FWH : Unicode Gets and RowSet

FWH : Unicode Gets and RowSet

Postby nageswaragunupudi » Sat Oct 01, 2016 8:17 pm

When an application is built with Unicode enabled, TGet object accepts Unicode characters as input. To create a Unicode application, we should call the function
Code: Select all  Expand view

FW_SetUnicode( .T. )
 

at the very beginning, particulary before any Window or Dialog is created for the first time. After the first window/dialog is created, this setting can not be changed.

Rest of the discussion in this post apply to Unicode applications only and for ANSI (non-Unicode) applications, the behavior continues to be as has always been.

Non-Character Gets are always ANSI:

Even in a Unicode application, Unicode character input is enabled only for character variables. Gets of all other variables (types like DLNT) behave identical to pure ANSI Gets.

By default, any Get of a character variable in a Unicode application accepts Unicode characters. Unicode character Gets (hereafter referred to as Unicode Gets) do not respect any picture clause except "@!".

ANSI Gets in Unicode Application:

However, it is possible to create a pure ANSI Get even in a Unicode application by including the clause "CHRGROUP CHR_ANSI". Such Gets do not accept Unicode characters and behave identical to Gets in a pure ANSI application. These Gets can have picture clauses like any ANSI Get.

Maximum possible length of input in Unicode Get:

By default Get restricts the length of the input to the same number of bytes as the length of the variable in bytes. If cVar is Space(15), GET ... cVar restricts the input to 15 bytes. Because on average each Unicode character occupies 3 bytes, we can not input a Unicode text exceeding 5 Unicode characters or 15 bytes. In case the trimmed length of input text is less than 15 bytes, the value is right padded to 15 bytes.

Note:
Len( cVar ) --> Length in Bytes
HB_UTF8LEN( cVar ) --> Length in Characters.
In the case of English, Len( cVar ) is always the same as HB_UTF8LEN( cVar )

This default behavior suits XBase applications. Character field with length 15 can strore maximum of 15 byte of text whether ANSI or Unicode. So in effect maximum Unicode text that can be stored is approximately 5 characters.

The case of SQL servers is different. For example, VarChar(15) in MySql with utf8 charset or NVARCHAR(15) in MsSql/Oracle,etc accommodate 15 characters even if the length in bytes exceeds 15 bytes. To meet the requirements of these databases we need a different behavior of Get object.

cVar := Space(15)
@ r,c GET cVar ........ CHRGROUP CHR_WIDE
meets this requirement.

In this case, the user input is restricted to 15 characters not bytes. The returned value is padded to 15 characters, i.e., HB_UTF8LEN( cVar ) will be 15 though Len( cVar ) may be >= 15 depending on the type of characters input.

The following sample may be tried to test this functionality.
Code: Select all  Expand view

#include "fivewin.ch"

function TestUnicodeGets()

   local aText[ 3 ]
   local aGets[ 3 ]
   local oDlg, oSegoe, oSmall

   FW_SetUnicode( .T. )

   AFill( aText, Space( 15 ) )

   DEFINE FONT oSmall NAME "TAHOMA"   SIZE 0,-12
   DEFINE FONT oSegoe NAME "Segoe UI" SIZE 0,-20
   DEFINE DIALOG oDlg SIZE 650,340 PIXEL TRUEPIXEL FONT oSegoe ;
      TITLE "FWH : UNICODE GETS"

   @  20, 360 SAY "BYTES" SIZE 110,34 PIXEL OF oDlg FONT oSmall RIGHT
   @  20, 480 SAY "CHARACTERS" SIZE 110,34 PIXEL OF oDlg FONT oSmall RIGHT

   @  60,  40 SAY "ANSI Text upto 15 Bytes" SIZE 300,20 PIXEL OF oDlg FONT oSmall
   @  60, 360 SAY "Full" SIZE 50,20 PIXEL OF oDlg FONT oSmall RIGHT
   @  60, 420 SAY "Trim" SIZE 50,20 PIXEL OF oDlg FONT oSmall RIGHT
   @  60, 480 SAY "Full" SIZE 50,20 PIXEL OF oDlg FONT oSmall RIGHT
   @  60, 540 SAY "Trim" SIZE 50,20 PIXEL OF oDlg FONT oSmall RIGHT

   @  80,  40 GET aGets[ 1 ] VAR aText[ 1 ] SIZE 300,34 PIXEL OF oDlg CHRGROUP CHR_ANSI VALID ( oDlg:Update(), .t. )

   @  80, 360 SAY Len( aText[ 1 ] ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @  80, 420 SAY Len( Trim( aText[ 1 ] ) ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @  80, 480 SAY HB_UTF8LEN( aText[ 1 ] ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @  80, 540 SAY HB_UTF8LEN( Trim( aText[ 1 ] ) ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE

   @ 120,  40 SAY "Unicode Text upto 15 Bytes" SIZE 300,20 PIXEL OF oDlg FONT oSmall
   @ 140,  40 GET aGets[ 2 ] VAR aText[ 2 ] SIZE 300,34 PIXEL OF oDlg CHRGROUP CHR_ANY VALID ( oDlg:Update(), .t. )

   @ 140, 360 SAY Len( aText[ 2 ] ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @ 140, 420 SAY Len( Trim( aText[ 2 ] ) ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @ 140, 480 SAY HB_UTF8LEN( aText[ 2 ] ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @ 140, 540 SAY HB_UTF8LEN( Trim( aText[ 2 ] ) ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE

   @ 180,  40 SAY "Unicode Text upto 15 Characters" SIZE 300,20 PIXEL OF oDlg FONT oSmall
   @ 200,  40 GET aGets[ 3 ] VAR aText[ 3 ] SIZE 300,34 PIXEL OF oDlg CHRGROUP CHR_WIDE VALID ( oDlg:Update(), .t. )

   @ 200, 360 SAY Len( aText[ 3 ] ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @ 200, 420 SAY Len( Trim( aText[ 3 ] ) ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @ 200, 480 SAY HB_UTF8LEN( aText[ 3 ] ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE
   @ 200, 540 SAY HB_UTF8LEN( Trim( aText[ 3 ] ) ) PICTURE "99" SIZE 50,34 PIXEL OF oDlg RIGHT UPDATE

   @ 260, 040 BUTTON "REFRESH" SIZE 150,40 PIXEL OF oDlg ACTION oDlg:Update()

   ACTIVATE DIALOG oDlg CENTERED
   RELEASE FONT oSegoe, oSmall


return nil
 


When text is entered to maximum allowed length:

Image

FW MARIADB RowSet:

Now let us see how does FWH implementation of MySql/MariaDB simplifies handling of ANSI and Unicode fields of tables.

First we need to establish Unicode connection to Server. In a Unicode application ( i.e., when FW_SetUnicode( .T. ) is called first), connection to MySql server is established as a Unicode connection.

When a table is opened, the RowSet object recognises the character set of each field of the table (or sql). XBrowse and DataRow objects are tightly integrated with FW MySql objects. Both XBrowse and DataRow objects learn from the RowSet object, which fields are to be edited as ANSI and which fields are to edited as Unicode with CHR_WIDE type.

A table's character set defaults to database's character set and fields' character set defaults to table's character set. Because defaults may be deceptive at times, it is highly recommended to specify the character set of table and fields at the time of creation of the table.

Example:
Code: Select all  Expand view

#include "fivewin.ch"

function TestUnicodeRowSet

   local cHost       := "localhost"
   local cUser       := "root"
   local cPassword   := <secret>
   local cDB         := "fwh"
   local oCn, oRs

   FW_SetUnicode( .t. )

   FWCONNECT oCn HOST cHost USER cUser PASSWORD cPassword DATABASE cDb
   if oCn == nil
      ? "Connect Fail"
      return nil
   endif

   if .not. oCn:TableExists( "ansiutf" )

      oCn:CreateTable( "ansiutf", { { "ansitext", 'C',  5, 0, "latin1" }, ;
                                    { "utf8text", 'C', 15, 0, "utf8"   }  }, ;
                       .t., "utf8" )

   endif

   oRs   := oCn:RowSet( "ansiutf" )
   XBROWSER oRs FASTEDIT

   oRs:Close()
   oCn:Close()

return nil
 


The above CreateTable() method internally generates the following SQL to create the table:
Code: Select all  Expand view
CREATE TABLE `ansiutf` (
   `ID` INT AUTO_INCREMENT PRIMARY KEY,
   `ansitext` VARCHAR( 5 ) CHARACTER SET latin1 COLLATE latin1_general_ci,
   `utf8text` VARCHAR( 15 ) CHARACTER SET utf8 COLLATE utf8_unicode_ci
) CHARACTER SET utf8 COLLATE utf8_unicode_ci


Image

As long as we use FW MariaDB, XBrowse and default DataRow dialogs everything is automated without any special effort of the programmer.

But we do not always like to use DataRow's default dialogs and want to design our own dialogs. In such cases we need to specify for each Get whether to use CHR_ANSI or CHR_WIDE. Even this is simple

example:
Code: Select all  Expand view

function MyDialog( oRec )

@ r,c GET oRec:fieldname SIZE ........ CHRGROUP oRec:FieldnChrGrp( "fieldname" )

...
 

There is no need for the programmer to keep note of the character sets of different fields.

Note: If TDataRow is used for editing, default is CHR_WIDE.

When using 3rd party libs, the programmer should carefully declare the clauses CHR_ANSI / CHR_WIDE.
Regards

G. N. Rao.
Hyderabad, India
User avatar
nageswaragunupudi
 
Posts: 10632
Joined: Sun Nov 19, 2006 5:22 am
Location: India

Re: FWH : Unicode Gets and RowSet

Postby richard-service » Sun Oct 02, 2016 2:23 am

Mr.Rao
Could you provide a EXE working example to test it here ?
This support ADS/DBF database?
Best Regards,

Richard

Harbour 3.2.0dev (r2402101027) => Borland C++ v7.7 32bit
MySQL v8.0 /ADS v10
Harbour 3.2.0dev (r2011030937) => Borland C++ v7.4 64bit
User avatar
richard-service
 
Posts: 803
Joined: Tue Oct 16, 2007 8:57 am
Location: New Taipei City, Taiwan

Re: FWH : Unicode Gets and RowSet

Postby nageswaragunupudi » Sun Oct 02, 2016 2:49 am

Mr Richard

May I know what is the latest FWH version you are using?

For DBF/ADS, if the field width is 'n', we can store string of length n bytes only, i.e, approximately n/3 unicode characters.

I shall send you exes by mail.
Regards

G. N. Rao.
Hyderabad, India
User avatar
nageswaragunupudi
 
Posts: 10632
Joined: Sun Nov 19, 2006 5:22 am
Location: India

Re: FWH : Unicode Gets and RowSet

Postby richard-service » Sun Oct 02, 2016 2:52 am

nageswaragunupudi wrote:Mr Richard

May I know what is the latest FWH version you are using?

For DBF/ADS, if the field width is 'n', we can store string of length n bytes only, i.e, approximately n/3 unicode characters.

I shall send you exes by mail.


I use FWH1606build2, send to my mail.
Thanks a lot.
Best Regards,

Richard

Harbour 3.2.0dev (r2402101027) => Borland C++ v7.7 32bit
MySQL v8.0 /ADS v10
Harbour 3.2.0dev (r2011030937) => Borland C++ v7.4 64bit
User avatar
richard-service
 
Posts: 803
Joined: Tue Oct 16, 2007 8:57 am
Location: New Taipei City, Taiwan

Re: FWH : Unicode Gets and RowSet

Postby nageswaragunupudi » Mon Oct 03, 2016 2:26 pm

We expect Unicode programmers may already be aware not to use ANSI functions like CHR(), LEFT(), RIGHT(), SUBSTR(), STUFF(), LEN() etc with Unicode srings. Instead we should use their UTF8 counter parts viz., HB_UTF8CHR(), HB_UTF8LEFT(), HB_UTF8RIGHT(), HB_UTF8SUBSTR(), HB_UTF8STUFF(), HF_UTF8LEN(). It is once again desirable to review the application programs and any 3rd party libraries they are using and make appropriate rectifications.

Same way PAD() or PADR() should not be used with Unicode strings. For example, using PADR(cStr,10) where cStr is a unicode string having a length of more than 10 bytes is very likely to result in an invalid Utf string and we just see many ??? marks instead of valid characters.

FWH provides two substitue functions for handling truncation and padding of Unicode strings.

FW_UTF8PADCHAR( cStr, nCharacters ) --> cPartStr whose HB_UTF8LEN() is nCharacters
FW_UTF8PADBYTE( cStr, nBytes ) --> cPartStr whose Len() is nBytes without invalidating the part string.

It may be noted that our own legacy applications and many 3rd party libs were originally written to handle ANSI strings and all of them need careful review and appropriate modifications.

Despite the best efforts made by FWH to provide the best UI for Unicode programmng, all this effort will go waste if the programmer does not fully and properly take care of these issues in his programs and the 3rd party libs he is using.

Programmers using TDolphin and TMySql need also to keep in mind some issues. Both these libraries were not written keeping in mind the requirement to handle Unicode strings. The programmer needs to make knowledgeable modifications in the sources to hande the requirements of Unicode strings.

Character fields defined as Char( n ) and VarChar( n ) are treated as fields with length of ( 3 * n ) bytes and when returning values to the user, the values are padded to 3*n Bytes. These values are not suitable for any kind of Get.

We need to use FW_UTF8PADCHAR( string, n ) and then create a Get with the resultant string.
Regards

G. N. Rao.
Hyderabad, India
User avatar
nageswaragunupudi
 
Posts: 10632
Joined: Sun Nov 19, 2006 5:22 am
Location: India

Re: FWH : Unicode Gets and RowSet

Postby richard-service » Tue Oct 04, 2016 2:47 am

nageswaragunupudi wrote:We expect Unicode programmers may already be aware not to use ANSI functions like CHR(), LEFT(), RIGHT(), SUBSTR(), STUFF(), LEN() etc with Unicode srings. Instead we should use their UTF8 counter parts viz., HB_UTF8CHR(), HB_UTF8LEFT(), HB_UTF8RIGHT(), HB_UTF8SUBSTR(), HB_UTF8STUFF(), HF_UTF8LEN(). It is once again desirable to review the application programs and any 3rd party libraries they are using and make appropriate rectifications.

Same way PAD() or PADR() should not be used with Unicode strings. For example, using PADR(cStr,10) where cStr is a unicode string having a length of more than 10 bytes is very likely to result in an invalid Utf string and we just see many ??? marks instead of valid characters.

FWH provides two substitue functions for handling truncation and padding of Unicode strings.

FW_UTF8PADCHAR( cStr, nCharacters ) --> cPartStr whose HB_UTF8LEN() is nCharacters
FW_UTF8PADBYTE( cStr, nBytes ) --> cPartStr whose Len() is nBytes without invalidating the part string.

It may be noted that our own legacy applications and many 3rd party libs were originally written to handle ANSI strings and all of them need careful review and appropriate modifications.

Despite the best efforts made by FWH to provide the best UI for Unicode programmng, all this effort will go waste if the programmer does not fully and properly take care of these issues in his programs and the 3rd party libs he is using.

Programmers using TDolphin and TMySql need also to keep in mind some issues. Both these libraries were not written keeping in mind the requirement to handle Unicode strings. The programmer needs to make knowledgeable modifications in the sources to hande the requirements of Unicode strings.

Character fields defined as Char( n ) and VarChar( n ) are treated as fields with length of ( 3 * n ) bytes and when returning values to the user, the values are padded to 3*n Bytes. These values are not suitable for any kind of Get.

We need to use FW_UTF8PADCHAR( string, n ) and then create a Get with the resultant string.


Mr.Rao
Yes,you're right.
we know old functions need to be fixed for Unicode functions.
Thanks for your share these Unicode functions. In face, I don't know these functions name.
Anyway, I use FW_UTF8PADCHAR() function when I call modify data characters. Now, It's working fine.
Best Regards,

Richard

Harbour 3.2.0dev (r2402101027) => Borland C++ v7.7 32bit
MySQL v8.0 /ADS v10
Harbour 3.2.0dev (r2011030937) => Borland C++ v7.4 64bit
User avatar
richard-service
 
Posts: 803
Joined: Tue Oct 16, 2007 8:57 am
Location: New Taipei City, Taiwan

Re: FWH : Unicode Gets and RowSet

Postby nageswaragunupudi » Tue Oct 04, 2016 2:54 am

Glad the posting helped you.
Regards

G. N. Rao.
Hyderabad, India
User avatar
nageswaragunupudi
 
Posts: 10632
Joined: Sun Nov 19, 2006 5:22 am
Location: India

Re: FWH : Unicode Gets and RowSet

Postby richard-service » Tue Oct 04, 2016 4:04 am

nageswaragunupudi wrote:Glad the posting helped you.

I don't know New HB_xx/FW_xx functions support database object.
Maybe next time, I will test it.
Best Regards,

Richard

Harbour 3.2.0dev (r2402101027) => Borland C++ v7.7 32bit
MySQL v8.0 /ADS v10
Harbour 3.2.0dev (r2011030937) => Borland C++ v7.4 64bit
User avatar
richard-service
 
Posts: 803
Joined: Tue Oct 16, 2007 8:57 am
Location: New Taipei City, Taiwan

Re: FWH : Unicode Gets and RowSet

Postby nageswaragunupudi » Sat Oct 15, 2016 12:34 am

These notes may help some beginners of Unicode programming with FWH.

FW_SetUnicode( .T. ):
It is necessary to call this function at the beginning of the Main() function. In any case before any Window/Dialog is created and any screen I/O starts.

KeyBoard:
It is possible to use (a) Windows on-screen keyboards, (b) utilties like Google Input Tools or (c) Hardware KeyBoards. First ensure the keyboard works with Windows Notepad, Word, Excel, etc. If the keyboard works with Windows software, then it works with FWH application also.

Unicode GETs and Length of Variables to use.
It is very important to know that we need to use character variables with larger sizes to handle Unicode input.

Each single Unicode character requires about 3 bytes. That means we need to modify the DBF field widths and variable lengths in our Gets to almost 3 times or even more.

For example, our present ANSI application may be having a character field with width of 2 bytes. We can comfortably enter values like "NY", "OH", etc. But if we try to enter Unicode characters, we can not input even a single character. We need a space of atleast 6 bytes or even more.

What do we mean by a single Unicode character?
This is another important thing to understand. Some languages make one alphabet using more than one character. For example, this is a composite alphabet in my mother tongue. To see, it looks like one alphabet.

క్ష్మ

But this is made up of 4 Unicode characters, viz.

K + SH + M + A (phonetically)

What appers like a single alphabet requires 4 Unicode characters and requires 4 x 3 = 12 bytes of storage. It is now clear that we can not input this alphabet in a variable or field of width less than 12 bytes. From this we can underestand how much field-sizes we need to provide to store Unicode data and what size of variables we should use in GETs.

This is some essential information to keep in mind, before we start working with a Unicode application.
Regards

G. N. Rao.
Hyderabad, India
User avatar
nageswaragunupudi
 
Posts: 10632
Joined: Sun Nov 19, 2006 5:22 am
Location: India


Return to FiveWin for Harbour/xHarbour

Who is online

Users browsing this forum: No registered users and 114 guests