In this issue, the doctor will address hot issues he's received through e-mails:
| Switching Language Versions of Windows XP | |
| Alternate Language Keyboards for Login | |
| Supplementary Characters, Surrogate Pairs and SQL |
I just purchased a computer in France/Germany/Japan (name your country) and it came with the French/German/Japanese/(name your language) version of Windows 2000 or Windows XP. How can I change it to the English/Russian/Arabic (name your desired language) version?
Dr. International gets this question all the time. The answer is very straight forward. The only way to replace a version of Windows 2000 or XP with a different language version of Windows 2000 or XP is to:
| • | Buy a copy of Windows 2000 or XP in the desired language version. Keep in mind that stores normally stock the local language and maybe the English versions of Microsoft products, but you can find products in other languages from online retailers. |
| • | Back up your important files on the hard drive to a different media |
| • | Reformat the hard drive |
| • | Install the new language version |
| • | Restore your backed-up files to the hard drive |
Since my login password contains accented or non-Latin characters, I would like to have the ability to switch input languages in the login prompt dialog just like I can in many of my applications after I have logged in. How can I do this?
To add multiple input languages to the Windows XP or Windows Server 2003 logon prompt dialog, you need to assign the input languages to the machine’s “default user profile”. Here is how to do this (note: you must either be an administrator for this computer or at least have administrator rights).
1. | Go to Start --> Control Panel --> Regional and Language Options |
2. | Make sure that all the input languages that you want are installed. If you are not sure how to do this, see Enabling International Support in Windows XP/Server 2003 Family |
3. | Once you have all the input languages installed, click on the “Advanced” tab (as illustrated below) of the Regional and Language Options property sheet. ![]() |
4. | Now select the check box under “Default user account settings” ![]() |
5. | The following “Change Default User Settings” dialog box will appear. Click the “OK” button in the “Change Default User Settings” dialog to continue. ![]() |
6. | Now click “OK” to apply your changes. |
7. | The next time you get the “Login” dialog it will have a language indicator in the lower left-hand corner of the dialog. You can change this to the language you want by pressing the Left Alt + Shift keys. |
Note: To take out the language indicator from the “Login” dialog box, one must remove all the installed input language services and then again select the “Apply all settings…” box in the “Default user account settings.”
I need to store and fetch several languages (Chinese, Japanese, etc.) that contain supplemental Unicode characters. I know that SQL stores these as Unicode surrogate pairs, but am unsure if I need to do anything special to process them.
This is a very good question, but before I give you an answer, I will take a few lines to make sure everyone understands a few terms.
Unicode standard: The Unicode standard matches characters to code points. For example the character “A” is mapped to the code point U+0041; the Greek delta “Δ” character is mapped to code point U+0394; and the Chinese character for hand “手” is U+624B. A Unicode code point can be defined between U+0000 and U+10FFFF (a total of 1,114,112 characters).
Note: The syntax U+(xx)xxxx is the standard notation for a Unicode code point with the “U” meaning Unicode and the “x”s representing the code point value in hexadecimal.
To store Unicode data in memory there are three primary encoding forms in use today – UTF-8, UTF-16 and UTF-32 (we will only discuss UTF-16 in this answer).
UTF-16: This encoding stores the basic Unicode characters using single 16 bit units and others characters using two 16 bit units. UTF-16 is the primary encoding mechanism used by Microsoft Windows 2000, Windows 2000 Server, Windows XP and Windows 2003 Server. On the other hand SQL server 2000 uses the UCS-2 encoding schema to store Unicode data.
Note: The UCS-2 encoding scheme is actually a subset of the UTF-16 scheme. Every UCS-2 encoded code point is identical to the encoding of the same code point in UTF-16. Also, most new implementations using the Unicode standard now employ UTF-16, UTF-8 or UTF-32 instead of UCS-2.
Supplementary Characters: Some languages, like Chinese and Japanese, need thousands and thousands of characters for completeness. Besides a main block of ideographic characters, the Unicode standard defines a high secondary block of these idiographic characters and other characters (such as some historic scripts, and musical symbols) in the range of code points from U+10000 to U+10FFFF and calls these code points “supplementary code points.” Characters in this range are called supplementary characters.
Surrogate Pairs: Supplementary Characters are stored straightforwardly in UTF-16 using two 16 bit units. These two 16 bit units are called a “surrogate pair,” with the leading surrogate called the “high” surrogate and the trailing surrogate called the “low” surrogate. (To learn more about surrogate pairs, see section “3.8 Surrogates” in the 4.0 Unicode Standard at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf )
As mentioned above, UCS-2 is a subset of the UTF-16 schema. The only difference between UCS-2 and UTF-16 is in the recognition of supplementary characters. As mentioned above, UTF-16 uses sets of two16-bit units call surrogate pairs to store supplementary characters. UCS-2 only allows 65,536 different code points and thus treats the supplementary characters’ surrogate pairs as two undefined Unicode code points and not as a single character.
Since SQL Server uses UCS-2, it thus stores supplementary characters as two undefined Unicode code points. This way the SQL Server allows for the storage of supplementary characters without the risk of loss or corruption.
The good news is that the high surrogate and low surrogate code point values set within well defined ranges of Unicode code points. High surrogates are in the range U+D800 to U+DBFF. Low surrogates are in the range U+DC00 to U+DFFF. Thus if you know you may have supplementary characters in your data store, you can check to see if the UCS-2 character retrieved is a high surrogate or low surrogate. If it is a high surrogate you will know that you need to get the next character to create a surrogate pair or if it is a low surrogate you will know that you need to retrieve the previous character to create a surrogate pair.
Note: You must have one high and one low surrogate to have a valid surrogate pair to represent a supplementary character. A single surrogate code unit all by itself does not represent any valid character.
A few things to remember when working with SQL Server and supplementary characters are:
| • | Since these characters’ surrogate pairs are considered two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (i.e. space for a surrogate pair) |
| • | String operations are not supplementary character aware. Thus operations such as Substring(nvarchar(2),1,1) will result in only the high surrogate of the supplementary characters surrogate pair. Also the Len operation will return the count of two characters for every supplementary character encountered – one for the high surrogate and one for the low surrogate. |
| • | In sorting and searching, all supplementary characters compare equal to all other supplementary characters |
Note:
The concept of two or more Unicode code points representing a single displayable character is not new. The standard has for several versions defined what is called combining characters. These are characters not used in isolation, but when combined with “base characters,” display as a single character. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs and Indic matras. An example would be the Latin letter “A” U+0041 plus the combining ring character U+030A, which graphically produces “Å”. Although displayed as a single character, SQL functions treat combined characters the same as they do supplementary characters. They process them as two separate Unicode code points.
For more information about combining characters, see section “3.6 Combination” of Chapter 3 of the Unicode Standard 4.0.
Two good resources for SQL Server 2000 and Unicode are:
1. | International Features in Microsoft SQL Server 2000 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/html/intlfeaturesinsqlserver2000.asp |
2. | Preserving Client-Server Data Integrity with Unicode and Microsoft SQL Server 2000 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql2k/html/sql_dataencoding.asp |