Global Development and Computing Portal Global Development and Computing Portal

Ask Dr. International

Column #2

The Doctor's back to answer more of your questions! This issue will cover:

On This Page
Arabic: Script or Language?Arabic: Script or Language?
The real deal about surrogatesThe real deal about surrogates
Hijri Dates in SQL Server 2000Hijri Dates in SQL Server 2000
Column 1: Acronyms, Unicode, Browser SniffingColumn 3:Installing Keyboards, VB & Unicode, Windows CE & Unicode, Turkish "I"
*

Arabic: Script or Language?

Dear Dr. International,

I am trying to support Farsi (Perisan) in my application. I used the GetLocaleInfo API with an LCID of 0x0429 and the LOCALE_IDEFAULTANSICODEPAGE parameter, and it returned 1256 as the code page to use. However, 1256 seems to be missing some characters. How do I support Farsi with the code page returned by GetLocaleInfo?

(From the Internet)

Dr. International replies:

Dr. International does know what is happening here. The Arabic Windows code page (1256) is actually referring to the Arabic language, rather than the Arabic script (which supports many additional languages, such as Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, Urdu, and more). Unfortunately, there is simply not enough room on code page 1256 to support all of the additional characters that these other languages require. Although some Farsi letters were added to the code page in previously unassigned spots for Windows Me and Windows 2000, the fact is that there are still characters that simply cannot be represented in code page 1256. However, when considering the handful of characters not supported for Farsi, its important to realize that there are even more that are not present needed by Urdu and other languages.

In this case, the GetLocaleInfo API call can be best described as returning the "best fit" code page for the Farsi locale. Even though it is the best possible choice for a code page, it is not going to be able to support all possible characters. There is not currently any way under Windows to ask whether a code page completely supports a locale's language or whether it is a "best fit" code page, but this would be a very useful feature.

How do you work around this? By supporting Unicode, of course! Unicode can and does represent all of these languages that use the Arabic script. The same solution will work for Urdu and other languages for which this is even more of an issue. By treating languages such as Farsi and Urdu the same way as the "Unicode only" languages such as Hindi, you can be sure that you will not lose information on conversion.

Top of pageTop of page

The real deal about surrogates

Dear Dr. International,

What is a surrogate? I have been hearing about surrogates a lot but I do not know what they are and I'd like to find out before my boss asks me!

(From the Internet)

Dr. International replies:

Dr. International did promise to talk about some new terminiology in each column, didn't he?

By now, it seems like almost everyone knows (or pretends to know) the answer to the question "What is Unicode?". It is an encoding standard that is designed to provide a unique number for every character, no matter what the language. By using 16 bits for every code point, a total of 65,536 characters can be represented. But it turned out that was not going to be enough... the problem is that when you look at languages such as Chinese, Japanese, and Korean, there are still tens of thousands of ideograms that are not yet encoded. And even though many of these are rarely used characters, they are still present in dictionaries and important to scholars. Add to that many of the rarely used characters in other languages and all of the dead languages not yet encoded, and suddenly 65,536 does not seem like such a very large number!

Therefore, starting with Unicode 2.0, the surrogate range was defined. By taking two 16-bit code points (one from the high surrogate range and the other from the low surrogate range), an additional 1,048,576 characters can be defined, and that really ought to be enough, don't you think? Dr. International thinks that with over a million code points, the languages of the world should be covered. Although up to now, this has all been a theoretical matter since no characters had actually been assigned in the surrogate range. As Dr. International is writing this column, he saw the Unicode Technical Committee (UTC) has just officially decided to assign over 40,000 CJKV ideographs to the surrogate range, and that means that soon all of these characters will be officially assigned and people will expect to be able to use them in software programs.

Unicode in versions of Windows prior to Windows 2000 generally was using UCS-2 for its encoding. With Windows 2000, UTF-16 is the encoding standard for Unicode. The important difference between the two is awareness of surrogates. There are many people who are wondering why a move to UCS-4/UTF-32 (which uses 32 bits per character) is not being considered. The most common reason for people to make this suggestion is their concern that attempts to handle these extra characters via surrogates seems so much like the methods that DBCS required to support a mix of 8-bit and 16-bit characters. The need for functions such as IsDBCSLeadByte, etc. makes DBCS string handling very difficult, as many people can attest to. Surrogates, however, are always made up of 16-bit values, and both high and low surrogates are in specific ranges. This makes surrogates much easier to detect and it makes string handling routines much easier to implement. Given the heavy investment that the Windows platform and COM have in UCS-2/UTF-16, moving to UCS-4/UTF-32 would require a rearchitecture of Windows that while not as large as the move from Win16 to Win32, could certainly provide pain to programmers on the same order of magnitude. In short:

The move to UCS-2/UTF-16 from DBCS is one that can make life easier and provide a great deal of expanded functionality.

The move from UCS-2/UTF-16 to UCS-4/UTF-32 can make life significantly more difficult with no significant expanded functionality.

In any case, Dr. International will not need to prescribe anything for headaches if you have a Unicode application, because even if your program is not surrogate aware, it will simply believe that these are two characters instead of one. It may mean a bit of strangeness in your cursor movement (two arrow keypresses to move past the character if you have a font that recognizes the character). You will not be happy with the results if you insert text between the high and low surrogate values. And until SQL Server or Windows 2000 releases a service pack, you may not get any special sorting of surrogates. However, the most important issue, data integrity, will not be a problem in your existing applications. In all of your future applications, you probably will want to consider surrogate awareness, because any character that was considered important enough to be included may well be important enough for your application to support.

Windows 2000 has limited support for the input and output of surrogate characters. However, since the characters are not yet officially encoded, there is no keyboard/IME support yet. SQL Server 2000 will consider all such characters to be undefined and thus collation for them is undefined. As fonts become available and more sophisticated sort/collation support is added to future service packs of Windows 2000 and SQL Server, Dr. International is sure that he will be discussing with you the details of how to to make your applications surrogate aware. Be sure to stay tuned!

Top of pageTop of page

Hijri Dates in SQL Server 2000

Dear Dr. International,

I am coding a database solution that requires to store dates using the Hijri (Arabic) calendar. Given that SQL Server can only store dates from 1753 and it is currently 1421 in Hijri it would appear that there is a bit of a problem here...

(From Graeme E. Coutts)

Dr. International replies:

Hello Graeme,

Dr. International is well aware of the problem you refer to, and can provide two possible solutions, either of which will hopefully solve the issue. The problem is not that Hijri dates are outside the range of Gregorian dates that SQL Server can use, because you would not want SQL Server to be assuming that Hijri dates are in fact Gregorian ones when you do date calculations. What you need to be able to do is convert them properly from the format that SQL Server uses internally to Hijri format. Dr. International will discuss two methods of doing this:

The first method, which was added for SQL Server 2000, is to convert the date to and from the Hijri calendar via the CONVERT intrinsic. There are two CONVERT styles that are available in SQL Server 2000 to support Hijri dates:

130 - Returns the date via the Hijri calendar, in dd mon yyyy hh:mi:ss:mmmAM format.

131 - Returns the date via the Hijri calendar, in dd/mm/yy hh:mi:ss:mmmAM format.

For example, to convert a date to Hijri format in Transact-SQL, you would use syntax such as the following:

SELECT CONVERT(nchar, GETDATE(), 131)

This query will return a string such as the following in its result set:

7/05/1421 12:14:35:727PM

The reverse operation is also possible. The following syntax would be used to go in the opposite direction:

SELECT CONVERT(datetime, ' 7/05/1421 12:14:35:727PM', 131)

This query would convert the date to SQL Server datetime type, which in Query Analyzer would look like:

2000-08-07 12:14:35.727

The second method is to do the conversions on the client side once a date value has been returned. In VBA code, this can be done by setting the VBA.Calendar proprety to vbCalHijri, after which all dates will be converted to and from strings assuming the Hijri format for the date strings. This will allow you to manipulate Hijri dates directly. In C++, you can use the VarBstrFromDate and VarDateFromStr functions defined in oleauto.h (with the VAR_CALENDAR_HIJRI flag) to convert to and from Hijri format dates.

Perhaps Dr. International should provide some background to help explain why SQL Server refers to this as an Arabic style date that uses the Kuwaiti algorithm. The Hijri calendar is a very old and complex calendar, which has an issue when it comes to automating conversion between Gregorian and Hijri: there are specific days that the conversion can potentially be off by a day or two in either direction. The exact reason for this has to do with the proclamation of the new moon by religious authorities based on visibility of lunar crescent. Therefore, the natural temptation of programmers to want to automate everything must be resisted in this case. The Hijri calendar is very important to Saudi Arabia and other countries such as Kuwait, and thus this seemingly unsolveable problem must be solved.

In an effort to solve this challenging problem, several years ago some of the top developers in Microsoft's Middle East Products Divison (MEPD) did extensive research into it. They had the longest timeline of information on the Hijri calendar as is used in Kuwait, and they took this information and did statistical analysis on it, finally arriving at the most accurate algorithm they could devise. This algorithm is used in many Microsoft products, including all operating systems that support Arabic locales, Microsoft Office, COM, Visual Basic, VBA, and SQL Server 2000. Whether you refer to this as the Hijri date, the Arabic style, or the Kuwaiti algorithm, you should understand that it is technically none of these things; it is simply the most accurate algorithm that Microsoft was able to derive using a large number of known Hijri dates. The actual determination of the new moon by religious authorities does not bow to a computer algorithm (nor should it, obviously!).

Windows does provide a means to make corrections to the date when needed in the Regional Options applet in the control panel. In the case of Windows 2000, choosing an Arabic locale and then switching to date tab will reveal a Calendar Type dropdown. Choosing the first option (Hijri Calendar) will enable the 'Adjust Hijri Date to:' dropdown, which allows you modify the Hijri date by one or two days in either direction, to support the occasions when the 'Kuwaiti algorithm' does not match the proclaimed date. You can see this dropdown in Figure 1, below.

Figure 1

Prior versions of Windows that include Arabic support have similar functionality, by way of an Advance Hijri Date check box. It can be used any time the previous month contained fewer than 30 days according to the appropriate authorities.

Unfortunately, many applications (including SQL Server 2000, COM, and VBA) do not use this extra control panel setting, and you would have to make such corrections yourself. Windows itself does use this setting in its calendar operations, as does Microsoft Outlook.

Obviously, even the "Advance Hijri Date" functionality were supported by all of the programs that support Hijri dates, it is still not a perfect solution, and Microsoft is very proactive in its work with both individuals and governments to try and improve its Hijri date support.

Top of pageTop of page

Keep those questions coming!

Dr. International has been feeling gratified by the many questions that people have been sending to him related to issues with internationalization and localization. Please keep sending those questions to the Doctor. Although Dr. International cannot answer every e-mail personally, many of them will end up (with the answers!) on this very web site. See you next time!

Dr. International
Windows International Division

Column 1: Acronyms, Unicode, Browser SniffingColumn 3:Installing Keyboards, VB & Unicode, Windows CE & Unicode, Turkish "I"
Top of pageTop of page