Global Development and Computing Portal Global Development and Computing Portal

Ask Dr. International

Column #17

The Doctor has finally had the time to prepare a holiday present for everyone, and decided close the year with a bang. Amazing how quickly time flies when you're having fun!

This installment will address:

On This Page
Exchanging non–English e–mailsExchanging non–English e–mails
Windows XP/2000 default system fontWindows XP/2000 default system font
Which Windows charsets have support for the Euro symbol?Which Windows charsets have support for the Euro symbol?
Difference between Spanish traditional sort and modern sortDifference between Spanish traditional sort and modern sort
How to perform string manipulation on UTF–8 strings?How to perform string manipulation on UTF–8 strings?
MLang and NLS comparedMLang and NLS compared
Globalization of websitesGlobalization of websites
Multilanguage web formsMultilanguage web forms
Column 16: Handling Multilingual Data & Changing Codepages 
on older versions of Microsoft Windows
Column 18: Windows XP and Unicode Surrogate Code Points, CJK Extensions A and B
*

Exchanging non–English e–mails

Dear Dr. International,

I bought a new PC with Windows Me English version preinstalled. I'd like to exchange e–mail in Japanese. So, I installed the Global IME 5.02 for Japanese. I could read e-mails in Japanese. However, the e-mail that I wrote in Japanese was garbled at the recipient's.

Why were letters garbled even though I could type Japanese on the PC? Could you help me?

Ayako from Japan

Dr. International replies:

Dear Ayako,

A very good question. The situation happens when you want to send e–mails containing non–Latin text using Outlook or Outlook Express on all versions of Windows. The problem is in the language encoding that your mail client stamps the mail with. The default setting will make all messages marked to match your system's codepage, but you can change the settings manually.

1.

In Outlook Express: open Tools > Options > Send > International Options

2.

In Outlook 2000/XP: Tools > Options > Mail Format > International Options

Here are the detailed steps for Outlook 2000 or 2002 (part of Office XP):

Go to Tools menu and select Options

Click on Mail Format tab

In Message Format section, click on International Options

In the dialog that appears, from "Preferred encoding for outgoing messages" drop–down list, select Unicode (UTF–8). Click OK, OK to save the new settings.

There you can set the language for outgoing messages. It may be Japanese (or your desired language's codepage) or UTF–8. Both will work, but UTF–8 may not work with some recipients. Try both out and see which works best for you.

Top of pageTop of page

Windows XP/2000 default system font

Dear Dr. International,

Are Windows 2000 default system fonts the same for other languages as for the US–English version? If not, where can I find a listing of default fonts by OS language?

Dion from the Carolinas

Dr. International replies:

Hi Dion,

The default Shell UI font varies from one set of languages to another. For all Latin and European languages, this font remains unchanged same as English (Tahoma). For East Asian and Middle–Eastern versions, here is a quick summary:

OS LanguageDefault Desktop ThemeClassic Desktop FontStandard Desktop Font

English (US)

Standard

Microsoft Sans Serif

Tahoma

Japanese

Classic

MS UI Gothic

Tahoma

Korean

Classic

Gulim

Tahoma

Simplified Chinese

Classic

Simsun

Tahoma

Traditional Chinese

Classic

PMinglu

Tahoma

Arabic

Classic

Microsoft Sans Serif

Tahoma

Hebrew

Standard

Microsoft Sans Serif

Tahoma

Note: This is also applicable to Windows XP.

Top of pageTop of page

Which Windows charsets have support for the Euro symbol?

Dear Dr. International,

With the Euro being launched in January 2002, I am wondering which Windows character sets support the Euro symbol?

Euro Fan

Dr. International replies:

Dear Euro Fan,

Windows character sets/codepages that support Euro are:

858 OEM code page (Multilingual Latin I + Euro) at code point 0xD5

all Windows codepages has the Euro symbol at code point 0x80, except 1251 Cyrillic codepage. In Cyrillic codepage, Euro is at 0x88.

East Asian code pages:

932 (Japanese Shift–JIS) Windows/OEM code page does not currently support Euro

936 (Simplified Chinese GBK) Windows/OEM code page at 0x80

949 (Korean) Windows/OEM code page at 0xA2E6

950 (Traditional Chinese Big 5) Windows/OEM code page at 0xA3E1

Many of you have written to the Doctor concerning your inability to see the € symbol in the subject line or the body of your e–mail messages. I've done some investigation and I've seen some mails with the Euro character displayed properly in the message's body but not in the subject line. Usually it happened when the message containing Euro symbol was sent from one service provider to another – Euro's Hex number 80 was in its place. Strangely enough, when I sent the same message that exhibited the problem to different accounts with the same service provider, they remained intact – € was in the subject and body of the message. Other messages remained fully intact when sent amongst different service providers. The problem lies in:

1.

How email clients/readers and the various web mail service providers handle the Euro symbol;

2.

How well they conform to RFC 2047, which specifies message header encoding for non–ASCII text,

3.

and if the outgoing message encoding your email client is configured to use has Euro symbol support – check Exchanging non–English e–mails above to find out how to change this in MS products.

The Doctor can only say that Outlook 2000 and 2002, Outlook Express 5.5 and 6.0, and Hotmail conform to RFC 2047 and keeps the Euro symbol intact. Outlook and Exchange server apply same character set to the header as to message body text parts. For other clients, please contact your vendor if you have problems.

FYI – Windows XP is Euro–ready. It defaults to Euro as the currency when the user locale (Standards and formats in Windows XP parlance) is set to Austria, Belgium, Finland, France, Germany, Greece, Italy, Luxemburg, Netherlands, Portugal, Republic of Ireland, and Spain. Monaco, which previously used the French currency, will also switch to the euro. Users in those 12 countries will see currency amounts appear in Euros rather than the national currency of their respective countries. You can switch to the national currency, or to any other currency, by customizing your currency settings in Regional and Language Options in the Control Panel.

To learn more about Euro support in Microsoft products, see Using Euro as your Currency on TechNet.

Top of pageTop of page

Difference between Spanish traditional sort and modern sort

Dear Dr. International,

Can you kindly tell me what the difference is between Spanish_Traditional_Sort and Spanish_Modern_Sort? The break–down of the Spanish is quite extensive for the LDICs...however, I do not see a code for Spain or the US market....can you provide some insight for me?

I look forward to hearing back from you at your earliest possible convenience.

Curious Translator

Dr. International replies:

Dear Translator,

As you know there are two different flavors of Spanish sorting: modern and traditional. The traditional has become less preferred over time, and is being replaced by the modern sort. Within the traditional sort, the following rules apply:

1.

CH is treated as a compression (where C and H are treated as a single unit of sorting), which sorts between C and D as a unique letter.

2.

LL is also treated as a compression, which sorts between L and M as a unique letter.

3.

N tilde (ñ) is a unique letter between N and O.

For modern (international) sort, the only rule that applies is that N tilde is a unique letter between N and O.

Top of pageTop of page

How to perform string manipulation on UTF–8 strings?

Dear Dr. International,

I am writing an application that handles UTF–8, and need to find out the length of UTF–8 strings being entered. I've used StrLen, and MbsLen, but they don't give me the right results. Please help.

String Counter

Dr. International replies:

Dear String Counter,

strlen and mbslen are both doing the same job: returning the number of characters in the string. The former deals with single–byte characters and the latter with MBCS (Multi–byte Character Sets). The generic function name is _tcslen. From an OS point of view we do not offer real string manipulation support for UTF–8 (except UTF16 <–>UTF8 conversion).

Let me explain it more in detail. Let's say that my original Unicode string is "سليم". This Arabic name is composed of 4 Unicode code points.

wcslen on this string will return 4 (in fact 4 wchars).

Now, if I convert this to ANSI (with Arabic code page), strlen on that ANSI string will still return 4 (this time 4 chars).

In your case, you are converting the original UTF–16 string to UTF–8. For that range of Unicode characters (U+0x0080 to U+0x07ff) each Unicode code point is represented by 2 bytes in UTF–8 encoding. So my conversion will in fact give 8 high ANSI chars that can not really be displayed in GUI mode (failure not in conversion, but string manipulation). So, running wcslen on that new chain of string will return in fact 8! Not the number of characters as expected, but the number of bytes!! The same goes for other scripts and characters (ASCII, Hebrew, CJK...).

In conclusion, UTF–8 string manipulation should be avoided. Convert your string to UTF–16 and do all analysis in this encoding. Once you're done, convert back to UTF–8 if needed.

Top of pageTop of page

MLang and NLS compared

Dear Dr. International,

We are in a process of rewriting the code that handles internationalization. So far we work with NLS files and third party conversion tables (for codepages which are not supported by NLS) and we consider shifting to Mlang. Can you provide us with the pros and cons of using MLang versus NLS? Also, how do I perform CharNext on UTF–8 and MLang

Ori in Israel

Dr. International replies:

Dear Ori,

MLang is a COM component that provides a variety of services for detecting the character encoding used by web pages and emails, converting text from one encoding to another as part of an import or export operation, and the display of characters that are not included within the font specified for parts of a web page.

There is some overlap in features between MLang and NLS, but there are more differences than overlap.

1.

NLS is Win32, MLang is COM

2.

NLS is used by the system itself and most Win32 programs, MLANG is used by IE and some programs

3.

NLS supports locales, sorting, case conversion etc. MLANG doesn't

4.

MLANG supports many queries about code page properties (including autodetection) that NLS doesn't.

5.

MLANG and NLS support code page conversions. In many cases MLANG will use NLS to convert but it also does conversion itself.

6.

Both MLANG and NLS have extensive documentation including on http://msdn.Microsoft.com

Except for code page conversion, the decision to use NLS or MLang should usually be obvious – they provide different functionality and many programs use both.

As for you questions about CharNext, CharNext doesn't support UTF–8. It's easy to implement yourself though––look at the top 2 bits of each byte. If it's 0, then the character is one byte. If it's 1, then the character is 2 bytes. See table below for conversion. Finally, MLang doesn't do CharNext. This is one of the many areas where there is no overlap.

Unicode RangeUTF-8 Encoded Bytes

0x00000000 - 0x0000007F

0xxxxxxx

0x00000080 - 0x000007FF

110xxxxx 10xxxxxx

0x00000800 - 0x0000FFFF

1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x0010FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Top of pageTop of page

Globalization of websites

Dear Dr. International,

I've read several of your articles on how to achieve globalization for an application, but have not been able to find any on the topic globalized web sites/platforms.

We are currently in the design stages of a project that calls for a single code base to support a web site that will be viewed by international users in several different countries (we're starting with UK, US, Israel and Germany).

We are running Windows 2000, IIS 5.0. All of our presentation logic is contained within ASP (VBScript) files. Our business logic is encapsulated in VB and VC++ DLLs. At the moment, we maintain separate code trees on the ASP layer to achieve multilingual support.

I am currently investigating ways in which the language/locale specific resources of each web site can be bundled together and "swapped" in/out on the presentation layer.

Can you point me to articles that specifically address web site globalization on all levels – database, business objects, and presentation.

Peter in UK

Dr. International replies:

Dear Peter,

You're right, we do not have a central reference point for web globalization and are working on it in near future. But for the time being, to answer to your question: Most of the globalization and localizability considerations are common between Win32 and Web development. However, within the web context, you can rely on your HTML/ASP rendering mechanism (example of Trident engine for Internet Explorer) do to the job for you. Here are globalization requirements that you should be aware of in a web context:

Encode universally:

For HTML: <meta http–equiv="Content–Type" content="text/html; charset=<charset>">

For ASP:

Per session: <%Session.CodePage=<charset>%>

Per page: <%@CODEPAGE=<charset>%>

Locale awareness – Check out the browser sniffing question at: http://www.microsoft.com/globaldev/drintl/columns/001/default.mspx

Font independency – Make sure that you are not using in–line styles. Use CSS files instead to define the font type that you want to use:

<style>
.myStyle {font-size: 10pt; font-family: Arial;}
</style>
<span class = myStyle> Hello </span>

This approach will make it easier for localizers to define a per language font later during the localization process. Also, you can use the Web Embedding tool to link new fonts to your web pages, for more info on this, check out http://www.microsoft.com/typography/web/default.htm.

Multilingual awareness: the content of your site should be offered in multiple languages and users should be allowed to select their preferred UI language. You can retrieve user's browser UI language by the property of the navigator object: navigator.BrowserLanguage = UI language.

Tips for enabling language awareness:

Store your translatable resources in a DB, XML file, or resource files.

Reference string resources by variables.

Detect the display language at initialization time.

Load appropriate resources at display time.

Mirroring awareness:
To learn about mirroring in general, check out http://www.microsoft.com/globaldev/getwr/steps/WRG_mirror.mspx

In a Web context, mirroring and Right_To_Left (RTL) reading order go hand to hand and can be set by adding a direction attribute "DIR". The DIR attribute is added during the Arabic and Hebrew localization process and can be set:

at HTML level by: html dir=RTL

at element level by: span dir–RTL

at DHTML document object level by: document.Dir = "RTL"

Setting the DIR attribute to RTL would:

Set the "right" alignment of the text

Set the right_to_left reading order of the text

Mirror the page context

Leave the orientation of stationary elements

Here is a small sample that demonstrates how to toggle mirroring on/off at run–time:

function mirror()
   {
   if (document.dir == "rtl")
      {
      document.dir = "ltr";
      }
   else
      {
      document.dir = "rtl";
      }
   }

Tips to accommodate BiDi localization:

As mentioned above, the DIR attribute leaves the orientation of stationary elements unchanged. However, for directional images (such as arrows), it might be required to change the direction of the image during the BiDi localization. A new filter style for images (IMG) allows developers to implement a flippable image: "<IMG style=filter:flipH SRC=arrow.jpg>". Tables and cells are left aligned by default. Avoid using an obsolete left alignment ("align = left") that would overwrite the DIR attribute and would cause the BiDi text to be left aligned as well. Upon application of the DIR attribute all controls are being rearranged in the web page with a right alignment. Avoid using absolute positioning for controls. Tables and cells are automatically mirrored, so take advantage of this feature and place your controls within cells for robust reversibility.

For more information about this subject, please search MSDN for scripting, and resource isolation. Also see the whitepaper outlining the international features of SQL Server 2000.

Top of pageTop of page

Multilanguage web forms

Dear Dr. International,

I am currently facing the following problem while developing an international e–business site:

Inside a WEB form,

1.

some fields can be filled by user with different language characters

2.

other ones are filled with a default language characters

For each field filled with a specific language characters, the user will specify the language used.

When this form is submitted to the web server, I need to get all the values set by the user and stored them inside a SQL 7 database (nvarchar() columns).

I am using the following configuration:

1.

Web Server : NT4.0 SP6a, IIS4.0, ASP with JScript

2.

SQL Server : NT4.0 SP6a, SQL7 SR1

3.

Web browser : NTWordkstation/W2k, IE5

Let's say my HTML form is like this one:

<html><head><title>Multi–language form</title></head>
<body> 
<form action=mypage.asp method=post> 
<input type=text name=field1>
<select name=lang_field1> 
<option value="">default</option> 
<option value="en">english</option> 
<option value="fr">french</option> 
<option value="gr">greek</option> 
<option value="ru">russian</option> 
</select> 
<br> 
<input type=text name=field2>
<select name=lang_field2> 
<option value="">default</option> 
<option value="en">english</option> 
<option value="fr">french</option> 
<option value="gr">greek</option> 
<option value="ru">russian</option>
</select> 
<br>
<input type=submit name=subf value="Submit these fields...">
</form>
</body>
</html> 

Now let's say user has filled the form with the following values:

1.

field1=some russian text

2.

lang_field1=ru

3.

field2=some greek text

4.

lang_field2=gr

Inside mypage.asp, when I call Request.Form(field_name), the current value of Session.Codepage is used and the translation to UNICODE seems to be applied on ALL the post data which is not convenient for me, as I need to use the codepage 1251 to get value of field1 and codepage 1253 to get value of field2.

Any help to solve this problem will be appreciate.

Webmaster from France

Dr. International replies:

Dear Webmaster,

The easiest way for you would be just to use Unicode (UTF–8) as your encoding. You may also set Session.Codepage based on the lang_fieldX settings to process each of the fields, but this is not a scenario Charset was supposed to be used in, it can be distracted by inclusion of the server includes, there are some other problems that can affect your data processing depending on the site architecture.

If your multi–lingual ASP uses string literals, then you need to set @CODEPAGE directive as well; without Unicode the architecture will be too complicated to make it workable. You'll need to have multiple language – specific ASP pages. UTF–8 is the definitely a way to go!

Note: If you are doing similar development work on IIS 5.0, please see Knowledge Base articles Q294831 and Q294833.

To set the compile–time encoding of an ASP page to UTF–8, start it with:

<%@ LANGUAGE="VBscript" CODEPAGE=65001 %>

('language' may be different, of course)

Then in the ASP code you should call

<% Session.CodePage = 65001 %>

before you process any data.

Finally, if you plan to display multilingual data, put the following in the HTML header:

<HEAD>
<meta http–equiv="Content–Type" content="text/html; charset=utf–8">
...
Top of pageTop of page

Happy Holidays! See you all next year.

Dr. International
Windows Division

Column 16: Handling Multilingual Data & Changing Codepages 
on older versions of Microsoft Windows
Column 18: Windows XP and Unicode Surrogate Code Points, CJK Extensions A and B
Top of pageTop of page