Global Development and Computing Portal Global Development and Computing Portal

Ask Dr. International

Column #18

Over the past several months, I have received many questions about Windows XP's support for new features introduced in Unicode, namely Surrogates (more accurately referred to as Surrogate Pairs, or Surrogate Code Points), CJK Extension A, and CJK Extension B.

On This Page
What is CJK Extension A?What is CJK Extension A?
What is CJK Extension B?What is CJK Extension B?
What are surrogates?What are surrogates?
How does GB18030 relate to CJK Extensions A and B?How does GB18030 relate to CJK Extensions A and B?
To what extent does Windows XP support CJK Extension A?To what extent does Windows XP support CJK Extension A?
To what extent does Windows XP support CJK Extension B?To what extent does Windows XP support CJK Extension B?
To what extent does Windows XP support GB18030?To what extent does Windows XP support GB18030?
To what extent does Windows XP support surrogate pairs/code points?To what extent does Windows XP support surrogate pairs/code points?
AColumn 17: Exchanging non-English emails, Windows 2000/XP System Font, Euro Symbol, Spanish Sorting, UTF-8 String Manipulation, MLang & NLS, Web GlobalizationColumn 19: Resource File Encoding, The Yen, the Won, and the Reverse Solidus (aka, Backslash), Entering Characters Using their Unicode Code Points, Switching to the Thai Keyboard Using the Grave Accent Key, Detecting a String's Character Set
*

What is CJK Extension A?

CJK (Chinese, Japanese, and Korean) Unified Ideographs Extension A consists of 6,591 glyphs added to the Unicode's CJK repertoire when Unicode Standard 3.0.1 was ratified, and occupies Unicode range U+3400 to U+4DBF.

Top of pageTop of page

What is CJK Extension B?

CJK Unified Ideographs Extension B consists of 42,719 glyphs added to the Unicode's CJK repertoire when Unicode Standard 3.2 was ratified, and occupies Unicode range U+20000 to U+2A6DF.

Top of pageTop of page

What are surrogates?

There are two terms that describe the concept of surrogates more accurately:

1.

Surrogate Pair, which is a coded character representation for a single abstract character that consists of a sequence of two code points, where the first point of the pair is a high–surrogate and the second is a low–surrogate. A high–surrogate is a Unicode code point in the range U+D800 through U+DBFF, and a low–surrogate is a Unicode code point in the range U+DC00 through U+DFFF.

2.

Surrogate Code Point: a Unicode code point in the range U+D800 through U+DFFF. This term is reserved for use by UTF–16, where a pair of surrogate code points (a high–surrogate followed by a low–surrogate) "stand in" for a supplementary code point.

Surrogates or surrogate characters are misnomers because encoded characters cannot have a surrogate code point.

An example of a surrogate character from CJK extension B is surrogate (U+20001, or U+D840 U+DC01 in surrogate code point format).

For more information, visit the Unicode Consortium's website, http://www.unicode.org.

Top of pageTop of page

How does GB18030 relate to CJK Extensions A and B?

GB18030 contains exactly the same characters as Unicode 3.2. This includes Latin, Greek, Arabic, Hebrew, Thai characters etc. It also includes CJK Extensions A and B. The Simsun18030.ttc font that ships with the Windows GB18030 support package contains the CJK Extension A glyphs, but not the CJK Extension B glyphs.

To find out more about GB18030, see Dr. International #15.

Top of pageTop of page

To what extent does Windows XP support CJK Extension A?

Windows XP has some support for CJK Ext A. For example, CompareString recognizes such characters and will sort them according to default order, which does not match the linguistic practice for any East Asian language. Microsoft may implement the language–specific orders once they are defined. However, the system fonts that ship with Windows XP do not contain any of the CJK Ext A glyphs, so this support is largely invisible.

If the user obtains a font containing the CJK Ext A glyphs, e.g., simsun18030.ttc, they can display these characters in documents. However, input will be restricted to direct numeric entry. Also, since the font is not linked to a system UI font, CJK Ext A characters in UI and file names will appear as boxes.

Future system releases will address all these limitations.

Top of pageTop of page

To what extent does Windows XP support CJK Extension B?

Windows XP has some support for CJK Ext B. For example, CompareString recognizes such characters and will sort them according to default order, which does not match the linguistic practice for any East Asian language. Microsoft may implement the language–specific orders once they are defined. However, the system fonts that ship with Windows XP do not contain any of the CJK Extension B glyphs, so this support is largely invisible.

If the user obtains a font containing the CJK Ext B glyphs, they can display these characters in documents. However, input will be restricted to direct numeric entry Also, since the font is not linked to a system UI font, CJK Extension B characters in UI and file names will appear as boxes.

Future system releases will address all these limitations.

Top of pageTop of page

To what extent does Windows XP support GB18030?

Windows XP supports the GB18030 encoding. As with other encodings, Windows XP supports GB18030 content by converting it to UTF–16, e.g. for display or sorting. In addition, we provide a GB18030 support package that installs the simsun18030.ttc font and a small API layer for GB18030–native applications. See the GB18030 FAQ in Dr. International #15 for more information on this package.

The SimSun18030.ttc font contains the CJK Ext A glyphs but does not have CJK Ext B glyphs nor the complete GB18030 repertoire. Indeed, since the entire GB18030 repertoire is identical to Unicode 3.2, no TrueType font can have the entire repertoire (it is larger than the 64,000 glyph limit for a TrueType font). Moreover, since simsun18030.ttc is not font–linked to Simsun (or any other system UI font), the font can only be used by applications that explicitly select it and cannot be used in controls, font names or other UI that use system fonts.

The NeiMa IME in Unicode input mode allows the entry of Simplified Chinese characters defined in CJK Extension A and Extension B, and included in GB18030. The glyphs are entered in the form of surrogate pairs. If you only have SimSun18030 type face, only CJK Extension A characters are displayed.

For more information about GB18030 support on Windows 2000 and Windows XP, see Dr. International #15.

Top of pageTop of page

To what extent does Windows XP support surrogate pairs/code points?

Windows XP has some support for surrogate code points. For example, CompareString recognizes such characters and will sort them according to the default order mentioned earlier. However, the system fonts that ship with Windows XP do not contain any glyphs represented by surrogate pairs, so this support is largely invisible.

If the user obtains a font containing characters with surrogate code points, e.g. simsun18030.ttc, they can display these characters in documents. However, input will be restricted to direct numeric entry. Also, since the font is not linked to a system UI font, CJK Extension A characters in UI and file names will appear as boxes.

Future system releases will address all these limitations.

Top of pageTop of page

Regards,

Dr. International
Windows Division

Top of pageTop of page

AColumn 17: Exchanging non-English emails, Windows 2000/XP System Font, Euro Symbol, Spanish Sorting, UTF-8 String Manipulation, MLang & NLS, Web GlobalizationColumn 19: Resource File Encoding, The Yen, the Won, and the Reverse Solidus (aka, Backslash), Entering Characters Using their Unicode Code Points, Switching to the Thai Keyboard Using the Grave Accent Key, Detecting a String's Character Set
Top of pageTop of page