Over the past several months, I have received many questions about Windows XP's support for new features introduced in Unicode, namely Surrogates (more accurately referred to as Surrogate Pairs, or Surrogate Code Points), CJK Extension A, and CJK Extension B.
CJK (Chinese, Japanese, and Korean) Unified Ideographs Extension A consists of 6,591 glyphs added to the Unicode's CJK repertoire when Unicode Standard 3.0.1 was ratified, and occupies Unicode range U+3400 to U+4DBF.
CJK Unified Ideographs Extension B consists of 42,719 glyphs added to the Unicode's CJK repertoire when Unicode Standard 3.2 was ratified, and occupies Unicode range U+20000 to U+2A6DF.
There are two terms that describe the concept of surrogates more accurately:
1. | Surrogate Pair, which is a coded character representation for a single abstract character that consists of a sequence of two code points, where the first point of the pair is a high–surrogate and the second is a low–surrogate. A high–surrogate is a Unicode code point in the range U+D800 through U+DBFF, and a low–surrogate is a Unicode code point in the range U+DC00 through U+DFFF. |
2. | Surrogate Code Point: a Unicode code point in the range U+D800 through U+DFFF. This term is reserved for use by UTF–16, where a pair of surrogate code points (a high–surrogate followed by a low–surrogate) "stand in" for a supplementary code point. |
Surrogates or surrogate characters are misnomers because encoded characters cannot have a surrogate code point.
An example of a surrogate character from CJK extension B is
(U+20001, or U+D840 U+DC01 in surrogate code point format).
For more information, visit the Unicode Consortium's website, http://www.unicode.org.
GB18030 contains exactly the same characters as Unicode 3.2. This includes Latin, Greek, Arabic, Hebrew, Thai characters etc. It also includes CJK Extensions A and B. The Simsun18030.ttc font that ships with the Windows GB18030 support package contains the CJK Extension A glyphs, but not the CJK Extension B glyphs.
To find out more about GB18030, see Dr. International #15.
Windows XP has some support for CJK Ext A. For example, CompareString recognizes such characters and will sort them according to default order, which does not match the linguistic practice for any East Asian language. Microsoft may implement the language–specific orders once they are defined. However, the system fonts that ship with Windows XP do not contain any of the CJK Ext A glyphs, so this support is largely invisible.
If the user obtains a font containing the CJK Ext A glyphs, e.g., simsun18030.ttc, they can display these characters in documents. However, input will be restricted to direct numeric entry. Also, since the font is not linked to a system UI font, CJK Ext A characters in UI and file names will appear as boxes.
Future system releases will address all these limitations.
Windows XP has some support for CJK Ext B. For example, CompareString recognizes such characters and will sort them according to default order, which does not match the linguistic practice for any East Asian language. Microsoft may implement the language–specific orders once they are defined. However, the system fonts that ship with Windows XP do not contain any of the CJK Extension B glyphs, so this support is largely invisible.
If the user obtains a font containing the CJK Ext B glyphs, they can display these characters in documents. However, input will be restricted to direct numeric entry Also, since the font is not linked to a system UI font, CJK Extension B characters in UI and file names will appear as boxes.
Future system releases will address all these limitations.
Windows XP supports the GB18030 encoding. As with other encodings, Windows XP supports GB18030 content by converting it to UTF–16, e.g. for display or sorting. In addition, we provide a GB18030 support package that installs the simsun18030.ttc font and a small API layer for GB18030–native applications. See the GB18030 FAQ in Dr. International #15 for more information on this package.
The SimSun18030.ttc font contains the CJK Ext A glyphs but does not have CJK Ext B glyphs nor the complete GB18030 repertoire. Indeed, since the entire GB18030 repertoire is identical to Unicode 3.2, no TrueType font can have the entire repertoire (it is larger than the 64,000 glyph limit for a TrueType font). Moreover, since simsun18030.ttc is not font–linked to Simsun (or any other system UI font), the font can only be used by applications that explicitly select it and cannot be used in controls, font names or other UI that use system fonts.
The NeiMa IME in Unicode input mode allows the entry of Simplified Chinese characters defined in CJK Extension A and Extension B, and included in GB18030. The glyphs are entered in the form of surrogate pairs. If you only have SimSun18030 type face, only CJK Extension A characters are displayed.
For more information about GB18030 support on Windows 2000 and Windows XP, see Dr. International #15.
Windows XP has some support for surrogate code points. For example, CompareString recognizes such characters and will sort them according to the default order mentioned earlier. However, the system fonts that ship with Windows XP do not contain any glyphs represented by surrogate pairs, so this support is largely invisible.
If the user obtains a font containing characters with surrogate code points, e.g. simsun18030.ttc, they can display these characters in documents. However, input will be restricted to direct numeric entry. Also, since the font is not linked to a system UI font, CJK Extension A characters in UI and file names will appear as boxes.
Future system releases will address all these limitations.
Regards,
Dr. International
Windows Division