microsoft.com Home   All Products  |   Support  |   Search  |   microsoft.com Home  
Microsoft

Microsoft Typography | Developer | Uniscribe
Introduction | Uniscribe APIs


Supporting multilanguage text layout and complex scripts with Windows 2000

By F. Avery Bishop, David C. Brown, David M. Meltzer

This article is provided for historic reference purposes. The three authors of this article worked in the Windows Operating System division International group at Microsoft. F. Avery Bishop was an evangelist for international software development. David C. Brown was the architect of Uniscribe. David M. Meltzer managed the OpenType specification and related developer resources prior to leaving Microsoft.

A multilanguage version of Windows NT 5.0 will allow per-user setting of the user-interface language. One user can see system messages, menus, and other text in Japanese, while another user logging onto the same system can see the corresponding text in French.

This article assumes you’re familiar with International Windows.


Introduction

All international versions of Windows 2000, from Japanese to Hebrew, are based on the same binary files. In addition, Microsoft has created new services for text layout that support a wide range of languages. By internationalizing the Windows text layout interfaces, Microsoft has made it easier to develop applications that can lay out text for almost any language.

For developers of global applications, the shift to a single worldwide binary strategy is welcome news; different localized versions of an application can now be developed under a single Windows 2000-based system. That is, for most applications you won't have to switch from one localized system to another during development (although you should certainly test on each targeted platform before release).

In this article, we'll explore techniques for developing applications that handle multilingual text and complex scripts. We'll introduce Uniscribe, the Windows Unicode script processor, which comes with Windows 2000. This article also will cover existing interfaces that Microsoft has extended to meet the requirements of complex text layout.


Multilingual Features in Windows 2000

All language versions of Windows 2000 will be enabled for all supported languages, including European, and Far Eastern. This includes languages written with complex scripts such as Arabic, Hebrew, Thai, Devanagari, and Tamil. Applications that display plain text using Unicode can handle mixed text from any of the supported scripts. For example, you can pass a Unicode string containing French, Thai, Hindi, Korean, and Arabic text to ExtTextOutW, which will display the whole string in one pass. Thus, a Unicode application (one compiled with the –DUNICODE option) can display text in any of the supported scripts, without changing locales, on any language version of Windows 2000.

An ANSI application that targets a particular language can run on any language version of Windows 2000 if you first set the system default appropriately. For example, setting the system default locale to Japanese on a system localized to French allows the user to run a Japanese ANSI application.

Windows 2000 includes a new Unicode script processor called Uniscribe that supports line measurement, display, caret movement, character selection, justification, and line breaking of Unicode plain text and rich text.

The input method manager (IMM) runs on any version of Windows 2000. You can install Chinese, Japanese, and Korean input method editors (IME), and use them to enter text in the appropriate language. Similarly, all keyboard drivers work on all language versions of Windows 2000, and all locales are supported by all versions of Windows 2000. For example, users on any system can set their user locale to Arabic and the calendar type to Hijri.

A multilanguage version of Windows 2000 will allow per-user setting of the user-interface language. One user can see system messages, menus, and other text in Japanese, while another user logging onto the same system can see the corresponding text in French. Plus, Windows 2000 can provide a fallback glyph for characters that have no corresponding glyph in the currently selected font. This enables an application to display multilingual Unicode plain text.


Characteristics of Complex Scripts

The rules governing the shaping and positioning of glyphs are specified and cataloged within the Unicode standard. The shaping engines that comprise the Windows Unicode Script Processor implement this standard for applications performing complex text layout.

A complex script is one that requires special processing to display and edit because the characters are not laid out in a simple linear progression from left to right, as most European characters are. This special processing falls into several general classes.

Character Reordering Characters must be rearranged from logical (keystroke) order to visual order.

Contextual Shaping In some languages, the choice of which glyph to display depends on the surrounding characters.

Display of Combining Characters and Diacritics Multiple characters must be stacked or combined into one cluster.

Specialized Word-break and Justification Rules Some languages require special word-break logic because there is no fixed set of characters that delimit words.

Cursor Movement and Hit Testing The mapping between screen position and a character index for, say, selection of text or cursor display requires knowledge of the layout algorithms.


The OpenType Font Format

The Unicode-based OpenType font format has been developed jointly by Microsoft and Adobe; it extends the TrueType font file format originally designed by Apple. OpenType fonts allow mapping between characters and glyphs, enabling support for ligatures, positional forms, alternates, and other substitutions. OpenType fonts may also include information that supports two-dimensional glyph positioning and glyph attachment, and may contain either TrueType or PostScript outlines.

Layout features within OpenType fonts are organized by scripts and languages, allowing a single font to support multiple writing systems, even within the same script. To ensure consistency in text layout operations and to avoid unnecessary overhead in font files or applications, many of the text layout and language semantic algorithms are included in Uniscribe. This relieves the font developer from having to define generalized script rules within a font.

Applications may introduce their own knowledge or preferences regarding script layout. OpenType layout fonts may even contain layout rules that duplicate or supersede those applied by OS services. The layered structure of OS services supporting text layout allows a client to choose which layout information to use, and how to apply it.

At a minimum, font developers should be able to expect that an application has knowledge of (or services for executing) script rules as defined in the Unicode standard. Application developers should be able to expect that a font has glyphs and positioning information representing layout features as defined by the Unicode standard.


Multilingual Input

Chapter 6 of Developing International Software for Windows 95 and Windows NT by Nadine Kano (Microsoft Press, 1995) explains how an application can handle switching of the input locale by the user. When the book was published, support for switching input locales was only available in Windows 95; Windows NT 3.5 did not support the new messages and interfaces. This support has been in Windows NT since version 4.0.

Notice that we use the term "input locale" rather than "keyboard layout." When users change the input locale, they're doing much more than changing the keyboard layout, and in some cases a keyboard may not be involved at all. The input locale consists of an input language and a method of input. The input language can be any LangID supported in Windows NT, as found in the WINNT.H header file. The method of input is often a keyboard layout, but it can be anything from an IME to a speech recognition system.

The input language is of interest to the user and to the application. For example, applications can use the input language to tag text or to choose a new font. In general, however, applications do not care what method of input is used. The application always gets the characters from the user in the same way, through the WM_CHAR or WM_IME_ CHAR messages. (IME-aware applications—those that display their own UI for an IME—are an obvious exception.)

When the user changes the input locale, the application receives a WM_INPUTLANGCHANGEREQUEST message, with lParam set to the new HKL. (HKL originally stood for "handle to a keyboard layout," but as you've seen, it isn't necessarily associated with a keyboard at all.) The HKL contains the LangID of the new language in the low word. You can use this to select compatible fonts or to tag text for other processing.


DWORD   dwCodePage ;
HKL     hkl = (HKL) lParam ;
LOGFONT lf ;
HDC     hDc ;
CHARSETINFO cs ;
TCHAR szLocaleData [BUFFER_SIZE] ;

lstrcpy (lf.lfFaceName, TEXT("")) ;
lf.lfCharSet = DEFAULT_CHARSET ;

// This is a hack for Hindi and Tamil, since they 
// don't have charsets. Mangal and Latha are the
// fonts for Hindi and Tamil shipping with Windows NT 5.0.
// A better hack would be to put these strings in 
// data files that can be updated with new 
// typeface names. You would then call 
// EnumFontFamiliesEx once per face name.
if (LOWORD(hkl) ==
    MAKELANGID(LANG_HINDI, SUBLANG_DEFAULT))
{
    lstrcpy (lf.lfFaceName, TEXT("Mangal")) ;
}
else 
if (LOWORD(hkl) ==
    MAKELANGID(LANG_TAMIL, SUBLANG_DEFAULT))
{
    lstrcpy (lf.lfFaceName, TEXT("Latha")) ;
}
else
{
    // Find out what Charset the new kbd wants 
    GetLocaleInfo (LOWORD(hkl), 
	LOCALE_IDEFAULTANSICODEPAGE, szLocaleData, 6) ;
    dwCodePage = _ttol (szLocaleData) ; 

    if (TranslateCharsetInfo (
        (LPVOID) dwCodePage, &cs,
         TCI_SRCCODEPAGE))
    {
        lf.lfCharSet   = (BYTE) cs.ciCharset ; 
    }
} 

// Get list of fonts that support this charset 

    // hDc is needed by EnumFontFamilies
hDc = GetDC (hWnd) ; 
// Callback uses hDlg
EnumFontFamiliesEx (hDc, &lf, 
	(FONTENUMPROC) EnumFontProc, (LPARAM) hDlg,
                    (DWORD) 0) ; 
ReleaseDC (hWnd, hDc) ;

Figure 1. Finding a Locale-compatible Font

For example, the code in Figure 1 identifies the charset corresponding to the new input locale and enumerates a set of compatible fonts. Note that hWnd is the window handle where the font will be used, and hDlg is a dialog to display the font list. In this example, the callback function passed to EnumFontFamiliesEx is as follows:

 //
 int CALLBACK EnumFontProc (ENUMLOGFONTEX* lpelfe,
NEWTEXTMETRICEX* lpntme,
int iFontType, LPARAMlParam)
 {
 // Size computed from format used below
 // and buffer limits
 TCHAR
 SzFaceName [4+LF_FULLFACESIZE+LF_FACESIZE] ;

 wsprintf (szFaceName, TEXT("%s (%s)"), 
 lpelfe->elfFullName, lpelfe->elfScript) ;

 // Add string to listbox to describe this font
 SendDlgItemMessage ((HWND) lParam,
 IDC_FONTLIST, LB_ADDSTRING,
 (WPARAM) 0,
 (LPARAM) szFaceName) ;

 return TRUE ;
 }


Each time the callback function is called, it builds a string containing the typeface name and the language name, and adds that string to the listbox in a dialog box.

As you can see from the sample code in Figure 1, Indic scripts must be handled separately because they have no charset values. Since there is no default ACP value for Indic scripts, none of the Win32 ANSI entry points (the A routines) will work with Indic text. Indic text is not automatically translated to Unicode. There are ways to force translation of Indic text to or from Unicode using MultiByteToWideChar and WideCharToMultiByte by specifying the appropriate code page. However, an Indic input locale can only pass Indic text to a Unicode application, so full support for Indic scripts requires a Unicode application.


Doing Text Layout Using Win32 APIs

An application has the following options for performing text layout:

  • Calling Win32 text APIs
  • Instantiating Win32 edit controls
  • Instantiating RichEdit control
  • Calling Uniscribe
Some applications will use a combination of these methods. Responsibility for performing and tracking text layout operations depends on a client's implementation model. For example, some clients handle line breaking and some don't. Some functionality, such as managing memory and maintaining a backing store, are shared by all clients.

Many applications deal mostly in plain text—text that is all in the same typeface, weight, color, and so on. Such applications have traditionally displayed text using standard Win32 display entry points (TextOut, ExtTextOut, TabbedTextOut, and DrawText) to write text to a window, and the GetTextExtent family of functions to measure line lengths. As you'll see later, Uniscribe provides ScriptString APIs for better plain text processing on Windows 2000 and Windows 9x. There is good news for existing applications that use the standard Win32 API for plain text processing; it just works!

In Windows 2000, the standard entry points have been extended to support display of complex scripts and, through the font fallback mechanisms mentioned earlier, multilingual Unicode text. In general, this support is transparent to the application itself, so properly designed applications require no changes to support complex scripts through these interfaces.

There are two requirements for displaying complex scripts correctly using the standard Win32-based applications. First, applications should save characters in a buffer and display the whole line of text at once rather than, for example, calling ExtTextOut on each character as it is typed in by the user. When characters are written out one by one, the complex script shaping modules cannot determine the context for correct reordering and glyph shaping.

Second, applications should use one of the GetTextExtentXxx functions to determine line length rather than computing line lengths from cached character widths. This is because the width of a glyph used to display a character may vary by context.


/* 
If static variables bother you, use your 
favorite technique to save the state 
of lAlign without using a static variable.
*/
static LONG lAlign = TA_LEFT ;

// Do the following when the user 
// toggles alignment. This assumes
// that the TA_CENTER alignment is 
// not supported
lAlign ^= TA_RIGHT ;

// Do this when the user toggles reading
// order
lAlign ^= TA_RTLREADING ;

// 
// Before calling ExtTextOut, (e.g., when
// processing WM_PAINT messages), do the
// following:
SetTextAlign (hDc, lAlign) ;

Figure 2. Toggling Alignment and Reading Order

In addition, complex script-aware applications should consider adding support for right-to-left reading order and right alignment to their applications. You can toggle the reading order or alignment between left and right with the code shown in Figure 2. Of course, you can toggle both attributes at once, as Notepad does, by executing the following statement:

lAlign ^= TA_RIGHT|TA_RTLREADING;

Follow this by the calls to SetTextAlign and ExtTextOut as shown in Figure 2.


Standard Edit Control

The standard edit control has been extended in Windows to support text containing multilingual text and complex scripts. This includes not only input and display, but also correct cursor movement over character clusters (in Thai and Devanagari script, for example).

As with the standard Win32 API functions, a well-written application will receive this support automatically, without modification. Again, you should consider adding support for right-to-left reading order and right alignment. In this case, toggle the extended style flags of the edit control window to control these attributes:

 // ID_EDITCONTROL is the control ID in the
 // resource file.
 HANDLE hWndEdit 
 = GetDlgItem(hDlg, ID_EDITCONTROL);
 LONG lAlign = GetWindowLong(hWndEdit, GWL_EXSTYLE) ;
 //...
 // To toggle alignment
 lAlign ^= WS_EX_RIGHT ;
 // To toggle reading order
 lAlign ^= WS_EX_RTLREADING ;

After setting the lAlign value, enable the new display by setting the extended style of the edit control window as follows:

 // (This assumes your edit control is in a
 // dialog box. If not, you can 
 // get the edit control handle
 // from another source)
 SetWindowLong(hWndEdit, GWL_EXSTYLE, lAlign);
 InvalidateRect(hWndEdit, NULL, FALSE);

One new feature of the standard edit control is a context menu that allows the user to toggle the reading order and insert/display Unicode bidirectional control characters (see Figure 3).

Figure 3 Edit Control Context Menu

Figure 3. Edit Control Context Menu


RichEdit Control

RichEdit 3.0 is a higher-level collection of interfaces that takes advantage of Uniscribe to further insulate text layout clients from the complexities of certain scripts. RichEdit is designed for clients whose primary purpose is not necessarily text layout, but who nonetheless need to display complex scripts.

RichEdit provides fast, versatile editing of rich Unicode multilingual text and simple plain text. It includes extensive message and COM interfaces, text editing, formatting, line breaking, simple table layout, vertical text layout, bidirectional text layout, Indic and Thai support, a Word-like edit UI, and Text Object Model interfaces. RichEdit is the simplest way for a client to support features of complex scripts. Clients use its TextOut function to automatically parse, shape, position, and break lines.


Uniscribe

The new Unicode Script Processor (USP10.DLL), also known as Uniscribe, is a collection of APIs that enables a text layout client to format complex scripts. Uniscribe supports the complex rules found in scripts such as Arabic, Indian, and Thai. Uniscribe also handles scripts written from right-to-left such as Arabic or Hebrew, and supports the mixing of scripts. For plain-text clients, Uniscribe provides a range of ScriptString functions that are similar to TextOut, with additional support for caret placement. The remainder of the Uniscribe interfaces provide finer control to clients.

Although native to Windows NT 5.0, the Uniscribe DLL may also be distributed for use on Windows NT 4.0, Windows 95, and Windows 98-based systems. USP10.DLL is also expected to ship with Microsoft(r)Internet Explorer 5.0.

Uniscribe uses multiple shaping engines that contain the layout knowledge for particular scripts (see Figure 4). It also takes advantage of the OpenType layout shaping engine for handling font-specific script features such as glyph generation, extent measurement, and word-breaking support.

Figure 4.

Uniscribe subdivides strings of characters into items (a character string having all the same script and direction attributes), runs (portions of an item that have continuous formatting attributes), and clusters (script-defined, indivisible character groupings). The client builds runs based on its own stored formatting attributes and on the item boundaries obtained by calling the Uniscribe ScriptItemize API.

The Uniscribe ScriptShape API breaks a run into clusters according to script rules and then generates glyphs. The ScriptPlace API generates x and y positions for the characters. The ScriptTextOut API then displays these glyphs using these x and y positions.

Uniscribe supports line breaking at word boundaries through ScriptBreak. Hit testing and cursor positioning are supported by ScriptCPtoX and ScriptXtoCP. Character-to-glyph mapping is provided by ScriptGetCMap. Uniscribe manages bidirectional character reordering using the Unicode bidirectional algorithm, and understands non-OpenType layout font formats for Arabic, Hebrew, and Thai shaping.

Using Uniscribe, clients need only manage a backing store of Unicode character codes. Text layout clients do not need to maintain any other buffer or mapping table to track character order. A client only needs to store and manage the order in which the characters were entered by the user. This is the same logical order as defined by Unicode. The client's backing store never changes as a result of layout operations. Uniscribe maintains an index from the reordered clusters to the original character boundaries passed by the client. Using the Uniscribe interfaces ScriptCPtoX and ScriptXtoCP, clients can support cursor positioning and hit testing.

All Uniscribe APIs are Unicode APIs. Uniscribe is a single API for Unicode output across Microsoft's operating system range. Scripts are supported as shown in Figure 5.

Platform Western scripts Arabic, Hebrew, Thai, Vietnamese Hindi, Tamil Chinese, Japanese, Korean
Windows NT 4.0 X X X X
Windows NT 5.0 X X X X
Windows 98 X X X
Windows 95 (FE) X
Windows 95 (rest) X X X

Figure 5. Scripts supported by platform


Problems with Common Text-layout Methods

The most common way to break a simple text paragraph into filled lines is to sum the widths of the individual characters until the line is full, and then back up to the closest preceding space. Lines of simple text are conventionally split into runs based on hdc attributes such as font and color. Each run is displayed immediately to the right of the previous run, using SelectObject to set the style and ExtTextOut to display the text.

For complex scripts these simple approaches have problems. First, the width of a complex script character depends on its context. It is not possible to save the widths in simple tables. Second, breaking between words in scripts like Thai requires dictionary support since there is no separator character between Thai words. Third, Arabic, Hebrew, Farsi, Urdu and other bidirectional text requires reordering before display. And finally, some form of font association is often required to easily use complex scripts.


Formatting Paragraphs in the Sample Application

Our sample application, CSSamp, demonstrates Uniscribe APIs displaying text (see Figure 6). (The code for this example is included in the archive which can be found at the top of this article. —Ed). DspPlain.cpp shows how to use the ScriptString APIs to display plain text. These APIs give similar functionality to ExtTextOut, DrawText, and GetTextExtent, providing full complex script support under Windows 9x and Windows NT 4.0 and higher. DspFormt.cpp shows how to use lower-level APIs such as ScriptItemize, ScriptShape, and ScriptPlace to display high quality formatted text. In the sample application, all the paragraph formatting code is in Dspformt.cpp.

Figure 6.

Let's summarize the text formatting process. First, runs are built that are unique in style, script, and direction. Then, lines are broken into whole runs. For each line, a map is built from visual position to a run. For each run, the codepoints are shaped in visual order into glyphs, which are then positioned and rendered.

The function PaintFormattedText in DspFormt.cpp breaks the text buffer (g_wcBuf) at CR+LF and passes single paragraphs to PaintFormattedTextPara. To keep the app simple, it reapplies the entire layout process every time it displays a paragraph. (In real-world applications, it may be a better idea to save formatting information such as run sizes and line boundaries.) The entry conditions for PaintFormattedTextPara are the text buffer (g_wcBuf), the text style list (head at g_pFirstFormatRun), and the character positions at the start and end of the paragraph. The entire paragraph is first split into runs, each of a single script, a single style, and a single direction. The paragraph is broken into lines by measuring runs in logical order until the line overflows. A word-breaking algorithm then breaks the overflowing run between the current line and the next line. Finally, the lines are displayed one at a time by PaintFormattedTextLine. BuildParaRunList creates runs of text that contain no changes of style, script, or direction.

The Uniscribe API ScriptItemize is passed the entire paragraph and then breaks it at script and direction boundaries. Script boundaries are determined by the codepoint coverage of the shaping engines; direction boundaries are evaluated according to the Unicode bidirectional algorithm. ScriptItemize accepts many options in its SCRIPT_CONTROL and SCRIPT_STATE parameters, some of which are Unicode bidirectional algorithm control and choice of digits. ScriptItemize fills a buffer with information about each item, including an internal script enumeration, the direction encoded as a Unicode embedding level, and the flags to be passed to the shaping engine.


Shaping Engines and Font Association

Itemization serves two main purposes. It breaks the string into runs that match the codepoint ranges of the shaping engines or where a change of direction will require reordering. Direction changes are identified according to the Unicode bidirectional algorithm. Many applications will do their own reordering, applying what Unicode calls a higher protocol, because applications generally know more about a string than just the codepoints and can do a better job of reordering than the Unicode bidirectional algorithm. For example, an application may derive directionality from the keyboard layout used to enter a character. This approach gives a consistent and easily understood user interface. If that's what you want your application to do, instruct ScriptItemize to break only for the shaping engine by passing NULL to the SCRIPT_CONTROL and SCRIPT_STATE parameters.

The sample application merges the items in the ScriptItemize buffer with its own style list, returning a paragraph run list. The representation of the style list and paragraph run list in the sample is arbitrary. I chose a linked list with nodes containing length and style. You may prefer a dynamic array or STL data type.

Lines containing one or more runs are constructed by measuring the runs in logical order until a run causes the line to overflow. The overflowing run is passed to BreakRun, which determines a suitable wordbreak position. BreakRun uses ScriptGetLogicalWidths to convert the glyph widths returned by ScriptPlace into character widths. ScriptGetLogicalWidths returns virtual character widths ordered one for one with the logical character buffer. These widths are summed to identify the physical end of line as a logical character position.

BreakRun then uses ScriptBreak to obtain character classifications including whitespace and the start of the word in scripts like Thai. BreakRun retreats from the character break position to the nearest line preceding the start of the word. The run is split at this point. Spaces, if any, are left attached to the end of the previous line so that the new line always begins with a nonspace character. (For simplicity, the sample does not implement Far East word-breaking rules.) BreakRun treats a break request at the beginning of a line as a special case to ensure that each line contains at least one cluster. In this case BreakRun uses the logical cluster array returned by ScriptShape to make sure that combining characters are not split from their base characters.

Next, PaintFormattedTextLine is passed a single line of runs for display. Before the runs can be rendered, the correct display order must be established.

BuildVisualDisplayOrder passes the Unicode embedding levels from the runs in the line to the ScriptLayout API. ScriptLayout can return both logical-to-visual and visual-to-logical mapping arrays. A logical-to-visual mapping array is indexed by a logical (stored) run offset, providing the appropriate visual position for each run. A visual-to-logical mapping is indexed by a visual run position, providing the index of the logical run that should be displayed at that position. BuildVisualDisplayOrder uses the logical-to-visual mapping to construct and return two arrays. pVisualOrder is indexed by a visual run index and provides a pointer to the logical run that should be displayed at that visual index. iPos is indexed by a visual run index and returns the offset to the first character of the logical run that should be displayed at that visual index.

PaintFormattedTextPara now loops through the runs on the line in visual order, using pVisualOrder and iPos to pass the correct logical runs to PaintFormattedRun.

PaintFormattedRun displays a single run in a single style. These are the steps for displaying the run.

  1. Update the hdc as necessary for any change in style from the previously displayed run.
  2. Call ShapePlaceRun to generate glyphs and positions.
  3. Call ScriptTextOut to render the run to the hdc.
  4. Call CaretHandling to process any caret display or mouse hit testing required in this run.

In the sample application, styles are simply hFonts, so the style change is a simple SelectObject call.


Shaping Functions

ShapePlaceRun encapsulates the calls to ScriptShape and ScriptPlace and implements simple font association. The call to ScriptShape requires the SCRIPT_ANALYSIS returned by ScriptItemize. If ScriptShape returns USP_E_ SCRIPT_NOT_IN_FONT, it means the shaping engine was unable to generate glyphs for this script with the currently selected font. To handle this case, the sample app tries using the first style to shape the run. A real-world application might keep a list of standard fonts to try. By keeping such a list indexed by the script number in the itemization analysis, the application can avoid running through many alternative attempts.

If this fallback strategy fails, the sample application restores the original style and changes the script field of the itemization analysis to SCRIPT_UNDEFINED (the only publicized script number). SCRIPT_UNDEFINED causes ScriptShape to bypass shaping and use the 1:1 codepoint to glyph mappings from the font CMAP table. Most likely this will display the missing glyph for each character in the run. (The missing glyph is usually represented as an empty rectangle.)

The glyphs are then passed to ScriptPlace for positioning. ScriptPlace returns an advance width and an x, y offset for each glyph. Usually, base characters have an advance width and no x, y offset, and combining characters have a zero advance width and an x, y offset to place them correctly over the preceding base glyph.

Once ShapePlaceRun has generated glyphs and widths, the glyphs are rendered by a call to ScriptTextOut. ScriptTextOut is a slightly extended form of ExtTextOut(… ETO_ GLYPH_INDEX …) that can handle the x, y combining character offsets.

Finally, the run display process checks for any caret display or mouse hit testing activity required in this run. We do this here in the sample application because we don't keep width information hanging around. In your apps, you might save enough information to do hit testing and caret placement at least on the current line without requiring reprocessing of the paragraph.

Microsoft will add more shaping engines to Uniscribe in the future. The exact codepoint ranges assigned to each shaping engine may vary, so with the exception of SCRIPT_ UNDEFINED, script numbers are not published. Currently, codepoint range divisions include the following: complex text ranges such as Arabic, Hebrew, Thai, Hindi; complex script digit ranges; basic punctuation; ASCII; other Western text; and Far East CJK.

Although the script numbers are not published, attributes of the scripts can be tested. There is a global script properties table that can be indexed by script number.

 const SCRIPT_PROPERTIES **g_ppScriptProperties;
 int g_iMaxScript;
 
 ScriptGetProperties(&g_ppScriptProperties,
 &g_iMaxScript);
 
 hResult = ScriptItemize( … , pItems, &cItems);
 for (i=0; i<cItems; i++) {
 if (g_ppScriptProperties[pItems[i].a.eScript]
 >fComplex) {
 
 // Item [i] is complex script text
 // requiring glyph shaping
 }
 }


Applications may use these properties to help combine their own layout rules with the required shaping engine divisions.

All the complex script shaping engines, the digit shaping engines, the punctuation and ASCII shaping engines validate the font in the hdc before shaping, and will return HRESULT USP_E_SCRIPT_NOT_IN_FONT if the font does not contain sufficient glyphs and/or shaping tables. Only scripts that have the property fComplex should be shaped with the script returned by ScriptItemize. All other runs may be merged and shaped with SCRIPT_UNDEFINED. If there are characters not supported by the font, SCRIPT_UNDEFINED will not fail with USP_E_SCRIPT_ NOT_IN_FONT. Missing glyphs will usually be displayed as an empty rectangle. An application can determine if a codepoint is supported by a font by calling ScriptGetFontProperties to obtain the default glyph index, and ScriptGetCMap to look up font glyphs for Unicode codepoints.


The Unicode Bidirectional Algorithm

The Unicode bidirectional algorithm resolves the layout of mixed-direction text in the absence of higher-level protocols. Here are some of the general assumptions it makes. Adjacent runs of words of opposite language direction are laid out according to the base level—left-to-right for an English paragraph, right-to-left for an Arabic paragraph. Numbers following left-to-right words should be displayed to the right of the words. Numbers following right-to-left words should be displayed to the left of the words. Punctuation between words of the same language direction should be displayed between those words. Punctuation between runs of words of opposite language direction appears between those runs. Punctuation at the beginning or end of a paragraph is laid out according to the paragraph direction and is not affected by the direction of adjacent text.

The digits of numbers are laid out left-to-right in the number. Commas and periods are considered part of a number when immediately surrounded by digits. Other characters, such as currency signs, are considered part of a number when immediately adjacent to a digit. The algorithm makes a valiant and surprisingly successful stab at resolving what can be very ambiguous text. In applications such as databases and forms, it is often sufficient. In applications such as word processors, it is usually considered necessary to give the user more direct control over bidirectional text layout.


Experimenting with the Sample App

You can use the sample application to experiment with reading order. The default text for the sample shows the line "123-52 is 71" twice, once in English and once with "is" translated to Arabic.

In the second case, the number 71 following the Arabic translation of "is" has right-to-left layout because it follows Arabic text. Thus, it is displayed to the left of the translation of "is". Since the overall line direction is left-to-right, there is a conflict with the first part "123-52" which is assumed to be left-to-right since there is no preceding text.

Now press the RTL button in the SCRIPT_STATE control group. Notice how the second example now looks better, and the first (English) numeric example suffers from the conflict instead.

Is the right-to-left sample now correct? Should "One hundred and twenty three minus fifty two" appear as "123-52" or as "52-123"? It depends on the country. In Israel and Iran, sums like this are usually presented left-to-right, while for the rest of the bidirectional world they are presented right-to-left.

Set the AraNum Context checkbox in the SCRIPT_STATE control group to change the display to "52-123." (The Arabic number context is normally set by preceding Arabic text; AraNum Context sets the initial value).


Caret Placement and Hit Testing

Complex script languages are broken into clusters by ScriptShape. Character reordering always occurs within cluster boundaries. The clusters themselves are guaranteed to advance monotonically in the reading order.

Conventions for caret placement within clusters depend on the script. For the Arabic script, if the cursor position is set between a base character and its combining mark, then the caret is displayed halfway through the base character. For the Thai script, the cursor may not be positioned within a cluster. When the user advances the cursor, the application must advance over all characters that make up the cluster.

In the sample app, caret placement and hit testing is performed in CaretHandling. The Uniscribe APIs ScriptXtoCP and ScriptCPtoX translate between cursor positions (in codepoint offsets) and x positions (in logical pixels). Both APIs require the attribute and position information returned by ScriptShape and ScriptPlace. In the sample app, pending caret displays and mouse clicks are saved in global variables and processed during line display. A real-world application might choose to cache this information for the current line.

ScriptXtoCP returns a trailing edge flag so the caller knows which side of the character or cluster the user has clicked on. The value of the flag is either zero or the width of the character or cluster in codepoints. The returned CP is the position of the character on which the user clicked. Most editors set the cursor closest to the characters whose leading edge the user clicked. To achieve this, add the flag value to the returned CP.

For languages such as Thai where the user conventionally does not want to place the cursor into a cluster, ScriptXtoCP sets the trailing side flag to zero or the cluster width. The application should also advance the cursor position in whole clusters for languages such as Thai. For languages such as Arabic, where the user expects to be able to edit within a cluster, ScriptXtoCP sets the trailing side flag to zero or one. Uniscribe provides information on valid cursor positions in the fCharStop BOOL in the logical attributes returned by ScriptBreak: TRUE for most characters and FALSE for intercluster characters in scripts such as Thai. Check the fNeedsCaretInfo flag in the SCRIPT_PROPERTIES for an item to see if it is necessary to call ScriptBreak to check for valid cursor positions. If fNeedsCaretInfo is FALSE then all codepoints are valid cursor positions.


Digit Shape Selection

Unicode provides separate digit codepoints for each script that has its own digits. For historical reasons, the conventional names for some of these digit styles are confusing. The Arabic numerals used in America and Europe aren't used in the Arab world. The Arab-Indic digits used in the Arab world aren't used in Indic countries. And Farsi and Urdu use Eastern Arabic-Indic digits, which, just to keep things confusing, aren't used in the nations that use Arabic or Indic numerals. Other complex scripts have their own digit shapes, including Thai, Tibetan, and all nine Indian scripts.

Although Unicode provides separate codepoints for alternate digit shapes, very little software will recognize them if entered into a numeric form field, and most software will produce ASCII digits (U+0030 through U+0039) when converting from internal (binary) representation to character codes.

The fDigitSubstitute and fContextDigits flags in SCRIPT_ STATE and the uDefaultLanguage field in SCRIPT_CONTROL determine how ScriptItemize will classify ASCII digits. To cause U+0030 through U+0039 to display in an alternate digit script, set fDigitShape to TRUE and uDefaultLanguage to the language with which the digits are associated. You can also set fContextDigits to have digits displayed in the language of preceding letters in the same itemization.


Caching

Uniscribe saves Unicode to glyph mappings (CMAP), glyph widths, and OpenType script shaping tables. A handle to the tables for a particular font of a particular size is called a SCRIPT_CACHE. Uniscribe functions look first for information through the SCRIPT_CACHE, using the hdc only when required tables are not already cached. When calling ScriptShape, ScriptPlace, and ScriptTextOut, you must provide a pointer to a SCRIPT_CACHE variable, which you must initially set to NULL.

For ScriptShape and ScriptPlace it is valid to pass the hdc as NULL. Most often the call will be successful as required tables will already be cached. If the shaping or placement requires access to an hdc, ScriptShape or ScriptPlace will return immediately with the HRESULT E_PENDING. This allows the client to avoid most SelectObject calls.


Symbol and Device Fonts

Symbolic fonts can be recognized by calling GetTextMetrics and checking for a tmCharSet value of SYMBOL_ CHARSET or OEM_CHARSET. Such fonts do not necessarily conform to Unicode conventions. Although Uniscribe will process such fonts, it probably makes no sense to itemize them. Instead, consider a run formatted with such a font as a single item with eScript SCRIPT_UNDEFINED.

Printer device fonts are not processed by Uniscribe—if you call ScriptShape, ScriptPlace and ScriptTextout strings will be sent to ExtTextOut without any manipulation. You cannot use ScriptGetCmap on a printer device font.


Supporting Multilingual and Complex Scripts

In the preceding sections we've discussed four ways to enable your application to support multilingual content in documents: standard Win32 API functions, edit controls, RichEdit controls, and Uniscribe. Figure 7 explains which platforms support complex scripts through which interfaces.

Platform Win32 API Interfaces Edit Control RichEdit Uniscribe
English Windows 95/98 M
Enabled/localized Windows 95/98 En En En M
U.S. Windows NT 4.0 M
Enabled/localized Windows NT 4.0 En En En M
U.S. Windows NT 5.0 M M M M
Localized Windows NT 5.0 M M M M

En: Supports complex scripts consistent with the platform enabling. For example, Hebrew-enabled Windows 95 will support Hebrew through all interfaces, but not Arabic or Thai.

M: Supports all multilingual text.

Figure 7. Interfaces by platform.


Until now, the discussion has assumed that you have Unicode strings that you pass to the Uniscribe entry points. What should you do if your application needs to run on Windows 95, Windows 98, and Windows NT, given that Windows 98 allegedly doesn't support Unicode? There are a couple of strategies that work.

Before we get into these strategies, let's briefly review the A and W entry points in the Win32 API. In essence, all entry points used in a normal Win32-based application, such as RegisterClass and CreateWindowEx, are actually symbols in the Windows headers files defined as follows:

 #ifdef UNICODE
 #define MessageBox MessageBoxW
 #else
 #define MessageBox MessageBoxA
 #endif // !UNICODE

In this example, the application actually calls MessageBoxW if you compile your source code with the –DUNICODE switch; otherwise it calls MessageBoxA. Text strings passed to or from these entry points all have the LPCTSTR or LPTSTR data types, which are typedefed as unsigned short if the symbol UNICODE is defined, and char otherwise. Unicode applications (those that call the W interfaces) get all characters and strings from the system as Unicode, whereas ANSI applications (users of the A routines) get text encoded in the ACP. It's important to keep in mind that this applies not only to the arguments of the Win32 entry points, but to all text passed to or from the application. For example, window messages such as WM_CHAR, WM_GETTEXT, and WM_SETTEXT that pass text in the wParam or lParam parameters also use Unicode or ANSI, depending on the type of application.

With this in mind, how can you encode text in Unicode and run on all Win32-based platforms?


Strategy 1

Always run as a pure Unicode application. Compile the application with the –DUNICODE switch so that you use only the W entry points. All text passed to and from the application is in Unicode. This is the easiest to program by far. It also supports all Indic scripts and all new script added to Windows NT in the future. However, the application will not run on Windows 95 or Windows 98. This is the best approach if your application is targeted for Windows NT only, as is the case for many in-house or vertical applications.


Strategy 2

Create two binaries, one for Windows NT using Unicode and one for Windows 95 and Windows 98 using ANSI. Use LPTSTR for pointers to string buffers, TCHAR for characters, and so on. Use –DUNICODE to compile the Windows NT version only. This strategy is easy to program in its simplest form, and covers Windows 95, Windows 98, and Windows NT. Unfortunately, the ANSI version is basically restricted to the Win32 API standard calls. Localization, distribution, and maintenance of two binaries is difficult. Only recommended for simple or special in-house applications.


Strategy 3

Always run as an ANSI application, but use Unicode internally. This is the strategy used by the CSSamp sample code. The source code is compiled as an ANSI application, which receives text from the keyboard through the WM_CHAR or WM_IME_CHAR messages in the codepage of the current input locale. In general, this is not the same as the ACP. The application converts text to Unicode using the codepage of the current input locale. If the input locale changes in the middle of a string, the application will have to concatenate strings converted from different codepages. With this strategy, the same binary runs on Windows NT, Windows 95, and Windows 98, and it supports nearly all of the scripts in Unicode.

On the other hand, it's somewhat more difficult to program than the pure Unicode approach. Also, it does not support scripts without an ACP, such as Indic scripts, even when running on Windows NT. This is a sound approach if your application must run on Windows 95 and Windows 98 and does not need to support Indic scripts and others without an ACP.


Strategy 4

Detect the system and explicitly call the W APIs for Windows NT and the A routines for Windows 95 and Windows 98. The application registers itself as a Unicode application on Windows NT and as an ANSI application on Windows 95 and Windows 98.

The easiest way to implement this approach is to write a set of functions, say U routines, that parallel the Win32 W and A routines. Your application first calls GetVersionEx to detect the system, and stores that information into a global variable:

 BOOL g_IsWindowsNT.

Each U interface looks just like the corresponding W interface. For example, the prototype for CreateWindowExU would be:

 WINUSERAPI
 HWND
 WINAPI
 CreateWindowExU(
 DWORD dwExStyle,
 LPCWSTR lpClassName,
 LPCWSTR lpWindowName,
 DWORD dwStyle,
 int X,
 int Y,
 int nWidth,
 int nHeight,
 HWND hWndParent ,
 HMENU hMenu,
 HINSTANCE hInstance,
 LPVOID lpParam);


You can implement CreateWindowExU as a function pointer. When the app is launched, your initialization code checks to see if g_IsWindowsNT is TRUE, and if s, sets CreateWindowExU equal to CreateWindowExW. Otherwise (that is, when running on Windows 95 or Windows 98), CreateWindowExU is set to a routine you write yourself, say CreateWindowsExAU. This routine converts lpClassName and lpWindowName to the ACP using WideCharToMultiByte, and passes those parameters along with everything else to CreateWindowExA.

This approach also requires special handlers for messages such as WM_CHAR and WM_GETTEXT to convert the text passed in the wParam or lParam parameters to or from Unicode when g_IsWindowsNT is false. In the case of WM_CHAR and WM_IME_CHAR when running on Windows 95 and Windows 98, the application will also have to build up the Unicode string from multiple conversions via MultiByteToWideChar if the user switches input locales while typing in text.

This strategy runs on all platforms with the same binary files. It supports all scripts when running on Windows NT, including Indic, and allows use of Uniscribe on all platforms. The only disadvantage to this approach is that it requires considerable development investment. This strategy is your best choice for any application that needs to run on Windows 95 and Windows 98 and needs universal support for all scripts on Windows NT.


Summary

It's time to make your application multilingual! Don't overlook the new markets into which you can now more easily localize your application. The Uniscribe and RichEdit libraries enable you to rely on consistent and standardized layout of complex scripts—and, of course, typical scripts as well. Applications performing advanced typographic layout may complement their own proprietary layout engines with features available from these libraries.


Glossary

  • ACP - The active code page. Windows NT uses this code page to convert to/from Unicode automatically whenever an application calls one of the A entry points.
  • Character - The simplest element used to represent written languages. Note that the appearance of a character is not constant; the glyph used to display a character depends on the font used as well as the context of surrounding text. See glyph.
  • Character encoding - A one-to-one mapping from a set of characters into a set of numbers, used to represent text in software.
  • Complex script - Scripts that require special processing to display, print, and edit.
  • Font -A collection of glyphs for displaying text in a particular typeface.
  • Formatted text -Text displayed with multiple attributes, such as typeface, slant, weight, and color, and special effects such as shading, underlining, and blinking.
  • Globalization - Designing and implementing software so that it can support all targeted locales and user interface languages without modification to the software source itself. This processing includes enabling for all target languages, and adding NLS support for target locales.
  • Glyph - A graphical representation of a character.
  • IME - Input method editor, used to enter text with large character sets, such as Chinese, Japanese, and Korean.
  • Internationalization - See globalization.
  • Input locale - An ordered pair consisting of an input language (LangID) and a method of inputting characters in the language. The method can be a keyboard layout, an IME, or other device provided by a vendor, such as a speech recognition engine.
  • LangID - A 16-bit value that identifies a language. A LangID consists of a primary language, such as Arabic, and a sublanguage, such as Arabic for Saudi Arabia.
  • Language enabling - 1. Adding support to software for document content in a particular language. In this sense, to enable an application for Japanese means to modify the software so that the user can enter, display, edit, and print text containing Japanese. 2. Modifying software so that it can be localized to a particular language. In this sense, enabling for Japanese means to modify software so that it can display Japanese text correctly in menus, dialog boxes, and other user interface elements. Note that in either sense, an enabled product may still have the user interface in English, that is, it may not be localized. Contrast localization.
  • LCID - A 32-bit value that identifies a locale. An LCID consist of a LangID and a sort key ID.
  • Locale - A generic term indicating a set of attributes related to language and other regional/ethnic preferences. Examples include currency symbol, date and time format, calendar type, number formats, default character encoding, and keyboard layouts. Microsoft uses this term in combination with others to specify a subclass of these preferences. See input locale, system locale, and user locale.
  • Localization to a language - Translating the user interface elements from the original language, usually English, to the target language. Contrast enabling.
  • Logical order - The ordering of characters in text corresponding to that when writing by hand, or keying in text using a keyboard. Contrast visual order.
  • National Language Support (NLS) - The set of system functions in 32-bit Windows that contains national language support.
  • Plain text - A string of text to be displayed with one value for each text attribute: one typeface, one slant, and one weight. Contrast formatted text.
  • Reading order - The overall direction of a sequence of text. Whereas words in a given script always flow in the direction associated with that script (left-to-right for Latin, right-to-left for Hebrew), the flow sentence itself depends on the reading order. For example, a mixture of Arabic and English text may be regarded as French embedded in an overall Arabic sentence, implying right-to-left reading order, or as Arabic embedded in French, implying left-to-right reading order.
  • Rich text - See formatted text.
  • Script - A collection of characters for displaying written text, all of which have a common characteristic that justifies their consideration as a distinct set. One script may be used for several different languages (for example, Latin script, which covers all of Western Europe), and some written languages require multiple scripts (for example, Japanese requires at least three scripts: the hiragana and katakana syllabaries, and the Kanji ideographs imported from China). Note that this sense of the word has nothing to do with programming scripts such as Perl or VBScript.
  • Slant - The obliqueness or tilt of the glyphs in a font. The most common slants are regular and italic.
  • System locale - The locale emulated by the system, as seen by applications. For example, if the system locale for U.S. Windows NT 5.0 is set to Hebrew, then ANSI applications will see it as Hebrew localized Windows NT 5.0, although the user interface of the system will still be in English. The system locale is systemwide, in that it applies to all users. Changing the system locale requires a reboot.
  • Typeface - Name given to a particular style of text. In contrast, a font is an implementation of a typeface.
  • User locale - The user default preferences for calendar type, date format, currency, and number format. User locale is a per-user setting, and does not require a reboot or logoff/logon.
  • Visual order - The ordering used to display glyphs on a screen, printed page, or other medium. Usually used with bidirectional text, because reordering is required to go from logical order to visual order. Contrast logical order.
  • Weight - The thickness or darkness of glyphs in a font. The most common weights are regular and bold.
  • Writing system - The collection of scripts and orthography required to represent a given human language in visual media.


Based on an article that appeared in the November 1998 issue of Microsoft Systems Journal. Get it at your local newsstand, or better yet, subscribe.



this page was last updated 30 January 2003
© 2003 Microsoft Corporation. All rights reserved. Terms of use.
comments to the MST group: how to contact us

 

Introduction | Uniscribe APIs
Microsoft Typography | Developer | Uniscribe