Key Point: Add-ins let MS Internet Explorer display content written in the world's major languages. | |
Detail: Medium | Task: Planning, implementation |
Article Section | What's There |
Introduction | Growth of the Web has created a global market for content. |
Browsing Multilingual Content | Basics for Internet Explorer users who want to view non-English content. |
Creating Multilingual Content | More technical advice for content creators. |
Technical Background | Details of how major writing systems are encoded as Web content. Of interest mostly to those wanting the nuts-and-bolts level view. |
Conclusion | |
More Information | Pointers to related articles and URLs on TechNet and the Web. |
In the last few years the term "World Wide Web" has gone from computer industry jargon to household phrase in many countries, and for good reason: Web technology makes content available worldwide through the Internet and through companies' private, multi-national intranets. But in this case worldwide refers to geographic reach, not to readability. Most Web content is written in English, especially American English, yet a large and growing number of Web users read English as a second or third language or not at all, and would prefer to read content in their native languages. Whether your Web site advertises products and builds brand recognition on the Internet, or uses an intranet to publish shared schedules and group goals, using English-only content can make your communications less effective.
Web technology is flexible enough to deliver content in many writing systems, but users and content providers (especially those in the United States) may not know what tools they need to take advantage of this flexibility. The first section of this article explains how intranet administrators can configure Microsoft Internet Explorer to view multilingual content, the second section describes how content providers can create that content, and the third provides technical background that explains why these solutions work. The article concludes with pointers to other sources of information.
Internet Explorer 3.0 is adapted for 29 markets: Arabic, Chinese (Simplified and Traditional), Czech, Danish, Dutch, English (US and International), Finnish, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese (Iberian and Brazilian), Russian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, and Vietnamese.
Which of these "localized" versions you'd want to use depends on three factors: the language you want the browser to use when displaying menus and dialogs; the writing system of the text you want to view; and the localized version of the operating system on which you run the browser.
Localization affects the browser's user interfacesuch as menu items and dialog box textbut does not affect the content that the browser can display. For example, while the US-English version of Internet Explorer 3.0 has a "File" menu, the German-localized version has a "Datei" menu. Nonetheless, both can display the same kinds of text.
When considering the content localized browsers can display, it helps to group languages according to the writing systems they use:
| • | Latin-1Includes Danish, Dutch, English (US and International), Finnish, French, German, Italian, Norwegian, Portuguese (Iberian and Brazilian), Spanish, and Swedish |
| • | Latin-2Includes Albanian, Belarussian, Croatian, Czech, Greek, Hungarian, Polish, Romanian, Russian, Turkish, Slovak, Slovenian |
| • | Arabic |
| • | Hebrew |
| • | ChineseAs written in Taiwan and Hong Kong |
| • | ChineseAs written in the People's Republic of China, and in Singapore |
| • | Japanese |
| • | Korean |
| • | Thai |
All localized versions of Internet Explorer can display text written with the Latin-1 writing system. All except the US-English version can display text written with the Latin-2 characters. For example, the Japanese version can display text written in Japanese, or any language that uses a Latin-1 or Latin-2 writing system.
Suppose you want to read text that is not written in English, and is not written in the language for which your browser is localized. For example, you have the US-English version and want to read Russian text. In most cases you need to install one or more Internet Explorer 3.0 language support packs (also known as International Extensions):
| • | Chinese SimplifiedSupports the writing style used in People's Republic of China, and in Singapore |
| • | Chinese TraditionalSupports the writing style used in Taiwan and Hong Kong |
| • | Japanese |
| • | Korean |
| • | Pan EuropeanSupports all of the European languages; ships with the localized versions of Internet Explorer, and is available separately because it is not supplied with the US-English version |
To view Arabic, Hebrew, or Thai content, you must run the appropriate localized browser on the localized operating system. For example, to view Arabic content, you must run the Arabic version of Internet Explorer on the Arabic version of Windows 95.
As a rule, a localized version of Internet Explorer 3.0 is supported only on a similarly-localized Microsoft Windows operating system. For example, Japanese Internet Explorer runs on the Japanese version of Windows 95, but not on US-English or French versions. The exception is US-English Internet Explorer: Like most US-English software from Microsoft, it runs on any localized operating system.
Internet Explorer 3.0 and language support packs let you view content written in various languages. Users sometimes misunderstand these capabilities, expecting the software to "transliterate" the content (convert it to another character set) or translate it. For example, after installing the Japanese language pack on US-English Internet Explorer, American English-speaking users can view this HTML content:
<p>Windows 95 " ItBX 95 Abvf_[g Lbĝ" </p>
as this:
![]()
However, users might expect the language pack to convert the Japanese characters into an equivalent written in the Latin character set:
Windows95 ban offisu95 appude-to kitto no go-annai
or they might expect the software to translate the content into English:
Information about the Office95 update kit for Windows 95
Internet Explorer and the language packs do not perform translation or transliteration. Some third-party add-ins can do this, but they rarely do it well, and the software is usually of little use to English-speaking users because most convert English into another language (usually Japanese), not the other way around.
Which tools you need to create multilingual content depend mainly on how much multilingual content you need to create and how often. You also should know your audience and the relevant language or languages you need to communicate with them.
If you seldom need to create multilingual contentand even then you need only a few charactersthe most efficient method may be a hex editor or text editor, and a chart of the character set you want to use. For instance, to create a British pound sign (character 163 in the ANSI character set) on a page of American English content, you can use a hex editor to create a byte with value 163 in the HTML content:
<p>This item costs 10.</p>
You can also use a text editor to put the "pound encoding" equivalent of this character in the HTML content:
<p>This item costs 10.</p>
A third approach is to put the symbol's named entity in the HTML content:
<p>This item costs 10.</p>
Each of these methods works in most cases, but none works in every case. Named entities are often best for special punctuation, such as "smart quotes," but many language-specific characters have no equivalent named entity. Pound encoding is often the best (or only) solution for characters specific to European languages, but pound encoded punctuation often displays incorrectly on Asian-language browsers. You can minimize the latter problem by using the META CHARSET tag on pages that use pound encoding (this tag is described under Browser Considerations in the Technical Background section, below).
If your organization plans to create a lot of content for multilingual audiences, the most effective solution is to keep on hand a computer with the proper input hardware, running the localized operating system, applications, and Web page creation tools. For example, you might use the French versions of Windows 95, Word for Windows 97, and Internet Assistant for Word for Windows 97 on a computer with a French keyboard. There are some caveats you should keep in mind because of how localized software is sold and supported. All of Microsoft's international subsidiaries support US-English software, as well as localized versions. However, the support groups in the US support only the US-English versions, and the US sales groups sell only the US-English versions. You have several options if you choose to buy a localized version:
| • | Purchase it from the subsidiary. If you need technical support, you must contact the subsidiary that sold you the software. For telephone support this may require an international long distance call. |
| • | Purchase it from a domestic reseller. The purchase process is simpler than when dealing directly with the subsidiary, but the subsidiary that normally sells the product may not offer support for software purchased from domestic resellers. |
| • | Use third-party emulation software to host localized applications on a US-English operating system. This is an important option when you need to use Asian-localized applications because it eliminates the need for a localized operating system. Application problems are often easier to solve than operating system problems, so this option may reduce your need for support. However, output quality and portability vary considerably. Also, Microsoft does not provide technical support for problems with localized applications unless you can show that they occur without the emulation software (that is, while running on a localized operating system). Finally, third-party vendors offer their own support policies. |
If you need to create content in a European language, you can use a US keyboard and remap the keyboard layout. This solution is not enough for Asian languages: these require specialized software (input method editorsIMEs) to construct the characters they uses. An IME is included with each localized operating system, but only for the operating system's language; for example, Japanese Windows 95 includes a Japanese IME, but not a Korean or Chinese IME. If you choose to run emulation software as described in the third bullet above, you still need an IME, and the emulation software usually provides one.
You can choose the right content creation tools, and select and use the right browser, without knowing much about how the software works. However, if you want to create multilingual content or troubleshoot browser problems, you need some technical background. The first part of this section helps you avoid a common roadblock to thinking about the content, and the remaining parts explain character set standards and how the browser uses them to display content written in various languages.
User interfaces let you interact with a computer using human language, such as English, rather than computer language: you type characters on a keyboard and see them on a screen or printout. This interface is so convenient and pervasive that you can easily forget that the characters are an illusion. For example, you treat text files as if they contained letters, digits, and punctuation, when in reality they contain only numbers. When you view a text file, the application and operating system display those numbers to you as a pattern of dots that look like letters or words.
Applying this insight to the Web, HTML content is simply numbers, and the client's browser, operating system and fonts are interpreters that use one of several conventions to display those numbers as characters. Both the content creation tools and the clients' browsers must use the same convention, or else the content will not display correctly.
There are numerous such conventions, loosely known as "code pages" or "character sets." A character set is simply a set of characters, and a code page maps a number to each character in a character set. You pick a code set to support a specific language, and this means picking a set that supports the language's script: the set of characters used to write the language's words. The sections below discuss several major languages, the scripts they use, and the common code pages with which you can encode content written in those scripts.
European languages vary considerably, but (as opposed to having a unique character for each word) most use the Latin alphabet or characters that vary slightly. These languages are supported by single-byte character sets (SBCSs) because none uses more than one to two hundred characters.
Western and Central European Languages
Most European languages use the Latin script, which consists of 26 base letter characters (A through Z) in upper and lower case. Icelandic has four unique characters: , , , . Several languages use ligaturescharacters created by joining two other characters together: , , , , and .
Also, most European languages, in effect, create additional characters by adding one of fourteen diacritical marks, commonly called "accent marks," to some set of the 26 base characters. For example, German uses the 26 base characters, the ligature, and six diacritical combinations: , , , , , and . Most European languages have at least six such additional characters; Dutch has the most (42), and English has none.
One of the first encodings developed for a European language was the American Standard Code for Information Interchange (ASCII). It maps characters into the first seven bits of each byte (the numbers from 0 through 127). Thirty-three of those numbers are used for control functions, 26 for upper-case characters, 26 for lower-case, 10 for digits, and 33 for punctuation marks commonly used in American English. There are no numbers left over for ligatures, or for characters with diacritical marks. As a result, ASCII is poorly suited to writing anything but English (specifically, American English: ASCII doesn't include a British pound sign).
The general solution is to enhance ASCII with a code page that uses all eight bits of each byte. ASCII is used for the values from 0 through 127, and other characters are mapped to the values from 128 through 255. Numerous character sets use this solution, but this provides only 128 more numbersnot enough to support the nearly 200 diacritical combinations, or the dozens of additional punctuation marks, collectively found in the Latin-based European scripts.
MS-DOS attempts to work around the problem by defining "OEM code pages" for numerous languages. Important examples include:
| • | Code page 437 (Latin-US) |
| • | Code page 850 (Latin 1) |
| • | Code page 852 (Latin 2) |
Microsoft Windows, beginning with version 3.1, works around the problem in a similar way by extending two American National Standards Institute (ANSI) standards:
| • | Windows-1252 (Latin 1)Based on ANSI-1252; like MS-DOS code page 850 it includes ligatures and diacritical combinations for Western European languages |
| • | Windows-1250 (Latin 2)Based on ANSI-1250; like MS-DOS code page 852 it includes diacritical combinations to support Eastern European languages and Greek |
Turkish
Turkish is written with a Latin-based script. Common code pages are Windows 1254 and MS-DOS code page 857. Both use ASCII for the numbers 0 through 127, and map Turkish characters to the numbers between 128 and 255.
Russian
Russian is written with the Cyrillic script, and other languages may use this script as well; these include Bulgarian, Belarussian, Serbian, and Ukranian. The most widely used Cyrillic code pages are Windows-1251, KOI8-R, and MS-DOS code page 866. In each case, ASCII is used for the numbers 0 through 127, and Cyrillic characters are mapped to numbers between 128 and 255.
Greek
Greek is written with the Greek script. Common Greek code pages include Windows-1253, and MS-DOS code pages 737, 851, and 869. In each case, ASCII is used for the numbers 0 through 127, and Greek characters are mapped to numbers between 128 and 255.
The scripts of most of the world's languages are written from left to right, but the Arabic and Hebrew scripts are written from right to left. Content written in Arabic or Hebrew often contains words borrowed from languages that use a left-to-right script. A person writing such "mixed" content (or a computer displaying that content) often switches between left-to-right and right-to-left several times in a single sentence, and for this reason the Arabic and Hebrew scripts are called "bidirectional" (or simply "bidi").
There are two ways to encode a bidi language:
| • | LogicallyIn the order that the characters would be written by hand: mainly from right to left |
| • | VisuallyIn the order that the characters appear, from left to right |
Web content can use either convention, and should include an HTML tag to indicate which convention is in use on any particular page.
Switching direction may seem trivial, but much of the logic built into the Windows display engine assumes a left-to-right script. The display engines in the Arabic and Hebrew localized versions of Windows 95 are modified to support this switching, and the Arabic and Hebrew localized versions of Internet Explorer take advantage of the modifications to display content in these scripts. However, versions of Internet Explorer localized for any other language (including the US-English version) cannot display Arabic or Hebrew content.
Arabic
The Arabic script is used mainly to write the Arabic language, although some other languages (such as Farsi, spoken mainly in Iran) use this script as well. Font files with Arabic characters tend to be rather large, for two reasons:
| • | The appearance of a letter changes depending on its position within a word (at the front, at the end, in the middle, or by itself). Thus, the font needs four renditions of each of the two dozen base characters. |
| • | Arabic is a cursive script, so each character is joined to the adjacent ones. In addition to providing an image of each of four forms of each letter, the font must contain ligature characters that join letter pairs correctly. |
Arabic is usually encoded using the MS-DOS 708 code page, or the Windows 1256 code page. In each, ASCII is used for the numbers 0 through 127, and Arabic characters are mapped to numbers from 128 through 255.
Hebrew
The Hebrew script is used mainly to write the Hebrew language, although Yiddish uses this script as well. It is usually encoded using the MS-DOS 862 code page, or the Windows 1255 code page. In each case, ASCII is used for the numbers 0 through 127, and Hebrew characters are mapped to numbers from 128 through 255.
Character sets for the major Asian languages (Chinese, Japanese, and Korean) are similar to those for Russian and Greek in in that they include ASCII characters, but, they differ from all European languages in in that:
| • | Each can be written in several scripts, whereas each European language has only one script |
| • | Each has at least one script that uses a unique character for each word; these scripts require thousands of characters, so each script requires a double-byte character set (DBCS) |
Chinese
Two thousand years ago, the Han dynasty in China defined a script in which each character represented a word or part of a word. This script remains largely intact today, and is called Chinese Traditional. It is used primarily in Hong Kong and Taiwan. The main character set encoding for this script is called BIG-5.
Over the last hundred years or so, a second script has been developed. It defines fewer characters (around 7000), and generally uses fewer strokes to draw each character. This script is known as Chinese Simplified and is used today in the People's Republic of China and in Singapore; the code page that defines this script is GB 2321-80.
Japanese
Japanese is written using three scripts:
| • | KanjiBorrowed long ago from Chinese Traditional, each of the thousands of kanji characters represents a word or part of a word |
| • | HiraganaA script of 71 characters which is used to conjugate verbs and provide sentence structure, and to spell some words of Japanese origin |
| • | KatakanaA syllabic script of 71 characters that is used mainly to approximate words borrowed from other languages |
Coincidentally, there are three major Japanese character encodings: EUC, JIS, and Shift-JIS. Each includes the characters from all three scripts, but any particular character might be mapped to different numbers.
Korean
Korean is written using two scripts:
| • | HanjaCharacters borrowed long ago from Chinese Traditional and typically used in formal writing, and for names |
| • | HangulAlso spelled "Hangeul," a syllabic script in which each character is the combination of symbols for consonants and vowels |
Most writing is done with Hangul characters. There are more than 11,000 possible characters, defined by the Johab character set (KS C-5601-1992, Windows-1361). However, only a small subset are actually used; these are defined by the Wansung character set (KS C-5601-1987, Windows-949).
This section covers issues specific to viewing multilingual content with Web browsers.
META CHARSET Tag
Different Web sites assume different character sets, and encode content with different code pages. In some cases the content creator includes the META CHARSET tag that tells the browser what character set to assume when decoding the content. For example, a browser reading this line from the header of an HTML document:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Windows-1251">
should assume the data to be encoded with the Windows-1251 standard. The META CHARSET tag is supported by the major browsers, including both MS Internet Explorer 3.0 and Netscape Navigator 3.0.
Note that this tag applies to an entire HTML document, so you can't mix, say, Japanese and Korean text in one HTML document. You can, however, load two HTML files, each using a different character set, into two frames that the user sees side by side on the screen.
Changing Character Set Decoding
In practice, few content creators include the META CHARSET tag in their HTML files, so the browser should set a default code page and give the user a way to redisplay a page with any other code page that the browser supports. With Internet Explorer 3.0, you can change the default by selecting Options from the View menu, then select Font Settings from the General tab. You can redisplay the current Web page using another code page by selecting the International icon at the bottom right corner, then selecting a character set from the resulting list.
Double Byte Character Set (DBCS) Support
DBCS standards support documents with mixed English and Asian characters because they retain the ASCII character set. If the browser reads the first byte of a piece of text and finds its value is below 128, it interprets the value as an ASCII character. If that byte's value is 128 or greater, the browser determines whether the value is a valid "lead byte" or not for the current character set. For example, when using the GB 2312-80 character set (Chinese Simplified), values from 0xA1 through 0xFE are valid lead bytes. If the lead byte is valid, the browser reads the second byte ("trail byte"), and if it, too, is valid the browser asks the operating system to display the character identified by these two bytes.
Note that the major DBCS standards have overlapping lead byte ranges and trail byte ranges. As a result, the browser cannot always identify the character set or code page simply by analyzing the lead and trail bytes.
Also, recall that ASCII maps US-English characters to values below 128, and European-language code pages map "extended" characters to values from 128 through 255. Browsers interpreting text as DBCS will misinterpret these "extended" characters (including pound-encoded punctuation marks such as smart quotes) as DBCS lead bytes.
Microsoft's Asian-localized operating systems are designed to handle DBCS text; this is why they and Asian-localized browsers can display English content without additional software. European-localized operating systems and browsers are designed for single byte character sets; they interpret HTML content one byte a time, and so require additional software to display Asian-language content. That software consists of:
| • | An add-in that lets the browser interpret data in HTML files as DBCS text |
| • | Tables that implement DBCS standards by mapping DBCS values to characters |
| • | Fonts that supply those characters |
Each of the Asian language packs for Internet Explorer 3.0 supplies each of these components for one or more writing systems. The Pan European language pack supplies the latter two components to support Greek, Cyrillic, and other European languages, but because these are single-byte character sets, the first component is not needed.
Unicode Support
One reason multilingual content is confusing is that the lack of a single standard complicates content creation and viewing, and even development of browser software. The simplest solution is a single character set that includes all the characters used by all of the world's major languages. ISO standard 10646 (the Unicode standard) implements this solution as a double-byte character set.
At first this seems unworkable: supporting 30,000 words in each of Chinese, Japanese, and Korean would seem to require at least 90,000 characters, and a two-byte system can support no more than 65,536 characters. There is a workaround, though, because many Chinese Traditional, Japanese kanji, and Korean hanja characters are based on a single "root" script. Thus, the symbol for a word written with one of these scripts often resembles the symbols from the other two scripts, even though the Chinese, Japanese, and Korean pronunciations may be quite different. In effect, these scripts share approximately 20,000 characters, and Unicode identifies them with 20,000 values rather than with 60,000 values.
Both Windows 95 and Windows NT support Unicode, and these operating systems supply several fonts, such as Arial, Courier New, and Times New Roman, that implement subsets of the Unicode standard that support many European languages. Language packs supply additional Unicode fonts that implement other subsets of the standard to support their languages: Chinese Simplified supplies MingLiU, Japanese supplies MS Gothic, and Korean supplies GulimChe.
Universal adoption of Unicode would simplify browser development and make content more flexible because content creators could combine symbols from any languages in the same document.
One drawback is that Unicode's support for multiple Asian scripts in one double-byte character set can cause some cultural friction. A Unicode font has one image for each of the 20,000 "shared" characters, and readers who are used to one script may recognize many of these images as foreign renditions, or may not recognize them at all. An analogous, but much less serious problem occurs when a British reader sees the American spelling "color" instead of "colour." This problem occurs mainly on pages that mix two or more Asian scripts on a single page; otherwise, the browser can display text using a font whose images are "tuned" for a particular language.
Despite this obstacle, Unicode is the best prospect for a single character set. It will make multilingual content easier to create and view, and it is the standard endorsed by such bodies as the World Wide Web Consortium.
The Web can improve global communication by letting people everywhere browse content in familiar languages. Anyone can download the Internet Explorer 3.0 browser software free from the Microsoft Web site, and localized content creation software is available both in the United States and abroad. But this software is only the delivery mechanism: someone has to create the content, and that person must be sensitive to linguistic and cultural variations. For example, if you're selling baked goods on the Internet you should be aware that what Americans call "cookies" the British call "biscuits." Other more embarrassing linguistic faux pas get in the way of building good communications and strong relationships with others. In short, Microsoft software lets you publish words in many languages, but you still need to choose those words carefully.
The best in-depth reference guide for creating multilingual content is Developing International Software For Windows 95 and Windows NT, by Nadine Kano, 1995, published by Microsoft Press (ISBN 1-55615-840-8). This is an invaluable resource for professional content creators.
You can find specific information about these topics on Microsoft's Web site:
| • | Internet Explorer, including language packs: http://www.microsoft.com/ie/default.htm |
| • | Creating Web sites (including a META CHARSET reference): the Site Builder Workshop: http://www.microsoft.com/workshop/default.htm |
| • | Examples of non-English Web content: choose one of Microsoft's international subsidiaries from http://www.microsoft.com |
Microsoft TechNet
May 1997
Volume 5, Issue 5