Global Development and Computing Portal Global Development and Computing Portal

Authoring HTML for Middle Eastern Content

The World Wide Web Consortium (W3C) has created the HTML and CSS standards with a global audience in mind. These standards include items specific to correctly displaying Middle Eastern languages. Many browsers used for viewing HTML content now have support for Middle East content, either natively or through the use of plug-in extensions created by third parties. The goal of this article is to provide the reader with an overview of the HTML 4.0 and CSS2 specifications as they pertain to Middle Eastern content. This should assist the reader in creating web pages using conformant HTML that will ensure the portability across platforms.

*
**
**
On This Page
Use of charsetUse of charset
Use of EntitiesUse of Entities
The LANG attributeThe LANG attribute
Font SizeFont Size
Bidirectional LayoutBidirectional Layout
Applying the DIR to groups of block level or in-line/phrase elementsApplying the DIR to groups of block level or in-line/phrase elements
The BDO ElementThe BDO Element
Setting Direction with CSS/StylesheetsSetting Direction with CSS/Stylesheets
Using the Document Object Model with Bidirectional TextUsing the Document Object Model with Bidirectional Text
CSS and Script ExamplesCSS and Script Examples
HTML for Farsi and UrduHTML for Farsi and Urdu
SummarySummary
GlossaryGlossary
About the AuthorAbout the Author

Use of charset

Document charset, or encoding, is defined by IANA (Internet Assigned Numbers Authority). IANA is dedicated to preserving the central coordinating functions of the global Internet for the public good.

The charset is used in the HEAD portion of the HTML document and tells the browser how the text in the document has been encoded. Typically you will see the charset mixed in with other information in the following format:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-6">
For Middle Eastern web pages you will commonly see the following charsets

UTF-8

The W3C's recommended encoding. Can represent all characters defined in the Unicode standard.

windows-1252

Windows 1252 (no Arabic characters included)

windows-1256

Windows Arabic codepage

asmo-708

ASMO 708 codepage

dos-720

Arabic DOS 720 codepage

Be mindful that charsets have a limited set of characters. If there are any characters that are not in the charset on your page, you must explicitly tell the browser what they are using entities so the browser can properly render them.

Top of pageTop of page

Use of Entities

The improper use of entities for Arabic is the common cause of Middle East web pages showing up in browsers as upper ASCII characters. Awareness of this issue can help you author web content that displays correctly.

For example, the Arabic beh (ب) is located at decimal 200 in the Windows-1256 codepage. To have the beh show on an HTML document, one must type the keystroke that produces the character in the editor, or explicitly declare the Unicode value (ب). Some HTML pages have used &Egrave; for the beh. By HTML 4.0 specification the rendered results of &Egrave; will be the capital letter E with a grave mark above it (È), not the beh. By HTML definition, the ampersand character (&) bypasses the codepage to Unicode mapping and explicitly displays the character specified.

Some HTML editors incorrectly represent Arabic characters as entities. When this happens, it is impossible to use a change in encoding to view entities as Arabic text in an HTML compliant browser. If your editor generates incorrect upper-ASCII entities for Arabic characters, consider switching editors. Otherwise, it will be necessary to hand-edit the HTML to remove them.

Top of pageTop of page

The LANG attribute

The LANG attribute is provided solely to assist with lexical analysis of the text, such as spell checking. It is never to be used to specify the directional layout properties of the HTML document. Currently, there are not many applications that perform proofing tool functionality in native HTML. However, as more applications continue to migrate toward native web support, this attribute could become very useful.

Top of pageTop of page

Font Size

When a browser views a page with multiple scripts, for example Arabic and English, the default font size may cause one script to be readable but the other to be illegibly small or anesthetically large. When authoring such pages, the best workaround for this is to set explicit font size attributes for the text. The best way to do this is to set explicit markup in style sheets for each script. Following is the source of the example seen at Figure 1.

<STYLE type="text/css">
SPAN.arabic
   { font-face: Traditional Arabic;
    font-size: 120%; }
SPAN.english
   { font-face: Times New Roman;
    font-size: 100%; }
</STYLE>
	
<BODY>
   <P>
      <SPAN class="arabic">
         اهلا و سهلا من الشر ق الأ و سط.
      </SPAN>
      <SPAN class="english">
         Hello from the Middle East.
      </SPAN>
   </P>
</BODY>
Figure1

Figure 1

Top of pageTop of page

Bidirectional Layout

The DIR (direction) attribute is used to indicate the base directionality of an element and provides for proper resolution of weak and neutral characters. The Unicode specification assigns directionality to characters and defines an algorithm for determining proper layout of text based on its direction. This algorithm is known as the Unicode Bidirectional Algorithm. Weak characters are characters like numerals that may behave differently depending upon the language with which they are used. Neutral characters, like punctuation, do not change with the language used. However, they may be laid out differently depending upon the directional flow of the text.

When writing HTML pages, it is advisable to use the DIR attribute for text layout rather than the Unicode control characters that provide the same functionality (LRM (left to right mark), RLM (right to left mark), LRO (left to right override), RLO (right to left override), etc.). This is because the Unicode control characters are not visible in most editors. Using them will add difficulty when editing your HTML page. However, LRM and RLS are valid and DIR is not a substitute for them.

The DIR attribute is inherited by the element's children and may be overridden.

Placement of the DIR on the HTML element vs. the BODY element By HTML specification the DIR attribute placed on the HTML element sets the layout direction of the document. In IE5 this will also cause the ambient property (used by embedded OLE objects) of the document to be set. When no DIR attribute is placed on the HTML element the direction is left to right. However, when one sees <HTML DIR=RTL> the following behavior can be expected:

The OLE/COM ambient property of the document is set to AMBIENT_RIGHTTOLEFT.

The document direction can be toggled through the document object model (DOM) (document.direction="ltr/rtl").

An HTML Dialog will get the correct extended windows styles set so it displays as a RTL dialog on a Bidi enabled system.

If the document has vertical scrollbars, they will be on the left side if DIR=RTL.

If the DIR=RTL attribute is placed on the body instead of the HTML element:

The OLE/COM ambient property for the document will not reflect the direction on the BODY.

The ability to toggle the document's direction will be lost, because the body's direction is explicitly set.

Dialog window frames and captions will not reflect the direction of the BODY.

Vertical scrollbars will be reflect the direction assigned to the body, not the document.

Alignment should be implicit with the use of the DIR attribute on some HTML elements. Text in items like <P> and <H1> will be right aligned by default if they have DIR=RTL assigned to them. Therefore it is redundant to assign DIR=RTL and ALIGN=RIGHT.

Placement of the DIR on the TABLE element <TABLE DIR=RTL> changes geographic layout of the table. Column A will be on the right of the table, while column C will be to the left of the table. See Figure 2.

Figure2

Figure 2

Top of pageTop of page

Applying the DIR to groups of block level or in-line/phrase elements

When pages have groups of block level elements or groups of in-line elements, it is not necessary to apply the same DIR attribute to all of the members of the group.

Groups of block level elements that have the same parent can be placed a DIV that has the direction that should be applied to the group. In this way a section of the document can have a direction that is different than the rest of the document.

Similarly, groups of in-line element can be grouped inside of a SPAN element to which the appropriate direction is assigned.

The default behavior of a document always resumes whenever the specified block or phrase ends, if the parent element does not have direction explicitly set.

Top of pageTop of page

The BDO Element

The BDO element (bidirectional override) is used to override the direction of text. The DIR attribute is required to be used with the BDO element so the proper direction override is implemented. The BDO element is useful for overriding the Unicode bidirectional algorithm when mixing alphabetic and numeric characters in part numbers.

Figure 3 will help you understand visually how the BDO works.

Figure3

Figure 3

Top of pageTop of page

Setting Direction with CSS/Stylesheets

CSS has a feature similar to the DIR attribute. In CSS the direction is set through the use of two markup properties: direction and unicode-bidi. The following text is either copied from the CSS2 specification, or summarized.

'direction'

This property specifies the base writing direction of blocks and the direction of embeddings and overrides (see 'unicode-bidi') for the Unicode bidirectional algorithm. In addition, it specifies the direction of column layout in tables, the direction of horizontal overflow, and the position of an incomplete last line in a block in case of 'text-align: justify'.

Values for this property have the following meanings:

ltr - Left-to-right direction.

rtl - Right-to-left direction.

inherit - the property takes the same value as the property for the element's parent.

For the 'direction' property to have any effect on inline-level elements, the 'unicode-bidi' property's value must be 'embed' or 'override'.

Note: The 'direction' property, when specified for table column elements, is not inherited by cells in the column since columns don't exist in the document tree. Thus, CSS cannot easily capture the "dir" attribute inheritance rules described in HTML 4.0, section 11.3.2.1.

'unicode-bidi'

User agents following the bidirectional algorithm will implicitly display characters in the correct writing direction automatically. In cases where authors must assist the user agent with explicit directional markup, the author may signal that an element opens an explicit embedding or directional override.

Values for 'unicode-bidi' have the following meanings:

normal - The element follows implicit bidi rules.This means that implicit reordering works across element boundaries based on the 'direction' property and strong direction of characters in the text.

embed - The element explicitly opens an additional embedding and is equivalent to using the LRE or RLE mark in the code. The agent must terminate the explicit embedding at the end of the element (using a PDF).

bidi-override - The element explicitly forces the characters to be treated as strong characters in the direction specified by the 'direction' property. This is equivalent to using the HTML BDO element. The agent must terminate the explicit override at the end of the element (using a PDF). This is used for special cases, such as part numbers.

inherit - the property takes the same value as the property for the element's parent.

The 'direction' property has an implicit effect on in-line elements when applying the Unicode Bidirectional Algorithm. The 'unicode-bidi' property is only required when explicit markup is required to achieve the desired result.

The following HTML example demonstrates that 'unicode-bidi' property is required to achieve proper text layout with in-line elements based on the Unicode Bidirectional Algorithm. It also illustrates an important design principle: web page designers should take bidirectionality into account, both in the language proper (elements and attributes) and in any accompanying style sheets. The style sheets should be designed so that bidirectional rules are separate from other markup rules. The bidirectional rules should not be overridden by other style sheets so that the document language's or style sheet bidirectional behavior is preserved.

<HTML DIR=RTL>
   <HEAD>
      <TITLE>Direction with Styles</TITLE>
      <META charset=windows-1256>
      <STYLE>
         DIV.arabic
            {direction: rtl; unicode-bidi: normal;}
         DIV.english
            {direction: ltr; unicode-bidi: normal;}
         SPAN.arabic1
            {direction: rtl; unicode-bidi: normal;}
         SPAN.arabic2
            {direction: rtl; unicode-bidi: embed;}
      </STYLE>
   </HEAD>
   <BODY>
      <DIV class=arabic>
         <P>يبرعلا1 يبرعلا2 english3 يبرعلا4 يبرعلا5</P>
         <P>يبرعلا6 <B>يبرعلا7 </B>يبرعلا8</P>
      </DIV>
      <DIV class=english>
         <P>english9 english10 english11 يبرعلا12 يبرعلا13</P>
         <P>english14 english15 english16</P>
         <P>english17 <SPAN class=arabic1>يبرعلا18 english19 يبرعلا20</SPAN></P>
         <P>english21 <SPAN class=arabic2> يبرعلا22 english23 يبرعلا24</SPAN></P>
      </DIV>
   </BODY>
</HTML>

Notice the difference in the 'Unicode-bidi' assignment given to SPAN.arabic1 and SPAN.arabic2 above. Figure 4 give a visual illustration of using the correct 'Unicode-bidi' assignment for correct layout.

Figure4

Figure 4

Top of pageTop of page

Using the Document Object Model with Bidirectional Text

This section covers the use of the DIR property in HTML, the direction property in Cascading Style Sheets (CSS2) and Command IDs assigned to assist with implementing the DIR property in scripts. Below samples using the Style attributes are examples on how to accomplish your tasks. Use the style attribute sparingly in your daily work as tehre's no need to style or mark up every paragraph.

1. Item for Properties Reference

Style:

<SPAN style="font-family: 'Courier New'; font-size: 120%; color: blue;">dir</span>

Syntax:

<SPAN style="font-family: 'Courier New'; font-size: 120%; color: blue;">object.dir[ = dir ]</span>

valid values for dir are "ltr" and "rtl".

ltr - left-to-right

rtl - right-to-left

Remarks: This property has read-write permissions, meaning you can change as well as retrieve its current value.

Applies To: All elements but APPLET, BASE, BASEFONT, BR, FRAME, FRAMESET, HR, IFRAME, PARAM, SCRIPT Expected Behavior:

HTML - Specifies the document default layout direction, applies the DIR property to the HTML element and refreshes the document.

TABLE - Specifies the table column layout direction, applies the DIR property to the TABLE element and updates the table layout. Actual positioning of the element will depend upon the parent's direction.

DIV, P, H1, H2, H3, H4, H5, H6, TH, TD, THEADER, TBODY, TFOOT, CAPTION - specifies reading order and default alignment (align right for right to left). Actual positioning of the element will depend upon the parent's direction.

SPAN and other phrase elements - specifies reading order of text only.

IMAGE and other non-text type elements - specifies reading order of any tool tip associated with the item.

BDO - required property that indicates the reading order direction that will be overridden. For example, <BDO DIR=RTL>MIRROR</BDO> will be displayed as "RORRIM".

2. CSS2 Support for the DIRECTION Property

Style:

<SPAN style="font-family: 'Courier New'; font-size: 120%; 
color: blue;">direction<span>

Description: Specifies the reading order of item specified.

Syntax:

<SPAN style="font-family: 'Courier New'; font-size: 120%; 
color: blue;">object.style.direction[ = value \<span>

valid values for direction are "ltr" and "rtl".

ltr - left-to-right

ltl - right-to-left

inherit - take the direction of the parent element

This property determines whether the direction of flow in an inline formatting context is left-to-right or right-to-left. It also specifies the direction of table layout.

3. CSS2 Support for the UNICODE-BIDI Property

Style:

<SPAN style="font-family: 'Courier New'; font-size: 120%; 
color: blue;">unicode-bidi</span>

Description: Specifies the manner in which the Unicode bidirectional algorithm processes the direction associated with the element.

Syntax:

<SPAN style="font-family: 'Courier New'; font-size: 120%; 
color: blue;">object.style.unicode-bidi[ = value ] </span>

valid values for unicode-bidi are "normal", "embed" and "bidi-override".

normal - handle the direction implicitly with the Unicode bidirectional algorithm

embed - force an additional embedding level

bidi-override - force the text to be laid out in the direction given

inherit - take the direction of the parent element

Top of pageTop of page

CSS and Script Examples

In style sheets the direction property may be assigned as in examples below.

a. Direction used Inline:


<HTML>
   <TITLE>Style Sheets With DIR</TITLE>
   <HEAD>
   </HEAD>
   <BODY>
      <P STYLE="direction: rtl; unicode-bidi: normal;">وسهلا من ا لشرق الأوسط.</P>
      <P DIR=RTL> اهلا وسهلا من الشرق الأوسط.</P>
   </BODY>
</HTML>

b. Direction used in an embedded style sheet:

<HTML>
   <HEAD>
      <TITLE>Style Sheets With DIR</TITLE>
      <STYLE>
         <!-- english direction is not really needed as it is default -->
         <!-- it is shown for example only -->
         P.english
            { color: blue; direction: ltr; unicode-bidi: normal;}
         P.arabic
            { color: red; direction: rtl; unicode-bidi: normal; }
      </STYLE>
   </HEAD>
   <BODY>
      <P CLASS="english">Hello from the Middle East.</P>
      <P CLASS="arabic">اهلا وسهلا من الشرق الأوسط.</P>
   </BODY>
</HTML>

In scripts the direction property may be manipulated as in the example below.

<HTML>
   <HEAD>
      <TITLE>Changing Direction with SCRIPT</TITLE>
      <META charset=windows-1256>
      <SCRIPT>
         function changeDirection()
            {
            var curDir = document.dir;
            if (curDir == "rtl")
               {
               document.dir = "ltr";
               }
            else
               {
               document.dir = "rtl";
               }
            }
      </SCRIPT>
   </HEAD>
   <BODY>
      <P>
      <INPUT type=button 
             onClick="changeDirection()"
             value="Change direction">
      </P>
      <P>اهلا وسهلا من الشرق الأوسط. Hello from the Middle East.</P>
      <TABLE BORDER=1>
         <TR>
            <TH>Column A</TH>
            <TH>Column B</TH>
            <TH>Column C</TH>
         </TR>
         <TR>
            <TD>Cell 1</TD>
            <TD>Cell 2</TD>
            <TD>Cell 3</TD>
         </TR>
      </TABLE>
   </BODY>
</HTML>
Top of pageTop of page

HTML for Farsi and Urdu

Earlier in the article I talked about the use of hex or decimal entities to represent characters not found in charsets. The use of entities is necessary for writing Farsi and Urdu pages that are portable and do not require special fonts.

With Microsoft's IE5, or other programs that utilize the Uniscribe engine (USP10.DLL), it is possible to shape all characters that reside within the Unicode Arabic block. This includes support for Arabic, Farsi, Urdu, Sindhi, Pashto, and others. Currently, fonts are the missing link to having web pages that display correctly with the Unicode values. Unfortunately, font designers for Farsi and Urdu often make their own layout and have not designed their fonts to the Open Type specification. Additionally, until Office 2000, Microsoft was only including Arabic characters in the fonts they distributed.

Recently, Microsoft has added a few additional characters to the Windows-1256 codepage. While this gives some better coverage if one needs to use a codepage, it is better to create Farsi or Urdu web pages using Unicode encoding (like UTF-8) or to simply use entities. Following is a list of new characters added to the Windows-1256 codepage.

If you are using Arabic Windows9x or WinNT4 Arabic Enabled platforms, you already have support for the following Farsi characters in the old Windows-1256 codepage. Next to the character is the key combination you can type to get that character.

Figure6

Figure 6

The Simplified and Traditional Arabic fonts that come with Windows have the peh, tcheh, jeh and gaf glyphs in them.

With Office 2000 you will also be able to type the following characters that are included in the new Windows-1256 codepage. To get this support you need an updated CP_1256.nls (Win9x) or C_1256.nls (WinNT) file. These updated files are installed as part of the Office 2000 installation. As noted above, the key combination you can type for the character when you are using the Windows-1256 codepage is next to the character name.

Figure7

Figure 7

Note: not all Urdu characters are supported by the Windows-1256 codepage. There was not enough empty space to include all Urdu characters. The DecoType fonts that come with Windows 2000 have support for all Farsi and Urdu characters. If you want to type characters not listed above, refer to the Unicode values for character support. Other characters in the Unicode Arabic block can be used by defining character values. Generally, writing text using character values is not recommended. THey are difficult to maintain.

Figure 5 demonstrates how one can use entities for Farsi and Urdu web pages. Writing pages using entities is somewhat tedious to do by hand. The source code for Figure 5 is:

<HTML DIR=RTL>
   <HEAD>
      <TITLE>Farsi and Urdu Sample</TITLE>
      <STYLE>
         DIV { font-family: 'MS Farsi'; font-size: 200%;}
      </STYLE>
   </HEAD>
   <BODY>
      <h2>Farsi</h2>
      <DIV>&#1575;&#1610;&#1606; &#1580;&#1605;&#1604;&#1607; &#1610;
&#1705; &#1606;&#1605;&#1608;&#1606;&#1607; &#1575;&#1584; &#1581;
&#1585;&#1608;&#1601; &#1601;&#1575;&#1585;&#1587;&#1740; &#1607;
&#1587;&#1578;.</DIV>
      <h2>Urdu</h2>
      <DIV>&#1610;&#1607; &#1580;&#1605;&#1604;&#1607; &#1575;&#1585;
&#1583;&#1615;&#1608; &#1705;&#1746; &#1581;&#1585;&#1608;&#1601; 
&#1603;&#1575; &#1575;&#1610;&#1705; &#1606;&#1605;&#1608;&#1606;
&#1607;&#1729;&#1746;.</DIV>
   </BODY>
</HTML>
Figure5

Figure 5

Top of pageTop of page

Summary

In summary, while it is possible to get good results by tailoring your pages to a specific browser, it is better to create pages that conform to current HTML and CSS standards. As HTML standards are refined and upgraded, most browser manufacturers update their product to take advantage of the changes.

Top of pageTop of page

Glossary

Bidi: short for Bidirectional.

Bidirectional: The characters in certain scripts are written from right to left. Consequently, text in a single block may appear with mixed directionality. This phenomenon is called bidirectionality, or "bidi" for short.

Unicode Bidirectional Algorithm: The Unicode standard defines a complex algorithm for determining the proper directionality of text. The algorithm consists of an implicit part based on character properties, as well as explicit controls for embeddings and overrides.

User Agent or Agent: A browser or other program that is capable of rendering HTML or CSS.

Top of pageTop of page

About the Author

Paul Nelson is a development lead in the Typography team at Microsoft Corporation. He was a member of the team that enabled complex script support in IE4 Complex Script version and IE5.


Top of pageTop of page