Click Here to Install Silverlight*
United StatesChange|All Microsoft Sites
Microsoft*
Search Microsoft.com for:
Search for

Advanced Search

Typography Home   Typography Home

Developing fonts > Specifications

Developing OpenType Fonts
for Indic Scripts (2 of 3):
Shaping Engine

The Indic shaping engine of Uniscribe processes text in stages. The stages are:

  1. Analyze the syllables
  2. Reorder characters
  3. Shape (substitute) glyphs with OTLS (OpenType Library Services)
  4. Position glyphs with OTLS

The descriptions which follow will help font developers understand the rationale for the Indicfeature encoding model, and help application developers better understand how layout clients can divide responsibilities with operating system functions.


Analyze the syllables

The syllable unit that the shaping engine receives for the purpose of shaping is a string of Unicode characters, in a sequence. These are not necessarily positioned within the sequence as they appear when composed into a syllable. Also a few of these units could have a priority in combining within the syllable to form another graphically distinct unit (e.g. akhand KSSA).

First, the shaping engine determines syllable boundaries and isolates certain parts and properties of a syllable. To be able to place syllable elements in required positions and find out the possibility of them combining to form graphically distinct units, the shaping engine analyzes the syllable and gets answers to questions like:

  • Is there a 'reph' formation in the syllable? A syllable starting with letter "Ra" + Halant may be flagged as a syllable containing a 'reph'.

  • Which consonants can form 'vattu' variants?

  • Given a consonant within the syllable, does it qualify as a pre-base/below-base/post-base form? Does it qualify as a 'halant' form? (e. g. words in Sanskrit that end with the 'halant' form of a consonant, or chillaksharams in Malayalam).

Next, the base (full-form) consonant of the syllable is identified. All other elements are classified by their position relative to the base: pre-base, below-base, above-base and post-base.

Then the Indic shaping engine splits matras that have components appearing on more than one side of the base glyph into the corresponding parts (pre-base, below-base, above-base or post-base parts).


Reorder characters

Uniscribe creates and manages a buffer of appropriately reordered character codes, delineated as "clusters." Uniscribe reorders character codes within clusters according to several rules (described below). Then, Uniscribe obtains the corresponding glyph string by passing the reordered character string to the glyph substitution function of the OTL Services.

Because glyph strings are obtained from reordered character strings, the features in an Indic font must be encoded to map reordered characters (and combinations of characters) to their corresponding glyphs. Consequently, font developers are relieved of several layers of complexity in defining features - allowing Uniscribe to perform standard character reordering operations.

The character reordering rules of the Uniscribe Indic shaping engine are described below. None of the rules need to be encoded in an OpenType font, as long as the font is to be used with Uniscribe (or another client that follows the Unicode Standard for character reordering). In fact, if a font developer attempted to encode such reordering information in an OpenType font, they would need to add a huge number of many-to-many glyph mappings to cover the very simple algorithms that Uniscribe uses.

Uniscribe always performs reordering operations in a specified order, as described below.

Starting with a syllable of one of the following forms:

{C + [Nukta] + H} + C + [M] + [VM] + [SM]

...or a syllable without vowels

{C + [Nukta] + H} + C + H

...or a syllable without consonants

VO + [VM] + [SM]

  1. The shaping engine finds the base consonant of the syllable, using the following algorithm: starting from the end of the syllable, move backwards until a consonant is found that does not have a below-base or post-base form (post-base forms have to follow below-base forms), or arrive at the first consonant. The consonant stopped at will be the base.

    • If the syllable starts with Ra + H (in a script that has 'Reph'), Ra is excluded from candidates for base consonants.

    • In Kannada and Telugu, the base consonant cannot be farther than 3 consonants from the end of the syllable.

  2. If the base consonant is not the last one, Uniscribe moves the halant from the base consonant to the last one.

  3. If the syllable starts with Ra + H, Uniscribe moves this combination so that it follows either:

    • the post-base 'matra' (if any) or the base consonant (in scripts that show similarity to Devanagari, i.e., Devanagari, Gujarati, Bengali)

    • the base consonant (other scripts)

    • the end of the syllable (Kannada)

  4. Uniscribe splits two- or three-part matras into their parts. This splitting is a character-to-character operation). Then;

    • in scripts that show similarity to Devanagari, Uniscribe moves the left 'matra' part to the beginning of the syllable.

    • in Malayalam and Tamil, the left 'matra' is moved to immediately precede the base glyph (see section below).

  5. Uniscribe classifies consonants and 'matra' parts as pre-base, above-base (Reph), below-base or post-base. This classification exists on the character code level and is language-dependent, not font-dependent.

  6. Uniscribe then groups elements of the syllable (consonants and 'matras') according to this classification. Pre-base elements will precede the base consonant. The above-base, below-base and post-base components will follow the base glyph.

    • In scripts that show similarity to Devanagari; the Reph (Ra + H) and vowel modifiers will be positioned in the syllable after the post-base 'matra' (if any); since these become marks on the 'matra', not on the base glyph.

    • In Kannada/Telugu, on the other hand, the below-base consonants may appear below the post-base matra (if any) and should be reordered after it.

    • Below-base Ra (vattu) will be positioned following the consonants on which it is placed (which could either be the base consonant or one of the pre-base consonants).

    • 'Halants' and 'nukta' marks are moved with the consonants they affect.

After performing the character reordering steps, the sequence of characters will have one of the following forms:

For Devanagari and Gujarati:

[Mpre] + {Cpre + [Nukta] + H + [Ra + H]vattu} + Cbase +

[Ra + H]vattu + [Mbelow] + [VMbelow] + [SMbelow] +

[Mabove] + [Mpost] + [Ra + H]reph + [VMabove] + [SMabove] + [VMpost]

(Only one of Mpre, Mbelow, Mabove or Mpost can be present)

For Gurmukhi:

[Mpre] + {Cpre + [Nukta] + H} + Cbase +

[Cbelow + H] + [Mbelow] +

[Mabove] + [Cpost + H] + [Mpost] + [VMabove]

(Only one of Mpre, Mbelow, Mabove or Mpost can be present)

For Bengali and Oriya:

[Mpre] + {Cpre + [Nukta] + H} +Cbase +

{Cbelow + H} + [Mbelow] +

[Mabove] + [Ra + H]reph + [VMabove] +

[Cpost + H] + [Mpost] + [VMpost]

(Out of Mpre, Mbelow, Mabove or Mpost at most two can be present)

For Reformed Malayalam:

{Cpre + H} + [Mpre]* + [Ra + H]*vattu + Cbase +

{Cbelow + H} + [Mbelow] +

[Cpost + H] + [Mpost] + [VMpost]

(Out of Mpre1, Mpre2, Mbelow or Mpost different combinations can be present)

For Old/Traditional Malayalam:

{Cpre + H} + [Mpre]* + [Ra + H]*vattu + Cbase +

{Cbelow + H} + [Mbelow] +[Ra + H]reph +

[Cpost + H] + [Mpost] + [VMpost]

(Out of Mpre1, Mpre2, Mbelow or Mpost different combinations can be present)

For Tamil:

{Cpre + H} + [Mpre]* + Cbase + [Mabove] + [Mpost] + [VMpost]

(Out of Mpre, Mabove and Mpost different combinations can be present)

For Telugu:

{Cpre + H} + Cbase + [Mabove] + [Mbelow] + [Mpost] + {Cbelow + H} +

{Cpost + H} + [VMpost]

(Out of Mabove, Mbelow, Mpost and VMpost different combinations can be present)

For Kannada:

{Cpre + H} + Cbase + [Mabove] + [Mpost] + {Cbelow + H} +

{Cpost + H} + [LMpost] + [Ra + H]reph + [VMpost]

(Out of Mabove, Mbelow, Mpost and LMpost different combinations can be present)

* Will be reordered at syllable start for shaping.

In the absence of a vowel, we'll have

{Cpre + [Nukta] + H + [Ra + H]vattu} + Cbase + [Ra + H]vattu + H

Finally, a syllable with independent vowel will look like

VO + [VM1] + [VM2]


Shape glyphs with OTLS

The first step Uniscribe takes in shaping the character string is to map all characters to their nominal form glyphs. Then, Uniscribe applies contextual shape features to the glyph string.

Next, Uniscribe calls the OTL Services Library to shape the Indic syllable. All OTL processing is divided into a set of predefined features (described and illustrated in the Features section of this document). Each feature is applied, one by one, to the appropriate glyphs in the syllable and OTLS processes them. Uniscribe makes as many calls to the OTL Services as there are features. This ensures that the features are executed in the desired order.

The steps of the shaping process are outlined below. Not all of the features listed apply to all Indic script languages.

Shaping features:

  1. Language forms
    1. Apply feature 'nukt' to get nukta forms of consonants.
    2. Apply feature 'akhn' to get akhand ligatures.
    3. Apply feature 'rphf' to Ra and the following halant to get the reph glyph.
    4. Apply feature 'blwf' to get below-base forms of consonants.
    5. Apply feature 'half' to get half forms of pre-base consonants.
    6. Apply feature 'pstf' to get post-base forms of consonants.
    7. Apply feature 'vatu' to get ligatures of the below-base form of 'Ra' and the preceding consonant.

  2. Conjuncts and Typographical forms
    1. Apply feature 'pres' to get pre-base consonant conjuncts and pre-base matra conjuncts. (ie. consonant and matra conjuncts to the left of the base glyph).
    2. Apply feature 'blws' to get below-base consonant conjuncts; below-base matra conjuncts; below-base vowel modifier forms; and below-base stress and tone mark forms. (ie. consonant and matra conjuncts; typographical forms; vowel modifier forms; and stress and tone mark forms of below-base elements).
    3. Apply feature 'abvs' to get above-base matra conjuncts; reph conjuncts; above-base vowel modifiers; and above-base stress and tone marks. (ie. reph and matra conjuncts, typographical forms and vowel modifier forms of above-base elements).
    4. Apply feature 'psts' to get post-base consonant conjuncts, post-base matra conjuncts and post-base vowel modifiers. (ie. consonant and 'matra' conjuncts, typographical forms and vowel modifier forms of post-base elements).

      Note: In scripts that show similarity to Devanagari the post-base matra goes before the above-base part. Thus causing the 'Post-base Substitutions' feature to be executed before any above-base features.

  3. Halant form
    1. Apply feature 'haln' to put the base consonant in halant form (if the syllable ends with a halant).

      Note: The halant substitution is performed last to ensure that the base consonant is always in the full form during shaping.


Position glyphs with OTLS

Uniscribe next applies features concerned with positioning, calling functions of OTLS to position glyphs.

Positioning features:

  1. Below-base marks
    1. Apply feature 'blwm' to position below-base forms, vowel modifiers and or stress/tone marks.

  2. Above-base marks
    1. Apply feature 'abvm' to position above-base forms, vowel modifiers and or stress/tone marks (on base glyph or post-base matra).

  3. Distances
    1. Apply feature 'dist' to adjust other distances. (e.g. to provide kerning between post and pre-base elements and the base glyph).


Base elements

Commonly, a feature is required for dealing with the base glyph and one of the post-base, pre-base or above-base elements. Since it is not possible to reorder ALL of these elements next to the base glyph, we need to skip over the elements "in the middle" (reordering-wise).

The solution is to assign different mark attachment classes to different elements of the syllable and positional forms, and in any given lookup work with one mark type only. For example, in above-base substitutions we need only consider above-base elements most of the time.

Generally, it is good practice to label as "mark" glyphs that are denoted as marks in the Unicode Standard as well as below-base/above-base forms of consonants. Then, different attachment classes should be assigned to different marks depending on their position with respect to the base.


Left Matras in Malayalam and Tamil

In these languages the left (part of a) 'matra' is not placed in front of the whole syllable but immediately precedes the base glyph.

The problem is that in presence of (font-dependent) consonant conjuncts it is impossible to predict to where the 'matra' should be reordered so that consonant conjunct ligatures don't have to "skip over" it.

Although the Tamil script uses only one consonant conjunct (KSSA), conjuncts are in abundance in Malayalam.

To solve the problem, Uniscribe always places the pre-base 'matras' at the beginning of the syllable for shaping. Then, for the above-mentioned scripts Uniscribe will reorder it before the base glyph at the end of script shape routine for correct placement.


Chillaksharams in Malayalam

Some consonants in Malayalam have more than one way of representing consonant followed by halant (chilla). These forms are distinct from the parent consonant and do not have a visible chilla. These appear only at non-initial or final consonant locations in syllables. Known as Chillaksharams, they are treated as halant forms by the shaping engine. Consonants that have been identified to possess a chillaksharam form are: KA, NNA, NA, RA, LA, LLA. Their respective chillaksharams are: IK, INN, IN, IR, IL, ILL.


Invalid combining marks

Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered invalid. Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode Standard (section 5.12, 'Rendering Non-Spacing Marks' of the Unicode Standard 3.1), i.e. positioned on a dotted circle.

Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of the Unicode Standard). Uniscribe requires a ZWJ to be placed between the space and a mark for them to combine into a standalone sign. (ie. to get a shape of I-matra without the dotted circle one should type + ZWJ + I-matra).

For the fallback mechanism to work properly, an Indic OTL font should contain a glyph for the dotted circle (U+25CC). In case this glyph is missing from the font, the invalid signs will be displayed on the missing glyph shape (white box).

In addition to the 'dotted circle' other Unicode code points that are recommended for inclusion in any Indic font are the ZWJ (zero width joiner; U+200C), the ZWNJ (zero width non-joiner; U+200D) and the ZWSP (zero width space; U+200B).

Next section:  Features

introduction | shaping engine | features | appendices

Top of page


© 2008 Microsoft Corporation. All rights reserved. Contact Us |Terms of Use |Trademarks |Privacy Statement