Microsoft Typography | Developer information | Specifications | OpenType font development
Indic OpenType Specification | Terms | Shaping | Features | Other | Appendix


How the Indic shaping engine works

The Indic shaping engine of Uniscribe processes text in stages. The stages are:

  1. Analyzing the syllables.
  2. Reordering characters.
  3. Shaping (substituting) glyphs with OTLS (OpenType Library Services).
  4. Positioning glyphs with OTLS.

The descriptions which follow will help font developers understand the rationale for the Indicfeature encoding model, and help application developers better understand how layout clients can divide responsibilities with operating system functions.


Analyzing the syllables

The syllable unit that the shaping engine receives for the purpose of shaping is a string of Unicode characters, in a sequence. These are not necessarily positioned within the sequence as they appear when composed into a syllable. Also a few of these units could have a priority in combining within the syllable to form another graphically distinct unit (e.g. akhand KSSA).

First, the shaping engine determines syllable boundaries and isolates certain parts and properties of a syllable. To be able to place syllable elements in required positions and find out the possibility of them combining to form graphically distinct units, the shaping engine analyzes the syllable and gets answers to questions like:

  • Is there a 'reph' formation in the syllable? A syllable starting with letter "Ra" + Halant may be flagged as a syllable containing a 'reph'.

  • Which consonants can form 'vattu' variants?

  • Given a consonant within the syllable, does it qualify as a pre-base/below-base/post-base form? Does it qualify as a 'halant' form? (e. g. words in Sanskrit that end with the 'halant' form of a consonant, or chillaksharams in Malayalam).

Next, the base (full-form) consonant of the syllable is identified. All other elements are classified by their position relative to the base: pre-base, below-base, above-base and post-base.

Then the Indic shaping engine splits matras that have components appearing on more than one side of the base glyph into the corresponding parts (pre-base, below-base, above-base or post-base parts).


Reordering characters

Uniscribe creates and manages a buffer of appropriately reordered character codes, delineated as "clusters." Uniscribe reorders character codes within clusters according to several rules (described below). Then, Uniscribe obtains the corresponding glyph string by passing the reordered character string to the glyph substitution function of the OTL Services.

Because glyph strings are obtained from reordered character strings, the features in an Indic font must be encoded to map reordered characters (and combinations of characters) to their corresponding glyphs. Consequently, font developers are relieved of several layers of complexity in defining features - allowing Uniscribe to perform standard character reordering operations.

The character reordering rules of the Uniscribe Indic shaping engine are described below. None of the rules need to be encoded in an OpenType font, as long as the font is to be used with Uniscribe (or another client that follows the Unicode Standard for character reordering). In fact, if a font developer attempted to encode such reordering information in an OpenType font, they would need to add a huge number of many-to-many glyph mappings to cover the very simple algorithms that Uniscribe uses.

Uniscribe always performs reordering operations in a specified order, as described below.

Starting with a syllable of one of the following forms:

{C + [Nukta] + H} + C + [M] + [VM] + [SM]

...or a syllable without vowels

{C + [Nukta] + H} + C + H

...or a syllable without consonants

VO + [VM] + [SM]

  1. The shaping engine finds the base consonant of the syllable, using the following algorithm: starting from the end of the syllable, move backwards until a consonant is found that does not have a below-base or post-base form (post-base forms have to follow below-base forms), or arrive at the first consonant. The consonant stopped at will be the base.

    • If the syllable starts with Ra + H (in a script that has 'Reph'), Ra is excluded from candidates for base consonants.

    • In Kannada and Telugu, the base consonant cannot be farther than 3 consonants from the end of the syllable.

  2. If the base consonant is not the last one, Uniscribe moves the halant from the base consonant to the last one.

  3. If the syllable starts with Ra + H, Uniscribe moves this combination so that it follows either:

    • the post-base 'matra' (if any) or the base consonant (in scripts that show similarity to Devanagari, i.e., Devanagari, Gujarati, Bengali)

    • the base consonant (other scripts)

    • the end of the syllable (Kannada)

  4. Uniscribe splits two- or three-part matras into their parts. This splitting is a character-to-character operation). Then;

    • in scripts that show similarity to Devanagari, Uniscribe moves the left 'matra' part to the beginning of the syllable.

    • in Malayalam and Tamil, the left 'matra' is moved to immediately precede the base glyph (* see section Other Encoding Issues).

  5. Uniscribe classifies consonants and 'matra' parts as pre-base, above-base (Reph), below-base or post-base. This classification exists on the character code level and is language-dependent, not font-dependent.

  6. Uniscribe then groups elements of the syllable (consonants and 'matras') according to this classification. Pre-base elements will precede the base consonant. The above-base, below-base and post-base components will follow the base glyph.

    • In scripts that show similarity to Devanagari; the Reph (Ra + H) and vowel modifiers will be positioned in the syllable after the post-base 'matra' (if any); since these become marks on the 'matra', not on the base glyph.

    • In Kannada/Telugu, on the other hand, the below-base consonants may appear below the post-base matra (if any) and should be reordered after it.

    • Below-base Ra (vattu) will be positioned following the consonants on which it is placed (which could either be the base consonant or one of the pre-base consonants).

    • 'Halants' and 'nukta' marks are moved with the consonants they affect.

After performing the character reordering steps, the sequence of characters will have one of the following forms:

For Devanagari and Gujarati:

[Mpre] + {Cpre + [Nukta] + H + [Ra + H]vattu} + Cbase +

[Ra + H]vattu + [Mbelow] + [VMbelow] + [SMbelow] +

[Mabove] + [Mpost] + [Ra + H]reph + [VMabove] + [SMabove] + [VMpost]

(Only one of Mpre, Mbelow, Mabove or Mpost can be present)

For Gurmukhi:

[Mpre] + {Cpre + [Nukta] + H} + Cbase +

[Cbelow + H] + [Mbelow] +

[Mabove] + [Cpost + H] + [Mpost] + [VMabove]

(Only one of Mpre, Mbelow, Mabove or Mpost can be present)

For Bengali and Oriya:

[Mpre] + {Cpre + [Nukta] + H} +Cbase +

{Cbelow + H} + [Mbelow] +

[Mabove] + [Ra + H]reph + [VMabove] +

[Cpost + H] + [Mpost] + [VMpost]

(Out of Mpre, Mbelow, Mabove or Mpost at most two can be present)

For Reformed Malayalam:

{Cpre + H} + [Mpre]* + [Ra + H]*vattu + Cbase +

{Cbelow + H} + [Mbelow] +

[Cpost + H] + [Mpost] + [VMpost]

(Out of Mpre1, Mpre2, Mbelow or Mpost different combinations can be present)

For Old/Traditional Malayalam:

{Cpre + H} + [Mpre]* + [Ra + H]*vattu + Cbase +

{Cbelow + H} + [Mbelow] +[Ra + H]reph +

[Cpost + H] + [Mpost] + [VMpost]

(Out of Mpre1, Mpre2, Mbelow or Mpost different combinations can be present)

For Tamil:

{Cpre + H} + [Mpre]* + Cbase + [Mabove] + [Mpost] + [VMpost]

(Out of Mpre, Mabove and Mpost different combinations can be present)

For Telugu:

{Cpre + H} + Cbase + [Mabove] + [Mbelow] + [Mpost] + {Cbelow + H} +

{Cpost + H} + [VMpost]

(Out of Mabove, Mbelow, Mpost and VMpost different combinations can be present)

For Kannada:

{Cpre + H} + Cbase + [Mabove] + [Mpost] + {Cbelow + H} +

{Cpost + H} + [LMpost] + [Ra + H]reph + [VMpost]

(Out of Mabove, Mbelow, Mpost and LMpost different combinations can be present)

* Will be reordered at syllable start for shaping.

In the absence of a vowel, we'll have

{Cpre + [Nukta] + H + [Ra + H]vattu} + Cbase + [Ra + H]vattu + H

Finally, a syllable with independent vowel will look like

VO + [VM1] + [VM2]


Shaping with OTLS

The first step Uniscribe takes in shaping the character string is to map all characters to their nominal form glyphs. Then, Uniscribe applies contextual shape features to the glyph string.

Next, Uniscribe calls the OTL Services Library to shape the Indic syllable. All OTL processing is divided into a set of predefined features (described and illustrated in the Feature section of this document). Each feature is applied, one by one, to the appropriate glyphs in the syllable and OTLS processes them. Uniscribe makes as many calls to the OTL Services as there are features. This ensures that the features are executed in the desired order.

The steps of the shaping process are outlined below. Not all of the features listed apply to all Indic script languages.

Shaping features:

  1. Language forms
    1. Apply feature 'nukt' to get nukta forms of consonants.
    2. Apply feature 'akhn' to get akhand ligatures.
    3. Apply feature 'rphf' to Ra and the following halant to get the reph glyph.
    4. Apply feature 'blwf' to get below-base forms of consonants.
    5. Apply feature 'half' to get half forms of pre-base consonants.
    6. Apply feature 'pstf' to get post-base forms of consonants.
    7. Apply feature 'vatu' to get ligatures of the below-base form of 'Ra' and the preceding consonant.

  2. Conjuncts and Typographical forms
    1. Apply feature 'pres' to get pre-base consonant conjuncts and pre-base matra conjuncts. (ie. consonant and matra conjuncts to the left of the base glyph).
    2. Apply feature 'blws' to get below-base consonant conjuncts; below-base matra conjuncts; below-base vowel modifier forms; and below-base stress and tone mark forms. (ie. consonant and matra conjuncts; typographical forms; vowel modifier forms; and stress and tone mark forms of below-base elements).
    3. Apply feature 'abvs' to get above-base matra conjuncts; reph conjuncts; above-base vowel modifiers; and above-base stress and tone marks. (ie. reph and matra conjuncts, typographical forms and vowel modifier forms of above-base elements).
    4. Apply feature 'psts' to get post-base consonant conjuncts, post-base matra conjuncts and post-base vowel modifiers. (ie. consonant and 'matra' conjuncts, typographical forms and vowel modifier forms of post-base elements).

      Note: In scripts that show similarity to Devanagari the post-base matra goes before the above-base part. Thus causing the 'Post-base Substitutions' feature to be executed before any above-base features.

  3. Halant form
    1. Apply feature 'haln' to put the base consonant in halant form (if the syllable ends with a halant).

      Note: The halant substitution is performed last to ensure that the base consonant is always in the full form during shaping.


Positioning glyphs with OTLS

Uniscribe next applies features concerned with positioning, calling functions of OTLS to position glyphs.

Positioning features:

  1. Below-base marks
    1. Apply feature 'blwm' to position below-base forms, vowel modifiers and or stress/tone marks.

  2. Above-base marks
    1. Apply feature 'abvm' to position above-base forms, vowel modifiers and or stress/tone marks (on base glyph or post-base matra).

  3. Distances
    1. Apply feature 'dist' to adjust other distances. (e.g. to provide kerning between post and pre-base elements and the base glyph).



this page was last updated December 2001
© 2001 Microsoft Corporation. All rights reserved. Terms of use.
comments to the MST group: how to contact us

 

Indic OpenType Specification | Terms | Shaping | Features | Other | Appendix
Microsoft Typography | Developer information | Specifications | OpenType font development