Developing fonts > Specifications
| Analyze the text | |
| Reorder characters | |
| Shape glyph sequences (GSUB processing) | |
| Position glyphs sequences (GPOS processing) | |
| Base elements | |
| Invalid combining marks | |
| Use of ZWJ, ZWNJ and NBSP |
The Indic shaping engine processes Malayalam text in stages. The stages are:
The descriptions which follow will help font developers understand the rationale for the Malayalam feature encoding model and help application developers better understand how layout clients can divide responsibilities with operating system functions.
Character properties
The shaping engine divides the text into syllable clusters and identifies character properties.
Character properties are used in parsing syllables and identifying its parts, in determining proper character or glyph reordering and in OpenType feature application. Properties for each character are divided into two types: static properties and dynamic properties.
Static properties define basic characteristics that do not change from font to font: character type (consonant, matra, vedic sign, etc.) or type of matra reordering. They differ from script to script, but can’t be controlled by font developer.
Dynamic properties are font dependent and are retrieved by the shaping engine as the font is loaded. These properties affect shaping and reordering behavior.
*Note: in old shaping-engine implementations, all consonant properties were static: consonants were assumed to have particular conjoining forms. In the new implementation model, consonant conjoining behavior is a dynamic property.
Retrieving dynamic character properties from Indic fontsFonts define dynamic properties for consonants through implementing standard features. Consonant types (and corresponding feature tags) that the shaping engine reads from the font are:
Each of the features above is applied together with <locl> feature to input sequences consisting of two characters: for <rphf> and <half>, features are applied to Consonant + Halant combinations; for <pref>, <subf> and <pstf>, features are applied to Halant + Consonant combinations. This is done for each consonant. If these two glyphs form a ligature, with no additional glyphs in context, this means the consonant has the corresponding form. For instance, if a substitution occurs when the <half> and <locl> features are applied to a sequence Da + Halant, then Da is classified as having a half form.
Note that a font may be implemented to re-order a Ra to pre-base position only in certain syllables and display it as a below-base or post-base form otherwise. This means that the Pre-base-form classification is not mutually exclusive with either Below-base-form or Post-base-form classifications. However, all classifications are determined as described above using context-free substitutions.
Font-dependent character classification only defines consonant types. Reordering positions, however, are fixed for each character class.
*Note: for fonts that support the old implementation, all features are applied to Consonant + Halant sequences.
Indic input processingThe following steps should be repeated while there are characters left in the input sequence. All shaping operations are done on a syllable-by-syllable basis, independent from other characters.
Find next syllable in the inputEngine should find the character sequence matching one of the patterns below:
Consonant syllable
{C+[N]+<H+[<ZWNJ|ZWJ>]|<ZWNJ|ZWJ>+H>} +
C+[N]+[A] + [< H+[<ZWNJ|ZWJ>] | {M}+[N]+[H]>]+[SM]+[(VD)]
Vowel-based syllable:
[Ra+H]+V+[N]+[<[<ZWJ|ZWNJ>]+H+C|ZWJ+C>]+[{M}+[N]+[H]]+[SM]+[(VD)]
Stand Alone cluster (at the start of the word only):
#[Ra+H]+NBSP+[N]+[<[<ZWJ|ZWNJ>]+H+C>]+[{M}+[N]+[H]]+[SM]+[(VD)]
Where
| { } | zero or more occurrences |
| [ ] | optional occurrence |
| <|> | "one of" |
| ( ) | one or two occurrences |
| C | consonant |
| V | independent vowel |
| N | nukta |
| H | halant/virama |
| ZWNJ | zero width non-joiner |
| ZWJ | zero width joiner |
| M | matra (up to one of each type: pre-, above-, below- or post- base) |
| SM | syllable modifier signs |
| VD | vedic |
| A | anudatta (U+0952) |
| NBSP | NO-BREAK SPACE |
Syllable structure consists of the following parts:
Reph + HalfConsonant(s) + MainConsonant(s) + BelowBaseConsonant(s) + PostBaseConsonant(s) + PreBaseReorderingRa + MatrasAndSigns
The consonant parts include all associated halants and nuktas. (For example, an instance of BelowBaseConsonant consists of a sequence of Halant + Below-base-forming Consonant.) All parts are optional, except the main consonant.
All parts are shown in the order they would occur within a syllable, with one qualification: depending on a font implementation, PreBaseReorderingRa may occur before all BelowBaseConsonants, after BelowBaseConsonants and before PostBaseConsonants, or after PostBaseConsonants. Also, a font may be implemented to re-order a Ra to pre-base position only in certain syllables and display it as a below-base or post-base form otherwise. Thus, final determination of whether an occurrence of Ra in a specific syllable can be treated as a pre-base reordering Ra can be made only after the <pref> feature has been applied to that syllable.
There could be several main consonants in the case where more than one consonant doesn't have a half-, below-base, post-base or pre-base form. In a case of a cluster where the first consonant does not have a half form, the shaping engine will recognize it as the 1st 'full form' and go on to identify the 2nd full form consonant, if there is one. This information will then be used to determine the reordering behavior of the reph or any matras, vowel modifiers or stress marks.
All other elements are classified by their position relative to the base: pre-base (half forms and reordering pre-base Ra forms), below-base, above-base and post-base.
Indic clusters are subject to the following constraints:Once the Indic shaping engine has analyzed the cluster as described above, it creates and manages a buffer of appropriately reordered elements (glyphs) representing the cluster, according to several rules (described below).
The OpenType lookups in an Indic font must be written to match glyph sequences after re-ordering has occurred. OpenType fonts should not have substitutions that attempt to perform the re-ordering. If a font developer attempted to encode such reordering information in an OpenType font, they would need to add a huge number of many-to-many glyph mappings to cover the general algorithms that a shaping engine will use.
Find base consonant: The shaping engine finds the base consonant of the syllable, using the following algorithm: starting from the end of the syllable, move backwards until a consonant is found that does not have a below-base or post-base form (post-base forms have to follow below-base forms), or that is not a pre-base reordering Ra, or arrive at the first consonant. The consonant stopped at will be the base.
Decompose and reorder Matras: Each matra and any syllable modifier sign in the cluster are moved to the appropriate position relative to the consonant(s) in the cluster. The shaping engine decomposes two- or three-part matras into their constituent parts before any repositioning. Matra characters are classified by which consonant in a conjunct they have affinity for and are reordered to the following positions:
Reorder marks to canonical order: Adjacent nukta and halant or nukta and vedic sign are always repositioned if necessary, so that the nukta is first.
Final reordering: After the localized forms and basic shaping forms GSUB features have been applied (see below), the shaping engine performs some final glyph reordering before applying all the remaining font features to the entire cluster.
Reorder matras: If a pre-base matra character had been reordered before applying basic features, the glyph can be moved closer to the main consonant based on whether half-forms had been formed. Actual position for the matra is defined as "after last standalone halant glyph, after initial matra position and before the main consonant". If ZWJ or ZWNJ follow this halant, position is moved after it.
Reorder reph: Reph's original position is always at the beginning of the syllable, (i.e. it is not reordered at the character reordering stage). However, it will be reordered according to the basic-forms shaping results. Possible positions for reph, depending on the script, are; after main, before post-base consonant forms, and after post-base consonant forms.
Reorder pre-base reordering consonants: If a pre-base reordering consonant is found, reorder it according to the following rules:
| Characters | Reorder Class |
| 0D30 | AfterMain |
| 0D46, 0D47, 0D48 | BeforeMain |
| 0D3E-0D43, 0D57 | AfterPostscript |
All characters from a string are first mapped to their nominal glyphs using the cmap lookup. The shaping engine proceeds to shape (substitute) the glyphs using GSUB lookups. The features of the basic shaping forms are applied one at a time to the cluster or portion of the cluster. The result impacts the analysis in terms of the conjoining behavior.
The features of the presentation forms are applied next to the entire cluster simultaneously. These predefined features are described and illustrated in the Features section.
Once the basic consonant types and parts of a cluster have been identified, the following OpenType features are applied in the order outlined below.
Shaping features:
Localized forms
Basic Shaping forms
Presentation forms
The shaping engine next processes the GPOS (glyph positioning) table, applying features concerned with positioning. All features are applied simultaneously to the entire cluster.
The font developer must consider the effects of re-ordering when creating the GPOS feature and lookup tables (i.e., the glyphs will be in the order they were in after the GSUB presentation forms features were applied).
Positioning features:Kerning
Above-base marks
Below-base marks
Commonly, a feature is required for dealing with the base glyph and one of the post-base, pre-base, above-base or below-base elements. Since it is not possible to reorder ALL of these elements next to the base glyph, we need to skip over the elements "in the middle" (reordering-wise).
The solution is to assign different mark attachment classes to different elements of the syllable and positional forms, and in any given lookup work with one mark type only. For example, in above-base substitutions we need only consider above-base elements most of the time.
Generally, it is good practice to label as "mark" any glyphs that are denoted as combining marks in the Unicode Standard as well as below-base/above-base forms of consonants. Then, different attachment classes should be assigned to different marks depending on their position with respect to the base.
For example, after the shaping engine has re-ordered elements within the cluster, matras will always occur before syllable modifier sign such as the candrabindu. In an actual sequence, though, potentially some other mark glyph, such as nukta, may occur between the matra and the candrabindu. Thus, when processing the matra and candrabindu, you may need to allow for the possibility that some other mark glyph(s) may occur between them. Using lookup flags, you can specify that a lookup should process only a certain class of marks, such as 'above-base marks', and ignore all other marks. In that way, a match will occur whether or not a mark from another class is present. Otherwise, the lookup would fail to apply.
Using Microsoft VOLT, you can assign glyphs to attachment classes.
In the example below this 'abvm' feature was set to process only TopMarks, therefore the presence of another mark class would be ignored. If Process ALL was used and another mark glyph followed the matra, this positioning lookup would fail to apply. This example comes from the Devanagari font Mangal.

Combining marks and signs that do not occur in conjunction with a valid base are considered invalid. Shaping engine implementations may adopt different strategies for how invalid marks are handled. For example, a shaping engine implementation might treat an invalid mark as a separate cluster and display the stand-alone mark positioned on some default base glyph, such as a dotted circle. (See Fallback Rendering in section 5.13 of the Unicode Standard 4.0.) Shaping engine implementations may vary somewhat with regard to what sequences are or are not considered valid. For instance, some implementations may impose a limit of at most one above-base vowel mark while others may not.
To allow for shaping engine implementations that expect to position an invalid mark on a dotted circle, it is recommended that a Malayalam OT font contain a glyph for the dotted circle character, U+25CC. If this character is not supported in the font, such implementations will display invalid signs on the missing glyph shape (white box).

In addition to the 'dotted circle' other Unicode code points that are recommended for inclusion in any Malayalam font are the ZWJ (zero width non-joiner; U+200C), the ZWNJ (zero width joiner; U+200D) and the ZWSP (zero width space; U+200B). For more information see the Suggested glyphs section of the OpenType Font Development document.
Unicode defines specific behaviors for ZWJ and ZWNJ in relation to Indic scripts. The Indic-specific behavior retains the general behavior that ZWJ requests connection between text elements while ZWNJ inhibits connection between text elements.
The following example illustrates some of these behaviors:

Just as the zwj can be used to display a form in isolation, it can also be used to display a mark, sub- or post-base form in isolation. Unlike the stand-alone Chillaksaram form, however, sequences to display them must begin with a no-break space (NBSP). This is because marks, sub- and post-base forms have a 'zero-width' so must be placed on the NBSP. For example, to get a shape of I-matra without the dotted circle one should type NBSP + I-matra.
In the illustration below the I-matra is displayed without the dotted circle by using the NBSP. The combination of NBSP and ZWJ is used to display the below-base form of La in isolation.

Next section: Features
introduction | shaping engine | features | appendices