Developing fonts > Specifications

Developing OpenType Fonts
for Khmer Script (2 of 3):
Shaping Engine

The Uniscribe Khmer shaping engine processes text in stages. The stages are:

  1. Analyze syllables and reorder characters
  2. Shape (substitute) glyphs with OTLS (OpenType Library Services)
  3. Position glyphs with OTLS

The descriptions which follow will help font developers understand the rationale for the Khmerf feature encoding model, and help application developers better understand how layout clients can divide responsibilities with operating system functions.


Analyze syllables and reorder characters

All Khmer syllables begin with a consonant, independent vowel, or number. The following should be considered as canonical ordering for Khmer Unicode input. The ordering is in the same order that the Khmer syllable is formed and produces the correct sort/search order. Any device using Khmer Unicode should use this input sequence order to correctly handle Khmer text. One complex and two simple constructs are elaborated below.

Canonical ordering
It is important for the user inputting the text to remember that although it is possible to input some of the formed sequences by using individual glyphs, the Unicode characters that are input must be in a correctly defined and consistent order for sorting and searching mechanisms to work. For example, a person might try to enter a syllable with U+17C1 and U+17B6. A user might think that they look the same as inserting U+17C4. However, the meaning is very different. Any devices, like text-to-speech, that require correct characters for correct output would not consider these the same. More importantly, the user's attempt to incorrectly use U+17C1 and U+17B6 in the same syllable would result in breaking the rule of having only one vowel character per syllable.

Syllables beginning with consonants
Consonant based syllables are formed in the following order:

Cons + {COENG + (Cons | IndV)} + [PreV | BlwV] + [RegShift] + [AbvV] + {AbvS} + [PstV] + [PstS]

RegShift case - The RegShift glyphs automatically take positioning based on the context of the vowel above. Normally, the RegShift will be rendered immediately above the base glyph. In the event that the RegShift character precedes an AbvV, the RegShift is normally rendered as a vertical stroke at the lowest extreme of the syllable. In some cases it is necessary to force the RegShift to be placed above the base glyph. In this case a ZERO WIDTH NON-JOINER (ZWNJ) is inserted between the RegShift and the AbvV to prevent the context rule of the shaping engine from being applied.

U+179F U+17CA U+17B8 (for a child or animal 'to eat') is an example where the below base form of TRIISAP is used.

U+1784 17C9 U+17B7 U+1780 U+1784 U+17C9 U+1780 U+17CB ('sulky') is an example where the first MUUSIKATOAN is in a below base form and the second in an above base form.

U+17A2 U+200D(ZWNJ) U+17CA U+17B7 U+17A2 U+17BB U+17CA U+17C7 is an interesting case where the first TRIISAP needs to be escaped, but the second does not (as there is a below base vowel)

An overview of the logic used when analyzing and reordering characters in the shaping engine looks something like the following;

  1. Khmer shaping assumes that a syllable will begin with a Cons, IndV, or Number.

  2. When a COENG + (Cons | IndV) combination are found (and subscript count is less than two) the character combination is handled according to the subscript type of the character following the COENG.

    1. Subscript Type 1 - The COENG + (Cons | IndV) characters are assigned to have the 'blwf' OpenType feature applied to them.
    2. Subscript Type 2 - The COENG + RO characters are reordered to immediately before the base glyph. Then the COENG + RO characters are assigned to have the 'pref' OpenType feature applied to them.
    3. Subscript Type 3 - The COENG + Cons characters are assigned to have the 'pstf' OpenType feature applied to them.

  3. When a RegShift character is followed by and AbvV character, the RegShift character is assigned have the 'blwf' OpenType feature applied to change the shape to the below base form of the RegShift glyph (like U+17BB).

  4. When a AbvV character with KHF_ABVSPLIT assigned is found, the pre-base vowel part (U+17C1) is prepended to the beginning of the cluster. The AbvV character is then assigned to have the 'abvf' OpenType feature applied so the glyph form is changed to the shape of the above vowel ( like U+17B8).

  5. When a PstV character with KHF_PSTSPLIT assigned is found, the pre-base vowel part (U+17C1) is prepended to the beginning of the cluster. The PstV character is then assigned to have the 'abvf' OpenType feature applied so the glyph form is changed to the shape of the second half.


Shape glyphs with OTLS

The first step Uniscribe takes in shaping the reordered character string is to apply the assigned layout features to the glyph string during the shaping process. These features, described and illustrated later in this document, are always applied in the order in which they are listed below.

Next, Uniscribe calls OTLS to apply the features. All OTL processing is divided into a set of predefined features (described and illustrated in the Features section). Each feature is applied, one by one, to the appropriate glyphs in the syllable and OTLS processes them. Uniscribe makes as many calls to the OTL Services as there are features. This ensures that the features are executed in the desired order.

The steps of the shaping process are outlined below.

Shaping features:

  1. Language forms
    1. Apply feature 'pref' to get pre based ligatures
    2. Apply feature 'blwf' to get below based ligatures or below base RegShift.
    3. Apply feature 'abvf' to Ro and the following COENG to get the Robat glyph, or to the AbvV that has KHF_ABVSPLIT to get the above glyph.
    4. Apply feature 'pstf' to get post base ligatures.

  2. Conjuncts and Typographical forms
    1. Apply feature 'pres' to get pre-base substitutions on the COENG RO glyph when there is a subscript type 1 on the syllable.
    2. Apply feature 'blws' to get below base substitutions that might be required for typographical correctness.
    3. Apply feature 'abvs' to get above base substitutions that might be required for typographical correctness.
    4. Apply feature 'psts' to get post base substations that might be required for typographical correctness. For example, a subscript type 3 glyph that needs to have a lower descent when a subscript type 1 glyph is on the syllable.
    5. Apply feature 'clig' to form ligatures that are desired for typographical correctness. For example, a subscript type 3 glyph that is followed by the OO glyph (U+17C4.secondhalf).


Position glyphs with OTLS

Uniscribe next applies features concerned with positioning, calling functions of OTLS to position glyphs.

Positioning features:

  1. Distances
    1. Apply feature 'dist' to adjust other distances, e.g. to provide kerning between post- and pre-base elements and the base glyph.

  2. Below-base marks
    1. Apply feature 'blwm' tto position below-base forms, vowel modifiers and or stress/tone marks on base glyph.

  3. Above-base marks
    1. Apply feature 'abvm' to position above-base forms, vowel modifiers and or stress/tone marks on base glyph.

  4. Mark to mark
    1. Apply feature 'mkmk' to position AbvS glyphs above AbvV glyphs or BlwV glyphs below subscript glyphs.


Base elements

Commonly, a feature is required for dealing with the base glyph and one of the post-base, pre-base or above-base elements. Since it is not possible to reorder ALL of these elements next to the base glyph, we need to skip over the elements "in the middle" (reordering-wise).

The solution is to assign different mark attachment classes to different elements of the syllable and positional forms, and in any given lookup work with one mark type only. For example, in above-base substitutions we need only consider above-base elements most of the time.

Generally, it is good practice to mark as "mark" glyphs that are denoted as marks in the Unicode standard as well as below-base/above-base forms of consonants. Then, different attachment classes should be assigned to different marks depending on their position with respect to the base.


Invalid combining marks

Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered invalid. Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode standard (section 5.12, 'Rendering Non-Spacing Marks' of Unicode Standard 3.1), i.e. positioned on a dotted circle.

Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of Unicode Standard 3.1).

For the fallback mechanism to work properly, a Khmer OTL font should contain a glyph for the dotted circle (U+25CC). In case this glyph is missing form the font, the invalid signs will be displayed on the missing glyph shape (white box).

In addition to the "dotted circle" other Unicode code points that are recommended for inclusion in any Khmer font are the ZWJ (zero width joiner; U+200C), the ZWNJ (zero width non-joiner; U+200D) and the ZWSP (zero width space; U+200B) which can be used for word boundaries.

If an invalid combination is found; more than one vowel character in a syllable, more than two subscripts on the same base character, or the incorrect ordering of subscripts, a new cluster will be formed that has the dotted circle as the base glyph. The shaping engine for non-OpenType fonts will cause invalid mark combinations to overstrike. This is the problem that inserting the dotted circle for the invalid base solves. It should also be noted that the dotted circle is not inserted into the application's backing store. This is a run-time insertion into the glyph array that is returned from the ScriptShape function.

The invalid diacritic logic for Khmer is based on the classes listed below.

Class Description Code points
XXXX NEED CLASS INFO HERE and unicode points

Next section:  Features

introduction | shaping engine | features | appendix

Top of page