Back to Specifications Overview

Creating and supporting OpenType fonts for the Universal Shaping Engine

Microsoft Typography
Last updated: January 2017

This document presents information that will help font developers in creating OpenType fonts for complex scripts included in the Unicode Standard 8.0, but not otherwise supported by a dedicated shaping engine.

Contents

Introduction

This document targets developers implementing shaping behavior compatible with the Microsoft OpenType specification for complex scripts not supported by a dedicated shaping engine. It contains information about terminology, font features and behavior of the Universal Shaping Engine (USE). While it does not contain instructions for creating fonts, it will help font developers understand how the Universal shaping engine processes complex script text.

How the Universal Shaping Engine works

The Universal Shaping Engine processes text in stages. The stages are:

  1. Character classification
    1. Overrides to Unicode categories
  2. Split vowel handling
  3. Cluster validation
  4. OpenType feature application I
    1. Basic cluster formation, GSUB
  5. Glyph reordering
  6. OpenType feature application II
    1. Topographical features, GSUB
    2. Standard typographical features, GSUB
    3. Custom substitution features requested by the application, GSUB
    4. Positional features, GPOS

Character classification

The run of text that the shaping engine receives for the purpose of shaping is a sequence of Unicode characters. Itemization preprocessing ensures that the runs of text being shaped contain characters belonging to a single script but may include SCRIPT_COMMON characters. The shaping engine divides the text into syllable clusters and identifies character properties. Character properties are used in parsing syllables and identifying their parts as well as determining whether any special behavior or contextual reordering is required. The Universal Shaping Engine (USE) generates character properties from Unicode data. There is a mapping between Unicode’s categories and the classes used internally by USE. This section defines how USE’s classes and subclasses are derived from Unicode data.

Unicode categories:

UISC = Unicode, Indic Syllabic Category

UGC = Unicode, General Category

UIPC = Unicode, Indic Positional Category, prior to Unicode 8.0 this property was known as Indic Matra Category.

Sigla USE class Derivation
B BASE

UISC = Number;

UISC = Avagraha & UGC = Lo;

UISC = Bindu & UGC = Lo;

UISC = Consonant;

UISC = Consonant_Final & UGC = Lo;

UISC = Consonant_Head_Letter;

UISC = Consonant_Medial & UGC = Lo;

UISC = Consonant_Subjoined & UGC = Lo;

UISC = Tone_Letter;

UISC = Vowel & UGC = Lo;

UISC = Vowel_Independent;

UISC = Vowel_Dependent & UGC = Lo

CGJ CGJ

U+034F

CM CONS_MOD

UISC = Nukta;

UISC = Gemination_Mark;

UISC = Consonant_Killer

CS CONS_WITH_STACKER

UISC = Consonant_With_Stacker

F CONS_FINAL

UISC = Consonant_Final & UGC != Lo;

UISC = Consonant_Succeeding_Repha

FM CONS_FINAL_MOD

UISC = Syllable_Modifier

GB BASE_OTHER

UISC = Consonant_Placeholder;

U+2015, U+2022, U+25FB–25FE

H HALANT

UISC = Virama;

UISC = Invisible_Stacker

HN HALANT_NUM

UISC = Number_Joiner

IND BASE_IND

UISC = Consonant_Dead;

UISC = Modifying_Letter;

UGC = Po (Punctuation signs), except U+104E, U+2022;

U+002D

M CONS_MED

UISC = Consonant_Medial & UGC != Lo

N BASE_NUM

UISC = Brahmi_Joining_Number

O OTHER

Any other SCRIPT_COMMON characters;

White space characters, UGC=Zs

R REPHA

UISC = Consonant_Preceding_Repha;

UISC = Consonant_Prefixed

Rsv Reserved characters

Any character not currently assigned or otherwise reserved in Unicode

S SYM

UGC = So except U+25CC;

UGC = Sc

SM SYM_MOD

U+1B6B, U+1B6C, U+1B6D, U+1B6E, U+1B6F, U+1B70, U+1B71, U+1B72, U+1B73

SUB CONS_SUB

UISC = Consonant_Subjoined & UGC != Lo

V VOWEL

UISC = Vowel & UGC != Lo;

UISC = Vowel_Dependent & UGC != Lo;

UISC = Pure_Killer

VM VOWEL_MOD

UISC = Bindu & UGC != Lo;

UISC = Tone_Mark;

UISC = Cantillation_Mark;

UISC = Register_Shifter;

UISC = Visarga

VS VARIATION_SELECTOR

U+FE00‒FE0F

WJ Word joiner

U+2060

ZWJ Zero width joiner

UISC = Joiner

ZWNJ Zero width non-joiner

UISC = Non_Joiner

Classes that can vary by position are defined in Unicode’s Indic_Positional_Category (UIPC), additional subclasses are defined:

Sigla USE subclass Derivation
CMAbv CONS_MOD_ABOVE UIPC = Top
CMBlw CONS_MOD_BELOW UIPC = Bottom
FAbv CONS_FINAL_ABOVE UIPC = Top
FBlw CONS_FINAL_BELOW UIPC = Bottom
FPst CONS_FINAL_POST UIPC = Right
MAbv CONS_MED_ABOVE UIPC = Top
MBlw CONS_MED_BELOW UIPC = Bottom
MPre CONS_MED_PRE UIPC = Left
MPst CONS_MED_POST UIPC = Right
SMAbv SYM_MOD_ABOVE U+1B6B, U+1B6D, U+1B6E, U+1B6F, U+1B70, U+1B71, U+1B72, U+1B73
SMBlw SYM_MOD_BELOW U+1B6C
VAbv VOWEL_ABOVE UIPC = Top
  VOWEL_ABOVE_BELOW UIPC = Top_And_Bottom
  VOWEL_ABOVE_BELOW_POST UIPC = Top_And_Bottom_And_Right
  VOWEL_ABOVE_POST UIPC = Top_And_Right
VBlw VOWEL_BELOW

UIPC = Bottom;

UIPC = Overstruck

  VOWEL_BELOW_POST UIPC = Bottom_And_Right
VPre VOWEL_PRE UIPC = Left
VPst VOWEL_POST UIPC = Right
  VOWEL_PRE_ABOVE UIPC = Top_And_Left
  VOWEL_PRE_ABOVE_POST UIPC = Top_And_Left_And_Right
  VOWEL_PRE_POST UIPC = Left_And_Right
VMAbv VOWEL_MOD_ABOVE UIPC = Top
VMBlw VOWEL_MOD_BELOW

UIPC = Bottom;

UIPC = Overstruck

VMPre VOWEL_MOD_PRE UIPC = Left
VMPst VOWEL_MOD_POST UIPC = Right

Overrides to Unicode categories

USE uses the following overrides to Unicode categories in order to achieve the desired shaping behavior.

Overrides to Indic_Syllabic_Category
  # Indic_Syllabic_Category=Bindu
  AA29       ; Bindu          # Mn       CHAM VOWEL SIGN AA
  # ================================================
  # Indic_Syllabic_Category=Nukta
  0F71       ; Nukta          # Mn       TIBETAN VOWEL SIGN AA
  # ================================================
  # Indic_Syllabic_Category=Tone_Mark
  A982       ; Tone_Mark      # Mn       JAVANESE SIGN LAYAR
  # ================================================
  # Indic_Syllabic_Category=Consonant_Dead
  0F7F       ; Consonant_Dead # Mc       TIBETAN SIGN RNAM BCAD
  # ================================================
  # Indic_Syllabic_Category=Gemination_Mark 
  11134      ; Gemination_Mark # Mc      CHAKMA MAAYYAA
    
Overrides to Indic_Positional_Category
  # Indic_Matra_Category=Top
  0F74        ; Top     # Mn      TIBETAN VOWEL SIGN U
  AA35        ; Top     # Mn      CHAM CONSONANT SIGN
  1A18        ; Top     # Mn      BUGINESE VOWEL SIGN U
  
  # ================================================
  # Indic_Matra_Category=Bottom
  0F72        ; Bottom  # Mn      TIBETAN VOWEL SIGN I
  0F7A..0F7D  ; Bottom  # Mn  [4] TIBETAN VOWEL SIGN E..TIBETAN VOWEL SIGN OO
  0F80        ; Bottom  # Mn      TIBETAN VOWEL SIGN REVERSED I
  11127..11129; Bottom  # Mn  [3] CHAKMA VOWEL SIGN A..CHAKMA VOWEL SIGN II
  1112D       ; Bottom  # Mn      CHAKMA VOWEL SIGN AI
  11130       ; Bottom  # Mn      CHAKMA VOWEL SIGN OI
    

Split vowel handling

USE decomposes split vowel characters belonging to UISC = Vowel_Dependent according to character decomposition mappings defined in UnicodeData.txt:

  0DCF;SINHALA VOWEL SIGN AELA-PILLA;Mc;0;L;;;;;N;;;;;
  0DD9;SINHALA VOWEL SIGN KOMBUVA;Mc;0;L;;;;;N;;;;;
  0DDC;SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA;Mc;0;L;0DD9 0DCF;;;;N;;;;;

When a decomposition is not defined in UnicodeData.txt, it is up to the font developer to handle any required decomposition during GSUB processing.

Cluster validation, is done based on the decomposed state of a split vowel. Therefore, the validation schemas only take into account the cardinal positions (Pre, Above, Below, Post) since the full decompositions occupy one position only. Therefore cluster validation depends on the sequence of decompositions which may be more restrictive than with ordinary vowels.

  1. Split vowel decomposition needs to be applied recursively so that split vowels get fully decomposed before shaping is applied
  2. Note, if a character belonging to a split vowel class that includes Pre, does not have a canonical decomposition, it is up to the font developer to specify a decomposition. The logical first glyph in that decomposition will be considered to be the VPre. Any subsequent glyphs from that decomposition will not reorder. There are no characters of this category in the currently supported scripts

Note: Font developers must include glyphs for all required decompositions.

Cluster validation

Cluster validation allows sequences of characters to be arranged into groups called “clusters” based on their classification (both class and subclass). In Abugida writing systems, a cluster is an orthographic unit of text that combines multiple phonetic and orthographic elements. It is desirable to control the sequence of characters forming a cluster so that a single visual cluster does not have multiple different encoding sequences as that would create problems for data interchange in terms of stability and security.

USE employs a generalized and permissive cluster structure in order to be flexible enough to accommodate a wide range of script needs. The goal of the clustering logic is to enable what is graphically consistent with a given script’s rules, rather than enforcing particular orthographic or linguistic rules. Such considerations should be applied at another layer, such as a spelling checker.

The maximal cluster scheme used by USE may be visualized as follows:

Visualized form of standard cluster in the Universal Shaping Engine
Visualized form of standard cluster in the Universal Shaping Engine

Schemas and rules for cluster analysis and syllable analysis use the following additional symbols:

X* sequence of zero or more occurrences of X
X+ sequence of one or more occurrences of X
<X | Y> disjunction of elements: X or Y
[X] optional (zero or one) occurrence of X
# occurrence of a boundary
× no boundary allowed at indicated position
÷ boundary allowed at indicated position
^ Except

Well-formed character clusters can have combinations of groups as defined below. There are four options:

  1. Independent cluster

      < IND | O | Rsv | WJ > [VS]

    The independent cluster normally consists of a single member. The only other character class it can combine with is VARIATION_SELECTOR. The BASE_IND, OTHER, Reserved Characters and the word joiner (WJ) can only have a single code point per cluster. When a VARIATION_SELECTOR occurs in any context other than immediately following one of the valid bases (IND, O, Rsv, WJ, B, GB, N, S), it forms an independent cluster.

  2. Standard cluster

      [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< H B | SUB > [VS] (CMAbv)* (CMBlw)*)*
      [MPre] [MAbv] [MBlw] [MPst]
      (VPre)* (VAbv)* (VBlw)* (VPst)*
      (VMPre)* (VMAbv)* (VMBlw)* (VMPst)*
      (FAbv)* (FBlw)* (FPst)* [FM]

    The only required component of a standard cluster is a BASE or BASE_OTHER. A cluster may optionally begin with a REPH or CONS_WITH_STACKER. A BASE or BASE_OTHER may be followed immediately by a VARIATION_SELECTOR and/or multiple CONS_MOD characters in the order CONS_MOD_ABOVE CONS_MOD_BELOW. Multiple sequences of a HALANT BASE with optional VARIATION_SELECTOR or optional CONS_MOD can occur. The sequence can continue with zero or one CONS_MED for each cardinal position (Pre, Above, Below, Post); zero to many VOWEL characters in each cardinal position; zero to many VOWEL_MODs in each cardinal position; zero to many CONS_FINALs in each of Above, Below, and Post; and lastly, an optional FINAL_MOD.

  3. Virama terminated cluster

      [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< H B | SUB > [VS] (CMAbv)* (CMBlw)*)* H

    This is similar to the Standard cluster but terminates in a final HALANT. When a HALANT follows a BASE or BASE_OTHER it will form a cluster. When any character other than a Base follows the Virama there will be a cluster break between the Virama and the following character. Multiple sequences of a HALANT BASE with optional VARIATION_SELECTOR or optional CONS_MOD can occur. A CONS_SUBJ is equivalent to the sequence BASE HALANT.

  4. Number-joiner terminated cluster

      N [VS] (HN N [VS])* HN

    When HALANT_NUM follows a BASE_NUM it will form a cluster. When any character other than BASE_NUM follows the HALANT_NUM there will be a cluster break between the HALANT_NUM and the following character. A BASE_NUM may be followed immediately by a VARIATION_SELECTOR. Multiple sequences of a HALANT_NUM BASE_NUM with optional VARIATION_SELECTOR can occur.

  5. Numeral cluster

      N [VS] (HN N [VS])*

    A BASE_NUM may form a cluster with another BASE_NUM when joined using a HALANT_NUM. The join may be repeated. Any BASE_NUM may be followed by a VARIATION_SELECTOR.

  6. Symbol cluster

      < S | GB > [VS] (SMAbv)* (SMBlw)*

    A SYM character may be followed by an optional VARIATION_SELECTOR and zero to many SYM_MOD_ABOVE, then SYM_MOD_BELOW.

Independent Vowel (IV) plus Dependent Vowel constraints (DV)

The core-specification of the Unicode standard prohibits forming certain IV forms from other bases plus a DV (e.g., TUS Table 12-1). Since these combinations apply to particular pairs and not globally across classes USE maintains a list of prohibited sequences. This list of prohibited sequences is:

  0905 0946       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN SHORT E
  0905 093E       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AA
  0930 094D 0907  ; # DEVANAGARI LETTER RA, DEVANAGARI SIGN VIRAMA, DEVANAGARI LETTER I
  0909 0941       ; # DEVANAGARI LETTER U, DEVANAGARI VOWEL SIGN U
  090F 0945       ; # DEVANAGARI LETTER E, DEVANAGARI VOWEL SIGN CANDRA E
  090F 0946       ; # DEVANAGARI LETTER E, DEVANAGARI VOWEL SIGN SHORT E
  090F 0947       ; # DEVANAGARI LETTER E, DEVANAGARI VOWEL SIGN E
  0905 0949       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN CANDRA O
  0906 0945       ; # DEVANAGARI LETTER AA, DEVANAGARI VOWEL SIGN CANDRA E
  0905 094A       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN SHORT O
  0906 0946       ; # DEVANAGARI LETTER AA, DEVANAGARI VOWEL SIGN SHORT E
  0905 094B       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN O
  0906 0947       ; # DEVANAGARI LETTER AA, DEVANAGARI VOWEL SIGN E
  0905 094C       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AU
  0906 0948       ; # DEVANAGARI LETTER AA, DEVANAGARI VOWEL SIGN AI
  0905 0945       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN CANDRA E
  0905 093A       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN OE
  0905 093B       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN OOE
  0906 093A       ; # DEVANAGARI LETTER AA, DEVANAGARI VOWEL SIGN OE
  0905 094F       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN AW
  0905 0956       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN UE
  0905 0957       ; # DEVANAGARI LETTER A, DEVANAGARI VOWEL SIGN UUE
  0985 09BE       ; # BENGALI LETTER A, BENGALI VOWEL SIGN AA
  098B 09C3       ; # BENGALI LETTER VOCALIC R, BENGALI VOWEL SIGN VOCALIC R
  098C 09E2       ; # BENGALI LETTER VOCALIC L, BENGALI VOWEL SIGN VOCALIC L
  0A05 0A3E       ; # GURMUKHI LETTER A, GURMUKHI VOWEL SIGN AA
  0A72 0A3F       ; # GURMUKHI IRI, GURMUKHI VOWEL SIGN I
  0A72 0A40       ; # GURMUKHI IRI, GURMUKHI VOWEL SIGN II
  0A73 0A41       ; # GURMUKHI URA, GURMUKHI VOWEL SIGN U
  0A73 0A42       ; # GURMUKHI URA, GURMUKHI VOWEL SIGN UU
  0A72 0A47       ; # GURMUKHI IRI, GURMUKHI VOWEL SIGN EE
  0A05 0A48       ; # GURMUKHI LETTER A, GURMUKHI VOWEL SIGN AI
  0A73 0A4B       ; # GURMUKHI URA, GURMUKHI VOWEL SIGN OO
  0A05 0A4C       ; # GURMUKHI LETTER A, GURMUKHI VOWEL SIGN AU
  0A85 0ABE       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN AA
  0A85 0AC5       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN CANDRA E
  0A85 0AC7       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN E
  0A85 0AC8       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN AI
  0A85 0AC9       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN CANDRA O
  0A85 0ACB       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN O
  0A85 0ABE 0AC5  ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN AA, GUJARATI VOWEL SIGN CANDRA E
  0A85 0ACC       ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN AU
  0A85 0ABE 0AC8  ; # GUJARATI LETTER A, GUJARATI VOWEL SIGN AA, GUJARATI VOWEL SIGN AI
  0AC5 0ABE       ; # GUJARATI VOWEL SIGN CANDRA E, GUJARATI VOWEL SIGN AA
  0B05 0B3E       ; # ORIYA LETTER A, ORIYA VOWEL SIGN AA
  0B0F 0B57       ; # ORIYA LETTER E, ORIYA AU LENGTH MARK
  0B13 0B57       ; # ORIYA LETTER O, ORIYA AU LENGTH MARK
  0C12 0C55       ; # TELUGU LETTER O, TELUGU LENGTH MARK
  0C12 0C4C       ; # TELUGU LETTER O, TELUGU VOWEL SIGN AU
  0C3F 0C55       ; # TELUGU VOWEL SIGN I, TELUGU LENGTH MARK
  0C46 0C55       ; # TELUGU VOWEL SIGN E, TELUGU LENGTH MARK
  0C4A 0C55       ; # TELUGU VOWEL SIGN O, TELUGU LENGTH MARK
  0C89 0CBE       ; # KANNADA LETTER U, KANNADA VOWEL SIGN AA
  0C92 0CCC       ; # KANNADA LETTER O, KANNADA VOWEL SIGN AU
  0C8B 0CBE       ; # KANNADA LETTER VOCALIC R, KANNADA VOWEL SIGN AA
  0D07 0D57       ; # MALAYALAM LETTER I, MALAYALAM AU LENGTH MARK
  0D09 0D57       ; # MALAYALAM LETTER U, MALAYALAM AU LENGTH MARK
  0D0E 0D46       ; # MALAYALAM LETTER E, MALAYALAM VOWEL SIGN E
  0D12 0D3E       ; # MALAYALAM LETTER O, MALAYALAM VOWEL SIGN AA
  0D12 0D57       ; # MALAYALAM LETTER O, MALAYALAM AU LENGTH MARK
  0D85 0DCF       ; # SINHALA LETTER AYANNA, SINHALA VOWEL SIGN AELA-PILLA
  0D85 0DD0       ; # SINHALA LETTER AYANNA, SINHALA VOWEL SIGN KETTI AEDA-PILLA
  0D85 0DD1       ; # SINHALA LETTER AYANNA, SINHALA VOWEL SIGN DIGA AEDA-PILLA
  0D8B 0DDF       ; # SINHALA LETTER UYANNA, SINHALA VOWEL SIGN GAYANUKITTA
  0D8D 0DD8       ; # SINHALA LETTER IRUYANNA, SINHALA VOWEL SIGN GAETTA-PILLA
  0D8F 0DDF       ; # SINHALA LETTER ILUYANNA, SINHALA VOWEL SIGN GAYANUKITTA
  0D91 0DCA       ; # SINHALA LETTER EYANNA, SINHALA SIGN AL-LAKUNA
  0D91 0DD9       ; # SINHALA LETTER EYANNA, SINHALA VOWEL SIGN KOMBUVA
  0D91 0DDA       ; # SINHALA LETTER EYANNA, SINHALA VOWEL SIGN DIGA KOMBUVA
  0D91 0DDC       ; # SINHALA LETTER EYANNA, SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA
  0D91 0DDD       ; # SINHALA LETTER EYANNA, SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA
  0D91 0DDD       ; # SINHALA LETTER EYANNA, SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA
  0D94 0DDF       ; # SINHALA LETTER OYANNA, SINHALA VOWEL SIGN GAYANUKITTA
  11005 11038     ; # BRAHMI LETTER A, BRAHMI VOWEL SIGN AA
  1100B 1103E     ; # BRAHMI LETTER VOCALIC R, BRAHMI VOWEL SIGN VOCALIC R
  1100F 11042     ; # BRAHMI LETTER E, BRAHMI VOWEL SIGN E
  11680 116AD     ; # TAKRI LETTER A, TAKRI VOWEL SIGN AA
  11686 116B2     ; # TAKRI LETTER E, TAKRI VOWEL SIGN E
  11680 116B4     ; # TAKRI LETTER A, TAKRI VOWEL SIGN O
  11680 116B5     ; # TAKRI LETTER A, TAKRI VOWEL SIGN AU
  112B0 112E0     ; # KHUDAWADI LETTER A, KHUDAWADI VOWEL SIGN AA
  112B0 112E5     ; # KHUDAWADI LETTER A, KHUDAWADI VOWEL SIGN E
  112B0 112E6     ; # KHUDAWADI LETTER A, KHUDAWADI VOWEL SIGN AI
  112B0 112E7     ; # KHUDAWADI LETTER A, KHUDAWADI VOWEL SIGN O
  112B0 112E8     ; # KHUDAWADI LETTER A, KHUDAWADI VOWEL SIGN AU
  11481 114B0     ; # TIRHUTA LETTER A, TIRHUTA VOWEL SIGN AA
  114AA 114B5     ; # TIRHUTA LETTER LA, TIRHUTA VOWEL SIGN VOCALIC R
  114AA 114B6     ; # TIRHUTA LETTER LA, TIRHUTA VOWEL SIGN VOCALIC RR
  1148B 114BA     ; # TIRHUTA LETTER E, TIRHUTA VOWEL SIGN SHORT E
  1148D 114BA     ; # TIRHUTA LETTER O, TIRHUTA VOWEL SIGN SHORT E
  11600 11639     ; # MODI LETTER A, MODI VOWEL SIGN E
  11600 1163A     ; # MODI LETTER A, MODI VOWEL SIGN AI
  11601 11639     ; # MODI LETTER AA, MODI VOWEL SIGN E
  11601 1163A     ; # MODI LETTER AA, MODI VOWEL SIGN AI
    

Note:The above is a complete listing of prohibited sequences documented in TUS at the time of this writing. This includes sequences from scripts that are not currently handled by USE (e.g., Devanagari).

Combining Grapheme Joiner (CGJ)

CGJ has been omitted from the above schema in order to avoid unnecessary complexity. It may occur anywhere in a cluster with no effect. The purpose of CGJ is to block normalization processing which could change the order of marks in a sequence. CGJ handling will need to be updated if USE is modified to support normalization.

Zero-width non-joiner (ZWNJ)

The zero-width non-joiner is used to prevent a fusion of two characters. It continues a preceding cluster but causes a cluster break after itself when the following character is not a mark character (gc=Mn or gc=Mc). ZWNJ does not reset the cluster model.

Zero-width joiner (ZWJ)

The zero-width joiner is used to fuse two characters. It continues a preceding cluster and joins it to a following character unless the following character is another ZWJ. In which case there will be a cluster break between the two ZWJs. ZWJ does not reset the cluster model.

Standalone characters

Characters of the class BASE_IND occur as standalone characters and do not form clusters with following characters. When a character of the class VS or J occurs outside of a cluster, i.e., at the start of a run, or following one of these standalone characters, they should also be treated as standalone characters.

Defective clusters

When a cluster starts with any character that has UGC=Mc or UGC=Mn, USE inserts a dotted circle glyph (U+25CC) to indicate a broken cluster. Defective clusters do not form extended clusters themselves. A sequence of marks without a valid base forms separate clusters for each mark. Note that an explicit character U+25CC is a valid generic base (GB, BASE_OTHER) and so can form extended clusters.

Maximum cluster length

The maximum cluster length is 31 glyphs. After this limit is reached, a cluster break is forced and a new cluster is started.

OpenType feature application I

Basic cluster formation GSUB

Default glyph pre-processing group

These features are applied together, one cluster at a time. Lookups associated with any of these features will be triggered in the lookup-order specified by the font developer. It is possible to interleave lookups for different features. The order of features given here is recommended.

Reordering group

These features are applied individually in this order, rphf, pref. The output of each of these features is reordered as specified in the next section.

Note that reordering is not done until after all of the basic features have been applied.

Orthographic unit shaping group

Like the Default glyph pre-processing group, these features are applied together, one cluster at a time. It is, therefore, up to the font developer to specify the order of lookups for these features in the font’s OTL. The order given here is recommended but not required.

Glyph reordering

All reordering and anchoring of marks is done in relation to the base consonant. USE does reordering in a single late phase. This is because all reordering is dependent on the formation of the base which may be modified during basic cluster formation. USE does not do checks or look-ahead in a font’s feature tables. Therefore feature lookups for the basic features must be designed for the logical glyph order before any reordering has been applied. There are two categories of glyphs that reorder: feature based and property based. Actual glyph reordering is done between basic cluster formation (OpenType feature application I) and Topographical substitutions (first part of OpenType feature application II). Reordering is applied in logical order: rphf, pref, VPre, VMPre:

Before reordering Before reordering
After reordering after reordering

When the base cluster is broken by an explicit virama that has not been replaced during basic cluster formation the reordering is impacted since reordering components do not reorder past an explicit virama:

Before reordering Before reordering
After reordering after reordering

If a font developer wants the reordering behavior to not be blocked by an explicit virama, they can substitute the virama for an alternate glyph during basic cluster formation so that USE treats the cluster as a continuous cluster without explicit virama.

Feature-based reordering
Sigla Name Description
rphf Reph form Many abugida scripts render a preconsonantal r as a sign above the consonant it precedes. This sign is called reph. For the purposes of OT layout, reph is normally rendered with a mark glyph, and as such, must follow the base to which it applies. Contexts which produce the reph glyph must use the <rphf> feature. The output of this feature reorders after the following full base. Reph does not reorder past an explicit Virama. Lookups under the rphf feature should output no more than one glyph per cluster.
pref Pre-base form Scripts such as Javanese and Balinese reorder a medial consonant to the beginning of a cluster based on context. The contextual logic is encoded in the font’s OT logic. Cases which require reordering should use the <pref> feature to identify the glyph or glyphs that is or are to be reordered. One or more glyphs may be substituted for a single feature glyph that is to be reordered before the first spacing glyph in the cluster or the first spacing glyph after an explicit virama if present. Only one such glyph is reordered per cluster.
Property-based reordering
Sigla Name Description
R REPHA Pre-base REPHA is reordered by USE after a following full base as if the <rphf> feature has been applied. Font developers do not need to use the <rphf> feature explicitly.
VPre VOWEL_PRE Pre-base vowels and pre-base vowel components from split vowels are reordered before the base glyph and, if present, before a pre-base glyph reordered via the <pref> feature.
VMPre VOWEL_MOD_PRE Pre-base vowel modifiers are reordered before the base glyph and, if present, before a pre-base vowel and/or before a pre-base glyph reordered via the <pref> feature.

OpenType feature application II

Topographical features, GSUB

USE applies positional features required for scripts like Arabic which have alternate glyph shapes depending on the position of a glyph within a word.

USE applies these as required features on a per cluster basis in order to invoke a particular form based on non-joining or white space boundaries. Non-joining boundaries occur between glyphs belonging to a joining script when one or both characters have a non-joining property that applies to the side of the connection on which a join would occur. The control characters ZWJ and ZWNJ may be used to artificially invoke or prevent a join. Joining properties are defined by the Unicode property Joining Type in ArabicShaping.txt.

Note:Support for topographical features for non-joining scripts is not currently implemented in USE. Additional specification is required.

Standard typographic presentation, GSUB

The remaining required features are applied all together to the entire run. It is up to the font developer to specify the order of lookups for this set of features:

Custom substitution features requested by the application, GSUB

Positional feature application, GPOS

During GPOS processing, these required features are applied simultaneously to the entire run. The order specified here is recommended, but it is up to font developers to define the order of GPOS lookups for this set of features in OTL.

An important OT technique for excluding certain base glyphs from contextual lookups is to classify the base glyph as a mark in the font's GDEF table, since marks can be selectively included or omitted from OT processing. However, whenever this technique is used, the width of the base glyph must be added back using the <dist> feature. This is necessary because OT processing cancels the width associated with a mark. It is necessary to cancel the width of a non-spacing mark because it is not clear where to apply the width of a non-spacing mark during OpenType processing.

A typical use case of this is Javanese which has prebase vowels. Since the prebase vowels do not reorder until after basic cluster formation, they are present in their logical position. This may interrupt other contextual substitutions. If the vowels are treated as marks, they can be excluded from OT context, and thus reduce the number of contextual rules required for processing. Consequently the expected width of the prebase vowels may be restored with the dist feature, for example (in VOLT OT Language):

  DEF_LOOKUP "j.dist_preVowel" PROCESS_BASE PROCESS_MARKS ALL DIRECTION LTR
    AS_POSITION
        ADJUST_SINGLE
            GLYPH "jSignE"  BY POS ADV 1433 END_POS
            GLYPH "jSignAi" BY POS ADV 1433 END_POS
        END_ADJUST
    END_POSITION
  END

Other encoding issues

Handling invalid combining marks

Combining marks and signs that do not occur in conjunction with a valid base are considered invalid. USE treats an invalid mark as a separate cluster and displays the stand-alone mark positioned on a dotted circle (U+25CC). If multiple marks are required to position on a dotted circle, the dotted circle can be explicitly inserted into the text stream followed by any marks in accordance with the standard clustering rules.

To allow for shaping engine implementations that expect to position an invalid mark on a dotted circle, it is recommended that font using USE contain glyphs for the dotted circle character, U+25CC. If this character is not supported in the font, such implementations will display invalid signs on the missing glyph shape (white box).

Recommended Glyphs

Unicode code points that are recommended for inclusion in any font using USE are:

Code point Description
U+200B Zero Width Space
U+200C Zero Width Non-Joiner
U+200D Zero Width Joiner
U+25CC Dotted Circle
U+00A0 No-break space
U+00D7 Multiplication sign
U+2012 Figure dash
U+2013 En dash
U+2014 Em dash
U+2015 Horizontal bar
U+2022 Bullet
U+25FB White medium square
U+25FC Black medium square
U+25FD White medium small square
U+25FE Black medium small square

Appendix

Writing system and language tags

OpenType features are enabled in a font according to both a designated script and language system. The language system tag specifies a typographic convention associated with a language or linguistic subgroup. Not all software applications support specific language tags for use when rendering text runs.

* NOTE: It is strongly recommended to include the “dflt” language tag in all OpenType fonts because it defines the basic script handling for a font. The “dflt” language system is used as the default if no other language specific features are defined or if the application does not support that particular language. If the “dflt” tag is not present for the script being used, the font may not work in some applications.

The following tables list the registered tag names for scripts currently supported by USE. This list will grow as new complex scripts are enabled by Unicode and USE is updated. Language system tags are not listed here and should be determined by font developers as appropriate for the script concerned.

Registered tags for the Universal Shaping Engine
Script Script tag
Balinese bali
Batak batk
Brahmi brah
Buginese bugi
Buhid buhd
Chakma cakm
Cham cham
Duployan dupl
Egyptian Hieroglyphs egyp
Grantha gran
Hanunoo hano
Javanese java
Kaithi kthi
Kayah Li kali
Kharoshthi khar
Khojki khoj
Khudawadi sind
Lepcha lepc
Limbu limb
Mahajani mahj
Mandaic mand
Manichaean mani
Meitei Mayek mtei
Modi modi
Mongolian mong
N’Ko nko
Pahawh Hmong hmng
Phags-pa phag
Psalter Pahlavi phlp
Rejang rjng
Saurashtra saur
Sharada shrd
Siddham sidd
Sinhala sinh
Sundanese sund
Syloti Nagri sylo
Tagalog tglg
Tagbanwa tagb
Tai Le tale
*Tai Tham lana
Tai Viet tavt
Takri takr
Tibetan tibt
Tifinagh tfng
Tirhuta tirh

Note: script tags are case sensitive (script tags should be lowercase) and must contain four characters.

Note: Tai Tham support is limited to mono-syllabic clusters.