Creating and Supporting OpenType Fonts for Sinhala Script

Article
06/09/2022

This document presents information that will help font developers in creating OpenType fonts for Sinhala script as covered by the Unicode Standard 6.0. The Sinhala script is used to write the Sinhala language. It is also used to write Pali and Sanskrit.

NOTE: Starting in Windows 10, Sinhala will be supported by the Universal Shaping Engine rather than a stand-alone shaping engine. Moving forward, developers should refer to this new specification.

Introduction

This document targets developers implementing shaping behavior compatible with the Microsoft OpenType specification for the Sinhala script. It contains information about terminology, font features and behavior of the Sinhala shaping engine. While it does not contain instructions for creating Sinhala fonts, it will help font developers understand how the Sinhala shaping engine processes Sinhala text.

Terms

The following terms are useful for understanding the layout features and script rules discussed in this document.

Akhand ligature – A required consonant ligature that may appear anywhere in the syllable and may or may not involve the base glyph. Akhand ligatures have the highest priority and are formed first; some languages include them in their alphabets

Al-lakuna (halant/virama) – The character used after a consonant to suppress its inherent vowel

Base glyph – Any glyph that can have a diacritic mark attached to it. Layout operations are defined in terms of a base glyph, not a base character, as a ligature may act as a base

Character – Each character represents a Unicode character code point. For example the Sinhala letter Ayanna (අ) is U+0D85. A character may have multiple forms of glyphs

Cluster – A group of characters that form an integral unit in Indic scripts, oftentimes this corresponds to a syllable

Consonant – Sinhala consonants have an inherent vowel (the short vowel /a/ called ayanna). For example, “Ka” and “Ta”, rather than just “K” or “T”

Consonant conjunct (aka ‘conjunct’) – A ligature of two or more consonants

Format controls – special formatting characters used in the shaping process of Sinhala scripts (U+200c and U+200D). These characters have no visual appearance, except when an application chooses to display zero width glyphs

Glyph – A glyph represents a form of one or more characters

Halant – See Al-lakuna.

Ligature – A combination of glyphs that join to form a single glyph. For example the touching-letter sequence of alpapraana kayanna and vayanna (U+0D9A, U+0DCA, U+200D, U+0DC0)

Matra (dependent vowel) – Used to represent a vowel sound that is not inherent to the consonant. Dependent vowels are referred to as “matras” in Sanskrit. They are always depicted in combination with a single consonant, or with a consonant cluster. The greatest variation among different Indian scripts is found in the rules for attaching dependent vowels to base characters

OpenType layout engine – The library responsible for executing OpenType layout features in a font. In the Microsoft text formatting stack, it is named OTLS (OpenType layout services)

OpenType tag – A 4-byte identifier for script, language system or feature in the font

Rakaaraansaya (rakar) – The below-base form of “Ra” which forms a ligature with most preceding consonant(s)

Repaya (reph) – The above-base form of the letter “Ra” that is used if “Ra” is the first consonant in the syllable and is not the base consonant

Shaping engine - Code responsible for shaping input, classified to a particular script

Split matra – A matra that is decomposed into pieces for rendering. Usually the different pieces appear in different positions relative to the base. For instance, part of the matra may be placed at the beginning of the cluster and another part at the end of the cluster

Touching letters – A pure consonant written touching a following letter instead of using al-lakuna. Used in classical and Buddhist texts.

Yansaya – The post-base form of “Ya” which goes with a preceding consonant(s)

The Uniscribe Sinhala shaping engine processes text in stages. The stages are:

Analyzing the characters
Processing split matras
Reordering characters
Applying OpenType GSUB features to get the correct glyph shape
Applying OpenType GPOS features to position glyphs or marks

The descriptions which follow will help font developers understand the rationale for the Sinhala feature encoding model, and help application developers better understand how layout clients can divide responsibilities with operating system functions.

Analyzing the characters

The run of text that the shaping engine receives for the purpose of shaping is a sequence of Unicode characters. The shaping engine divides the text into syllable clusters and identifies character properties. Character properties are used in parsing syllables and identifying their parts as well as determining whether any contextual reordering is required.

Additionally, the engine verifies that the run consists of valid clusters and inserts a placeholder glyph (U+25CC) wherever combining marks occur without a valid base.

In the Sinhala engine, OpenType features are applied to the entire run after any reordering has been completed.

In the diagrams below, the rules for forming clusters are given in terms of the types of characters in the character stream. The meanings of the symbols are:

C	Consonant (U+0D9A–U+0DB1, U+0DB3–U+0DBB, U+0DBD, U+0DC0–U+0DC6)
IV	Independent vowel (U+0D85–U+0D96)
H	Halant/Al-lakuna (U+0DCA)
VM	Vowel modifier (U+0D82, U+0D83)
DV	Dependent vowel (U+0DCF–U+0DD1, U+0DD4, U+0DD6, U+0DD8, U+0DDF, U+0DF2–U+0DF3, U+0DD9, U+0DDB, U+0DD2–U+0DD3, U+0DDA, U+0DDC–U+0DDE)
ZWJ	Zero width joiner (U+200D)
{ }	Zero to many occurrences. Because this could theoretically go on forever, an arbitrary limit of eight occurrences has been set.
[ ]	Optional occurrence
< x \| y >	Choice of x or y

There are two types of clusters:

Consonant Cluster
{C + H + ZWJ } + C + [< H \| DV >] + [VM]	Used for semivowel and ligature clusters
{C + ZWJ + H} + C + [< H \| DV >] + [VM]	Used for clusters of touching letters
Independent Vowel Cluster
IV + [VM]

Processing split matras

The shaping engine inserts the prebase component of the split matras into the glyph sequence. The following split matras are defined for Sinhala:

Code point	Glyph	Description
U+0DDA		Sinhala vowel sign diga kombuva
U+0DDC		Sinhala vowel sign kombuva haa aela-pilla
U+0DDD		Sinhala vowel sign kombuva haa diga aela-pilla
U+0DDE		Sinhala vowel sign kombuva haa gayanukitta

The shaping engine will insert the glyph for U+0DD9 into the glyph sequence to serve as the prebase component for subsequent reordering. The second component of the split matras should be substituted appropriately by the font using the pstf feature, see Basic shaping forms.

Reordering characters

Once the Sinhala shaping engine has analyzed the run as described above, it creates a buffer of appropriately reordered elements (glyphs) representing the cluster according to the following rules:

Prebase vowels and prebase vowel-components of split matras reorder to start of a cluster
Yansaya fuses with a preceding base glyph
Repaya reorders after the base glyph or after yansaya if also present

The OpenType lookups in a Sinhala font must be written to match glyph sequences after re-ordering has occurred. OpenType fonts should not have substitutions that attempt to perform the re-ordering. If a font developer attempted to encode such reordering information in an OpenType font, they would need to add a huge number of many-to-many glyph mappings to cover the general algorithms that a shaping engine will use.

Apply OpenType GSUB features

Uniscribe calls OTLS to apply the features. All OTL processing is divided into a set of predefined features (described and illustrated in the Feature section of this document). Each feature is applied to the entire run and OTLS processes them. Uniscribe makes as many calls to the OTL Services as there are features. This ensures that the features are executed in the desired order.

The steps of the shaping process are outlined below. Not all of the features listed must be used by all languages using the Sinhala script.

Shaping features:

Localized forms
1. Apply feature ‘locl’ to preprocess any localized forms for the current language
Basic shaping forms
1. Apply feature ‘ccmp’ to preprocess any glyphs that require composition or decomposition
2. Apply feature ‘akhn’ to get akhand ligature forms
3. Apply feature ‘rphf’ to get repaya glyph forms
4. Apply feature ‘vatu’ to get rakaaraansaya and yansaya glyph forms
5. Apply feature ‘pstf’ to get post-base forms of split-matra glyphs
Presentation forms
1. Apply feature ‘pres’ to substitute prebase glyph forms
2. Apply feature ‘abvs’ to substitute above-base glyph forms
3. Apply feature ‘blws’ to substitute below-base glyph forms
4. Apply feature ‘psts’ to substitute postbase glyph forms

Note: since the presentation form features are applied simultaneously over the entire cluster, several features are operationally equivalent to a single feature. Multiple features are provided as an aid for font developers to organize the lookups they implement.

Apply OpenType GPOS features

The shaping engine next processes the GPOS (glyph positioning) table, applying features concerned with positioning. All features are applied simultaneously to the entire cluster.

The font developer must consider the effects of re-ordering when creating the GPOS feature and lookup tables (i.e., the glyphs will be in the order they were in after the GSUB presentation forms features were applied).

Kerning
1. Apply feature ‘kern’ to provide pair kerning between glyphs for better typographic quality. Note this feature may be disabled by some applications
2. Apply feature ‘dist’ to make any required distance adjustments
Mark placement
1. Apply feature ‘abvm’ to position diacritic glyphs above the base glyph
2. Apply feature ‘blwm’ to position diacritic glyphs below the base glyph

Features

The features listed below have been defined to create the basic forms for the scripts and languages that are supported on Sinhala systems. Regardless of the model an application chooses for supporting layout of complex scripts, Uniscribe requires a fixed order for executing features within a run of text to consistently obtain the proper basic form. This is achieved by calling features one-by-one in the standard order listed below.

The order of the lookups within each feature is also very important. For more information on lookups and defining features in OpenType fonts, see Encoding feature information in the OpenType font development section.

The standard order for applying Sinhala features encoded in OpenType fonts:

Feature	Feature function	Layout operation	Required
Localized forms
locl		GSUB
Basic shaping forms
ccmp	Character composition/decomposition substitution	GSUB
akhn	Akhand ligature substitution	GSUB	X
rphf	Repaya substitution	GSUB	X
vatu	Rakaaraansaya and yansaya substitution	GSUB	X
pstf	Post-base form substitution	GSUB	X
Presentation forms
pres	Pre-base substitution	GSUB
abvs	Above-base substitution	GSUB
blws	Below-base substitution	GSUB
psts	Post-base substitution	GSUB
Kerning:
kern	Pair kerning	GPOS
dist	Distance adjustments	GPOS
Mark placement
abvm	Above-base mark positioning	GPOS
blwm	Below-base mark positioning	GPOS
[GSUB = glyph substitution, GPOS = glyph positioning]

Feature examples

The registered features described and illustrated in this document are based on the Microsoft OpenType font Iskoola Pota (iskpota.ttf). Iskoola Pota contains layout information and glyphs to support all of the required features for the Sinhala script.

The illustrations in the following examples show the result of that particular feature being applied. Features must be written to match glyph sequences after re-ordering has occurred. Note that the input context for a feature may be the result of a previous feature having already been applied.

Localized forms

Feature Tag: 'locl'

This feature is used in association with OpenType language system tags to trigger lookups that will select alternate glyphs needed for language-specific typographic conventions. The ‘locl’ should not be used in association with the default language system, but only used with other language system tags. See the Appendix of this document for language system tags associated with the Sinhala script. The Iskoola Pota font does not use this feature.

Basic shaping forms

Feature Tag: 'ccmp'

The composition/decomposition feature may be used to trigger lookups to form particular composed or decomposed forms. It should not be used to create decompositions of the split matras as this is done by the shaping engine before any features are applied. The Iskoola Pota font does not use this feature.

Feature Tag: 'akhn'

The akhand ligature feature is used to create all required ligatures including touching consonants used in Pali and Sanskrit.

Ligatures that are used in Sinhala text are encoded using the sequence Halant ZWJ (U+0DCA U+200D) between the two components that are to be joined.

Screenshot that shows the 'a k h n' feature is used to create all required ligatures including touching consonants used in Pali and Sanskrit.
Lookup example:

Sequence	Result
		ක්‍ව
U+0D9A U+0DCA U+200D U+0DC0	Image	Text

Touching consonants that are used in Pali and Sanskrit are encoded with the sequence ZWJ Halant (U+200D U+0DCA) between the two components that are to be joined.

Screenshot that shows how touching consonants that are used in Pali and Sanskrit are handled with the 'a k h n' feature.
Lookup example:

Sequence	Result
		ක‍්ක
U+0D9A U+200D U+0DCA U+0D9A	Image	Text

Feature Tag: 'rphf'

This feature is used to substitute the repaya form.

Screenshot that shows the 'r p h f' feature is used to substitute the repaya form.
Example:

Sequence	Result
		ර්‍ක
U+0DBB U+0DCA U+200D U+0D9A	Image	Text

Feature Tag: 'vatu'

This feature is used to substitute the rakaaraansaya and yansaya forms.

Screenshot that shows the 'vatu' feature is used to substitute the rakaaraansaya and yansaya forms.
Rakaaraansaya example:

Sequence	Result
		ක්‍ර
U+0D9A U+0DCA U+200D U+0DBB	Image	Text

Yansaya example:

Sequence	Result
		ක්‍ය
U+0D9A U+0DCA U+200D U+0DBA	Image	Text

Feature Tag: 'pstf'

This feature is used to convert a split matra into the corresponding second half of the split matra.

Screenshot that shows the 'p s t f' feature is used to convert a split matra into the corresponding second half of the split matra.
Lookup example:

Sequence	Result
		කො
U+0D9A U+0DDC	Image	Text

Presentation forms

Feature Tag: 'pres'

This feature may be used to substitute presentation forms involving pre-base elements. The Iskoola Pota font does not use this feature.

Feature Tag: 'abvs'

This feature is used to substitute ligatures involving a base glyph and an above mark.

Screenshot that shows the 'a b v s' feature is used to substitute ligatures involving a base glyph and an above mark.
Lookup example:

Sequence	Result
		කි
U+0D9A U+0DD2	Image	Text

This should also include marks that are the outcome of previous lookups and reordering such as rakaaraansaya and yansaya.

Screenshot that shows the inclusion of marks that are the outcome of previous lookups and reordering such as rakaaraansaya and yansaya.
Rakaaraansaya + yansaya example:

Sequence	Result
		ර්‍ය්‍ය
U+0DBB ^{(rakaaraansaya)} U+0DCA U+200D U+0DBA U+0DCA ^(yansaya) U+200D U+0DBA	Image	Text

Sequence

Result

Illustration that shows rakaaraansaya + yansaya rendered as an image when using the 'a b v s' feature.

ර්‍ය්‍ය

U+0DBB ^{(rakaaraansaya)} U+0DCA U+200D U+0DBA

U+0DCA ^(yansaya) U+200D U+0DBA

Image

Text

Feature Tag: 'blws'

This feature is used to substitute ligatures involving a base glyph and a below mark.

Screenshot that shows the 'b l w s' feature is used to substitute ligatures involving a base glyph and a below mark.
Lookup example:

Sequence	Result
		ඛු
U+0D9B U+0DD4	Image	Text

Feature Tag: 'psts'

This feature is used to substitute ligatures involving a base glyph and a post-base element.

Screenshot that shows the 'p s t s' feature is used to substitute ligatures involving a base glyph and a post-base element.
Lookup example:

Sequence	Result
		රැ
U+0DBB U+0DD0	Image	Text

Kerning

Feature Tag: 'kern'

This feature may be used to adjust the positioning of glyph pairs. The Iskoola Pota font does not use this feature.

Feature Tag: 'dist'

This feature may be used to adjust distances. The Iskoola Pota font does not use this feature.

Mark placement

Feature Tag: 'abvm'

This feature is used to position marks above base glyphs. The Iskoola Pota font uses this feature to position marks relative to the dotted circle placeholder glyph.

Screenshot that shows the 'a b v m' feature is used to position marks above base glyphs.

Feature Tag: 'blwm'

This feature is used to position marks below base glyphs. The Iskoola Pota font uses this feature to position marks relative to the dotted circle placeholder glyph.

Screenshot that shows the 'b l w m' feature is used to position marks below base glyphs.

Handling invalid combining marks

Combining marks and signs that do not occur in conjunction with a valid base are considered invalid. Shaping engine implementations may adopt different strategies for how invalid marks are handled. For example, a shaping engine implementation might treat an invalid mark as a separate cluster and display the stand-alone mark positioned on some default base glyph, such as a dotted circle (U+25CC). (See Fallback Rendering in section 5.13 of the Unicode Standard 4.0.) Shaping engine implementations may vary somewhat with regard to what sequences are or are not considered valid. For instance, some implementations may impose a limit of at most one above-base vowel mark while others may not.

To allow for shaping engine implementations that expect to position an invalid mark on a dotted circle, it is recommended that a Sinhala OT font contain a glyph for the dotted circle character, U+25CC. If this character is not supported in the font, such implementations will display invalid signs on the missing glyph shape (white box).

Recommended Glyphs

Unicode code points that are strongly recommended for inclusion in any Sinhala font are:

Code point	Description
U+200B	Zero Width Space
U+200C	Zero Width Non-Joiner
U+200D	Zero Width Joiner
U+25CC	Dotted Circle

Appendix

Appendix: Writing system and language tags

Features are encoded according to both a designated script and language system. The language system tag specifies a typographic convention associated with a language or linguistic subgroup. For example, there are different language systems defined for the Sinhala script; Sinhala, Pali, and Sanskrit.

Currently, the Uniscribe engine only supports the “default” language for each script. However, font developers may want to build language specific features which are supported in other applications and will be supported in future Microsoft OpenType implementations.

NOTE: It is strongly recommended to include the “dflt” language tag in all OpenType fonts because it defines the basic script handling for a font. The “dflt” language system is used as the default if no other language specific features are defined or if the application does not support that particular language. If the “dflt” tag is not present for the script being used, the font may not work in some applications.

The following tables list the registered tag names for scripts and language systems.

Registered tags for the Sinhala script		Registered tags for Sinhala language systems
Script tag	Script	Language system tag	Language
'sinh'	Sinhala	'dflt'	*default script handling
		'SNH '	Sinhala
		'PAL '	Pali
		“SAN”	Sanskrit

Note: both the script and language tags are case sensitive (script tags should be lowercase, language tags are all caps) and must contain four characters (ie. you must add a space to the three character language tags).

Share via