Analyze the text
Character properties
The shaping engine divides the text into syllable clusters and identifies character
properties.
Character properties are used in parsing syllables and identifying its parts, in
determining proper character or glyph reordering and in OpenType feature application.
Properties for each character are divided into two types: static properties and
dynamic properties.
Static properties define basic characteristics that do not change from
font to font: character type (consonant, matra, vedic sign, etc.) or type of matra
reordering. They differ from script to script, but can’t be controlled by
font developer.
Dynamic properties are font dependent and are retrieved by the shaping
engine as the font is loaded. These properties affect shaping and reordering behavior.
*Note: in old shaping-engine implementations, all consonant properties were static:
consonants were assumed to have particular conjoining forms. In the new implementation
model, consonant conjoining behavior is a dynamic property.
Retrieving dynamic character properties from Indic fonts
Fonts define dynamic properties for consonants through implementing standard features.
Consonant types (and corresponding feature tags) that the shaping engine reads from
the font are:
Reph <rphf>
Half forms <half>
Pre-base-reordering forms of Ra/Rra <pref>
Below-base forms <blwf>
Post-base forms <pstf>
Each of the features above is applied together with <locl> feature to input
sequences consisting of two characters: for <rphf> and <half>, features
are applied to Consonant + Halant combinations; for <pref>, <subf>
and <pstf>, features are applied to Halant + Consonant combinations.
This is done for each consonant. If these two glyphs form a ligature, with no additional
glyphs in context, this means the consonant has the corresponding form. For instance,
if a substitution occurs when the <half> and <locl> features are applied
to a sequence Da + Halant, then Da is classified as having a half form.
Note that a font may be implemented to re-order a Ra to pre-base position only in
certain syllables and display it as a below-base or post-base form otherwise. This
means that the Pre-base-form classification is not mutually exclusive with either
Below-base-form or Post-base-form classifications. However, all classifications
are determined as described above using context-free substitutions.
Font-dependent character classification only defines consonant types. Reordering
positions, however, are fixed for each character class.
*Note: for fonts that support the old implementation, all features are applied to
Consonant + Halant sequences.
Indic input processing
The following steps should be repeated while there are characters left in the input
sequence. All shaping operations are done on a syllable-by-syllable basis, independent
from other characters.
Find next syllable in the input
Engine should find the character sequence matching one of the patterns below:
Consonant syllable
{C+[N]+<H+[<ZWNJ|ZWJ>]|<ZWNJ|ZWJ>+H>} +
C+[N]+[A] + [< H+[<ZWNJ|ZWJ>] | {M}+[N]+[H]>]+[SM]+[(VD)]
Vowel-based syllable:
[Ra+H]+V+[N]+[<[<ZWJ|ZWNJ>]+H+C|ZWJ+C>]+[{M}+[N]+[H]]+[SM]+[(VD)]
Stand Alone cluster (at the start of the word only):
#[Ra+H]+NBSP+[N]+[<[<ZWJ|ZWNJ>]+H+C>]+[{M}+[N]+[H]]+[SM]+[(VD)]
Where
|
{ }
|
zero or more occurrences
|
|
[ ]
|
optional occurrence
|
|
<|>
|
“one of”
|
|
( )
|
one or two occurrences
|
|
C
|
consonant
|
|
V
|
independent vowel
|
|
N
|
nukta
|
|
H
|
halant/virama
|
|
ZWNJ
|
zero width non-joiner
|
|
ZWJ
|
zero width joiner
|
|
M
|
matra (up to one of each type: pre-, above-, below- or post- base)
|
|
SM
|
syllable modifier sign
|
|
VD
|
vedic
|
|
A
|
anudatta (U+0952)
|
|
NBSP
|
NO-BREAK SPACE
|
Identify key positions inside syllable
Syllable structure consists of the following parts:
Reph + HalfConsonant(s) + MainConsonant(s) + BelowBaseConsonant(s) + PostBaseConsonant(s)
+ PreBaseReorderingRa + MatrasAndSigns
The consonant parts include all associated halants and nuktas. (For example, an
instance of BelowBaseConsonant consists of a sequence of Halant + Below-base-forming
Consonant.) All parts are optional, except the main consonant.
All parts are shown in the order they would occur within a syllable, with one qualification:
depending on a font implementation, PreBaseReorderingRa may occur before all BelowBaseConsonants,
after BelowBaseConsonants and before PostBaseConsonants, or after PostBaseConsonants.
Also, a font may be implemented to re-order a Ra to pre-base position only in certain
syllables and display it as a below-base or post-base form otherwise. Thus, final
determination of whether an occurrence of Ra in a specific syllable can be treated
as a pre-base reordering Ra can be made only after the <pref> feature has
been applied to that syllable.
There could be several main consonants in the case where more than one consonant
doesn’t have a half-, below-base, post-base or pre-base form. In a case of
a cluster where the first consonant does not have a half form, the shaping engine
will recognize it as the 1st ‘full form’ and go on to identify the 2nd
full form consonant, if there is one. This information will then be used to determine
the reordering behavior of the reph or any matras, vowel modifiers or stress marks.
All other elements are classified by their position relative to the base: pre-base
(half forms and reordering pre-base Ra forms), below-base, above-base and post-base.
Indic clusters are subject to the following constraints:
- Only one reph is allowed per syllable.
- Only one pre-base reordering Ra is allowed per syllable.
- A nukta can be placed on a consonant, matra or independent vowel. It cannot be
placed on a pre-composed nukta character.
- One matra from each positioning class is permitted (exception in the Kannada script).
A composite matra is treated as belonging to all the classes from which its components
belong.
- One syllable modifier sign is allowed per cluster.
- Vedic signs are combining marks (used for Sanskrit) that should be included in
all Indic scripts.
- Danda and Double Danda are punctuation marks that should be included in all Indic
scripts.
Reorder characters
Once the Indic shaping engine has analyzed the cluster
as described above, it creates and manages a buffer of appropriately reordered elements
(glyphs) representing the cluster, according to several rules (described below).
The OpenType lookups in an Indic font must be written to match glyph sequences after
re-ordering has occurred. OpenType fonts should not have substitutions that attempt
to perform the re-ordering. If a font developer attempted to encode such reordering
information in an OpenType font, they would need to add a huge number of many-to-many
glyph mappings to cover the general algorithms that a shaping engine will use.
- Find base consonant: The shaping engine finds the base consonant
of the syllable, using the following algorithm: starting from the end of the syllable,
move backwards until a consonant is found that does not have a below-base or post-base
form (post-base forms have to follow below-base forms), or that is not a pre-base
reordering Ra, or arrive at the first consonant. The consonant stopped at will be
the base.
- If the syllable starts with Ra + Halant (in a script that has Reph) and
has more than one consonant, Ra is excluded from candidates for base consonants.
- Decompose and reorder Matras: Each matra and any syllable modifier
sign in the cluster aremoved to the appropriate position relative to the consonant(s)
in the cluster. The shaping engine decomposes two- or three-part matras
into their constituent parts before any repositioning. Matra characters are classified
by which consonant in a conjunct they have affinity for and are reordered to the
following positions:
- Before first half form in the syllable
- After subjoined consonants
- After post-form consonant
- After main consonant (for above marks)
- Reorder marks to canonical order: Adjacent nukta and halant or
nukta and vedic sign are always repositioned if necessary, so that the nukta is
first.
- Final reordering: After the localized forms and
basic shaping forms GSUB features have been applied (see below), the shaping
engine performs some final glyph reordering before applying all the remaining font
features to the entire cluster.
- Reorder matras: If a pre-base matra character had been reordered before
applying basic features, the glyph can be moved closer to the main consonant based
on whether half-forms had been formed. Actual position for the matra is defined
as “after last standalone halant glyph, after initial matra position and before
the main consonant”. If ZWJ or ZWNJ follow this halant, position is moved
after it.
- Reorder reph: Reph’s original position is always at the beginning
of the syllable, (i.e. it is not reordered at the character reordering stage). However,
it will be reordered according to the basic-forms shaping results. Possible positions
for reph, depending on the script, are; after main, before post-base consonant forms,
and after post-base consonant forms.
- If reph should be positioned after post-base consonant forms, proceed to step
5.
- If the reph repositioning class is not after post-base: target position is after
the first explicit halant glyph between the first post-reph consonant and last main
consonant. If ZWJ or ZWNJ are following this halant, position is moved after it.
If such position is found, this is the target position. Otherwise, proceed to the
next step. Note: in old-implementation fonts, where classifications were fixed
in shaping engine, there was no case where reph position will be found on this step.
- If reph should be repositioned after the main consonant: from the first consonant
not ligated with main, or find the first consonant that is not a potential pre-base
reordering Ra.
- If reph should be positioned before post-base consonant, find first post-base
classified consonant not ligated with main. If no consonant is found, the target
position should be before the first matra, syllable modifier sign or vedic sign.
- If no consonant is found in steps 3 or 4, move reph to a position immediately
before the first post-base matra, syllable modifier sign or vedic sign that has
a reordering class after the intended reph position. For example, if the
reordering position for reph is post-main, it will skip above-base matras that also
have a post-main position.
- Otherwise, reorder reph to the end of the syllable.
- Reorder pre-base reordering consonants: If a pre-base reordering consonant
is found, reorder it according to the following rules:
- Only reorder a glyph produced by substitution during application of the <pref>
feature. (Note that a font may shape a Ra consonant with the <pref> feature
generally but block it in certain contexts.)
- Try to find a target position the same way as for pre-base matra. If it is found,
reorder pre-base consonant glyph.
- If position is not found, reorder immediately before main consonant.
Character reordering Classes for Gujarati:
|
0AB0 (reph)
|
BeforePostscript
|
|
0ABF
|
BeforeHalf
|
|
0AC5, 0AC, 0AC8
|
AfterSubscript
|
|
0AC1-0AC4, 0AE2, 0AE3
|
AfterPostscript
|
|
0ABE, 0AC0, 0AC9, 0ACB, 0ACC
|
AfterPostscript
|
Shape glyph sequences (GSUB processing)
All characters from a string are first mapped to their nominal glyphs using the
cmap lookup. The shaping engine then proceeds to shape (substitute) the glyphs using
GSUB lookups.
The features for localized forms and basic shaping forms are applied
one at a time to the cluster or a relevant portion of the cluster.
The results after basic shaping forms features have been applied impact
the final syllable analysis in terms of final designation of Ra as a pre-base reordering
form and final reordering positions for reph and matras. Next, the features for
presentation forms are applied to the entire cluster simultaneously. Note:
since the presentation form features are applied simultaneously over the
entire cluster, several features are operationally equivalent to a single feature.
Multiple features are provided as an aid for font developers to organize the lookups
they implement.
Note: final reordering occurs after features for basic shaping forms have
been applied and before features for presentation forms are applied. Font
developers must consider the effects of initial reordering (before any features
are applied) and final reordering (after basic shaping forms features have
applied) when they create GSUB feature and lookup tables.
These predefined features are described and illustrated in the
Features section and are applied in the order below.
Shaping features:
Localized forms
- Apply feature 'locl' to select language-specific forms.
Basic Shaping forms
- Apply feature 'nukt' to substitute nukta forms of consonants.
- Apply feature 'akhn' to substitute required akhand ligatures, or to substitute
forms that take precedence over forms produced by features applied later.
- Apply feature 'rphf' to substitute reph glyph (above-base form of 'Ra').
- Apply feature 'rkrf' to substitute any rakaar ligatures.
- Apply feature blwf’ to substitute below-base forms.
- Apply feature 'half' to substitute half forms of pre-base consonants.
- Apply feature 'vatu' to substitute ligature consonant-vattu or conjunct-vattu
forms for sequences of a consonant or conjunct glyph (full or half form) followed
by the below-base rakaar mark. (This feature is not needed if the rkrf feature is
used, but is available for old-behavior implementations).
- Apply feature 'cjct' to substitute conjunct forms. (This is needed particularly
for ligature conjuct forms when the pre-base consonant does not have a half form).
Presentation forms
- Apply feature 'pres' to substitute pre-base consonant conjuncts and pre-base
matra conjuncts. (ie. consonant and matra conjuncts to the left of the base glyph).
- Apply feature 'abvs' to substitute above-base matra conjuncts, reph conjuncts,
above-base vowel modifiers and above-base stress and tone marks.
- Apply feature 'blws' to substitute below-base consonant conjuncts, below-base
matra conjuncts, below-base vowel modifier forms and below-base stress and tone
mark forms.
- Apply feature 'psts' to substitute post-base consonant conjuncts, post-base
matra conjuncts and post-base vowel modifiers.
- Apply feature 'haln' to substitute the halant form of base (or conjunct
base) glyph in syllables ending with a halant.
- Apply feature 'calt' to substitute the contextual alternate of a consonant.
Position glyph sequences (GPOS processing)
The shaping engine next processes the GPOS (glyph positioning) table, applying features
concerned with positioning. All features are applied simultaneously to the entire
cluster.
The font developer must consider the effects of re-ordering when creating the GPOS
feature and lookup tables (i.e., the glyphs will be in the order they were in after
the GSUB presentation forms features were applied).
Positioning features:
Kerning
- Apply feature 'kern' to adjust distances (e.g., to provide kerning between
post- or pre-base elements and the base glyph).
- Apply feature 'dist' to adjust distances. (NOTE – the feature ‘dist’
can be used in the same way as the ‘kern’ feature. The advantage of
using the ‘dist’ feature is that it does not rely on the application
to enable kerning. Therefore, if you want to make sure certain spacing adjustments
will always be displayed, you should use the ‘dist’ feature).
Above-base marks
- Apply feature 'abvm' to position above-base forms, vowel modifiers and
or stress/tone marks (on base glyph or post-base matra).
Bel0w-base marks
- Apply feature 'blwm' to position below-base forms, vowel modifiers and
or stress/tone marks.
Base elements
Commonly, a feature is required for dealing with the base glyph and one of the post-base,
pre-base or above-base elements. Since it is not possible to reorder ALL of these
elements next to the base glyph, we need to skip over the elements "in the
middle" (reordering-wise).
The solution is to assign different mark attachment classes to different elements
of the syllable and positional forms, and in any given lookup work with one mark
type only. For example, in above-base substitutions we need only consider above-base
elements most of the time.
Generally, it is good practice to label as "mark" glyphs that are denoted
as marks in the Unicode Standard as well as below-base/above-base forms of consonants.
Then, different attachment classes should be assigned to different marks depending
on their position with respect to the base.
For example, after the shaping engine has re-ordered
elements within the cluster, matras will always occur before syllable modifier sign
such as the candrabindu. In an actual sequence, though, potentially some
other mark glyph, such as nukta, may occur between the matra and the candrabindu.
Thus, when processing the matra and candrabindu, you may need to allow for the possibility
that some other mark glyph(s) may occur between them. Using lookup flags, you can
specify that a lookup should process only a certain class of marks,
such as ‘above-base marks’, and ignore all other marks. In that way,
a match will occur whether or not a mark from another class is present. Otherwise,
the lookup would fail to apply.
Using Microsoft VOLT, you can assign glyphs to attachment classes.
In the example below this ‘abvm’ feature was set to process only
TopMarks, therefore the presence of another mark class would be
ignored. If Process ALL was used and another mark glyph followed
the matra, this positioning lookup would fail to apply. This example comes from
the Devanagari font Mangal.
Invalid combining marks
Combining marks and signs that do not occur in conjunction with a valid base are
considered invalid. Shaping engine implementations may adopt different
strategies for how invalid marks are handled. For example, a shaping engine implementation
might treat an invalid mark as a separate cluster and display the stand-alone mark
positioned on some default base glyph, such as a dotted circle. (See Fallback
Rendering in section 5.13 of the Unicode Standard 4.0.) Shaping engine implementations
may vary somewhat with regard to what sequences are or are not considered valid.
For instance, some implementations may impose a limit of at most one above-base
vowel mark while others may not.
To allow for shaping engine implementations that expect to position an invalid mark
on a dotted circle, it is recommended that a Gujarati OT font contain a glyph for
the dotted circle character, U+25CC. If this character is not supported in the font,
such implementations will display invalid signs on the missing glyph shape (white
box).
In addition to the 'dotted circle' other Unicode code points that are recommended
for inclusion in any Gujarati font are the ZWJ (zero width non-joiner; U+200C),
the ZWNJ (zero width joiner; U+200D) and the ZWSP (zero width space; U+200B). For
more information see the
Suggested glyphs section of the OpenType Font Development document.
Effect of ZWJ, ZWNJ and NBSP on Consonant Shaping
Unicode defines specific behaviors for zwj and zwnj in relation to Indic scripts.
The Indic-specific behavior retains the general behavior that zwj requests connection
between text elements while zwnj inhibits connection between text elements.
- The main intent of using ZWJ in this context is to prevent a ligature-conjunct
from forming (and in Devanagari or Gujuarati, to request a half form, below-base
form or post-base form instead). The Indic engine does not need to take any action
to prevent ligature-conjuct formation: the presence of ZWJ will prevent GSUB substitution
lookups from matching the input glyph sequence. If the first consonant does not
have a half form, an overt-halant form should result, which would also happen with
no particular action by the engine.
- A secondary intent of using ZWJ in this context is to prevent the display
of reph in the case that the first consonant is RA. If a cluster begins
with RA H (halant) ZWJ, the engine must ensure that the ‘rphf’ feature
is not applied, and that re-ordering for reph does not take place. Note that use
of either joiner in this context should prevent formation and re-ordering of reph
when RA is the first consonant.
- The main intent of using ZWNJ is to prevent conjunct ligature or half forms from
forming, and to display an explicit halant form instead. The shaping engine must
take specific actions to prevent half forms for a sequence of Consonant + Halant
+ ZWNJ.
The following example illustrates these behaviors:
Just as the zwj can be used to display a half form in isolation, it can also be
used to display a mark, sub- or post-base form in isolation. Unlike the stand-alone
half form, however, sequences to display them must begin with a no-break space (NBSP).
This is because marks, sub- and post-base forms have a ‘zero-width’
so must be placed on the NBSP. For example, to get a shape of I-matra without the
dotted circle one should type NBSP + I-matra.
In the illustration below the I-matra is displayed without the dotted circle by
using the NBSP.
The combination of NBSP and ZWJ is used to display the below-base form of Ra (Rakaar)
in isolation.
Next section: Features
introduction | shaping engine | features
| appendices