| <?xml version="1.0"?> |
| <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" |
| "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ |
| <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> |
| <!ENTITY version SYSTEM "version.xml"> |
| ]> |
| <chapter id="shaping-concepts"> |
| <title>Shaping concepts</title> |
| <section id="text-shaping-concepts"> |
| <title>Text shaping</title> |
| <para> |
| Text shaping is the process of transforming a sequence of Unicode |
| codepoints that represent individual characters (letters, |
| diacritics, tone marks, numbers, symbols, etc.) into the |
| orthographically and linguistically correct two-dimensional layout |
| of glyph shapes taken from a specified font. |
| </para> |
| <para> |
| For some writing systems (or <emphasis>scripts</emphasis>) and |
| languages, the process is simple, requiring the shaper to do |
| little more than advance the horizontal position forward by the |
| correct amount for each successive glyph. |
| </para> |
| <para> |
| But, for <emphasis>complex scripts</emphasis>, any combination of |
| several shaping operations may be required, and the rules for how |
| and when they are applied vary from script to script. HarfBuzz and |
| other shaping engines implement these rules. |
| </para> |
| <para> |
| The exact rules and necessary operations for a particular script |
| constitute a shaping <emphasis>model</emphasis>. OpenType |
| specifies a set of shaping models that covers all of |
| Unicode. Other shaping models are available, however, including |
| Graphite and Apple Advanced Typography (AAT). |
| </para> |
| </section> |
| |
| <section id="complex-scripts"> |
| <title>Complex scripts</title> |
| <para> |
| In text-shaping terminology, scripts are generally classified as |
| either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>. |
| </para> |
| <para> |
| Complex scripts are those for which transforming the input |
| sequence into the final layout requires some combination of |
| operations—such as context-dependent substitutions, |
| context-dependent mark positioning, glyph-to-glyph joining, |
| glyph reordering, or glyph stacking. |
| </para> |
| <para> |
| In some complex scripts, the shaping rules require that a text |
| run be divided into syllables before the operations can be |
| applied. Other complex scripts may apply shaping operations over |
| entire words or over the entire text run, with no subdivision |
| required. |
| </para> |
| <para> |
| Non-complex scripts, by definition, do not require these |
| operations. However, correctly shaping a text run in a |
| non-complex script may still involve Unicode normalization, |
| ligature substitutions, mark positioning, kerning, and applying |
| other font features. The key difference is that a text run in a |
| non-complex script can be processed sequentially and in the same |
| order as the input sequence of Unicode codepoints, without |
| requiring an analysis stage. |
| </para> |
| </section> |
| |
| <section id="shaping-operations"> |
| <title>Shaping operations</title> |
| <para> |
| Shaping a complex-script text run involves transforming the |
| input sequence of Unicode codepoints with some combination of |
| operations that is specified in the shaping model for the |
| script. |
| </para> |
| <para> |
| The specific conditions that trigger a given operation for a |
| text run varies from script to script, as do the order that the |
| operations are performed in and which codepoints are |
| affected. However, the same general set of shaping operations is |
| common to all of the complex-script shaping models. |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| A <emphasis>reordering</emphasis> operation moves a glyph |
| from its original ("logical") position in the sequence to |
| some other ("visual") position. |
| </para> |
| <para> |
| The shaping model for a given complex script might involve |
| more than one reordering step. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| A <emphasis>joining</emphasis> operation replaces a glyph |
| with an alternate form that is designed to connect with one |
| or more of the adjacent glyphs in the sequence. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| A contextual <emphasis>substitution</emphasis> operation |
| replaces either a single glyph or a subsequence of several |
| glyphs with an alternate glyph. This substitution is |
| performed when the original glyph or subsequence of glyphs |
| occurs in a specified position with respect to the |
| surrounding sequence. For example, one substitution might be |
| performed only when the target glyph is the first glyph in |
| the sequence, while another substitution is performed only |
| when a different target glyph occurs immediately after a |
| particular string pattern. |
| </para> |
| <para> |
| The shaping model for a given complex script might involve |
| multiple contextual-substitution operations, each applying |
| to different target glyphs and patterns, and which are |
| performed in separate steps. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| A contextual <emphasis>positioning</emphasis> operation |
| moves the horizontal and/or vertical position of a |
| glyph. This positioning move is performed when the glyph |
| occurs in a specified position with respect to the |
| surrounding sequence. |
| </para> |
| <para> |
| Many contextual positioning operations are used to place |
| <emphasis>mark</emphasis> glyphs (such as diacritics, vowel |
| signs, and tone markers) with respect to |
| <emphasis>base</emphasis> glyphs. However, some complex |
| scripts may use contextual positioning operations to |
| correctly place base glyphs as well, such as |
| when the script uses <emphasis>stacking</emphasis> characters. |
| </para> |
| </listitem> |
| |
| </itemizedlist> |
| </section> |
| |
| <section id="unicode-character-categories"> |
| <title>Unicode character categories</title> |
| <para> |
| Shaping models are typically specified with respect to how |
| scripts are defined in the Unicode standard. |
| </para> |
| <para> |
| Every codepoint in the Unicode Character Database (UCD) is |
| assigned a <emphasis>Unicode General Category</emphasis> (UGC), |
| which provides the most fundamental information about the |
| codepoint: whether the codepoint represents a |
| <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a |
| <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a |
| <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, |
| or something else (<emphasis>Other</emphasis>). |
| </para> |
| <para> |
| These UGC properties are "Major" categories. Each codepoint is |
| further assigned to a "minor" category within its Major |
| category, such as "Letter, uppercase" (<literal>Lu</literal>) or |
| "Letter, modifier" (<literal>Lm</literal>). |
| </para> |
| <para> |
| Shaping models are concerned primarily with Letter and Mark |
| codepoints. The minor categories of Mark codepoints are |
| particularly important for shaping. Marks can be nonspacing |
| (<literal>Mn</literal>), spacing combining |
| (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). |
| </para> |
| <para> |
| In addition to the UGC property, codepoints in the Indic and |
| Southeast Asian scripts are also assigned |
| <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and |
| <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) |
| properties that provide more detailed information needed for |
| shaping. |
| </para> |
| <para> |
| The UISC property sub-categorizes Letters and Marks according to |
| common script-shaping behaviors. For example, UISC distinguishes |
| between consonant letters, vowel letters, and vowel marks. The |
| UIPC property sub-categorizes Mark codepoints by the relative visual |
| position that they occupy (above, below, right, left, or in |
| multiple positions). |
| </para> |
| <para> |
| Some complex scripts require that the text run be split into |
| syllables. What constitutes a valid syllable in these |
| scripts is specified in regular expressions, formed from the |
| Letter and Mark codepoints, that take the UISC and UIPC |
| properties into account. |
| </para> |
| |
| </section> |
| |
| <section id="text-runs"> |
| <title>Text runs</title> |
| <para> |
| Real-world text usually contains codepoints from a mixture of |
| different Unicode scripts (including punctuation, numbers, symbols, |
| white-space characters, and other codepoints that do not belong |
| to any script). Real-world text may also be marked up with |
| formatting that changes font properties (including the font, |
| font style, and font size). |
| </para> |
| <para> |
| For shaping purposes, all real-world text streams must be first |
| segmented into runs that have a uniform set of properties. |
| </para> |
| <para> |
| In particular, shaping models always assume that every codepoint |
| in a text run has the same <emphasis>direction</emphasis>, |
| <emphasis>script</emphasis> tag, and |
| <emphasis>language</emphasis> tag. |
| </para> |
| </section> |
| |
| <section id="opentype-shaping-models"> |
| <title>OpenType shaping models</title> |
| <para> |
| OpenType provides shaping models for the following scripts: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| The <emphasis>default</emphasis> shaping model handles all |
| non-complex scripts, and may also be used as a fallback for |
| handling unrecognized scripts. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Indic</emphasis> shaping model handles the Indic |
| scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, |
| Malayalam, Oriya, Tamil, Telugu, and Sinhala. |
| </para> |
| <para> |
| The Indic shaping model was revised significantly in |
| 2005. To denote the change, a new set of <emphasis>script |
| tags</emphasis> was assigned for Bengali, Devanagari, |
| Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and |
| Telugu. For the sake of clarity, the term "Indic2" is |
| sometimes used to refer to the current, revised shaping |
| model. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Arabic</emphasis> shaping model supports |
| Arabic, Mongolian, N'Ko, Syriac, and several other connected |
| or cursive scripts. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Thai/Lao</emphasis> shaping model supports |
| the Thai and Lao scripts. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Khmer</emphasis> shaping model supports the |
| Khmer script. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Myanmar</emphasis> shaping model supports the |
| Myanmar (or Burmese) script. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Tibetan</emphasis> shaping model supports the |
| Tibetan script. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Hangul</emphasis> shaping model supports the |
| Hangul script. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Hebrew</emphasis> shaping model supports the |
| Hebrew script. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The <emphasis>Universal Shaping Engine</emphasis> (USE) |
| shaping model supports complex scripts not covered by one of |
| the above, script-specific shaping models, including |
| Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, |
| Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai |
| Viet, and many others. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Text runs that do not fall under one of the above shaping |
| models may still require processing by a shaping engine. Of |
| particular note is <emphasis>Emoji</emphasis> shaping, which |
| may involve variation-selector sequences and glyph |
| substitution. Emoji shaping is handled by the default |
| shaping model. |
| </para> |
| </listitem> |
| |
| </itemizedlist> |
| |
| </section> |
| |
| <section id="graphite-shaping"> |
| <title>Graphite shaping</title> |
| <para> |
| In contrast to OpenType shaping, Graphite shaping does not |
| specify a predefined set of shaping models or a set of supported |
| scripts. |
| </para> |
| <para> |
| Instead, each Graphite font contains a complete set of rules that |
| implement the required shaping model for the intended |
| script. These rules include finite-state machines to match |
| sequences of codepoints to the shaping operations to perform. |
| </para> |
| <para> |
| Graphite shaping can perform the same shaping operations used in |
| OpenType shaping, as well as other functions that have not been |
| defined for OpenType shaping. |
| </para> |
| </section> |
| |
| <section id="aat-shaping"> |
| <title>AAT shaping</title> |
| <para> |
| In contrast to OpenType shaping, AAT shaping does not specify a |
| predefined set of shaping models or a set of supported scripts. |
| </para> |
| <para> |
| Instead, each AAT font includes a complete set of rules that |
| implement the desired shaping model for the intended |
| script. These rules include finite-state machines to match glyph |
| sequences and the shaping operations to perform. |
| </para> |
| <para> |
| Notably, AAT shaping rules are expressed for glyphs in the font, |
| not for Unicode codepoints. AAT shaping can perform the same |
| shaping operations used in OpenType shaping, as well as other |
| functions that have not been defined for OpenType shaping. |
| </para> |
| </section> |
| </chapter> |