|  | <?xml version="1.0"?> | 
|  | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" | 
|  | "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ | 
|  | <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'"> | 
|  | <!ENTITY version SYSTEM "version.xml"> | 
|  | ]> | 
|  | <chapter id="shaping-concepts"> | 
|  | <title>Shaping concepts</title> | 
|  | <section id="text-shaping-concepts"> | 
|  | <title>Text shaping</title> | 
|  | <para> | 
|  | Text shaping is the process of transforming a sequence of Unicode | 
|  | codepoints that represent individual characters (letters, | 
|  | diacritics, tone marks, numbers, symbols, etc.) into the | 
|  | orthographically and linguistically correct two-dimensional layout | 
|  | of glyph shapes taken from a specified font. | 
|  | </para> | 
|  | <para> | 
|  | For some writing systems (or <emphasis>scripts</emphasis>) and | 
|  | languages, the process is simple, requiring the shaper to do | 
|  | little more than advance the horizontal position forward by the | 
|  | correct amount for each successive glyph. | 
|  | </para> | 
|  | <para> | 
|  | But, for <emphasis>complex scripts</emphasis>, any combination of | 
|  | several shaping operations may be required, and the rules for how | 
|  | and when they are applied vary from script to script. HarfBuzz and | 
|  | other shaping engines implement these rules. | 
|  | </para> | 
|  | <para> | 
|  | The exact rules and necessary operations for a particular script | 
|  | constitute a shaping <emphasis>model</emphasis>. OpenType | 
|  | specifies a set of shaping models that covers all of | 
|  | Unicode. Other shaping models are available, however, including | 
|  | Graphite and Apple Advanced Typography (AAT). | 
|  | </para> | 
|  | </section> | 
|  |  | 
|  | <section id="complex-scripts"> | 
|  | <title>Complex scripts</title> | 
|  | <para> | 
|  | In text-shaping terminology, scripts are generally classified as | 
|  | either <emphasis>complex</emphasis> or <emphasis>non-complex</emphasis>. | 
|  | </para> | 
|  | <para> | 
|  | Complex scripts are those for which transforming the input | 
|  | sequence into the final layout requires some combination of | 
|  | operations—such as context-dependent substitutions, | 
|  | context-dependent mark positioning, glyph-to-glyph joining, | 
|  | glyph reordering, or glyph stacking. | 
|  | </para> | 
|  | <para> | 
|  | In some complex scripts, the shaping rules require that a text | 
|  | run be divided into syllables before the operations can be | 
|  | applied. Other complex scripts may apply shaping operations over | 
|  | entire words or over the entire text run, with no subdivision | 
|  | required. | 
|  | </para> | 
|  | <para> | 
|  | Non-complex scripts, by definition, do not require these | 
|  | operations. However, correctly shaping a text run in a | 
|  | non-complex script may still involve Unicode normalization, | 
|  | ligature substitutions, mark positioning, kerning, and applying | 
|  | other font features. The key difference is that a text run in a | 
|  | non-complex script can be processed sequentially and in the same | 
|  | order as the input sequence of Unicode codepoints, without | 
|  | requiring an analysis stage. | 
|  | </para> | 
|  | </section> | 
|  |  | 
|  | <section id="shaping-operations"> | 
|  | <title>Shaping operations</title> | 
|  | <para> | 
|  | Shaping a complex-script text run involves transforming the | 
|  | input sequence of Unicode codepoints with some combination of | 
|  | operations that is specified in the shaping model for the | 
|  | script. | 
|  | </para> | 
|  | <para> | 
|  | The specific conditions that trigger a given operation for a | 
|  | text run varies from script to script, as do the order that the | 
|  | operations are performed in and which codepoints are | 
|  | affected. However, the same general set of shaping operations is | 
|  | common to all of the complex-script shaping models. | 
|  | </para> | 
|  |  | 
|  | <itemizedlist> | 
|  | <listitem> | 
|  | <para> | 
|  | A <emphasis>reordering</emphasis> operation moves a glyph | 
|  | from its original ("logical") position in the sequence to | 
|  | some other ("visual") position. | 
|  | </para> | 
|  | <para> | 
|  | The shaping model for a given complex script might involve | 
|  | more than one reordering step. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | A <emphasis>joining</emphasis> operation replaces a glyph | 
|  | with an alternate form that is designed to connect with one | 
|  | or more of the adjacent glyphs in the sequence. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | A contextual <emphasis>substitution</emphasis> operation | 
|  | replaces either a single glyph or a subsequence of several | 
|  | glyphs with an alternate glyph. This substitution is | 
|  | performed when the original glyph or subsequence of glyphs | 
|  | occurs in a specified position with respect to the | 
|  | surrounding sequence. For example, one substitution might be | 
|  | performed only when the target glyph is the first glyph in | 
|  | the sequence, while another substitution is performed only | 
|  | when a different target glyph occurs immediately after a | 
|  | particular string pattern. | 
|  | </para> | 
|  | <para> | 
|  | The shaping model for a given complex script might involve | 
|  | multiple contextual-substitution operations, each applying | 
|  | to different target glyphs and patterns, and which are | 
|  | performed in separate steps. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | A contextual <emphasis>positioning</emphasis> operation | 
|  | moves the horizontal and/or vertical position of a | 
|  | glyph. This positioning move is performed when the glyph | 
|  | occurs in a specified position with respect to the | 
|  | surrounding sequence. | 
|  | </para> | 
|  | <para> | 
|  | Many contextual positioning operations are used to place | 
|  | <emphasis>mark</emphasis> glyphs (such as diacritics, vowel | 
|  | signs, and tone markers) with respect to | 
|  | <emphasis>base</emphasis> glyphs. However, some complex | 
|  | scripts may use contextual positioning operations to | 
|  | correctly place base glyphs as well, such as | 
|  | when the script uses <emphasis>stacking</emphasis> characters. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | </itemizedlist> | 
|  | </section> | 
|  |  | 
|  | <section id="unicode-character-categories"> | 
|  | <title>Unicode character categories</title> | 
|  | <para> | 
|  | Shaping models are typically specified with respect to how | 
|  | scripts are defined in the Unicode standard. | 
|  | </para> | 
|  | <para> | 
|  | Every codepoint in the Unicode Character Database (UCD) is | 
|  | assigned a <emphasis>Unicode General Category</emphasis> (UGC), | 
|  | which provides the most fundamental information about the | 
|  | codepoint: whether the codepoint represents a | 
|  | <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a | 
|  | <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a | 
|  | <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, | 
|  | or something else (<emphasis>Other</emphasis>). | 
|  | </para> | 
|  | <para> | 
|  | These UGC properties are "Major" categories. Each codepoint is | 
|  | further assigned to a "minor" category within its Major | 
|  | category, such as "Letter, uppercase" (<literal>Lu</literal>) or | 
|  | "Letter, modifier" (<literal>Lm</literal>). | 
|  | </para> | 
|  | <para> | 
|  | Shaping models are concerned primarily with Letter and Mark | 
|  | codepoints. The minor categories of Mark codepoints are | 
|  | particularly important for shaping. Marks can be nonspacing | 
|  | (<literal>Mn</literal>), spacing combining | 
|  | (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). | 
|  | </para> | 
|  | <para> | 
|  | In addition to the UGC property, codepoints in the Indic and | 
|  | Southeast Asian scripts are also assigned | 
|  | <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and | 
|  | <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) | 
|  | properties that provide more detailed information needed for | 
|  | shaping. | 
|  | </para> | 
|  | <para> | 
|  | The UISC property sub-categorizes Letters and Marks according to | 
|  | common script-shaping behaviors. For example, UISC distinguishes | 
|  | between consonant letters, vowel letters, and vowel marks. The | 
|  | UIPC property sub-categorizes Mark codepoints by the relative visual | 
|  | position that they occupy (above, below, right, left, or in | 
|  | multiple positions). | 
|  | </para> | 
|  | <para> | 
|  | Some complex scripts require that the text run be split into | 
|  | syllables. What constitutes a valid syllable in these | 
|  | scripts is specified in regular expressions, formed from the | 
|  | Letter and Mark codepoints, that take the UISC and UIPC | 
|  | properties into account. | 
|  | </para> | 
|  |  | 
|  | </section> | 
|  |  | 
|  | <section id="text-runs"> | 
|  | <title>Text runs</title> | 
|  | <para> | 
|  | Real-world text usually contains codepoints from a mixture of | 
|  | different Unicode scripts (including punctuation, numbers, symbols, | 
|  | white-space characters, and other codepoints that do not belong | 
|  | to any script). Real-world text may also be marked up with | 
|  | formatting that changes font properties (including the font, | 
|  | font style, and font size). | 
|  | </para> | 
|  | <para> | 
|  | For shaping purposes, all real-world text streams must be first | 
|  | segmented into runs that have a uniform set of properties. | 
|  | </para> | 
|  | <para> | 
|  | In particular, shaping models always assume that every codepoint | 
|  | in a text run has the same <emphasis>direction</emphasis>, | 
|  | <emphasis>script</emphasis> tag, and | 
|  | <emphasis>language</emphasis> tag. | 
|  | </para> | 
|  | </section> | 
|  |  | 
|  | <section id="opentype-shaping-models"> | 
|  | <title>OpenType shaping models</title> | 
|  | <para> | 
|  | OpenType provides shaping models for the following scripts: | 
|  | </para> | 
|  |  | 
|  | <itemizedlist> | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>default</emphasis> shaping model handles all | 
|  | non-complex scripts, and may also be used as a fallback for | 
|  | handling unrecognized scripts. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Indic</emphasis> shaping model handles the Indic | 
|  | scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, | 
|  | Malayalam, Oriya, Tamil, Telugu, and Sinhala. | 
|  | </para> | 
|  | <para> | 
|  | The Indic shaping model was revised significantly in | 
|  | 2005. To denote the change, a new set of <emphasis>script | 
|  | tags</emphasis> was assigned for Bengali, Devanagari, | 
|  | Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and | 
|  | Telugu. For the sake of clarity, the term "Indic2" is | 
|  | sometimes used to refer to the current, revised shaping | 
|  | model. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Arabic</emphasis> shaping model supports | 
|  | Arabic, Mongolian, N'Ko, Syriac, and several other connected | 
|  | or cursive scripts. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Thai/Lao</emphasis> shaping model supports | 
|  | the Thai and Lao scripts. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Khmer</emphasis> shaping model supports the | 
|  | Khmer script. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Myanmar</emphasis> shaping model supports the | 
|  | Myanmar (or Burmese) script. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Tibetan</emphasis> shaping model supports the | 
|  | Tibetan script. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Hangul</emphasis> shaping model supports the | 
|  | Hangul script. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Hebrew</emphasis> shaping model supports the | 
|  | Hebrew script. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | The <emphasis>Universal Shaping Engine</emphasis> (USE) | 
|  | shaping model supports complex scripts not covered by one of | 
|  | the above, script-specific shaping models, including | 
|  | Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, | 
|  | Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai | 
|  | Viet, and many others. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | <listitem> | 
|  | <para> | 
|  | Text runs that do not fall under one of the above shaping | 
|  | models may still require processing by a shaping engine. Of | 
|  | particular note is <emphasis>Emoji</emphasis> shaping, which | 
|  | may involve variation-selector sequences and glyph | 
|  | substitution. Emoji shaping is handled by the default | 
|  | shaping model. | 
|  | </para> | 
|  | </listitem> | 
|  |  | 
|  | </itemizedlist> | 
|  |  | 
|  | </section> | 
|  |  | 
|  | <section id="graphite-shaping"> | 
|  | <title>Graphite shaping</title> | 
|  | <para> | 
|  | In contrast to OpenType shaping, Graphite shaping does not | 
|  | specify a predefined set of shaping models or a set of supported | 
|  | scripts. | 
|  | </para> | 
|  | <para> | 
|  | Instead, each Graphite font contains a complete set of rules that | 
|  | implement the required shaping model for the intended | 
|  | script. These rules include finite-state machines to match | 
|  | sequences of codepoints to the shaping operations to perform. | 
|  | </para> | 
|  | <para> | 
|  | Graphite shaping can perform the same shaping operations used in | 
|  | OpenType shaping, as well as other functions that have not been | 
|  | defined for OpenType shaping. | 
|  | </para> | 
|  | </section> | 
|  |  | 
|  | <section id="aat-shaping"> | 
|  | <title>AAT shaping</title> | 
|  | <para> | 
|  | In contrast to OpenType shaping, AAT shaping does not specify a | 
|  | predefined set of shaping models or a set of supported scripts. | 
|  | </para> | 
|  | <para> | 
|  | Instead, each AAT font includes a complete set of rules that | 
|  | implement the desired shaping model for the intended | 
|  | script. These rules include finite-state machines to match glyph | 
|  | sequences and the shaping operations to perform. | 
|  | </para> | 
|  | <para> | 
|  | Notably, AAT shaping rules are expressed for glyphs in the font, | 
|  | not for Unicode codepoints. AAT shaping can perform the same | 
|  | shaping operations used in OpenType shaping, as well as other | 
|  | functions that have not been defined for OpenType shaping. | 
|  | </para> | 
|  | </section> | 
|  | </chapter> |