| <?xml version="1.0"?> |
| <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" |
| "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ |
| <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> |
| <!ENTITY version SYSTEM "version.xml"> |
| ]> |
| <chapter id="clusters"> |
| <title>Clusters</title> |
| <section id="clusters-and-shaping"> |
| <title>Clusters and shaping</title> |
| <para> |
| In text shaping, a <emphasis>cluster</emphasis> is a sequence of |
| characters that needs to be treated as a single, indivisible |
| unit. A single letter or symbol can be a cluster of its |
| own. Other clusters correspond to longer subsequences of the |
| input code points — such as a ligature or conjunct form |
| — and require the shaper to ensure that the cluster is not |
| broken during the shaping process. |
| </para> |
| <para> |
| A cluster is distinct from a <emphasis>grapheme</emphasis>, |
| which is the smallest unit of meaning in a writing system or |
| script. |
| </para> |
| <para> |
| The definitions of the two terms are similar. However, clusters |
| are only relevant for script shaping and glyph layout. In |
| contrast, graphemes are a property of the underlying script, and |
| are of interest when client programs implement orthographic |
| or linguistic functionality. |
| </para> |
| <para> |
| For example, two individual letters are often two separate |
| graphemes. When two letters form a ligature, however, they |
| combine into a single glyph. They are then part of the same |
| cluster and are treated as a unit by the shaping engine — |
| even though the two original, underlying letters remain separate |
| graphemes. |
| </para> |
| <para> |
| HarfBuzz is concerned with clusters, <emphasis>not</emphasis> |
| with graphemes — although client programs using HarfBuzz |
| may still care about graphemes for other reasons from time to time. |
| </para> |
| <para> |
| During the shaping process, there are several shaping operations |
| that may merge adjacent characters (for example, when two code |
| points form a ligature or a conjunct form and are replaced by a |
| single glyph) or split one character into several (for example, |
| when decomposing a code point through the |
| <literal>ccmp</literal> feature). Operations like these alter |
| clusters; HarfBuzz tracks the changes to ensure that no clusters |
| get lost or broken during shaping. |
| </para> |
| <para> |
| HarfBuzz records cluster information independently from how |
| shaping operations affect the individual glyphs returned in an |
| output buffer. Consequently, a client program using HarfBuzz can |
| utilize the cluster information to implement features such as: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Correctly positioning the cursor within a shaped text run, |
| even when characters have formed ligatures, composed or |
| decomposed, reordered, or undergone other shaping operations. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Correctly highlighting a text selection that includes some, |
| but not all, of the characters in a word. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Applying text attributes (such as color or underlining) to |
| part, but not all, of a word. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Generating output document formats (such as PDF) with |
| embedded text that can be fully extracted. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Determining the mapping between input characters and output |
| glyphs, such as which glyphs are ligatures. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Performing line-breaking, justification, and other |
| line-level or paragraph-level operations that must be done |
| after shaping is complete, but which require examining |
| character-level properties. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </section> |
| <section id="working-with-harfbuzz-clusters"> |
| <title>Working with HarfBuzz clusters</title> |
| <para> |
| When you add text to a HarfBuzz buffer, each code point must be |
| assigned a <emphasis>cluster value</emphasis>. |
| </para> |
| <para> |
| This cluster value is an arbitrary number; HarfBuzz uses it only |
| to distinguish between clusters. Many client programs will use |
| the index of each code point in the input text stream as the |
| cluster value. This is for the sake of convenience; the actual |
| value does not matter. |
| </para> |
| <para> |
| Some of the shaping operations performed by HarfBuzz — |
| such as reordering, composition, decomposition, and substitution |
| — may alter the cluster values of some characters. The |
| final cluster values in the buffer at the end of the shaping |
| process will indicate to client programs which subsequences of |
| glyphs represent a cluster and, therefore, must not be |
| separated. |
| </para> |
| <para> |
| In addition, client programs can query the final cluster values |
| to discern other potentially important information about the |
| glyphs in the output buffer (such as whether or not a ligature |
| was formed). |
| </para> |
| <para> |
| For example, if the initial sequence of cluster values was: |
| </para> |
| <programlisting> |
| 0,1,2,3,4 |
| </programlisting> |
| <para> |
| and the final sequence of cluster values is: |
| </para> |
| <programlisting> |
| 0,0,3,3 |
| </programlisting> |
| <para> |
| then there are two clusters in the output buffer: the first |
| cluster includes the first two glyphs, and the second cluster |
| includes the third and fourth glyphs. It is also evident that a |
| ligature or conjunct has been formed, because there are fewer |
| glyphs in the output buffer (four) than there were code points |
| in the input buffer (five). |
| </para> |
| <para> |
| Although client programs using HarfBuzz are free to assign |
| initial cluster values in any manner they choose to, HarfBuzz |
| does offer some useful guarantees if the cluster values are |
| assigned in a monotonic (either non-decreasing or non-increasing) |
| order. |
| </para> |
| <para> |
| For buffers in the left-to-right (LTR) |
| or top-to-bottom (TTB) text flow direction, |
| HarfBuzz will preserve the monotonic property: client programs |
| are guaranteed that monotonically increasing initial cluster |
| values will be returned as monotonically increasing final |
| cluster values. |
| </para> |
| <para> |
| For buffers in the right-to-left (RTL) |
| or bottom-to-top (BTT) text flow direction, |
| the directionality of the buffer itself is reversed for final |
| output as a matter of design. Therefore, HarfBuzz inverts the |
| monotonic property: client programs are guaranteed that |
| monotonically increasing initial cluster values will be |
| returned as monotonically <emphasis>decreasing</emphasis> final |
| cluster values. |
| </para> |
| <para> |
| Client programs can adjust how HarfBuzz handles clusters during |
| shaping by setting the |
| <literal>cluster_level</literal> of the |
| buffer. HarfBuzz offers three <emphasis>levels</emphasis> of |
| clustering support for this property: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para><emphasis>Level 0</emphasis> is the default. |
| </para> |
| <para> |
| The distinguishing feature of level 0 behavior is that, at |
| the beginning of processing the buffer, all code points that |
| are categorized as <emphasis>marks</emphasis>, |
| <emphasis>modifier symbols</emphasis>, or |
| <emphasis>Emoji extended pictographic</emphasis> modifiers, |
| as well as the <emphasis>Zero Width Joiner</emphasis> and |
| <emphasis>Zero Width Non-Joiner</emphasis> code points, are |
| assigned the cluster value of the closest preceding code |
| point from <emphasis>different</emphasis> category. |
| </para> |
| <para> |
| In essence, whenever a base character is followed by a mark |
| character or a sequence of mark characters, those marks are |
| reassigned to the same initial cluster value as the base |
| character. This reassignment is referred to as |
| "merging" the affected clusters. This behavior is based on |
| the Grapheme Cluster Boundary specification in <ulink |
| url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode |
| Technical Report 29</ulink>. |
| </para> |
| <para> |
| This cluster level is suitable for code that likes to use |
| HarfBuzz cluster values as an approximation of the Unicode |
| Grapheme Cluster Boundaries as well. |
| </para> |
| <para> |
| Client programs can specify level 0 behavior for a buffer by |
| setting its <literal>cluster_level</literal> to |
| <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>Level 1</emphasis> tweaks the old behavior |
| slightly to produce better results. Therefore, level 1 |
| clustering is recommended for code that is not required to |
| implement backward compatibility with the old HarfBuzz. |
| </para> |
| <para> |
| <emphasis>Level 1</emphasis> differs from level 0 by not merging the |
| clusters of marks and other modifier code points with the |
| preceding "base" code point's cluster. By preserving the |
| separate cluster values of these marks and modifier code |
| points, script shapers can perform additional operations |
| that might lead to improved results (for example, coloring |
| mark glyphs differently than their base). |
| </para> |
| <para> |
| Client programs can specify level 1 behavior for a buffer by |
| setting its <literal>cluster_level</literal> to |
| <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>Level 2</emphasis> differs significantly in how it |
| treats cluster values. In level 2, HarfBuzz never merges |
| clusters. |
| </para> |
| <para> |
| This difference can be seen most clearly when HarfBuzz processes |
| ligature substitutions and glyph decompositions. In level 0 |
| and level 1, ligatures and glyph decomposition both involve |
| merging clusters; in level 2, neither of these operations |
| triggers a merge. |
| </para> |
| <para> |
| Client programs can specify level 2 behavior for a buffer by |
| setting its <literal>cluster_level</literal> to |
| <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| As mentioned earlier, client programs using HarfBuzz often |
| assign initial cluster values in a buffer by reusing the indices |
| of the code points in the input text. This gives a sequence of |
| cluster values that is monotonically increasing (for example, |
| 0,1,2,3,4). |
| </para> |
| <para> |
| It is not <emphasis>required</emphasis> that the cluster values |
| in a buffer be monotonically increasing. However, if the initial |
| cluster values in a buffer are monotonic and the buffer is |
| configured to use cluster level 0 or 1, then HarfBuzz |
| guarantees that the final cluster values in the shaped buffer |
| will also be monotonic. No such guarantee is made for cluster |
| level 2. |
| </para> |
| <para> |
| In levels 0 and 1, HarfBuzz implements the following conceptual |
| model for cluster values: |
| </para> |
| <itemizedlist spacing="compact"> |
| <listitem> |
| <para> |
| If the sequence of input cluster values is monotonic, the |
| sequence of cluster values will remain monotonic. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Each cluster value represents a single cluster. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Each cluster contains one or more glyphs and one or more |
| characters. |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| In practice, this model offers several benefits. Assuming that |
| the initial cluster values were monotonically increasing |
| and distinct before shaping began, then, in the final output: |
| </para> |
| <itemizedlist spacing="compact"> |
| <listitem> |
| <para> |
| All adjacent glyphs having the same final cluster |
| value belong to the same cluster. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Each character belongs to the cluster that has the highest |
| cluster value <emphasis>not larger than</emphasis> its |
| initial cluster value. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </section> |
| |
| <section id="a-clustering-example-for-levels-0-and-1"> |
| <title>A clustering example for levels 0 and 1</title> |
| <para> |
| The basic shaping operations affect clusters in a predictable |
| manner when using level 0 or level 1: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| When two or more clusters <emphasis>merge</emphasis>, the |
| resulting merged cluster takes as its cluster value the |
| <emphasis>minimum</emphasis> of the incoming cluster values. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| When a cluster <emphasis>decomposes</emphasis>, all of the |
| resulting child clusters inherit as their cluster value the |
| cluster value of the parent cluster. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| When a character is <emphasis>reordered</emphasis>, the |
| reordered character and all clusters that the character |
| moves past as part of the reordering are merged into one cluster. |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| The functionality, guarantees, and benefits of level 0 and level |
| 1 behavior can be seen with some examples. First, let us examine |
| what happens with cluster values when shaping involves cluster |
| merging with ligatures and decomposition. |
| </para> |
| |
| <para> |
| Let's say we start with the following character sequence (top row) and |
| initial cluster values (bottom row): |
| </para> |
| <programlisting> |
| A,B,C,D,E |
| 0,1,2,3,4 |
| </programlisting> |
| <para> |
| During shaping, HarfBuzz maps these characters to glyphs from |
| the font. For simplicity, let us assume that each character maps |
| to the corresponding, identical-looking glyph: |
| </para> |
| <programlisting> |
| A,B,C,D,E |
| 0,1,2,3,4 |
| </programlisting> |
| <para> |
| Now if, for example, <literal>B</literal> and <literal>C</literal> |
| form a ligature, then the clusters to which they belong |
| "merge". This merged cluster takes for its cluster |
| value the minimum of all the cluster values of the clusters that |
| went in to the ligature. In this case, we get: |
| </para> |
| <programlisting> |
| A,BC,D,E |
| 0,1 ,3,4 |
| </programlisting> |
| <para> |
| because 1 is the minimum of the set {1,2}, which were the |
| cluster values of <literal>B</literal> and |
| <literal>C</literal>. |
| </para> |
| <para> |
| Next, let us say that the <literal>BC</literal> ligature glyph |
| decomposes into three components, and <literal>D</literal> also |
| decomposes into two components. Whenever a cluster decomposes, |
| its components each inherit the cluster value of their parent: |
| </para> |
| <programlisting> |
| A,BC0,BC1,BC2,D0,D1,E |
| 0,1 ,1 ,1 ,3 ,3 ,4 |
| </programlisting> |
| <para> |
| Next, if <literal>BC2</literal> and <literal>D0</literal> form a |
| ligature, then their clusters (cluster values 1 and 3) merge into |
| <literal>min(1,3) = 1</literal>: |
| </para> |
| <programlisting> |
| A,BC0,BC1,BC2D0,D1,E |
| 0,1 ,1 ,1 ,1 ,4 |
| </programlisting> |
| <para> |
| Note that the entirety of cluster 3 merges into cluster 1, not |
| just the <literal>D0</literal> glyph. This reflects the fact |
| that the cluster <emphasis>must</emphasis> be treated as an |
| indivisible unit. |
| </para> |
| <para> |
| At this point, cluster 1 means: the character sequence |
| <literal>BCD</literal> is represented by glyphs |
| <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any |
| further. |
| </para> |
| </section> |
| <section id="reordering-in-levels-0-and-1"> |
| <title>Reordering in levels 0 and 1</title> |
| <para> |
| Another common operation in some shapers is glyph |
| reordering. In order to maintain a monotonic cluster sequence |
| when glyph reordering takes place, HarfBuzz merges the clusters |
| of everything in the reordering sequence. |
| </para> |
| <para> |
| For example, let us again start with the character sequence (top |
| row) and initial cluster values (bottom row): |
| </para> |
| <programlisting> |
| A,B,C,D,E |
| 0,1,2,3,4 |
| </programlisting> |
| <para> |
| If <literal>D</literal> is reordered to the position immediately |
| before <literal>B</literal>, then HarfBuzz merges the |
| <literal>B</literal>, <literal>C</literal>, and |
| <literal>D</literal> clusters — all the clusters between |
| the final position of the reordered glyph and its original |
| position. This means that we get: |
| </para> |
| <programlisting> |
| A,D,B,C,E |
| 0,1,1,1,4 |
| </programlisting> |
| <para> |
| as the final cluster sequence. |
| </para> |
| <para> |
| Merging this many clusters is not ideal, but it is the only |
| sensible way for HarfBuzz to maintain the guarantee that the |
| sequence of cluster values remains monotonic and to retain the |
| true relationship between glyphs and characters. |
| </para> |
| </section> |
| <section id="the-distinction-between-levels-0-and-1"> |
| <title>The distinction between levels 0 and 1</title> |
| <para> |
| The preceding examples demonstrate the main effects of using |
| cluster levels 0 and 1. The only difference between the two |
| levels is this: in level 0, at the very beginning of the shaping |
| process, HarfBuzz merges the cluster of each base character |
| with the clusters of all Unicode marks (combining or not) and |
| modifiers that follow it. |
| </para> |
| <para> |
| For example, let us start with the following character sequence |
| (top row) and accompanying initial cluster values (bottom row): |
| </para> |
| <programlisting> |
| A,acute,B |
| 0,1 ,2 |
| </programlisting> |
| <para> |
| The <literal>acute</literal> is a Unicode mark. If HarfBuzz is |
| using cluster level 0 on this sequence, then the |
| <literal>A</literal> and <literal>acute</literal> clusters will |
| merge, and the result will become: |
| </para> |
| <programlisting> |
| A,acute,B |
| 0,0 ,2 |
| </programlisting> |
| <para> |
| This merger is performed before any other script-shaping |
| steps. |
| </para> |
| <para> |
| This initial cluster merging is the default behavior of the |
| Windows shaping engine, and the old HarfBuzz codebase copied |
| that behavior to maintain compatibility. Consequently, it has |
| remained the default behavior in the new HarfBuzz codebase. |
| </para> |
| <para> |
| But this initial cluster-merging behavior makes it impossible |
| for client programs to implement some features (such as to |
| color diacritic marks differently from their base |
| characters). That is why, in level 1, HarfBuzz does not perform |
| the initial merging step. |
| </para> |
| <para> |
| For client programs that rely on HarfBuzz cluster values to |
| perform cursor positioning, level 0 is more convenient. But |
| relying on cluster boundaries for cursor positioning is wrong: cursor |
| positions should be determined based on Unicode grapheme |
| boundaries, not on shaping-cluster boundaries. As such, using |
| level 1 clustering behavior is recommended. |
| </para> |
| <para> |
| One final facet of levels 0 and 1 is worth noting. HarfBuzz |
| currently does not allow any |
| <emphasis>multiple-substitution</emphasis> GSUB lookups to |
| replace a glyph with zero glyphs (in other words, to delete a |
| glyph). |
| </para> |
| <para> |
| But, in some other situations, glyphs can be deleted. In |
| those cases, if the glyph being deleted is the last glyph of its |
| cluster, HarfBuzz makes sure to merge the deleted glyph's |
| cluster with a neighboring cluster. |
| </para> |
| <para> |
| This is done primarily to make sure that the starting cluster of the |
| text always has the cluster index pointing to the start of the text |
| for the run; more than one client program currently relies on this |
| guarantee. |
| </para> |
| <para> |
| Incidentally, Apple's CoreText does something different to |
| maintain the same promise: it inserts a glyph with id 65535 at |
| the beginning of the glyph string if the glyph corresponding to |
| the first character in the run was deleted. HarfBuzz might do |
| something similar in the future. |
| </para> |
| </section> |
| <section id="level-2"> |
| <title>Level 2</title> |
| <para> |
| HarfBuzz's level 2 cluster behavior uses a significantly |
| different model than that of level 0 and level 1. |
| </para> |
| <para> |
| The level 2 behavior is easy to describe, but it may be |
| difficult to understand in practical terms. In brief, level 2 |
| performs no merging of clusters whatsoever. |
| </para> |
| <para> |
| This means that there is no initial base-and-mark merging step |
| (as is done in level 0), and it means that reordering moves and |
| ligature substitutions do not trigger a cluster merge. |
| </para> |
| <para> |
| Only one shaping operation directly affects clusters when using |
| level 2: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| When a cluster <emphasis>decomposes</emphasis>, all of the |
| resulting child clusters inherit as their cluster value the |
| cluster value of the parent cluster. |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| When glyphs do form a ligature (or when some other feature |
| substitutes multiple glyphs with one glyph) the cluster value |
| of the first glyph is retained as the cluster value for the |
| resulting ligature. |
| </para> |
| <para> |
| This occurrence sounds similar to a cluster merge, but it is |
| different. In particular, no subsequent characters — |
| including marks and modifiers — are affected. They retain |
| their previous cluster values. |
| </para> |
| <para> |
| Level 2 cluster behavior is ultimately less complex than level 0 |
| or level 1, but there are several cases for which processing |
| cluster values produced at level 2 may be tricky. |
| </para> |
| <section id="ligatures-with-combining-marks-in-level-2"> |
| <title>Ligatures with combining marks in level 2</title> |
| <para> |
| The first example of how HarfBuzz's level 2 cluster behavior |
| can be tricky is when the text to be shaped includes combining |
| marks attached to ligatures. |
| </para> |
| <para> |
| Let us start with an input sequence with the following |
| characters (top row) and initial cluster values (bottom row): |
| </para> |
| <programlisting> |
| A,acute,B,breve,C,circumflex |
| 0,1 ,2,3 ,4,5 |
| </programlisting> |
| <para> |
| If the sequence <literal>A,B,C</literal> forms a ligature, |
| then these are the cluster values HarfBuzz will return under |
| the various cluster levels: |
| </para> |
| <para> |
| Level 0: |
| </para> |
| <programlisting> |
| ABC,acute,breve,circumflex |
| 0 ,0 ,0 ,0 |
| </programlisting> |
| <para> |
| Level 1: |
| </para> |
| <programlisting> |
| ABC,acute,breve,circumflex |
| 0 ,0 ,0 ,5 |
| </programlisting> |
| <para> |
| Level 2: |
| </para> |
| <programlisting> |
| ABC,acute,breve,circumflex |
| 0 ,1 ,3 ,5 |
| </programlisting> |
| <para> |
| Making sense of the level 2 result is the hardest for a client |
| program, because there is nothing in the cluster values that |
| indicates that <literal>B</literal> and <literal>C</literal> |
| formed a ligature with <literal>A</literal>. |
| </para> |
| <para> |
| In contrast, the "merged" cluster values of the mark glyphs |
| that are seen in the level 0 and level 1 output are evidence |
| that a ligature substitution took place. |
| </para> |
| </section> |
| <section id="reordering-in-level-2"> |
| <title>Reordering in level 2</title> |
| <para> |
| Another example of how HarfBuzz's level 2 cluster behavior |
| can be tricky is when glyphs reorder. Consider an input sequence |
| with the following characters (top row) and initial cluster |
| values (bottom row): |
| </para> |
| <programlisting> |
| A,B,C,D,E |
| 0,1,2,3,4 |
| </programlisting> |
| <para> |
| Now imagine <literal>D</literal> moves before |
| <literal>B</literal> in a reordering operation. The cluster |
| values will then be: |
| </para> |
| <programlisting> |
| A,D,B,C,E |
| 0,3,1,2,4 |
| </programlisting> |
| <para> |
| Next, if <literal>D</literal> forms a ligature with |
| <literal>B</literal>, the output is: |
| </para> |
| <programlisting> |
| A,DB,C,E |
| 0,3 ,2,4 |
| </programlisting> |
| <para> |
| However, in a different scenario, in which the shaping rules |
| of the script instead caused <literal>A</literal> and |
| <literal>B</literal> to form a ligature |
| <emphasis>before</emphasis> the <literal>D</literal> reordered, the |
| result would be: |
| </para> |
| <programlisting> |
| AB,D,C,E |
| 0 ,3,2,4 |
| </programlisting> |
| <para> |
| There is no way for a client program to differentiate between |
| these two scenarios based on the cluster values |
| alone. Consequently, client programs that use level 2 might |
| need to undertake additional work in order to manage cursor |
| positioning, text attributes, or other desired features. |
| </para> |
| </section> |
| <section id="other-considerations-in-level-2"> |
| <title>Other considerations in level 2</title> |
| <para> |
| There may be other problems encountered with ligatures under |
| level 2, such as if the direction of the text is forced to |
| the opposite of its natural direction (for example, Arabic text |
| that is forced into left-to-right directionality). But, |
| generally speaking, these other scenarios are minor corner |
| cases that are too obscure for most client programs to need to |
| worry about. |
| </para> |
| </section> |
| </section> |
| </chapter> |