n8willis | e12fc66 | 2016-01-28 12:14:12 -0600 | [diff] [blame] | 1 | <chapter id="clusters"> |
| 2 | <sect1 id="clusters"> |
| 3 | <title>Clusters</title> |
| 4 | <para> |
| 5 | In shaping text, a <emphasis>cluster</emphasis> is a sequence of |
| 6 | code points that needs to be treated as a single, indivisible unit. |
| 7 | </para> |
| 8 | <para> |
| 9 | When you add text to a HB buffer, each character is associated with |
| 10 | a <emphasis>cluster value</emphasis>. This is an arbitrary number as |
| 11 | far as HB is concerned. |
| 12 | </para> |
| 13 | <para> |
| 14 | Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the |
| 15 | actual number does not matter. Moreover, it is not required for the |
| 16 | cluster values to be monotonically increasing, but pretty much all |
| 17 | of HB's tests are performed on monotonically increasing cluster |
| 18 | numbers. Nevertheless, there is no such assumption in the code |
| 19 | itself. With that in mind, let's examine what happens with cluster |
| 20 | values during shaping under each cluster-level. |
| 21 | </para> |
| 22 | <para> |
| 23 | HarfBuzz provides three <emphasis>levels</emphasis> of clustering |
| 24 | support. Level 0 is the default behavior and reproduces the behavior |
| 25 | of the old HarfBuzz library. Level 1 tweaks this behavior slightly |
| 26 | to produce better results, so level 1 clustering is recommended for |
| 27 | code that is not required to implement backward compatibility with |
| 28 | the old HarfBuzz. |
| 29 | </para> |
| 30 | <para> |
| 31 | Level 2 differs significantly in how it treats cluster values. |
| 32 | Levels 0 and 1 both process ligatures and glyph decomposition by |
| 33 | merging clusters; level 2 does not. |
| 34 | </para> |
| 35 | <para> |
| 36 | The conceptual model for what the cluster values mean, in levels 0 |
| 37 | and 1, is this: |
| 38 | </para> |
| 39 | <itemizedlist spacing="compact"> |
| 40 | <listitem> |
| 41 | <para> |
| 42 | the sequence of cluster values will always remain monotone |
| 43 | </para> |
| 44 | </listitem> |
| 45 | <listitem> |
| 46 | <para> |
| 47 | each value represents a single cluster |
| 48 | </para> |
| 49 | </listitem> |
| 50 | <listitem> |
| 51 | <para> |
| 52 | each cluster contains one or more glyphs and one or more |
| 53 | characters |
| 54 | </para> |
| 55 | </listitem> |
| 56 | </itemizedlist> |
| 57 | <para> |
| 58 | Assuming that initial cluster numbers were monotonically increasing |
| 59 | and distinct, then all adjacent glyphs having the same cluster |
| 60 | number belong to the same cluster, and all characters belong to the |
| 61 | cluster that has the highest number not larger than their initial |
| 62 | cluster number. This will become clearer with an example. |
| 63 | </para> |
| 64 | </sect1> |
| 65 | <sect1 id="a-clustering-example-for-levels-0-and-1"> |
| 66 | <title>A clustering example for levels 0 and 1</title> |
| 67 | <para> |
| 68 | Let's say we start with the following character sequence and cluster |
| 69 | values: |
| 70 | </para> |
| 71 | <programlisting> |
| 72 | A,B,C,D,E |
| 73 | 0,1,2,3,4 |
| 74 | </programlisting> |
| 75 | <para> |
| 76 | We then map the characters to glyphs. For simplicity, let's assume |
| 77 | that each character maps to the corresponding, identical-looking |
| 78 | glyph: |
| 79 | </para> |
| 80 | <programlisting> |
| 81 | A,B,C,D,E |
| 82 | 0,1,2,3,4 |
| 83 | </programlisting> |
| 84 | <para> |
| 85 | Now if, for example, <literal>B</literal> and <literal>C</literal> |
| 86 | ligate, then the clusters to which they belong "merge". |
| 87 | This merged cluster takes for its cluster number the minimum of all |
| 88 | the cluster numbers of the clusters that went in. In this case, we |
| 89 | get: |
| 90 | </para> |
| 91 | <programlisting> |
| 92 | A,BC,D,E |
| 93 | 0,1 ,3,4 |
| 94 | </programlisting> |
| 95 | <para> |
| 96 | Now let's assume that the <literal>BC</literal> glyph decomposes |
| 97 | into three components, and <literal>D</literal> also decomposes into |
| 98 | two. The components each inherit the cluster value of their parent: |
| 99 | </para> |
| 100 | <programlisting> |
| 101 | A,BC0,BC1,BC2,D0,D1,E |
| 102 | 0,1 ,1 ,1 ,3 ,3 ,4 |
| 103 | </programlisting> |
| 104 | <para> |
| 105 | Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then |
| 106 | their clusters (numbers 1 and 3) merge into |
| 107 | <literal>min(1,3) = 1</literal>: |
| 108 | </para> |
| 109 | <programlisting> |
| 110 | A,BC0,BC1,BC2D0,D1,E |
| 111 | 0,1 ,1 ,1 ,1 ,4 |
| 112 | </programlisting> |
| 113 | <para> |
| 114 | At this point, cluster 1 means: the character sequence |
| 115 | <literal>BCD</literal> is represented by glyphs |
| 116 | <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any |
| 117 | further. |
| 118 | </para> |
| 119 | </sect1> |
| 120 | <sect1 id="reordering-in-levels-0-and-1"> |
| 121 | <title>Reordering in levels 0 and 1</title> |
| 122 | <para> |
| 123 | Another common operation in the more complex shapers is when things |
| 124 | reorder. In those cases, to maintain monotone clusters, HB merges |
| 125 | the clusters of everything in the reordering sequence. For example, |
| 126 | let's again start with the character sequence: |
| 127 | </para> |
| 128 | <programlisting> |
| 129 | A,B,C,D,E |
| 130 | 0,1,2,3,4 |
| 131 | </programlisting> |
| 132 | <para> |
| 133 | If <literal>D</literal> is reordered before <literal>B</literal>, |
| 134 | then the <literal>B</literal>, <literal>C</literal>, and |
| 135 | <literal>D</literal> clusters merge, and we get: |
| 136 | </para> |
| 137 | <programlisting> |
| 138 | A,D,B,C,E |
| 139 | 0,1,1,1,4 |
| 140 | </programlisting> |
| 141 | <para> |
| 142 | This is clearly not ideal, but it is the only sensible way to |
| 143 | maintain monotone indices and retain the true relationship between |
| 144 | glyphs and characters. |
| 145 | </para> |
| 146 | </sect1> |
| 147 | <sect1 id="the-distinction-between-levels-0-and-1"> |
| 148 | <title>The distinction between levels 0 and 1</title> |
| 149 | <para> |
| 150 | So, the above is pretty much what cluster levels 0 and 1 do. The |
| 151 | only difference between the two is this: in level 0, at the very |
| 152 | beginning of the shaping process, we also merge clusters between |
| 153 | base characters and all Unicode marks (combining or not) following |
| 154 | them. E.g.: |
| 155 | </para> |
| 156 | <programlisting> |
| 157 | A,acute,B |
| 158 | 0,1 ,2 |
| 159 | </programlisting> |
| 160 | <para> |
| 161 | will become: |
| 162 | </para> |
| 163 | <programlisting> |
| 164 | A,acute,B |
| 165 | 0,0 ,2 |
| 166 | </programlisting> |
| 167 | <para> |
| 168 | This is the default behavior. We do it because Windows did it and |
| 169 | old HarfBuzz did it, so this remained the default. But this behavior |
| 170 | makes it impossible to color diacritic marks differently from their |
| 171 | base characters. That's why in level 1 we do not perform this |
| 172 | initial merging step. |
| 173 | </para> |
| 174 | <para> |
| 175 | For clients, level 0 is more convenient if they rely on HarfBuzz |
| 176 | clusters for cursor positioning. But that's wrong anyway: cursor |
| 177 | positions should be determined based on Unicode grapheme boundaries, |
| 178 | NOT shaping clusters. As such, level 1 clusters are preferred. |
| 179 | </para> |
| 180 | <para> |
| 181 | One last note about levels 0 and 1. We currently don't allow a |
| 182 | <literal>MultipleSubst</literal> lookup to replace a glyph with zero |
| 183 | glyphs (i.e., to delete a glyph). But in some other situations, |
| 184 | glyphs can be deleted. In those cases, if the glyph being deleted is |
| 185 | the last glyph of its cluster, we make sure to merge the cluster |
| 186 | with a neighboring cluster. |
| 187 | </para> |
| 188 | <para> |
| 189 | This is, primarily, to make sure that the starting cluster of the |
| 190 | text always has the cluster index pointing to the start of the text |
| 191 | for the run; more than one client currently relies on this |
| 192 | guarantee. |
| 193 | </para> |
| 194 | <para> |
| 195 | Incidentally, Apple's CoreText does something else to maintain the |
| 196 | same promise: it inserts a glyph with id 65535 at the beginning of |
| 197 | the glyph string if the glyph corresponding to the first character |
| 198 | in the run was deleted. HarfBuzz might do something similar in the |
| 199 | future. |
| 200 | </para> |
| 201 | </sect1> |
| 202 | <sect1 id="level-2"> |
| 203 | <title>Level 2</title> |
| 204 | <para> |
| 205 | Level 2 is a different beast from levels 0 and 1. It is simple to |
| 206 | describe, but hard to make sense of. It simply doesn't do any |
| 207 | cluster merging whatsoever. When things ligate or otherwise multiple |
| 208 | glyphs turn into one, the cluster value of the first glyph is |
| 209 | retained. |
| 210 | </para> |
| 211 | <para> |
| 212 | Here are a few examples of why processing cluster values produced at |
| 213 | this level might be tricky: |
| 214 | </para> |
| 215 | <sect2 id="ligatures-with-combining-marks"> |
| 216 | <title>Ligatures with combining marks</title> |
| 217 | <para> |
| 218 | Imagine capital letters are bases and lower case letters are |
| 219 | combining marks. With an input sequence like this: |
| 220 | </para> |
| 221 | <programlisting> |
| 222 | A,a,B,b,C,c |
| 223 | 0,1,2,3,4,5 |
| 224 | </programlisting> |
| 225 | <para> |
| 226 | if <literal>A,B,C</literal> ligate, then here are the cluster |
| 227 | values one would get under the various levels: |
| 228 | </para> |
| 229 | <para> |
| 230 | level 0: |
| 231 | </para> |
| 232 | <programlisting> |
| 233 | ABC,a,b,c |
| 234 | 0 ,0,0,0 |
| 235 | </programlisting> |
| 236 | <para> |
| 237 | level 1: |
| 238 | </para> |
| 239 | <programlisting> |
| 240 | ABC,a,b,c |
| 241 | 0 ,0,0,5 |
| 242 | </programlisting> |
| 243 | <para> |
| 244 | level 2: |
| 245 | </para> |
| 246 | <programlisting> |
| 247 | ABC,a,b,c |
| 248 | 0 ,1,3,5 |
| 249 | </programlisting> |
| 250 | <para> |
| 251 | Making sense of the last example is the hardest for a client, |
| 252 | because there is nothing in the cluster values to suggest that |
| 253 | <literal>B</literal> and <literal>C</literal> ligated with |
| 254 | <literal>A</literal>. |
| 255 | </para> |
| 256 | </sect2> |
| 257 | <sect2 id="reordering"> |
| 258 | <title>Reordering</title> |
| 259 | <para> |
| 260 | Another tricky case is when things reorder. Under level 2: |
| 261 | </para> |
| 262 | <programlisting> |
| 263 | A,B,C,D,E |
| 264 | 0,1,2,3,4 |
| 265 | </programlisting> |
| 266 | <para> |
| 267 | Now imagine <literal>D</literal> moves before |
| 268 | <literal>B</literal>: |
| 269 | </para> |
| 270 | <programlisting> |
| 271 | A,D,B,C,E |
| 272 | 0,3,1,2,4 |
| 273 | </programlisting> |
| 274 | <para> |
| 275 | Now, if <literal>D</literal> ligates with <literal>B</literal>, we |
| 276 | get: |
| 277 | </para> |
| 278 | <programlisting> |
| 279 | A,DB,C,E |
| 280 | 0,3 ,2,4 |
| 281 | </programlisting> |
| 282 | <para> |
| 283 | In a different scenario, <literal>A</literal> and |
| 284 | <literal>B</literal> could have ligated |
| 285 | <emphasis>before</emphasis> <literal>D</literal> reordered; that |
| 286 | would have resulted in: |
| 287 | </para> |
| 288 | <programlisting> |
| 289 | AB,D,C,E |
| 290 | 0 ,3,2,4 |
| 291 | </programlisting> |
| 292 | <para> |
Bruce Mitchener | 85ec6d3 | 2018-01-03 01:23:23 +0700 | [diff] [blame] | 293 | There's no way to differentiate between these two scenarios based |
n8willis | e12fc66 | 2016-01-28 12:14:12 -0600 | [diff] [blame] | 294 | on the cluster numbers alone. |
| 295 | </para> |
| 296 | <para> |
Bruce Mitchener | 85ec6d3 | 2018-01-03 01:23:23 +0700 | [diff] [blame] | 297 | Another problem happens with ligatures under level 2 if the |
n8willis | e12fc66 | 2016-01-28 12:14:12 -0600 | [diff] [blame] | 298 | direction of the text is forced to opposite of its natural |
| 299 | direction (e.g. left-to-right Arabic). But that's too much of a |
| 300 | corner case to worry about. |
| 301 | </para> |
| 302 | </sect2> |
| 303 | </sect1> |
| 304 | </chapter> |