docs/usermanual-clusters.xml - third_party/harfbuzz - Git at Google

 <chapter id="clusters">
 <sect1 id="clusters">
   <title>Clusters</title>
   <para>
     In shaping text, a <emphasis>cluster</emphasis> is a sequence of
     code points that needs to be treated as a single, indivisible unit.
   </para>
   <para>
     When you add text to a HB buffer, each character is associated with
     a <emphasis>cluster value</emphasis>. This is an arbitrary number as
     far as HB is concerned.
   </para>
   <para>
     Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
     actual number does not matter. Moreover, it is not required for the
     cluster values to be monotonically increasing, but pretty much all
     of HB's tests are performed on monotonically increasing cluster
     numbers. Nevertheless, there is no such assumption in the code
     itself. With that in mind, let's examine what happens with cluster
     values during shaping under each cluster-level.
   </para>
   <para>
     HarfBuzz provides three <emphasis>levels</emphasis> of clustering
     support. Level 0 is the default behavior and reproduces the behavior
     of the old HarfBuzz library. Level 1 tweaks this behavior slightly
     to produce better results, so level 1 clustering is recommended for
     code that is not required to implement backward compatibility with
     the old HarfBuzz.
   </para>
   <para>
     Level 2 differs significantly in how it treats cluster values.
     Levels 0 and 1 both process ligatures and glyph decomposition by
     merging clusters; level 2 does not.
   </para>
   <para>
     The conceptual model for what the cluster values mean, in levels 0
     and 1, is this:
   </para>
   <itemizedlist spacing="compact">
     <listitem>
       <para>
         the sequence of cluster values will always remain monotone
       </para>
     </listitem>
     <listitem>
       <para>
         each value represents a single cluster
       </para>
     </listitem>
     <listitem>
       <para>
         each cluster contains one or more glyphs and one or more
         characters
       </para>
     </listitem>
   </itemizedlist>
   <para>
     Assuming that initial cluster numbers were monotonically increasing
     and distinct, then all adjacent glyphs having the same cluster
     number belong to the same cluster, and all characters belong to the
     cluster that has the highest number not larger than their initial
     cluster number. This will become clearer with an example.
   </para>
 </sect1>
 <sect1 id="a-clustering-example-for-levels-0-and-1">
   <title>A clustering example for levels 0 and 1</title>
   <para>
     Let's say we start with the following character sequence and cluster
     values:
   </para>
   <programlisting>
    A,B,C,D,E
    0,1,2,3,4
 </programlisting>
   <para>
     We then map the characters to glyphs. For simplicity, let's assume
     that each character maps to the corresponding, identical-looking
     glyph:
   </para>
   <programlisting>
    A,B,C,D,E
    0,1,2,3,4
 </programlisting>
   <para>
     Now if, for example, <literal>B</literal> and <literal>C</literal>
     ligate, then the clusters to which they belong &quot;merge&quot;.
     This merged cluster takes for its cluster number the minimum of all
     the cluster numbers of the clusters that went in. In this case, we
     get:
   </para>
   <programlisting>
    A,BC,D,E
    0,1 ,3,4
 </programlisting>
   <para>
     Now let's assume that the <literal>BC</literal> glyph decomposes
     into three components, and <literal>D</literal> also decomposes into
     two. The components each inherit the cluster value of their parent:
   </para>
   <programlisting>
    A,BC0,BC1,BC2,D0,D1,E
    0,1  ,1  ,1  ,3 ,3 ,4
 </programlisting>
   <para>
     Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
     their clusters (numbers 1 and 3) merge into
     <literal>min(1,3) = 1</literal>:
   </para>
   <programlisting>
    A,BC0,BC1,BC2D0,D1,E
    0,1  ,1  ,1    ,1 ,4
 </programlisting>
   <para>
     At this point, cluster 1 means: the character sequence
     <literal>BCD</literal> is represented by glyphs
     <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
     further.
   </para>
 </sect1>
 <sect1 id="reordering-in-levels-0-and-1">
   <title>Reordering in levels 0 and 1</title>
   <para>
     Another common operation in the more complex shapers is when things
     reorder. In those cases, to maintain monotone clusters, HB merges
     the clusters of everything in the reordering sequence. For example,
     let's again start with the character sequence:
   </para>
   <programlisting>
    A,B,C,D,E
    0,1,2,3,4
 </programlisting>
   <para>
     If <literal>D</literal> is reordered before <literal>B</literal>,
     then the <literal>B</literal>, <literal>C</literal>, and
     <literal>D</literal> clusters merge, and we get:
   </para>
   <programlisting>
    A,D,B,C,E
    0,1,1,1,4
 </programlisting>
   <para>
     This is clearly not ideal, but it is the only sensible way to
     maintain monotone indices and retain the true relationship between
     glyphs and characters.
   </para>
 </sect1>
 <sect1 id="the-distinction-between-levels-0-and-1">
   <title>The distinction between levels 0 and 1</title>
   <para>
     So, the above is pretty much what cluster levels 0 and 1 do. The
     only difference between the two is this: in level 0, at the very
     beginning of the shaping process, we also merge clusters between
     base characters and all Unicode marks (combining or not) following
     them. E.g.:
   </para>
   <programlisting>
   A,acute,B
   0,1    ,2
 </programlisting>
   <para>
     will become:
   </para>
   <programlisting>
   A,acute,B
   0,0    ,2
 </programlisting>
   <para>
     This is the default behavior. We do it because Windows did it and
     old HarfBuzz did it, so this remained the default. But this behavior
     makes it impossible to color diacritic marks differently from their
     base characters. That's why in level 1 we do not perform this
     initial merging step.
   </para>
   <para>
     For clients, level 0 is more convenient if they rely on HarfBuzz
     clusters for cursor positioning. But that's wrong anyway: cursor
     positions should be determined based on Unicode grapheme boundaries,
     NOT shaping clusters. As such, level 1 clusters are preferred.
   </para>
   <para>
     One last note about levels 0 and 1. We currently don't allow a
     <literal>MultipleSubst</literal> lookup to replace a glyph with zero
     glyphs (i.e., to delete a glyph). But in some other situations,
     glyphs can be deleted. In those cases, if the glyph being deleted is
     the last glyph of its cluster, we make sure to merge the cluster
     with a neighboring cluster.
   </para>
   <para>
     This is, primarily, to make sure that the starting cluster of the
     text always has the cluster index pointing to the start of the text
     for the run; more than one client currently relies on this
     guarantee.
   </para>
   <para>
     Incidentally, Apple's CoreText does something else to maintain the
     same promise: it inserts a glyph with id 65535 at the beginning of
     the glyph string if the glyph corresponding to the first character
     in the run was deleted. HarfBuzz might do something similar in the
     future.
   </para>
 </sect1>
 <sect1 id="level-2">
   <title>Level 2</title>
   <para>
     Level 2 is a different beast from levels 0 and 1. It is simple to
     describe, but hard to make sense of. It simply doesn't do any
     cluster merging whatsoever. When things ligate or otherwise multiple
     glyphs turn into one, the cluster value of the first glyph is
     retained.
   </para>
   <para>
     Here are a few examples of why processing cluster values produced at
     this level might be tricky:
   </para>
   <sect2 id="ligatures-with-combining-marks">
     <title>Ligatures with combining marks</title>
     <para>
       Imagine capital letters are bases and lower case letters are
       combining marks. With an input sequence like this:
     </para>
     <programlisting>
   A,a,B,b,C,c
   0,1,2,3,4,5
 </programlisting>
     <para>
       if <literal>A,B,C</literal> ligate, then here are the cluster
       values one would get under the various levels:
     </para>
     <para>
       level 0:
     </para>
     <programlisting>
   ABC,a,b,c
   0  ,0,0,0
 </programlisting>
     <para>
       level 1:
     </para>
     <programlisting>
   ABC,a,b,c
   0  ,0,0,5
 </programlisting>
     <para>
       level 2:
     </para>
     <programlisting>
   ABC,a,b,c
   0  ,1,3,5
 </programlisting>
     <para>
       Making sense of the last example is the hardest for a client,
       because there is nothing in the cluster values to suggest that
       <literal>B</literal> and <literal>C</literal> ligated with
       <literal>A</literal>.
     </para>
   </sect2>
   <sect2 id="reordering">
     <title>Reordering</title>
     <para>
       Another tricky case is when things reorder. Under level 2:
     </para>
     <programlisting>
   A,B,C,D,E
   0,1,2,3,4
 </programlisting>
     <para>
       Now imagine <literal>D</literal> moves before
       <literal>B</literal>:
     </para>
     <programlisting>
   A,D,B,C,E
   0,3,1,2,4
 </programlisting>
     <para>
       Now, if <literal>D</literal> ligates with <literal>B</literal>, we
       get:
     </para>
     <programlisting>
   A,DB,C,E
   0,3 ,2,4
 </programlisting>
     <para>
       In a different scenario, <literal>A</literal> and
       <literal>B</literal> could have ligated
       <emphasis>before</emphasis> <literal>D</literal> reordered; that
       would have resulted in:
     </para>
     <programlisting>
   AB,D,C,E
   0 ,3,2,4
 </programlisting>
     <para>
       There's no way to differentitate between these two scenarios based
       on the cluster numbers alone.
     </para>
     <para>
       Another problem appens with ligatures under level 2 if the
       direction of the text is forced to opposite of its natural
       direction (e.g. left-to-right Arabic). But that's too much of a
       corner case to worry about.
     </para>
   </sect2>
 </sect1>
 </chapter>
	<chapter id="clusters">
	<sect1 id="clusters">
	<title>Clusters</title>
	<para>
	In shaping text, a <emphasis>cluster</emphasis> is a sequence of
	code points that needs to be treated as a single, indivisible unit.
	</para>
	<para>
	When you add text to a HB buffer, each character is associated with
	a <emphasis>cluster value</emphasis>. This is an arbitrary number as
	far as HB is concerned.
	</para>
	<para>
	Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
	actual number does not matter. Moreover, it is not required for the
	cluster values to be monotonically increasing, but pretty much all
	of HB's tests are performed on monotonically increasing cluster
	numbers. Nevertheless, there is no such assumption in the code
	itself. With that in mind, let's examine what happens with cluster
	values during shaping under each cluster-level.
	</para>
	<para>
	HarfBuzz provides three <emphasis>levels</emphasis> of clustering
	support. Level 0 is the default behavior and reproduces the behavior
	of the old HarfBuzz library. Level 1 tweaks this behavior slightly
	to produce better results, so level 1 clustering is recommended for
	code that is not required to implement backward compatibility with
	the old HarfBuzz.
	</para>
	<para>
	Level 2 differs significantly in how it treats cluster values.
	Levels 0 and 1 both process ligatures and glyph decomposition by
	merging clusters; level 2 does not.
	</para>
	<para>
	The conceptual model for what the cluster values mean, in levels 0
	and 1, is this:
	</para>
	<itemizedlist spacing="compact">
	<listitem>
	<para>
	the sequence of cluster values will always remain monotone
	</para>
	</listitem>
	<listitem>
	<para>
	each value represents a single cluster
	</para>
	</listitem>
	<listitem>
	<para>
	each cluster contains one or more glyphs and one or more
	characters
	</para>
	</listitem>
	</itemizedlist>
	<para>
	Assuming that initial cluster numbers were monotonically increasing
	and distinct, then all adjacent glyphs having the same cluster
	number belong to the same cluster, and all characters belong to the
	cluster that has the highest number not larger than their initial
	cluster number. This will become clearer with an example.
	</para>
	</sect1>
	<sect1 id="a-clustering-example-for-levels-0-and-1">
	<title>A clustering example for levels 0 and 1</title>
	<para>
	Let's say we start with the following character sequence and cluster
	values:
	</para>
	<programlisting>
	A,B,C,D,E
	0,1,2,3,4
	</programlisting>
	<para>
	We then map the characters to glyphs. For simplicity, let's assume
	that each character maps to the corresponding, identical-looking
	glyph:
	</para>
	<programlisting>
	A,B,C,D,E
	0,1,2,3,4
	</programlisting>
	<para>
	Now if, for example, <literal>B</literal> and <literal>C</literal>
	ligate, then the clusters to which they belong "merge".
	This merged cluster takes for its cluster number the minimum of all
	the cluster numbers of the clusters that went in. In this case, we
	get:
	</para>
	<programlisting>
	A,BC,D,E
	0,1 ,3,4
	</programlisting>
	<para>
	Now let's assume that the <literal>BC</literal> glyph decomposes
	into three components, and <literal>D</literal> also decomposes into
	two. The components each inherit the cluster value of their parent:
	</para>
	<programlisting>
	A,BC0,BC1,BC2,D0,D1,E
	0,1 ,1 ,1 ,3 ,3 ,4
	</programlisting>
	<para>
	Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
	their clusters (numbers 1 and 3) merge into
	<literal>min(1,3) = 1</literal>:
	</para>
	<programlisting>
	A,BC0,BC1,BC2D0,D1,E
	0,1 ,1 ,1 ,1 ,4
	</programlisting>
	<para>
	At this point, cluster 1 means: the character sequence
	<literal>BCD</literal> is represented by glyphs
	<literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
	further.
	</para>
	</sect1>
	<sect1 id="reordering-in-levels-0-and-1">
	<title>Reordering in levels 0 and 1</title>
	<para>
	Another common operation in the more complex shapers is when things
	reorder. In those cases, to maintain monotone clusters, HB merges
	the clusters of everything in the reordering sequence. For example,
	let's again start with the character sequence:
	</para>
	<programlisting>
	A,B,C,D,E
	0,1,2,3,4
	</programlisting>
	<para>
	If <literal>D</literal> is reordered before <literal>B</literal>,
	then the <literal>B</literal>, <literal>C</literal>, and
	<literal>D</literal> clusters merge, and we get:
	</para>
	<programlisting>
	A,D,B,C,E
	0,1,1,1,4
	</programlisting>
	<para>
	This is clearly not ideal, but it is the only sensible way to
	maintain monotone indices and retain the true relationship between
	glyphs and characters.
	</para>
	</sect1>
	<sect1 id="the-distinction-between-levels-0-and-1">
	<title>The distinction between levels 0 and 1</title>
	<para>
	So, the above is pretty much what cluster levels 0 and 1 do. The
	only difference between the two is this: in level 0, at the very
	beginning of the shaping process, we also merge clusters between
	base characters and all Unicode marks (combining or not) following
	them. E.g.:
	</para>
	<programlisting>
	A,acute,B
	0,1 ,2
	</programlisting>
	<para>
	will become:
	</para>
	<programlisting>
	A,acute,B
	0,0 ,2
	</programlisting>
	<para>
	This is the default behavior. We do it because Windows did it and
	old HarfBuzz did it, so this remained the default. But this behavior
	makes it impossible to color diacritic marks differently from their
	base characters. That's why in level 1 we do not perform this
	initial merging step.
	</para>
	<para>
	For clients, level 0 is more convenient if they rely on HarfBuzz
	clusters for cursor positioning. But that's wrong anyway: cursor
	positions should be determined based on Unicode grapheme boundaries,
	NOT shaping clusters. As such, level 1 clusters are preferred.
	</para>
	<para>
	One last note about levels 0 and 1. We currently don't allow a
	<literal>MultipleSubst</literal> lookup to replace a glyph with zero
	glyphs (i.e., to delete a glyph). But in some other situations,
	glyphs can be deleted. In those cases, if the glyph being deleted is
	the last glyph of its cluster, we make sure to merge the cluster
	with a neighboring cluster.
	</para>
	<para>
	This is, primarily, to make sure that the starting cluster of the
	text always has the cluster index pointing to the start of the text
	for the run; more than one client currently relies on this
	guarantee.
	</para>
	<para>
	Incidentally, Apple's CoreText does something else to maintain the
	same promise: it inserts a glyph with id 65535 at the beginning of
	the glyph string if the glyph corresponding to the first character
	in the run was deleted. HarfBuzz might do something similar in the
	future.
	</para>
	</sect1>
	<sect1 id="level-2">
	<title>Level 2</title>
	<para>
	Level 2 is a different beast from levels 0 and 1. It is simple to
	describe, but hard to make sense of. It simply doesn't do any
	cluster merging whatsoever. When things ligate or otherwise multiple
	glyphs turn into one, the cluster value of the first glyph is
	retained.
	</para>
	<para>
	Here are a few examples of why processing cluster values produced at
	this level might be tricky:
	</para>
	<sect2 id="ligatures-with-combining-marks">
	<title>Ligatures with combining marks</title>
	<para>
	Imagine capital letters are bases and lower case letters are
	combining marks. With an input sequence like this:
	</para>
	<programlisting>
	A,a,B,b,C,c
	0,1,2,3,4,5
	</programlisting>
	<para>
	if <literal>A,B,C</literal> ligate, then here are the cluster
	values one would get under the various levels:
	</para>
	<para>
	level 0:
	</para>
	<programlisting>
	ABC,a,b,c
	0 ,0,0,0
	</programlisting>
	<para>
	level 1:
	</para>
	<programlisting>
	ABC,a,b,c
	0 ,0,0,5
	</programlisting>
	<para>
	level 2:
	</para>
	<programlisting>
	ABC,a,b,c
	0 ,1,3,5
	</programlisting>
	<para>
	Making sense of the last example is the hardest for a client,
	because there is nothing in the cluster values to suggest that
	<literal>B</literal> and <literal>C</literal> ligated with
	<literal>A</literal>.
	</para>
	</sect2>
	<sect2 id="reordering">
	<title>Reordering</title>
	<para>
	Another tricky case is when things reorder. Under level 2:
	</para>
	<programlisting>
	A,B,C,D,E
	0,1,2,3,4
	</programlisting>
	<para>
	Now imagine <literal>D</literal> moves before
	<literal>B</literal>:
	</para>
	<programlisting>
	A,D,B,C,E
	0,3,1,2,4
	</programlisting>
	<para>
	Now, if <literal>D</literal> ligates with <literal>B</literal>, we
	get:
	</para>
	<programlisting>
	A,DB,C,E
	0,3 ,2,4
	</programlisting>
	<para>
	In a different scenario, <literal>A</literal> and
	<literal>B</literal> could have ligated
	<emphasis>before</emphasis> <literal>D</literal> reordered; that
	would have resulted in:
	</para>
	<programlisting>
	AB,D,C,E
	0 ,3,2,4
	</programlisting>
	<para>
	There's no way to differentitate between these two scenarios based
	on the cluster numbers alone.
	</para>
	<para>
	Another problem appens with ligatures under level 2 if the
	direction of the text is forced to opposite of its natural
	direction (e.g. left-to-right Arabic). But that's too much of a
	corner case to worry about.
	</para>
	</sect2>
	</sect1>
	</chapter>