Blame - docs/usermanual-clusters.xml - third_party/harfbuzz

blob: 608371b00d7a5a5d271876c0ec7a634d11071a0f [file] [log] [blame]

n8willis	e12fc66	2016-01-28 12:14:12 -0600	[diff] [blame]	1	<chapter id="clusters">
				2	<sect1 id="clusters">
				3	<title>Clusters</title>
				4	<para>
				5	In shaping text, a <emphasis>cluster</emphasis> is a sequence of
				6	code points that needs to be treated as a single, indivisible unit.
				7	</para>
				8	<para>
				9	When you add text to a HB buffer, each character is associated with
				10	a <emphasis>cluster value</emphasis>. This is an arbitrary number as
				11	far as HB is concerned.
				12	</para>
				13	<para>
				14	Most clients will use UTF-8, UTF-16, or UTF-32 indices, but the
				15	actual number does not matter. Moreover, it is not required for the
				16	cluster values to be monotonically increasing, but pretty much all
				17	of HB's tests are performed on monotonically increasing cluster
				18	numbers. Nevertheless, there is no such assumption in the code
				19	itself. With that in mind, let's examine what happens with cluster
				20	values during shaping under each cluster-level.
				21	</para>
				22	<para>
				23	HarfBuzz provides three <emphasis>levels</emphasis> of clustering
				24	support. Level 0 is the default behavior and reproduces the behavior
				25	of the old HarfBuzz library. Level 1 tweaks this behavior slightly
				26	to produce better results, so level 1 clustering is recommended for
				27	code that is not required to implement backward compatibility with
				28	the old HarfBuzz.
				29	</para>
				30	<para>
				31	Level 2 differs significantly in how it treats cluster values.
				32	Levels 0 and 1 both process ligatures and glyph decomposition by
				33	merging clusters; level 2 does not.
				34	</para>
				35	<para>
				36	The conceptual model for what the cluster values mean, in levels 0
				37	and 1, is this:
				38	</para>
				39	<itemizedlist spacing="compact">
				40	<listitem>
				41	<para>
				42	the sequence of cluster values will always remain monotone
				43	</para>
				44	</listitem>
				45	<listitem>
				46	<para>
				47	each value represents a single cluster
				48	</para>
				49	</listitem>
				50	<listitem>
				51	<para>
				52	each cluster contains one or more glyphs and one or more
				53	characters
				54	</para>
				55	</listitem>
				56	</itemizedlist>
				57	<para>
				58	Assuming that initial cluster numbers were monotonically increasing
				59	and distinct, then all adjacent glyphs having the same cluster
				60	number belong to the same cluster, and all characters belong to the
				61	cluster that has the highest number not larger than their initial
				62	cluster number. This will become clearer with an example.
				63	</para>
				64	</sect1>
				65	<sect1 id="a-clustering-example-for-levels-0-and-1">
				66	<title>A clustering example for levels 0 and 1</title>
				67	<para>
				68	Let's say we start with the following character sequence and cluster
				69	values:
				70	</para>
				71	<programlisting>
				72	A,B,C,D,E
				73	0,1,2,3,4
				74	</programlisting>
				75	<para>
				76	We then map the characters to glyphs. For simplicity, let's assume
				77	that each character maps to the corresponding, identical-looking
				78	glyph:
				79	</para>
				80	<programlisting>
				81	A,B,C,D,E
				82	0,1,2,3,4
				83	</programlisting>
				84	<para>
				85	Now if, for example, <literal>B</literal> and <literal>C</literal>
				86	ligate, then the clusters to which they belong "merge".
				87	This merged cluster takes for its cluster number the minimum of all
				88	the cluster numbers of the clusters that went in. In this case, we
				89	get:
				90	</para>
				91	<programlisting>
				92	A,BC,D,E
				93	0,1 ,3,4
				94	</programlisting>
				95	<para>
				96	Now let's assume that the <literal>BC</literal> glyph decomposes
				97	into three components, and <literal>D</literal> also decomposes into
				98	two. The components each inherit the cluster value of their parent:
				99	</para>
				100	<programlisting>
				101	A,BC0,BC1,BC2,D0,D1,E
				102	0,1 ,1 ,1 ,3 ,3 ,4
				103	</programlisting>
				104	<para>
				105	Now if <literal>BC2</literal> and <literal>D0</literal> ligate, then
				106	their clusters (numbers 1 and 3) merge into
				107	<literal>min(1,3) = 1</literal>:
				108	</para>
				109	<programlisting>
				110	A,BC0,BC1,BC2D0,D1,E
				111	0,1 ,1 ,1 ,1 ,4
				112	</programlisting>
				113	<para>
				114	At this point, cluster 1 means: the character sequence
				115	<literal>BCD</literal> is represented by glyphs
				116	<literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any
				117	further.
				118	</para>
				119	</sect1>
				120	<sect1 id="reordering-in-levels-0-and-1">
				121	<title>Reordering in levels 0 and 1</title>
				122	<para>
				123	Another common operation in the more complex shapers is when things
				124	reorder. In those cases, to maintain monotone clusters, HB merges
				125	the clusters of everything in the reordering sequence. For example,
				126	let's again start with the character sequence:
				127	</para>
				128	<programlisting>
				129	A,B,C,D,E
				130	0,1,2,3,4
				131	</programlisting>
				132	<para>
				133	If <literal>D</literal> is reordered before <literal>B</literal>,
				134	then the <literal>B</literal>, <literal>C</literal>, and
				135	<literal>D</literal> clusters merge, and we get:
				136	</para>
				137	<programlisting>
				138	A,D,B,C,E
				139	0,1,1,1,4
				140	</programlisting>
				141	<para>
				142	This is clearly not ideal, but it is the only sensible way to
				143	maintain monotone indices and retain the true relationship between
				144	glyphs and characters.
				145	</para>
				146	</sect1>
				147	<sect1 id="the-distinction-between-levels-0-and-1">
				148	<title>The distinction between levels 0 and 1</title>
				149	<para>
				150	So, the above is pretty much what cluster levels 0 and 1 do. The
				151	only difference between the two is this: in level 0, at the very
				152	beginning of the shaping process, we also merge clusters between
				153	base characters and all Unicode marks (combining or not) following
				154	them. E.g.:
				155	</para>
				156	<programlisting>
				157	A,acute,B
				158	0,1 ,2
				159	</programlisting>
				160	<para>
				161	will become:
				162	</para>
				163	<programlisting>
				164	A,acute,B
				165	0,0 ,2
				166	</programlisting>
				167	<para>
				168	This is the default behavior. We do it because Windows did it and
				169	old HarfBuzz did it, so this remained the default. But this behavior
				170	makes it impossible to color diacritic marks differently from their
				171	base characters. That's why in level 1 we do not perform this
				172	initial merging step.
				173	</para>
				174	<para>
				175	For clients, level 0 is more convenient if they rely on HarfBuzz
				176	clusters for cursor positioning. But that's wrong anyway: cursor
				177	positions should be determined based on Unicode grapheme boundaries,
				178	NOT shaping clusters. As such, level 1 clusters are preferred.
				179	</para>
				180	<para>
				181	One last note about levels 0 and 1. We currently don't allow a
				182	<literal>MultipleSubst</literal> lookup to replace a glyph with zero
				183	glyphs (i.e., to delete a glyph). But in some other situations,
				184	glyphs can be deleted. In those cases, if the glyph being deleted is
				185	the last glyph of its cluster, we make sure to merge the cluster
				186	with a neighboring cluster.
				187	</para>
				188	<para>
				189	This is, primarily, to make sure that the starting cluster of the
				190	text always has the cluster index pointing to the start of the text
				191	for the run; more than one client currently relies on this
				192	guarantee.
				193	</para>
				194	<para>
				195	Incidentally, Apple's CoreText does something else to maintain the
				196	same promise: it inserts a glyph with id 65535 at the beginning of
				197	the glyph string if the glyph corresponding to the first character
				198	in the run was deleted. HarfBuzz might do something similar in the
				199	future.
				200	</para>
				201	</sect1>
				202	<sect1 id="level-2">
				203	<title>Level 2</title>
				204	<para>
				205	Level 2 is a different beast from levels 0 and 1. It is simple to
				206	describe, but hard to make sense of. It simply doesn't do any
				207	cluster merging whatsoever. When things ligate or otherwise multiple
				208	glyphs turn into one, the cluster value of the first glyph is
				209	retained.
				210	</para>
				211	<para>
				212	Here are a few examples of why processing cluster values produced at
				213	this level might be tricky:
				214	</para>
				215	<sect2 id="ligatures-with-combining-marks">
				216	<title>Ligatures with combining marks</title>
				217	<para>
				218	Imagine capital letters are bases and lower case letters are
				219	combining marks. With an input sequence like this:
				220	</para>
				221	<programlisting>
				222	A,a,B,b,C,c
				223	0,1,2,3,4,5
				224	</programlisting>
				225	<para>
				226	if <literal>A,B,C</literal> ligate, then here are the cluster
				227	values one would get under the various levels:
				228	</para>
				229	<para>
				230	level 0:
				231	</para>
				232	<programlisting>
				233	ABC,a,b,c
				234	0 ,0,0,0
				235	</programlisting>
				236	<para>
				237	level 1:
				238	</para>
				239	<programlisting>
				240	ABC,a,b,c
				241	0 ,0,0,5
				242	</programlisting>
				243	<para>
				244	level 2:
				245	</para>
				246	<programlisting>
				247	ABC,a,b,c
				248	0 ,1,3,5
				249	</programlisting>
				250	<para>
				251	Making sense of the last example is the hardest for a client,
				252	because there is nothing in the cluster values to suggest that
				253	<literal>B</literal> and <literal>C</literal> ligated with
				254	<literal>A</literal>.
				255	</para>
				256	</sect2>
				257	<sect2 id="reordering">
				258	<title>Reordering</title>
				259	<para>
				260	Another tricky case is when things reorder. Under level 2:
				261	</para>
				262	<programlisting>
				263	A,B,C,D,E
				264	0,1,2,3,4
				265	</programlisting>
				266	<para>
				267	Now imagine <literal>D</literal> moves before
				268	<literal>B</literal>:
				269	</para>
				270	<programlisting>
				271	A,D,B,C,E
				272	0,3,1,2,4
				273	</programlisting>
				274	<para>
				275	Now, if <literal>D</literal> ligates with <literal>B</literal>, we
				276	get:
				277	</para>
				278	<programlisting>
				279	A,DB,C,E
				280	0,3 ,2,4
				281	</programlisting>
				282	<para>
				283	In a different scenario, <literal>A</literal> and
				284	<literal>B</literal> could have ligated
				285	<emphasis>before</emphasis> <literal>D</literal> reordered; that
				286	would have resulted in:
				287	</para>
				288	<programlisting>
				289	AB,D,C,E
				290	0 ,3,2,4
				291	</programlisting>
				292	<para>
Bruce Mitchener	85ec6d3	2018-01-03 01:23:23 +0700	[diff] [blame]	293	There's no way to differentiate between these two scenarios based
n8willis	e12fc66	2016-01-28 12:14:12 -0600	[diff] [blame]	294	on the cluster numbers alone.
				295	</para>
				296	<para>
Bruce Mitchener	85ec6d3	2018-01-03 01:23:23 +0700	[diff] [blame]	297	Another problem happens with ligatures under level 2 if the
n8willis	e12fc66	2016-01-28 12:14:12 -0600	[diff] [blame]	298	direction of the text is forced to opposite of its natural
				299	direction (e.g. left-to-right Arabic). But that's too much of a
				300	corner case to worry about.
				301	</para>
				302	</sect2>
				303	</sect1>
				304	</chapter>