Blame - docs/usermanual-buffers-language-script-and-direction.xml - third_party/harfbuzz

blob: 2865426f2ccb2718169ff5d2eb21acab4b79fb44 [file] [log] [blame]

Nathan Willis	9f4b375	2018-10-29 17:10:53 -0500	[diff] [blame]	1	<?xml version="1.0"?>
				2	<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
				4	<!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
				5	<!ENTITY version SYSTEM "version.xml">
				6	]>
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	7	<chapter id="buffers-language-script-and-direction">
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	8	<title>Buffers, language, script and direction</title>
				9	<para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	10	The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	11	buffer. In this chapter, we'll look at how to set up a buffer with
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	12	the text that we want and how to customize the properties of the
				13	buffer. We'll also look at a piece of lower-level machinery that
				14	you will need to understand before proceeding: the functions that
				15	HarfBuzz uses to retrieve Unicode information.
				16	</para>
				17	<para>
				18	After shaping is complete, HarfBuzz puts its output back
				19	into the buffer. But getting that output requires setting up a
				20	face and a font first, so we will look at that in the next chapter
				21	instead of here.
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	22	</para>
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	23	<section id="creating-and-destroying-buffers">
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	24	<title>Creating and destroying buffers</title>
				25	<para>
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	26	As we saw in our <emphasis>Getting Started</emphasis> example, a
				27	buffer is created and
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	28	initialized with <function>hb_buffer_create()</function>. This
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	29	produces a new, empty buffer object, instantiated with some
				30	default values and ready to accept your Unicode strings.
				31	</para>
				32	<para>
Nathan Willis	ed13cad	2018-11-28 13:48:38 -0600	[diff] [blame]	33	HarfBuzz manages the memory of objects (such as buffers) that it
				34	creates, so you don't have to. When you have finished working on
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	35	a buffer, you can call <function>hb_buffer_destroy()</function>:
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	36	</para>
				37	<programlisting language="C">
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	38	hb_buffer_t *buf = hb_buffer_create();
				39	...
				40	hb_buffer_destroy(buf);
				41	</programlisting>
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	42	<para>
				43	This will destroy the object and free its associated memory -
				44	unless some other part of the program holds a reference to this
Behdad Esfahbod	a0ad0d5	2017-11-20 15:07:48 -0800	[diff] [blame]	45	buffer. If you acquire a HarfBuzz buffer from another subsystem
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	46	and want to ensure that it is not garbage collected by someone
				47	else destroying it, you should increase its reference count:
				48	</para>
				49	<programlisting language="C">
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	50	void somefunc(hb_buffer_t *buf) {
				51	buf = hb_buffer_reference(buf);
				52	...
				53	</programlisting>
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	54	<para>
				55	And then decrease it once you're done with it:
				56	</para>
				57	<programlisting language="C">
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	58	hb_buffer_destroy(buf);
				59	}
				60	</programlisting>
				61	<para>
				62	While we are on the subject of reference-counting buffers, it is
				63	worth noting that an individual buffer can only meaningfully be
				64	used by one thread at a time.
				65	</para>
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	66	<para>
				67	To throw away all the data in your buffer and start from scratch,
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	68	call <function>hb_buffer_reset(buf)</function>. If you want to
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	69	throw away the string in the buffer but keep the options, you can
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	70	instead call <function>hb_buffer_clear_contents(buf)</function>.
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	71	</para>
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	72	</section>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	73
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	74	<section id="adding-text-to-the-buffer">
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	75	<title>Adding text to the buffer</title>
				76	<para>
Behdad Esfahbod	a0ad0d5	2017-11-20 15:07:48 -0800	[diff] [blame]	77	Now we have a brand new HarfBuzz buffer. Let's start filling it
				78	with text! From HarfBuzz's perspective, a buffer is just a stream
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	79	of Unicode code points, but your input string is probably in one of
				80	the standard Unicode character encodings (UTF-8, UTF-16, or
				81	UTF-32). HarfBuzz provides convenience functions that accept
				82	each of these encodings:
				83	<function>hb_buffer_add_utf8()</function>,
				84	<function>hb_buffer_add_utf16()</function>, and
				85	<function>hb_buffer_add_utf32()</function>. Other than the
				86	character encoding they accept, they function identically.
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	87	</para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	88	<para>
				89	You can add UTF-8 text to a buffer by passing in the text array,
				90	the array's length, an offset into the array for the first
				91	character to add, and the length of the segment to add:
				92	</para>
				93	<programlisting language="C">
				94	hb_buffer_add_utf8 (hb_buffer_t *buf,
				95	const char *text,
				96	int text_length,
				97	unsigned int item_offset,
				98	int item_length)
				99	</programlisting>
				100	<para>
				101	So, in practice, you can say:
				102	</para>
				103	<programlisting language="C">
				104	hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
				105	</programlisting>
				106	<para>
				107	This will append your new characters to
				108	<parameter>buf</parameter>, not replace its existing
				109	contents. Also, note that you can use <literal>-1</literal> in
				110	place of the first instance of <function>strlen(text)</function>
				111	if your text array is NULL-terminated. Similarly, you can also use
				112	<literal>-1</literal> as the final argument want to add its full
				113	contents.
				114	</para>
				115	<para>
				116	Whatever start <parameter>item_offset</parameter> and
				117	<parameter>item_length</parameter> you provide, HarfBuzz will also
				118	attempt to grab the five characters <emphasis>before</emphasis>
				119	the offset point and the five characters
				120	<emphasis>after</emphasis> the designated end. These are the
				121	before and after "context" segments, which are used internally
				122	for HarfBuzz to make shaping decisions. They will not be part of
				123	the final output, but they ensure that HarfBuzz's
				124	script-specific shaping operations are correct. If there are
				125	fewer than five characters available for the before or after
				126	contexts, HarfBuzz will just grab what is there.
				127	</para>
				128	<para>
				129	For longer text runs, such as full paragraphs, it might be
				130	tempting to only add smaller sub-segments to a buffer and
				131	shape them in piecemeal fashion. Generally, this is not a good
				132	idea, however, because a lot of shaping decisions are
				133	dependent on this context information. For example, in Arabic
				134	and other connected scripts, HarfBuzz needs to know the code
				135	points before and after each character in order to correctly
				136	determine which glyph to return.
				137	</para>
				138	<para>
				139	The safest approach is to add all of the text available, then
				140	use <parameter>item_offset</parameter> and
				141	<parameter>item_length</parameter> to indicate which characters you
				142	want shaped, so that HarfBuzz has access to any context.
				143	</para>
				144	<para>
				145	You can also add Unicode code points directly with
				146	<function>hb_buffer_add_codepoints()</function>. The arguments
				147	to this function are the same as those for the UTF
				148	encodings. But it is particularly important to note that
				149	HarfBuzz does not do validity checking on the text that is added
				150	to a buffer. Invalid code points will be replaced, but it is up
				151	to you to do any deep-sanity checking necessary.
				152	</para>
				153
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	154	</section>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	155
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	156	<section id="setting-buffer-properties">
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	157	<title>Setting buffer properties</title>
				158	<para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	159	Buffers containing input characters still need several
				160	properties set before HarfBuzz can shape their text correctly.
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	161	</para>
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	162	<para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	163	Initially, all buffers are set to the
				164	<literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
				165	type. After adding text, the buffer should be set to
				166	<literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
				167	indicates that it contains un-shaped input
				168	characters. After shaping, the buffer will have the
				169	<literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
				170	</para>
				171	<para>
				172	<function>hb_buffer_add_utf8()</function> and the
				173	other UTF functions set the content type of their buffer
				174	automatically. But if you are reusing a buffer you may want to
				175	check its state with
				176	<function>hb_buffer_get_content_type(buffer)</function>. If
				177	necessary you can set the content type with
				178	</para>
				179	<programlisting language="C">
				180	hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
				181	</programlisting>
				182	<para>
				183	to prepare for shaping.
				184	</para>
				185	<para>
				186	Buffers also need to carry information about the script,
				187	language, and text direction of their contents. You can set
				188	these properties individually:
				189	</para>
				190	<programlisting language="C">
				191	hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
				192	hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
				193	hb_buffer_set_language(buf, hb_language_from_string("en", -1));
				194	</programlisting>
				195	<para>
				196	However, since these properties are often the repeated for
				197	multiple text runs, you can also save them in a
				198	<literal>hb_segment_properties_t</literal> for reuse:
				199	</para>
				200	<programlisting language="C">
				201	hb_segment_properties_t *savedprops;
				202	hb_buffer_get_segment_properties (buf, savedprops);
				203	...
				204	hb_buffer_set_segment_properties (buf2, savedprops);
				205	</programlisting>
				206	<para>
				207	HarfBuzz also provides getter functions to retrieve a buffer's
				208	direction, script, and language properties individually.
				209	</para>
				210	<para>
				211	HarfBuzz recognizes four text directions in
				212	<type>hb_direction_t</type>: left-to-right
				213	(<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
				214	top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
				215	bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the
				216	script property, HarfBuzz uses identifiers based on the
				217	<ulink
Nathan Willis	97ba206	2019-05-25 12:26:50 +0100	[diff] [blame]	218	url="https://unicode.org/iso15924/">ISO 15924
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	219	standard</ulink>. For languages, HarfBuzz uses tags based on the
				220	<ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
				221	</para>
				222	<para>
				223	Helper functions are provided to convert character strings into
				224	the necessary script and language tag types.
				225	</para>
				226	<para>
				227	Two additional buffer properties to be aware of are the
				228	"invisible glyph" and the replacement code point. The
				229	replacement code point is inserted into buffer output in place of
				230	any invalid code points encountered in the input. By default, it
				231	is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
				232	point, <literal>U+FFFD</literal> "�". You can change this with
				233	</para>
				234	<programlisting language="C">
				235	hb_buffer_set_replacement_codepoint(buf, replacement);
				236	</programlisting>
				237	<para>
Nathan Willis	78fcb14	2019-05-11 20:56:02 +0100	[diff] [blame]	238	passing in the replacement Unicode code point as the
				239	<parameter>replacement</parameter> parameter.
				240	</para>
				241	<para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	242	The invisible glyph is used to replace all output glyphs that
				243	are invisible. By default, the standard space character
				244	<literal>U+0020</literal> is used; you can replace this (for
				245	example, when using a font that provides script-specific
				246	spaces) with
				247	</para>
				248	<programlisting language="C">
Nathan Willis	78fcb14	2019-05-11 20:56:02 +0100	[diff] [blame]	249	hb_buffer_set_invisible_glyph(buf, replacement_glyph);
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	250	</programlisting>
				251	<para>
Nathan Willis	78fcb14	2019-05-11 20:56:02 +0100	[diff] [blame]	252	Do note that in the <parameter>replacement_glyph</parameter>
				253	parameter, you must provide the glyph ID of the replacement you
				254	wish to use, not the Unicode code point.
				255	</para>
				256	<para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	257	HarfBuzz supports a few additional flags you might want to set
				258	on your buffer under certain circumstances. The
				259	<literal>HB_BUFFER_FLAG_BOT</literal> and
				260	<literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
				261	that the buffer represents the beginning or end (respectively)
				262	of a text element (such as a paragraph or other block). Knowing
				263	this allows HarfBuzz to apply certain contextual font features
				264	when shaping, such as initial or final variants in connected
				265	scripts.
				266	</para>
				267	<para>
				268	<literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
				269	tells HarfBuzz not to hide glyphs with the
				270	<literal>Default_Ignorable</literal> property in Unicode. This
				271	property designates control characters and other non-printing
				272	code points, such as joiners and variation selectors. Normally
				273	HarfBuzz replaces them in the output buffer with zero-width
Nathan Willis	78fcb14	2019-05-11 20:56:02 +0100	[diff] [blame]	274	space glyphs (using the "invisible glyph" property discussed
				275	above); setting this flag causes them to be printed, which can
				276	be helpful for troubleshooting.
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	277	</para>
				278	<para>
				279	Conversely, setting the
				280	<literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
				281	tells HarfBuzz to remove <literal>Default_Ignorable</literal>
				282	glyphs from the output buffer entirely. Finally, setting the
				283	<literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
				284	flag tells HarfBuzz not to insert the dotted-circle glyph
				285	(<literal>U+25CC</literal>, "◌"), which is normally
				286	inserted into buffer output when broken character sequences are
				287	encountered (such as combining marks that are not attached to a
				288	base character).
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	289	</para>
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	290	</section>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	291
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	292	<section id="customizing-unicode-functions">
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	293	<title>Customizing Unicode functions</title>
				294	<para>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	295	HarfBuzz requires some simple functions for accessing
				296	information from the Unicode Character Database (such as the
				297	<literal>General_Category</literal> (gc) and
				298	<literal>Script</literal> (sc) properties) that is useful
				299	for shaping, as well as some useful operations like composing and
				300	decomposing code points.
				301	</para>
				302	<para>
Nathan Willis	dd5ad6b	2019-05-24 20:30:22 +0100	[diff] [blame]	303	HarfBuzz includes its own internal, lightweight set of Unicode
				304	functions. At build time, it is also possible to compile support
				305	for some other options, such as the Unicode functions provided
				306	by GLib or the International Components for Unicode (ICU)
				307	library. Generally, this option is only of interest for client
				308	programs that have specific integration requirements or that do
				309	a significant amount of customization.
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	310	</para>
				311	<para>
				312	If your program has access to other Unicode functions, however,
				313	such as through a system library or application framework, you
				314	might prefer to use those instead of the built-in
				315	options. HarfBuzz supports this by implementing its Unicode
				316	functions as a set of virtual methods that you can replace —
				317	without otherwise affecting HarfBuzz's functionality.
				318	</para>
				319	<para>
				320	The Unicode functions are specified in a structure called
				321	<literal>unicode_funcs</literal> which is attached to each
				322	buffer. But even though <literal>unicode_funcs</literal> is
				323	associated with a <type>hb_buffer_t</type>, the functions
				324	themselves are called by other HarfBuzz APIs that access
				325	buffers, so it would be unwise for you to hook different
				326	functions into different buffers.
				327	</para>
				328	<para>
				329	In addition, you can mark your <literal>unicode_funcs</literal>
				330	as immutable by calling
Nathan Willis	78fcb14	2019-05-11 20:56:02 +0100	[diff] [blame]	331	<function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
				332	This is especially useful if your code is a
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	333	library or framework that will have its own client programs. By
				334	marking your Unicode function choices as immutable, you prevent
				335	your own client programs from changing the
				336	<literal>unicode_funcs</literal> configuration and introducing
				337	inconsistencies and errors downstream.
				338	</para>
				339	<para>
				340	You can retrieve the Unicode-functions configuration for
				341	your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
				342	</para>
				343	<programlisting language="C">
				344	hb_unicode_funcs_t *ufunctions;
				345	ufunctions = hb_buffer_get_unicode_funcs(buf);
				346	</programlisting>
				347	<para>
				348	The current version of <literal>unicode_funcs</literal> uses six functions:
				349	</para>
				350	<itemizedlist>
				351	<listitem>
				352	<para>
				353	<function>hb_unicode_combining_class_func_t</function>:
				354	returns the Canonical Combining Class of a code point.
				355	</para>
				356	</listitem>
				357	<listitem>
				358	<para>
				359	<function>hb_unicode_general_category_func_t</function>:
				360	returns the General Category (gc) of a code point.
				361	</para>
				362	</listitem>
				363	<listitem>
				364	<para>
				365	<function>hb_unicode_mirroring_func_t</function>: returns
				366	the Mirroring Glyph code point (for bi-directional
				367	replacement) of a code point.
				368	</para>
				369	</listitem>
				370	<listitem>
				371	<para>
				372	<function>hb_unicode_script_func_t</function>: returns the
				373	Script (sc) property of a code point.
				374	</para>
				375	</listitem>
				376	<listitem>
				377	<para>
				378	<function>hb_unicode_compose_func_t</function>: returns the
				379	canonical composition of a sequence of two code points.
				380	</para>
				381	</listitem>
				382	<listitem>
				383	<para>
				384	<function>hb_unicode_decompose_func_t</function>: returns
				385	the canonical decomposition of a code point.
				386	</para>
				387	</listitem>
				388	</itemizedlist>
				389	<para>
				390	Note, however, that future HarfBuzz releases may alter this set.
				391	</para>
				392	<para>
				393	Each Unicode function has a corresponding setter, with which you
				394	can assign a callback to your replacement function. For example,
				395	to replace
				396	<function>hb_unicode_general_category_func_t</function>, you can call
				397	</para>
				398	<programlisting language="C">
				399	hb_unicode_funcs_set_general_category_func (ufuncs, func, user_data, destroy)
				400	</programlisting>
				401	<para>
				402	Virtualizing this set of Unicode functions is primarily intended
				403	to improve portability. There is no need for every client
				404	program to make the effort to replace the default options, so if
				405	you are unsure, do not feel any pressure to customize
				406	<literal>unicode_funcs</literal>.
Simon Cozens	5470e74	2015-08-29 08:21:18 +0100	[diff] [blame]	407	</para>
Simon Cozens	11a07c4	2015-08-31 10:39:10 +0100	[diff] [blame]	408	</section>
Nathan Willis	3b301c5	2019-04-30 17:21:33 +0100	[diff] [blame]	409
Nathan Willis	9f4b375	2018-10-29 17:10:53 -0500	[diff] [blame]	410	</chapter>