blob: 2865426f2ccb2718169ff5d2eb21acab4b79fb44 [file] [log] [blame]
Nathan Willis9f4b3752018-10-29 17:10:53 -05001<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'">
5 <!ENTITY version SYSTEM "version.xml">
6]>
Simon Cozens11a07c42015-08-31 10:39:10 +01007<chapter id="buffers-language-script-and-direction">
Simon Cozens5470e742015-08-29 08:21:18 +01008 <title>Buffers, language, script and direction</title>
9 <para>
Nathan Willis3b301c52019-04-30 17:21:33 +010010 The input to the HarfBuzz shaper is a series of Unicode characters, stored in a
Simon Cozens5470e742015-08-29 08:21:18 +010011 buffer. In this chapter, we'll look at how to set up a buffer with
Nathan Willis3b301c52019-04-30 17:21:33 +010012 the text that we want and how to customize the properties of the
13 buffer. We'll also look at a piece of lower-level machinery that
14 you will need to understand before proceeding: the functions that
15 HarfBuzz uses to retrieve Unicode information.
16 </para>
17 <para>
18 After shaping is complete, HarfBuzz puts its output back
19 into the buffer. But getting that output requires setting up a
20 face and a font first, so we will look at that in the next chapter
21 instead of here.
Simon Cozens5470e742015-08-29 08:21:18 +010022 </para>
Simon Cozens11a07c42015-08-31 10:39:10 +010023 <section id="creating-and-destroying-buffers">
Simon Cozens5470e742015-08-29 08:21:18 +010024 <title>Creating and destroying buffers</title>
25 <para>
Nathan Willised13cad2018-11-28 13:48:38 -060026 As we saw in our <emphasis>Getting Started</emphasis> example, a
27 buffer is created and
Nathan Willis3b301c52019-04-30 17:21:33 +010028 initialized with <function>hb_buffer_create()</function>. This
Simon Cozens5470e742015-08-29 08:21:18 +010029 produces a new, empty buffer object, instantiated with some
30 default values and ready to accept your Unicode strings.
31 </para>
32 <para>
Nathan Willised13cad2018-11-28 13:48:38 -060033 HarfBuzz manages the memory of objects (such as buffers) that it
34 creates, so you don't have to. When you have finished working on
Nathan Willis3b301c52019-04-30 17:21:33 +010035 a buffer, you can call <function>hb_buffer_destroy()</function>:
Simon Cozens5470e742015-08-29 08:21:18 +010036 </para>
37 <programlisting language="C">
Nathan Willis3b301c52019-04-30 17:21:33 +010038 hb_buffer_t *buf = hb_buffer_create();
39 ...
40 hb_buffer_destroy(buf);
41 </programlisting>
Simon Cozens5470e742015-08-29 08:21:18 +010042 <para>
43 This will destroy the object and free its associated memory -
44 unless some other part of the program holds a reference to this
Behdad Esfahboda0ad0d52017-11-20 15:07:48 -080045 buffer. If you acquire a HarfBuzz buffer from another subsystem
Simon Cozens5470e742015-08-29 08:21:18 +010046 and want to ensure that it is not garbage collected by someone
47 else destroying it, you should increase its reference count:
48 </para>
49 <programlisting language="C">
Nathan Willis3b301c52019-04-30 17:21:33 +010050 void somefunc(hb_buffer_t *buf) {
51 buf = hb_buffer_reference(buf);
52 ...
53 </programlisting>
Simon Cozens5470e742015-08-29 08:21:18 +010054 <para>
55 And then decrease it once you're done with it:
56 </para>
57 <programlisting language="C">
Nathan Willis3b301c52019-04-30 17:21:33 +010058 hb_buffer_destroy(buf);
59 }
60 </programlisting>
61 <para>
62 While we are on the subject of reference-counting buffers, it is
63 worth noting that an individual buffer can only meaningfully be
64 used by one thread at a time.
65 </para>
Simon Cozens5470e742015-08-29 08:21:18 +010066 <para>
67 To throw away all the data in your buffer and start from scratch,
Nathan Willis3b301c52019-04-30 17:21:33 +010068 call <function>hb_buffer_reset(buf)</function>. If you want to
Simon Cozens5470e742015-08-29 08:21:18 +010069 throw away the string in the buffer but keep the options, you can
Nathan Willis3b301c52019-04-30 17:21:33 +010070 instead call <function>hb_buffer_clear_contents(buf)</function>.
Simon Cozens5470e742015-08-29 08:21:18 +010071 </para>
Simon Cozens11a07c42015-08-31 10:39:10 +010072 </section>
Nathan Willis3b301c52019-04-30 17:21:33 +010073
Simon Cozens11a07c42015-08-31 10:39:10 +010074 <section id="adding-text-to-the-buffer">
Simon Cozens5470e742015-08-29 08:21:18 +010075 <title>Adding text to the buffer</title>
76 <para>
Behdad Esfahboda0ad0d52017-11-20 15:07:48 -080077 Now we have a brand new HarfBuzz buffer. Let's start filling it
78 with text! From HarfBuzz's perspective, a buffer is just a stream
Nathan Willis3b301c52019-04-30 17:21:33 +010079 of Unicode code points, but your input string is probably in one of
80 the standard Unicode character encodings (UTF-8, UTF-16, or
81 UTF-32). HarfBuzz provides convenience functions that accept
82 each of these encodings:
83 <function>hb_buffer_add_utf8()</function>,
84 <function>hb_buffer_add_utf16()</function>, and
85 <function>hb_buffer_add_utf32()</function>. Other than the
86 character encoding they accept, they function identically.
Simon Cozens5470e742015-08-29 08:21:18 +010087 </para>
Nathan Willis3b301c52019-04-30 17:21:33 +010088 <para>
89 You can add UTF-8 text to a buffer by passing in the text array,
90 the array's length, an offset into the array for the first
91 character to add, and the length of the segment to add:
92 </para>
93 <programlisting language="C">
94 hb_buffer_add_utf8 (hb_buffer_t *buf,
95 const char *text,
96 int text_length,
97 unsigned int item_offset,
98 int item_length)
99 </programlisting>
100 <para>
101 So, in practice, you can say:
102 </para>
103 <programlisting language="C">
104 hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text));
105 </programlisting>
106 <para>
107 This will append your new characters to
108 <parameter>buf</parameter>, not replace its existing
109 contents. Also, note that you can use <literal>-1</literal> in
110 place of the first instance of <function>strlen(text)</function>
111 if your text array is NULL-terminated. Similarly, you can also use
112 <literal>-1</literal> as the final argument want to add its full
113 contents.
114 </para>
115 <para>
116 Whatever start <parameter>item_offset</parameter> and
117 <parameter>item_length</parameter> you provide, HarfBuzz will also
118 attempt to grab the five characters <emphasis>before</emphasis>
119 the offset point and the five characters
120 <emphasis>after</emphasis> the designated end. These are the
121 before and after "context" segments, which are used internally
122 for HarfBuzz to make shaping decisions. They will not be part of
123 the final output, but they ensure that HarfBuzz's
124 script-specific shaping operations are correct. If there are
125 fewer than five characters available for the before or after
126 contexts, HarfBuzz will just grab what is there.
127 </para>
128 <para>
129 For longer text runs, such as full paragraphs, it might be
130 tempting to only add smaller sub-segments to a buffer and
131 shape them in piecemeal fashion. Generally, this is not a good
132 idea, however, because a lot of shaping decisions are
133 dependent on this context information. For example, in Arabic
134 and other connected scripts, HarfBuzz needs to know the code
135 points before and after each character in order to correctly
136 determine which glyph to return.
137 </para>
138 <para>
139 The safest approach is to add all of the text available, then
140 use <parameter>item_offset</parameter> and
141 <parameter>item_length</parameter> to indicate which characters you
142 want shaped, so that HarfBuzz has access to any context.
143 </para>
144 <para>
145 You can also add Unicode code points directly with
146 <function>hb_buffer_add_codepoints()</function>. The arguments
147 to this function are the same as those for the UTF
148 encodings. But it is particularly important to note that
149 HarfBuzz does not do validity checking on the text that is added
150 to a buffer. Invalid code points will be replaced, but it is up
151 to you to do any deep-sanity checking necessary.
152 </para>
153
Simon Cozens11a07c42015-08-31 10:39:10 +0100154 </section>
Nathan Willis3b301c52019-04-30 17:21:33 +0100155
Simon Cozens11a07c42015-08-31 10:39:10 +0100156 <section id="setting-buffer-properties">
Simon Cozens5470e742015-08-29 08:21:18 +0100157 <title>Setting buffer properties</title>
158 <para>
Nathan Willis3b301c52019-04-30 17:21:33 +0100159 Buffers containing input characters still need several
160 properties set before HarfBuzz can shape their text correctly.
Simon Cozens5470e742015-08-29 08:21:18 +0100161 </para>
Simon Cozens5470e742015-08-29 08:21:18 +0100162 <para>
Nathan Willis3b301c52019-04-30 17:21:33 +0100163 Initially, all buffers are set to the
164 <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content
165 type. After adding text, the buffer should be set to
166 <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which
167 indicates that it contains un-shaped input
168 characters. After shaping, the buffer will have the
169 <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type.
170 </para>
171 <para>
172 <function>hb_buffer_add_utf8()</function> and the
173 other UTF functions set the content type of their buffer
174 automatically. But if you are reusing a buffer you may want to
175 check its state with
176 <function>hb_buffer_get_content_type(buffer)</function>. If
177 necessary you can set the content type with
178 </para>
179 <programlisting language="C">
180 hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE);
181 </programlisting>
182 <para>
183 to prepare for shaping.
184 </para>
185 <para>
186 Buffers also need to carry information about the script,
187 language, and text direction of their contents. You can set
188 these properties individually:
189 </para>
190 <programlisting language="C">
191 hb_buffer_set_direction(buf, HB_DIRECTION_LTR);
192 hb_buffer_set_script(buf, HB_SCRIPT_LATIN);
193 hb_buffer_set_language(buf, hb_language_from_string("en", -1));
194 </programlisting>
195 <para>
196 However, since these properties are often the repeated for
197 multiple text runs, you can also save them in a
198 <literal>hb_segment_properties_t</literal> for reuse:
199 </para>
200 <programlisting language="C">
201 hb_segment_properties_t *savedprops;
202 hb_buffer_get_segment_properties (buf, savedprops);
203 ...
204 hb_buffer_set_segment_properties (buf2, savedprops);
205 </programlisting>
206 <para>
207 HarfBuzz also provides getter functions to retrieve a buffer's
208 direction, script, and language properties individually.
209 </para>
210 <para>
211 HarfBuzz recognizes four text directions in
212 <type>hb_direction_t</type>: left-to-right
213 (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>),
214 top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and
215 bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the
216 script property, HarfBuzz uses identifiers based on the
217 <ulink
Nathan Willis97ba2062019-05-25 12:26:50 +0100218 url="https://unicode.org/iso15924/">ISO 15924
Nathan Willis3b301c52019-04-30 17:21:33 +0100219 standard</ulink>. For languages, HarfBuzz uses tags based on the
220 <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard.
221 </para>
222 <para>
223 Helper functions are provided to convert character strings into
224 the necessary script and language tag types.
225 </para>
226 <para>
227 Two additional buffer properties to be aware of are the
228 "invisible glyph" and the replacement code point. The
229 replacement code point is inserted into buffer output in place of
230 any invalid code points encountered in the input. By default, it
231 is the Unicode <literal>REPLACEMENT CHARACTER</literal> code
232 point, <literal>U+FFFD</literal> "&#xFFFD;". You can change this with
233 </para>
234 <programlisting language="C">
235 hb_buffer_set_replacement_codepoint(buf, replacement);
236 </programlisting>
237 <para>
Nathan Willis78fcb142019-05-11 20:56:02 +0100238 passing in the replacement Unicode code point as the
239 <parameter>replacement</parameter> parameter.
240 </para>
241 <para>
Nathan Willis3b301c52019-04-30 17:21:33 +0100242 The invisible glyph is used to replace all output glyphs that
243 are invisible. By default, the standard space character
244 <literal>U+0020</literal> is used; you can replace this (for
245 example, when using a font that provides script-specific
246 spaces) with
247 </para>
248 <programlisting language="C">
Nathan Willis78fcb142019-05-11 20:56:02 +0100249 hb_buffer_set_invisible_glyph(buf, replacement_glyph);
Nathan Willis3b301c52019-04-30 17:21:33 +0100250 </programlisting>
251 <para>
Nathan Willis78fcb142019-05-11 20:56:02 +0100252 Do note that in the <parameter>replacement_glyph</parameter>
253 parameter, you must provide the glyph ID of the replacement you
254 wish to use, not the Unicode code point.
255 </para>
256 <para>
Nathan Willis3b301c52019-04-30 17:21:33 +0100257 HarfBuzz supports a few additional flags you might want to set
258 on your buffer under certain circumstances. The
259 <literal>HB_BUFFER_FLAG_BOT</literal> and
260 <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz
261 that the buffer represents the beginning or end (respectively)
262 of a text element (such as a paragraph or other block). Knowing
263 this allows HarfBuzz to apply certain contextual font features
264 when shaping, such as initial or final variants in connected
265 scripts.
266 </para>
267 <para>
268 <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal>
269 tells HarfBuzz not to hide glyphs with the
270 <literal>Default_Ignorable</literal> property in Unicode. This
271 property designates control characters and other non-printing
272 code points, such as joiners and variation selectors. Normally
273 HarfBuzz replaces them in the output buffer with zero-width
Nathan Willis78fcb142019-05-11 20:56:02 +0100274 space glyphs (using the "invisible glyph" property discussed
275 above); setting this flag causes them to be printed, which can
276 be helpful for troubleshooting.
Nathan Willis3b301c52019-04-30 17:21:33 +0100277 </para>
278 <para>
279 Conversely, setting the
280 <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag
281 tells HarfBuzz to remove <literal>Default_Ignorable</literal>
282 glyphs from the output buffer entirely. Finally, setting the
283 <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal>
284 flag tells HarfBuzz not to insert the dotted-circle glyph
285 (<literal>U+25CC</literal>, "&#x25CC;"), which is normally
286 inserted into buffer output when broken character sequences are
287 encountered (such as combining marks that are not attached to a
288 base character).
Simon Cozens5470e742015-08-29 08:21:18 +0100289 </para>
Simon Cozens11a07c42015-08-31 10:39:10 +0100290 </section>
Nathan Willis3b301c52019-04-30 17:21:33 +0100291
Simon Cozens11a07c42015-08-31 10:39:10 +0100292 <section id="customizing-unicode-functions">
Simon Cozens5470e742015-08-29 08:21:18 +0100293 <title>Customizing Unicode functions</title>
294 <para>
Nathan Willis3b301c52019-04-30 17:21:33 +0100295 HarfBuzz requires some simple functions for accessing
296 information from the Unicode Character Database (such as the
297 <literal>General_Category</literal> (gc) and
298 <literal>Script</literal> (sc) properties) that is useful
299 for shaping, as well as some useful operations like composing and
300 decomposing code points.
301 </para>
302 <para>
Nathan Willisdd5ad6b2019-05-24 20:30:22 +0100303 HarfBuzz includes its own internal, lightweight set of Unicode
304 functions. At build time, it is also possible to compile support
305 for some other options, such as the Unicode functions provided
306 by GLib or the International Components for Unicode (ICU)
307 library. Generally, this option is only of interest for client
308 programs that have specific integration requirements or that do
309 a significant amount of customization.
Nathan Willis3b301c52019-04-30 17:21:33 +0100310 </para>
311 <para>
312 If your program has access to other Unicode functions, however,
313 such as through a system library or application framework, you
314 might prefer to use those instead of the built-in
315 options. HarfBuzz supports this by implementing its Unicode
316 functions as a set of virtual methods that you can replace —
317 without otherwise affecting HarfBuzz's functionality.
318 </para>
319 <para>
320 The Unicode functions are specified in a structure called
321 <literal>unicode_funcs</literal> which is attached to each
322 buffer. But even though <literal>unicode_funcs</literal> is
323 associated with a <type>hb_buffer_t</type>, the functions
324 themselves are called by other HarfBuzz APIs that access
325 buffers, so it would be unwise for you to hook different
326 functions into different buffers.
327 </para>
328 <para>
329 In addition, you can mark your <literal>unicode_funcs</literal>
330 as immutable by calling
Nathan Willis78fcb142019-05-11 20:56:02 +0100331 <function>hb_unicode_funcs_make_immutable (ufuncs)</function>.
332 This is especially useful if your code is a
Nathan Willis3b301c52019-04-30 17:21:33 +0100333 library or framework that will have its own client programs. By
334 marking your Unicode function choices as immutable, you prevent
335 your own client programs from changing the
336 <literal>unicode_funcs</literal> configuration and introducing
337 inconsistencies and errors downstream.
338 </para>
339 <para>
340 You can retrieve the Unicode-functions configuration for
341 your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>:
342 </para>
343 <programlisting language="C">
344 hb_unicode_funcs_t *ufunctions;
345 ufunctions = hb_buffer_get_unicode_funcs(buf);
346 </programlisting>
347 <para>
348 The current version of <literal>unicode_funcs</literal> uses six functions:
349 </para>
350 <itemizedlist>
351 <listitem>
352 <para>
353 <function>hb_unicode_combining_class_func_t</function>:
354 returns the Canonical Combining Class of a code point.
355 </para>
356 </listitem>
357 <listitem>
358 <para>
359 <function>hb_unicode_general_category_func_t</function>:
360 returns the General Category (gc) of a code point.
361 </para>
362 </listitem>
363 <listitem>
364 <para>
365 <function>hb_unicode_mirroring_func_t</function>: returns
366 the Mirroring Glyph code point (for bi-directional
367 replacement) of a code point.
368 </para>
369 </listitem>
370 <listitem>
371 <para>
372 <function>hb_unicode_script_func_t</function>: returns the
373 Script (sc) property of a code point.
374 </para>
375 </listitem>
376 <listitem>
377 <para>
378 <function>hb_unicode_compose_func_t</function>: returns the
379 canonical composition of a sequence of two code points.
380 </para>
381 </listitem>
382 <listitem>
383 <para>
384 <function>hb_unicode_decompose_func_t</function>: returns
385 the canonical decomposition of a code point.
386 </para>
387 </listitem>
388 </itemizedlist>
389 <para>
390 Note, however, that future HarfBuzz releases may alter this set.
391 </para>
392 <para>
393 Each Unicode function has a corresponding setter, with which you
394 can assign a callback to your replacement function. For example,
395 to replace
396 <function>hb_unicode_general_category_func_t</function>, you can call
397 </para>
398 <programlisting language="C">
399 hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy)
400 </programlisting>
401 <para>
402 Virtualizing this set of Unicode functions is primarily intended
403 to improve portability. There is no need for every client
404 program to make the effort to replace the default options, so if
405 you are unsure, do not feel any pressure to customize
406 <literal>unicode_funcs</literal>.
Simon Cozens5470e742015-08-29 08:21:18 +0100407 </para>
Simon Cozens11a07c42015-08-31 10:39:10 +0100408 </section>
Nathan Willis3b301c52019-04-30 17:21:33 +0100409
Nathan Willis9f4b3752018-10-29 17:10:53 -0500410</chapter>