| <?xml version="1.0"?> |
| <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" |
| "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ |
| <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> |
| <!ENTITY version SYSTEM "version.xml"> |
| ]> |
| <chapter id="buffers-language-script-and-direction"> |
| <title>Buffers, language, script and direction</title> |
| <para> |
| The input to the HarfBuzz shaper is a series of Unicode characters, stored in a |
| buffer. In this chapter, we'll look at how to set up a buffer with |
| the text that we want and how to customize the properties of the |
| buffer. We'll also look at a piece of lower-level machinery that |
| you will need to understand before proceeding: the functions that |
| HarfBuzz uses to retrieve Unicode information. |
| </para> |
| <para> |
| After shaping is complete, HarfBuzz puts its output back |
| into the buffer. But getting that output requires setting up a |
| face and a font first, so we will look at that in the next chapter |
| instead of here. |
| </para> |
| <section id="creating-and-destroying-buffers"> |
| <title>Creating and destroying buffers</title> |
| <para> |
| As we saw in our <emphasis>Getting Started</emphasis> example, a |
| buffer is created and |
| initialized with <function>hb_buffer_create()</function>. This |
| produces a new, empty buffer object, instantiated with some |
| default values and ready to accept your Unicode strings. |
| </para> |
| <para> |
| HarfBuzz manages the memory of objects (such as buffers) that it |
| creates, so you don't have to. When you have finished working on |
| a buffer, you can call <function>hb_buffer_destroy()</function>: |
| </para> |
| <programlisting language="C"> |
| hb_buffer_t *buf = hb_buffer_create(); |
| ... |
| hb_buffer_destroy(buf); |
| </programlisting> |
| <para> |
| This will destroy the object and free its associated memory - |
| unless some other part of the program holds a reference to this |
| buffer. If you acquire a HarfBuzz buffer from another subsystem |
| and want to ensure that it is not garbage collected by someone |
| else destroying it, you should increase its reference count: |
| </para> |
| <programlisting language="C"> |
| void somefunc(hb_buffer_t *buf) { |
| buf = hb_buffer_reference(buf); |
| ... |
| </programlisting> |
| <para> |
| And then decrease it once you're done with it: |
| </para> |
| <programlisting language="C"> |
| hb_buffer_destroy(buf); |
| } |
| </programlisting> |
| <para> |
| While we are on the subject of reference-counting buffers, it is |
| worth noting that an individual buffer can only meaningfully be |
| used by one thread at a time. |
| </para> |
| <para> |
| To throw away all the data in your buffer and start from scratch, |
| call <function>hb_buffer_reset(buf)</function>. If you want to |
| throw away the string in the buffer but keep the options, you can |
| instead call <function>hb_buffer_clear_contents(buf)</function>. |
| </para> |
| </section> |
| |
| <section id="adding-text-to-the-buffer"> |
| <title>Adding text to the buffer</title> |
| <para> |
| Now we have a brand new HarfBuzz buffer. Let's start filling it |
| with text! From HarfBuzz's perspective, a buffer is just a stream |
| of Unicode code points, but your input string is probably in one of |
| the standard Unicode character encodings (UTF-8, UTF-16, or |
| UTF-32). HarfBuzz provides convenience functions that accept |
| each of these encodings: |
| <function>hb_buffer_add_utf8()</function>, |
| <function>hb_buffer_add_utf16()</function>, and |
| <function>hb_buffer_add_utf32()</function>. Other than the |
| character encoding they accept, they function identically. |
| </para> |
| <para> |
| You can add UTF-8 text to a buffer by passing in the text array, |
| the array's length, an offset into the array for the first |
| character to add, and the length of the segment to add: |
| </para> |
| <programlisting language="C"> |
| hb_buffer_add_utf8 (hb_buffer_t *buf, |
| const char *text, |
| int text_length, |
| unsigned int item_offset, |
| int item_length) |
| </programlisting> |
| <para> |
| So, in practice, you can say: |
| </para> |
| <programlisting language="C"> |
| hb_buffer_add_utf8(buf, text, strlen(text), 0, strlen(text)); |
| </programlisting> |
| <para> |
| This will append your new characters to |
| <parameter>buf</parameter>, not replace its existing |
| contents. Also, note that you can use <literal>-1</literal> in |
| place of the first instance of <function>strlen(text)</function> |
| if your text array is NULL-terminated. Similarly, you can also use |
| <literal>-1</literal> as the final argument want to add its full |
| contents. |
| </para> |
| <para> |
| Whatever start <parameter>item_offset</parameter> and |
| <parameter>item_length</parameter> you provide, HarfBuzz will also |
| attempt to grab the five characters <emphasis>before</emphasis> |
| the offset point and the five characters |
| <emphasis>after</emphasis> the designated end. These are the |
| before and after "context" segments, which are used internally |
| for HarfBuzz to make shaping decisions. They will not be part of |
| the final output, but they ensure that HarfBuzz's |
| script-specific shaping operations are correct. If there are |
| fewer than five characters available for the before or after |
| contexts, HarfBuzz will just grab what is there. |
| </para> |
| <para> |
| For longer text runs, such as full paragraphs, it might be |
| tempting to only add smaller sub-segments to a buffer and |
| shape them in piecemeal fashion. Generally, this is not a good |
| idea, however, because a lot of shaping decisions are |
| dependent on this context information. For example, in Arabic |
| and other connected scripts, HarfBuzz needs to know the code |
| points before and after each character in order to correctly |
| determine which glyph to return. |
| </para> |
| <para> |
| The safest approach is to add all of the text available (even |
| if your text contains a mix of scripts, directions, languages |
| and fonts), then use <parameter>item_offset</parameter> and |
| <parameter>item_length</parameter> to indicate which characters you |
| want shaped (which must all have the same script, direction, |
| language and font), so that HarfBuzz has access to any context. |
| </para> |
| <para> |
| You can also add Unicode code points directly with |
| <function>hb_buffer_add_codepoints()</function>. The arguments |
| to this function are the same as those for the UTF |
| encodings. But it is particularly important to note that |
| HarfBuzz does not do validity checking on the text that is added |
| to a buffer. Invalid code points will be replaced, but it is up |
| to you to do any deep-sanity checking necessary. |
| </para> |
| |
| </section> |
| |
| <section id="setting-buffer-properties"> |
| <title>Setting buffer properties</title> |
| <para> |
| Buffers containing input characters still need several |
| properties set before HarfBuzz can shape their text correctly. |
| </para> |
| <para> |
| Initially, all buffers are set to the |
| <literal>HB_BUFFER_CONTENT_TYPE_INVALID</literal> content |
| type. After adding text, the buffer should be set to |
| <literal>HB_BUFFER_CONTENT_TYPE_UNICODE</literal> instead, which |
| indicates that it contains un-shaped input |
| characters. After shaping, the buffer will have the |
| <literal>HB_BUFFER_CONTENT_TYPE_GLYPHS</literal> content type. |
| </para> |
| <para> |
| <function>hb_buffer_add_utf8()</function> and the |
| other UTF functions set the content type of their buffer |
| automatically. But if you are reusing a buffer you may want to |
| check its state with |
| <function>hb_buffer_get_content_type(buffer)</function>. If |
| necessary you can set the content type with |
| </para> |
| <programlisting language="C"> |
| hb_buffer_set_content_type(buf, HB_BUFFER_CONTENT_TYPE_UNICODE); |
| </programlisting> |
| <para> |
| to prepare for shaping. |
| </para> |
| <para> |
| Buffers also need to carry information about the script, |
| language, and text direction of their contents. You can set |
| these properties individually: |
| </para> |
| <programlisting language="C"> |
| hb_buffer_set_direction(buf, HB_DIRECTION_LTR); |
| hb_buffer_set_script(buf, HB_SCRIPT_LATIN); |
| hb_buffer_set_language(buf, hb_language_from_string("en", -1)); |
| </programlisting> |
| <para> |
| However, since these properties are often repeated for |
| multiple text runs, you can also save them in a |
| <literal>hb_segment_properties_t</literal> for reuse: |
| </para> |
| <programlisting language="C"> |
| hb_segment_properties_t *savedprops; |
| hb_buffer_get_segment_properties (buf, savedprops); |
| ... |
| hb_buffer_set_segment_properties (buf2, savedprops); |
| </programlisting> |
| <para> |
| HarfBuzz also provides getter functions to retrieve a buffer's |
| direction, script, and language properties individually. |
| </para> |
| <para> |
| HarfBuzz recognizes four text directions in |
| <type>hb_direction_t</type>: left-to-right |
| (<literal>HB_DIRECTION_LTR</literal>), right-to-left (<literal>HB_DIRECTION_RTL</literal>), |
| top-to-bottom (<literal>HB_DIRECTION_TTB</literal>), and |
| bottom-to-top (<literal>HB_DIRECTION_BTT</literal>). For the |
| script property, HarfBuzz uses identifiers based on the |
| <ulink |
| url="https://unicode.org/iso15924/">ISO 15924 |
| standard</ulink>. For languages, HarfBuzz uses tags based on the |
| <ulink url="https://tools.ietf.org/html/bcp47">IETF BCP 47</ulink> standard. |
| </para> |
| <para> |
| Helper functions are provided to convert character strings into |
| the necessary script and language tag types. |
| </para> |
| <para> |
| Two additional buffer properties to be aware of are the |
| "invisible glyph" and the replacement code point. The |
| replacement code point is inserted into buffer output in place of |
| any invalid code points encountered in the input. By default, it |
| is the Unicode <literal>REPLACEMENT CHARACTER</literal> code |
| point, <literal>U+FFFD</literal> "�". You can change this with |
| </para> |
| <programlisting language="C"> |
| hb_buffer_set_replacement_codepoint(buf, replacement); |
| </programlisting> |
| <para> |
| passing in the replacement Unicode code point as the |
| <parameter>replacement</parameter> parameter. |
| </para> |
| <para> |
| The invisible glyph is used to replace all output glyphs that |
| are invisible. By default, the standard space character |
| <literal>U+0020</literal> is used; you can replace this (for |
| example, when using a font that provides script-specific |
| spaces) with |
| </para> |
| <programlisting language="C"> |
| hb_buffer_set_invisible_glyph(buf, replacement_glyph); |
| </programlisting> |
| <para> |
| Do note that in the <parameter>replacement_glyph</parameter> |
| parameter, you must provide the glyph ID of the replacement you |
| wish to use, not the Unicode code point. |
| </para> |
| <para> |
| HarfBuzz supports a few additional flags you might want to set |
| on your buffer under certain circumstances. The |
| <literal>HB_BUFFER_FLAG_BOT</literal> and |
| <literal>HB_BUFFER_FLAG_EOT</literal> flags tell HarfBuzz |
| that the buffer represents the beginning or end (respectively) |
| of a text element (such as a paragraph or other block). Knowing |
| this allows HarfBuzz to apply certain contextual font features |
| when shaping, such as initial or final variants in connected |
| scripts. |
| </para> |
| <para> |
| <literal>HB_BUFFER_FLAG_PRESERVE_DEFAULT_IGNORABLES</literal> |
| tells HarfBuzz not to hide glyphs with the |
| <literal>Default_Ignorable</literal> property in Unicode. This |
| property designates control characters and other non-printing |
| code points, such as joiners and variation selectors. Normally |
| HarfBuzz replaces them in the output buffer with zero-width |
| space glyphs (using the "invisible glyph" property discussed |
| above); setting this flag causes them to be printed, which can |
| be helpful for troubleshooting. |
| </para> |
| <para> |
| Conversely, setting the |
| <literal>HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES</literal> flag |
| tells HarfBuzz to remove <literal>Default_Ignorable</literal> |
| glyphs from the output buffer entirely. Finally, setting the |
| <literal>HB_BUFFER_FLAG_DO_NOT_INSERT_DOTTED_CIRCLE</literal> |
| flag tells HarfBuzz not to insert the dotted-circle glyph |
| (<literal>U+25CC</literal>, "◌"), which is normally |
| inserted into buffer output when broken character sequences are |
| encountered (such as combining marks that are not attached to a |
| base character). |
| </para> |
| </section> |
| |
| <section id="customizing-unicode-functions"> |
| <title>Customizing Unicode functions</title> |
| <para> |
| HarfBuzz requires some simple functions for accessing |
| information from the Unicode Character Database (such as the |
| <literal>General_Category</literal> (gc) and |
| <literal>Script</literal> (sc) properties) that is useful |
| for shaping, as well as some useful operations like composing and |
| decomposing code points. |
| </para> |
| <para> |
| HarfBuzz includes its own internal, lightweight set of Unicode |
| functions. At build time, it is also possible to compile support |
| for some other options, such as the Unicode functions provided |
| by GLib or the International Components for Unicode (ICU) |
| library. Generally, this option is only of interest for client |
| programs that have specific integration requirements or that do |
| a significant amount of customization. |
| </para> |
| <para> |
| If your program has access to other Unicode functions, however, |
| such as through a system library or application framework, you |
| might prefer to use those instead of the built-in |
| options. HarfBuzz supports this by implementing its Unicode |
| functions as a set of virtual methods that you can replace — |
| without otherwise affecting HarfBuzz's functionality. |
| </para> |
| <para> |
| The Unicode functions are specified in a structure called |
| <literal>unicode_funcs</literal> which is attached to each |
| buffer. But even though <literal>unicode_funcs</literal> is |
| associated with a <type>hb_buffer_t</type>, the functions |
| themselves are called by other HarfBuzz APIs that access |
| buffers, so it would be unwise for you to hook different |
| functions into different buffers. |
| </para> |
| <para> |
| In addition, you can mark your <literal>unicode_funcs</literal> |
| as immutable by calling |
| <function>hb_unicode_funcs_make_immutable (ufuncs)</function>. |
| This is especially useful if your code is a |
| library or framework that will have its own client programs. By |
| marking your Unicode function choices as immutable, you prevent |
| your own client programs from changing the |
| <literal>unicode_funcs</literal> configuration and introducing |
| inconsistencies and errors downstream. |
| </para> |
| <para> |
| You can retrieve the Unicode-functions configuration for |
| your buffer by calling <function>hb_buffer_get_unicode_funcs()</function>: |
| </para> |
| <programlisting language="C"> |
| hb_unicode_funcs_t *ufunctions; |
| ufunctions = hb_buffer_get_unicode_funcs(buf); |
| </programlisting> |
| <para> |
| The current version of <literal>unicode_funcs</literal> uses six functions: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <function>hb_unicode_combining_class_func_t</function>: |
| returns the Canonical Combining Class of a code point. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>hb_unicode_general_category_func_t</function>: |
| returns the General Category (gc) of a code point. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>hb_unicode_mirroring_func_t</function>: returns |
| the Mirroring Glyph code point (for bi-directional |
| replacement) of a code point. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>hb_unicode_script_func_t</function>: returns the |
| Script (sc) property of a code point. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>hb_unicode_compose_func_t</function>: returns the |
| canonical composition of a sequence of two code points. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>hb_unicode_decompose_func_t</function>: returns |
| the canonical decomposition of a code point. |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| Note, however, that future HarfBuzz releases may alter this set. |
| </para> |
| <para> |
| Each Unicode function has a corresponding setter, with which you |
| can assign a callback to your replacement function. For example, |
| to replace |
| <function>hb_unicode_general_category_func_t</function>, you can call |
| </para> |
| <programlisting language="C"> |
| hb_unicode_funcs_set_general_category_func (*ufuncs, func, *user_data, destroy) |
| </programlisting> |
| <para> |
| Virtualizing this set of Unicode functions is primarily intended |
| to improve portability. There is no need for every client |
| program to make the effort to replace the default options, so if |
| you are unsure, do not feel any pressure to customize |
| <literal>unicode_funcs</literal>. |
| </para> |
| </section> |
| |
| </chapter> |