Mapping to characters

It is quite natural to think that separating a word up into individual characters is quite easy. It turns out that for the computer this isn’t really the case. If we look at a system that understands Unicode (like XeTeX or LuaTeX), most of the time one ‘character’ is stored as one codepoint. A codepoint is a single character entity for a Unicode programme. For example, if we take the input café, it is made up of four codepoints:

U+0063 (LATIN SMALL LETTER C)
U+0061 (LATIN SMALL LETTER A)
U+0066 (LATIN SMALL LETTER F)
U+00E9 (LATIN SMALL LETTER E WITH ACUTE)

So we could in XeTeX/LuaTeX use a simple mapping to grab one character at a time and do stuff with it. However, that’s not always the case. Take for example Spın̈al Tap. The dotless-i is a single codepoint, but there is not a codepoint for an umlauted-n. Instead, that is represented by two codepoints: a normal n and a combining umlaut. As a user, it’s clear that we’d want to get a single ‘character’ here. So there’s clearly more work to do.

Luckily, this is not just a TeX problem and the Unicode Consortium have thought about it for us. They provide a data file and rules that describe how to divide input into graphemes: ‘user perceived characters’. So ‘all’ that is needed is to examine the input using these rules, and to divide it up so that ‘characters’ stay together.

For pdfTeX, there’s an additional wrinkle: it uses bytes, not codepoints, and so if we use a naïve TeX mapping, we would divide up any codepoint outside the ASCII range into separate bytes: not good. Luckily, the nature of codepoints is predictable: all that is needed is to examine the first byte and collect the right number of further bytes to re-combine into a valid codepoint.

This work isn’t something the average end user wants to do. Luckily, they don’t have to as the LaTeX team have looked at this and created a suitable set of expl3 functions to do it: \text_map_function:nN and \text_map_inline:nn. So for example we can do

\ExplSyntaxOn
\text_map_inline:nn { Spın̈al ~ Tap } { (#1) }
\ExplSyntaxOff

and get

(S)(p)(ı)(n̈)(a)(l)( )(T)(a)(p)

in any TeX engine (assuming we are set up to print the characters, of course).

Taking a more ‘serious’ example (And one that is going to use LuaTeX for font reasons), we might want to map over Bangla text. It’s easy to do that with the expl3 function \tl_map_inline:nn, but it gives very odd results. In contrast, \text_map_inline:nn divides up the characters correctly.

\documentclass{article}
\usepackage{fontspec}
\newfontface\harfbengali
  {NotoSansBengali-VariableFont_wdth,wght.ttf}[Renderer=HarfBuzz,Script=Bengali]
\begin{document}
\harfbengali
\ExplSyntaxOn
ন্দ্রকিন্দ্র
\par
\text_map_inline:nn{ন্দ্রকিন্দ্র}{(#1)}
\par
\tl_map_inline:nn{ন্দ্রকিন্দ্র}{(#1)}
\ExplSyntaxOff
\end{document}

which gives Example output (You’ll need Noto Sans Bengali available to make this work locally.)

So, as you can see, mapping to ‘real’ text is easy with expl3: you just need to know that the tools are there.