Article ID: 241020 - View products that this article applies to.
This article was previously published under Q241020
Applications usually draw Unicode text by specifying character codes in a string of WORDs. When a TrueType font is used, the operating system converts these character codes into TrueType glyph indices when it draws the glyphs. An application may need to translate character codes to glyph indices. This article discusses, with sample source code, how to obtain glyph indices from TrueType font files.
Drawing a character onto a device in Windows involves mapping the character code to a font-dependent index of the character's graphic (a glyph). When a TrueType font is used, these indices are called glyph indices.
TrueType font files have a flexible, table-oriented file format. The flexibility of this file format supports a variety of character encoding (or mapping) from a character code to the respective glyph. One such encoding can be from Unicode to glyph indices.
The Microsoft Windows operating systems use TrueType font files that are almost always Unicode-encoded. Even the TrueType fonts used by Windows 95 (and its predecessor versions to Windows 3.1) use the Unicode standard as their internal encoding. See the References section of this article for more information on the Unicode standard and the TrueType font file.
The functions of the Win32 Application Programming Interface (API) are exported as two separate entry points. The first has the base function name with an appended "A"; the second has the base function name with an appended "W". These entry points support the character set (or ANSI) string encoding and the Unicode (or wide character) encoding, respectively. The Platform SDK header files define the base function name as the "A" or "W" variant, depending on the use of the UNICODE macro, but either variant can be referenced directly.
Most of the functions in the Win32 API for Windows 95 and later versions support only the ANSI versions of the functions. In addition to the ANSI API, there is also support for a limited subset of Unicode-capable functions such as TextOutW, lstrlenW, and so on. See the References section for more information on Unicode support in Windows 95 and its successors.
This set of Unicode or wide-character functions supports writing a limited Unicode-capable application. Just as an ANSI application may need to use glyph indices, a Unicode-capable application may need access to them as well. Such a case occurs when the application uses some advanced features of TrueType font files, if the application needs the glyph definitions, or if it needs to implement a workaround or functionality not otherwise present in the operating system.
ANSI application's can use the GetCharacterPlacementA function to translate a byte string of character codes to glyph indices. If the application uses a Unicode string encoding and it runs on Windows NT, there is a wide-character version of the GetCharacterPlacement function. On Windows 95, the GetCharacterPlacementW function is not implemented so there is no API to convert from Unicode to glyph indices.
To convert a Unicode character code to a TrueType glyph index, an application must retrieve the table data for the Unicode-to-glyph-index encoding. The data in a TrueType font file can be obtained by calling the GetFontData Win32 function. This function returns an unprocessed byte buffer, but it can extract this buffer relative to the start of a named TrueType table. The function also operates on the TrueType font file currently realized in the Device Context (DC). These features make the function more useful than the alternative of locating the font file and parsing its table directory to seek to the appropriate table.
A Unicode-to-glyph-index encoding is located in a table of a TrueType font file that is marked as "cmap", which is the tag name for the table containing the character mappings for the font file. This table may contain one or more subtables that are different mappings.
Immediately following the initial unsigned short values of the "cmap" table is a directory of each encoding that the TrueType font file contains. Per the TrueType specification, the Unicode encoding is located in the subtable marked with a PlatformId value of 3 and a SpecificId value of 1. The specification also defines the subtable referenced by this 3-1 encoding to be a Format 4 subtable. An encoding with a PlatformId of 3 and a SpecificId of 0 (zero) is also Format 4 encoding but the font file is usually interpreted as a symbol font file. Per the symbol font suggestion in the TrueType specification, one would expect this encoding to contain character codes from the Unicode Private Use Area.
A Format 4 subtable is a sparse array. To accommodate the 64K entries needed by the 16-bit Unicode standard, the Format 4 subtable collects neighboring sequences of characters into segments. The segments are defined by the first and last character code value covered by that range of characters. The collection of segments are stored in the subtable by a set of parallel arrays: one for the starting character of the range (startCount) and one for the ending character of the range (endCount). The segment array is a classification of a character code's potential mapping, not the actually mapping.
The mapping to a glyph index requires the use of a third parallel array from the subtable called the idRangeOffset. This determines which of two methods are used to compute the final glyph index. The first method uses a simple delta value to compute the glyph index from the character code. The second method uses a lookup into an intermediate glyph Id table. If the value in this table is zero, there is no glyph; otherwise, the value is used to computer the final glyph index. The lookup table is an efficient way of representing a collection of noncontiguous character codes that span a segment.
The balance of this article covers how to obtain TrueType glyph indices from a TrueType font file. The following assumptions have been made: the TrueType font is currently selected into the application's DC, the current string encoding is Unicode, and the TrueType font file has a Unicode character-map table ("cmap" Format 4 subtable).
NOTE: This technique for obtaining glyph indices is applicable to Windows NT; however, because a Unicode version of the GetCharacterPlacement function has been implemented, it is typically unnecessary. If a Windows 95 application's strings are not Unicode, it can call the GetCharacterPlacementA function to convert to glyph indices instead.
The complete source code is at the end of this article. Within the text of the article, numerous references to the source code are made. Please refer to the sample source code to examine the relevant references in their full context.
Odds and EndsTo work with the data in a TrueType font file, a number of data-type problems must be well understood. In the TrueType specification, all tables in the font file are defined as a collection of base data types. The base data types are also defined by the specification but they correspond well to some base data types defined by the Windows Platform SDK.
The TrueType font file is byte-packed, meaning that all data types sit on a byte offset within the file. Specific padding bytes are included in any table definitions where applicable to the specification. Most compilers allow and sometimes default to structure alignment on other than byte boundaries. This means that defining "C" language structures to mimic the definition of a table may not be compatible.
This sample code uses structure definitions where possible to represent tables. To work correctly, the code must be compiled with byte structure alignment such as is ensured by the sample code's use of the pack pragma.
Even when byte packing is guaranteed, just reading the tables into a structure is incorrect. TrueType font files use "Big Endian" or Motorola-style byte ordering while the Intel microprocessors use the "Little Endian" byte ordering. This means that all data larger than a byte taken from a TrueType font file must have the bytes swapped. The swapping of bytes makes the data compatible with Intel microprocessors. The SWAPWORD and SWAPLONG macro definitions provide a facility for doing this.
The DefinitionsThe "cmap" table of a TrueType font file consists of subtables; each of which define a different encoding. To locate the subtables, the sample code has defined a convenient macro called CMAPHEADERSIZE. The macro is convenient for offset calculations to the beginning of the subtable directory. This macro returns the size of the two unsigned short data types used to store the "cmap" table version and the number of subtables in the "cmap" table:
Each encoding subtable has a directory entry in the main "cmap" table. This is represented in the source code by the structure definition _CMapEncoding. This structure contains two ID fields used to distinguish each subtable and the offset from the start of the "cmap" table where the subtable is located:
The GetFontData function in the Win32 API takes a single DWORD argument as the table name. This works because the TrueType specification defines a table name as a four-byte tag sequence. To properly pack the table name into the DWORD parameter of the GetFontData function call, the sample code defines a macro called MAKETABLENAME. The macro works by sequentially shifting the four individual byte values of the table name tag into a DWORD data type:
The Unicode-encoding subtable is marked by a PlatformId of 3 and a SpecificId of 1. This "3-1" encoding is a Format 4 subtable according to the TrueType specification. Defined in the source code is a structure, _CMap4, which corresponds to the first seven data types of the Format 4 subtable plus a symbolic array of one unsigned short. The array represents the balance of the subtable definition that consists of multiple unsigned short arrays. By including the array symbol in the structure definition, a convenient address to the start of the unsigned short arrays is defined. The array symbol can then be used to compute the other array start addresses by using their offsets from the first array. This is useful when casting a larger memory buffer containing a full subtable to dereference one of the unsigned short arrays:
The ProcessThe sample code implements two basic tasks: retrieval of a Unicode "cmap" subtable from the font file and a search of the subtable to find a TrueType glyph index for a Unicode character code.
To retrieve table data from a TrueType font file given a DC, the code uses the GetFontData function. This function requires that the TrueType font file be selected into the DC. To retrieve table data indexed by a named TrueType table, the four-byte table tag name must be packed into a DWORD. Because we are only interested in getting data from the "cmap" table, the sample code defines a global DWORD, dwCmapName, that is packed with the "cmap" tag. All calls to the GetFontData function are coded to use the global dwCmapName variable.
The GetTTUnicodeCoverage function is the source code to retrieve the Unicode "cmap" subtable. It is declared as:
This function retrieves the full Unicode subtable from the TrueType's "cmap" table. If called with a buffer that is too small (that is, size of zero) as declared by cbSize or with a NULL parameter for the pBuffer parameter, it fails and returns FALSE. When it fails in this manner, it calculates and returns the size of the buffer needed in the pcbNeeded parameter. When the function succeeds, the pBuffer parameter is filled and the number of bytes copied are placed in the pcbNeeded parameter.
This function first searches for a subtable that would contain a Unicode encoding, either a "3-1" encoding or a "3-0" encoding. These are Format 4 subtables as defined in the "cmap" data type chapter of the TrueType specification.
If the function finds a Unicode encoding, it retrieves the first seven elements of the Format 4 subtable by using the GetFontFormat4Header function. The code then calculates the size of the buffer that is needed to return the entire Unicode subtable. If the buffer is too small or not provided, the size is returned to the caller so that they can allocate an appropriately sized buffer and recall the function.
If the buffer supplied by the caller is large enough, the sample code then uses the GetFontFormat4Subtable function to retrieve the entire subtable. This function appropriately reorders the bytes to accommodate Intel microprocessors. If the subtable retrieval was successful, the result is copied to the caller's buffer. If it was not successful, the code has not modified the user's buffer, and can safely return failure. By setting the bytes-needed parameter to zero, the sample code can indicate that it did not copy bytes to the buffer and can distinguish this failure from that of a lack of buffer space.
Once the Unicode subtable has been obtained, it can be used to retrieve glyph indices for a character code or to implement a number of other useful functions.
Converting a Unicode character code to a glyph index is accomplished in the sample codes' GetTTUnicodeGlyphIndex function:
This function has a simpler interface requiring only the handle to a DC, hdc; that contains the TrueType font and the Unicode character code to convert, ch. The function returns the glyph index for ch when it is successful. If the Unicode character code is not located in the encoding (that is, there is no glyph) the missing glyph-index value of zero is returned.
It first retrieves the Unicode subtable by allocating a buffer and calling the GetTTUnicodeCoverage function. If an error occurs, the sample code fails the call by returning the missing glyph index. At this point, failure can mean that either the DC does not contain a TrueType font, or the TrueType font does not contain a suitable Unicode subtable.
Next, the code attempts to locate the Unicode character code in the encoding. The search is performed per the Format 4 subtable reference in the "cmap" data type chapter of the TrueType specification. The code segments from the subtable are linearly searched by the FindFormat4Segment function. If no code segment brackets this character code, then the font file does not contain an encoding and therefore there is no glyph. The code then returns the missing glyph index.
The lookup of the glyph index occurs in the last half of the GetTTUnicodeGlyphIndex function. There are two methods of looking up the glyph index for a particular glyph. Both cases use the idRangeOffset array by examining the value at the ordinal index of the segment in which the character code was found.
In the first case, if the value located at the segment index of the idRangeOffset array is zero, the code dereferences the idDelta array with the same array index and converts to the glyph index using modulo arithmetic:
In the second case, the value located at the ordinal segment index is part of an index into a lookup table for glyph indices. Based upon the order and location of the subtables' arrays, an obscure indexing trick using the address of the value at the idRangeOffset element returns an intermediate ID value. The indexing mechanism is explained in the TrueType specification's Format 4 subtable chapter. If nonzero, this value is then added to the idDelta value and converted with modulo arithmetic; otherwise, there is no glyph and the missing glyph index is returned:
Some other useful functions can be derived from the decoding of the Unicode subtable. For instance this sample code implements a function called:
This function adds up the characters covered by each segment in the Format 4 subtable to find the total number of Unicode character codes represented in the TrueType font file. Note however that this function must test each individual character code for a mapping if a segment in question uses the Glyph ID array rather than being continuous.
It is also instructive to note that the count of Unicode character codes that are mapped to a glyph is not necessarily equivalent to the number of glyphs contained in the font file. There may be fewer glyphs if the Unicode encoding maps multiple character codes to the same glyph. There may also be more glyphs in the font file than the mapping suggests. For example: TrueType Open (now called OpenType Layout) tables define glyph index substitutions to multiple, alternative glyphs.
The GetTTUnicodeGlyphIndex function could also be used to implement a function to determine whether a given TrueType font contains a glyph for a given Unicode character code. Just call the GetTTUnicodeGlyphIndex function with the character code and test the return for equivalence with the missing glyph index (a value of zero).
A Word About ImplementationThis sample code was written for clarity of explanation. It is not optimized for repeated use because it allocates and retrieves TrueType tables each time a public function is called. For real applications, a good optimization would be to cache the Unicode encoding for the TrueType font file as long as it remained in the DC. An application can compare to see whether the font selected into a DC is the same TrueType font file by caching and comparing the checksum value of the font file. This checksum is located in the Table Directory of the TrueType font file at the beginning of the file and can be retrieved by using the GetFontData function. See the TrueType specification's discussion of "The Table Directory" under the Data Types chapter to locate the checksum of a font file.
The Complete Source Code
For more information on the Unicode standard please see:
The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, MA, Addison-Wesley Developers Press, 1996. ISBN 0-201-48345-9.For additional information, click the article number below to view the article in the Microsoft Knowledge Base:
On the internet: The Unicode Consortium (http://www.unicode.org)
210341For more information on the TrueType specification please see:
(http://support.microsoft.com/kb/210341/EN-US/ )INFO: Unicode Support in Windows 95 and Windows 98
Microsoft TrueType Specifications (http://www.microsoft.com/typography/tt/tt.htm)
Also available on the Microsoft Developer Network Library CD's under Specifications.