Character set terms defined

Over on Typophile, Nick Shinn asked: “What is the difference between a code page, a glyph list, and a Unicode chart?” (By that last I think he meant “Unicode range.”) Nick also made some mentions of “encoding” in the same thread, so I thought I would define the whole lot of them. I expect this will be useful to font developers, and perhaps some other folks as well. I’ve just dashed this off, so I reserve the right to twiddle the descriptions for more accuracy and/or clarity. Especially if my readers point out errors. 🙂

A Unicode range is a block of Unicode codepoints dedicated to representing some particular grouping of characters. Often these are the characters of some particular writing system. In some cases they are a specific subset of that writing system, such as “Latin Extended B.”

Unicode ranges often have available character slots which are as yet unassigned, and each new version of Unicode commonly assigns some of these. This makes supporting entire Unicode ranges a bit of a moving target for a font developer. I could release a font tomorrow that supports a bunch of specified Unicode ranges, and then when the next version of Unicode comes out, I might find that there are now more characters in those ranges.

An encoding is a standardized system, across multiple computers and fonts, which maps numbered codepoints to characters. Those numbered codepoints are used as an access mechanism for those characters, by software and operating systems. Unicode is the main encoding in use these days, though there are many others. I think of an encoding as a set of slots (characters), and a font as having glyphs which fill some or all of those slots (default glyphs for various characters). Some glyphs may not be encoded at all (not in any slot), or may not fit neatly in a single slot (many ligatures are unencoded), or may be alternate forms for a default glyph (maybe tied by a string to the default glyph sitting in the slot?).

A code page is an encoding, typically a single-byte encoding, specific to some group of languages. Within any given system of code pages, the same code points are generally recycled over and over so they have different meanings in different code pages. For example, the Windows code pages work that way, as do the Mac OS codepages. These days, code pages are mostly obsolete for their original purpose, because code pages are largely supplanted by Unicode as a single universal encoding. But code pages they serve as a convenient shorthand for indicating character set coverage. Thus some of Adobe’s font readmes have talked about what Windows and Mac codepages are supported by particular fonts.

A character set is a term sometimes applied to a glyph list as well. But strictly speaking, each one entry on the list should be a single codepoint. So, Microsoft’s WGL-4 is a character set, as is H&FJ’s

A glyph list is a slightly broader term than a character set. It can include glyphs which are not characters, or do not correspond directly to a single character. For example, there are some combinations of a base letter and an accent which do not exist as single characters in Unicode, but could be represented by a single pre-composed glyph (or by two separate glyphs). For that matter, there are cases of single Unicode characters which might be handled by two or more glyphs. If the listing happens to include such cases, it suddenly matters whether it’s a “Character set” or a “glyph list.”

One Response

  1. Randy Jones says:

    Trying to understand the difference between character set and glyph list… So by these definitions small cap glyphs are not a part of a unicode range, encoding, or code page.[Pretty much, yes. They aren’t part of any Unicode range (because they’re not encoded in Unicode proper), nor part of any code page. They are sometimes given Private Use Area assignments in Unicode, as we used to do. – T]But they are in a character set and glyph list, or maybe not a character set? But strictly speaking, each one entry on the [character set] list should be a single codepoint — this makes me think no.[Exactly so. If the small caps are unencoded (as they are in the most recent Adobe fonts, and in all fonts from Microsoft and Apple AFAIK), then they aren’t characters, and a definition that includes them isn’t strictly speaking a character set, but a glyph set or glyph complement. – T]

Comments are closed.

Thomas Phinney

Adobe type alumnus (1997–2008), now VP at FontLab, also helped create WebINK at Extensis. Lives in Portland (OR), enjoys board games, movies, and loves spicy food.

Extended Latin Character Sets

Thomas Phinney · August 28, 2008 · Making Type

Adobe, Web fonts and EOT

Thomas Phinney · September 1, 2008 · Making Type