Extended Latin Character Sets

About two years ago I posted my thoughts on extended Cyrillic character sets. Now we’re finally ready to talk about future extended Latin character sets, and to better document what we consider to be the existing Latin character sets as well. The largest character sets here (Adobe Latin 4 and Adobe Latin 5) are drafts; I welcome any feedback, especially (though not only) on things that “ought to be in Adobe Latin 5” but aren’t there yet.

This post owes a special thanks to my colleague Miguel Sousa, who spent many hours compiling lists based on my spreadsheets and directions, and checked my data repeatedly in various ways. Any errors are probably mine, but he created the linked tables of HTML which are linked from this page, as well as the tab-delimited text files which are linked in turn from those pages.


Prelude

What is “Latin” anyway? That’s the writing system used for European languages including English, but also everything from Portuguese to Turkish, Lithuanian to Italian. Many languages written with the Latin writing system use accented characters, and some unusual letters peculiar to only one or two languages, such as the eth, thorn and yogh. Some countries have switched to (or from) Latin versus other writing systems at various points in history; for example, in recent years there have been a number of former Soviet bloc countries which have switched to Latin from Cyrillic for political reasons.

I should also point out that there’s a bit of a misnomer in the title and body text of this post: I refer to “characters” and “character sets.” Why is this a problem? A “character” is a fundamental element with its own unique codepoint in Unicode. A “glyph” may represent a single character, several characters (as with a ligature), or even just a part of a character that combines with other glyphs to create the full character.

The distinction becomes relevant because the larger of these character sets start including glyphs which are not encoded by themselves, but really represent combinations of Unicode characters. Fonts consist of glyphs, and support characters. But I’d already gone down this path in my post on Cyrillic, and even the type industry usage is usually to talk about “character sets”… so there is some imprecision of wording in spots. But usually I’ll be more careful/precise in my word usage (and you should be too, if you’re working in this area), okay?

OVERVIEW

As part of the exercise to name the new character sets, I also codified and renamed some existing ones. I’m hoping this is short-term awkwardness with a long-term gain, but feel free to complain if you think it’s a boneheaded move; it’s not written in stone yet. Here’s our hierarchy of Latin character sets, with the more rationalized naming scheme.

Adobe Latin 1 = 229 glyphs, formerly ISO-Adobe
Adobe Latin 2 = 250 glyphs, formerly Adobe Western 2
Adobe Latin 3 = 329 glyphs, Adobe Western 2 + CE, the basic western “Pro” font charset
Adobe Latin 4 = 616 glyphs, Adobe Western 2 + CE + most of WGL-4 + Vietnamese + (a few things)
Adobe Latin 5 = 1119 glyphs, all the Above + more accented characters + phonetics and transliteration needs

Existing public charset info is here: http://www.adobe.com/type/browser/info/charsets.html

What’s covered here, and what’s not?

These character set definitions cover Latin characters, accented forms, and symbols commonly used in countries using these characters. They also cover phonetics where the phonetics are closely based on Latin letterforms (e.g. IPA and APA).

Fonts with character sets covering Adobe Latin 3 (or bigger) may also support Greek and/or Cyrillic, but that character coverage is not included in the AL-3 definition (nor in AL-4 or AL-5). It would be pretty improbable that we’d ever do an AL-4 or AL-5 typeface without Greek and Cyrillic, however. Certain currency symbols which are closely tied to non-Latin countries, such as the Ukrainian hryvni, are also covered in the appropriate other character set and are not here.

Other symbols such as math and technical symbols are covered here in the Latin character sets rather than elsewhere.

Adobe Latin 1
(ISO-Adobe)
AL-1 is the original character set we ended up with for Type 1 fonts, including the later addition of the euro, for 229 glyphs plus .notdef. It covers major Western European Latin languages (a.k.a. “Roman” languages) and some minor ones: Afrikaans, Basque, Breton, Catalan (without the l-dot, added in Adobe Latin 3), Danish, Dutch, English, Finnish, French, Gaelic, German, Icelandic, Indonesian, Irish, Italian, Norwegian, Portuguese, Sami, Spanish, Swahili and Swedish.

(Note that the term “Roman” is to be avoided! It has multiple independent and often conflicting meanings in typography: upright vs italic/oblique, single-byte Latin vs CE/other, or in the style of classical Roman lettering vs not.)

Adobe Latin 2
(Adobe Western 2)
AL-2, 250 characters plus .notdef, is the most basic OpenType character set for Adobe fonts, the minimum for “Standard” fonts with western language support.

AL-2 is Adobe Latin 1 plus: uni2113 (litre), estimated, the 14 Mac “symbol substitution” characters (uni2126 (ohm/omega), pi, partialdiff, uni2206 (Delta), product, summation, radical, infinity, integral, approxequal, notequal, lessequal, greaterequal, and lozenge), and 4 characters using duplicated glyphs from other characters (uni00A0 from space, uni00AD from hyphen, uni02C9 from macron, uni2215 from fraction, and uni2219 from periodcentered).

Adobe Latin 3

AL-3, 329 glyphs + .notdef, is the union of the AL-2 character set and the Adobe CE character set, which includes major Latin-based Central European (CE) languages. AL-3 is the minimum character set for western “Pro” fonts.

AL-3 covers the same languages as AL-1/AL-2, plus 79 characters to cover Croatian (Latin), Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Serbian (Latin), Slovak, Slovenian and Turkish.

Adobe Latin 4

AL-4, 616 glyphs plus .notdef, is the character set that will be used for future mega-families on the scale of Hypatia Sans Pro, Arno Pro and Garamond Premier Pro. Not all new Western typefaces from Adobe will necessarily be AL-4, although we expect many will be.

Compared to AL-3, it adds 287 glyphs to support: Afar, Azerbaijani, Belarusian (Latin), Chichewa, Esperanto, Gikuyu, Greenlandic, Guarani, Igo/Igbo, Kuskokwim, Luba (Ciluba), Malay, isiNdebele, Oromo, Pilipino/Tagalog, Setswana, Sidamo, Somali, Sotho (Northern and Southern), Swazi, XiTsonga, Tuareg, Uzbek (Latin), Vietnamese, Welsh, isiXhosa, Yoruba, and isiZulu. For Latin-based transliteration, it supports several schemes for Indic and Arabic, as well as Cyrillic and Pinyin Chinese.

We’ve long had Latin character sets beyond AL-3, but their definition has been fluid and has grown over time. All our fonts that are significantly beyond AL-3 should “at some point” be upgraded to full AL-4. But don’t hold your breath, okay? Some things that are in some of those current fonts are not going to be part of the official AL-4 definition either; not everything that’s in our existing Latin coverage will make it into AL-4 and future mega-fonts.

Adobe Latin 5

AL-5 is a massive Latin glyph complement of 1119 glyphs (plus .notdef). We currently intend to use it in only a couple of core typefaces.

Additional languages supported in AL-5, by adding 502 glyphs over AL-4, include: Bambara, Baule, Northern Berber languages (Tarifit, Kabyle, Tachelhit, Tamashek, etc.), Bislama, Cornish, Dinka, Fula, Hausa, Latin, Luxembourgish, Malagasy, Maltese, Marshallese, Mende, Sami, Venda, Wolof. For transliteration, it adds support for International Phonetic Association (IPA), Khmer, Mongolian, another Hindi scheme, pan-Athapaskan native American languages, and the Americanist Phonetic Alphabet (APA). Note that most of the added glyphs are actually supporting transliteration and phonetic needs.

>Supplemental Spreadsheet, 470K Excel file

This spreadsheet shows the additional characters/glyphs in the Adobe Latin 4 and 5 character sets, and which languages need which, but only over and above the characters already supported in Myriad Pro version 2.0 (the current version in 2007-08, shipped in CS3). Being based on “what’s extra over some arbitrary typeface” makes it a highly Adobe-centric document, and I hesitated to upload it at first. But I concluded that it was better to give more information (however awkward it might be for non-Adobe folks to use it), than to not give enough.

[Updated spreadsheet Aug 29th to add a note on Croatian digraphs (only supported in AL-5) and unhide some rows.]

One column in there shows which of the characters in question are supported by Hypatia Sans Pro. The asterisks in that column are for things needed in AL-4 that aren’t in Hypatia Sans Pro. I was wondering what it would take to upgrade Hypatia Sans to AL-4. The answer as it turns out is “enough that it’s not going to happen as part of the release of the full family” (when the italics are done). Right now I don’t even have time to finish the darn italics. sigh.

3 Responses

  1. Nice and very interesting stuff.Could there be a typo where “Croatian (Latin)” is both in AL-3 and AL-4?[Yes, it is added in AL-3, so shouldn’t be repeated in AL-4 (I’ll correct this in the main post). This warrants some additional explanation as well: Latin-based Croatian support could be seen as including certain digraphs, but in practice those digraphs are never used as separate characters. So the support of those digraphs as separate characters is only defined in AL-5. – T]

  2. David W. Goodrich says:

    For those needing to romanize East Asian languages, AL 4 has what is needed for the CJK triumvirate: macrons for Japanese, breve-o and -u for Korean, and tone-marked vowels for Chinese. I still wish Unicode hadn’t used grave and acute vowels for tone-marks — they’re too indistinct for textbooks, the place where tones are most common and important. But then I also wish Unicode didn’t describe CJK characters as “ideographs” (a term that many sinologists hate).

  3. Randy Jones says:

    Thanks for the recent posts about encoding/character sets etc. This is a very hazy area for me! I’m trying (!) to wrap up an ambitious font (for me) that has a modestly large set. It looks to me like it’s about the same as Adobe Latin 3. Ultimately I want to communicate in the easiest way what the font contains. It seems like the most user friendly way is with a list of languages supported (also for testing purposes I want to test spacing and kerning for all these languages).Questions:1. looking at my font window in fontlab, i’m looking at nearly 700 glyphs (including smallcap glyphs). Is there an easy way to tell what languages are covered?[I do not know any easy way. You realize that a character set like AL-3 supports dozens of languages, right? It would be really cool if somebody invented a tool that “knows” which characters are needed for what langauges, analyzes the font’s cmap, and reports what languages the font supports. Or it could be used to determine what fonts support a particular language. Hmmm… something one of the font management vendors should do. – T]2. can I tell if Adobe Latin 3 is covered without hunting and pecking from your list and my font? Seems like I’d need to import Adobe Latin 3 into fontlab, and select it from the menu at the bottom of the font window… Codepages > Adobe > Adobe Latin 3 – – if that is correct, how do I do that? :-)[Somebody would have to process the posted info into a FontLab .enc encoding file, I think. If somebody makes one (or a set), I’d be happy to post it/them. – T]Again thanks for the articles. They are a great help (though my terminology is still butchered!)

Comments are closed.

Thomas Phinney

Adobe type alumnus (1997–2008), now VP at FontLab, also helped create WebINK at Extensis. Lives in Portland (OR), enjoys board games, movies, and loves spicy food.

Need a Unicode font?

Thomas Phinney · August 11, 2008 · Making Type

Character set terms defined

Thomas Phinney · August 31, 2008 · Making Type