Extending Cyrillic (and later Latin) character sets

One project I’ve been working on is to define some extended character sets: that is, more characters that we’re going to have as standard in many of our future new fonts, and in some cases in future revisions to existing fonts.

First I tackled Cyrillic. What I’m presenting here is what you might consider a “late draft” – it’s what we’re going to go with unless we find errors or key omissions, and what Robert and I have been putting into the next typefaces we’re working on. Feedback welcome!

Some time later this year I’m likely to get into the same process for Latin-based languages. So, if you have any thoughts on characters that we aren’t yet putting in our newest Pro fonts, but you think would be useful additions, please comment. I’ll accumulate any and all suggestions, and look into them more carefully when time permits.

Note: In your comments, I’d appreciate your referring to characters both by Unicode and some kind of descriptive name, or with links to detailed info. For example, you might say “you should include the new Ukrainian hryvnia currency symbol, U+20B4, in all your fonts which support Cyrillic,” or “you should make sure your new or extended Latin fonts support the six accented letters needed by Esperanto: u with breve, and c g h j and s with circumflex. See http://en.wikipedia.org/wiki/Esperanto.”

[NOTE: Special thanks to Emil Yakupov of Paratype for his careful analysis and review of the original version of this post. On Aug 3, 2006, I did several minor updates as noted below in the text. None of these affected our actual character set definition, however.]


CYRILLIC OVERVIEW:

This is an “extended Cyrillic character set,” a list of characters we will add to our newer Cyrillic fonts. This is in addition to the characters of our old “Adobe Cyrillic” standard, seen in our Cyrillic Type 1 fonts, and in OpenType format in Excelsior Std and Helvetica Std (among others). We might label this “Adobe Cyrillic 2 extension” or some such.

The possible characters in an “extended Cyrillic” character set, beyond Adobe Cyrillic, number about 100. We wanted to do some subset of that which would give us the best bang for our buck. We’re adding the new Ukrainian hryvnia currency symbol, plus 27 characters to support the written language needs of 46.8M of the 59M Cyrillic-writing people who are unserved by the old Adobe Cyrillic standard. Specifically, that’s:

- ghe w/stroke (U+0492, U+0493)
- zhe w/descender (U+0496, U+0497)
- ka w/descender (U+049A, U+049B)
- en w/descender (U+04A2, U+04A3)
- straight u (U+04AE, U+04AF)
- straight u w/stroke (U+04B0, U+04B1)
- ha w/descender (U+04B2, U+04B3)
- che w/descender (U+04B6, U+04B7)
- shha (U+04BA, U+04BB)
- palochka (U+04C0, U+04CF) cap/smallcap I shape
(lowercase version is new in Unicode v5)
- cap Cyrillic schwa (U+04D8)
(we already have lowercase version in Adobe Cyrillic)
- I w/macron (U+04E2, U+04E3)
- barred o (U+04E8, U+04E9)
- u w/macron (U+04EE, U+04EF)
- hryvnia (U+20B4)

[UPDATE Aug 2: Here’s a link to the Unicode chart so you can see what these things look like. The Hryvnia is in a separate chart, however.]

If you’re wondering what was already in Adobe Cyrillic, here’s a text file listing for Adobe Western 2 plus Adobe Cyrillic.

[UPDATE Aug 3, 2006: Here is a Zipped version of the Excel spreadsheet I used to compile the Cyrillic charset data. I can’t promise it is completely error-free, however.]

The additonal Cyrillic languages (and numbers of people) we would support with this proposed Adobe Cyrillic 2 character set are:

Abaza (45K)
Adyghian (300K)
Avar (600K)
Buryat (440K)
Chechen (1M)
Dargin (0.4M)
Dungan (50K)
Ingush (230K)
Kabardian (650K)
Kalmyk (160K)
Kara-Kalpak (200K) *
Kazakh (8M)
Kyrgyz (1.5M)
Lakh (145K)
Lezgi (400K)
Mongolian (5M)
Tabasaran (100K)
Tajik (4.4M)
Tatar (7M)
Turkmen (6.4M) *
Tuvan (200K)
Uzbek (16.5M)

* these people have switched from Cyrillic to Latin. [NOTE: updated August 3rd to reflect Tatar, Chechen and Ingush not switching to Latin.]

CONSIDERATIONS:

In considering whether to support each language, I looked at numerous factors, including number of people, number of characters needed, and the difficulty of designing and kerning those particular glyphs. Because number of people varies so massively per language, it ended up being a primary determinant, with number and difficulty of glyphs behind.

GROUPS:

I also took heavily into account groupings and overlaps; some languages end up getting a “free ride” due to their overlaps with character sets of more widely used languages. I grouped characters together due to their being needed for a given language or group of languages with heavy overlaps in their character needs.

Each group also assumes that we have added support for all the groups before it. I am only listing the incremental characters needed for a given language group given that all the previous groups have been added.

NOTES:

1) With regards to character sets, our Adobe Original Pro fonts are somewhat inconsistent in Cyrillic coverage. I have not analyzed every single one of them, but I note that Garamond Premier Pro has three characters not in either Myriad Pro or Excelsior Std (the original Adobe Cyrillic charset):
- Palochka (U+04C0, cap I shape), and
- ghe with descender in cap and lowercase (U+04F6, U+04F7) As the ghe with descender is only needed for Yup’ik (10,000 people) and Nivkh (1000 people) I am NOT recommending including it in our new fonts.

2) We have something very odd in our old Cyrillic charset. We have the lowercase Cyrillic schwa (U+04D9) in our charsets but not the cap Cyrillic schwa (U+04D8). Now, you might wonder why nobody has complained about this, if it’s a problem. It’s because we don’t (yet) support any of the languages that require the schwa. In other words, before this it was a mistake to include the schwa at all for the languages we were covering. Now, a bunch of the languages below do need it, so I recommend adding the cap schwa (see light green highlight in spreadsheet).

3) Asterisks below indicate languages for which the written form has switched to Latin in in 1989 or later.

4) COLORS below refer to colors in the supporting Excel spreadsheetI used to record my research.

GROUP DETAILS:

Adding the palochka alone (yellow highlight in my spreadsheet) gives us full support for the following languages which we’ll call group A (# people in parens):
Abaza (45K)
Adyghian (300K)
Avar (600K)
Chechen (1M)
Dargin (0.4M)
Ingush (230K)
Kabardian (650K)
Lakh (145K)
Lezgi (400K)
Tabasaran (100K)

2 characters
- palochka (U+04C0, U+04CF)
TOTAL A: 3.9 M people served

Next, I suggest as group B (light blue highlight), we also add support for the first four languages below, which happen to have immense overlap with each other, and have a lot of speakers. This also gets us support for the next four “for free”:
Uzbek (16.5M)
Mongolian (5M)
Kazakh (8M)
Kyrgyz (1.5M)
Buryat (440K)
Kalmyk (160K)
Kara-Kalpak (200K) *
Tuvan (200K)

16 characters
- ghe w/stroke (U+0492, U+0493)
- ka w/descender (U+049A, U+049B)
- en w/descender (U+04A2, U+04A3)
- straight u (U+04AE, U+04AF)
- straight u w/stroke (U+04B0, U+04B1)
- ha w/descender (U+04B2, U+04B3)
- shha (U+04BA, U+04BB)
- barred o (U+04E8, U+04E9)
TOTAL B: 32M served

Then, I suggest group C (grey highlight), Tatar and Turkmen, which also gets us Dungan for free.
Tatar (7M)
Turkmen (6.4M) *
Dungan (50K)

2 characters
- zhe w/descender (U+0496, U+0497)
TOTAL C: 7M served (13.4M if you include those phasing out Cyrillic)

Next, group D (turquoise blue):
Tajik (4.4M)

6 characters
- che w/descender (U+04B6, U+04B7)
- I w/macron (U+04E2, U+04E3)
- u w/macron (U+04EE, U+04EF)
TOTAL D: 4.4M

That’s everything we have tentatively decided to do. There are several more groups that we considered, but are not going to do at this time. Here’s the specifics:

Group E (red/pink):
Bashkir (1M)
Chuvash (2M)
TOTAL: 3M
12 characters
[NOTE: deleted incorrect listing of abkhazian che U+04BE, U+04BF]
- ze w/descender (U+0498, U+0499)
- bashkir ka (U+04A0, U+04A1)
- es w/descender (U+04AA, U+04AB)
- a w/breve (U+04D0, U+04D1)
- ie w/breve (U+04D6, U+04D7)
- u w/double acute (U+04F2, U+04F3)

Group F (orange):
Komi – both Zyryan and Permyak, for 360K people in Russia.
2 glyphs
- o dieresis (U+04E6, U+04E7)

Group G (purple)
Moldavian (2.6M)
However, they switched to Latin back in 1989.
2 glyphs
- zhe w/breve (U+04C1, U+04C2)

Group H (navy blue):
Azeri (6M)
5 characters
However, the vertical-stroke glyphs could be problematic designs
in heavy weights (especially for Myriad and Hypatia).
- ka w/vertical stroke (U+049C, U+049D)
- che w/vertical stroke (U+04B8, U+04B9)
- modifier letter apostrophe (U+02BC)

Group I (bright green):
Ossetic (500K)
2 glyphs
- ligature a ie (U+04D4, U+04D5)

14 Responses

  1. Joe Clark says:

    Few of us, even obsessifs who think Cyrillic and Unicode are fabulously interesting, have mental images of all these characters. Couldn’t you have just shown them in your blog posting? They’re easy enough to enter if you know the Unicode hex value, which you do.

  2. John Hudson says:

    Tom, what was your source for character-to-language mapping?Cyrillic is still quite heavily used for many of the languages that officially switched to Latin alphabets after the demise of the Soviet Union, so I’m glad to see you extending your Cyrillic support.

  3. Thomas Phinney says:

    Joe: I’ve added links to the two relevant Unicode charts, so you can see what things should look like.John: I used several sources. One was the letter database here: http://www.eki.ee/letter/. Another was Michael Everson’s “Alphabets of Europe” here: http://www.evertype.com/alphabets/index.html.Regards,T

  4. si says:

    Tom, people may ask why Adobe doesn’t just include all the extended Cyrillic codepoints – without mentioning $ values what are the cost savings as a percentage (taking into account development, test and productization) of your approach vs just adding them all?

  5. Thomas Phinney says:

    Si,Well, it’s hard to quantify, but 75 extra Cyrillic characters, plus small caps, and any needed alternates for consistency with the rest of the font… I think it would be about 150 extra glyphs in the typeface I’m working on right now.It’s also not so much money as time. We have a relatively small team working on what I call western fonts (Latin/Greek/Cyrillic). The bigger the fonts get, the fewer of them we produce. With a typical new family having 2000 glyphs per font, we may only issue something like two a year. Anything that delays that has to be carefully balanced.Another reason not to do everything is the question of the bar for the other writing systems covered by the same typeface. If we do *all* of extended Cyrillic, why not do all the extra bits for Greek and Latin as well? But by the time we’re done, we’ve probably added at least 25% to the glyph count.Another problem with increasing character sets is kerning. Although class kerning helps imnmensely, adding lots of new glyphs generally means creating more classes. Unfortunately, the number of class interactions is equal to the number of left-hand classes multiplied by the number of right-hand classes. Simplifying a bit, think of it as being proportional to the square of the number of classes.Of course, there are numerous bits of work that are not affected in the least by adding lots more glyphs, but basically I see it as a “slippery slope,” something that it is easy to go too far on and then be stuck with doing in lots more typefaces going forwards.So, I’ll continue thinking about character sets in terms of trying to help the most people for the least work.Cheers,T

  6. John Hudson says:

    Thanks for the extra information, Thomas. The two sources you cite are pretty reliable, I think (I was a contributor on the project that Michael Everson published as “Alphabets of Europe”).

  7. Ognyan Kulev says:

    I would very like if you add U+0462 Ѣ, U+0463 ѣ, U+046A Ѫ and U+046B ѫ. They are used in Bulgarian old spelling before 1945. Both characters are mentioned in http://www.evertype.com/alphabets/bulgarian.pdf I understand that these doesn’t qualify in adding support for new languages but I believe Bulgarian support (OK, we are not many) will benefit from these characters.

  8. Klaas Ruppel says:

    It is a very good thing, that Adobe is going to extend coverage of Cyrillic’s. However, I am a little bit puzzled about the fact, that even now you are using a lot of time thinking about which language to cover and which not. Since we have now one – and only one – code table (Unicode) and not thousands of them, I would have thought, that font designers free themselves from the limitations due to the 8-bit world … and create characters on a block or script basis rather than on languages basis. So it would be eg. Cyrillic script coverage and not coverage of specific languages using Cyrillic script any more. As a side effect you would avoid impacts on politics, minority rights, human rights and so on. With your current plans of language coverage you would make a statement concerning minority languages in Russia like Mari and Udmurt (which you plan not to cover). This would be a statement an American company would not necessarily want to make, I presume.

  9. Thomas Phinney says:

    Klaas,Certainly, Unicode frees us from the constraints of arbitrary code pages. But as I mentioned above in earlier comments, we’re talking about a very large amount of additional work to support the entire Unicode ranges for Cyrillic, Latin and Greek.The only statement that I am making is this: it is impractical for Adobe to cover all possible languages in every typeface we release, but we remain deeply interested in expanding the language coverage of our fonts, and we will continue to do so in several different ways.One is to define new character sets that increase the language coverage for many of our new fonts (and possibly some revised fonts). That’s what this particular blog post is about.A second route is to add more support than that to a couple of core typefaces (say, Minion Pro and Myriad Pro), while keeping our standard set for new typefaces a little more manageable. We have done this a bit in the past, and one can imagine that some day we might expand our core families to full coverage of the Latin, Greek and Cyrillic Unicode blocks.Third, we have recently commissioned and released new typefaces that are specifically designed to deal with a given writing system, such as our recent Adobe Thai, Adobe Hebrew and Adobe Arabic typefaces (with Adobe Arabic being a TDC award winner).Regards,T

  10. Thomas Phinney says:

    I just wanted to expand on my comments to Klaas, by quoting from a reply I just did to somebody with similar concerns, on the OpenType mailing list:The original Adobe Cyrillic character set from the early 1990s supports a number of Cyrillic languages, but not all of them. Some of those are “minority languages.” This is true for the shipping Windows system fonts as well, and I imagine for the overwhelming majority of Cyrillic fonts from the overwhelming majority of font sources/vendors. What would you suggest, then – should we instead rip out the Cyrillic support from existing fonts because we can’t extend it to cover all Cyrillic characters?As noted earlier, it is impractical for us, with our current level of resources, to simply add such support to all our shipping fonts that support Cyrillic.Assuming we are fine with the Adobe Cyrillic character set for whatever reason, but extended Cyrillic must support ALL of Cyrillic, well, we’d just have few or none of our new typefaces supporting extended Cyrillic.Either way, it’s conceivable that we could support the full range of Cyrillic characters in one or two typefaces, eventually. I like this idea, myself – but even if we go down that path, it won’t happen quickly.But I think it would be a shame to not have something in between. Currently, there are ~60M people whose Cyrilic languages are not supported by our Cyrillic. If we can support ~80% of them while adding only 27 of the possible ~100 additional Cyrillic characters (and mostly an easier 27), I think that’s a Good Thing.Please understand that I do take your concerns seriously. If enough people think that our existing Adobe Cyrillic character set is inoffensive, but any proposed extension which is less than full Cyrillic coverage is controversial, then I expect we’d reconsider the extension.Regards,T

  11. Vessela Kouzova says:

    I think organizing characters by languages is important. Isn’t it connected with typing and keyboards? Can we organize characters by language but whenever we don’t have a character available just to have an empty space? Is it possible for type designers to collaborate on creating additional characters? So the missing symbols can be added eventually to a proper place using one not thousands of standards.Regarding adding more characters to Cyrillic, I wish there were “Cyrillic quotation marks”, stressed vowels +stressed “iu” and “ia” (last letters of Bulgarian alphabet). Please, take a look at http://www.hermessoft.com/

  12. Charles Ellertson says:

    What would be really nice are the Latin characters used in transliteration of non-Latin languages — such as the underdots for transliterated Arabic (typically d, h, s, t, and z; what some of my customers call “Sanskrit” (see the IAST list [http://en.wikipedia.org/wiki/IAST]), and the phonetic characters, which see also lot of use.[Thanks, Charles - this is a very helpful suggestion. - TP]

  13. Thomas Milo says:

    I am delighted to see the extent of Adobe coverage of the Cyrillic and Latin Unicode block increases.For he record, in 1988 I was requested by Adobe Europe to map all Soviet Cyrllic orthographies. Roberts Minion project triggered Adobe’s first interest in global computing. For the definition of the Cyrillic portion of the font, David Birnbaum (Slavic) and I (Turkic and other non-Slavic former Soviet languages) worked for Bur Davis (Adobe US) and Tony Arcari (Adobe Europe) on what was to become the basis of the Cyrillic block in the Unicode Standard. For your and everybody’s amusement, here’s the Report on Cyrillic that I wrote between 1988 and 1990 using Ye Olde WordPerfect:www.decotype.com/publications/1990_Adobe_Cyrillic/Comprehensive_Report_on_Cyrillic_1990.pdfIt has at least a primitive image of each glyph discussed.On the basis of David’s and my contribution I believe Robert’s Minion Typeface became one of the first, if not the first, new designs with Cyrillic, although I believe many of the non-Slavic elements were very slowly implemented.I hope this will be corrected. I understand the cost factor, but supporting Unicode in width and depth is also a cultural responsibility. The fact that a cyrillic-based orthography is replaced by a Latin-based one does not make it less relevant. After all, we are dealing with text, not speech.

  14. Daniel Garcia says:

    I found this post by chance while looking for information on support for characters 040D and 045D (i with an accent), which are used in Bulgarian.I am testing an Adobe OpenType font (MyriadPro) and it does not seem to include these two characters.Are they going to be supported in the future?Daniel[A future version of Myriad Pro, which is currently in the works, will include glyphs for these (and many more) characters. Unfortunately, I cannot say exactly when it will be released. I suggest that you stay tuned to this blog. — KL]

Comments are closed.

author-photo-thomas-phinney

Thomas Phinney

Adobe type alumnus (1997–2008), now VP at FontLab, also helped create WebINK at Extensis. Lives in Portland (OR), enjoys board games, movies, and loves spicy food.

Housekeeping

Thomas Phinney · July 27, 2006 · Making Type

Bush Guard memos used Times Roman, not Times New Roman

Thomas Phinney · August 3, 2006 · Making Type