Tchaikovsky and Chekhov; Tchernobyl,Chernobyl and Tschernobyl

One of the most challenging aspects of automated name-processing is accounting for the many ways in which a name can be changed when it is adapted from its original written form for representation in a different writing system. Perhaps the most benign form of this problem can be seen when a name is transliterated between two writing systems that are similar in their representational technique. The Cyrillic alphabet used to represent Russian (inter alia) and the Roman alphabet used to represent English (inter alia) are very similar linguistically. Better yet, there are official, widely-promulgated rules for converting text in Cyrillic to its equivalent form in the Roman alphabet.

It might be reasonable to expect, then, that a Russian personal name or place-name would be spelled consistently in its Romanized form. However, even a quick look at name-data shows that such is not the case. And the consequences for automated searching, matching, screening or filtering romanized Russian names can be important.

The problem is caused not by the lack of a rigorous Cyrillic-to-Roman transliteration, but by the fact that there are too many of them. There are ones for German speakers, different ones for French speakers, and still different ones for English speakers. Try a Google search for the 1986 nuclear-accident site using Chernobyl and the top results are entirely in English; use Tschernobyl and the results are almost entirely in German; try the same search using Tchernobyl and the top listings are almost all in French. Different name romanizations mean a different Internet…

So why is this a big deal? In the U.S., we just need to use and expect our own Anglo-oriented romanization system(s) for Russian names, and all is well, right? Perhaps not.

One subtle but basic problem is that it is frequently not possible to know the romanization process that was applied to a Russian name. Many official romanization standards are not reversible; that is, the original Cyrillic form of a name cannot be unambiguously recovered from the romanized form. Another problem is that Russian names, once romanized, do not necessarily stay within their original sphere of reference. That is why the most common spelled form of the Russian composer’s name in English-language texts is TCHAIKOVSKY, yet the most common spelled form of the Russian playwright and is CHEKHOV, even though both of these Russian surnames begin with the same Cyrillic letter (Ч). The French-style romanization of the composer’s name can probably be attributed to the fact that both official and popular romanization of Russian names followed a French standard from at least the 18th century until the late 1990’s. The author’s name, by contrast, may be best known in its English-oriented form because one of his earliest translators and popularizers outside Russia lived in England.

What if the romanized form of the name is taken directly from an “official” document, such as a passport? The rules used to produce the official romanized form of a name in a Russian passport followed the traditional/French system until 1997, but these were changed in 2010 and then changed again in 2013. This means that a Soviet/Russian citizen can have three different romanized forms of his or her name in circulation, depending on the way the name is spelled, and also upon the exact issue-date of the passport.

These confusions may seem abstruse, academic and irrelevant to IT specialists tasked with ensuring the accuracy of an organization’s name-searching operations, but even slight romanization differences can be fatal for many key-based approaches, including SOUNDEX and Metaphone.

This problem is not limited simply to Russian names in the English-speaking world. A recent article by a Spanish political commentator shows how the dominance of English-language news and media sources has caused difficulties for many Spanish-speakers, because many prominent “names in the news” that originate in non-Roman writing systems are romanized in a way that makes no sense for them, and are spelled in such a way as to make it difficult for Spanish readers to connect the spoken and written form of those names.

[For a more complete discussion of the issues with Hispanic, Islamic, Korean, and Russian names, download our white paper on Improved OFAC Name Screening.]

Leave a Reply

Your email address will not be published.