Why KYD Matters
In my previous post, there was a chart that shows the extent to which a relatively small number of name-models predominate in OFAC SDN data. My ultimate purpose in providing that graphic was two-fold:
- To show that reliance on mediocre, Anglo-centric name-screening techniques such as SOUNDEX and Jaro-Winkler has a broadly negative impact for the quality of matching results, especially with respect to search recall.
- To exemplify ways in which better knowledge of the OFAC SDN names themselves can help financial institutions determine whether or not the name-screening techniques and products they now use are adequate for the task at hand, and thus an arguably adequate basis for achievement of due-diligence.
The pie-chart diagram shows that names following the Hispanic and Islamic name-models, taken together, cover almost three-fourths of the OFAC SDN list, followed distantly by Slavic and “Other” (a grab-bag of mostly Anglo names, along with smatterings of names from other languages/cultures, as well as names that were either too hard to classify or names that showed potential evidence of multiple name-models).
Thus, it’s fairly easy to realize that whatever technology you’re using to do your OFAC name-clearance had better be good — make that really good — at anticipating the kinds of problems that are commonly observed with Hispanic and Islamic names. In this post, I’ll present a few well-known patterns of variation uniquely associated with names from the Hispanophone (Spanish-speaking) world.
[For a more complete discussion of the issues with Hispanic, Islamic, Korean, and Russian names, download our white paper on Improved OFAC Name Screening.]
Dealing With Hispanic Names
Before I begin, a disclaimer. Not every personal name in every Hispanophone country exemplifies these patterns, so (as the IT guys love to say):YMMV. But the patterns presented here are more than prominent enough, once you start looking at large collections of Hispanic names from the most populous Hispanophone countries. These patterns count as factual, in the same way that “The average American family has 3.13 members.” (or it did in 2014, anyway) is true, even if it is impossible to produce even a single American family that has that many members in it.
In my experience, the biggest problem that Anglocentric name-screening systems encounter when handling personal names of Hispanic origin is posed by the widespread use of compound (multiple) surnames in most Hispanophone countries (Argentina generally being something of an exception, at least these days).
What this means, first off, is that the “last” name is not really or entirely the last name (or surname, as I prefer to call it). If your system uses a mechanical way to select the surname when confronted with a fairly common name like Maria Ana Torres Gomez, you’re already in trouble, because you’ve lopped off half the surname by taking just rightmost (i.e., “last”) piece of it.
The problem gets much worse as soon as Maria follows the standard practice of using just her father’s name (Torres) in all but the most formal of circumstances, especially when she is around Anglos. It gets worse still when Maria marries Juan Diego Gonzales Huerta, and thus officially becomes Maria Ana Torres de Gonzalez, at which point all trace of her former “last” name has disappeared from your records, even though all the other parts are still safely preserved.
One other nasty little wrinkle: recent (1995) legislation in Spain allows any citizen to choose whether the matronimico (mother’s name) or the patronimico (father’s name) is officially the first/left-most surname, so Maria Ana Torres Gomez might also choose to be Maria Ana Gomez Torres there.
(Bonus: see if you can find at least one Hispanic name in the OFAC SDN list that was likely mis-parsed by taking the “last”/rightmost part of the surname, instead of taking the entire compound surname.)
Well into the 19th Century, there was an active tradition of using abbreviated forms of the most common male given-names, especially in official records and important written or printed documents. Genealogical researchers digging through old English records frequently confront the forms Jas (James) and Jno (John). Although this
convention has largely passed from use in the Anlgophone world, it is still in active use in many parts of the Hispanophone world, especially when personal names are captured for use in official documents and commercial listings. And it is used with common given-names of both sexes.
Thus, when you are handling and processing Hispanic names, sooner or later you will be confronted by a Fco (Francisco) or a Gpe (Guadalupe) or a Ma (Maria) in a record, and you will need to match that form against its non-abbreviated equivalent. If your match-logic is just counting the characters that two names have in common, there will always be some loss of recall. Matches will slip past unnoticed. You can count on that.
Initials in surnames
The name-abbreviation phenomenon among many Hispanic people is not limited to given-name components in a personal name. In many areas of Central and South America, people with a name that includes a very common matronimico (mother’s name) or apellido de casada (husband’s name, for a married woman) can simply use an initial instead.
Under this perfectly normal convention, Maria Ana Torres Gomez might also represent herself, quite legally, as Maria Ana Torres G, in certain circumstances.
A related, but distinct issue in surnames of some married Hispanic women arises when the husband dies. Taking our example above, Maria Ana Torres de Gonzalez, wife of Juan Diego Gonzales Huerta, becomes Maria Ana Torres viuda de Gonzalez (“widow of Gonzalez”) when her husband passes away. The tricky bit is the viuda de, which can appear in written form as v. de or even vde in some official records.
One of the most vexing aspects of automating the name-screening process in ways that handle Hispanic patterns is the presence or absence of white-space, i.e, blanks. Many Hispanic surnames contain prefixes and other small items which complicate the decision about aligning the corresponding parts, so that apples can be compared to apples. This issue seems to be most severe when the surname “stem” (the part after all the prefixes) is relatively short.
So, for example, it is easy to see pairs of names like de la Fe and Delafe, or De la Hoz and Delahoz or Del Carmen and Delcarmen, especially in official documents like passports and visas and airline-passenger manifests, where space is at a premium, and long, multi-part Hispanic names are frequently too large for many data-capture screens.
Innocently squeezing out a blank-space here and there to make a name fit may be OK for humans, but it can be fatal for naive name-screening operations that rely on consistent white-space placement to derive correct scores and make solid matches.
Yes, the Spanish-speaking world makes extensive use of informal/familiar given-name forms, the ones we call “nicknames.” And like our nicknames, sometimes these are fairly easy to pair with the corresponding full form (Caro~Carolina,Edu~Eduardo) and sometimes they are not (Chuy~Jesus,Paco~Francisco).
The nature of documents required in financial transactions makes it unlikely that a nickname form would appear, instead of the corresponding full “official” form, but the same cannot be said for either the OFAC SDN data, or for PEP data that you may be using. But it’s also possible that something appearing to be a nickname is, in fact, a legal given-name, just as Jack can serve either as a nickname for John, or a full legal given-name in its own right. Consider the complexities of the major-league baseball player whose legal name is Jhonny Peralta (nope, that’s not a typo it’s really spelled that way).
Without any way to know that two names being compared and scored are closely related in this way, an automated name-screening system will generally find the similarity between a nickname and its associated full version too weak to be considered a hit. And with that, a golden opportunity for improved recall is entirely missed.
A Parting Shot
One last zinger: many of these patterns of name-variation are manifest very differently, or not at all, in Lusophone (Portuguese-speaking) countries such as Portugal, Brazil and even Mozambique. If you’re lumping this name-model together with the Hispanics, you might want to reconsider…
What Harry Callahan Can Teach Us About KYD
Oof! I know — this is looking more like a poorly edited master’s thesis than a blog post. But this is by no means an exhaustive list of the ways in which Hispanic personal names can give Anglocentric name-screening systems severe indigestion. And, we’re talking about something like 40% of the OFAC SDN folks, any one of whom who could be your very last customer, if you let him (or her) sign up.
So, take a pointer from Dirty Harry, and count the bullets in that gun pointed at you. Get to know your data, so you can do a better job of knowing your customers. While you’re at it, get to know how your current name-screening technology does when these issues come up. Because they will.
Or, like the Scorpio Killer, maybe you’re feeling lucky…
For an excellent, if concise, overview of personal-name patterns found in the Hispanophone world and elsewhere too, please take a look at this Wikipedia entry. [Article in Spanish]