GNR: An Advanced
Name Finding Tool
This article is about names. Not any names in particular, but about finding names. With an increasing diverse set of names becoming an everyday occurrence for any business (large or small), there is an increased emphasis on resolving multiple variations of a person’s name into a unique qualifier. Resolving names has increasingly become a Critical Success Factor for critical enterprise initiatives, such as CDI (Customer Data Integration), Data Warehousing, and even MDM (Master Data Management). Organizations are finding that getting lots of names is easy. Finding the names you want is hard.
Name Damage
Name Damage means the inadvertent or intentional errors or irregularities introduced into the name. Let’s start by looking at some of the ways names can be damaged. The three main causes of name damage are:
- Data Capture Issues
- Hidden Names
- Deception Issues
Data Capture Issues
Data Capture Issues concern how names are saved into your computers for long term storage. Take a look at just a few examples of how people often spell the “same” name differently:
John, Jon, Jonathan, Johnny
How many inventive ways have you seen your own name spelled? People can also make mistakes (simple typos or swapping first & last names). Customers may speak unclearly; names may sound one way on the phone but be spelled differently (Thomson vs. Thompson, Cline vs. Kline). Optical Character Recognition (OCR) document scanning can also damage names. It only takes minor character changes to make it very hard to find someone later.
Sometimes people have to put names into a single field. Thus they put: “Bob Smith”, “Robert Smith”, “Smith Bobby”, or “Smith Bob” - which one is correct? Of course there is no single “correct” way; they all may be good enough for some purpose. One division may keep notes the first way. Another group might track shipment history with the second. The Customer Payments group might use the third, while the Help Desk might use another for support tickets. Style changes within a group can complicate finding records for any given person. Shifting styles across groups and divisions can make it nearly impossible.
Different name lists can have their own ideas about how to best keep names. What happens when you buy a mailing list or acquire another company? Combining another name list with your existing names can be non-trivial. What should one do when each list has its own idea of how first & last names are arranged? And can you be sure that “first name” and “last name” have been used consistently during the life of a given name list?
Hidden Names
So far we have considered fields that are intended to hold names, but what happens when people get creative and start hiding names in other fields? We are not referring to street names like: ADDRESS1=”MARTIN LUTHER KING DRIVE” but rather ADDRESS1=”ATTN MARIA HILBERT”.
Now these are all things that “just happen” during the work day – data entry errors, personal errors, and many such reasons. However, Name Capture issues pale in comparison to Deception Issues.
Deception Issues
Deception Issues occur when someone tries to hide their identity. Sometimes people change their names; they may use variations, intentional misspellings, or nicknames. They may also employ the same on someone else’s name (e.g. identity theft). Researchers found that over 62% of the criminal records in the Tucson, AZ police department files contained misleading name data provided by suspects. There is a long list of theft and fraud activities that can hurt your organization. Some people may be very motivated to hide their identities from you. A creative person with malicious intent can make name damage due to Name Capture Issues (above) look harmless by comparison.
Name Damage Impact
Small changes (even innocent typos) can cause names to just “drop off your radar.” This is an especially challenging problem because people “don’t know what they don’t know.” Your customer service center may successfully handle thousands of requests every day; who would ever know if a name fell through the cracks? By definition, they are unknown because existing searches do not reveal them – and no human has the time (or ability) to dig through mountains of names looking for strays. Now that you know how names can be damaged, we will take a look at why name searching is so hard.
Name Resolution – Limitations of Prevailing Search Strategies
A helpful search strategy will find similarities between names, either based on how names sound or how similar their spellings are. The table below shows a brief history of name search strategies. Using a search strategy is conceptually simple. Take the name you want to find, and a list of candidate names to search through (maybe a phone book, or a stack of customer files, etc.). Generate a score for the search-name and each candidate-name. Scoring is where the “strategy” part comes into play; the strategy defines rules for cranking out a score based on the name’s original spelling. If the scores match, you should keep that candidate because the names
might be the same. The
difficulty with search strategies is that two names “might be” the same, but often they are not the same at all. To see an example of this, let us look at how Soundex works.
Original Name |
Soundex 1918 |
Edit Distance 1965 |
NYSIIS 1970 |
META PHONE 1990 |
CLINE |
C450 |
dist = 1 |
CLAN |
KLN |
KLINE |
K450 |
|
CLAN |
KLN |
MCALEVEY |
M241 |
dist = 4 |
MCALAFY |
MKLF |
CALVERLEY |
C416 |
|
CALVARLY |
KLFR |
Search Strategy – A Brief History with Examples
All of these strategies have trouble coping with Name Damage, Name Variants, and Cultural / Regional differences. For a simple example of Cultural differences, consider CLINE and KLINE. In Anglo-cultures, C and K sometimes sound the same. In Russian culture, C always sounds like S, as in “city”. If you had to search Russian names, strategies designed for English sounds would work poorly. Thus all the strategies above has their own limitations – the kinds that may not help you meet your business objectives.
In summary, the more a strategy knows about names (such as cultural background, phonetics sensitivity, etc.), the better job it can do of finding names.
Resolving Names – The GNR Approach
If we wanted to make a better name search strategy, it would be a great help to have “inside information” about how names worked. To the outside observer, it would look as if we had an unfair advantage over those other strategies. Let us look at some of the things we might be able to “learn” about names, and how that knowledge can help build better search strategies.
The Knowledge Advantage
Let us start with a small example. Consider the 100 most common Last Names for a given culture, like Chinese. Now it turns out there are over 2,000 different Chinese Last Names. But interestingly enough, the top 100 names account for 85% of all Chinese people. In other words, knowing 5% of the names covers 85% of the people.
So how does this help us build a better search strategy? When we make our sounds-like rules, we can focus on the most popular names and build rules that are customized for that culture. And because we know the most popular names, we could check to see if the search request had a typo and then (automatically) look for similar spellings. This would leave Soundex-like strategies wandering around lost while we hone in on interesting search results. Note that this would also be a remarkably effective way to deal with name damage – rather like having a spell checker that was name-smart.
Untangling Names
Consider the following name:
Alejandr Rodriguez de la Pena y de Ybarra
Which part is First Name?
Which part is that Last Name?
If we can separate the different parts of a name, we can make much better decisions about searching. This increase in “hi-definition” helps us resolve the name more meaningfully.
Name Quiz – Answer
| Given |
Family |
1st Surname |
2nd Surname |
Alejandr |
Rodrigues |
de la Pena |
y de Ybarra |
Recall our discussion about Soundex-like strategies. They work rather well when comparing any single part of a name. However, those strategies run into huge difficulties when comparing entire names. Part of the problem is that people can write the same names in so many different ways. Soundex requires we feed it only “Last Names” – but how do you do that if you are not sure what part the “Last Name” is?
Name Variations Across Cultures
Another useful kind of “name knowledge” pertains to name variations. We saw some simple examples of name variants earlier. Does culture really matter for name variations? The answer is yes, culture makes a huge difference in determining which variations to use when searching for names. Since we are already keeping track of how popular names are in various cultures, we could also make note of which names are variants of one another. If we did that, we would end up with something like the table below (Table 1), which shows the top six most popular variations for a given name.
Original Culture |
Original Spelling |
Anglo variants |
Arabic variants |
Chinese variants |
Hispanic variants |
Russian variants |
Arabic
 |
Isaac |
ISAAC
ISACK
ISSAC
ISAC
ISSAAC
ISSACE |
ISAAC
ALISAC
ESAAC
ESAK
ESSAC
ISAAQ |
ISAAC |
ISAAC
YSAAC
IZAAC |
ISAAC |
Russian
Юрий |
Yury |
YURI
YURAI
YURY |
YIOURI
EURI
YURY |
YUJI
YOUJI
YURI
YURY |
YURI
YURY
LLURI
LLURY |
YURIY
IOURI
YURY
YURI
JOURI
YOURI |
Hispanic |
Manuel |
MANUEL
MANUELE |
MANUEL
MANAWEL
MANAWEIL
MANOAEL
MANOEIL
MANOIL |
MANUEL |
MANUEL
MANOLO
LICO
MANOLETE
MEME |
MANUEL |
Chinese
劉 |
Liu |
LIU
LIUU |
ABDEOLEU
LAYO
LIU |
LU
LIU
LIAO
LAU
LUI
LAO |
LIU |
LJU
LU
LIU |
Table 1 - Name Variants Across Cultures
How many of the variants listed in Table 1 would you have thought of? More to the point, how many would your computers have thought of? Knowing these name variants would give our search strategy a huge advantage in coping with name damage. This kind of advantage makes the difference between finding needles in haystacks versus going home empty handed.
GNR Defined
Global Name Recognition (GNR) is a powerful name searching toolkit. For over 20 years, professional linguists and computer experts have been analyzing names to help make GNR smart. They have compiled an extensive set of cultural databases from about one billion names gathered from all over the world (while not the whole planet, that is a respectable start). They have also boiled down the sounds-like rules for different cultures. This gives GNR peerless insight into names, pronunciation, and more. In other words, GNR comes pre-tuned for everyone, everywhere, right out of the box.
GNR is a tool kit because it comes “straight from the factory” as a low-level set of libraries ready to bolt into your applications (typically SOA or linked in C++). GNR is not a rip & replace strategy. Most companies have hundreds of man-years (or more!) invested in their systems. “Just” replacing these systems is non-trivial; in many cases, we find there are straight forward integration points that GNR can plug into. And of course GNR can be used in new projects as well.
Would you like to see GNR in action? You already have! Many of the examples in this article were produced with the GNR toolkit.
What Can GNR Do For Me?
Alpine would love to show you! One of Alpine’s popular services is a Data Quality Assessment (DQA). In addition to a data health-check, a DQA can give you a baseline of interesting name search results that you can measure your existing search strategy’s performance against.
Next, consider the Data Capture Issues that were discussed earlier. You could try being strict with your customers and staff to eliminate or reduce those issues. Your computers would have a much easier life if only people would behave properly (stop making typos, be consistent, no “creative” field use, etc.). Another approach would be to give your computers a “lift” with GNR – this more reasonable approach would help your computers “accommodate” those unreliable, unpredictable humans. I will leave it as an exercise for the reader to work out which approach is a “solving it once” scenario and which involves “solving it over and over again.”
Finally consider the Deception Issues discussed earlier as well. If shrinkage or compliance issues are proving to be difficult to manage, tools like GNR can help rope them in.
Summary
In this article we covered how name damage happens and the kinds of challenges that must be handled when searching for names. As organizations’ customers / suppliers / partners profiles get increasingly global, adopting state of the art Name Resolution technologies serve the promise of better relationships and definitely better protection of one’s business interest. Many ‘Early Adopter’ organizations are reaping the benefits of such technologies as GNR in regards to their compliance, cost savings, and potential for additional revenues.