Fuzzy Matching for Sanctions Screening: Why Exact Match Isn't Enough
The biggest technical challenge in sanctions screening is not speed or scale. It is name matching. A sanctioned individual might be listed as "Muammar al-Qadhafi" but your customer might enter "Moammar Gaddafi." Both refer to the same person, but an exact string comparison would miss it completely.
This article explains the three matching techniques used in production sanctions screening systems, why each one matters, and how they work together to minimize both false negatives (missed matches) and false positives (incorrect matches).
The problem with exact matching
Exact matching is the simplest approach: normalize both strings (lowercase, remove diacritics, strip whitespace) and compare them character by character. If they match, you have a hit.
This works well for clear-cut cases. If your customer enters "Vladimir Putin" and the OFAC list contains "PUTIN, Vladimir Vladimirovich," you can normalize both to "putin vladimir vladimirovich" and "vladimir putin" and... they still do not match. The name order is different, and the OFAC entry includes the patronymic.
Now consider names transliterated from non-Latin scripts. The Arabic name "محمد" can be written as Mohammed, Muhammad, Mohamed, Mohamad, Muhammed, or dozens of other variations. None of these are typos. They are all legitimate transliterations of the same name. Exact matching fails on every single variation.
This is why every serious sanctions screening system uses multiple matching stages beyond exact comparison.
Stage 1: Exact match (the fast path)
Despite its limitations, exact matching is still the first step because it is the fastest and most confident. When it works, you get a 100% confidence result with zero ambiguity.
A good exact matching implementation does more than raw string comparison. It should:
- Normalize Unicode characters (remove diacritics: e becomes e, u becomes u)
- Convert to lowercase
- Strip punctuation and extra whitespace
- Check against both the primary name and all known aliases
In Verifex, exact matching runs against pre-computed normalized names stored as database indexes. This means the lookup is O(1), not O(n). Even with 30,000+ entries, exact matching completes in under 1 millisecond.
Stage 2: Fuzzy matching with Levenshtein distance
Levenshtein distance measures how many single-character edits (insertions, deletions, or substitutions) it takes to transform one string into another. For example:
- "kitten" to "sitting" = distance 3 (substitute k with s, substitute e with i, insert g)
- "Muhammad" to "Mohammed" = distance 2 (substitute u with o, substitute a with e)
- "Putin" to "Puttin" = distance 1 (insert extra t)
The lower the distance relative to the string length, the more similar the names are. We convert this to a confidence score: a distance of 1 on a 10-character name gives about 90% confidence, while a distance of 3 gives about 70%.
The challenge with Levenshtein matching is performance. Computing the distance between a query name and all 30,000+ entries in the database would take too long for a real-time API. This is where blocking strategies come in.
The blocking technique
Instead of comparing against every entry, we first filter the database to a smaller candidate set. Verifex uses a prefix blocking strategy: we only compute Levenshtein distance against entries that share the same first two characters as the query name.
This reduces the candidate set from 30,000+ to roughly 200-500 entries, making real-time fuzzy matching feasible. The trade-off is that we might miss matches where the first two characters differ (e.g., "Gaddafi" vs "Qadhafi"), but the phonetic matching stage catches those.
Stage 3: Phonetic matching with Soundex
Phonetic matching addresses the transliteration problem directly. Instead of comparing how names are spelled, it compares how they sound.
Soundex is one of the oldest and most widely used phonetic algorithms. It works by converting a name to a four-character code based on its pronunciation:
- "Robert" becomes R163
- "Rupert" becomes R163
- "Smith" becomes S530
- "Schmidt" becomes S530
Robert and Rupert get the same code because they sound similar. Smith and Schmidt also match. This is exactly what you want for sanctions screening, where the same person might be listed under a different transliteration.
In Verifex, we pre-compute Soundex codes for every sanctions entry and store them as indexed columns. When a query comes in, we generate its Soundex code and look up all entries with the same code. This is another O(1) lookup, so it adds minimal latency.
The base confidence for a phonetic match is set lower than fuzzy matching (around 70-75%) because two names having the same Soundex code does not guarantee they are the same person. We then adjust the score based on the actual Levenshtein similarity between the original strings.
How the three stages work together
The matching pipeline runs the three stages sequentially with short-circuiting:
- Exact match first. If we find an exact match, return immediately with 100% confidence. No need to run fuzzy or phonetic stages.
- Fuzzy match second. If no exact match, run Levenshtein matching against the blocked candidate set. This catches typos, minor spelling variations, and partial name matches.
- Phonetic match third. If fuzzy matching also returns few results, run phonetic matching. This catches transliterations and names that sound similar but are spelled very differently.
Results from all stages are deduplicated by entity ID and sorted by confidence score. The top 10 results are returned to the caller.
This design means that the common case (screening a non-sanctioned person) is very fast, because exact matching returns "no match" immediately and the system can skip the more expensive stages. The rare case (screening someone who is close to a sanctioned person) takes slightly longer but still completes in under 50 milliseconds.
Real-world examples
Here are some practical examples of how the three stages catch different types of matches:
| Query | Matched entry | Stage | Why |
|---|---|---|---|
| Sberbank | SBERBANK OF RUSSIA | Exact | Alias match |
| Vladmir Putin | Vladimir Putin | Fuzzy | 1 character typo |
| Wladimir Putin | Vladimir Putin | Fuzzy | German transliteration |
| Poutin | Putin | Phonetic | French transliteration |
Confidence scores matter
Every match should include a confidence score that tells you how likely it is to be a true match. This is critical for building automated workflows. Here is a general framework:
- 90-100% (critical). Exact match or very close fuzzy match. Almost certainly the same person. Block the transaction and escalate to your compliance team.
- 75-89% (high). Strong fuzzy or phonetic match. Likely the same person but needs human review. Flag for investigation.
- 60-74% (medium). Moderate similarity. Could be the same person or could be a coincidence. Worth a second look, especially if other details (date of birth, nationality) also match.
- Below 60% (low). Weak match. Probably a different person, especially for common names. Most businesses auto-approve at this level unless other risk indicators are present.
The confidence score is not a probability in the statistical sense. It is a heuristic based on string similarity. But it gives you a consistent, actionable number to build your compliance logic around.
Beyond Soundex: advanced techniques
Soundex is a good starting point, but there are more sophisticated phonetic algorithms available:
- Metaphone and Double Metaphone. More accurate than Soundex for English names, with better handling of consonant clusters and silent letters.
- Caverphone. Designed for New Zealand English but useful for names with British English pronunciation patterns.
- Beider-Morse Phonetic Matching. Specifically designed for matching names across multiple languages. It considers the name's likely origin language when generating phonetic codes.
For most sanctions screening applications, Soundex combined with Levenshtein distance provides a good balance of accuracy and performance. More advanced algorithms add marginal improvement at the cost of complexity.
Summary
Effective sanctions screening requires going beyond exact string matching. A three-stage pipeline that combines exact matching (fast and certain), fuzzy matching (catches typos and variations), and phonetic matching (catches transliterations) provides comprehensive coverage while keeping response times under 50 milliseconds.
Every match should include a confidence score so your application can make risk-proportionate decisions: auto-block high confidence matches, flag medium matches for review, and auto-approve low matches. This gives you compliance coverage without creating an unworkable volume of false positives for your operations team.
See it in action
Try the 3-stage matching engine yourself. Free API key, no credit card required.
Get Free API Key