What's in a name?
We have already mentioned the problem of duplication in pension schemes and annuities, and as an issue we encounter frequently it is worth talking a little about some technology that can be used to counter the problem.
What we find in practice is that the unique member identifiers used within financial administration systems are all too frequently, well, not unique. We know that converting policy or benefit orientated data into individual person orientated data is vital statistically, but how can this be done reliably?
The answer is to use a combination of other data attributes present for each member to create a deduplication key around which multiple records can be merged. One common case would be to merge records which shared a common birthdate, name, gender and postcode.
To do this there are a few issues with names that need to be addressed:
- Names often appear with and without embedded titles - from the relatively mundane Mr, Mrs and Ms., to those ennobled between policy purchases (it happens; not in my life, but it happens). Titles need to be recognised and extracted prior to merging
- Forenames are commonly abbreviated and suffer from variant spelling - Stephen, Steven, Steve - and may even be truncated to the first initial. We eliminate these challenges by working exclusively with the first initial, although weaker deduplication schemes can leave out the forename altogether.
- Surnames or family names have far greater variant spelling potential than forenames. Business transacted by tele-servicing in particular is less likely to trap and correct these variants upfront, so they often affect policy records. As an example, my surname can commonly appear as as any of Ritche, Richie or Richey, along with a host of less common variants. But as the family name is an important part of the deduplication key we need to harness it. So we use double metaphone.
Double metaphone is an algorithm developed by Lawrence Philips. It looks through variant spellings by reducing surames to phonetic codes. The "double" in the title stems from the fact that returning up to two codes for a single surname allows the algorithm to deal with common-case Anglo-Saxon and foreign-pronunciation variants simultaneously.
As an example, say we have three annuity records in a portfolio
Date of Birth |
Surname |
Forename |
Postcode |
Gender |
Surname |
---|---|---|---|---|---|
25/09/1948 | Smith | G | EH4 2DA | M | SM0 / XMT |
25/09/1948 | Smythe | Sir Gavin | EH4 2DA | M | SM0 / XMT |
25/09/1948 | Schmidt | Gavin | EH4 2DA | M | XMT / SMT |
Although these surnames differ and would fail a straightforward text match, double metaphone shows all to match on some combination of the primary or alternate phonetic codings. Primary to primary matches - as in Smith and Smythe - are the strongest, but even our alternate to primary match with Schmidt indicates a likely duplicate in the presence of the other corroborating attributes.
Of course, European names also bring accented characters, and if you are a German speaker you might be relieved to note that Strasser and Straßer both share a primary metaphone code of STRS. Well, it pleases us anyway - duplication is a problem you don't want in your models, whatever the language!
Add new comment