What's in a name?

We have already mentioned the problem of duplication in pension schemes and annuities, and as an issue we encounter frequently it is worth talking a little about some technology that can be used to counter the problem.

What we find in practice is that the unique member identifiers used within financial administration systems are all too frequently, well, not unique. We know that converting policy or benefit orientated data into individual person orientated data is vital statistically, but how can this be done reliably?

The answer is to use a combination of other data attributes present for each member to create a deduplication key around which multiple records can be merged. One common case would be to merge records which shared a common birthdate, name, gender and postcode.

To do this there are a few issues with names that need to be addressed:

  • Names often appear with and without embedded titles - from the relatively mundane Mr, Mrs and Ms., to those ennobled between policy purchases (it happens; not in my life, but it happens). Titles need to be recognised and extracted prior to merging
  • Forenames are commonly abbreviated and suffer from variant spelling - Stephen, Steven, Steve - and may even be truncated to the first initial. We eliminate these challenges by working exclusively with the first initial, although weaker deduplication schemes can leave out the forename altogether.
  • Surnames or family names have far greater variant spelling potential than forenames. Business transacted by tele-servicing in particular is less likely to trap and correct these variants upfront, so they often affect policy records. As an example, my surname can commonly appear as as any of Ritche, Richie or Richey, along with a host of less common variants. But as the family name is an important part of the deduplication key we need to harness it. So we use double metaphone.

Double metaphone is an algorithm developed by Lawrence Philips. It looks through variant spellings by reducing surames to phonetic codes. The "double" in the title stems from the fact that returning up to two codes for a single surname allows the algorithm to deal with common-case Anglo-Saxon and foreign-pronunciation variants simultaneously.

As an example, say we have three annuity records in a portfolio

Date of Birth

Surname

Forename

Postcode

Gender

Surname
Metaphone

25/09/1948 Smith G EH4 2DA M SM0 / XMT
25/09/1948 Smythe Sir Gavin EH4 2DA M SM0 / XMT
25/09/1948 Schmidt Gavin EH4 2DA M XMT / SMT

Although these surnames differ and would fail a straightforward text match, double metaphone shows all to match on some combination of the primary or alternate phonetic codings. Primary to primary matches - as in Smith and Smythe - are the strongest, but even our alternate to primary match with Schmidt indicates a likely duplicate in the presence of the other corroborating attributes.

Of course, European names also bring accented characters, and if you are a German speaker you might be relieved to note that Strasser and Straßer both share a primary metaphone code of STRS. Well, it pleases us anyway - duplication is a problem you don't want in your models, whatever the language!

Metaphone in Longevitas

Longevitas users can choose whether or not to apply metaphone algorithms during deduplication. Simply go to the Configuration section and open the Deduplication tab. There you will also have the option to select deduplication schemes with or without metaphone encoding. 

Previous posts

Deduplication and pension schemes

Deduplication is an essential part of data preparation for statistical modelling. The phenomenon of multiple policies per person is a major issue for annuity portfolios, and arises from life companies' policy-orientated view of the world.
Tags: Filter information matrix by tag: deduplication, Filter information matrix by tag: duplicates, Filter information matrix by tag: pensions

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.