Confounding compounding
Earlier posts discussed the importance of deduplication in annuity portfolios and pension schemes and some of the issues around the deduplication of names, specifically the use of double metaphone to look through common variant spellings of the surname or family name.
One problem is that often the surname data is prepended by first or middle names as well. Or it might be suffixed with a post-nominal term as in Douglas Fairbanks Junior. Even trickier is the presence of compound names like Simon Van der Valk, and the fact that in teleservicing Van der Valk sounds awfully like Vandervalk or even Vander Valk.
So trying to match Mr Simon Piet Van der Valk with S VanderValk Senior PHD isn't a walk in the park. If we try a metaphone match on the final token we'll find Valk doesn't match PHD on a primary or alternate basis.
What to do? Well, you can never be perfect in this area, but you can be more than good enough with the appropriate effort. Recognising common trailing terms and disregarding them as with titles takes us part of the way there.
The final trick is by combining space-separated tokens based on list of compounding name elements such as the fragment shown here:
Doing this, you can actually match such complex names on a string equivalence basis without metaphone.
This kind of name is of course more common in some territories than others, and some might argue it will be a small part of most portfolios. This may be true, but if it occurs amongst the wealthiest policyholders representing the largest concentration of risk, it has a disproportionate impact.
The general point shouldn't be conceded in any case, since creating statistical models responsibly means making every effort to preserve the independence assumption. And that makes it worth going the extra mile.
Previous posts
Great Expectations
When fitting statistical models, a number of features are commonly assumed by users. Chief amongst these assumptions is that the expected number of events according to the model will equal the actual number in the data. This strikes most people as a thoroughly reasonable expectation. Reasonable, but often wrong.
A likely story
The foundation for most modern statistical inference is the log-likelihood function. By maximising the value of this function, we find the maximum-likelihood estimate (MLE) for a given parameter, i.e. the most likely value given the model and data. For models with more than one parameter, we find the set of values which jointly maximise the log-likelihood.
Add new comment