Logistical nightmares
A common Generalised Linear Model (GLM) for mortality modelling is logistic regression, also sometimes described as a Bernoulli GLM with a logistic link function. This models mortality at the level of the individual, and models the rate of mortality over a single year. When age is used as a continuous covariate, logistic regression has some very useful properties for pensioner mortality: exponentially increasing mortality from age 60 to 90 (say), with slower, non-exponential increases at higher ages. Logistic regression was the foundation of the models presented in a SIAS paper on annuitant mortality.
Although logistic regression for the rate of mortality is nowadays superceded by more-powerful survival models, it is still used in some quarters. However, it is a matter of real concern that we still encounter invalid implementations of this particular GLM in some actuarial applications. The mistake is to take a sixty-year-old observed over three years (say) and split this into three year-long "observations" – one at 60, one at 61, and one at age 62 – in order to force multi-year data into a model for one-year mortality rates. That this is a mistake is clear from the fact that such "observations" are indistinguishable from the data for three separate individuals aged 60, 61 and 62 observed over one year each. You don't need to be a statistician to know that one person observed over three years is not the same as three people observed over one year. This treatment of the data makes it impossible to distinguishable between the two, so what impact does this have?
When the time period over which an individual is observed is split up in this way, the assumption of independence of observations is violated. This assumption is crucial to all statistical model, of course, not just GLMs or logistic regression. The independence assumption can obviously hold true for the three individuals aged 60, 61 and 62 in the second example: any one of them could die in the year observed, or all of them, or none of them. Knowledge of what happened to the 62-year-old tells us nothing about what happened to the other two. However, when observation periods are split up the resulting "observations" are clearly not independent. In the one-person case above, the very existence of the "observation" at age 62 tells us automatically that there was no death at 60 or 61. The independence assumption no longer holds true, and situation is obviously worse when longer time periods are split into more "observations".
What are the consequences of this error? The most obvious one is that the standard errors produced by the GLM software will be misleading. The software is being told that there are more independent observations than there really are. As a result, the standard errors produced will be too low and they will understate the uncertainty around the parameter estimates.
A follow-on consequence of too-small standard errors is that the p-values will be wrong, i.e. the significance of parameters will be over-stated. This can lead to insignificant rating factors being falsely classed as significant.
The third consequence may be the most damaging of all: parameter estimates will be biased. The number of times a life appears in the "observations" is clearly linked to how long they live. This leads to longer-lived sub-groups being over-represented in the model-fitting.
If the results of such a GLM were being used for financial purposes, it is obviously a major concern if the estimates are biased and the standard errors are wrong. The solution is simple: enforce the independence assumption by making sure each person appears only once in the data fed into the model. To use logistic regression at the individual level, this means using a single year's data only. Of course, this is wasteful if you have multiple years' data, and you are somewhat vulnerable to period effects. If you want to model mortality at the individual level over several years, therefore, the obvious approach is to use survival models for the force of mortality.
Add new comment