How much data do you need?
We have written before about how survival models make better use of available data. Another way of viewing this is that survival models can make do with smaller data volumes than methods based on the rate of mortality, qx. But what do we mean by "data volumes"? Should we measure this by claim events, by number of lives or by exposure time? And how much is enough?
For survival models the most sensible measure is a combination of claim events and exposure time. The number of lives is of secondary importance for survival models, since they naturally and easily span multi-year investigations. For a survival model it is less important if 10,000 life-years of exposure is observed amongst 10,000 people for one year, or 5,000 people for two years.
In an analysis of a critical-illness portfolio we had 267 claims out of nearly 130,000 life-years of exposure. Of these 267 claims, just 56 were to smokers, who accounted for 20,000 life-years of the exposure time. A natural reaction would be to think that these claim counts would be too small to detect any smoker/non-smoker differential. Natural, but mistaken — the survival model we fitted to this data estimated that smokers had a 57% higher critical-illness claim rate than non-smokers, with a standard error of plus or minus 15%. This gave a p-value of 0.02% for the effect of smoking, i.e. the result was highly significant at even the 0.1% test level.
The reason such a significant result can be obtained from such a small number of claim counts is that the event we are modelling is comparatively rare in this portfolio: just two claims on average for each thousand life-years of exposure. Thus, comparatively few additional claims are required amongst a sub-group to provide significant evidence of higher risk.
Even if you think you have relatively little data, you might be surprised what you can achieve with survival models.
Add new comment