What's in a (file)name?
The upcoming EU General Data Protection Regulation places focus on the potential for personal data exposures to create a risk to the rights of natural persons. The best way to reduce such risk is to minimise the ability to identify individuals from the data you use in your analysis. Thankfully, not all data used for modelling runs the risk of identifying individuals. Group data, such as that used by Longevitas group count survival models, or the grouped death and exposure formats used within the Projections Toolkit service, are not personal data under the terms of the GDPR. Such data stands no risk of identifying individuals. However, individual data used within mortalityrating.com, and within Longevitas individual level survival models may, depending on content, be classified as personal data.
There are various technical measures adopted within software to minimise the risk individuals can be identified. Such mechanisms, including encryption, multi-factor authentication, and pseudonymisation are all valuable. A more fundamental technique to guard against personal data risk is to remove unnecessary data elements to reduce (preferably to zero!) the number of ways and individual might be traced from the data shared and processed. This might be thought of as a variation on the popular security concept of Need to Know. If a calculation, such as a rating or a survival model, doesn't require a piece of knowledge, then our goal should be to remove that knowledge from the process. How can you avoid combining postcodes and dates of birth? How can you avoid combining names and sensitive codes? Questions such as these were the focus of our previous blog on the latest release of mortalityrating.com, and the Transform on Download feature available since February 2016.
However we should not forget aspects that are seemingly more mundane. What knowledge is encoded in uploaded file names and file descriptions? Clearly if we use publicly recognisable references for pension schemes or annuity portfolios, that piece of context may, when combined with other fields in the dataset, make it easier to identify individuals. Identifying the dataset member who is oldest, youngest or has the highest or lowest pension may be made easier by knowing the source of their annuity, and is certainly made easier with knowledge of the organisation paying their defined-benefit pension. For this reason our latest GDPR updates focus on such details in two ways:
- On file upload the system will propose a random, neutral description that can be retained or overtyped.
- The system will discard all knowledge of the original file name and rely only upon the user-supplied description.
These changes are already in place for the latest releases of mortalityrating.com, Longevitas and the Projections Toolkit. Contact us if you need further information.
Previous posts
Functions of a random variable
Assume we have a random variable, \(X\), with expected value \(\eta\) and variance \(\sigma^2\). Often we find ourselves wanting to know the expected value and variance of a function of that random variable, \(f(X)\). Fortunately there are some workable approximations involving only \(\eta\), \(\sigma^2\) and the derivatives of \(f\). In both cases we make use of a Taylor-series expansion of \(f(X)\) around \(\eta\):
\[f(X)=\sum_{n=0}^\infty \frac{f^{(n)}(\eta)}{n!}(X-\eta)^n\]
The Karma of Kaplan-Meier
Our new book, Modelling Mortality with Actuarial Applications, describes several non-parametric estimators of two quantities:
Add new comment