Privacy and De-Identification

July 26th, 2007 | by Neal Levene |

When data is integrated and aggregated, there are additional privacy and security concerns that do not exist for a typical transactional system. When multiple slices of the same data are presented either through ad-hoc queries or formal reported results, mathematical models may sometimes be used to specifically identify an individual from within aggregated results.

A recent article in DM Review, Complying with HIPAA’s Privacy Rule - Tabular Data, addressed this topic extremely well (although, from a rather technical mathematical perspective).

An example helps, let’s say that a report exists on test performance and the data is broken out by national origin. It could be possible, even in aggregated and summarized data, to specifically identify an individual; thereby, violating privacy policies, regulations, or law.

Many studies on de-identifying information arise from a sample being drawn from a population. Sophisticated mathematical models have been developed to assign the probability of correctly de-identifying an individual.

On many projects, rows in tabular data are combined into an “other” category when the number of people represented in a report result is lower than some number (2-5). The DM Review article shows that identification is actually more likely when table values are higher.

As the number of guesses increases for a given cell size or percentage of the population, the chance that someone can correctly identify an individual increases. For example, in a cell size that is 20 percent of the total population, the chance of correctly identifying one individual with one guess is 0.2. With two guesses, the chance of correctly identifying one individual within that same cell size is 0.36 - slightly less than double. With three guesses, the chance is 0.49, four guesses is 0.59 and five guesses is 0.67.

Thus, when safeguarding data to protect against the potential identification of an individual, one must consider both the cell size as a percentage of the total table population and the number of attempts or guesses that could be made to identify an individual.

We recognize that health care organizations need to share critical information with their clients for business purposes. However, there are serious risks associated with unintentional disclosure. The easiest and most efficient ways to protect health care information and mitigate the risk of disclosure are to reduce the amount of information that is produced to others and modify table structures to contain percentages and averages, rather than actual counts or detailed information on individuals.

We recommend that health care organizations routinely review data that are produced to ensure adherence to minimum cell sizes and concentration of values within cells. We recommend also that table categories be created to more evenly distribute individuals across cells so that the probability of identifying any individual is reduced. When constructing tables, it is important to consider potential identification of individuals in small cell sizes as well as large cell sizes.

Additionally, table linkages should be reviewed periodically to ensure that information from one table will not be comprised by information contained in another table.

This is a subtle and frequently misunderstood security requirements. The best solutions attack the problem at the data level versus within the logic of specific queries and analytics. The image below shows the Universal De-Identification Platform being developed at IBM. This is one of many solutions to these kinds of issues.

deidentification.jpg

Popularity: 5% [?]

Post a Comment