Statistical Privacy

I just stumbled across an interesting observation by Peter Eckersley, who blogs for the EFF. Apparently, he points out, statistically, you can identify an individual with no more than thirty-three data points. Thirty-three pieces of information… and often quite a bit less. (Obviously, some pieces of information are more useful than others. Billionaires? Lots. Male billionaires? Many. Male billionaires in the IT industry? A half-dozen or so, from what I can tell. Russian-born male billionaires in the IT industry? Probably just one.)

It has interesting privacy implications in an era of surveillance and data-mining. Consider: under some circumstances, just visiting a website, once, can disclose:

What ISP you use, and where you live, down to the city or area level (two data points);
What browser you use, what operating system you use, and possibly what kind of computer hardware you have (three more data points);
What language(s) you speak (one more data point);
What you’re interested in (one further data point);
How you reached that website (one last data point).

That’s eight data points – a quarter of the maximum required to identify you – just from a webserver log.

Thing is, it’s not inconceivable that, under many circumstances, that information alone is enough to identify you, beyond a reasonable doubt, to anyone with access to the right data. (For example, someone downloads something they shouldn’t, using a public wifi connection at a public library in, oh, Haverhill, MA, at 11am on a Wednesday during the school year, using the Konqueror web browser on a 64bit computer running SuSE Linux, and following a link from, say, Yahoo webmail. If you believe some of the more outlandish stories about Big Brother and the breadth and depth of government data-mining, The Man might just be able to figure out who you are, by figuring out who in or around Haverhill uses SuSE on a 64bit laptop, has a Yahoo email account, and doesn’t have anything better to be doing at 11am on a Wednesday than downloading ch1ld pr0n.)

That’s kind of a worst-case scenario, but it demonstrates how little information really is necessary to identify someone, at least in theory. (And also why data-mining is slightly scary.) I don’t know how well statistical identification would hold up in court, but it’s something to think about for would-be dissidents, whistleblowers, or document leakers, especially in those parts of the world where due process only happens to other people. The fewer accurate data points you give away, and the more misleading ones you can provide, the better.

Something to think about, anyway.

I could write more, but it’s a fairly nice day, I spent all day at work in front of a computer, and my Volkswagen Jetta needs washing, so…

Published in: Geekiness, General | on April 2nd, 2009| Comments Off on Statistical Privacy

Both comments and pings are currently closed.

Comments are closed.