"How to Break Anonymity of the Netflix Prize Dataset"

Frequently Asked Questions

Last update: Nov 28, 2007

The purpose of this page is to provide answers to frequently asked questions and to dispel common misconceptions about our paper How To Break Anonymity of the Netflix Prize Dataset (latest version: Mar 7, 2008).

- Arvind Narayanan and Vitaly Shmatikov

Q: Nobody cares about privacy of movie ratings.

Clearly some people do, which is why the Video Privacy Protection Act of 1988 exists. It provides one of the strongest consumer privacy protections in federal law -- stronger, for example, than those for health records under HIPAA.

Q: Even if you know a few movies someone liked or disliked, you can only find a record which looks like that person in the Netflix Prize dataset. You cannot verify that you found the right record or person.

Our algorithm is specifically designed to minimize the chance of a false positive, that is, identifying the "wrong" record in the dataset.

We developed a measure called eccentricity to calculate how well the found record matches the information we have about the person. For instance, we were able to match an IMDb user so well with a Netflix record that the second-best match was 28 standard deviations away. It is exceedingly improbable that this is a spurious match.

Furthermore, even if the algorithm finds the "wrong" record, with high probability this record is very similar to the right record, so it still tells us a lot about the person we are looking for. This is mathematically quantifiable; see the theorems in the paper.

Q: Netflix published only a small part of its dataset. Even if you find a matching record, it is probably a false positive.

Our algorithm is sophisticated enough to deal with the fact that only a small part of the dataset was released. If it thinks there is no matching record in the dataset, it says so, and if it does find a match, the probability is very high that this is not a false positive (see Fig. 5 in the paper).

Q: You are not really de-anonymizing anyone, because your algorithm cannot link identities to anonymous Netflix records.

Here is how our algorithm works. If you already know someone's identity and a few of the movies this person liked or disliked, you can use the Netflix dataset to find their entire movie viewing history prior to 2005 (provided, of course, they were a Netflix subscriber and their record was one of those released as part of the dataset).

Q: You are not learning the person's identity, only the movies that he liked or disliked. This information is useless if you don't know who the person is.

See the answer to the previous question.

Q: Your algorithm only works if the person has accounts on both IMDb and Netflix and rates the same movies on both.

First, our algorithm is not specific to IMDb. You can run our algorithm for any person if they were a Netflix subscriber back in 2005 and you know a little bit about their movie viewing history.

Second, the IMDb record does not have to be the same as the Netflix record. Even a small intersection between the two records is sufficient for our algorithm to succeed with high statistical confidence.

Q: Why would a user who publicly rates movies on IMDb care about privacy of their Netflix record?

First, our algorithm works for any Netflix subscriber, not just those who also use IMDb. If their record is in the dataset and you know a few of the movies the subscriber watched prior to 2005, you can identify their record.

Second, in our experiments, Netflix records contain many more movies than the public IMDb records, revealing much more information about the user than available from IMDb.

Q: You are just linking an IMDb username to their Netflix record. An IMDb username is not an identity.

On IMDb, users do not view their profiles as anonymous and there is no expectation of anonymity. As IMDb's user profile page says, "Below, you can learn basic information about the user, such as their website address or personal biography (if they've chosen to enter either.) You can also see how long the user has been registered with the message boards, the last time they were active, and a log of their most recent posts. Most importantly, this profile page allows you to interact with the user three different ways..."

Q: You cannot infer anything about a person just because he gave certain ratings to a political or gay-themed movie.

Even if we can't, their boss, colleague, relative or significant other might.

Q: Few of the IMDb users are Netflix subscribers; you cannot uncover the private movie ratings of an average IMDb user.

Certainly true. IMDb, however, is only one source of external data. It is the major one today, but what about 5 years from now? 10 years? The Netflix dataset is always going to be out there; the genie cannot be put back into the bottle.

Q: All you are doing is predicting unknown movie ratings from known movie ratings. Of course you can do this; that's what the Netflix Prize competition is all about!

The two situations are not analogous. We show that one can link an anonymous Netflix record to external, public data not in the dataset, such as public IMDb ratings, which are associated with a person's identity.

Q: Your algorithm works only for very obscure movies.

Incorrect. Our algorithm works better with obscure movies, but the effect is quantifiable and not huge (see Fig. 11 in the paper).

Q: If you don't publicly rate movies on IMDb and similar forums, there is nothing to worry about.

Indeed. By the same token, you should not ever mention any movies you watched prior to 2005 on a public blog or website. Everybody who was a Netflix subscriber prior to 2005 should restrain themselves from these activities if they care about privacy of their movie viewing history.

We do not think this is a feasible privacy policy.