Ashwin Kumar V Machanavajjhala
Research Publications CV Cornell DB Group
Ph. D. Candidate with Prof. Johannes Gehrke
Department of Computer Science
Cornell University
4142 Upson Hall
Ithaca NY 14853
607 351 0477
mvnak(at)cs(dot)cornell(dot)edu
I expect to complete my Ph.D. by August 2008. I am currently looking for tenure track positions in academia. My primary area of interest is databases. My specific interests are in data privacy.
L-Diversity: Privacy Beyond k-Anonymity
Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer,
Muthuramakrishnan Venkitasubramaniam, In ICDE 2006 and TKDD 2007. [PDF:
Conf Version], [PDF: Journal Version]
On the
Efficiency of Checking Perfect Privacy
Ashwin Machanavajjhala, Johannes Gehrke, In PODS 2006. [PDF]
Privacy: From Theory to Practice on the Map [NEW]
Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes
Gehrke, Lars Vilhuber, In ICDE 2008. [PDF]
Privacy in Data Publishing:
Access to real life data is an important requirement for quality
computer science research. However, real data collected by various agencies
and corporations is not shared due to privacy concerns. Most of this data is
in the form of a relation; each tuple corresponds to a unique individual,
and some of the attributes in the table are marked sensitive. In the recent
years, anonymous data has been published by coarsening the non-sensitive
attributes such that a condition known as k-anonymity is satisfied. In a
k-anonymized dataset, each record is indistinguishable from at least (k
-1)
other records with respect to the non-sensitive attributes. This disallows
an attacker from re-identifying an individual in the published dataset.
I demonstrated, with simple attacks, that a k-anonymized dataset has some
subtle, yet severe privacy problems. First, an adversary can discover
sensitive information when there is little diversity in values of the
sensitive attributes. Second, adversaries often have background knowledge,
and k-anonymity does not guarantee privacy against such adversaries. I
initiated a formal study of adversarial background knowledge, and proposed a
novel and powerful privacy definition called L-Diversity. To
ensure L-diversity, every group of tuples that share the same
non-sensitive information should be associated with at least ` roughly
equally distributed sensitive values. This simple criterion guarantees
privacy even against adversaries who have (the most disclosing) (L-2)
different pieces of background knowledge of the form "Bob does not have
diabetes." In addition to having a formal mathematical foundation, L-diversity
is practical and can be efficiently implemented. L-Diversity has had
a great impact on the database privacy community, and has already been cited
by over 20 publications in the major database conferences over the past two
years.
In a subsequent paper, presented in ICDE 2007, I helped study privacy in the
face of more complex adversarial background knowledge. We used propositional
formulas over the tuples in the table to express adversarial back-ground
knowledge, and showed that any adversarial knowledge can be divided into a
conjunction of implications of the form, "If Bruce has the Flu, then
his wife Carla also has the Flu." Though reasoning about the
information disclosed by an arbitrary set of implications was #P-hard, we
proposed efficient algorithms to compute the worst L implications
that lead to the largest information disclosure, and proposed efficient
anonymization algorithms that guard against such adversaries.
Privacy on the Map.
In a recent collaboration with Prof. John Abowd, I studied the privacy
requirements of OnTheMap, a real world privacy application. This
application plots commute patterns of workers on the US Map to study
workforce indicators. In this application, while worker destinations (i.e.,
workplaces) are publicly known, origins (i.e., residences) must be kept
secret. This is done using the synthetic data generation technique, which is
very popular in the statistical disclosure limitation literature. Synthetic
data points are drawn from a statistical model built on a noise infused
version of the data. This synthetic population is then published. However,
there has been little research on deriving formal guarantees of privacy for
such an approach. Moreover, interesting real data is usually very sparse;
this is an important open problem that has been overlooked by research in
data publishing. Most previous work deals with data where the number of
individuals is typically larger than the size of the sensitive attribute
domain. In our application, however, only tens or hundreds of workers
commute to each destination from roughly 8 million origins on the U.S. map.
When applied to such data, current provably private algorithms create
spurious commute patterns by adding noise to all the 8 million origins. A
paper that I will present at ICDE 2008 contains the first rigorous
theoretical privacy analysis of a synthetic data generation algorithm,
sufficient conditions for the privacy of this technique, and the first
solutions that ensure privacy in sparse data.
Efficient Enforcement of Complex Privacy Policies.
In traditional hospital and military databases, and in futuristic web
data sharing scenarios, different entities are authorized to access
different parts of the data. Hence, it is crucial that an unauthorized
entity does not learn any information that it is not allowed to access. One
method to formally specify such (possibly complex) privacy policies is to
express the sensitive information as conjunctive queries (a subset of SQL)
and enforce perfect privacy, a very strong notion of privacy that ensures
that a query is answered by the database only if it does not disclose any
information about the answer to the sensitive query. However, checking
whether one query leaks any information about another query is known to be
intractable (hard for the second level of the polynomial hierarchy).
In a paper presented at PODS 2006, I showed that enforcing
perfect privacy is tractable, and indeed very efficient, for large and
interesting subclasses of conjunctive queries. Instead of giving different
arguments for query classes of varying complexities, I made a connection
between perfect privacy and the well known database problem of checking
query containment, thus relating the tractability of enforcing perfect
privacy to the tractability of query containment.
Outsourcing Databases.
There is a recent buzz in both academic research and the industry
regarding third-party-hosted software services. One such application that
various corporations are jostling to provide is the online "office
suite," where documents can be stored online and concurrently edited by
different users. This is attractive since users can collaborate with little
or no communication external to the server. However, users might not adopt
this service without an integrity guarantee that the server has not
maliciously altered the files.
I helped study the possibility of checking server integrity
in the face of concurrent edits from different users with little or no
communication between users. Under reasonable requirements, we showed that
server integrity cannot be guaranteed in the absence of external
communication, and designed efficient algorithms that ensure server
integrity with minimal external communication. These results were presented
at STD3S 2006.
Peer-to-Peer Data Management.
Peer-to-peer file sharing has received considerable attention in the
distributed systems community. In order to balance the load (number of
les served by a peer) across peers, search is usually implemented using
hash index structures. As a result, such systems only expose an equality
query interface. Web-scale data management solutions require richer query
interfaces, and load balance is a crucial challenge in supporting even range
queries. As part of the PEPPER project, I designed new load balancing
algorithms while supporting range queries. My algorithms provably bounded
the ratio between the most loaded and least loaded peers by 2 + (
> 0, a real number), beating the previous best algorithm's ratio of 4:6.
These algorithms maintain the load balance with an O(1) cost amortized over
the updates to the system.
I was also a major contributor to the design and
implementation of P-Ring, a complete peer-to-peer system for storage and
indexing data. The system supports both equality and range queries, is
fault-tolerant, provides logarithmic search performance, guarantees
correctness and availability, and provably ensures load-balancing as
described above. The system has been deployed on PlanetLab, and the design
is currently being commercialized by ATC, NY under a government contract for
distributed management of emergency information. This system was presented
at WWW 2004, SIGMOD 2004, 2007.
Secure Distributed Communication.
Secure and reliable communication over networks is an important problem,
since some of the routers in the network may be compromised (potentially in
a correlated fashion) and may maliciously alter or drop packets that are
routed through them. Traditionally, secure protocols have been proposed
where at most t routers are compromised. In reality, however, some routers
are more susceptible than others and the above threshold fault model
precludes using such information.
As part of my undergraduate thesis work, I studied the possibility of secure communication (both synchronous and asynchronous) in a non-threshold fault model. I proved necessary and sufficient conditions on the connectivity of a network that allow secure communication, and proposed communication protocols on networks where secure communication is possible. These results were presented in PODC 2002 and Asiacrypt 2002.
L-Diversity: Privacy beyond k-Anonymity [PDF]
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, Muthuramakrishnan Venkitasubramaniam, In TKDD, vol 1, no. 1, 2007.A Security Assurance Framework for Component Based Software Development
M. V. N. Ashwin Kumar, Arun K. Singh, Ramesh Babu, In Informatica, vol 25, no. 4, 2001.
Privacy: Theory meets Practice on the Map [PDF]
Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, Lars Vilhuber, In Proc. ICDE 2008.P-Ring: An Effcient and Robust P2P Range Index Structure [PDF]
Adina Crainiceanu, Prakash Linga, Ashwin Machanavajjhala, Johannes Gehrke, Jayavel Shanmugasundaram, In Proc. SIGMOD 2007.Worst Case Background Knowledge [PDF]
David Martin, Daniel Kifer, Ashwin Machanavajjhala, Johannes Gehrke, Joseph Halpern, In Proc. ICDE 2007.On the Efficiency of Checking Perfect Privacy [PDF]
Ashwin Machanavajjhala, Johannes Gehrke, In Proc. ACM PODS 2006.L-Diversity: Privacy beyond k-Anonymity [PDF]
Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan Venkitasubramaniam, In Proc. ICDE 2006.On Perfectly Secure Communication over Arbitrary Networks [PDF]
M. V. N. Ashwin Kumar, Pranava. R. Goundan, K. Srinathan, C. Pandu Rangan, In Proc. ACM PODC 2002.Asynchronous Secure Communication tolerating Mixed Adversaries [PDF]
K. Srinathan, M. V. N. Ashwin Kumar, C. Pandu Rangan, In Proc. ASIACRYPT 2002, LNCS, Springer Verlag.Asynchronous Perfectly Secure Computation tolerating Generalized Adversaries [PDF]
M. V. N. Ashwin Kumar, K. Srinathan, C. Pandu Rangan, In Proc. ACISP 2002, LNCS, Springer Verlag.Theory of Equal-Flows in Networks [PDF]
Pranava. R. Goundan, K. Srinathan, M. V. N. Ashwin Kumar, R. Nandakumar, C. Pandu Rangan, In Proc. COCOON-2002, LNCS, Springer Verlag.
Trusted CVS
Muthuramakrishnan Venkitasubramaniam, Ashwin Machanavajjhala, David Martin, Johannes Gehrke, In ICDE Workshops - STD3S 2006.Beyond k-Anonymity: New Schemes for Privacy Preserving Data Publishing
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, (Poster Paper) Best Visionary Poster In DB/IR Day, April, 2005.An Indexing Framework for P2P Systems
Adina Crainiceanu, Prakash Linga, Ashwin Machanavajjhala, Johannes Gehrke, Jayavel Shanmugasundaram, (Demo Paper) In Proc. ACM SIGMOD 2004.A Storage and Indexing Framework for P2P Systems
Adina Crainiceanu, Prakash Linga, Ashwin Machanavajjhala, Johannes Gehrke, Jayavel Shanmugasundaram, (Poster Paper) In Proc. WWW 2004.
Randomization Methods for ensuring Data Privacy
Ashwin Machanavajjhala, Johannes Gehrke, Encyclopedia of Database Systems, Springer, 2008.