Causal Databases - Youtopia
Database users increasingly expect to work with, create, and query not just data, but rich and expressive metadata as well. One particular flavor of metadata that is increasingly important in real-world applications is causal metadata. This describes the causal structure, provenance and other causal constraints associated with the data.
In the Youtopia project, we are laying the foundations for a new breed of databases -- causal databases. Causal databases can model causal information explicitly, and allow for queries regarding causality and explanations, as well as supporting emerging applications that make use of causality information in novel ways. This is a joint project with the University of Washington Database group.
- Research Topics
- Modeling Causality
- Causality and Coordination on the Social Web (Entangled Queries)
- Quantum Databases
- Community Data Integration
- Disclosure Control and Explainable Security
Modeling and understanding causality
We are pursuing several foundational projects on causality. The first involves examining more carefully how to model causality in real-world situations. As we now understand, this involves understanding the interplay between the effects of interventions in the world (modeled by the structural equations) and the expectations of agents (what they consider to be "normal" or "typical"). These issues arise in particular when causality is applied in the law. Much of this work is joint with Christopher Hitchcock; see "Actual causation and the art of modeling" and "Graded causation and defaults".
Another line of work explores the connection between the standard approach to modeling causality in the philosophy literature, which involves possible worlds models called counterfactual structures, and structural models approach that is the basis of work being done under this grant. In "From causal models to counterfactual structures" it is shown that, despite well-known claim to the contrary in the literature, these two approaches are, in general, inequivalent in expressive power, although in the special case most often considered in the literature, where the causal structure is acyclic, the structural equations approach can be viewed as a special case of the counterfactual structures approach.
Yet another lines of work involve understanding the extent to which causality is transitive and an improved definition of causality; see "Sufficient conditions for causality to be transitive" and A modification of the Halpern-Pearl definition of causality.
Much of this work is discussed in Halpern's recent book Actual Causality, published by MIT Press.
Causality and coordination on the Social Web
Database users frequently want to explore causal information about data and queries. However, they also increasingly want to accomplish new tasks where they themselves create causality-related metadata. Nowhere is this more apparent than on the Social Web, where new data-driven applications are constantly emerging.
We focus on supporting one such emerging application - online coordination. We have developed entangled queries, a declarative, high-level data-centric abstraction that allows users to specify causal coordination constraints. When the system evaluates these queries, it ensures that all causality constraints are satisfied; thus, all users are guaranteed that the coordination has occurred as desired.
In the last decade, the amount of data on the Internet has grown at a staggering rate. The rise of Web 2.0 has fueled that growth by providing platforms that enable people to create, upload, and share content very easily. However, people do much more than just produce and consume data; they are beginning to center their life around the Web and use social applications for very complex data-driven tasks, many of which involve coordination.
As a concrete example, consider the use case of online travel planning. All of us have faced the scenario where we wished to coordinate travel plans with family, friends or colleagues. How does this coordination happen today? Typically, it starts with a phone or email conversation to decide on the itinerary. Then, one person books tickets for everyone, and there is another round of emails to sort out the finances. Alternately, everybody arranges to book at about the same time, hoping that the airline seats do not fill up in the meantime. Clearly, there has to be a better way.
Entangled queries are a declarative abstraction that allows users to specify causal constraints on coordination. For example, suppose Donald wants to travel to Los Angeles on the same flight as his friend Minnie. Rather than communicating out-of-band, he can use a simple visual interface to specify his coordination request:
The system translates his request into an SQL-like query specification of the form:
SELECT 'Donald', fno INTO ANSWER Booking WHERE fno IN (SELECT flightno FROM Flight WHERE dest="LA") AND ('Minnie', fno) IN ANSWER Booking CHOOSE 1
This entangled query is a request to learn the number of a single flight to LA on which Minnie also plans to travel. Assuming Minnie also wants to coordinate with Donald, she issues a symmetric query with the strings 'Donald' and 'Minnie' exchanged. The system receives these queries for evaluation, recognizes that Donald and Minnie want to coordinate, chooses a flight and returns the same flight number to both friends. Donald and Minnie's queries may be embedded within entangled transactions - larger unites of code whose execution remains isolated except for the single information exchange about the flight number.
Our existing publications on this topic outline our vision for D3C, introduce entangled queries in detail, explain how these queries can be used in real-world applications, and present the entangled transaction abstraction in depth.
Many OLTP applications manage allocation of some commodity or resource such as plane tickets, places in courses, or slots in meeting calendars. In many cases the actual assignment of resources has flexibility; for example, a traveler may want to sit in an aisle seat but does not care about the exact row. This is information the system should be able to use for optimal seat allocation. However, without ad-hoc code, there is currently no clean way for users to specify their "don't cares" or for a system to utilize this information even if it were available.
The quantum database project introduces a new database abstraction that makes resources a first-class citizen in the system. A quantum database processes and commits transactions that request resources without finalizing the resource assignment; the database is in a "quantum state" where it is guaranteed that resources for committed transactions are available, although their exact values are not fixed until necessary. Thus, quantum databases separate a transaction's commit from true transaction execution, with the former often occurring much earlier than the latter. The maintenance of a quantum state allows our system to enforce a more flexible notion of serializability without sacrificing data consistency. In our causally-based semantic serializability model, transactions are ordered based on true dependencies rather than purely on commit order.
Community Data Integration
Communities everywhere on the Web want to share, store and query data. Their motivations for data sharing are very diverse, as is the format of the data involved. However, community data sharing scenarios have in common many high-level properties that translate into design desiderata for Community Data Integration (CDI) systems. First, a CDI system must enable best-effort cooperation among community members with respect to maintenance of the data and metadata. Next, it must manage disagreement regarding the data and schema or other metadata. Finally, it must maximize data utility. We are building a system to address these desiderata and enable community data sharing in arbitrary settings.
Disclosure Control and Explainable Security
Companies and organizations collect and use vast troves of sensitive user data whose release must be carefully controlled. In practice, the access policies that govern this data are often fine-grained, complex, poorly documented, and difficult to reason about. As a result, principals frequently request and are granted access to data they never use. The problem is particularly salient in the context of app ecosystems, in which users store personal data on platforms such as Facebook or smartphones and APIs enable third-party applications (apps) to utilize that data.
In this project, we have developed a new formal model of data disclosure, which we present in our SIGMOD 2013 paper. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions.
To encourage developers and administrators to use security mechanisms more effectively, we have built on the above foundation to create a security model in which all security decisions are formally explainable. We introduce this model in our SIGMOD 2014 paper. Whether a query is accepted or denied, the system returns a concise yet formal explanation which can allow the issuer to reformulate a rejected query or adjust his/her security credentials. Our approach has a strong formal foundation based on previously unexplored connections between disclosure lattices and policy algebras. We build on this foundation and implement a disclosure control system that handles a wide variety of real SQL queries and can accommodate complex policy constraints. The system has been released as open source software and is available here.
- Gabriel Bender
- Johannes Gehrke
- Nitin Gupta
- Joseph Halpern
- Christopher Hitchcock
- Christoph Koch
- Lucja Kot
- Milos Nikolic
- Sudip Roy
This research has been supported by the National Science Foundation under Grants IIS-0534404 and IIS-0911036, by a Google Research Award, by New York State Science Technology and Academic Research under agreement number C050061, and by the iAd Project funded by the Research Council of Norway. This research has also been supported by a Google Research Award, and we are very thankful to Google for their support.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.