Causal Databases - Youtopia

Database users increasingly expect to work with, create, and query not just data, but rich and expressive metadata as well. One particular flavor of metadata that is increasingly important in real-world applications is causal metadata. This describes the causal structure, provenance and other causal constraints associated with the data.

In the Youtopia project, we are laying the foundations for a new breed of databases -- causal databases. Causal databases can model causal information explicitly, and allow for queries regarding causality and explanations, as well as supporting emerging applications that make use of causality information in novel ways. This is a joint project with the University of Washington Database group.

Modeling and understanding causality

We are pursuing several foundational projects on causality. The first involves examining more carefully how to model causality in real-world situations. As we now understand, this involves understanding the interplay between the effects of interventions in the world (modeled by the structural equations) and the expectations of agents (what they consider to be "normal" or "typical"). These issues arise in particular when causality is applied in the law. Much of this work is joint with Christopher Hitchcock; see "Actual causation and the art of modeling" and "Graded causation and defaults".

Another line of work explores the connection between the standard approach to modeling causality in the philosophy literature, which involves possible worlds models called counterfactual structures, and structural models approach that is the basis of work being done under this grant. In "From causal models to counterfactual structures" it is shown that, despite well-known claim to the contrary in the literature, these two approaches are, in general, inequivalent in expressive power, although in the special case most often considered in the literature, where the causal structure is acyclic, the structural equations approach can be viewed as a special case of the counterfactual structures approach.

Yet another lines of work involve understanding the extent to which causality is transitive and an improved definition of causality; see "Sufficient conditions for causality to be transitive" and A modification of the Halpern-Pearl definition of causality.

Much of this work is discussed in Halpern's recent book Actual Causality, published by MIT Press.

Causality and coordination on the Social Web

Database users frequently want to explore causal information about data and queries. However, they also increasingly want to accomplish new tasks where they themselves create causality-related metadata. Nowhere is this more apparent than on the Social Web, where new data-driven applications are constantly emerging.

We focus on supporting one such emerging application - online coordination. We have developed entangled queries, a declarative, high-level data-centric abstraction that allows users to specify causal coordination constraints. When the system evaluates these queries, it ensures that all causality constraints are satisfied; thus, all users are guaranteed that the coordination has occurred as desired.

Online coordination

In the last decade, the amount of data on the Internet has grown at a staggering rate. The rise of Web 2.0 has fueled that growth by providing platforms that enable people to create, upload, and share content very easily. However, people do much more than just produce and consume data; they are beginning to center their life around the Web and use social applications for very complex data-driven tasks, many of which involve coordination.

As a concrete example, consider the use case of online travel planning. All of us have faced the scenario where we wished to coordinate travel plans with family, friends or colleagues. How does this coordination happen today? Typically, it starts with a phone or email conversation to decide on the itinerary. Then, one person books tickets for everyone, and there is another round of emails to sort out the finances. Alternately, everybody arranges to book at about the same time, hoping that the airline seats do not fill up in the meantime. Clearly, there has to be a better way.

Entangled queries

Entangled queries are a declarative abstraction that allows users to specify causal constraints on coordination. For example, suppose Donald wants to travel to Los Angeles on the same flight as his friend Minnie. Rather than communicating out-of-band, he can use a simple visual interface to specify his coordination request:

Visual interface for coordination

The system translates his request into an SQL-like query specification of the form:

SELECT 'Donald', fno INTO ANSWER Booking
fno IN (SELECT flightno FROM Flight WHERE dest="LA")
AND ('Minnie', fno) IN ANSWER Booking

This entangled query is a request to learn the number of a single flight to LA on which Minnie also plans to travel. Assuming Minnie also wants to coordinate with Donald, she issues a symmetric query with the strings 'Donald' and 'Minnie' exchanged. The system receives these queries for evaluation, recognizes that Donald and Minnie want to coordinate, chooses a flight and returns the same flight number to both friends. Donald and Minnie's queries may be embedded within entangled transactions - larger unites of code whose execution remains isolated except for the single information exchange about the flight number.

Our existing publications on this topic outline our vision for D3C, introduce entangled queries in detail, explain how these queries can be used in real-world applications, and present the entangled transaction abstraction in depth.

Quantum Databases

Many OLTP applications manage allocation of some commodity or resource such as plane tickets, places in courses, or slots in meeting calendars. In many cases the actual assignment of resources has flexibility; for example, a traveler may want to sit in an aisle seat but does not care about the exact row. This is information the system should be able to use for optimal seat allocation. However, without ad-hoc code, there is currently no clean way for users to specify their "don't cares" or for a system to utilize this information even if it were available.

The quantum database project introduces a new database abstraction that makes resources a first-class citizen in the system. A quantum database processes and commits transactions that request resources without finalizing the resource assignment; the database is in a "quantum state" where it is guaranteed that resources for committed transactions are available, although their exact values are not fixed until necessary. Thus, quantum databases separate a transaction's commit from true transaction execution, with the former often occurring much earlier than the latter. The maintenance of a quantum state allows our system to enforce a more flexible notion of serializability without sacrificing data consistency. In our causally-based semantic serializability model, transactions are ordered based on true dependencies rather than purely on commit order.

We introduce the quantum database abstraction in our CIDR 2013 paper; experimental repeatability materials available here.

Community Data Integration

Communities everywhere on the Web want to share, store and query data. Their motivations for data sharing are very diverse, as is the format of the data involved. However, community data sharing scenarios have in common many high-level properties that translate into design desiderata for Community Data Integration (CDI) systems. First, a CDI system must enable best-effort cooperation among community members with respect to maintenance of the data and metadata. Next, it must manage disagreement regarding the data and schema or other metadata. Finally, it must maximize data utility. We are building a system to address these desiderata and enable community data sharing in arbitrary settings.

Disclosure Control and Explainable Security

Companies and organizations collect and use vast troves of sensitive user data whose release must be carefully controlled. In practice, the access policies that govern this data are often fine-grained, complex, poorly documented, and difficult to reason about. As a result, principals frequently request and are granted access to data they never use. The problem is particularly salient in the context of app ecosystems, in which users store personal data on platforms such as Facebook or smartphones and APIs enable third-party applications (apps) to utilize that data.

In this project, we have developed a new formal model of data disclosure, which we present in our SIGMOD 2013 paper. In contrast with previous solutions, our model is data-derived and semantically meaningful. Information disclosure is modeled in terms of a set of distinguished security views. Each query is labeled with the precise set of security views that is needed to answer it, and these labels drive policy decisions.

To encourage developers and administrators to use security mechanisms more effectively, we have built on the above foundation to create a security model in which all security decisions are formally explainable. We introduce this model in our SIGMOD 2014 paper. Whether a query is accepted or denied, the system returns a concise yet formal explanation which can allow the issuer to reformulate a rejected query or adjust his/her security credentials. Our approach has a strong formal foundation based on previously unexplored connections between disclosure lattices and policy algebras. We build on this foundation and implement a disclosure control system that handles a wide variety of real SQL queries and can accommodate complex policy constraints. The system has been released as open source software and is available here.



  • Joseph Y. Halpern, Actual Causality, MIT Press, 2016.
  • Joseph Y. Halpern, Sufficient conditions for causality to be transitive, 83:2, 2016, 213--226, Phiosophy of Science.
  • Joseph Y. Halpern, Appropriate causal models and stability of causation, Review of Symbolic Logic 9:1, 2016, pp. 76-102.
  • Joseph Y. Halpern and Christopher Hitchock, Graded causation and defaults, British Journal for the Philosophy of Science 66:2, 2015, pp. 413--457.
  • Bailu Ding, Lucja Kot, Alan Demers, Johannes Gehrke (2015). Centiman: Elastic, High Performance Optimistic Concurrency Control by Watermarking. ACM Symposium on Cloud Computing 2015 (SoCC'15).
  • Joseph Y. Halpern, Cause, responsibility, and blame: a structural-model approach, Law, Probability, and Risk 14:2, 2015, pp. 91--118.
  • Tobias Gerstenberg, Joseph Y. Halpern, and Joshua B. Tenenbaum, Responsibility judgments in voting scenarios, Proceedings of the 37th Annual Conference of the Cognitive Science Society (CogSci 2015, 2015, pp. 788--793.
  • Joseph Y. Halpern, A modification of the Halpern-Pearl definition of causality Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), 2015, pp. 3022-3033.
  • Sudip Roy, Lucja Kot, Gabriel Bender, Bailu Ding, Hossein Hojjat, Christoph Koch, Nate Foster, Johannes Gehrke. The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis. SIGMOD 2015.
  • H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, Cyrus Shahabi (2014). Big data and its technical challenges. Communications of the ACM. 57 (7).
  • Gabriel Bender, Lucja Kot, Johannes Gehrke. Explainable Security for Relational Databases. SIGMOD 2014. (electronic appendix) (source code)
  • Gadi Aleksandrowicz, Hana Chockler, Joseph Y. Halpern, and Alexander Ivrii, The computational complexity of structure-based causality, AAAI-14 (Proceedings of the Twenty-Eighth National Conference on Artificial Intelligence), 2014, pp. 974-980.
  • Tao Zou, Ronan Le Bras, Marcos Vaz Salles, Alan Demers, Johannes Gehrke, ClouDiA: A Deployment Advisor for Public Clouds. Proceedings of the VLDB Endowment. 6 (2), 2013, pp. 121-132.
  • Gabriel Bender, Lucja Kot, Johannes Gehrke, Christoph Koch. Fine-Grained Disclosure Control for App Ecosystems. SIGMOD 2013.
  • Joseph Y. Halpern and Christopher Hitchcock, Compact representations of extended causal models, Cognitive Science 37:6, 2013, pp. 986--1010 (with C. Hitchcock).
  • Sudip Roy, Lucja Kot, Christoph Koch. Quantum Databases. CIDR 2013. (Experimental repeatability materials here).
  • Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguica, and Rodrigo Rodrigues. Making Geo-Replicated Systems Fast if Possible, Consistent when Necessary. 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12). 2012,
  • Konstantinos Mamouras, Sigal Oren, Lior Seeman, Lucja Kot, Johannes Gehrke. The Complexity of Social Coordination. VLDB 2012.
  • Yin Lou, Rich Caruana, Johannes Gehrke (2013). Intelligible models for classification and regression. Proceedings of the the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12.
  • Nitin Gupta, Milos Nikolic, Sudip Roy, Gabriel Bender, Lucja Kot, Johannes Gehrke, Christoph Koch. Entangled Transactions. VLDB 2011. Slides.
  • Nitin Gupta, Lucja Kot, Sudip Roy, Gabriel Bender, Johannes Gehrke, and Christoph Koch. Entangled queries: enabling declarative data-driven coordination. SIGMOD 2011. (Best Paper Award Winner) Slides.
  • Nitin Gupta, Lucja Kot, Gabriel Bender, Sudip Roy, Johannes Gehrke, Christoph Koch. Coordination Through Querying in the Youtopia System (Demonstration Paper). SIGMOD 2011.
  • J. Y. Halpern, From causal models to counterfactual structures, Proceedings of the Twelfth International Conference on Principles of Knowledge Representation and Reasoning (KR 2010), 2010.
  • R. Dechter, H. Geffner, J.Y. Halpern, Heuristics, Probability, and Causality: A Tribute to Judea Pearl, College Publications, 2010.
  • J.Y. Halpern and C. Hitchcock, Actual causation and the art of modeling, in Heuristics, Probability and Causality: A Tribute to Judea Pearl, (editors, R. Dechter, H. Geffner, and J. Y. Halpern), College Publications, 2010, pp. 383-406
  • A. Meliou, W. Gatterbauer, J.Y. Halpern, C. Koch, K. F. Moore, and D. Suciu, Causality in databases, IEEE Data Engineering Bulletin 33:3, 2010, pp. 59-67.
  • Lucja Kot, Nitin Gupta, Sudip Roy, Johannes Gehrke, and Christoph Koch. Beyond Isolation: Research Opportunities in Declarative Data-Driven Coordination. SIGMOD Record, 2010.
  • Lucja Kot, Christoph Koch. Cooperative Update Exchange in the Youtopia System. VLDB 2009.
  • Funding

    This research has been supported by the National Science Foundation under Grants IIS-0534404 and IIS-0911036, by a Google Research Award, by New York State Science Technology and Academic Research under agreement number C050061, and by the iAd Project funded by the Research Council of Norway. This research has also been supported by a Google Research Award, and we are very thankful to Google for their support.

    Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.