The Youtopia Project
Community Data Integration
Communities everywhere on the Web want to share, store and query data. Their motivations for data sharing are very diverse - from entertainment or commercial activity to the desire to collaborate on scientific or artistic projects. The data involved is also varied, running the gamut from unstructured through semistructured to relational. The solutions used for data sharing are frequently custom-built for a concrete scenario; as such, they exhibit significant diversity themselves. To name only a few prominent solutions, Wiki software has proved very successful for community management of unstructured data; scientists use custom portals to pool their datasets; and an increasingly large number of vertical social networking sites include a topic-specific database that is maintained by the site's members.
While the scenarios mentioned above vary widely in their parameters, they have in common many high-level properties that translate into concrete design desiderata for Collaborative Data Integration (CDI) systems. In the Youtopia project, we are building a system to address these desiderata and enable community data sharing in arbitrary settings. Our initial focus is on relational data; however, the ultimate goal is to include arbitrary data formats and manage the data in its full heterogeneity.
CDI has three fundamental aspects that distinguish it from other paradigms such as classical data integration. First, a CDI system must enable best-effort cooperation among community members with respect to maintenance of the data and metadata. That is, no worthwhile contribution to the repository should be rejected because it is incomplete, as another community member may be able to supply the knowledge required to complete it. This means a CDI system must be equipped to deal with incomplete data and metadata, as well as providing a way for users to complete them at a later time. Next, a CDI solution must manage disagreement regarding the data and schema or other metadata. Finally, it must maximize data utility.
These three aspects have clear tradeoffs in the extent to which they can be addressed; as such, they define a design space within which we can situate existing solutions and Youtopia. The structure of this design space also clarifies the relationship of CDI to classical data integration; the latter is fundamentally an effort to maintain utility while permitting as much disagreement as possible. CDI builds on this by introducing the added element of best-effort cooperation, familiar from the Web 2.0 model of enabling all users to create their own content on the internet.
Youtopia
Youtopia is a system that allows users to add, register, update and maintain relational data in a collaborative fashion. The architecture of Youtopia is shown above. The storage manager provides a logical abstraction of the repository. In this abstraction, the repository consists of a set of logical tables or views containing the data; these are tied together by a set of mappings (or tuple-generating dependencies). The mappings are supplied by the users as the repository grows and serve to propagate changes to the data in a variant of the chase process. Thus, at the logical level Youtopia is an update exchange system.
The following is our vision for Youtopia; we explain how the system addresses all three of the CDI goals.
Enabling best-effort cooperation
Youtopia is designed from the ground up to allow users to cooperate on all data management tasks.
- Our change propagation model is novel and includes human participation, which takes a fundamentally collaborative form. Details can be found in our paper on the subject.
- When a user adds a new table to a Youtopia repository, it is also desirable to add mappings that connect the table to others. However, such mappings are not always easy to specify. There is a need for an infrastructure to facilitate mapping creation. Youtopia provides such an infrastructure; notably, it allows users to cooperate and pool their understanding in setting up and refining mappings.
- Mapping creation is also made easier by the presence of subdomain-specific summary views: knowledgeable users can define such views which capture in their schema the essence of the subdomain. Much as portals and topic lists in Wikipedia can guide contributors in the categorization of their articles, such views can guide table owners in the definition of their mappings.
Maximizing utility
Ensuring high utility of data in a Youtopia repository requires both maintaining good data quality and providing flexible and appropriate mechanisms for data querying and browsing.
- The system provides mechanisms for cooperative data cleaning.
- There is support for user ranking of tables and data. These rankings are used to provide a better query and browsing experience.
- The subdomain-specific summary views mentioned above are also useful for data navigation and as entry points for structured queries.
- Youtopia supports a mixture of keyword search and structured queries over the data in the repository. The query engine is equipped to handle data that is incomplete, inconsistent or both. This is done through the use of multiple query semantics: a certain semantics that guarantees correctness while potentially omitting some results, and a best-effort semantics that includes all potentially relevant results at the risk of some incorrectness.
Handling disagreement
As data and mappings are added to the repository, disagreement is inevitable.
- Youtopia provides mechanisms for disagreement resolution by the community, such as arbitration and moderation. This requires appropriate support for access control with respect to all tasks that Youtopia users perform.
- In the event that disagreement is not resolved, Youtopia can handle queries over inconsistent data, while indicating to the user the presence and nature of the inconsistencies.
Finally, privacy is also always a consideration; therefore, Youtopia includes social network-like functionality that allows users to establish a network of trusted acquaintances or friends, so that data, mappings, rankings and user-defined views can be shared to a varying extent.
People
Publications
- Lucja Kot, Christoph Koch. Cooperative Update Exchange in the Youtopia System. VLDB 2009.