Managing Metadata and Provenance for Elementary-Particle Physics
Overview
This project is a collaboration between faculty and researchers at Cornell's Wilson Laboratory and the Computer Science department.
The particle detector in Cornell's Wilson Laboratory plays a significant role in current research on the structure of matter. It is one of ten major accelerator facilities in the world today. The subatomic particles, which are produced by the collision of electrons and positrons, are studied by the multi-institutional collaboration that runs CLEO (the actual detector) and conducts research in elementary particle physics---the study of the basic building blocks of matter. Please refer to the respective home pages for more information on CLEO and the Laboratory for Elementary-Particle Physics .
CLEO produces large amounts of data which are analyzed by physicists all over the world. The data consist of raw data about particle collisions and additional information about the detector calibration when recording the collisions. A fairly complex workflow is used to clean the raw data and to refine the calibration information. This process is called reconstruction. The first goal of our project is to provide a database infrastructure for managing the metadata that are generated at different stages of the reconstruction process. This will simplify the analysis process by end users and also the reconstruction process itself.
The second goal of our project is to add support for provenance to the existing system. The final results of a CLEO particle event analysis not only depend on the actual measurements related to a particle collision event, but on numerous other factors as well. The two major factors are the detector calibration constants and the data processing software. Both can change over time as new insights are gained based on previous analyses (e.g., corrections to assumed position of measuring wires in detector, new version of software for track reconstruction, etc.). This change in data and software can affect the validity of previous results and can limit re-use of previously derived data for future analysis. In particular, users of the CLEO data are interested in the following questions:
- Which software and input data were used to generate my final results?
- Does a given update of a software module affect my final results?
- Does a given update of some calibration constants affect my final result?
- Were two given final results obtained by a consistent setup (e.g., same software release and data versions)?
Research Foci
- Design of a metadata schema
- Design of a schema for managing provenance information
- Implementation of the schemata in a relational DBMS, including selection of indices, stored procedures, etc.
- Publication of the information as a web service
- Improvements to current data processing workflow
People
Manuel Calimlim
Johannes Gehrke
Lawrence Gibbons
Chris Jones
Valentin Kuznetsov
Mirek Riedewald
Dan Riley
Anders Ryd
Gregory J. Sharp
Internal
Detailed description, summary of current status, schemata, etc.