Database Colloquium Schedule: Spring 2009

The database colloquium is the weekly meeting of students and faculty interested in data management, data mining, or related topics at Cornell. The colloquium is typically a paper presentation of seminal or recent papers of general interest. While many of the speakers are from the Cornell community, the colloquium also invites outside speakers to talk about their research. The colloquium is held every Monday in from 12:15-1 pm in 5130 Upson Hall.

On those days in which the database colloquium does not have an outside speaker, the colloquium is replaced by a more informal database lunch. This is a short lunch lunch starting at noon followed by an informal paper discussion on a recent topic of interest.





Fast Randomized Algorithms for Data Streams

In this talk, I present methods to speed-up the sketch computation. Sketches are randomized algorithms that use small amount of memory and that can be computed in one pass over the data. Frequency moments represent important distributional characteristics of the data that are required in any statistical modeling method. This talk focuses on AGMS-sketches that are used for the estimation of the second frequency moment. Fast-AGMS sketches use hash functions to speed-up the computation by reducing the number of basic estimators that need to be updated. I show that hashing also changes the distribution of the estimator, thus improving the accuracy by orders of magnitude. The second method to speed-up the sketch computation is related to the degree of randomization used to build the estimator. I show that by using 3-wise independent random variables instead of the proposed 4-wise, significant improvements are obtained both in computation time and memory usage while the accuracy of the estimator stays the same. The last speed-up method I discuss is combining sketches and sampling. Instead of sketching the entire data, the sketch is built only over a sample of the data. I show that the accuracy of the estimator is not drastically affected even when the sample contains a small amount of the original data. The gain in speed-up is inversely proportional with the sample size. When the three speed-up methods are put together, it is possible to sketch streams having input rates of millions of tuples per second in small memory while providing similar accuracy as the original AGMS sketches.

Florin Rusu (University of Florida) 5130 Upson
February 23 DB COLLOQUIUM:
Artifacts in Business Processes: From Process-centric to Data-centric

Traditional workflow is process-centric, which brings challenges in the areas of easy evolvability, component re-use, distributions of workflows across multiple organizations, the use of broadly-based key performance indicators (KPIs), and provenance. IBM Research has been working on a new approach to the design and implementation of workflows and business processes, which is fundamentally data-centric. In this approach, business operations are first modeled in terms of the business artifacts (or entities, or documents) that flow through the system, along with a high-level specification of their life-cycle. Services (or tasks) are used to specify the automated and/or human steps that help move artifacts through their life-cycle, and the services are associated to the artifacts using procedural, graph-based, and/or declarative formalisms. This talk introduces the artifact-centric approach to workflow, describes a framework and tools currently in use to design and implement artifact-centric workflows. and overviews a variety of research challenges and opportunities stemming from the artifact-centric approach, ranging from the theoretical to the practical.

Rick Hull (IBM T.J. Watson Research Center) 5130 Upson

Recent prior semesters: