Expressiveness of Stream Processing Languages

People    Publications     Acknowledgments

One of the motivations for the Cayuga has been to find a "sweet spot" between expressiveness and performance in data stream processing. Users want highly expressive queries (e.g. more than publish/subscribe) but are willing to accept some restrictions if it can provide significant performance enhancements.

The problem with establishing this claim is that there are a many data stream languages out there, and very little is known about their relative expressive power. Clearly the queries of publish/subscribe are must less expressive than those of a fully fledged DSMS system like the Stanford STREAM project. But what about event systems? How much more expressive are they than publish/subscribe, and how much less expressive are they that STREAM? And where does streaming XML fit into all this?

The purpose of this project is to investigate various streaming languages and understand the exact range of their expressive power. At the same time we are exploring what types of language restrictions give as performance benefits and what do not.

The Theory of Streaming Data

Part of this work involves the investigation of the underlying theory of streaming data. For example, what is the right time stamp semantics for streaming data? There are many proposed event systems and they all use different time stamps for managing their data streams. It is very difficult to compare their expressiveness if we do not understand the semantics of these time stamps and how they compare to one another.

An important example of how different event systems can have different time stamp semantics is the issue of "successor". Event systems are used to analyze time series queries online. Therefore, they need to have some semantics for when one event is a succesor of another. Generally, an event B is a successor to event A if no event happened in between A and B. For example, suppose we have the events pictured below.

In this picture, event B is clearly the successor of A. The events C and D cannot be a successor of A because event B happens in between both C and D.

This concept of successor makes sense if the time stamps are points in time, and hence linearly ordered. But what if the time stamps are intervals? Event systems like Cayuga use interval time stamps because events may have duration, and this is a necessary part of the query language. are a necessary part of their query language. What is the "correct" successor for interval time stamps?

Consider the collection of intervals in the picture above. What is the successor of event A? Again, it appears to be event B. However, event B ends before event A begins; a typical user might reject the claim that B happens after event A, especially if the query is trying to establish some sort of causality. But what does that leave us with? Event E is the first event to start after A completes. Should it be the successor even though events C and D complete first? If completion time is more important, how do we pick between events C and D? Or should they both be the successor to A

Investigating questions like this are important to understanding the semantics of time stamps, which must are an integral part of the expressiveness of stream query languages.




This material is based upon work supported by the National Science Foundation under Grant No. 0621438. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.