Knowledge Broker: Analysis and Monitoring of Massive Data Streams
In many applications like telephone fraud detection, network management, stock analysis, Internet-scale news filtering and dissemination, and message filtering and analysis, data arrives in a stream. For example, telephone call records are generated for each call, typically at the end of the call. The stream of generated call records may be used by applications such as telephone fraud detection, that examines this stream for various patterns of potentially fraudulent calling behavior. Stream processing systems typically have to support large numbers of long-running continuous queries. For example, stock analysts continuously monitor incoming stock quotes in order to discover relevant patterns in real-time.
Traditional database systems are optimized for one-shot queries over persistent datasets. Straightforward stream processing solutions based on database triggers simply do not scale. We are therefore building a new system architecture for processing of massive data streams. Within the Knowledge Broker project we are currently exploring several directions of research, covering the whole range from foundations of stream processing to system implementation.
Maintaining statistics over data streams. The large volume of stream data and the on-line nature of the various applications that operate on such data, makes it imperative for the applications to compute and maintain a variety of statistical summaries in an on-line fashion. We developed histogram-based techniques for maintaining correlated aggregates [GKS01]. More recently we extended sketches to support complex aggregate queries involving multiple joins [DGGR02], and we developed the first on-line technique for maintaining statistics over spatial and spatio-temporal data with provable (probabilistic) error bounds [DGR04a]. The latter can also be used for high-quality selectivity estimation in spatial and spatio-temporal databases. It supports operators like the spatial join, proximity-joins (epsilon-join), and range queries.
Load shedding and load smoothing. Data stream arrival patterns in practice are often characterized by successions of load spikes (period of high arrival rate), followed by periods of low arrival rates. A typical example are topics covered by news, where an event can trigger a sudden burst of news articles. Dealing with such load spikes is a challenging issue, because during a burst the system might run out of resources and hence cannot process all arriving tuples any more. One approach to dealing with this problem is to shed load , i.e., to address a resource bottleneck by selectively dropping input tuples from overflowing operator queues. The goal is to select the right tuples to be dropped to minimize a given loss metric (e.g., loss in query result accuracy) [DGR03]. However, simply discarding tuples is not desirable for several applications, e.g., monitoring of critical infrastructure, financial applications, intelligence applications (needle in the haystack problem). Instead of shedding load, these applications require load to be smoothed, i.e., during bursts incoming tuples are buffered and then processed later during periods of low load.
Knowledge Broker system architecture . The Knowledge Broker architecture is designed to support continuous and one-shot queries over data streams and persistent datasets. Its core components are a subscription matching engine, an archive database, and data mining modules. The subscription matching engine supports continuous queries. It extends publish/subscribe functionality and multi-query optimization techniques to support real-time discovery of temporal patterns in data streams. The archive database stores all relevant information, including stream tuples and materialized results of queries running in the subscription matching engine and the data mining modules. The data mining modules provide one-shot query functionality, e.g., complex data mining algorithms, which make multiple passes over data in the archive. The three components will be integrated, enabling powerful functionality like data mining queries automatically setting up subscriptions to monitor incoming stream data for relevant information; and powerful subscriptions, which require access to historic data stored in the archive [DGR04b].
[DGR04a] Abhinandan Das , Johannes Gehrke, and Mirek Riedewald . Approximation Techniques for Spatial Data . In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD 2004) . Paris, France, June 2004.
[DGR04b] A. Demers, J. Gehrke, M. Riedewald. The Architecture of the Cornell Knowledge Broker. In Proc. Second Symposium on Intelligence and Security Informatics (ISI-2004) , 2004
[DGR03] Abhinandan Das , J. E. Gehrke, and Mirek Riedewald . Approximate Join Processing Over Data Streams . In Proceedings of the the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003) . San Diego, CA, June 2003.
[DGGR02] A. Dobra, M. Garofalakis, J. E. Gehrke, and R. Rastogi. "Processing Complex Aggregate Queries over Data Streams", In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data , Madison, Wisconsin, June 2002.
[GKS01] J. E. Gehrke, Flip Korn, and Divesh Srivastava . On Computing Correlated Aggregates Over Continual Data Streams. In Proceedings of the 2001 ACM Sigmod International Conference on Management of Data , Santa Barbara, California, May 2001.