Abhinandan Das: Research Overview
I am a member of the Cornell Database Group, and am advised by Prof. Johannes Gehrke. I am interested in applied research problems in the broad area of data management. My main research focus has been on data streaming. For my thesis, I worked on techniques for approximate query answering over data streams. Besides stream processing, my research interests include areas such as distributed algorithms, database privacy, self tuning databases, and data mining. A brief summary of some of the research I have been involved in is outlined below. Here is a list of my publications.
The emergence of sensor networks and a new class of high-speed, data-intensive streaming applications such as IP network management, financial analysis, and telephone fraud detection, have begun to change the traditional view of databases as a static store of data. The need to query, analyze and manage vast amounts of continuous stream data in an online fashion, potentially in real time, makes it imperative to design one pass algorithms which use limited resources to process queries over these data streams. In all these applications, for all but the simplest queries, computing exact answers online in one pass with limited storage is usually not possible. In addition, stream data is often lost, becomes stale, or is intentionally dropped for various reasons (e.g. in load shedding). All of this leads to approximate answers to queries. However, in most online applications, approximate answers usually suffice and in fact, for applications with real time requirements, approximate but timely answers are often preferable to exact answers returned after a relatively long delay. Thus, approximation algorithms for stream query processing are not only necessary, but often desirable and thus constitute an integral part of any practical data stream processing system.
My thesis research focuses on approximate query answering over data streams, and encompasses several approximation techniques for online processing of streaming data. These include techniques for online data summarization, as well as algorithms for one pass approximate query answering over data streams. Some representative research is outlined below.
Semantic Load Shedding for Data Streams
The online nature of data streams and their potentially high arrival rates impose high resource requirements on data stream processing systems. In order to deal with resource constraints in a graceful manner, we propose semantic load shedding as a promising approximation approach to save resources. As in traditional relational database systems, the join operator is a very important operator in a data stream processing system. While joins are important, their computation is highly resource intensive. We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. We present optimal offline and fast online techniques for load shedding, as well as present some hardness results and approximation algorithms for semantic join approximation. Our load shedding techniques are fairly general and can handle scenarios where resource availability and stream arrival rates vary continuously over time.
While approximations play an important role in stream processing, for certain applications, e.g. intrusion detection, it is often not desirable to obtain results with limited accuracy due to the risk of missing the proverbial needle in the haystack. While it is often impossible to estimate peak arrival rates or size a system for peak loads, we observe that due to their bursty nature, streams do not always generate consistently high loads on the system and thus one can often defer input processing to periods of low load. For such applications, we propose semantic load smoothing as an alternative to load shedding, where load is not shed during periods of high load but instead buffered and post processed during periods of low load. This in this case, approximate results are returned during the online pass and these are refined during later passes when load on the system is low.
Since data streams may be conceptually unbounded while queries over them need to be processed online using limited resources, techniques for data summarization play a key role in processing large volumes of data in a timely fashion. We propose online techniques for summarizing spatial data as well as summary structures for processing correlated aggregate queries. We introduce novel sketch based methods that permit high quality selectivity estimation for spatial joins and range queries. Our sketch synopses can be constructed in a single scan over the input, handle inserts and deletes to the database incrementally, and hence can be used for processing streaming data. In contrast to previous approaches which provide no guarantees on the quality of approximate results provided, our techniques return approximate results that come with provable quality guarantees. The quality guarantees are user tunable and permit a graceful tradeoff between space consumption and quality of the resulting approximation. Our summary structures for processing correlated-sum aggregate queries, which are a natural and powerful class of queries formed by composing basic aggregates, guarantee a priori approximation bounds when the independent aggregate can be computed exactly over a data stream in limited space. For the case when the independent aggregate cannot be computed exactly over a data stream, we show the negative result that the approximation error for any sample-based summary structure can be arbitrarily large. Besides sketch based synopses and summary structures for correlated aggregates, I have also worked on development of algorithms for histogram construction, and an online algorithm (QUADS) for maintaining QUantiles over Data Streams.
Distributed Data Stream Processing
With the widespread use and deployment of networks linking together a broad range of devices, distributed data streaming applications are becoming increasingly common. In such applications, stream updates arrive continuously at several remote sites and monitoring queries require online processing of data from multiple remote sources. We examine the problem of tracking set expression cardinality in a distributed streaming environment. We analyze and present the first algorithmic techniques for minimizing communication costs while answering set expression cardinality queries with guaranteed accuracy. Set expression queries subsume the important class of distinct value queries and applications of this work include detecting Distributed Denial of Service (DDoS) attacks in real time, detecting deviations in traffic flow across network routers, and tracking web usage patterns of users.
- Distributed Set-Expression Cardinality EstimationIn Proceedings of the 30th International Conference on Very Large Data Bases (VLDB 2004), Toronto, Canada, August 2004.
Abhinandan Das, Sumit Ganguly, Minos Garofalakis, and Rajeev Rastogi.
- In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD 2004), Paris, France, June 2004.
Approximation Techniques for Spatial Data
Abhinandan Das, Johannes Gehrke, and Mirek Riedewald.
- In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003), San Diego, CA, June 2003.
Approximate Join Processing over Data Streams
Abhinandan Das, Johannes Gehrke and Mirek Riedewald.
- Efficient Approximation of Correlated Sums on Data Streams
Rohit Ananthakrishna, Abhinandan Das, Johannes Gehrke, Flip Korn, S. Muthukrishnan and Divesh Srivastava.
In IEEE Transactions on Knowledge and Data Engineering (TKDE), Volume 15, No 3, May/June 2003.
Distributed large scale peer-to-peer applications usually require knowledge of process group membership information at all participating processes. Along with two other students here at Cornell, I developed the highly scalable SWIM group membership protocol which can be used by distributed applications for maintaining group membership, a basic primitive in peer-to-peer applications. Unlike the traditional heart beating protocols which are popular with distributed systems designers today, SWIM uses a randomized probing technique that imposes constant expected load per member and constant expected failure detection time, independent of group size. The SWIM system was prototyped and deployed on a large cluster of commodity PCs here at Cornell.
- SWIM Project Web Page
SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
Abhinandan Das, Indranil Gupta and Ashish Motivala.
In Proceedings of the 2002 IEEE International Conference on Dependable Systems and Networks (DSN 2002), Washington DC, June 2002.
- The Dping Scalable Membership Service
Abhinandan Das, Indranil Gupta and Ashish Motivala.
In 2001 ACM Symposium on Operating Systems Principles (SOSP 2001), Banff, Canada, October 2001. (poster)
The choice of database layout i.e. how database objects such as tables and indexes are assigned to disk drives can significantly impact the I/O performance of a database system. Today, DBA's typically rely on fully striping onjects across all available disk drives (RAID) as the basic mechanism for optimizing I/O performance. While full striping maximizes I/O parallelism, we show that when query execution involves co-access of two or more large objects, the above strategy may be highly suboptimal. We propose a framework for automating the choice of database layout for a given database that also takes into account the effects of co-accessed objects in the workload faced by the system. We designed and implemented a prototype layout wizard on the Microsoft SQL Server 2000 product and experimentally demonstrated the superiority of the layouts generated by the proposed algorithm over traditional RAID based techniques. This work was done as part of the AutoAdmin project in the Data Management, Exploration and Mining (DMX) group at Microsoft Research. The goal of this project is to help make database systems self-tuning and self-administering, thus reducing the high total cost of ownership of today’s enterprise database systems.
Traditional mining approaches assume that the data to be mined is presented as a unified central table to the mining algorithm. However, in real life, data is usually distributed across several tables/databases and a lot of pre-processing work (e.g. joins on relevant attributes) needs to be done to create the central table. Besides the cost of the pre-processing step, the space and time required to mine the highly denormalized central table could be extremely large. For my senior thesis, I worked on the development of an efficient classification algorithm for a vertically partitioned hierarchical database. The idea here is to have the algorithm incorporate information from various tables in the database without the need for first manually preprocessing the data as a ‘flattened table’. The algorithm avoids processing excessive amounts of data by exploiting the existence of the hierarchy to speed up classification without sacrificing accuracy.
Another direction I would like to explore in the near future, that combines my interests in data streams and data mining, is online mining of data streams. The idea here is to be able to identify time changing models of the data as it evolves, as opposed to the traditional mining approach which often bulk processes data collected over several months or years, thus failing to capture changing trends in the underlying processes generating the data. Some interesting applications in this area include intrusion and anomaly detection, trend prediction and anticipatory marketing.
- Distributed Data Mining
Senior Thesis, IIT Bombay, May 2000
Abhinandan Das. Advised By: Sunita Sarawagi and S. Sudarshan.
Database privacy is an emerging research area that has recently piqued my interest, given its importance in this age of electronic commerce and data outsourcing. With the explosive growth of the internet and rapid increase of digital storage capacities making information exchange easier than ever before, privacy concerns of end consumers will be one of the key issues arising in the electronic age. As a start, I am currently exploring the possibility of employing sketch based methods for developing privacy preserving algorithms. These include problems such as privacy preserving distance computation, frequency estimation etc., all of which have direct applications in areas such as collaborative mining, data matching and database outsourcing.
Back to Home Publications Contact Information