Abhinandan Das: Research Overview

 I am a member of the Cornell Database Group, and am advised by Prof. Johannes Gehrke. I am interested in applied research problems in the broad area of data management. My main research focus has been on data streaming. For my thesis, I worked on techniques for approximate query answering over data streams.  Besides stream processing, my research interests include areas such as distributed algorithms, database privacy, self tuning databases, and data mining. A brief summary of some of the research I have been involved in is outlined below. Here is a list of my publications.

The emergence of sensor networks and a new class of high-speed, data-intensive streaming applications such as IP network management, financial analysis, and telephone fraud detection, have begun to change the traditional view of databases as a static store of data. The need to query, analyze and manage vast amounts of continuous stream data in an online fashion, potentially in real time, makes it imperative to design one pass algorithms which use limited resources to process queries over these data streams. In all these applications, for all but the simplest queries, computing exact answers online in one pass with limited storage is usually not possible. In addition, stream data is often lost, becomes stale, or is intentionally dropped for various reasons (e.g. in load shedding). All of this leads to approximate answers to queries. However, in most online applications, approximate answers usually suffice and in fact, for applications with real time requirements, approximate but timely answers are often preferable to exact answers returned after a relatively long delay. Thus, approximation algorithms for stream query processing are not only necessary, but often desirable and thus constitute an integral part of any practical data stream processing system.

 My thesis research focuses on approximate query answering over data streams, and encompasses several approximation techniques for online processing of streaming data. These include techniques for online data summarization, as well as algorithms for one pass approximate query answering over data streams. Some representative research is outlined below.

Related Documents:

Distributed large scale peer-to-peer applications usually require knowledge of process group membership information at all participating processes. Along with two other students here at Cornell, I developed the highly scalable SWIM group membership protocol which can be used by distributed applications for maintaining group membership, a basic primitive in peer-to-peer applications. Unlike the traditional heart beating protocols which are popular with distributed systems designers today, SWIM uses a randomized probing technique that imposes constant expected load per member and constant expected failure detection time, independent of group size. The SWIM system was prototyped and deployed on a large cluster of commodity PCs here at Cornell. 

Related Documents

The choice of database layout i.e. how database objects such as tables and indexes are assigned to disk drives can significantly impact the I/O performance of a database system. Today, DBA's typically rely on fully striping onjects across all available disk drives (RAID) as the basic mechanism for optimizing I/O performance. While full striping maximizes I/O parallelism, we show that when query execution involves co-access of two or more large objects, the above strategy may be highly suboptimal. We propose a framework for automating the choice of database layout for a given database that also takes into account the effects of co-accessed objects in the workload faced by the system. We designed and implemented a prototype layout wizard on the Microsoft SQL Server 2000 product and experimentally demonstrated the superiority of the layouts generated by the proposed algorithm over traditional RAID based techniques. This work was done as part of the AutoAdmin project in the Data Management, Exploration and Mining (DMX) group at Microsoft Research. The goal of this project is to help make database systems self-tuning and self-administering, thus reducing the high total cost of ownership of today’s enterprise database systems.

Related Documents 

Traditional mining approaches assume that the data to be mined is presented as a unified central table to the mining algorithm. However, in real life, data is usually distributed across several tables/databases and a lot of pre-processing work (e.g. joins on relevant attributes) needs to be done to create the central table. Besides the cost of the pre-processing step, the space and time required to mine the highly denormalized central table could be extremely large. For my senior thesis, I worked on the development of an efficient classification algorithm for a vertically partitioned hierarchical database. The idea here is to have the algorithm incorporate information from various tables in the database without the need for first manually preprocessing the data as a ‘flattened table’. The algorithm avoids processing excessive amounts of data by exploiting the existence of the hierarchy to speed up classification without sacrificing accuracy. 

Another direction I would like to explore in the near future, that combines my interests in data streams and data mining, is online mining of data streams. The idea here is to be able to identify time changing models of the data as it evolves, as opposed to the traditional mining approach which often bulk processes data collected over several months or years, thus failing to capture changing trends in the underlying processes generating the data. Some interesting applications in this area include intrusion and anomaly detection, trend prediction and anticipatory marketing.

Related Documents

Database privacy is an emerging research area that has recently piqued my interest, given its importance in this age of electronic commerce and data outsourcing. With the explosive growth of the internet and rapid increase of digital storage capacities making information exchange easier than ever before, privacy concerns of end consumers will be one of the key issues arising in the electronic age. As a start, I am currently exploring the possibility of employing sketch based methods for developing privacy preserving algorithms. These include problems such as privacy preserving distance computation, frequency estimation etc., all of which have direct applications in areas such as collaborative mining, data matching and database outsourcing.

Back to Home                                      Publications                              Contact Information