Database Colloquium

The database colloquium is the weekly meeting of students and faculty interested in data management, data mining, or related topics at Cornell. The colloquium is typically a paper presentation of seminal or recent papers of general interest. While many of the speakers are from the Cornell community, the colloquium also invites outside speakers to talk about their research. The colloquium is held every Monday in from 12:15-1 pm in 5130 Upson Hall.

On those days in which the database colloquium does not have an outside speaker, the colloquium is replaced by a more informal database lunch. This is a short lunch lunch starting at noon followed by an informal paper discussion on a recent topic of interest.





September 6 CS Department Colloquium Magda Balazinska Upson B17, 4:15pm
September 10

Scalable and Flexible Similarity Matching Operators

Similarity matching is a classical problem. However, the fast increase in both the amount and the dimensionality of the data calls for new scalable methods, especially those suitable for massively parallel and distributed environments. Furthermore, the presence of large data also means that data become more feature-rich, which leads to new meaningful semantics when designing similarity matching operators to satisfy the needs of user applications. In this talk, we will share our experience and findings in processing kNN joins for large data in MapReduce, and answering flexible aggregate similarity search queries. Lastly, when executing similarity matching operators for large data in the cloud, security is also a paramount concern. We will briefly explore this topic at the end.

Speaker bio: Feifei Li has been an assistant professor at the School of Computing, University of Utah since August 2011. He was an assistant professor at the Computer Science Department, Florida State University, between August 2007 and July 2011. He obtained his B.S. in computer engineering from Nanyang Technological University, Singapore in 2002 (transferred from Tsinghua University, China) and PhD in computer science from Boston University in 2007. His research focuses on various issues in databases and large scale data management. He was a recipient for an NSF career award in 2011, two HP IRP awards in 2011 and 2012 respectively, and the IEEE ICDE best paper award in 2004.

Feifei Li 5130 Upson
September 17

Urban Computing with City Dynamics

Urban computing is emerging as a concept where every sensor, device, person, vehicle, building, and street in the urban areas can be used as a component to probe city dynamics to further enable city-wide computing for serving people and their cities. Urban computing aims to enhance both human life and urban environment smartly through a recurrent process of sensing, mining, understanding, and improving. Urban computing also aims to deeply understand the nature and sciences behind the phenomenon occurring in urban spaces, using a variety of heterogeneous data sources reflecting city dynamics, such as traffic flows, human mobility, geographic and map data, environment, energy consumption, populations, and economics. In this talk, we will present our recent research into urban computing with city dynamics, introducing innovative application scenarios and the technology for integrating and mining heterogeneous city dynamics, such as, finding smart driving directions based on taxi trajectories, identify different functional regions (e.g., residential and commercial areas) in urban spaces using both POIs and human mobility, gleaning the problematic city configurations, and anomaly detection in road traffic flows (these examples have been published in top-tier conferences and journals recently, such as KDD, UbiComp, ICDE). More details can be found on this page.

Speaker bio: Dr. Yu Zheng is a lead researcher from Microsoft Research Asia. He is an IEEE senior member and ACM senior member. His research interests include location-based services, spatio-temporal data mining, ubiquitous computing, and mobile social applications. He has published over 50 referred papers at high-quality international conferences and journals, such as SIGMOD, SIGKDD, AAAI, ICDE, WWW, Ubicomp, and IEEE TKDE, where he has received 3 best paper awards as well as 1 best paper nominee and a number of most cited papers. These papers have also been featured by top-tier presses like MIT Technology Review multiple times. In addition, he has been serving over 30 prestigious international conferences as a chair or a program committee member, including ICDE, KDD, Ubicomp, and IJCAI, etc. So far, he has received 3 technical transfer awards from Microsoft and 20 granted/filed patents. In 2008, he was recognized as the Microsoft Golden Star. His homepage is found here.

Yu Zheng 5130 Upson
September 20 CS Department Colloquium Sam Madden Upson B17, 4:15pm
September 24 Bidirectional Transformations for Web Data Nate Foster 5130 Upson
October 1

Google Ads Backend System

In this talk, we will present the high-level architecture of the "Ads Backend" system, the engine powering search advertising at Google. We will then describe in more detail a large-scale distributed data processing systems in this area (Photon).

Photon: Fault-tolerant and scalable joining of continuous data streams

Photon is a highly fault-tolerant, scalable, low-latency and stateful distributed system to join multiple streams of data flowing continuously. Joining these data streams is critical to extract key metrics about Google’s ads-system used for billing and internal analysis. Photon accomplishes exactly-once semantics and can automatically withstand datacenter-level outages, providing an order of magnitude higher uptime SLA relative to a single datacenter system. Our production deployment processes over one hundred thousand events per second at peak with end-to-end latency of less than 10 seconds. In this talk, we will focus on high-level architecture of Photon, including a highly scalable paxos-based storage system.

Manpreet Singh 5130 Upson
October 22

Local Thresholding for Structured and Unstructured graphs

Local thresholding algorithms were first offered a decade ago as a communication thrifty alternative for computation in large distributed environments. Their disadvantage, however, has always been in their brittleness. A single cycle in the communication graph could mean the algorithm converges to the wrong value. This talk describes two advances in local thresholding algorithms which overcome the demand for cycle freedom. The first is a local tree induction protocol for structured peer-to-peer networks which seamlessly integrates with the local thresholding algorithm. The second are new local stopping and update rules which permit execution of the local thresholding algorithm on general graphs. The first solution vastly outperforms a gossip based algorithm on simple computation tasks in a Chord-like peer-to-peer network. The second may transform the way data is processed in wireless sensor networks, where gossip is mostly considered impermissibly costly.

Ran Wolff 5130 Upson
November 5

TileHeat: A Framework for Tile Selection

Public geospatial services are now commonly available on the Web. These services often render maps to users by dividing the maps into tiles. Given that geospatial services experience significant user load, it is desirable to pre-compute tiles at a time of low load in order to increase overall performance. This talk reports on our experience with tile caching based on our analysis of the request log of a public geospatial service provider. We observe that windows of low load occur with a periodic pattern. In addition, our analysis shows that tile access patterns exhibit strong spatial skew. Based on these observations, we propose an adaptive strategy restricting the set of tiles that are pre-computed to the low load time window. Ideally, the restricted tile set should deliver performance comparable to the full tile set. To achieve this result, tiles should be selected based on their expected popularity. Our key observation is that the popularity of a tile can be estimated by analyzing the tiles that users have previously requested. Our adaptive strategy constructs heatmaps of previous requests and uses this information to decide which tiles to pre-compute. In addition, we mode local variations in both space and time to increase the quality of our predictions. We evaluate our methods against a real production log, and observe that the latter heuristic achieves a 25% increase in the hit ratio compared to current methods, without pre-computing a larger set of tiles.

Speaker bio: Marcos Vaz Salles is an assistant professor at the Department of Computer Science (DIKU) of the University of Copenhagen. His research targets building novel data-driven systems that bring classic database benefits, such as scalability and ease of programming, to new domains. In work that started during his postdoc at Cornell University, Marcos is investigating how to bring data management techniques into programming of parallel applications, such as computer games and behavioral simulations, in cloud platforms. During his PhD in the Systems Group at ETH Zurich, he investigated hybrid search and data integration architectures for personal dataspace management in the iMeMex project. Previously, Marcos obtained his MSc from PUC-Rio, Brazil, and his BSc from UNICAMP, Brazil.

Marcos Vaz Salles 5130 Upson

Prior semesters: