- About
- Events
- Calendar
- Graduation Information
- Cornell Tech Colloquium
- Student Colloquium
- BOOM
- Spring 2023 Colloquium
- Conway-Walker Lecture Series
- Salton Lecture Series
- Seminars / Lectures
- Big Red Hacks
- Cornell University High School Programming Contests 2023
- Game Design Initiative
- CSMore: The Rising Sophomore Summer Program in Computer Science
- Explore CS Research
- Research Night
- People
- Courses
- Research
- Undergraduate
- M Eng
- MS
- PhD
- Admissions
- Current Students
- Computer Science Graduate Office Hours
- Business Card Policy
- Cornell Tech
- Curricular Practical Training
- Exam Scheduling Guidelines
- Fellowship Opportunities
- Field of Computer Science Ph.D. Student Handbook
- Graduate TA Handbook
- Field A Exam Summary Form
- Graduate School Forms
- Instructor / TA Application
- Ph.D. Requirements
- Ph.D. Student Financial Support
- Special Committee Selection
- Travel Funding Opportunities
- The Outside Minor Requirement
- Diversity and Inclusion
- Graduation Information
- CS Graduate Minor
- Outreach Opportunities
- Parental Accommodation Policy
- Special Masters
- Student Spotlights
- Contact PhD Office
Data quality is one of the most important problems in data management and data science, since dirty data often leads to inaccurate data analytics results and wrong business decisions. This explains why recent studies show that data scientists spend 60-80% of their time cleaning and transforming data sets. A typical data cleaning process consists of three steps: data quality rule specification, error detection, and error repair.
In this talk, I will discuss my proposals in dealing with challenges in data cleaning workflows. First, I will introduce a system to automatically discover data quality rules from possibly dirty data instances, instead of requiring domain experts to design these rules, which is often expensive and is rarely done in practice. Second, I will show a holistic error detection and error repair technique that accumulates evidence from a broad spectrum of data quality rules, and suggests more accurate data repairs. I will conclude the talk by discussing some ongoing work in IoT data quality and my long-term vision of debugging data analytics.
Bio:
Xu Chu is a PhD candidate in the Data Systems Group at the University of Waterloo advised by Prof. Ihab Ilyas. He is broadly interested in data management, with a special focus on new theories, algorithms, and systems for managing large dirty and inconsistent data, including data quality rule discovery, error and outlier detection, and automatic data repair. Xu was awarded the Microsoft Research PhD fellowship in 2015, and the Cheriton Fellowship from the University of Waterloo in 2013.