Self-Driving Databases & My Pregnant Wife: The Hard Parts

Abstract:
The current research trend is on developing "learned" components to supplement and replace legacy components in database management systems (DBMSs). Such learned components use machine learning (ML) methods to identify non-trivial trends and correlations in the DBMS's runtime behavior. They then use this information to create execution strategies and data structures that are tailored to the application's access patterns. The hope is that learned components will enable new optimizations that are not possible today because the complexity of managing DBMSs has surpassed the abilities of humans. This could then lead to the ultimate goal of achieving a "self-driving" DBMS that is able to configure, manage, and optimize itself automatically as the database and its workload evolve over time. The bad news is that creating such a fully autonomous DBMS is harder than that. The problem requires both holistic systems engineering and novel ML solutions that cannot be solved with just adding learned components to an existing DBMS.

In this talk, I discuss the pressing unsolved problems in self-driving DBMSs. These include how to support training data collection, fast state changes, succinct state and action representations, and accurate reward observations. I will also present techniques on how to build a new autonomous DBMS or the steps needed to retrofit an existing one to enable automated control.

Bio:
Andy Pavlo is an Associate Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. His (unnatural) infatuation with database systems has inadvertently caused him to incur several distinctions, such as the NSF CAREER (2019), a Sloan Fellowship (2018), and the ACM SIGMOD Jim Gray Dissertation Award (2014).