Configuration errors are among the dominant causes of service-wide, catastrophic failures in today's cloud and datacenter systems. Despite the wide adoption of fault-tolerance and recovery techniques, these large-scale software systems still fail to effectively deal with configuration errors. In fact, even tolerance/recovery mechanisms are often misconfigured and thus crippled in reality.
In this talk, I will describe our research efforts towards hardening cloud and datacenter systems against configuration errors. First, I will briefly present work that seeks for understanding the fundamental causes of misconfigurations, in particular, how do sysadmins configure systems in the real world, and how are misconfigurations introduced in the field. I will then focus on describing work that enables cloud and datacenter systems to anticipate and defend against configuration errors, including checking configurations proactively to detect the errors timely, and reacting gracefully to the errors.
Tianyin Xu is a Ph.D. candidate in Computer Science and Engineering at UC San Diego. His research interests intersect systems, software engineering, and HCI towards the overarching goal of building reliable and secure systems. His dissertation work has impacted the configuration design and implementation of real-world commercial and open-source systems, and has received a Best Paper Award at OSDI 2016.