Heading off Correlated Failures in Cloud-Scale Systems

Speaker Name: 
Ennan Zhai
Speaker Title: 
Associate Research Scientist
Speaker Organization: 
Computer Science Department at Yale University
Start Time: 
Friday, February 23, 2018 - 11:00am
End Time: 
Friday, February 23, 2018 - 12:15pm
Cormac Flanagan


Today's cloud systems heavily rely on redundancy for reliability. Nevertheless, as cloud systems become ever more structurally complex, independent infrastructure components may unwittingly share deep dependencies. These unexpected common dependencies may result in correlated failures that undermine redundancy efforts. The state-of-the-art efforts, e.g., post-failure forensics, not only lead to prolonged failure recovery time in the face of structurally complex systems, but also fail to avoid expensive service downtime. In this talk, I will present a series of work towards preventing correlated failures not only in a single cloud datacenter but also across multiple cloud providers. In the first part of the talk, I will show a system that helps the datacenter operators proactively audit
correlated failure risks through: 1) automatically collecting dependencies, 2) constructing a fault graph to model the target system stacks, and 3) analyzing the fault graph to identify potential risks.
To ensure the practicality, efficiency, and accuracy of our approach, we further equip our system with a domain-specific auditing language framework, a set of high-performance auditing primitives based on SAT/SMT solvers, and an automatic correlated failure risk repair engine. Our system is 300x more efficient in auditing time than the state-of-the-art systems. In the second part of the talk, I will present another system capable of preventing correlated failures across multiple cloud providers unwilling to share their dependency information due to the business privacy. We construct a private set similarity protocols to evaluate the independence of each alternative inter-cloud replications without leaking any sensitive information, thus preventing correlated failures across different providers at the early stage.


Ennan Zhai is currently an Associate Research Scientist in the Computer Science Department at Yale University, where he also received his Ph.D. in 2015. His research focuses on building secure and reliable computer systems. Specifically, his work takes advantage of an interdisciplinary approach, integrating areas including distributed systems, security, verification, and programming languages.