Alluxio: Building a Distributed Data Access Layer for Big Data Analytics on Any Cloud

Speaker Name: 
Dr. Bin Fan
Speaker Organization: 
Alluxio, Inc.
Start Time: 
Thursday, March 21, 2019 - 10:00am
End Time: 
Thursday, March 21, 2019 - 11:00am
Location: 
E2 280
Organizer: 
Professor Chen Qian

Abstract

The rise of computation-intensive workloads and the adoption of the cloud storage (like S3, GCS) and object storage (like Ceph, Swift) has driven organizations to adopt a decoupled architecture for modern workloads -- one in which compute scales independently from storage. Alluxio (www.alluxio.org) is an open-source distributed file system that sits between conventional compute and storage layer that allows you to realize the benefits of decoupled architecture with improved performance. Alluxio provides distributed applications like Presto and Apache Spark a common and unified data access layer to different data sources, but also intelligently manages and places data and metadata closer to where they are needed. As a result, big data or ML applications can seamlessly access multiple different data sources with consistent performance. Alluxio is originally a research project named “Tachyon” at UC Berkeley AMPLab with more than 900 contributors on Github today. 

In this talk, we will discuss the design of Alluxio, its architecture and workflow as well as the use cases. We will dive into the choices in its design space and share the experiences when implementing data tiering, storage options, and cache eviction policies.

Bio

Bin Fan is the founding engineer of Alluxio, Inc. and the PMC maintainer of Alluxio open source project. Prior to Alluxio, he worked for Google to build the next-generation storage infrastructure.  Bin received his Ph.D. in Computer Science from Carnegie Mellon University on the design and implementation of distributed systems and algorithms including Cuckoo Filter, MemC3 and Silt.