Practical Reproducibility in Systems Research - A CROSS Mini-Workshop

Start Time: 
Friday, February 28, 2020 - 1:30pm
End Time: 
Friday, February 28, 2020 - 5:00pm
Location: 
E2-506
Organizer: 
CROSS

 

Workshop Details:

Reception starts at 1:30; Workshop starts at 2:00pm

 

 

 

Ivo Jimenez (UCSC): Popper 2.x: A Container-native Workflow Engine For Complex Application Testing and Validating Scientific Claims

Abstract: Software containers allow users to "bring their own environment" to shared computing platforms, reducing the friction between system administrators and their users. In recent years, multiple container runtimes have arisen, each addressing distinct needs (e.g. Singularity, Podman, rkt, among others), and an ongoing effort from the Linux Foundation (Open Container Initiative) is standardizing the specification of Linux container runtimes. While containers solve a big part of the reproducibility problem, there are scenarios where multi-container workflows are not fully addressed by existing runtimes or workflow engines. Current alternatives require a full scheduler (e.g. Kubernetes), a scientific workflow engine (e.g. Pegasus), or are constrained in the type of logic that users can express (e.g. Docker-compose). Ideally, users should be able to express workflows with the same user-friendliness and portability of `Dockerfile`s (write once, run anywhere). In this article, we introduce "Popper 2.x" a container-native workflow execution engine that allows users to express complex workflows similarly to how they do it in other scientific workflow languages, but with the advantage of running in container runtimes, bringing portability and ease of use to multiple platforms (HPC, Cloud, on-prem). Popper 2.x cleanly separates the three main concerns that are common in experimentation scenarios: experimentation logic, environment preparation, and system configuration. To exemplify the suitability of the tool, we present a case study where we take the experimentation pipeline defined for MLPerf and turn it into a Popper workflow.

Bio: Ivo Jimenez is a Research Scientist at UC Santa Cruz and Incubator Fellow at the UC Santa Cruz Center for Research on Open Source Software (CROSS). He is interested in large-scale distributed data management systems. His 2019 PhD dissertation focused on the practical aspects in the reproducible evaluation of systems research, work for which he was awarded the 2018 Better Scientific Software Fellowship. Ivo is originally from Mexico, where he got his B.S. in Computer Science from Universidad de Sonora.

Marios Kogias (EPFL): I
nfrastructure for microsecond-scale latency experiments

Abstract: In this talk I’m going to present our ongoing research effort on building load generators and latency measuring tools for microsecond-scale datacenter services. In the first part of the talk I’ll talk about Lancet, a self-correcting tool designed to measure the open-loop tail latency of μs-scale datacenter applications with high fan-in connection patterns. Lancet is self-correcting as it relies on online statistical tests to determine situations in which tail latency cannot be accurately measured from a statistical perspective, including situations where the workload configuration, the client infrastructure, or the application itself does not allow it. When available, Lancet leverages NIC-based hardware timestamping to measure RPC end-to-end latency. Otherwise, it uses an asymmetric setup with a latency-agent that leverages busy-polling system calls to reduce the client bias. Lancet was presented in Usenix ATC 2019. In the second part of the talk I’ll describe SLOG (Switch LOad Generator), a programmable load generator and latency-measuring tool based on a programmable Tofino ASIC. SLOG leverages the programming capabilities and the fixed function units of Tofino to generate load and measure tail latency for both NFs and RPC services, while being able to generate a Poisson inter-arrival distribution. According to our knowledge, SLOG is the only hardware-based tool that is able to generate a randomized inter-arrival distribution, which is crucial for a realistic latency experiment.

Bio: Marios Kogias is a 5th year PhD student at EPFL working with Edouard Bugnion. His research focuses on datacenter systems, and specifically on microsecond-scale Remote Procedure Calls. He is interested in improving the tail-latency of networked systems by rethinking both operating systems mechanisms, e.g. schedulers, and networking, e.g. transport protocols, while leveraging new emerging datacenter hardware for in-network compute. Marios has interned at Microsoft Research, Google, and Cern, and he is an IBM PhD Fellow.

Alexandru Uta (Leiden University): Is Big Data Performance Reproducible in Modern Cloud Networks?

Abstract: Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data  workloads by gathering traces from mainstream commercial clouds and private research clouds. Our dataset consists of millions of datapoints gathered while transferring over 9 petabytes on cloud providers' networks. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used.We provide guidelines to reduce the volatility of big data performance, making experiments more repeatable.

Bio: Alexandru Uta is an assistant professor at Leiden University since February 2020. Before, he worked as a postdoctoral researcher at VU Amsterdam in the Massivizing Computer Systems Group. He received his PhD and MSc in 2017, and 2012, respectively, from the HPDC group at VU Amsterdam. His current research focuses on studying modern distributed systems, with a special interest in the performance variability and reproducibility of cloud computing and big data systems, variability-proof and reproducible experiment design, as well as resource management, and converged HPC, big data, and AI infrastructure. His work has been funded through industry (Google) and Dutch national grants (SURF), and he has received the best eScience project award at the e-Science Conference 2015.

Event Type: 
Event