Stay Informed:
Baskin Engineering COVID-19 Information and Resources
Campus Roadmap to Recovery
Zoom Links: Zoom Help | Teaching with Zoom | Zoom Quick Guide

Defense: Bayesian Modeling for Heterogeneous Multivariate Data

Speaker Name: 
Arthur Lui
Speaker Title: 
PhD Candidate
Speaker Organization: 
Statistical Science PhD
Start Time: 
Friday, March 5, 2021 - 10:00am
End Time: 
Friday, March 5, 2021 - 11:00am
Zoom - - Passcode: 815223

Abstract: This defense presents Bayesian statistical methods for analyzing heterogeneous multivariate data, with application to marker expression data obtained from cytometry at time-of-flight (CyTOF). A Bayesian feature allocation model (FAM) is presented for identifying cell subpopulations based on multiple samples of cell surface or intracellular marker expression level data obtained by CyTOF. Cell subpopulations are characterized by differences in expression patterns of markers, and individual cells are clustered into the subpopulations based on the patterns of their observed expression levels. A finite Indian buffet process is used to model subpopulations as latent features, and a model-based method based on these latent feature subpopulations is used to construct cell clusters within each sample. Non-ignorable missing data due to technical artifacts in mass cytometry instruments are accounted for by defining a static missingship mechanism. A repulsive FAM (rep-FAM) which restructures the probability distribution of a traditional FAM to identify features more likely to be distinct from each other is then presented. The problem that a conventional FAM has a positive probability of repeating a feature is eliminated by the rep-FAM, which also increases the probability of larger differences between features. The rep-FAM yields clusters that are more biologically-interpretable than those identified by a conventional FAM. Methods for differential distributions between two experimental conditions, in the context of CyTOF data, are then presented. A zero-inflated mixture of log-skew-t distributions is used to model the multi-modal, heavy tailed, and often highly skewed distributions that arise from these marker expression levels. A distance metric is proposed to quantify differences between distributions under various experimental conditions. The performance and limitations of proposed methodologies are assessed through simulation studies and real data analyses in this presentation.

Event Type: 
Juhee Lee
Graduate Program: 
Statistical Science PhD