Statistical Methods for Microbiome Data

Block Bootstrap Method

Rapidly evolving high-throughput sequencing technology enables the comprehensive search for microbial biomarkers using longitudinal experiments. Such experiments consist of repeated biological observations from each subject over time and are essential in accounting for the high between-subject and within-subject variability.

Longitudinal microbiome data are used to either model abundance over time or compare the abundances of bacteria between two or more cohorts. We have devised a method for making nonparametric inferences in longitudinal microbiome data in the latter case.


The proposed resampling method combined moving block bootstrap (MBB) method (Lahiri 2013), empirical subsampling method (Hall, Horowitz, and Jing 1995), mixture model (McMurdie and Holmes 2014), generalized linear model (Diggle 2002), generalized estimating equation (Liang and Zeger 1986), median-ratio method (Anders and Huber 2010), and shrinkage estimation(Robbins 1956; Stephens 2016) to enabling inference on microbiome longitudinal data. With the optimal block size computed using subsampling, the MBB method accounts for within-subject dependency by using overlapping blocks of repeated observations within each subject to draw valid inferences based on approximately pivotal statistics. This resampling method for dependent data was motivated by the literature on the block bootstrap method for time series data along with subsampling method for optimal block size estimation.


I developed an open-source R package bootLong to make our method accessible and provide tools for exploring within-subject dependency in longitudinal data. Look at the tutorial.


The manuscript is available at arXiv.

My Presentations

A Bayesian Approach to Contamination Removal in Molecular Microbial Studies

With the potential to diagnose any known microbial organism, metagenomic Next-Generation Sequencing (NGS) has been regarded as a tool that will revolutionize infectious disease diagnostics. NGS removes the need for a pre-informed hypothesis from clinicians, detects nonculturable organisms, and can be optimized to include a turnaround time of 24-48 hours. Only recently, however, has the scientific community begun to understand the pitfalls of NGS. Microbial nucleic acids from reagent and lab environment contamination have been shown to result in signals that researchers falsely infer to be the cause of a patient’s illness. This problem is exacerbated in low biomass samples, such as plasma, where more than 99% of sequencing reads align to the human genome .

Although extracting and sequencing molecular-grade water to provide negative controls helps overcome this issue, downstream analysis can still be challenging because many common contaminants are also clinically relevant organisms. Computational methods to identify contaminants in low biomass samples are limited.


Our method will be available as an open-source R package at BARBI. Tutorial will be available soon.


  • Method paper (Submitted)

  • Application to suspected sepsis diagnosis (Submitted).
  • Application to identify translocation of bacteria (ready to submit)

My Presentations

  • July 24-27, 2019: 21st Meeting of New Researchers in Statistics and Probability: schedule available soon.

  • April 2, 2019: 3rd Workshop on Statistical and Algorithmic Challenges in Microbiome Data Analysis: video link here.


Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11 (10). BioMed Central: R106.

Diggle, Peter. 2002. Analysis of Longitudinal Data. Oxford University Press.

Hall, Peter, Joel L Horowitz, and Bing-Yi Jing. 1995. “On Blocking Rules for the Bootstrap with Dependent Data.” Biometrika 82 (3). Biometrika Trust: 561–74.

Lahiri, Soumendra Nath. 2013. Resampling Methods for Dependent Data. Springer Science & Business Media.

Liang, Kung-Yee, and Scott L Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1). Oxford University Press: 13–22.

McMurdie, Paul J, and Susan Holmes. 2014. “Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible.” PLoS Comput Biol 10 (4). Public Library of Science: e1003531.

Robbins, Herbert. 1956. “An Empirical Bayes Approach to Statistics.” Columbia University, New York City, United States.

Stephens, Matthew. 2016. “False Discovery Rates: A New Deal.” Biostatistics 18 (2). Oxford University Press: 275–94.

Postdoctoral Scholar