Abstract

One of the important tasks in molecular microbiome data is the quantification and statistical inference of micriobial communities across envrionment or groups. The count data are presented as a contingency table, for each sample the number of sequence fragments that have been assigned to each taxon. The package diffTop provides methods to estimate topics using latent Dirichlet allocation, align posterior samples across chains, find topic differences across environments by use of generalized linear models, visualize the topic and taxa proportions. This vignette explains the use of the package and demonstrates typical workflows.

library(diffTop)

Standard workflow

Note: if you use diffTop in published research, please cite:

Jeganathan, P. and Holmes, S.P. (2021) A statistical perspective on the challenges in molecular microbial biology. Journal of Agricultural, Biological and Environmental Statistics, 26(2):131 - 160. 10.1007/s13253-021-00447-1

Quick start

Here we show the steps for the differential topic analysis. The following code assume that the phyloseq object is available. Please refer to phyloseq for creating a phyloseq class object.

K <- 11
alpha <- 0.8
gamma <- 0.8
ps <- psE_BARBI
stan.data <- setUpData(
  ps = ps,
  K = K,
  alpha = alpha,
  gamma = gamma
)

iter <- 2000
chains <- 4
stan.fit <- LDAtopicmodel(
  stan_data = stan.data, 
  iter = iter, 
  chains = chains, 
  sample_file = NULL,
  diagnostic_file = NULL,
  cores = 4,
  control = list(adapt_delta = 0.9),
  save_dso = TRUE,
  algorithm = "NUTS"
  )

samples <- rstan::extract(
  stan.fit, 
  permuted = TRUE, 
  inc_warmup = FALSE, 
  include = TRUE
  )

theta <- samples$theta 

aligned <- alignmentMatrix(
  theta, 
  ps, 
  K, 
  iterUse = iter/2,
  chain = chains,
  SampleID_name = "unique_names"
  )

theta_aligned <- thetaAligned(
  theta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  )

beta <- samples$beta 

beta_aligned <- betaAligned(
  beta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  ) 


dds_all <- diffTopAnalysis(
  design = ~ pna,
  ps,
  theta_aligned,
  subsetSample = sample_names(psE_BARBI)
)
res_all <- results(dds_all)
res_all

Input data

We set-up data for latent Dirichlet allocation (LDA) based on a phyloseq object, the number of topics, hyperparameters for topic proportion in each sample and ASVs proportion in each topic.

We set the hyperparameters \(\alpha\) and \(\gamma\) less than one to generate mixtures that are different from each other. We choose 0.8 for \(\alpha\) across all samples so that we avoid generating unrealistic topics.

data("psE_BARBI")
ps <- psE_BARBI
K <- 11
alpha <- 0.8
gamma <- 0.8
stan.data <- setUpData(
  ps = ps,
  K = K,
  alpha = alpha,
  gamma = gamma
)

Posterior sampling

We need to estimate \(\theta\) and \(B = \left[\beta_{1}, \cdots, \beta_{K} \right]^{T}.\)

We estimate the parameters using HMC NUTS with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples.

LDAtopicmodel uses an LDA model written in lda.stan. The user doesn’t need to write the Stan model. We use stan function from rstan package. Please refer to ?stan to specify the inputs for the stan function. For the whole dataset, we can run the following chunk in high-performance computing and save the results.

iter <- 2000
chains <- 4

stan.fit <- LDAtopicmodel(
  stan_data = stan.data, 
  iter = iter, 
  chains = chains, 
  sample_file = NULL,
  diagnostic_file = NULL,
  cores = 4,
  control = list(adapt_delta = 0.9),
  save_dso = TRUE,
  algorithm = "NUTS"
  )

We extract posterior samples from the results.

samples <- rstan::extract(
  stan.fit, 
  permuted = TRUE, 
  inc_warmup = FALSE, 
  include = TRUE
  )

Alignment

We estimated the parameters using the HMC-NUTS sampler with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples and discarded. Label switching across chains makes it difficult to directly compute log predictive density, split-\(\hat{R}\), effective sample size for model assessment, and evaluate convergence and mixing of chains.

To address this issue, we fixed the order of topics in chain one and then found the permutation that best aligned the topics across all four chains. For each chain two to four, we identified the estimated topics pair with the highest correlation, then found the next highest pair among the remaining, and so forth.

We create a Topic \(*\) Chain matrix.

theta <- samples$theta 

aligned <- alignmentMatrix(
  theta, 
  ps, 
  K, 
  iterUse = iter/2,
  chain = chains,
  SampleID_name = "unique_names"
  )

Align topic proportion

We align the topic proportion in each specimen across four chains.

theta_aligned <- thetaAligned(
  theta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  )

Align ASV proportion

We align the ASVs proportion in each topic across four chains.

# an array (iterations *topic * ASV)
beta <- samples$beta 

# an array (iterations *topic * ASV)
beta_aligned <- betaAligned(
  beta, 
  K, 
  aligned, 
  iterUse = iter/2, 
  chain = chains
  ) 

Visualization

Plot topic proportion

Plot the topic proportion in specimens and we can save it for publication purposes. We can draw an informative summary of bacterial communities in each phenotype.

plotTopicProportion(
  ps,
  theta_aligned,
  K,
  col_names_theta_all = c("iteration", "Sample", "Topic", "topic.dis"),
  chain = 4,
  iterUse = iter/2,
  SampleIDVarInphyloseq = "unique_names",
  design = ~ pna
)

Plot ASVs distribution

We plot the ASVs distribution in each topic.

plotASVCircular(
  ps,
  K,
  iterUse = iter/2,
  chain = chains,
  beta_aligned,
  varnames = c("iterations", "topic", "rsv_ix"),
  value_name = "beta_h",
  taxonomylevel = "Class",
  thresholdASVprop = 0.008
  )

Model assessement

We do goodness of fit test using the posterior estimates and observed data.

modelAssessment(
  ps,
  stan.fit,
  iterUse = iter/2,
  statistic = "max",
  ASVsIndexToPlot = c(1, 3, 10:14, 19:26, 36, 51:53, 148)
)

Diagnostics

Effective sample size (ESS)

We compute effective sample size for each parameter.

p <- diagnosticsPlot(
  theta_aligned = theta_aligned,
  beta_aligned = beta_aligned
  iterUse = iter/2,
  chain = chains
)
p_ess_bulk <- p[[[1]]]
p_ess_bulk

\(\hat{R}\)

We can check whether \(\hat{R}\) is less than 1.02

p_rhat <- p[[2]]
p_rhat

Differential topic analysis

Test on all specimens

dds_all <- diffTopAnalysis(
  design = ~ pna,
  ps,
  theta_aligned,
  subsetSample = sample_names(ps),
  fitType = "mean"
)
results(dds_all)

Test on selected specimens

Let’s test on the paired-specimens.

dds_all_paired <- diffTopAnalysis(
  design = ~ pna,
  ps,
  theta_aligned,
  subsetSample = paired_specimens
)
results(dds_all_paired)