diffTop.Rmd
Abstract
One of the important tasks in molecular microbiome data is the quantification and statistical inference of micriobial communities across envrionment or groups. The count data are presented as a contingency table, for each sample the number of sequence fragments that have been assigned to each taxon. The package diffTop provides methods to estimate topics using latent Dirichlet allocation, align posterior samples across chains, find topic differences across environments by use of generalized linear models, visualize the topic and taxa proportions. This vignette explains the use of the package and demonstrates typical workflows.
library(diffTop)
Note: if you use diffTop in published research, please cite:
Jeganathan, P. and Holmes, S.P. (2021) A statistical perspective on the challenges in molecular microbial biology. Journal of Agricultural, Biological and Environmental Statistics, 26(2):131 - 160. 10.1007/s13253-021-00447-1
Here we show the steps for the differential topic analysis. The following code assume that the phyloseq
object is available. Please refer to phyloseq
for creating a phyloseq
class object.
K <- 11
alpha <- 0.8
gamma <- 0.8
ps <- psE_BARBI
stan.data <- setUpData(
ps = ps,
K = K,
alpha = alpha,
gamma = gamma
)
iter <- 2000
chains <- 4
stan.fit <- LDAtopicmodel(
stan_data = stan.data,
iter = iter,
chains = chains,
sample_file = NULL,
diagnostic_file = NULL,
cores = 4,
control = list(adapt_delta = 0.9),
save_dso = TRUE,
algorithm = "NUTS"
)
samples <- rstan::extract(
stan.fit,
permuted = TRUE,
inc_warmup = FALSE,
include = TRUE
)
theta <- samples$theta
aligned <- alignmentMatrix(
theta,
ps,
K,
iterUse = iter/2,
chain = chains,
SampleID_name = "unique_names"
)
theta_aligned <- thetaAligned(
theta,
K,
aligned,
iterUse = iter/2,
chain = chains
)
beta <- samples$beta
beta_aligned <- betaAligned(
beta,
K,
aligned,
iterUse = iter/2,
chain = chains
)
dds_all <- diffTopAnalysis(
design = ~ pna,
ps,
theta_aligned,
subsetSample = sample_names(psE_BARBI)
)
res_all <- results(dds_all)
res_all
We set-up data for latent Dirichlet allocation (LDA) based on a phyloseq
object, the number of topics, hyperparameters for topic proportion in each sample and ASVs proportion in each topic.
We set the hyperparameters \(\alpha\) and \(\gamma\) less than one to generate mixtures that are different from each other. We choose 0.8 for \(\alpha\) across all samples so that we avoid generating unrealistic topics.
data("psE_BARBI")
ps <- psE_BARBI
K <- 11
alpha <- 0.8
gamma <- 0.8
stan.data <- setUpData(
ps = ps,
K = K,
alpha = alpha,
gamma = gamma
)
We need to estimate \(\theta\) and \(B = \left[\beta_{1}, \cdots, \beta_{K} \right]^{T}.\)
We estimate the parameters using HMC NUTS with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples.
LDAtopicmodel
uses an LDA model written in lda.stan. The user doesn’t need to write the Stan model. We use stan
function from rstan
package. Please refer to ?stan
to specify the inputs for the stan
function. For the whole dataset, we can run the following chunk in high-performance computing and save the results.
iter <- 2000
chains <- 4
stan.fit <- LDAtopicmodel(
stan_data = stan.data,
iter = iter,
chains = chains,
sample_file = NULL,
diagnostic_file = NULL,
cores = 4,
control = list(adapt_delta = 0.9),
save_dso = TRUE,
algorithm = "NUTS"
)
We extract posterior samples from the results.
samples <- rstan::extract(
stan.fit,
permuted = TRUE,
inc_warmup = FALSE,
include = TRUE
)
We estimated the parameters using the HMC-NUTS sampler with four chains and 2000 iterations. Out of these 2000 iterations, 1000 iterations were used as warm-up samples and discarded. Label switching across chains makes it difficult to directly compute log predictive density, split-\(\hat{R}\), effective sample size for model assessment, and evaluate convergence and mixing of chains.
To address this issue, we fixed the order of topics in chain one and then found the permutation that best aligned the topics across all four chains. For each chain two to four, we identified the estimated topics pair with the highest correlation, then found the next highest pair among the remaining, and so forth.
We create a Topic \(*\) Chain matrix.
theta <- samples$theta
aligned <- alignmentMatrix(
theta,
ps,
K,
iterUse = iter/2,
chain = chains,
SampleID_name = "unique_names"
)
We align the topic proportion in each specimen across four chains.
theta_aligned <- thetaAligned(
theta,
K,
aligned,
iterUse = iter/2,
chain = chains
)
We align the ASVs proportion in each topic across four chains.
# an array (iterations *topic * ASV)
beta <- samples$beta
# an array (iterations *topic * ASV)
beta_aligned <- betaAligned(
beta,
K,
aligned,
iterUse = iter/2,
chain = chains
)
Plot the topic proportion in specimens and we can save it for publication purposes. We can draw an informative summary of bacterial communities in each phenotype.
plotTopicProportion(
ps,
theta_aligned,
K,
col_names_theta_all = c("iteration", "Sample", "Topic", "topic.dis"),
chain = 4,
iterUse = iter/2,
SampleIDVarInphyloseq = "unique_names",
design = ~ pna
)
We plot the ASVs distribution in each topic.
plotASVCircular(
ps,
K,
iterUse = iter/2,
chain = chains,
beta_aligned,
varnames = c("iterations", "topic", "rsv_ix"),
value_name = "beta_h",
taxonomylevel = "Class",
thresholdASVprop = 0.008
)
We do goodness of fit test using the posterior estimates and observed data.
modelAssessment(
ps,
stan.fit,
iterUse = iter/2,
statistic = "max",
ASVsIndexToPlot = c(1, 3, 10:14, 19:26, 36, 51:53, 148)
)
dds_all <- diffTopAnalysis(
design = ~ pna,
ps,
theta_aligned,
subsetSample = sample_names(ps),
fitType = "mean"
)
results(dds_all)
Let’s test on the paired-specimens.
dds_all_paired <- diffTopAnalysis(
design = ~ pna,
ps,
theta_aligned,
subsetSample = paired_specimens
)
results(dds_all_paired)