Biostatistical Methods In Psychiatry

Deciphering the ‘gut–brain axis’ through microbiome diversity

Abstract

Incentivised by breakthroughs and data generated by the high-throughput sequencing technology, this paper proposes a distance-based framework to fulfil the emerging needs in elucidating insights from the high-dimensional microbiome data in psychiatric studies. By shifting focus from traditional methods that focus on the observations from each subject to the between-subject attributes that aggregate two or more subjects’ entire feature vectors, the described approach revolutionises the conventional prescription for high-dimensional observations via microbiome diversity. To this end, we enrich the classical generalised linear models to articulate the multivariable regression relationship between distance-based variables. We also discuss a robust and computationally feasible semiparametric inference technique. Benefitting from the latest advances in the semiparametric efficiency theory for such attributes, the proposed estimators enjoy robustness and good asymptotic properties that guarantee sensitivity in detecting signals between clinical outcomes and microbiome diversity. It offers a readily implementable and easily interpretable solution for deciphering the gut–brain axis in mental health research.

Introduction

The human microbiome is the totality of the microbes (microbiota), their genetic elements (metagenome) and the interactions they have with surrounding environments throughout the human body.1 In contrast to the human genome, the human microbiome is highly variable, displays substantial intra-individual variation at different body sites (gut, skin, lung, vagina, oral cavity, etc), inter-individual variation at the same body sites and intra-individual variation at different times in longitudinal studies.2

The human microbiome plays a key role in human disease and health. A preponderance of human microbiome studies have implicated the human microbiome in the pathogenesis of many human diseases, such as obesity, diabetes, alcoholic liver disease, vaginosis and even cancers.1 3 The genotypic effect on the microbiome may explain the missing link between genetics and disease since a disease-susceptibility genotype may affect the disease outcome through the alteration of the microbiome composition.4 5 Therefore, identifying potential factors that influence the microbiome composition and discovering their relationship with biological or clinical outcomes help demystify the inherent disease mechanism and enable the possibility of modulating the microbiome composition for therapeutic purposes.

Fuelled by the technological advancement of next-generation sequencing, the human microbiome can be interrogated using high-throughput sequencing. One strategy amplifies and sequences the bacterial 16S ribosomal RNA from the samples. We then cluster the similar sequences into operational taxonomic units (OTUs). By comparing OTUs with reference databases, we identify existing species in the samples and also obtain the OTU abundance profiles. The OTU abundance profiles refer to a matrix with the (i, j)-th element referring to the number of sequence reads that represent the j-th OTU (or species, roughly speaking) in the i-th subject. This count matrix forms the foundation for statistical analyses.6 The notable features of OTU abundances are high-dimensional (p>>n) and skewed counts with a preponderance of zeros. One line of research aims to advance statistical tools to directly tackle such data features to find individual OTU culprits for certain diseases of interest.7 8 Another emerging paradigm, however, shifts gears to study the impact of the overall microbiome composition represented as diversity metrics, such as alpha-diversity and beta-diversity,6 which we introduce and focus on in this paper.

Microbiome diversity metrics

Alpha-diversity

Alpha-diversity is a subject-level metric summarising OTU abundance into a scalar value for each person. Consider a sample of n subjects and a vector  Inline Formula  denoting the counts of the p OTUs for subject i . Alpha-diversity for subject i is:

Display Formula

where m is the order controlling the weight allocation among taxa. Varying m permits different alpha diversities. For example,  Inline Formula  yields the observed species index  Inline Formula  that counts the total number of species present, hence weighing more on the rare taxa and indexes the richness of microbe in total.9 On the other hand,  Inline Formula  leads to the Simpson index  Inline Formula  that assigns more weight towards abundant taxa, hence indicating species evenness.9 When  Inline Formula  (1) is undefined but the resulting limit is the Shannon index,  Inline Formula .

Beta-diversity

Unlike the OTU abundance or alpha-diversity that describes features within each subject, beta-diversity is a between-subject attribute6 indexing the dissimilarity between any two individuals or a pair. Essentially, beta-diversity compares the feature differences between any pair  Inline Formula  using a dissimilarity or distance metric, denoted as  Inline Formula , where  Inline Formula :  Inline Formula  is a mapping from the high-dimensional OTUs to a scalar value for that pair. Different choices of  Inline Formula  lead to distinct versions of beta-diversity, with the commonly used ones including Aitchison, Bray-Curtis, Jaccard, Unifrac, and so on. For example, the Aitchison beta-diversity10 is defined as:

Display Formula

By integrating information from high-dimensional features, beta-diversity represents a totality measure of dissimilarity between two subjects across all the OTUs, which constitutes a biologically relevant indicator of human health11 and merits interest. It is non-negative, with 0 (bigger values) indicating the same (very different) taxonomic abundances. Also, its dimension does not change with the number of taxonomic units  Inline Formula  hence can be viewed as a dimension reduction.

Statistical methods

The research community is primarily interested in using statistical methods to link various diversity metrics introduced above to clinical variables, such as disease status.

Alpha-diversity

The alpha-diversity is composed of data from one subject; therefore, most standard statistical methods are readily applicable for the analysis of alpha-diversity. For example, popular tools for associating a phenotype with alpha-diversity include the Kruskal-Wallis H (or Mann-Whitney U) test12 for categorical variables and the Spearman’s correlation for continuous phenotype variables. When controlling for covariates (eg, demographics) is needed in more complex settings, the robust regression framework is often suggested to ensure valid inferences given the non-normality nature of alpha-diversity. For a study with size n , let  Inline Formula  denote a response, and  Inline Formula  an explanatory variable for the i -th subject. The semiparametric GLM characterising the within-subject relationship between  Inline Formula  and  Inline Formula  is:

Display Formula

where  Inline Formula  is the inverse of some link functions. Compared with the classical parametric GLM, (3) is more flexible. It removes the distributional assumption on  Inline Formula , thus yielding valid inference even when the data deviate from such an assumption. Hence, it is especially suited for modelling the alpha-diversity by specifying  Inline Formula  in (1). It is worth noting that the same method can be used if alpha-diversity is the predictor.

Beta-diversity

Given a main factor  Inline Formula  with K levels for the group membership, let  Inline Formula  and  Inline Formula  denote the mean and variance of  Inline Formula  for the k -th group. We can formalise the scientific question into a hypothesis to test the mean beta-diversity distance across the K groups:

Display Formula

However, unlike the case with alpha-diversity, the  Inline Formula  involves the OTU abundances from a pair, so it introduces rather complex correlation structures that are difficult, if not impossible, to model using parametric form.

Permutational Multivariate Analysis of Variance Using Distance Matrices

One solution is the distance-based Permutational Multivariate Analysis of Variance Using Distance Matrices (PERMANOVA).13 It defines a pseudo-F statistic as:

Display Formula

where  Inline Formula  is the trace of a matrix,  Inline Formula  is the hat matrix of the design matrix  Inline Formula  and G is the Gower’s centred matrix obtained from  Inline Formula  Due to the complexity of beta-diversity (usually non-Euclidean) distance, the limit of this pseudo- F is unlikely to follow the F -distribution, and the non-parametric permutation is thus adopted for p values.

Although routinely being used, some concerns have been raised regarding PERMANOVA. First, it does not provide any coefficient estimators for explanatory variables, which hinders generating interpretable results on the direction or size of the effects or discerning the sources of differences. Second, it only describes relationships between beta-diversity (a between-subject attribute) and a within-subject main categorical factor, not with their between-subject counterpart or any other types of variables, such as a continuous variable. Third, it requires a large number of permutations for stable results, and thus carries more overhead in terms of the computational burden. Last but not least, it is difficult to extend PERMANOVA to longitudinal studies (especially with missing data) to discover valuable scientific insights from dynamic and highly personalised microbiome data.

To resolve those limitations, recently, a more flexible alternative has been proposed.6 Granted by its distance-based regression setup, this approach permits elucidating the association between the beta-diversity  Inline Formula  and a categorical variable (eg, group difference) or even more general variable types, such as perceived stress, a continuous instrument in mental health research.14

Distance-based regression

Modelling the between-subject attributes

To enlarge the semiparametric GLM framework for beta-diversity, consider a column vector of multivariate response and an explanatory variable  Inline Formula  for the i -th subject, where  Inline Formula , and where  Inline Formula . By concatenating  Inline Formula  into a (scalar) functional response from two subjects with some scalar-valued function, such as the beta-diversity  Inline Formula  in (2), we can model  Inline Formula  as a function of  Inline Formula :

Display Formula

where  Inline Formula  is some smooth function (eg, with continuous derivatives up to the second order), and β is a vector of parameters. Akin to (3), (5) remains semiparametric by imposing minimum distributional assumption on  Inline Formula . This introduces greater flexibility in practice.

The distance-based regression introduced above is a special case of a class of semiparametric functional response models (FRM).6 Equation (5) achieves effective dimension-reduction (with the mapping  Inline Formula ) and is well-suited for data entailing the intrinsic between-subject nature, that is, outcomes that are composed of pairs of subjects. With this setup, we can formalise the scientific question by regressing  Inline Formula  on the explanatory variables to test their associations, adjusting for covariates.

To illustrate, consider a categorical variable  Inline Formula  (such as the group membership) with K levels. We transform  Inline Formula  into a between-subject attribute by defining a set of pairwise indicators (or dummy variables) for  Inline Formula  through the one-hot encoding function  Inline Formula 

Display Formula

Display Formula

where  Inline Formula  indicates the pair with the same k -th concordant ( Inline Formula ) or discordant ( Inline Formula ) levels for  Inline Formula  . For example, if  Inline Formula  is a binary indicator of disease, we use  Inline Formula  to index the respective diseased-diseased, healthy-healthy and healthy-diseased pairs. With  Inline Formula  the beta-diversity for the i -th pair, we can model its conditional mean among subgroups with (5):

Display Formula

here  Inline Formula  is adopted since the response beta-diversity is non-negative.

The coefficients β reveal the heterogeneity in  Inline Formula  among different subgroups. Constructing tests among subgroups also helps disentangle different types of heterogeneity. These insights, in terms of effect sizes and directions, are of considerable interest to researchers but are difficult to achieve for PERMANOVA. Additional strengths of FRM include the computational scalability over permutation-based mainstream for such data, as well as the ease of including covariates in (7), either between-subject or within-subject.6

A more generalised variation can be specified in this distance-based regression. For example, by switching the response and explanatory variable, we can define the ‘difference indices Inline Formula   Inline Formula  for some clinical outcomes (such as body mass index difference) and model their relationship with beta-diversity (as a predictor):

Display Formula

One glitch is that while  Inline Formula  is non-negative,  Inline Formula  here can be positive or negative. This can be readily fixed by setting  Inline Formula  to  Inline Formula , where  Inline Formula  denotes the sign function with  Inline Formula  if  Inline Formula   Inline Formula  if  Inline Formula  and  Inline Formula  otherwise. For brevity, we continue to denote  Inline Formula   Inline Formula  by  Inline Formula  in what follows.

Now for (8), β represents the differential response  Inline Formula  per unit difference in the beta-diversity  Inline Formula  for the i -th pair. This generalisation is especially useful when interest lies in evaluating the role of alpha-diversity and beta-diversity metrics together on a clinical outcome (see section ‘Real data analysis’). Furthermore, (8) even permits multivariate clinical outcomes  Inline Formula , where some domain-specific distance can be prespecified as  Inline Formula .

Statistical inference and hypothesis testing

As the response function in (5) or (8) involves pairs of subjects, inferences about β must tackle their interlocking dependencies. A class of U-statistics-based GEE (UGEE) has been proposed accordingly15 for this. Let

Display Formula

In practice,  Inline Formula  is unknown and substituted by a working variance such as  Inline Formula , with  Inline Formula  as an unknown constant. Thus, the UGEE takes a familiar form

Display Formula

where the estimates  Inline Formula  are obtained through the Newton-Raphson method.

The theory of U-statistics guarantees that  Inline Formula  by solving for (10) is consistent and asymptotically normal (CAN) under mild regularity conditions:

Display Formula

where  Inline Formula  denotes convergence in distribution and a consistent ‘sandwich’ variance estimator of  Inline Formula  and  Inline Formula , which can be obtained by substituting consistent estimates of β and moments of the respective quantities.

This can be readily applied to testing any linear hypotheses concerning β through the linear contrast  Inline Formula vs  Inline Formula , where C is a matrix of known constants with rank s . Under the null, the Wald statistic has an asymptotic  Inline Formula  distribution:

Display Formula

where  Inline Formula  denotes a (central)  Inline Formula  distribution with s df. A score test can also be constructed if needed.6

More importantly,  Inline Formula  is also semiparametrically efficient whose asymptotic variance  Inline Formula  is the smallest among all models satisfying the same moment restriction in (7) or (8), which can lead to reduced sample size requirements for clinical studies in detecting a given effect size.16 Altogether, this semiparametric inference technique ensures both robustness and sensitivity to help facilitate data-driven scientific findings.

Real data analysis

Recent studies suggest that the gut microbiome plays a major role in the development and functioning of the central nervous system via the microbiome–gut–brain axis.17 Although numerous pieces of evidence have implicated strong relationships among psychosocial factors, few have investigated their relationship with the gut microbiome. A recent study18 fills this gap by collecting both self-reported psychosocial measurements and faecal samples from 184 community-dwelling adults (aged 28–97 years). Participants completed the validated measures, including physical and mental health 36-Item Short Form Survey (SF-36), resilience, optimism, loneliness, wisdom, compassion and social support. DNA extraction and 16S rRNA amplicon sequencing were completed using the Earth Microbiome Project standard protocols. The feature dimension of the microbiome taxonomic units was quite high (m = 12 131).

The original paper18 focuses primarily on the predicting role of Faith’s phylogenetic alpha-diversity on the psychosocial variables, using a robust regression model. The study revealed that lower levels of loneliness and higher levels of wisdom, compassion, social support and social engagement were all associated with greater alpha-diversity of the gut microbiome, and age plays an important moderating role. Partial least squares (PLS) analysis was first applied to all the collected psychosocial variables to summarise them into PLS components of the negative impact of loneliness on gut health. The results supported the previous findings and related literature on the negative impact of loneliness on gut health.19 These encouraging findings motivated us to further investigate the role of microbiome beta-diversity on psychosocial outcomes, intending to uncover different aspects from the between-subject attributes. Below, we present the results of applying the distance-based model to elucidate the impact of beta-diversity on perceived stress and positive states (traits). For illustration, we did not correct for multiple comparisons.

Perceived stress as a continuous response

Perceived stress was assessed by the standardised instruments of the Perceived Stress Scale (PSS)14; it can be viewed as continuous with larger values indicating higher levels of perceived stress. The distance-based model in (8) with predictors of beta-diversity  Inline Formula , alpha-diversity difference  Inline Formula , age difference  Inline Formula  and the one-hot encoded gender  Inline Formula  can be specified as  Inline Formula , where  Inline Formula  denotes the difference score for the perceived stress of the i -th pair.

Here,  Inline Formula  represents the mean difference between the perceived stress per unit difference in the beta-diversity  Inline Formula ,  Inline Formula   Inline Formula  indicates the directional mean difference between the perceived stress per unit difference in the alpha-diversity (age), and  Inline Formula  is the mean difference between the perceived stress comparing male–female pairs with homogeneous gender pairs.

When the response and explanatory variable are both continuous, the between-subject regression will preserve their corresponding relationships among within-subject attributes. We demonstrate this in the continuous alpha-diversity (age) in figure 1. Shown at the top of figure 1 are the scatter plots with locally estimated scatterplot smoothing (LOESS) curves for the perceived stress (within-subject) with alpha-diversity and age, respectively. On average, the perceived stress did not correlate strongly with alpha-diversity, but it showed a negative relationship with age, which was confirmed by the univariate linear regression where the alpha-diversity effect was insignificant (t-statistic=−0.3145, p=0.448) but had a significant negative age effect (t-statistic=−1.1021, p<0.001). The bottom of figure 1 shows the scatter plots for their between-subject counterparts. As expected, the within-subject relationships were well-preserved in the between-subject regression.

Figure 1
Figure 1

Real data analysis. The first row shows the scatter plots and LOESS curves for the perceived stress (within-subject) against alpha-diversity (left) and age (right). The second row shows scatter plots and LOESS curves for the difference in perceived stress against the difference in alpha-diversity (left) and the difference in age (right). LOESS, locally estimated scatterplot smoothing.

Shown at the top of table 1 are the results of perceived stress from the robust distance-based regression model. Per unit difference in a pair’s beta-diversity is associated with a 0.1457 (Wald=2.9752, p=0.085) unit difference in their perceived stress, suggesting that a larger difference between two subjects’ microbiome profiles implies more discrepancies in their stress levels. The mean (directional) stress score difference for any pair was −0.0270 per unit difference in their alpha-diversity but was not significant (Wald=0.1869, p=0.666). Per unit age difference was significantly associated with a 0.2520 unit decrease in the mean perceived stress difference (Wald=10.7857, p=0.001). This is expected since the scatter plots in figure 1 suggest that age is negatively related to perceived stress. The mean perceived stress difference comparing male–female pairs versus homogeneous gender pairs (male–male and female–female pairs) was 0.0218 (Wald=0.343, p=0.558); that is, we did not find strong evidence implicating perceived stress in distinct gender pairs being different from homogeneous gender pairs. Although signals between perceived stress and the microbiome composition were not strong, the negative association between age and stress shown here has been demonstrated in the literature.20

Table 1
|
Estimates, asymptotic SEs, Wald statistics, p values for the real study data using distance-based model, controlling for body mass index (continuous) and gender (binary)

Positive states (traits) as a multivariate response

In mental health studies, some traits are evaluated as a composite outcome.9 Particularly, resilience, optimism, mental well-being and wisdom all belong to positive states (traits). We devised the distance-based regression notion to link this four-dimensional multivariate outcome  Inline Formula  to the microbiome composition by constructing the composite pairwise outcomes with the Euclidean distance and adopting (8) with a log link to decipher their relationships.

At the bottom of table 1 are the results of the positive states. Significant associations were found between the mean distance (variability) in the positive states and beta-diversity ( Inline Formula , Wald=181.3488, p<0.001) but not with the variability in alpha-diversity ( Inline Formula  Wald=0.5804, p=0.446). This non-significant alpha-diversity is similar to the case of perceived stress as the univariate outcome. Hence, the microbiome beta-diversity may be a more sensitive indicator in capturing the between-subject attributes of mental health than the alpha-diversity in this dataset. Age variability was only marginally significant ( Inline Formula , Wald=3.6775, p=0.055). The mean distance (variability) in the positive states for male–male pairs was  Inline Formula  the distance for female–female pairs (Wald=1.1775, p=0.278); the mean distance (variability) for male–female pairs was  Inline Formula  of that for female–female pairs (Wald=1.8043, p=0.179) but neither effect was significant. Hence, no significant gender effect was observed, similar to the case with perceived stress.

Taken together, those findings help support the existence of the microbiome–gut–brain axis17 19 and, more importantly, provide clinical implications for developing microbiota-related interventions to improve mental health and mitigate its related consequences. For example, given the positive association between beta-diversity and positive states, strategies that modulate patients’ microbiome composition may be beneficial to improve their mental health in general.

Conclusion and discussion

In this paper, we discussed both prevailing and emerging statistical approaches to demystifying the microbiome–gut–brain axis by studying the relationship between the univariate or multivariate clinical outcomes and various microbiome diversity metrics. Depending on its attribute (within-subject or between-subject), each type of microbiome diversity metric merits its own clinical and scientific exploration and, hence, well-designed statistical tools.

We introduced the definition and characteristics of popular alpha-diversity (within-subject) and beta-diversity (between-subject) measures. Specifically, for the between-subject beta-diversity, a semiparametric distance-based regression was discussed in detail. This distance-based framework has unique advantages in dealing with the complex dependency structures in between-subject attributes such as beta-diversity, which is difficult for traditional approaches. The regression setup also provides coefficient estimates to characterise the associations, facilitating in-depth scientific findings in mental health research. We briefly discussed their theoretical properties as well.

By further augmenting the choice set for both the response and explanatory variables, this framework permits elucidating the relationship among multiple high-dimensional variables. We illustrated predicting univariate or multivariate clinical variables using microbiome diversity by first transforming those clinical outcomes into between-subject attributes. This strategy is especially relevant to mental health research as many psychometric measures are evaluated as a composite rather than a univariate outcome. We also presented the analyses of an actual study to implicate the essential role of the microbiome on mental health. The compelling evidence was consistent with previous findings to support the bridge between the gut and mental health. Simultaneously, this timely demonstration offers a new angle to analyse complex omics data or other types of data of similar format to prepare for the new line of interdisciplinary research in psychiatry.

Jinyuan Liu obtained her PhD in Biostatistics in 2022 from UC San Diego in the USA. She is currently working as an Assistant Professor of Biostatistics at Vanderbilt University Medical Center in the USA, with a secondary appointment at the Department of Psychiatry. Her main research interests include semiparametric efficiency, causal inference, psychometrics, and building effective dimension-reduction and efficient integrative modeling of high-dimensional data from various disciplines, including genomics, imaging, and wearables.


Biography image