Article Text

Download PDFPDF

Deciphering the ‘gut–brain axis’ through microbiome diversity
  1. Jinyuan Liu1,
  2. Ke Xu1,
  3. Tsungchin Wu2,
  4. Lydia Yao1,
  5. Tanya T Nguyen3,
  6. Dilip Jeste3 and
  7. Xinlian Zhang2
  1. 1 Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA
  2. 2 Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health and Human Longevity Science, UC San Diego, La Jolla, California, USA
  3. 3 Department of Psychiatry, Stein Institute for Research on Aging, UC San Diego, La Jolla, California, USA
  1. Correspondence to Dr Xinlian Zhang; xizhang{at}


Incentivised by breakthroughs and data generated by the high-throughput sequencing technology, this paper proposes a distance-based framework to fulfil the emerging needs in elucidating insights from the high-dimensional microbiome data in psychiatric studies. By shifting focus from traditional methods that focus on the observations from each subject to the between-subject attributes that aggregate two or more subjects’ entire feature vectors, the described approach revolutionises the conventional prescription for high-dimensional observations via microbiome diversity. To this end, we enrich the classical generalised linear models to articulate the multivariable regression relationship between distance-based variables. We also discuss a robust and computationally feasible semiparametric inference technique. Benefitting from the latest advances in the semiparametric efficiency theory for such attributes, the proposed estimators enjoy robustness and good asymptotic properties that guarantee sensitivity in detecting signals between clinical outcomes and microbiome diversity. It offers a readily implementable and easily interpretable solution for deciphering the gut–brain axis in mental health research.

  • statistics, nonparametric
  • regression analysis
  • mental health
  • multivariate analysis

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


The human microbiome is the totality of the microbes (microbiota), their genetic elements (metagenome) and the interactions they have with surrounding environments throughout the human body.1 In contrast to the human genome, the human microbiome is highly variable, displays substantial intra-individual variation at different body sites (gut, skin, lung, vagina, oral cavity, etc), inter-individual variation at the same body sites and intra-individual variation at different times in longitudinal studies.2

The human microbiome plays a key role in human disease and health. A preponderance of human microbiome studies have implicated the human microbiome in the pathogenesis of many human diseases, such as obesity, diabetes, alcoholic liver disease, vaginosis and even cancers.1 3 The genotypic effect on the microbiome may explain the missing link between genetics and disease since a disease-susceptibility genotype may affect the disease outcome through the alteration of the microbiome composition.4 5 Therefore, identifying potential factors that influence the microbiome composition and discovering their relationship with biological or clinical outcomes help demystify the inherent disease mechanism and enable the possibility of modulating the microbiome composition for therapeutic purposes.

Fuelled by the technological advancement of next-generation sequencing, the human microbiome can be interrogated using high-throughput sequencing. One strategy amplifies and sequences the bacterial 16S ribosomal RNA from the samples. We then cluster the similar sequences into operational taxonomic units (OTUs). By comparing OTUs with reference databases, we identify existing species in the samples and also obtain the OTU abundance profiles. The OTU abundance profiles refer to a matrix with the (i, j)-th element referring to the number of sequence reads that represent the j-th OTU (or species, roughly speaking) in the i-th subject. This count matrix forms the foundation for statistical analyses.6 The notable features of OTU abundances are high-dimensional (p>>n) and skewed counts with a preponderance of zeros. One line of research aims to advance statistical tools to directly tackle such data features to find individual OTU culprits for certain diseases of interest.7 8 Another emerging paradigm, however, shifts gears to study the impact of the overall microbiome composition represented as diversity metrics, such as alpha-diversity and beta-diversity,6 which we introduce and focus on in this paper.

Microbiome diversity metrics


Alpha-diversity is a subject-level metric summarising OTU abundance into a scalar value for each person. Consider a sample of n subjects and a vector Embedded Image denoting the counts of the p OTUs for subject i . Alpha-diversity for subject i is:

Embedded Image (1)

where m is the order controlling the weight allocation among taxa. Varying m permits different alpha diversities. For example, Embedded Image yields the observed species index Embedded Image that counts the total number of species present, hence weighing more on the rare taxa and indexes the richness of microbe in total.9 On the other hand, Embedded Image leads to the Simpson index Embedded Image that assigns more weight towards abundant taxa, hence indicating species evenness.9 When Embedded Image (1) is undefined but the resulting limit is the Shannon index, Embedded Image .


Unlike the OTU abundance or alpha-diversity that describes features within each subject, beta-diversity is a between-subject attribute6 indexing the dissimilarity between any two individuals or a pair. Essentially, beta-diversity compares the feature differences between any pair Embedded Image using a dissimilarity or distance metric, denoted as Embedded Image , where Embedded Image : Embedded Image is a mapping from the high-dimensional OTUs to a scalar value for that pair. Different choices of Embedded Image lead to distinct versions of beta-diversity, with the commonly used ones including Aitchison, Bray-Curtis, Jaccard, Unifrac, and so on. For example, the Aitchison beta-diversity10 is defined as:

Embedded Image (2)

By integrating information from high-dimensional features, beta-diversity represents a totality measure of dissimilarity between two subjects across all the OTUs, which constitutes a biologically relevant indicator of human health11 and merits interest. It is non-negative, with 0 (bigger values) indicating the same (very different) taxonomic abundances. Also, its dimension does not change with the number of taxonomic units Embedded Image hence can be viewed as a dimension reduction.

Statistical methods

The research community is primarily interested in using statistical methods to link various diversity metrics introduced above to clinical variables, such as disease status.


The alpha-diversity is composed of data from one subject; therefore, most standard statistical methods are readily applicable for the analysis of alpha-diversity. For example, popular tools for associating a phenotype with alpha-diversity include the Kruskal-Wallis H (or Mann-Whitney U) test12 for categorical variables and the Spearman’s correlation for continuous phenotype variables. When controlling for covariates (eg, demographics) is needed in more complex settings, the robust regression framework is often suggested to ensure valid inferences given the non-normality nature of alpha-diversity. For a study with size n , let Embedded Image denote a response, and Embedded Image an explanatory variable for the i -th subject. The semiparametric GLM characterising the within-subject relationship between Embedded Image and Embedded Image is:

Embedded Image (3)

where Embedded Image is the inverse of some link functions. Compared with the classical parametric GLM, (3) is more flexible. It removes the distributional assumption on Embedded Image , thus yielding valid inference even when the data deviate from such an assumption. Hence, it is especially suited for modelling the alpha-diversity by specifying Embedded Image in (1). It is worth noting that the same method can be used if alpha-diversity is the predictor.


Given a main factor Embedded Image with K levels for the group membership, let Embedded Image and Embedded Image denote the mean and variance of Embedded Image for the k -th group. We can formalise the scientific question into a hypothesis to test the mean beta-diversity distance across the K groups:

Embedded Image (4)

However, unlike the case with alpha-diversity, the Embedded Image involves the OTU abundances from a pair, so it introduces rather complex correlation structures that are difficult, if not impossible, to model using parametric form.

Permutational Multivariate Analysis of Variance Using Distance Matrices

One solution is the distance-based Permutational Multivariate Analysis of Variance Using Distance Matrices (PERMANOVA).13 It defines a pseudo-F statistic as:

Embedded Image

where Embedded Image is the trace of a matrix, Embedded Image is the hat matrix of the design matrix Embedded Image and G is the Gower’s centred matrix obtained from Embedded Image Due to the complexity of beta-diversity (usually non-Euclidean) distance, the limit of this pseudo- F is unlikely to follow the F -distribution, and the non-parametric permutation is thus adopted for p values.

Although routinely being used, some concerns have been raised regarding PERMANOVA. First, it does not provide any coefficient estimators for explanatory variables, which hinders generating interpretable results on the direction or size of the effects or discerning the sources of differences. Second, it only describes relationships between beta-diversity (a between-subject attribute) and a within-subject main categorical factor, not with their between-subject counterpart or any other types of variables, such as a continuous variable. Third, it requires a large number of permutations for stable results, and thus carries more overhead in terms of the computational burden. Last but not least, it is difficult to extend PERMANOVA to longitudinal studies (especially with missing data) to discover valuable scientific insights from dynamic and highly personalised microbiome data.

To resolve those limitations, recently, a more flexible alternative has been proposed.6 Granted by its distance-based regression setup, this approach permits elucidating the association between the beta-diversity Embedded Image and a categorical variable (eg, group difference) or even more general variable types, such as perceived stress, a continuous instrument in mental health research.14

Distance-based regression

Modelling the between-subject attributes

To enlarge the semiparametric GLM framework for beta-diversity, consider a column vector of multivariate response and an explanatory variable Embedded Image for the i -th subject, where Embedded Image , and where Embedded Image . By concatenating Embedded Image into a (scalar) functional response from two subjects with some scalar-valued function, such as the beta-diversity Embedded Image in (2), we can model Embedded Image as a function of Embedded Image :

Embedded Image (5)

where Embedded Image is some smooth function (eg, with continuous derivatives up to the second order), and β is a vector of parameters. Akin to (3), (5) remains semiparametric by imposing minimum distributional assumption on Embedded Image . This introduces greater flexibility in practice.

The distance-based regression introduced above is a special case of a class of semiparametric functional response models (FRM).6 Equation (5) achieves effective dimension-reduction (with the mapping Embedded Image ) and is well-suited for data entailing the intrinsic between-subject nature, that is, outcomes that are composed of pairs of subjects. With this setup, we can formalise the scientific question by regressing Embedded Image on the explanatory variables to test their associations, adjusting for covariates.

To illustrate, consider a categorical variable Embedded Image (such as the group membership) with K levels. We transform Embedded Image into a between-subject attribute by defining a set of pairwise indicators (or dummy variables) for Embedded Image through the one-hot encoding function Embedded Image

Embedded Image (6)

Embedded Image

where Embedded Image indicates the pair with the same k -th concordant (Embedded Image ) or discordant (Embedded Image ) levels for Embedded Image . For example, if Embedded Image is a binary indicator of disease, we use Embedded Image to index the respective diseased-diseased, healthy-healthy and healthy-diseased pairs. With Embedded Image the beta-diversity for the i -th pair, we can model its conditional mean among subgroups with (5):

Embedded Image (7)

here Embedded Image is adopted since the response beta-diversity is non-negative.

The coefficients β reveal the heterogeneity in Embedded Image among different subgroups. Constructing tests among subgroups also helps disentangle different types of heterogeneity. These insights, in terms of effect sizes and directions, are of considerable interest to researchers but are difficult to achieve for PERMANOVA. Additional strengths of FRM include the computational scalability over permutation-based mainstream for such data, as well as the ease of including covariates in (7), either between-subject or within-subject.6

A more generalised variation can be specified in this distance-based regression. For example, by switching the response and explanatory variable, we can define the ‘difference indicesEmbedded Image Embedded Image for some clinical outcomes (such as body mass index difference) and model their relationship with beta-diversity (as a predictor):

Embedded Image (8)

One glitch is that while Embedded Image is non-negative, Embedded Image here can be positive or negative. This can be readily fixed by setting Embedded Image to Embedded Image , where Embedded Image denotes the sign function with Embedded Image if Embedded Image Embedded Image if Embedded Image and Embedded Image otherwise. For brevity, we continue to denote Embedded Image Embedded Image by Embedded Image in what follows.

Now for (8), β represents the differential response Embedded Image per unit difference in the beta-diversity Embedded Image for the i -th pair. This generalisation is especially useful when interest lies in evaluating the role of alpha-diversity and beta-diversity metrics together on a clinical outcome (see section ‘Real data analysis’). Furthermore, (8) even permits multivariate clinical outcomes Embedded Image , where some domain-specific distance can be prespecified as Embedded Image .

Statistical inference and hypothesis testing

As the response function in (5) or (8) involves pairs of subjects, inferences about β must tackle their interlocking dependencies. A class of U-statistics-based GEE (UGEE) has been proposed accordingly15 for this. Let

Embedded Image (9)

In practice, Embedded Image is unknown and substituted by a working variance such as Embedded Image , with Embedded Image as an unknown constant. Thus, the UGEE takes a familiar form

Embedded Image (10)

where the estimates Embedded Image are obtained through the Newton-Raphson method.

The theory of U-statistics guarantees that Embedded Image by solving for (10) is consistent and asymptotically normal (CAN) under mild regularity conditions:

Embedded Image (11)

where Embedded Image denotes convergence in distribution and a consistent ‘sandwich’ variance estimator of Embedded Image and Embedded Image , which can be obtained by substituting consistent estimates of β and moments of the respective quantities.

This can be readily applied to testing any linear hypotheses concerning β through the linear contrast Embedded Image vs Embedded Image , where C is a matrix of known constants with rank s . Under the null, the Wald statistic has an asymptotic Embedded Image distribution:

Embedded Image (12)

where Embedded Image denotes a (central) Embedded Image distribution with s df. A score test can also be constructed if needed.6

More importantly, Embedded Image is also semiparametrically efficient whose asymptotic variance Embedded Image is the smallest among all models satisfying the same moment restriction in (7) or (8), which can lead to reduced sample size requirements for clinical studies in detecting a given effect size.16 Altogether, this semiparametric inference technique ensures both robustness and sensitivity to help facilitate data-driven scientific findings.

Real data analysis

Recent studies suggest that the gut microbiome plays a major role in the development and functioning of the central nervous system via the microbiome–gut–brain axis.17 Although numerous pieces of evidence have implicated strong relationships among psychosocial factors, few have investigated their relationship with the gut microbiome. A recent study18 fills this gap by collecting both self-reported psychosocial measurements and faecal samples from 184 community-dwelling adults (aged 28–97 years). Participants completed the validated measures, including physical and mental health 36-Item Short Form Survey (SF-36), resilience, optimism, loneliness, wisdom, compassion and social support. DNA extraction and 16S rRNA amplicon sequencing were completed using the Earth Microbiome Project standard protocols. The feature dimension of the microbiome taxonomic units was quite high (m = 12 131).

The original paper18 focuses primarily on the predicting role of Faith’s phylogenetic alpha-diversity on the psychosocial variables, using a robust regression model. The study revealed that lower levels of loneliness and higher levels of wisdom, compassion, social support and social engagement were all associated with greater alpha-diversity of the gut microbiome, and age plays an important moderating role. Partial least squares (PLS) analysis was first applied to all the collected psychosocial variables to summarise them into PLS components of the negative impact of loneliness on gut health. The results supported the previous findings and related literature on the negative impact of loneliness on gut health.19 These encouraging findings motivated us to further investigate the role of microbiome beta-diversity on psychosocial outcomes, intending to uncover different aspects from the between-subject attributes. Below, we present the results of applying the distance-based model to elucidate the impact of beta-diversity on perceived stress and positive states (traits). For illustration, we did not correct for multiple comparisons.

Perceived stress as a continuous response

Perceived stress was assessed by the standardised instruments of the Perceived Stress Scale (PSS)14; it can be viewed as continuous with larger values indicating higher levels of perceived stress. The distance-based model in (8) with predictors of beta-diversity Embedded Image , alpha-diversity difference Embedded Image , age difference Embedded Image and the one-hot encoded gender Embedded Image can be specified as Embedded Image , where Embedded Image denotes the difference score for the perceived stress of the i -th pair.

Here, Embedded Image represents the mean difference between the perceived stress per unit difference in the beta-diversity Embedded Image , Embedded Image Embedded Image indicates the directional mean difference between the perceived stress per unit difference in the alpha-diversity (age), and Embedded Image is the mean difference between the perceived stress comparing male–female pairs with homogeneous gender pairs.

When the response and explanatory variable are both continuous, the between-subject regression will preserve their corresponding relationships among within-subject attributes. We demonstrate this in the continuous alpha-diversity (age) in figure 1. Shown at the top of figure 1 are the scatter plots with locally estimated scatterplot smoothing (LOESS) curves for the perceived stress (within-subject) with alpha-diversity and age, respectively. On average, the perceived stress did not correlate strongly with alpha-diversity, but it showed a negative relationship with age, which was confirmed by the univariate linear regression where the alpha-diversity effect was insignificant (t-statistic=−0.3145, p=0.448) but had a significant negative age effect (t-statistic=−1.1021, p<0.001). The bottom of figure 1 shows the scatter plots for their between-subject counterparts. As expected, the within-subject relationships were well-preserved in the between-subject regression.

Figure 1

Real data analysis. The first row shows the scatter plots and LOESS curves for the perceived stress (within-subject) against alpha-diversity (left) and age (right). The second row shows scatter plots and LOESS curves for the difference in perceived stress against the difference in alpha-diversity (left) and the difference in age (right). LOESS, locally estimated scatterplot smoothing.

Shown at the top of table 1 are the results of perceived stress from the robust distance-based regression model. Per unit difference in a pair’s beta-diversity is associated with a 0.1457 (Wald=2.9752, p=0.085) unit difference in their perceived stress, suggesting that a larger difference between two subjects’ microbiome profiles implies more discrepancies in their stress levels. The mean (directional) stress score difference for any pair was −0.0270 per unit difference in their alpha-diversity but was not significant (Wald=0.1869, p=0.666). Per unit age difference was significantly associated with a 0.2520 unit decrease in the mean perceived stress difference (Wald=10.7857, p=0.001). This is expected since the scatter plots in figure 1 suggest that age is negatively related to perceived stress. The mean perceived stress difference comparing male–female pairs versus homogeneous gender pairs (male–male and female–female pairs) was 0.0218 (Wald=0.343, p=0.558); that is, we did not find strong evidence implicating perceived stress in distinct gender pairs being different from homogeneous gender pairs. Although signals between perceived stress and the microbiome composition were not strong, the negative association between age and stress shown here has been demonstrated in the literature.20

Table 1

Estimates, asymptotic SEs, Wald statistics, p values for the real study data using distance-based model, controlling for body mass index (continuous) and gender (binary)

Positive states (traits) as a multivariate response

In mental health studies, some traits are evaluated as a composite outcome.9 Particularly, resilience, optimism, mental well-being and wisdom all belong to positive states (traits). We devised the distance-based regression notion to link this four-dimensional multivariate outcome Embedded Image to the microbiome composition by constructing the composite pairwise outcomes with the Euclidean distance and adopting (8) with a log link to decipher their relationships.

At the bottom of table 1 are the results of the positive states. Significant associations were found between the mean distance (variability) in the positive states and beta-diversity (Embedded Image , Wald=181.3488, p<0.001) but not with the variability in alpha-diversity (Embedded Image Wald=0.5804, p=0.446). This non-significant alpha-diversity is similar to the case of perceived stress as the univariate outcome. Hence, the microbiome beta-diversity may be a more sensitive indicator in capturing the between-subject attributes of mental health than the alpha-diversity in this dataset. Age variability was only marginally significant (Embedded Image , Wald=3.6775, p=0.055). The mean distance (variability) in the positive states for male–male pairs was Embedded Image the distance for female–female pairs (Wald=1.1775, p=0.278); the mean distance (variability) for male–female pairs was Embedded Image of that for female–female pairs (Wald=1.8043, p=0.179) but neither effect was significant. Hence, no significant gender effect was observed, similar to the case with perceived stress.

Taken together, those findings help support the existence of the microbiome–gut–brain axis17 19 and, more importantly, provide clinical implications for developing microbiota-related interventions to improve mental health and mitigate its related consequences. For example, given the positive association between beta-diversity and positive states, strategies that modulate patients’ microbiome composition may be beneficial to improve their mental health in general.

Conclusion and discussion

In this paper, we discussed both prevailing and emerging statistical approaches to demystifying the microbiome–gut–brain axis by studying the relationship between the univariate or multivariate clinical outcomes and various microbiome diversity metrics. Depending on its attribute (within-subject or between-subject), each type of microbiome diversity metric merits its own clinical and scientific exploration and, hence, well-designed statistical tools.

We introduced the definition and characteristics of popular alpha-diversity (within-subject) and beta-diversity (between-subject) measures. Specifically, for the between-subject beta-diversity, a semiparametric distance-based regression was discussed in detail. This distance-based framework has unique advantages in dealing with the complex dependency structures in between-subject attributes such as beta-diversity, which is difficult for traditional approaches. The regression setup also provides coefficient estimates to characterise the associations, facilitating in-depth scientific findings in mental health research. We briefly discussed their theoretical properties as well.

By further augmenting the choice set for both the response and explanatory variables, this framework permits elucidating the relationship among multiple high-dimensional variables. We illustrated predicting univariate or multivariate clinical variables using microbiome diversity by first transforming those clinical outcomes into between-subject attributes. This strategy is especially relevant to mental health research as many psychometric measures are evaluated as a composite rather than a univariate outcome. We also presented the analyses of an actual study to implicate the essential role of the microbiome on mental health. The compelling evidence was consistent with previous findings to support the bridge between the gut and mental health. Simultaneously, this timely demonstration offers a new angle to analyse complex omics data or other types of data of similar format to prepare for the new line of interdisciplinary research in psychiatry.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.


Jinyuan Liu obtained her PhD in Biostatistics in 2022 from UC San Diego in the USA. She is currently working as an Assistant Professor of Biostatistics at Vanderbilt University Medical Center in the USA, with a secondary appointment at the Department of Psychiatry. Her main research interests include semiparametric efficiency, causal inference, psychometrics, and building effective dimension-reduction and efficient integrative modeling of high-dimensional data from various disciplines, including genomics, imaging, and wearables.

Embedded Image


  • Contributors JL contributed to the data analysis and paper writing; KX, TW and LY contributed to the data analysis and paper review; TTN and DJ contributed to the data collection, analysis and paper review; XZ contributed to the data analysis interpretation and paper writing.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Commissioned; externally peer reviewed.