Article Text

## Abstract

Incentivised by breakthroughs and data generated by the high-throughput sequencing technology, this paper proposes a distance-based framework to fulfil the emerging needs in elucidating insights from the high-dimensional microbiome data in psychiatric studies. By shifting focus from traditional methods that focus on the observations from each subject to the between-subject attributes that aggregate two or more subjects’ entire feature vectors, the described approach revolutionises the conventional prescription for high-dimensional observations via microbiome diversity. To this end, we enrich the classical generalised linear models to articulate the multivariable regression relationship between distance-based variables. We also discuss a robust and computationally feasible semiparametric inference technique. Benefitting from the latest advances in the semiparametric efficiency theory for such attributes, the proposed estimators enjoy robustness and good asymptotic properties that guarantee sensitivity in detecting signals between clinical outcomes and microbiome diversity. It offers a readily implementable and easily interpretable solution for deciphering the gut–brain axis in mental health research.

- statistics, nonparametric
- regression analysis
- mental health
- multivariate analysis

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

## Statistics from Altmetric.com

## Introduction

The human microbiome is the totality of the microbes (microbiota), their genetic elements (metagenome) and the interactions they have with surrounding environments throughout the human body.1 In contrast to the human genome, the human microbiome is highly variable, displays substantial intra-individual variation at different body sites (gut, skin, lung, vagina, oral cavity, etc), inter-individual variation at the same body sites and intra-individual variation at different times in longitudinal studies.2

The human microbiome plays a key role in human disease and health. A preponderance of human microbiome studies have implicated the human microbiome in the pathogenesis of many human diseases, such as obesity, diabetes, alcoholic liver disease, vaginosis and even cancers.1 3 The genotypic effect on the microbiome may explain the missing link between genetics and disease since a disease-susceptibility genotype may affect the disease outcome through the alteration of the microbiome composition.4 5 Therefore, identifying potential factors that influence the microbiome composition and discovering their relationship with biological or clinical outcomes help demystify the inherent disease mechanism and enable the possibility of modulating the microbiome composition for therapeutic purposes.

Fuelled by the technological advancement of next-generation sequencing, the human microbiome can be interrogated using high-throughput sequencing. One strategy amplifies and sequences the bacterial 16S ribosomal RNA from the samples. We then cluster the similar sequences into operational taxonomic units (OTUs). By comparing OTUs with reference databases, we identify existing species in the samples and also obtain the OTU abundance profiles. The OTU abundance profiles refer to a matrix with the (*i, j*)-th element referring to the number of sequence reads that represent the *j*-th OTU (or species, roughly speaking) in the *i*-th subject. This count matrix forms the foundation for statistical analyses.6 The notable features of OTU abundances are high-dimensional (*p*>>*n*) and skewed counts with a preponderance of zeros. One line of research aims to advance statistical tools to directly tackle such data features to find individual OTU culprits for certain diseases of interest.7 8 Another emerging paradigm, however, shifts gears to study the impact of the overall microbiome composition represented as diversity metrics, such as alpha-diversity and beta-diversity,6 which we introduce and focus on in this paper.

## Microbiome diversity metrics

### Alpha-diversity

Alpha-diversity is a subject-level metric summarising OTU abundance into a scalar value for each person. Consider a sample of n subjects and a vector denoting the counts of the p OTUs for subject i . Alpha-diversity for subject i is:

(1)

where
m
is the order controlling the weight allocation among taxa. Varying
m
permits different alpha diversities. For example,
yields the observed species index
that counts the total number of species present, hence weighing more on the rare taxa and indexes the *richness* of microbe in total.9 On the other hand,
leads to the Simpson index
that assigns more weight towards abundant taxa, hence indicating species evenness.9 When
(1) is undefined but the resulting limit is the Shannon index,
.

### Beta-diversity

Unlike the OTU abundance or alpha-diversity that describes features within each subject, beta-diversity is a between-subject attribute6 indexing the dissimilarity between any two individuals or a pair. Essentially, beta-diversity compares the feature differences between any pair using a dissimilarity or distance metric, denoted as , where : is a mapping from the high-dimensional OTUs to a scalar value for that pair. Different choices of lead to distinct versions of beta-diversity, with the commonly used ones including Aitchison, Bray-Curtis, Jaccard, Unifrac, and so on. For example, the Aitchison beta-diversity10 is defined as:

(2)

By integrating information from high-dimensional features, beta-diversity represents a totality measure of dissimilarity between two subjects across all the OTUs, which constitutes a biologically relevant indicator of human health11 and merits interest. It is non-negative, with 0 (bigger values) indicating the same (very different) taxonomic abundances. Also, its dimension does not change with the number of taxonomic units hence can be viewed as a dimension reduction.

## Statistical methods

The research community is primarily interested in using statistical methods to link various diversity metrics introduced above to clinical variables, such as disease status.

### Alpha-diversity

The alpha-diversity is composed of data from one subject; therefore, most standard statistical methods are readily applicable for the analysis of alpha-diversity. For example, popular tools for associating a phenotype with alpha-diversity include the Kruskal-Wallis H (or Mann-Whitney U) test12 for categorical variables and the Spearman’s correlation for continuous phenotype variables. When controlling for covariates (eg, demographics) is needed in more complex settings, the robust regression framework is often suggested to ensure valid inferences given the non-normality nature of alpha-diversity. For a study with size
n
, let
denote a response, and
an explanatory variable for the
i
-th subject. The semiparametric *GLM* characterising the within-subject relationship between
and
is:

(3)

where is the inverse of some link functions. Compared with the classical parametric GLM, (3) is more flexible. It removes the distributional assumption on , thus yielding valid inference even when the data deviate from such an assumption. Hence, it is especially suited for modelling the alpha-diversity by specifying in (1). It is worth noting that the same method can be used if alpha-diversity is the predictor.

### Beta-diversity

Given a main factor with K levels for the group membership, let and denote the mean and variance of for the k -th group. We can formalise the scientific question into a hypothesis to test the mean beta-diversity distance across the K groups:

(4)

However, unlike the case with alpha-diversity, the involves the OTU abundances from a pair, so it introduces rather complex correlation structures that are difficult, if not impossible, to model using parametric form.

#### Permutational Multivariate Analysis of Variance Using Distance Matrices

One solution is the distance-based Permutational Multivariate Analysis of Variance Using Distance Matrices (PERMANOVA).13 It defines a *pseudo-F statistic* as:

where
is the trace of a matrix,
is the hat matrix of the design matrix
and
G
is the Gower’s centred matrix obtained from
Due to the complexity of beta-diversity (usually non-Euclidean) distance, the limit of this *pseudo-*
F
is unlikely to follow the
F
-distribution, and the non-parametric permutation is thus adopted for p values.

Although routinely being used, some concerns have been raised regarding PERMANOVA. First, it does not provide any coefficient estimators for explanatory variables, which hinders generating interpretable results on the direction or size of the effects or discerning the sources of differences. Second, it only describes relationships between beta-diversity (a between-subject attribute) and a within-subject main categorical factor, not with their between-subject counterpart or any other types of variables, such as a continuous variable. Third, it requires a large number of permutations for stable results, and thus carries more overhead in terms of the computational burden. Last but not least, it is difficult to extend PERMANOVA to longitudinal studies (especially with missing data) to discover valuable scientific insights from dynamic and highly personalised microbiome data.

To resolve those limitations, recently, a more flexible alternative has been proposed.6 Granted by its distance-based regression setup, this approach permits elucidating the association between the beta-diversity and a categorical variable (eg, group difference) or even more general variable types, such as perceived stress, a continuous instrument in mental health research.14

#### Distance-based regression

##### Modelling the between-subject attributes

To enlarge the semiparametric GLM framework for beta-diversity, consider a column vector of multivariate response and an explanatory variable for the i -th subject, where , and where . By concatenating into a (scalar) functional response from two subjects with some scalar-valued function, such as the beta-diversity in (2), we can model as a function of :

(5)

where is some smooth function (eg, with continuous derivatives up to the second order), and β is a vector of parameters. Akin to (3), (5) remains semiparametric by imposing minimum distributional assumption on . This introduces greater flexibility in practice.

The distance-based regression introduced above is a special case of a class of semiparametric functional response models (FRM).6 Equation (5) achieves effective dimension-reduction (with the mapping ) and is well-suited for data entailing the intrinsic between-subject nature, that is, outcomes that are composed of pairs of subjects. With this setup, we can formalise the scientific question by regressing on the explanatory variables to test their associations, adjusting for covariates.

To illustrate, consider a categorical variable (such as the group membership) with K levels. We transform into a between-subject attribute by defining a set of pairwise indicators (or dummy variables) for through the one-hot encoding function

(6)

where
indicates the pair with the same
k
-th *concordant* (
) or *discordant* (
) levels for
. For example, if
is a binary indicator of disease, we use
to index the respective diseased-diseased, healthy-healthy and healthy-diseased pairs. With
the beta-diversity for the
i
-th pair, we can model its conditional mean among subgroups with (5):

(7)

here is adopted since the response beta-diversity is non-negative.

The coefficients β reveal the heterogeneity in among different subgroups. Constructing tests among subgroups also helps disentangle different types of heterogeneity. These insights, in terms of effect sizes and directions, are of considerable interest to researchers but are difficult to achieve for PERMANOVA. Additional strengths of FRM include the computational scalability over permutation-based mainstream for such data, as well as the ease of including covariates in (7), either between-subject or within-subject.6

A more generalised variation can be specified in this distance-based regression. For example, by switching the response and explanatory variable, we can define the ‘*difference indices*’
for some clinical outcomes (such as body mass index difference) and model their relationship with beta-diversity (as a predictor):

(8)

One glitch is that while is non-negative, here can be positive or negative. This can be readily fixed by setting to , where denotes the sign function with if if and otherwise. For brevity, we continue to denote by in what follows.

Now for (8),
β
represents the differential response
per unit difference in the beta-diversity
for the
i
**-**th pair. This generalisation is especially useful when interest lies in evaluating the role of alpha-diversity and beta-diversity metrics together on a clinical outcome (see section ‘Real data analysis’). Furthermore, (8) even permits multivariate clinical outcomes
, where some domain-specific distance can be prespecified as
.

##### Statistical inference and hypothesis testing

As the response function in (5) or (8) involves pairs of subjects, inferences about β must tackle their interlocking dependencies. A class of U-statistics-based GEE (UGEE) has been proposed accordingly15 for this. Let

(9)

In practice, is unknown and substituted by a working variance such as , with as an unknown constant. Thus, the UGEE takes a familiar form

(10)

where the estimates are obtained through the Newton-Raphson method.

The theory of U-statistics guarantees that by solving for (10) is consistent and asymptotically normal (CAN) under mild regularity conditions:

(11)

where denotes convergence in distribution and a consistent ‘sandwich’ variance estimator of and , which can be obtained by substituting consistent estimates of β and moments of the respective quantities.

This can be readily applied to testing any linear hypotheses concerning β through the linear contrast vs , where C is a matrix of known constants with rank s . Under the null, the Wald statistic has an asymptotic distribution:

(12)

where denotes a (central) distribution with s df. A score test can also be constructed if needed.6

More importantly, is also semiparametrically efficient whose asymptotic variance is the smallest among all models satisfying the same moment restriction in (7) or (8), which can lead to reduced sample size requirements for clinical studies in detecting a given effect size.16 Altogether, this semiparametric inference technique ensures both robustness and sensitivity to help facilitate data-driven scientific findings.

## Real data analysis

Recent studies suggest that the gut microbiome plays a major role in the development and functioning of the central nervous system via the microbiome–gut–brain axis.17 Although numerous pieces of evidence have implicated strong relationships among psychosocial factors, few have investigated their relationship with the gut microbiome. A recent study18 fills this gap by collecting both self-reported psychosocial measurements and faecal samples from 184 community-dwelling adults (aged 28–97 years). Participants completed the validated measures, including physical and mental health 36-Item Short Form Survey (SF-36), resilience, optimism, loneliness, wisdom, compassion and social support. DNA extraction and 16S rRNA amplicon sequencing were completed using the Earth Microbiome Project standard protocols. The feature dimension of the microbiome taxonomic units was quite high (m = 12 131).

The original paper18 focuses primarily on the predicting role of Faith’s phylogenetic alpha-diversity on the psychosocial variables, using a robust regression model. The study revealed that lower levels of loneliness and higher levels of wisdom, compassion, social support and social engagement were all associated with greater alpha-diversity of the gut microbiome, and age plays an important moderating role. Partial least squares (PLS) analysis was first applied to all the collected psychosocial variables to summarise them into PLS components of the negative impact of loneliness on gut health. The results supported the previous findings and related literature on the negative impact of loneliness on gut health.19 These encouraging findings motivated us to further investigate the role of microbiome beta-diversity on psychosocial outcomes, intending to uncover different aspects from the between-subject attributes. Below, we present the results of applying the distance-based model to elucidate the impact of beta-diversity on perceived stress and positive states (traits). For illustration, we did not correct for multiple comparisons.

### Perceived stress as a continuous response

Perceived stress was assessed by the standardised instruments of the Perceived Stress Scale (PSS)14; it can be viewed as continuous with larger values indicating higher levels of perceived stress. The distance-based model in (8) with predictors of beta-diversity , alpha-diversity difference , age difference and the one-hot encoded gender can be specified as , where denotes the difference score for the perceived stress of the i -th pair.

Here, represents the mean difference between the perceived stress per unit difference in the beta-diversity , indicates the directional mean difference between the perceived stress per unit difference in the alpha-diversity (age), and is the mean difference between the perceived stress comparing male–female pairs with homogeneous gender pairs.

When the response and explanatory variable are both continuous, the between-subject regression will preserve their corresponding relationships among within-subject attributes. We demonstrate this in the continuous alpha-diversity (age) in figure 1. Shown at the top of figure 1 are the scatter plots with locally estimated scatterplot smoothing (LOESS) curves for the perceived stress (within-subject) with alpha-diversity and age, respectively. On average, the perceived stress did not correlate strongly with alpha-diversity, but it showed a negative relationship with age, which was confirmed by the univariate linear regression where the alpha-diversity effect was insignificant (t-statistic=−0.3145, p=0.448) but had a significant negative age effect (t-statistic=−1.1021, p<0.001). The bottom of figure 1 shows the scatter plots for their between-subject counterparts. As expected, the within-subject relationships were well-preserved in the between-subject regression.

Shown at the top of table 1 are the results of perceived stress from the robust distance-based regression model. Per unit difference in a pair’s beta-diversity is associated with a 0.1457 (Wald=2.9752, p=0.085) unit difference in their perceived stress, suggesting that a larger difference between two subjects’ microbiome profiles implies more discrepancies in their stress levels. The mean (directional) stress score difference for any pair was −0.0270 per unit difference in their alpha-diversity but was not significant (Wald=0.1869, p=0.666). Per unit age difference was significantly associated with a 0.2520 unit decrease in the mean perceived stress difference (Wald=10.7857, p=0.001). This is expected since the scatter plots in figure 1 suggest that age is negatively related to perceived stress. The mean perceived stress difference comparing male–female pairs versus homogeneous gender pairs (male–male and female–female pairs) was 0.0218 (Wald=0.343, p=0.558); that is, we did not find strong evidence implicating perceived stress in distinct gender pairs being different from homogeneous gender pairs. Although signals between perceived stress and the microbiome composition were not strong, the negative association between age and stress shown here has been demonstrated in the literature.20

### Positive states (traits) as a multivariate response

In mental health studies, some traits are evaluated as a composite outcome.9 Particularly, resilience, optimism, mental well-being and wisdom all belong to positive states (traits). We devised the distance-based regression notion to link this four-dimensional multivariate outcome to the microbiome composition by constructing the composite pairwise outcomes with the Euclidean distance and adopting (8) with a log link to decipher their relationships.

At the bottom of table 1 are the results of the positive states. Significant associations were found between the mean distance (variability) in the positive states and beta-diversity ( , Wald=181.3488, p<0.001) but not with the variability in alpha-diversity ( Wald=0.5804, p=0.446). This non-significant alpha-diversity is similar to the case of perceived stress as the univariate outcome. Hence, the microbiome beta-diversity may be a more sensitive indicator in capturing the between-subject attributes of mental health than the alpha-diversity in this dataset. Age variability was only marginally significant ( , Wald=3.6775, p=0.055). The mean distance (variability) in the positive states for male–male pairs was the distance for female–female pairs (Wald=1.1775, p=0.278); the mean distance (variability) for male–female pairs was of that for female–female pairs (Wald=1.8043, p=0.179) but neither effect was significant. Hence, no significant gender effect was observed, similar to the case with perceived stress.

Taken together, those findings help support the existence of the microbiome–gut–brain axis17 19 and, more importantly, provide clinical implications for developing microbiota-related interventions to improve mental health and mitigate its related consequences. For example, given the positive association between beta-diversity and positive states, strategies that modulate patients’ microbiome composition may be beneficial to improve their mental health in general.

## Conclusion and discussion

In this paper, we discussed both prevailing and emerging statistical approaches to demystifying the microbiome–gut–brain axis by studying the relationship between the univariate or multivariate clinical outcomes and various microbiome diversity metrics. Depending on its attribute (within-subject or between-subject), each type of microbiome diversity metric merits its own clinical and scientific exploration and, hence, well-designed statistical tools.

We introduced the definition and characteristics of popular alpha-diversity (within-subject) and beta-diversity (between-subject) measures. Specifically, for the between-subject beta-diversity, a semiparametric distance-based regression was discussed in detail. This distance-based framework has unique advantages in dealing with the complex dependency structures in between-subject attributes such as beta-diversity, which is difficult for traditional approaches. The regression setup also provides coefficient estimates to characterise the associations, facilitating in-depth scientific findings in mental health research. We briefly discussed their theoretical properties as well.

By further augmenting the choice set for both the response and explanatory variables, this framework permits elucidating the relationship among multiple high-dimensional variables. We illustrated predicting univariate or multivariate clinical variables using microbiome diversity by first transforming those clinical outcomes into between-subject attributes. This strategy is especially relevant to mental health research as many psychometric measures are evaluated as a composite rather than a univariate outcome. We also presented the analyses of an actual study to implicate the essential role of the microbiome on mental health. The compelling evidence was consistent with previous findings to support the bridge between the gut and mental health. Simultaneously, this timely demonstration offers a new angle to analyse complex omics data or other types of data of similar format to prepare for the new line of interdisciplinary research in psychiatry.

## Ethics statements

### Patient consent for publication

### Ethics approval

Not applicable.

## References

Jinyuan Liu obtained her PhD in Biostatistics in 2022 from UC San Diego in the USA. She is currently working as an Assistant Professor of Biostatistics at Vanderbilt University Medical Center in the USA, with a secondary appointment at the Department of Psychiatry. Her main research interests include semiparametric efficiency, causal inference, psychometrics, and building effective dimension-reduction and efficient integrative modeling of high-dimensional data from various disciplines, including genomics, imaging, and wearables.

## Footnotes

Contributors JL contributed to the data analysis and paper writing; KX, TW and LY contributed to the data analysis and paper review; TTN and DJ contributed to the data collection, analysis and paper review; XZ contributed to the data analysis interpretation and paper writing.

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Provenance and peer review Commissioned; externally peer reviewed.