Result
Strong negative selection was observed in HLM impacted post-transcriptional regulation
We obtained a list of SNVs between primate common ancestor (multiple alignments) and human hg38 from the CADD training set.3 After the removal of variants within low-quality genomic regions and retaining only human-specific and fixed mutation, we identified 13 007 486 mutations for analysis, defined as HLMs. As expected, these HLMs were strongly depleted in functionally important genomic regions, including exonic (OR=0.49, p<0.001), genic (including exon, intron and flanking regions, OR=0.90, p<0.001) and transcribed (OR=0.84, p<0.001) regions (online supplemental table S1). These HLMs were also depleted in tissue-specific active chromatin states (OR=0.76–0.94 for 222 human tissues in the epimap database,7 (online supplemental table S2) and cell type-specific open chromatin regions (OR=0.82–0.96 for single-cell ATAC peak from 222 human cell types,9 online supplemental table S3). These results suggest that functional consequences of HLMs were mostly intolerant and subjected to strong negative selection.
Next, we proceeded to analyse the impact of HLMs on post-transcription modification and extracted a list of 5 001 228 HLMs falling within transcribed regions of coding genes. By applying Seqweaver deep learning model,4 we quantified the impact of HLMs on 217 RBPs’ affinity (ΔRBP, figure 2A) and compared it with common SNVs in the human population in GnomAD13 (figure 2B). HLMs generally had a significantly smaller ΔRBP than human common SNVs (Wilcoxon test p<0.001). We calculated the maximum ΔRBP for each variant and found that only 13 475 transcribed HLMs (0.27% of all HLMs) had a maximum ΔRBP larger than the top 1% threshold of common SNVs (fold change (FC)=0.27%/1%=0.27, figure 2B), indicating a strong negative selection of HLM’s impact on RBP. When analysing each of the 217 RBP profiles separately, we observed the strongest negative selection on ΔRBP related to alternative splicing (FC<0.01 for ΔRBP of TACA-spliced site binding, online supplemental table S4), indicating strict intolerance to RNA splicing alteration. To further validate this result, we directly quantified the impact of HLMs and common SNVs on alternative splicing by SpliceAI12 and observed a similar depletion of influential HLMs, indicating purifying selection (FC=0.06, online supplemental figure S1).
Figure 2Seqweaver and gene-level analysis. (A) Manhattan plot of Seqweaver result. Y axis denoted the maximum ΔRBP of each HLM. X axis denoted the chromosome position of each HLM. (B) Distribution of ΔRBP. (C) Observed versus expected number of influential HLMs on each gene. Colour and dot size showed the fold enrichment of influential HLMs. (D) Each bar represented a decile of gene ranked by fold enrichment of influential HLMs (the leftmost bar represented the most enriched decile). Bar colour showed the proportion of loss-of-function intolerant genes within each decile. (E) Same as D, but showing the proportion of primate lineage constraint genes defined by Dumas et al. HLM, human lineage mutations; LOEUF, loss-of-function observed/expected upper bound fraction; RBP, RNA-binding protein; SCG, selectively constrained genes; ΔRBP, RNA-binding profile difference.
Based on these results, we defined the 13 475 HLMs with the maximum ΔRBP as influential HLMs and hypothesised that they may have an important role in human evolution. Top influential HLMs included gene encoding histamine receptor H1 (max ΔRBP=0.75), as well as HLMs on other brain-preferentially expressed genes like AMZ1 (max ΔRBP=0.78) and MPRIP (max ΔRBP=0.73). Interestingly, although HLMs generally had a small impact on RBP, these top influential HLMs had a larger impact than top common SNVs: the largest max ΔRBP for common SNVs was 0.70, smaller than these top HLMs. We used these RBP-HLMs for further functional research.
Influential HLMs enriched in genes are highly conserved in primates and human
Next, we analysed the biological significance of influential HLMs. By applying a Poisson regression model, we found that influential HLMs had a dramatically uneven distribution among protein-coding genes (figure 2C). For example, some genes like USP25 carried six times more influential HLMs than expected under null distribution, whereas 112 genes were expected to carry at least three influential HLMs but actually carried none. Thus, we ranked all the protein-coding genes according to the extent to which they enriched for influential HLMs. We found that both the top 10% (most enriched for influential HLMs) and bottom 10% (most depleted for influential HLMs) genes are more likely to be intolerant to loss-of-function mutations13 (OR=2.16 and 1.70, Fisher test p<0.001, respectively, figure 2D). They were also both more conserved in the primate lineage2 (OR=1.53 and 1.42, Fisher test p<0.001, respectively, figure 2E). One possible explanation of this result is that some sequences on some of the conserved genes are by nature more sensitive to mutation than other genes, and the mutations on these sequences are naturally more likely to be influential. To rule out this possibility, we applied saturated mutagenesis around HLMs and found that random mutations on each gene did not have a significant difference in the predicted effect on the RBP profile (online supplemental figure S2).
This bimodal distribution of conservation is in contrast with the classic notion of gene conservation: mutations on essential and conserved genes are less likely to survive natural selection, thus we would observe a depletion of influential mutations on these genes. As a negative control, we ranked all the genes according to the enrichment of influential common SNVs, and observed the expected unimodal distribution under this classic notion (online supplemental figure S3). Specifically, genes more depleted for influential common SNVs were more likely to be loss-of-function-intolerant and conserved in the primate lineage, in contrast to RBP-HLM enrichment results. We hypothesised that, instead of surviving natural selection, these influential RBP-HLMs on essential genes were favoured by natural selection and contributed to the evolutionary force that made us human.
To further verify this hypothesis, we defined 900 genes carrying at least one time more RBP-HLMs than expected (fold enrichment >2), termed RBP-gene, for further functional analysis. As revealed above, RBP-gene significantly enriched in loss-of-function intolerant genes13 (OR=2.10, Fisher test p<0.001), as well as primate constraint genes2 (OR=1.42, Fisher test p<0.001). We reasoned that these genes underwent frequent influential mutations during human speciation, and thus might have an important role in human brain evolution and cognition function. Therefore, we used these genes for further analysis.
RBP-gene involved in synaptic functions and GTPase pathway
Human brain has undergone the most outstanding alteration during human evolution. Thus, if the RBP-genes truly contributed to human evolution, we would anticipate their crucial involvement in brain functions and high expression in neurons. Indeed, by analysing single-cell transcriptome data,25 we found that the RBP-genes were highly expressed in the central nervous system (CNS): in foetal tissue, RBP-genes were only enriched in cerebellum, including Purkinje cells and several other subtypes of neurons (p<0.001, figure 3A). RBP-genes that were highly expressed in Purkinje neurons included USP25, ITPR1, KCNH8, and SCN8A, etc. In adult tissue, RBP-genes were enriched in different subtypes of excitatory neurons from the visual cortex and the frontal cortex (p<0.001, figure 3B). RBP-genes that were highly expressed in cortex excitatory neurons included NTRK2, NLGN1, GABRB2, CACNA1D, etc. The number of CNS cell types with nominally significant enrichment (29) was also larger than all other systems and organs. Interestingly, the cerebral cortex and cerebellar Purkinje cells are vital for cognition functions and cooperation in bipedal walking,2 both of which are key functions during human evolution.2 Taken together, RBP-genes were mostly enriched in the foetal cerebellum and adult cortex, since both the enrichment p value and the number of enriched cell types were the highest compared with other tissues and organs.
Figure 3Functional characteristics of RBP-genes. (A & B) WebCSEA results of cell type-specific expression in different embryonic (A) and adult (B) tissues. A low p value indicates that RBP-genes specifically expressed in the cell type at high significance. (C & D) Gene ontology enrichment analysis of biological process (C) and cellular component (D). Font size indicates fold enrichment, and colour indicates enrichment p value. We manually selected specific terms with FDR-adjusted p<0.05. FDR, false discovery rate; RBP, RNA-binding protein.
We further analysed the biological functions of RBP-genes by GO analysis. As shown in figure 3C and online supplemental table S5), RBP-genes significantly enriched in synapse organisation (p<0.001), regulation of membrane potential (p<0.001), dendrite development (p<0.001), synaptic vesicle cycle (p<0.001), cell junction assembly (p<0.001) as well as other pathways related to neuronal and synaptic functions. Despite neuron-related pathways, RBP-genes also showed strong enrichment in the regulation of GTPase activity (p<0.001). In cellular component analysis (figure 3D), we found that RBP-genes were mainly located in various components of neurons, including presynapse (p<0.001), postsynaptic specialisation (p<0.001) and dendritic spine (p<0.001).
Taken together, genes involved in the synaptic organisation and other pathways of synapse carried an excess number of HLMs that had a large impact on post-transcriptional modification, which might contribute to the evolution of the human brain.
RBP-genes carried excess severe mutations of neurodevelopmental disorders
Human brain has shaped the cognitive functions of modern human, and the genetic architecture of human brain evolution contributes to the genetic basis of brain disorders.26 We hypothesised that rare, damaging variants that disrupt RBP-gene have contributed to brain disorders. As shown in figure 4A, using published cross-sectional burden test17 result for autism, we found that compared with background genes, RBP-genes generally carried the excess burden of damaging coding mutations in patients (fold enrichment=4.33, Fisher test p<0.001). A similar but less significant result was also found in schizophrenia18 (fold enrichment=4.58, Fisher test p=0.004). In trio-based WES analysis, RBP-gene also carried excess de novo damaging mutations in probands with autism19 (fold enrichment=2.87, Fisher test p<0.001) and probands with developmental delay27 (fold enrichment=2.01, Fisher test p<0.001). Similarly, RBP-genes were also more likely to be the risk genes of brain Mendelian disorder (fold enrichment=1.34, Fisher test p=0.001). We repeated these analyses after controlling covariates like LOEUF and gene length and achieved consistently significant results, although with lower statistical power (online supplemental table S6). These results suggested that rare and severe mutations in RBP-genes are more likely to cause neurodevelopmental disorders.
Figure 4Phenotypic consequence of RBP-genes. (A) Enrichment of RBP-genes in brain diseases gene lists. The error bar indicated a 95% CI. (B) Quantile-quantile plot of LDSC p value for trait heritability enrichment around RBP-gene. (C) Quantile-quantile plot of AMM p value for trait heritability enrichment around RBP-gene. (D) Quantile-quantile plot of SLDP p value for trait heritability enrichment around humanisation score. (E) Quantile-quantile plot of LDSC p value for trait heritability enrichment around absolute humanisation score. AMM, abstract mediation model; ASD, autism spectrum disorder; BrainMD, brain Mendelian disorders; Dnm: de novo mutation; LDSC, linkage disequilibrium score regression; RBP, RNA-binding protein; SCZ, schizophrenia; SLDP, sign linkage disequilibrium profile.
We further analysed whether common variants on RBP-genes had a significant phenotypic consequence. By applying LDSC on a set of about 1000 polygenic traits, we found that the SNV around RBP-gene did not explain a significantly higher proportion of trait heritability (FDR-adjusted p value>0.05, figure 4B and online supplemental table S6). Using AMM instead of LDSC also revealed no significant result (figure 4C and online supplemental table S7). This result could be expected if the effect of post-transcriptional modification on human evolution is oligogenic instead of polygenic. In fact, if human speciation were driven by a few vital mutations in a few vital genes, there would not be a large number of variations with small but non-zero contributions. Under this scenario, the large number of common SNVs actually had no association with post-transcriptional modification during human speciation. Given the fact that influential mutations are mostly eliminated by purifying selection and the remaining RBP-HLMs are very sparse, the oligogenic view is plausible.
Oligogenic view of post-transcriptional modification changes in human evolution
To assess this theory, we used Seqweaver to calculate the RBP difference of all the genome-wide transcribed regions between hg38 and primate common ancestor genome alignment. We then applied Seqweaver to calculate how each 1000 genome common SNV intensifies (over-humanise) or weakens (de-humanise) this difference, termed humanisation score. If human evolution on RBP profile is polygenic, a large number of transcribed regions would have RBP alterations that have a small phenotypic effect. Then, the common SNVs with large over-humanisation or de-humanisation effects would collectively explain an excess proportion of trait heritability. However, this is not true: humanisation score was not significantly associated with trait heritability in LDSC (all traits had FDR-adjusted p value>0.05, figure 4E and online supplemental table S8); the result remained insignificant when taking the direction of humanisation score into consideration by using the SLDP (figure 4D and online supplemental table S9). Taken together, these findings are in line with an oligogenic view of human evolution of RBP profile, although it is difficult to draw statistical conclusions from null results.
Prioritising ITPR1 and NTRK2 in human evolution
The oligogenic view of the human evolution of the RBP profile suggested that among the 900 RBP-genes showing enrichment of influential HLM, only a small subset might actually contribute to human evolution, which gave rise to the functional enrichment of RBP-genes. Thus, we sought to aggregate all functional and phenotypic evidence in the above analysis for all genes (online supplemental table S10) and prioritise the most probable genes that took part in the human evolution of the RBP profile. As shown in figure 5A, there were 22 RBP-genes that had at least five pieces of evidence of functional and phenotypic importance. Two top genes, NTRK2 and ITPR1, had seven pieces of aggregated evidence.
Figure 5Prioritising top candidate genes. (A) All RBP-genes with at least five aggregated evidences. (B) Seqweaver predicted the difference of EVAVL-RNA binding affinity in the human brain between hg38 and primate common ancestor sequence of NTRK2. Each dot represented a 1000 bp block on NTRK2. (C) Same as B, but for ITPR1. (D) Spatiotemporal expression of NTRK2 in the human developing brain. Figure generated at Brainspan website. (E) Same as D, but for ITPR1. AMY, amygdala; ASD, autism spectrum disorder; Brain MD, brain Mendelian disorders; CBC, cerebellar cortex; DD, development delay; HIP, hippocampus; ITPR1, Inositol 1,4,5-Trisphosphate Receptor Type 1; LOEUF, loss-of-function observed/expected upper bound fraction; MD, mediodorsal nucleus of thalamus; NCX, neocortex; NTRK2, neurotrophic receptor tyrosine kinase 2; RBP, RNA-binding protein; SCG, selectively constrained genes; SCZ, schizophrenia; STR, striatum.
NTRK2 (chr9: 84668522–85027054) encodes neurotrophic receptor tyrosine kinase 2. NTRK2 is intolerant to loss-of-function in human and is conserved in the primate lineage, highly expressed in excitatory neurons and takes part in the synapse process and GTP pathway (figure 5A). Seqweaver revealed that HLM at the fourth intron of NTRK2 has profoundly decreased the binding affinity with EVAVL in the human brain (ΔRBP=–0.33, figure 5B), which was the largest alteration among all the 217 RBPs. In Brainspan data,14 the cerebral expression level of NTRK2 consistently increased until childhood, and remained at peak expression level until adulthood (figure 5D). This is in line with the fact that NTRK2 is associated with neurodevelopmental disorders including developmental delay and multiple brain Mendelian disorders (figure 5A) like astrocytoma and developmental and epileptic encephalopathy.
ITPR1 (chr3: 4493348–4847506) encodes inositol 1,4,5-trisphosphate receptor type 1. ITPR1 is also conserved in both human and primate lineages, and is highly expressed in both excitatory neurons and Purkinje neurons (figure 5A). In Seqweaver analysis, the HLM in the second intron of ITPR1 caused the most profound alteration of RBP (ΔRBP=–0.30, figure 5C). Interestingly, this alteration was also on EVAVL binding affinity in the human brain, just like NTRK2. Consistent with its high expression in Purkinje neurons, ITPR1 has the highest expression in the cerebellum, and the expression value continuously increases throughout human developmental periods (figure 5E). Furthermore, ITPR1 has been identified as a risk gene for several cerebellar genetic disorders, such as different subtypes of spinocerebellar ataxia and Gillespie syndrome, suggesting that ITPR1 may have played a role in the evolution of bipedal walking.