Introduction

The human microbiome refers to the millions of microbial organisms that live on different body sites [1, 2]. These microbes are essential for maintaining human health by producing nutrients and vitamins, regulating the immune system, providing the host with beneficial molecules, protecting the gut barrier, and fighting off disease-causing pathogens. For healthy humans, the microbiome gets populated immediately at birth. After birth, various microbial species from mothers and from the surrounding environments quickly colonize the infants, especially in the gastrointestinal tract. This initial colonization is influenced by birth modes (vaginal delivery or C-section), use of antibiotics, breastfeeding, and several other factors [3,4,5,6,7]. The maternal gut has been found to be the largest source of colonizing bacteria in the gastrointestinal tract of healthy infants [8].

The composition of the human microbiome is thought to evolve and change as humans age. After birth, the infant’s gut microbiome undergoes a development phase for about a year, then a transition phase for another year, and then reaches a stable phase [9]. Youth, adult, and senior populations each harbor age-specific features in their microbiomes [10]. Gut microbiome composition can be disrupted due to acute diseases or infections [11], diet [12, 13], use of antibiotics [14] and other drugs [15], travel [16], and many other factors. However, the composition of an established gut microbiome has a tendency to restore after disruption [17]. The stability of the human gut microbiome has been shown by longitudinal studies. One study [18], which followed subjects for years, showed that some microbial species can stay in the human gut for many years. However, so far, no study has demonstrated whether the microorganisms established at birth or in early childhood, either transmitted from parents or obtained from the environment, remain in the human body until adulthood or senior age. To directly answer such a question is difficult, because microbiome samples at childhood and at later adulthood for the same individual would need to be collected and compared.

The microbial species in the human microbiome exhibit large genomic variations among different individuals. A large number of single nucleotide variations (SNVs), short insertions/deletions (indels), and structural variants from the same species have been found among different samples [19]. These metagenomic variations are very unique and can be used as signatures to identify individuals [20]. Here, we analyzed 1004 gut microbiome samples from senior adults (65 ± 7.8 years) from the TwinsUK cohort [21]. In this study, by analyzing the rare SNVs and indels in this metagenomic dataset, we can indirectly show whether species in the human gut acquired at birth or in early childhood can persist for a lifetime until senior ages.

The microbiome datasets from the TwinsUK cohort generated by either 16S rDNA sequencing or shotgun metagenomic sequencing have been analyzed in several different studies, which have suggested a heritability of the human microbiome and how genetic and environmental factors impact the gut microbiome, as well as the associations between the gut microbiome and metabolites in stool and blood samples [22,23,24,25]. In these studies, as well as other microbiome studies involving samples from twins [26], it has been repeatedly observed that the similarity of the microbiome between twin pairs is significantly higher than in unrelated subjects. These observations were explained by the host genetics, which also explains why the microbiome similarity within the monozygotic (MZ) twin pairs is greater than the dizygotic (DZ) twin pairs [22,23,24]. However, the previous studies have not really answered the question as to when these species shared by twin pairs started to colonize the gut.

In this study, we used the same dataset from the TwinsUK cohort, but focused on a set of rare SNVs and indels in the microbial species, and generated results that support the observation that the shared rare genomic variants between twin pairs are from species that colonize them at an early age.

Results and Discussion

In this study, we re-analyzed the stool samples from 1004 individuals from the TwinsUK cohort that were sequenced using the shotgun metagenomic approach from our previous study [25]. This dataset includes 161 MZ twin pairs, 201 DZ twin pairs, and 280 individuals without a matching twin. We first identified a subset of abundant and common species for the analysis of SNVs and indels. We found 27 species that meet the following criteria: (1) present in more than 100 individuals, (2) present in at least 10 twin pairs, (3) average depth of sequencing coverage ≥ 2, (4) the fraction of genome covered by mapped metagenomic reads is ≥ 75%. These criteria ensure enough sample size for statistical analysis and sequencing coverage for reliable variant calls. The list of the 27 species and number of samples and number of twin pairs for these species are presented in Table S1.

SNVs and indels for these species were identified using the Varscan 2 program (see Materials and Methods). For each species, rare variants (SNVs or indels) that were only found in less than 20% of samples were kept for further analysis. On average, 171,316 ± 90,761 rare SNVs and 3308 ± 1424 rare indels were found per species (Table S1). For each species, a variant cutoff is defined so that among all the pairs of unrelated individuals having this species, only 1% of these pairs share more rare variants than this cutoff (Table S1). For example, the cutoff for Methanobrevibacter smithii is 971. It means that there is a 1% probability that a pair of unrelated individuals with M. smithii share ≥ 971 rare variants. Here, for a species, if two individuals meet this cutoff, we consider they share the same strain-specific variant signature (at a significance level of 0.01). Having strain-specific variant signatures indicates that the strains from the two individuals can be traced back to the same origin.

For some species, the frequency for the twin pairs sharing strain-specific variant signatures is much higher than 1%, which is the expected frequency for unrelated pairs. Exact binomial test (binom.test in R package) was used to calculate the p value for the observed number of twin pairs sharing strain-specific variant signature (Table 1). These species, in order of increasing p value, are M. smithii, Bacteroides caccae, Bifidobacterium adolescentis, Alistipes putredinis, Bacteroides fragilis, Bacteroides dorei, Eubacterium hallii, Phascolarctobacterium faecium, Bacteroides uniformis, Blautia obeum, Bifidobacterium longum, Dialister invisus, and Akkermansia muciniphila.

Table 1 Fraction of twin pairs that share strain-specific variants

While it is expected that only 1% of pairs of unrelated individuals may share strain-specific variant signatures by chance, for M. smithii, 36% of twin pairs did. This suggests that for twin pairs where both individuals have M. smithii, 36% of them can be traced back to the same origin. Other species with high frequency for the twin pairs sharing strain-specific variant signatures include B. fragilis (23%), B. caccae (14%), and E. hallii (9%) (Table 1). This highly suggests that these particular strains shared by the twins originated from the same source and colonized the twins at their early ages. Otherwise, although the twins may preferably select similar microbial species in later adulthood while they are physically and geographically separated, it is difficult to explain why these strains share so many rare genomic variants. In addition, we did not find significant differences between MZ and DZ twin pairs.

Our observation about the strain-specific variant signatures shared between twins is highly connected with the heritability of the microbiome. Species with most genomic variants preserved in twins, such as M. smithii, are also among the list of heritable microorganisms shown in previous publications [18, 22,23,24]. In fact, if a species that is originally colonized both twins during childhood survived until their adulthood, this species is likely to be a heritable species. However, there are differences between heritability analysis and our analysis. Heritability was calculated based on the similarity of microbial species abundance between individuals’ microbiome [18, 22,23,24]. But in this study, species abundance was not considered, only the variants were compared. In heritability analysis, the effect of non-genetic factors (e.g., environment) on the data was modeled and removed [22,23,24], so the estimated heritability is only due to the genetic factors. Therefore, the heritable species were different between twins and unrelated pairs and were also different between MZ and DZ twins. In our analysis, we only observed the difference between twins and unrelated pairs, not between MZ and DZ twins. So, shared environment played an important role in addition to genetic factors in our analysis.

Some species, such as Bacteroides vulgatus, Subdoligranulum sp. APC924/74, and Dorea longicatena, showed no significant difference between twin pairs and unrelated pairs (Table 1), which indicates that these strains colonized in later adult stages when the twins were separated, or the strains had diverged too much to be detected by our genomic variant analysis, or had been replaced by related strains.

It is expected that microbes, after the initial colonization in the human gut, continue to evolve and diverge. However, to answer the question about how fast different species in the human microbiome evolve over lifetime, longitudinal samples are needed. So, we cannot address this question here. A previous study [18] with longitudinal samples for up to 5 years showed microbiome community divergence over time based on 16S rDNA sequencing data. In this study, we tried to see if there is a correlation between number of shared variants in twins and their age. However, we did not find any significant correlation. This suggested that although the species continued to evolve, the accumulation of genomic variants did not happen at a linear rate. Otherwise, we should have observed that older twins have less shared variants than younger twins.

In this study, using strain-level genomic variant analysis on one of the largest available human metagenomic datasets from the TwinsUK cohort, we estimate that some species in the human gut microbiome may stay for life after their initial colonization since early childhood. After many decades, the strain-specific genomic variant signatures can still be detected. To further understand how the species evolve and how their genomic variants occur over a lifetime, more investigations with longitudinal samples are needed.

Materials and Methods

Study Cohort and Samples

The subjects are from the TwinsUK adult twin registry, which includes about 14,000 subjects, predominantly females [21]. The stool samples were collected from a subset of subjects. Details are described in our previous study [25]. Briefly, stool samples from 1004 individuals (39 males, 965 females) of European ancestry were collected. These subjects (65.0 ± 7.8 years) were all living in the UK at the time of specimen collection. This dataset includes 161 monozygotic (MZ) twin pairs, 201 dizygotic (DZ) twin pairs, and 280 singletons. Data on the TwinsUK participants are available for research under managed access due to governance and ethical constraints and can be requested from http://twinsuk.ac.uk/resources-for-researchers.

Metagenomic Sequencing

DNA extraction from stool samples, DNA library preparation, and metagenomic sequencing of the 1004 samples were also described in our previous study [25]. Sequencing of the stool samples yielded an average number of reads of 54 M per sample. The raw metagenomic sequences are available from the European Nucleotide Archive website (study accession number: PRJEB32731).

Metagenomic Sequence Analysis

Raw reads were processed using Trimmomatic (version 0.36) [27] to trim low-quality bases (parameters: SLIDINGWINDOW:4:20 LEADING:3 TRAILING:3 MINLEN:80 MAXINFO:80:0.5). Only paired-end (PE) reads both of ≥ 80 bp were kept for analysis. High-quality PE reads were mapped to the human reference genome (hg38) with BWA-MEM (version 0.7.12) [28] with default parameters and removed, if they mapped concordantly with an alignment score of ≥ 60. We maintain a comprehensive microbiome reference genome database for mapping the metagenomic reads. This database was compiled and curated from NCBI Refseq genomes covering complete and draft bacteria, archaea, viruses, fungi, and microbial eukaryotes species. Currently, the database contains 27,115 representative genomes. Taxonomy profiles were determined through reference genome mapping using Centrifuge (version 1.0.4) [29] with default parameters.

After initial mapping using Centrifuge, we first filtered out mapped genomes from contamination. The depth of coverage (read length x number of mapped reads/genome length) and fraction of coverage (number of bases covered by mapped reads/genome length) for each mapped genome was calculated using the alignment file provided by Centrifuge. Some genomes were filtered out using the following criteria. In sequencing, the observed number of times a base is sequenced follows a Poisson distribution [30]. Given the observed depth of coverage calculated from the alignment file, the expected fraction of coverage is (1–1/(edepth of coverage)) [30]. If the observed fraction of coverage was smaller than 1/10 of the expected value, then it suggested aligned reads were piled up at a small fraction on the genome rather than uniformly distributed along the genome, the genome was removed.

Then relative taxonomy abundance was calculated based on the filtered mapping results.

Genomic Variant Analysis

Our approach for genomic variants is similar to several existing methods that can analyze metagenomic data at strain-level such as StrainPhlAn [31], StrainEST [32], and MIDAS [33]. StrainPhlAn detects single nucleotide variation (SNV) based by mapping the reads to a set of conserved and unique species marker genes. StrainEST and MIDAS perform full-length genome SNV calling. Here, we implemented a novel approach for genomic variant analysis, including analysis of both SNVs and also indels.

First, we selected the species with at least 2× depth of coverage, based on the mapping data from the previous Centrifuge run. We then performed a second round of mapping to align the reads to these selected species using BWA-MEM [28] (parameters: -T 80). In this round of reads mapping, only one representative genome was used for one species. After mapping, we analyzed the BAM file using SAMtools (version 1.9) depth command [34] and calculated the depth of coverage and fraction of coverage for each genome. Genomes were removed if the depth of coverage is < 2 or fraction of coverage is < 0.75. SAMtools mpileup command was used to convert the BAM file to a mpileup file. Then, Varscan (version 2.4.2) pileup2snp [35] was used to call SNVs (parameters: --min-coverage 2 --min-reads2 2 --min-var-freq 0.66). Varscan pileup2indel was used to detect indels (parameters: --min-coverage 2 --min-reads2 2 --min-var-freq 0.66).

For one species, the detected SNVs and indels were collected for the samples that had this species at ≥ 2 depth of coverage and ≥ 0.75 fraction of coverage. And then from these collected variants, only rare SNVs and indels that exist in less than 20% of the samples were kept for further analysis, because common SNVs or indels are not discriminative in tracking strains between subjects.