Great ape Y Chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal

The distribution of genetic diversity in great ape species is likely to have been affected by patterns of dispersal and mating. This has previously been investigated by sequencing autosomal and mitochondrial DNA (mtDNA), but large-scale sequence analysis of the male-specific region of the Y Chromosome (MSY) has not yet been undertaken. Here, we use the human MSY reference sequence as a basis for sequence capture and read mapping in 19 great ape males, combining the data with sequences extracted from the published whole genomes of 24 additional males to yield a total sample of 19 chimpanzees, four bonobos, 14 gorillas, and six orangutans, in which interpretable MSY sequence ranges from 2.61 to 3.80 Mb. This analysis reveals thousands of novel MSY variants and defines unbiased phylogenies. We compare these with mtDNA-based trees in the same individuals, estimating time-to-most-recent common ancestor (TMRCA) for key nodes in both cases. The two loci show high topological concordance and are consistent with accepted (sub)species definitions, but time depths differ enormously between loci and (sub)species, likely reflecting different dispersal and mating patterns. Gorillas and chimpanzees/bonobos present generally low and high MSY diversity, respectively, reflecting polygyny versus multimale–multifemale mating. However, particularly marked differences exist among chimpanzee subspecies: The western chimpanzee MSY phylogeny has a TMRCA of only 13.2 (10.8–15.8) thousand years, but that for central chimpanzees exceeds 1 million years. Cross-species comparison within a single MSY phylogeny emphasizes the low human diversity, and reveals species-specific branch length variation that may reflect differences in long-term generation times.

approaches are merged into an overlapping set. We observe statistically significant differences between WGS and sequence-capture in the mean number of variant sites in each species. However, no significant differences are seen between capture and WGS samples for any of the species when these are extracted from the final merged dataset (Supplemental Table S3). Estimates of divergence from human for individual samples, whether based on sequence-capture, WGS, or the merged dataset, are very similar (Supplemental Table S4), showing that when overlapping MSY regions are retained then the datasets behave in the same way.
The obtained differences between capture and WGS data are likely due to a combination of factors such as inappropriate reference sequence, differences in read lengths and depth of coverage, and in the specific MSY regions retained in each analysis. Different regions of a chromosome may differ in mutation rate, as has been reported recently for MSY (Trombetta et al. 2015). The reference sequence effect can neither be overcome nor properly evaluated until high-quality MSY sequences for all species become available. Notably, when mapping chimpanzees to the chimpanzee reference, we see an increase in the total number of variants identified, but the fact that TMRCA estimates are only very slightly affected suggests a minor effect on the overall results and conclusions of our analysis.

Concordance of genotype calls
Overlap of some great-ape samples between datasets allowed an assessment of genotype concordance and therefore also gave an indication of data quality in our sequencing.
For mtDNA, GGG_Guy was sequenced both by us and within a published dataset (Xue et al. 2015), and PPA_Bono was sequenced by us and in a published dataset (Prado-Martinez et al. 2013). In each case, comparing respectively 15,453 and 15,447 bp, there were no discordant calls.
For MSY, PPA_Bono was sequenced by us and in a published dataset (Prado-Martinez et al. 2013), and showed no discordant calls out of the 54,695 variant sites used to build the cross-species phylogeny shown in Figure 4.
For autosomal data a total of 28 genotypes out of 14,024 variant sites differed for PPA_Bono between the datasets, suggesting an overall discordance rate ~0.2%.

Relatedness analysis
Relatedness is known to be a problem when using great-ape samples, in particular for captive-born individuals. For our samples we had no prior knowledge of possible relatedness, except for gorillas GGG_Nikumba and GGG_Tomoka, who have been reported to be father and son (Boyer et al. 1973). Also, some individuals were known to be closely related based on published genetic data (Prado-Martinez et al. 2013;Xue et al. 2015). In order to estimate the kinship coefficients between all samples, the software KING was used (Manichaikul et al. 2010). Among bonobos, no relatedness was identified. Among chimpanzees, a second-degree relationship was identified between PTV_8 and PTV_Donald, but since their MSY sequences differ by 82 variants across the final 2,496,576 bp, both samples were retained in all MSY analyses. However, because no differences were found in the mtDNAs between the two, PTV_Donald was retained in the mtDNA tree, but removed from dating and calculations of genetic diversity. Among gorillas, the father-son relationship between GGG_Nikumba and GGG_Tomoka was confirmed. For other gorillas we obtained the same relationships as previously published (Prado-Martinez et al. 2013;Xue et al. 2015). Among orangutans, first-degree relationships were found between PPY_Thai and PPY_Temmy, and PAB_Sinjo and PAB_Babu.

ADMIXTURE analysis
Analysis of shared genetic components in autosomal data was done using the ADMIXTURE program (Alexander et al. 2009) with 10-fold cross-validation (CV) (Alexander and Lange 2011). In order to avoid biases from related individuals, the following samples were excluded: PTV_Donald, GGG_Tomoka, GGG_Bulera, GGG_Suzie, GGG_Oko, GGG_Kowali, GBB_Umurimo, GBB_Maisha, PPY_Temmy and PAB_Babu. Overall the results (Supplemental Figure S3) agree well with the PCAs (Figure 2; Supplemental Figure S1). The CV error is lowest for k=1 for bonobos, k=3 for chimpanzees, and k=2 for gorillas and orangutans (Supplemental Figure S4). For the final number of autosomal variants used for each species, see Supplemental Table S2.

Ancestral states of MSY variants
In order to define the ancestral states for the final filtered variants used in intraspecific phylogenetic tree construction (Figure 3), we used sequence data generated from the other species (including humans), as well as the phylogenetic information from the maximum parsimony tree extracted from the PHYLIP outfile.
First, final variant sites from the species to be rooted were called from other species using SAMTools (Li et al. 2009) as described above, filtered for strand bias and minimum depth ≥ 1. For rooting the bonobo and chimpanzee trees, the allele matching gorillas or orangutans was preferred, and when missing data or nonmatching alleles precluded this, the human allele was used. For gorillas, the orangutan allele was preferred and for orangutans the gorilla allele was preferred. If the sites were missing in the other species or alleles did not match, then the ancestral allele was assigned randomly.
The allelic state defined using other species was checked against the reconstructed ancestral state from the PHYLIP output. If the defined allele did not match with the deepest rooting branches, but an allele from a different species did, then the ancestral state was changed and after checking all sites the maximum parsimony tree was constructed again. For all species this left a small proportion of sites (0.12-1.82% of variants) where the ancestral state was not supported by reconstructed states in the PHYLIP output. Numbers in parentheses in the 'Node' column refer to numbered nodes in the trees in Figure  3; for abbreviations in the same column, refer also to Figure 3; N -number of individuals; Snumber of variable sites; TMRCA -time to most recent common ancestor; HPD -highest posterior density. Numbers in parentheses in the 'Node' columns refer to numbered nodes in the trees in Figure  3; for abbreviations in the same column, refer also to Figure

D Orangutans P P A _ B o n o P P A _ P P A 2 P P A _ M a s ik in i P P A _ D e s m o n d P P A _ C a th e ri n e P P A _ K o m b o te P P A _ K u m b u k a P P A _ H e rm ie n P P A _ D z e e ta P P A _ L B 5 0 2 P P A _ H o rt e n s e P P A _ K o s a n a P P A _ C h ip it a P P A _ N a ta li e P P A _ S a lo n g a
Supplemental Figure S3: Model-based estimation of ancestry in great apes using ADMIXTURE. All individuals (columns) are grouped into di erent clusters (k=2 to k=5, rows) coloured according to their shared genetic structure. Names of great-ape males sequenced in this study are in red. In each plot the y-axis ranges from zero to 150 ×. Sample size (N) is given to the left, and the extent of sequence obtained (bp) to the right. Chimpanzees carry two distinct structural variant sequences (Ptr1 and 2) di ering by a large deletion highlighted by a magenta bar; the gaps indicated by the green bars correspond to large blocks that are absent from the human reference sequence. The Ptr2-speci c retention of a sequence block indicated in Figure 1 is not visible here since it does not exist in the chimpanzee reference sequence.