Indigenous Arabs are descendants of the earliest split from ancient Eurasian populations

An open question in the history of human migration is the identity of the earliest Eurasian populations that have left contemporary descendants. The Arabian Peninsula was the initial site of the out-of-Africa migrations that occurred between 125,000 and 60,000 yr ago, leading to the hypothesis that the first Eurasian populations were established on the Peninsula and that contemporary indigenous Arabs are direct descendants of these ancient peoples. To assess this hypothesis, we sequenced the entire genomes of 104 unrelated natives of the Arabian Peninsula at high coverage, including 56 of indigenous Arab ancestry. The indigenous Arab genomes defined a cluster distinct from other ancestral groups, and these genomes showed clear hallmarks of an ancient out-of-Africa bottleneck. Similar to other Middle Eastern populations, the indigenous Arabs had higher levels of Neanderthal admixture compared to Africans but had lower levels than Europeans and Asians. These levels of Neanderthal admixture are consistent with an early divergence of Arab ancestors after the out-of-Africa bottleneck but before the major Neanderthal admixture events in Europe and other regions of Eurasia. When compared to worldwide populations sampled in the 1000 Genomes Project, although the indigenous Arabs had a signal of admixture with Europeans, they clustered in a basal, outgroup position to all 1000 Genomes non-Africans when considering pairwise similarity across the entire genome. These results place indigenous Arabs as the most distant relatives of all other contemporary non-Africans and identify these people as direct descendants of the first Eurasian populations established by the out-of-Africa migrations.

-6proportion of differences between the human and macaque reference after Jukes-Cantor correction) were estimated over 100 kb loci obtained by grouping bases along chromosomes that remained after filtering. We refer to these normalized diversity estimates for individual populations in ChrX, the autosomes, or the ratio of X/A as absolute diversity estimates (Gottipati 2011;Arbiza 2014). Absolute X/A diversity in individual populations is influenced by several biases affecting ChrX and the autosomes differently, including the effect of selection on genic sites in linkage disequilibrium with intergenic regions (Gottipati 2011;Arbiza 2014), we also used genetic distance to the nearest gene (HapMap recombination rates, scaled by two-thirds for ChrX to yield sex-averaged rates, and RefSeq genes) to obtain a collection of regions that are far from genes and the least influenced by any such effect. To do this we partitioned all bases passing filters to each of seven bins spanning distances between 0.0 and 0.4 cM in such a manner that an equal fraction of bases in ChrX fit each bin and estimated absolute X/A diversity in all bases falling in the 7th and last bin (0.18 -0.40cM) using the same procedures described above.
We also consider relative X/A diversity estimates, defined as the ratio of diversity estimates between a given pair of populations (e.g., relative X/A diversity). Standard errors of the mean for absolute and relative diversity estimates in ChrX, the autosomes, and X/A were obtained by a moving block bootstrap procedure (Liu and Singh 1992;Lahiri 2003;Keinan 2007), where 10,000 random data sets were produced by resampling with replacement from the full set of 100 kb loci, independently for ChrX and the autosomes, and used to compute standard errors for all estimates.

Coalescent Analysis
The pairwise sequential Markov coalescent (PSMC) (Li and Durbin 2011) was applied to the 96 Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) Qatari genomes. A plot of effective population size vs years in the past was generated for each of the genome using -7instructions from the PSMC manual (Li and Durbin 2011), first generating genome-wide calls using SAMtools mpileup command ) then applying the PSMC algorithm to infer demographic history for each sample. The parameters in the PSMC model include the mutation rate (μ) and generation time (g) of μ =1.2x10 -8 and g=25 years (Roach 2010). Since PSMC outputs different x-axis coordinates for each individual genome, in order to calculate the average effective population size across multiple time points for each population, a spline was fitted to each PSMC output. The spline fitting function, as implemented in the R lattice package (lattice: Trellis Graphics for R 2015), outputs effective population size in 1000 year intervals from 1000 years ago to 1 million years ago. The median was calculated for Q1 (Bedouin), Q2 (Persian-South Asian), and Q3 (African) and plotted ( Figure 3).

Genome-Wide Admixture Analysis
A genome-wide admixture analysis was conducted on the combined dataset of 104 Qatari genomes, 1000 Genomes minus Human Origins overlap, and Human Origins (see Supplemental Methods). Using the integrated and linkage disequilibrium-pruned dataset of 197,714 SNPs, the genome-wide ancestry proportions were calculated using ADMIXTURE (Alexander 2009). The ADMIXTURE algorithm takes as input a dataset and a K value which indicates the expected number of ancestral populations for the dataset, and outputs the proportion of ancestry in each ancestry population for each individual. The result is visualized in a plot generated using R, which color-codes each ancestral group. In order to determine the optimal K, the analysis was run for K=3 to K=18, and the cross validation error was calculated for each iteration.
Three visualizations of the data were produced. The first includes all genomes in the combined dataset, sorted by dataset (96 Q1, Q2 or Q3 Qatari genomes, 1000 Genomes minus Human Origins, and Human Origins) and then by region (Africa, America, Central Asia / Siberia, East Asia, South Asia, and West Eurasia) as defined in the supplement of Lazardis et al -8 -2014 (Lazaridis 2014). In this analysis, the lowest cross validation error was observed for K=12, and at this level of resolution, an ancestry component that dominates in Q1 (Bedouin) Qataris, Saudi, and Bedouin B populations was observed. A second visualization includes populations defined here as Middle Eastern, as well as populations in the K=12 analysis with >10% admixture in the Bedouin B component that also dominates in Q1 (Bedouin) Qataris and Saudi.
A third visualization looks at the K=12 results for both the full dataset and the Middle Eastern dataset (Supplemental Figure 7).

African Admixture Proportion and Timing
ALDER 1.2 (Loh 2013) was used to analyze the proportion and timing of African admixture in Qatari populations, the genomes of Qataris and world populations. For each Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) Qatari and Human Origins population, Yoruba was used as a reference panel, and the proportion and timing of African admixture was estimated with confidence intervals established using a block jackknife (Reich 2009). A result was obtained for a subset of the populations; the remainder resulted in ALDER runtime errors due to small sample size or high error estimates.

Local Admixture Analysis
We performed an admixture deconvolution analysis on the 96 Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) Qatari genomes using the 11,711,386 autosomal SNPs segregating in both 1000 Genomes and Qatari genomes. The Qatari genomes and 1000 Genomes autosomal haplotypes were phased using SHAPEIT2 using default parameter settings (Delaneau 2013) and SupportMix (Omberg 2012) was used to assign 2000 SNP autosomal segments of each Qatari using populations of the 1000 Genomes as "ancestral proxy reference panels". Six of the 14 populations were used, including two European, two Asian, and two African. The European populations were CEU (Utah residents with Central and Western European ancestry) and TSI -9 -(Tuscan Italian), the Asian populations were CHB (Han Chinese Beijing) and JPT (Japanese in Tokyo), and the African populations were YRI (Yoruba in Nigeria) and LWK (Luhuya in Kenya) (see Supplemental Table IV for population details). The input parameters of SupportMix included the phased reference populations (1000 Genomes), the generations to admixture (2,400, assuming generation time of 25 years and 60,000 years to admixture), and the ancestry inference window size (2,000 SNPs, approximately 0.5 cM). SupportMix divides each chromosome into intervals and infers the most similar population for each haplotype. Based on the results, a tract length distribution was constructed using a Python script that scans each haplotype, counting the length of each interval where the haplotype is assigned to the same continental population (African, European, or Asian).
The distribution of African ancestry tracts was tabulated for each Qatari population in R (R Core Team 2015) and plotted for visualization (Supplemental Figure 9). While in theory, identity by descent for an ancient African ancestry tract 2,400 generations ago would be <1 cM, with the poor sampling of populations closely related to Qatari in the 1000 Genomes "proxy reference panel," we expect more and larger tract assignments to African populations, such that the analysis provided a relative metric for assessing ancient vs recent African ancestry in the Qatari genomes.

Neanderthal Ancestry
In order to compare the proportion of Neanderthal admixture in Q1 (Bedouin) Qataris with that of other populations in 1000 Genomes (The 1000 Genomes Project Consortium 2012) and Human Origins (Lazaridis 2014), the qpF4ratio program in the AdmixTools 3.0 package (Patterson 2012) was used to calculate the F 4 ratio and the qpDstat program in the AdmixTools 3.0 package (Patterson 2012) was used to calculate the D-statistic.
-10 - The F 4 ratio estimates α, the proportion of Neanderthal ancestry in the focal population by detecting introgression based on incomplete lineage sorting as reflected in the allele frequencies (Reich 2009). The F 4 ratio is calculated for 5 populations, including one outgroup (O = chimpanzee), two archaic (A = Denisova, B = Neanderthal) and two contemporary (C = Yoruba, X = tested population). The model is that population B contributed α percent ancestry to population X (via a subset of B called B'), and population C contributed 1α (via a subset of C called C'). The F 4 ratio is obtained by dividing two f 4 statistics (f 4 (A,O;X,C) / f 4 (A,O;B,C)). The f 4 statistic detects introgression based on incomplete lineage sorting, which is reflected in the allele frequencies (Reich 2009).
The D-statistic assesses counts of derived Neanderthal alleles vs ancestral (Chimpanzee) alleles in Q1 (Bedouin) and a compared population. The D-statistic was calculated for 4 populations (W = tested population, X = Q1, Y = Neanderthal, Z = Chimp), inspecting sites where Y and Z differ. The counts are the number of sites where the Y (derived) allele is observed in W and the Z (ancestral) allele is observed in X, and vice versa. An excess of population Y (Neanderthal) alleles in population W compared to Q1 (Bedouin) results in a higher D-statistic score, which translates to higher Neanderthal ancestry in population W compared to Q1 (Bedouin).
The F 4 ratio and the D-statistic relative to Q1 (Bedouin) was calculated for each population in the combined Qatari genome (QG), 1000 Genomes minus Human Origins overlap (1000G-HO), and Human Origins (HO) dataset for populations from Africa, West Eurasia, Central Asia and Siberia, East Asia, South Asia, Oceania, Middle East, and America. Neanderthal and Chimpanzee genotypes for this analysis were obtained from the Human Origins dataset. For populations that overlap between 1000 Genomes and Human Origins, the analysis was conducted twice, once for the 1000 Genomes (excluding duplicates in Human Origins) and a -11second time for the Human Origins sample, where the 1000 Genomes populations are labeled using 3-letter codes (e.g. "YRI"), while Human Origins populations are labeled using full labels from the Lazardis et al. 2014 supplement (e.g. "Yoruba").

TreeMix Analysis
We performed a TreeMix analysis (Pickrell and Pritchard 2012) of the 96 Q1 (Bedouin), Q2 (Persian-South Asian), or Q3 (African) Qatari genomes and 1000 Genomes excluding admixed populations (PUR, MXL, CLM, and ASW; abbreviations defined in Supplemental Table   IV). After exclusion of admixed populations, the allele count for 11,701,491 biallelic SNPs was calculated using PLINK (Chang 2015), and Python (Python Software Foundation 2015) was used to convert this file to a TreeMix input file with reference and alternate allele counts for each SNP in each population. The maximum likelihood trees reflecting population splits among the potentially mixing populations were produced using default settings, allowing from 0-5 migrations (mixtures), each in a separate analysis (Supplemental Figure 11). Trees in the supplement were plotted using FigTree 1.4 (Rambaut 2015), and residuals were plotted using the TreeMix R package R functions (Pickrell and Pritchard 2012). For each migration event observed, both the direction and percentage of admixture was recorded. In each successive run of TreeMix, the original tree was included as a reference, and the migration edges were inferred de novo.
-12 -  (Bentley 2008) was used to discover potential variant sites in each individual genome. In the second phase, the GATK best practices workflow (DePristo 2011) was used to simultaneously call genotypes in all 108 Qatari genomes at all potentially variant sites identified by the CASAVA pipeline. Reads were remapped using BWA 0.5.9 (Li and Durbin 2009), PCR duplicate reads were removed using SAMtools , and variants were called from reads of mapping quality >10 and bases with quality >17 using GATK (DePristo 2011). GATK 0.2.6 was used for autosomes, while GATK 3.1 was used for ChrX, ChrY, and MtDNA in order to take advantage of the haploid chromosome calling algorithm implemented in this version. Autosomal genotyping was limited to potential variant sites identified by the CASAVA pipeline (Bentley 2008) and 1000 Genomes Project Phase 1 sites (The 1000 Genomes Project Consortium 2012), while ChrY genotyping was limited to 10 Mb of ChrY amenable to next-generation sequencing analysis, MtDNA genotyping included the complete Cambridge reference sequence (Anderson 1981), and ChrX genotyping included all non-N bases on the chromosome. A summary of mapping and genotyping results is presented, including per-genome means and population (n=108) totals. 2 Shown is the mean of each statistic per genome for the 108 Qatari genomes (with percentages in parentheses).

Supplemental
Each result is shown for the complete genome: autosomal sites (Chr1 to Chr22), ChrX in females, the 10 Mb of ChrY amenable to next-generation sequencing analysis, and MtDNA in all individuals. One of the individuals originally considered male has no calls on ChrY, hence genotypes are included for only autosomes and MtDNA in this individual. From top-to-bottom is shown the number of individuals (n); number of alleles (2n for autosomes, n for ChrY and MtDNA); total gigabases mapped to the 1000 Genomes Project version of the hg19/GRCh37 reference genome by BWA 0.5.9; surveyed genome mean number of genome sites mapped with ≥1 read and % of mappable genome; mean depth of mapped reads among sites with at least one read mapped; mean number and % novel (not in DbSNP 135 nor 1000 Genomes Phase 1) of biallelic SNPs per genome; and transition-totransversion (Ti:TV) ratio for biallelic SNPs. * Due to divide-by-zero errors in samples with no transversions, Ti:Tv ratios were calculated on n=74 samples with at least one transversion. 3 Statistics are shown for female ChrX only, as these were used for calculation of ChrX to autosome diversity (Gottipati 2011;Arbiza 2014  One of the second-degree relationships was reported subsequent to sample selection to be a full-siblingship, but according to KING's and PREST-plus's estimated kinship coefficients (0.1435 and 0.1014, respectively) is most likely a second-degree relationship, likely a half-siblingship. Shown are the ID, population, and gender of the 1 st and 2 nd relative in the pair, the inferred and confirmed relationship degree, the confirmed actual relationship, observed IBS proportions, inferred kinship coefficients using KING-robust and PREST-plus, and the estimated IBD proportions inferred by PREST-plus.  Figure 5B), X-linked to autosomal diversity (Supplemental Figure Figure 11) and pairwise neighbor-joining cluster analysis ( Figure 6). Shown are the group, population code, description of populations sampled, and the total number of males and females in each.    2014. For the remaining 1000 Genomes individuals, the population label is a 3-letter code as used in the 1000 Genomes Phase 1 paper (The 1000 Genomes Project Consortium 2012). For example, the "Yoruba" population in this study represents the Human Origins data, while the "YRI" population represents the 1000 Genomes sample not duplicated in Human Origins. 4 For Human Origins samples, the sample size is as in the Lazardis 2014 supplemental materials. In cases where Human Origins and 1000 Genomes samples overlapped, the Human Origins data was kept and the 1000 Genomes duplicate was removed, hence 1000 Genomes population sample sizes are reduced compared to Supplemental  Table IV. The Qataris include 104 total genomes, where n=4 Q1 (Bedouin) of the total 108 sequenced were excluded based on a first-degree or second-degree relationship to another individual in the sample (Supplemental Table III). . Plink outputs 3 estimates of inbreeding, named "fhat1", "fhat2", and "fhat3", in this analysis the "fhat2" estimate was used. Shown are the region, population, inbreeding coefficient mean, and standard deviation for each Qatari (QG), 1000 Genomes minus Human Origins samples (1000G-HO), and Human Origins (HO) populations with sufficient sample size. P-value 0.01662 0.02737 1 In order to confirm qualitative differences in the distribution of haplogroups between ChrY and MtDNA across Qatari populations, an analysis of molecular variance (AMOVA) (Excoffier 1992) was conducted using Arlequin (Excoffier and Lischer 2010). For each of 47 ChrY sequences and 96 MtDNA sequences in Q1 (Bedouin), Q2 (Perisan-South Asian), and Q3 (African) populations, the VCF file of genotypes for segregating sites (see Supplemental Table II) was converted into Arlequin input format using a Python script and then imported into Arlequin for analysis. The AMOVA variance components analysis and F st were calculated for a three-way comparison of the Qatari subpopulations and for all 2-way comparisons. Shown are the comparison, the statistic name, and the statistic result in MtDNA and ChrY. Estimates of nucleotide diversity (the average number of pairwise differences per base between all haploid samples in a population) normalized by human-macaque divergence (the proportion of differences between the human and macaque reference after Jukes-Cantor correction) were estimated over 100 kbp loci obtained by grouping bases along chromosomes that remained after filtering. Shown are the normalized (absolute) diversity estimates (with standard error in parentheses) for individual populations in the autosomes, ChrX, and the ratio X/A (Gottipati 2011;Arbiza 2014). Population codes for 1000 Genomes are described in Supplemental Table IV.  Table V, Yoruba is used as a reference panel, and the timing and percent of African admixture is estimated, with confidence intervals established using a block jackknife (Reich 2009). Shown are the populations where a result was obtained, based on sufficiently large population sample and low error of estimates. Populations are sorted from top to bottom by decreasing proportion of estimated African admixture. Shown are the regional origin, the population, generations to African admixture with standard error, and proportion of African admixture with standard error.   Genomes Project Consortium 2012) are described in Supplemental Table IV.   1  4  7  10  13  16  19  22  25  28  31  34  37  40  43  46  49  52  55  58  61  64  67  70  73  76  79  82  85  88  91  94  97