Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum

The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable and thus difficult to analyze using short-read sequencing technologies. Here, we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first genome-wide, integrated analysis of SNP, indel and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that indels are exceptionally abundant, being more common than SNPs and thus the dominant mode of polymorphism within the core genome. We use the high density of SNP and indel markers to analyze patterns of meiotic recombination, confirming a high rate of crossover events and providing the first estimates for the rate of non-crossover events and the length of conversion tracts. We observe several instances of meiotic recombination within copy number variants associated with drug resistance, demonstrating a mechanism whereby fitness costs associated with resistance mutations could be compensated and greater phenotypic plasticity could be acquired.


Whole genome sequencing
Sample preparation and sequencing was performed as described in (Manske, Miotto et al., 2012) except that PCR-free library preparation was used throughout (Kozarewa et al. 2009). All sequences were deposited in the European Nucleotide Archive and a mapping from sample identifiers to ENA accessions is given in Table S1 and in the web application at http://www.malariagen.net/apps/pf-crosses/.
Typically in high throughput sequencing studies of humans or other higher eukaryotes multiple sequencing runs will be obtained for each sample, then data from each run (lane) are combined to increase coverage. However in this study a single sequencing run was sufficient to obtain ~100X coverage of the P. falciparum genome, so only a single sequencing run was obtained for each sample. Samples that represented biological replicates (DNA derived from the same clone but obtained from different cultures) were treated separately, with separate DNA library preparation and sequencing runs. Thus in this study there is always a one-to-one mapping from sample (biological replicate) to sequence run.
For convenience we use a three-part identifier for each sample, e.g., "3D7/PG0051-C/ERR019061", where the first part identifies the clone (e.g., "3D7"), the second part is our internal lab identifier for the DNA sample (e.g., "PG0051-C"), and the third part is the accession for the sequencing run at the ENA (e.g., "ERR019061"). The second and third parts are redundant, because as mentioned above there is a one-to-one mapping from sample to sequencing run, however we include both for transparency. The data files available from the FTP site and the web application use the same identifier system for consistency.

Sequence alignment and genome region classification
Sequence reads from each sample were aligned to the 3D7 version 3 reference genome using BWA (Li and Durbin 2009) version 0.6.1-r104 with the following parameter settings: bwa aln -n 0.01 -k 4 bwa sampe We found that the custom parameters to the aln command served to slightly increase the sensitivity and improve consistency of the alignment in regions with clusters of SNPs, such as the polymorphisms found at the chloroquine resistance locus (Fidock et al. 2000), however the vast majority of alignments were identical under the custom and default settings (data not shown). Sequence alignments can be downloaded as BAM files from the FTP site at ftp://ngs.sanger.ac.uk/production/pf-crosses/1.0/ and browsed via the web application at http://www.malariagen.net/apps/pf-crosses/.
Various metrics were then calculated from the alignments of each sample. These metrics were computed per genome position based on the pileup of aligned reads, using the program pysamstats 1 . Metrics calculated include the total depth of coverage, percentage of reads aligned in a proper pair (i.e., in correct orientation and reasonable distance apart, as defined by the aligner), average mapping quality and percentage of reads aligned ambiguously (mapping quality zero).
Alignment metrics for each of the parental samples were then plotted for each chromosome, alongside other metrics derived from the reference genome sequence, including the %GC content in a 300bp window and the non-uniqueness score (defined as the smallest k-mer size at which all kmers overlapping a given position are unique within the genome (Manske, Miotto et al., 2012); a high score for this metric is bad, in the sense that it indicates low uniqueness). An example plot for sample HB3/PG0052-C/ERR019054 and chromosome 4 is shown in Figure S1. The alignments themselves were also visualised using the LookSeq web application (Manske & Kwiatkowski, 2009).
From these visualisations a clear, qualitative distinction could be seen between regions of the genome with consistent coverage across all parent samples, and regions with significant alignment issues in one or more parents. To capture these large-scale qualitative differences we defined the following heuristic scheme for classifying genome regions: • Core -Regions with near-continuous coverage in all samples, with a high percentage of reads mapping in a proper pair and a low proportion of reads aligned ambiguously.
• Subtelomeric Hypervariable -Gene-containing regions towards the sub-telomere of a chromosome, with patchy and/or highly variable coverage in one or more samples and/or a low proportion of reads aligned in a proper pair and/or a high percentage of ambiguous alignments and/or a high percentage of aligned bases mismatching the reference.
• Internal Hypervariable -Gene-containing regions towards the centromere of a chromosome, with patchy and/or highly variable coverage in one or more samples and/or a low proportion of reads aligned in a proper pair and/or a high percentage of ambiguous alignments and/or a high percentage of aligned bases mismatching the reference.
• Subtelomeric Repeat -Gene-free regions with repetitive sequence at the end of a chromosome, typically with highly variable coverage and a high percentage of ambiguous alignments.
• Centromere -Centromere as given in the GeneDB genome annotation.
Within each chromosome we defined boundaries for these regions by eye from the visualisations described above. Table S2 gives the region boundaries, Figure S2 shows a map of the genome regions defined, and Figure S3 gives a summary of alignment statistics for each parental clone by region class. At least 99.6% of core genome positions were covered in all parents, and at least 98.8% of the core genome was covered by unambiguously mapped reads.
Our definition of the core genome is subjective, and more sophisticated methods could be devised to partition the genome into regions with different alignment characteristics. However the contrast between these different regions of the genome is very striking, and we believe the definitions given here capture the major qualitative features in a useful way.
The genome region classification can be browsed alongside coverage, mapping quality and other metrics via the web application URL http://www.malariagen.net/apps/pf-crosses/. A BED file defining the region boundaries can be downloaded from the FTP site at ftp://ngs.sanger.ac.uk/production/pf-crosses/1.0/. Starting from the reads aligned to the 3D7 version 3 reference genome as described above, the following steps were performed to prepare the BAM files. Using Picard tools version 1.77 the commands CleanSam, FixMateInformation, AddOrReplaceReadGroups and MarkDuplicates were run on each BAM file in that order.
Base quality score recalibration (BQSR) was then applied to the BAM files. BQSR empirically recalibrates the base quality scores reported for each base in each sequence read, by observing the correlation between mismatches in the aligned sequence reads and various covariates, including the original base quality reported by the sequencing machine, in addition to other factors like the local sequence context. BQSR thus relies on the assumption that a substantial number of bases mismatching the reference in aligned sequence reads are due to sequencing error and not true variation, alignment error or some other type of artefact. From a visual inspection of the alignments for the parental clones (see, e.g., Figure S1) it was apparent that the mismatch rate within hypervariable regions was extremely high, and given the other alignment symptoms in hypervariable regions including patchy coverage and ambiguous mapping, we assumed the vast majority of these mismatches were due to divergence between clones and not sequencing error. To avoid hypervariable regions overwhelming BQSR we limited the building of the covariates table to the core genome. BQSR also requires a set of known variant positions to exclude when building the covariates table. To bootstrap BQSR we created an initial set of variant calls for each cross from the raw BAM files using UnifiedGenotyper, then filtered these calls to exclude any that had less than 2 confident (GQ = 99) ALT calls, contained Mendelian errors, had more than 2 missing calls or were part of a homopolymer run of length 5 or more.
We then applied INDEL realignment to the recalibrated BAMs. Each BAM file was realigned separately, but to improve the sensitivity of INDEL realignment we provided as input the set of bootstrap INDEL calls obtained from the previous BQSR step, which has the effect of sharing information about possible INDEL alleles between samples. All other settings were default.
We then generated a raw variant callset from the realigned BAMs using UnifiedGenotyper run under a haploid model (-ploidy 1).
The next step was to empirically recalibrate variant quality scores (VQSR). VQSR requires at least a positive training set of known true variants, and optionally one or more negative training sets of sites where variant calls are likely to be spurious. We defined a positive training set for each cross by selecting variants from the raw callset that segregated within the cross according to Mendelian inheritance (i.e., parents had different genotypes, progeny had no Mendelian errors) and also produced highly parsimonious patterns of inheritance (i.e., did not induce an unrealistically high rate of recombination). Specifically, the positive training sets included only SNP and INDEL variants within the core genome, with no missing calls, no non-Mendelian calls, and no calls inducing an apparent double-crossover at a single variant. We also created two negative training sets for each cross, the first containing variants with Mendelian errors, the second containing variants inducing single-variant double-crossovers in one or more samples.
To verify that the VQSR runs had been effective we plotted the rate of Mendelian error against the number of variants for different thresholds of the VQSLOD score (similar to an ROC curve) ( Figure S4). For all three crosses and for both SNPs and INDELs, we observed an inflection point in these curves, corresponding to a Mendelian error rate of approximately 0.05% or ~1 Mendelian error in 2000 genotype calls. Thresholds (minimum values) were chosen for the VQSLOD separately for SNPs and INDELs in each of the three crosses at the inflection point in the curve. For SNPs the thresholds were 3D7xHB3: 2.5, HB3xDd2: 3, 7G8xGB4: 4; for INDELs the thresholds were 3D7xHB3: 1, HB3xDd2: 1.5, 7G8xGB4: 1.8.
We generated a final, analysis-ready VCF for each cross by adding the following filter annotations: • LOW_CONFIDENCE -Variant confidence is low (VQSLOD falls below the chosen threshold).
• NON_MENDELIAN -Variant calls are not consistent with Mendelian segregation because one or more progeny have an allele not found in either parent.
• MISSING_PARENT -One or both parents have a missing genotype call.
• NON_SEGREGATING -Variant is fixed within the sample set (not necessarily a spurious variant but a useful filter annotation as most analyses shown here use only segregating variants).
• NON_CORE -Variant is not within the core genome.
• LOW_CONFIDENCE_PARENT -Genotype confidence for one or both parents is low (GQ < 99).
• CNV -There is evidence for copy number variation at this locus.
The CNV filter was applied based on evidence from depth of coverage data, described in the section on CNV analysis below.
For all downstream analyses we also treated genotype calls with a genotype quality (GQ) of less than 99 as missing, although this annotation is not included in the VCF files.

Figure S6
illustrates variant calls from the alignment-based method before and after filtering for a single cross and chromosome. Variant calls for all crosses and chromosomes can be browsed via the web application at http://www.malariagen.net/apps/pf-crosses/ both with and without filters. VCF files can be downloaded from the FTP site.
The rate of Mendelian error was plotted against the number of variants for different thresholds of the SITE_CONF score ( Figure S5). Based on these plots we used a target Mendelian error rate of ~0.05% to decide variant and call filtering strategies. For SNPs we chose a SITE_CONF threshold of 50 and for INDELs we chose a SITE_CONF threshold of 200. These thresholds were the same for all crosses.
We generated a final, analysis-ready VCF for each cross by adding the following filter annotations: • LOW_CONFIDENCE -Variant confidence is low (SITE_CONF falls below the chosen threshold).
• NON_MENDELIAN -Variant calls are not consistent with Mendelian segregation because one or more progeny have an allele not found in either parent.
• MISSING_PARENT -One or both parents have a missing genotype call.
• NON_SEGREGATING -Variant is fixed within the sample set (not necessarily a spurious variant but a useful filter annotation as most analyses shown here use only segregating variants).
• NON_CORE -Variant is not within the core genome.
• CNV -There is evidence for copy number variation at this locus.
Note that these are in addition to a number of filter annotations previously added as a standard part of the Cortex pipeline.
For all downstream analyses we also treated genotype calls with a GT_CONF of less than 50 as missing, although this annotation is not included in the VCF files.

Figure S7
illustrates variant calls from the assembly-based method before and after filtering for a single cross and chromosome. Variant calls for all crosses and chromosomes can be browsed via the web application at http://www.malariagen.net/apps/pf-crosses/ both with and without filters. VCF files can be downloaded from the FTP site.

1.3.3.Combined callset
A single callset of segregating variants was constructed for each cross by combining variant calls from the alignment and assembly-based methods as follows. For each calling method, a VCF was derived from the full analysis-ready VCF by selecting only variants that passed all filters and segregated within the cross. These two VCFs were then combined into a single VCF using the GATK CombineVariants task, taking genotype calls from the alignment-based calling method where both methods reported the same variant (because the alignment-based method had lower levels of missingness). This produced a single combined VCF of segregating variation for each cross. These VCFs were then post-processed to add a DUP_SITE filter annotation to any variant that coincided with another variant but reported different alleles.

1.3.4.Genotype concordance between biological replicates
In the 3D7xHB3 cross one replicate for clone C01 and 3 replicates for clone C02 were sequenced and genotyped independently. This provided 6 replicate pairs for analysis of genotype concordance. In the 7G8xGB4 cross a single replicate was obtained for each of 10 progeny clones, providing 10 replicate pairs. We computed genotype concordance for each replicate pair and for each of the three available callsets (alignment-based method, assembly-based method, combined) after filtering variants and genotype calls as described above. We computed concordance for each replicate pair as the number of sites where both samples had a matching genotype call divided by the number of sites where both samples had a non-missing genotype call. The results are given in Table S3.

1.3.5.Estimation of FDR and sensitivity
To estimate false discovery rate (FDR) and sensitivity, we compared the variant calls generated in this study with pre-existing sequence data resources for the clone HB3. We downloaded contigs from the HB3 draft genome assembly produced from shotgun sequencing by Birren et al. (2006). We also downloaded HB3 sequences for individual genes deposited in GenBank. We aligned both the HB3 contigs and the gene sequences to the 3D7 reference genome using bwa mem with the -x intractg option (parameters tuned for mapping contigs within a species). We limited further analyses to a set of 32 genes that were completely covered by a single uniquely mapped contig from the draft assembly and by a gene sequence (Table S4). In spite of these criteria there remained some considerable discordance between the draft assembly and the gene sequences, particularly regarding INDELs ( Figure S17). Given that both of these sources may themselves contain errors, we used the following methods to estimate FDR and sensitivity. To estimate FDR we compared variants discovered in this study with the union of variants found in the draft assembly and the gene sequences. Thus a true positive is a variant discovered in this study and also found in either of the other sources, and a false positive is a variant discovered in this study but not present in either of the other sources. To estimate sensitivity we compared variants discovered in this study with the intersection of variants found in the draft assembly and the gene sequences. Thus a false negative is a variant not discovered in this study but present in both of the other sources.
FDR and sensitivity were computed for the replicates HB3(1) and HB3(2) separately, and for each of the two variant calling methods. For INDELs these metrics were computed under two different matching schemes: "position match" where we require the position and type (insertion/deletion) of the variant to match but allow the allele to be different, and "allele match" where we require the position and allele to match perfectly. The results are reported in Table S5.
Note that for these comparisons we included all variant alleles called for an HB3 sample, regardless of whether they segregated within a cross (i.e., we ignored the NON_SEGREGATING filter annotations). This is particularly relevant for the HB3(2) sample which was genotyped as part of the HB3xDd2 cross and where many alternate alleles were shared with clone Dd2 and were fixed in all progeny.
We found that the assembly calling method had lower SNP and indel sensitivity for clone HB3(2) compared to HB3(1). This lower sensitivity for HB3(2) was partly due to a technical limitation of the assembly calling method, which was only capable of genotyping variants with a single nonreference allele. Clone HB3(1) was called together with all samples in the 3D7xHB3 cross, and because 3D7 is also the reference clone, only one non-reference haplotype was present. Clone HB3 (2) was called with all samples in the HB3xDd2 cross, thus two non-reference haplotypes were present, and some variants would be expected where these haplotypes carry different non-reference alleles. Indeed the alignment method, capable of genotyping any number of alleles, found only 379 INDELs with more than one non-reference allele in the 3D7xHB3 cross, compared with 3732 in HB3xDd2 and 3834 in 7G8xGB4. This limitation does not account for the lower SNP sensitivity, however, as the number of SNPs called by the alignment method with more than one non-reference allele was similar in all three crosses.
The estimated INDEL FDR of 8.3-12.5% seems at odds with the fact that inheritance of SNP and INDEL alleles was highly concordant in all three crosses, and INDEL genotypes were almost perfectly reproducible across multiple biological replicates. If we relaxed the FDR matching condition to require only that variants match type and position, estimated INDEL FDR for the alignment-based method was reduced to 5.3-6.2%. The mismatching alleles were always STR INDELs with the correct type (insertion/deletion) and repeat unit (e.g., "AT") but an incorrect allele length. This could indicate a tendency for the alignment-based method to systematically miscall STR allele length. Some of these mismatches could also be due to genetic variation between HB3 clones with different culturing histories. We also noted considerable discordance between the HB3 draft assembly and published gene sequences regarding INDELs. Although the draft assembly seemed generally more concordant with our variant calls at the 32 genes examined, this was not always the case, and we suspect both the draft assembly and published gene sequences contain INDEL errors. These findings highlight the need for multiple P. falciparum genomes fully assembled to the same quality as the current 3D7 reference, so that methods for calling all types of polymorphism and can be accurately evaluated.

Recombination analyses 1.4.1.Calling CO and NCO recombination events
Two types of recombination event are expected: crossover (CO) and non-crossover (NCO). A CO is a reciprocal exchange accompanied by a conversion tract, whereas a NCO is a conversion tract without reciprocal exchange (Youds and Boulton 2011). A conversion tract can either be simple (all alleles converted to the same parent) or complex (containing switches between parental alleles converted). In studies of yeast or other organisms where all four daughters of a single meiosis can be captured and genotyped, NCO events can be inferred directly from a non-Mendelian ratio of segregation of alleles. For P. falciparum crosses it is not possible to isolate all four daughters of a single meiosis, and thus unequal segregation cannot be directly observed. However, CO and NCO events can be inferred from the patterns of allelic inheritance in the progeny of each cross. Two or more CO events are unlikely to appear in the same progeny clone in close proximity, and thus two or more nearby switches in allelic inheritance are more likely to indicate a conversion tract.
To determine an appropriate threshold for differentiating CO from NCO events, for each cross we first identified contiguous blocks of markers within each progeny where alleles were all inherited from the same parent (parental haplotype blocks). Boundaries between such blocks thus indicate switches in parental inheritance. Each such block has a minimal size, given by the distance between the outer markers within the block, and a maximal size, given by the distance between the markers flanking the block. The distribution of minimal block sizes was plotted for each cross ( Figure  S10). The resulting distributions were bimodal for all three crosses with a minor peak of blocks around ~1kb extending upward to ~10kb. This minor peak would not be expected from CO events, and suggests an expected size range for NCO conversion tracts, although at this stage we have not accounted for complex conversion tracts (which will appear as multiple adjacent short blocks).
To determine a size limit below which to assume that blocks indicate conversion tracts, we computed the number of blocks of a given size that would be expected from CO events alone, using previously published estimates for the CO recombination rate, which should be reasonably accurate given that a high marker resolution is not required to ascertain CO events. Assuming a uniform CO recombination rate of 12 kb/cM (Ranford-Cartwright and Mwangi 2012) we would expect to observe less than 1% of CO events within 10kb of another CO (by the CDF of the exponential distribution). This model is overly simplistic but serves to provide an estimate for the frequency of small block sizes expected from double cross-over events which is conservative because CO interference is likely to reduce further the true probability of observing smaller blocks. We thus assumed that all blocks observed with minimal length shorted than 10kb were either whole or part of conversion tracts.
The following algorithm was then used to call conversion tracts and CO and NCO events from the size and arrangement of parental haplotype blocks. For each progeny clone, genotype calls were used to identify contiguous regions of the genome where all alleles were inherited from the same parent (inheritance blocks) by iterating through variants within a chromosome and recording switches in inheritance between adjacent variants. Any inheritance blocks with minimal length <10kb occurring in isolation were called as simple conversion tracts. Any blocks with minimal length <10kb occurring directly adjacent to each other were merged into a single complex conversion tract. To identify CO events, all genotype calls within conversion tracts were first masked, and remaining switches in parental inheritance were called as CO events. Conversion tracts occurring directly adjacent to a CO were then identified, and the remaining conversion tracts were called as NCO events. Putative conversion tracts supported by a single marker or with a minimal length less than 100bp were excluded from further analyses. This algorithm is similar in motivation to that used by Samarakoon et al., (2011) but does not depend on a windowed analysis and has greater flexibility for detecting tracts spanning windows.

1.4.2.Estimation of CO and NCO recombination parameters
To estimate genetic map length from the CO recombination rate the identity map functions was used, as marker density was high (~300bp) and thus no adjustment was required to account for the possibility of unobserved double crossovers.
To estimate conversion tract length, we fitted a geometric distribution as suggested by Hilliker et al., (1994). This distribution has a single parameter phi, which can be interpreted as the per-basepair probability of extending a conversion tract, once it has been initiated. To account for the effect of ascertaining conversion tracts via the available SNP and INDEL markers, we simulating 50000 random NCO conversion tracts for each of a range of values for phi, then passed the simulated tracts through the same NCO calling process as used for the real data to obtain an empirical distribution for the observed minimal tract length. For each simulation run we compared the resulting distribution of tract lengths with the actual data via a quantile-quantile plot (e.g., Figure  3E) and chose a value of phi with the smallest total residual. This process was repeated for each cross separately, and all three crosses gave a close agreement for the value of phi.
Once we had fitted the tract length distribution, we then used this information to estimate the NCO discovery rate. This was necessary because we filtered out putative conversion tracts supported by less than two markers spanning less than 100bp, and thus some true conversion tracts will have been filtered, and some conversion tracts may have fallen entirely between markers thus not been observed. We used the NCO simulation described above for phi = 0.9993 and for each cross calculated the fraction of simulated NCOs that would have actually been observed using the given markers and the NCO calling process. We then used these estimates for the NCO discovery rate to multiply the observed number of NCO rates per meiosis in each cross by an appropriate factor, to arrive at an estimate for the actual rate of NCO recombination events.
To estimate the CO recombination rate relative to centromere position we divided each chromosome into non-overlapping windows of 20000bp starting from the centromere moving both up and down stream. Within each window we calculated a map length from CO events using the identity map function, then averaging over all windows at each given distance from the centromere we plotted the mean and 95% confidence interval from 1000 bootstrap replicates ( Figure 3C).
To estimate the expected number of CO events occurring within genes, intergenic regions, exons and introns we performed Monte Carlo simulations under a null hypothesis of random recombination. In each run a number of CO events for each cross were simulated, then for each CO event the flanking markers were determined according to the markers available in each cross. The number of CO events simulated in each run was equal to the number of CO events observed for each cross. From each run we counted the number of CO events that would have been observed entirely within a gene, entirely within an intergenic region, spanning a gene boundary, etc. We obtained an empirical distribution for these counts from 10000 simulation runs. A similar process was followed for NCO events.

1.4.3.Recombination within CNVs
Read counts were computed for all samples in non-overlapping 300bp windows over the genome, counting the number of reads whose alignment starts within the window, using pysamstats. We observed some GC-related coverage bias in almost all sequencing runs, such that coverage was lower in windows where GC was below 20%. We tried correcting for this bias using the method of Abyzov et al. (2011) however the variance in coverage relative to the median was also higher where GC was below 20% (data not shown) and this correction led to many apparent increases in copy number in low-GC regions, so instead we simply excluded windows where GC content was below 20% from further plots and analyses. To normalise the read count data for each sample we divided the read counts in each window by the median across all windows within the core region of chromosome 14, after visually inspecting the read counts for chromosome 14 in all samples to verify that there was no evidence for copy number variation on that chromosome. These normalised read counts are plotted as black circles in Figure 3 and Figure S13. To view the evidence for the arrangement of amplified segments, depth of coverage was computed within regions of interest for reads aligned facing away from their mate pair ("face-away", indicating a tandem array) and for reads aligned in the same orientation as their mate pair (indicating a tandem inversion), using pysamstats. These data are plotted as purple and green lines in Figure 3 and Figure S13.
We also fitted a model of copy number state to the normalised read count data, to aid in visualising the copy number changes in plots. For each sample we fitted a Gaussian HMM using the scikitlearn package 2 . Parameters for the model were tuned to minimise prediction of copy number changes within the core genome for clone 3D7 (also the reference clone) and to minimise prediction of copy number changes that do not segregate in both parents and progeny of a cross, assuming that these two conditions are both indicative of false positive predictions of copy number change, and to maximise sensitivity for detecting known copy number amplifications at the mdr1 and gch1 loci. A complete worked example is given at http://nbviewer.ipython.org/github/alimanfoo/hmmcnv/blob/master/tutorial.ipynb.
We also used the fitted HMM model to scan the core genome for other loci with evidence of copy number change. Apart from the mdr1 and gch1 loci, the only other loci with evidence for copy number change spanning one or more genes were a deletion at the clag3 locus (Chung et al. 2007;Sepúlveda et al. 2013) present in GB4 and segregating in the 7G8xGB4 cross, and a translocated duplication adjacent to the subtelomere of chromosome 11 (Hinterberg et al. 1994) present in both HB3 replicates and segregating in the HB3xDd2 progeny.
To examine evidence for regions of pseudo-heterozygosity at the gch1 locus in progeny clones C05, C06 and CH3_61 we used the raw allele depths (AD) emitted by GATK at variant sites and plotted the ratio of allele depths for the first parent's allele (e.g., 3D7 for clones C05 and C06) divided by the total depth of coverage.

A Web application to facilitate data exploration and re-use
The sequence and variation data generated in this study are a rich resource and could serve many purposes beyond the analyses presented here. To facilitate re-use of these data we developed a Web application that provides a number of novel tools for intuitive, interactive data exploration, available at www.malariagen.net/apps/pf-crosses. The introduction page ( Figure S19A) provides navigation to a set of tools, including a tool for browsing and querying a table of variants for each cross and calling method ( Figure S19B); a tool for visualising and browsing the genotype calls at individual samples and patterns of inheritance and recombination within each cross ( Figure S19C); a tool for browsing the genome, allowing the location of variants to be viewed in the context of genome features and alignment metrics ( Figure S19D); and a browser for visualising the sequence alignments themselves, implemented by embedding the LookSeq software (Manske and Kwiatkowski 2009) (Fig S19E). All tools are highly interactive, for example when browsing genotypes the user can hover over any variant and view further information about the reference and alternate alleles, effect prediction, etc. Filters applied to variants can also be changed dynamically, allowing users to explore the entire dataset and compare calling and filtering methods. For the genome browser, a multi-resolution filterbank was implemented to enable highly responsive browsing at all scales from base-pair resolution up to entire chromosomes. The underlying technologies for this Web application are being developed as a generic framework so that they can be used with other organisms and datasets, as part of the open source Panoptes project 3 .
3 https://github.com/cggh/panoptes       Figure S7: Illustration of the assembly-based callset before and after variant filtration. The upper plot shows the raw variant calls, the lower plot shows the filtered variant calls. The main subplot in each shows each sample as a row and each variant as a column, painting genotype calls as described in Figure S7. Figure S8: Comparison of SNP and INDEL calls. The main subplot in each plot shows each sample as a row and each variant as a column, painting genotype calls as described in Figure S7. Lines from the inheritance subplot to the accessibility track indicate the physical position of variants, with one line drawn for every 10 variants.  Figure S10: Size distribution for haplotype blocks transmitted from parents to progeny in the three crosses. "Parent 1" means the first named parent in each cross, i.e., 3D7 for the cross 3D7 x HB3. Figure S11: Physical locations of CO and NCO recombination events observed. Events are shown for each of the 14 nuclear chromosomes. The lower track for each chromosome shows the genome region classification, with colours as in Figure S3. Figure S12: Long-range complex recombination events. Figure S13: CNV spanning MDR1 in the HB3xDd2 cross. A, evidence for CNV in the parent clones, as per Figure 4 in the main text. B, copy number prediction for all parents and progeny for chromosome 5. Figure S14: Copy number prediction for all parents and progeny of cross 3D7xHB3 for chromosome 12 (including the gch1 locus).

Index of tables
Figure S15: Copy number prediction for all parents and progeny of cross HB3xDd2 for chromosome 12 (including the gch1 locus).
Figure S16: Copy number prediction for all parents and progeny of cross 7G8xGB4 for chromosome 12 (including the gch1 locus). Figure S17: Variant concordance between previously published sequences for clone HB3. Each Venn diagram shows the number of variants present in the HB3 sequences relative to the 3D7 reference genome. Figure S18. INDEL polymorphism within tandem repeats by repeat unit length and tract length. Tandem repeat sequences were identified within the reference genome following (Montgomery et al. 2013). The fraction polymorphic is computed over all three crosses. Figure S19. Screenshots from the Web application at www.malariagen.net/apps/pf-crosses providing access to sequence and variation data on the three crosses. A, Introduction page, providing navigation to different tools for data exploration. B, Browse and query data on variants (SNPs and INDELs) discovered in the crosses by different calling methods. C, Browse genotype calls in parents and progeny and visualise patterns of allelic inheritance and recombination. D, Genome browser, providing multi-resolution views of various data tracks including coverage and mapping quality. E, Sequence alignment browser (LookSeq).