Alignathon: a competitive assessment of whole-genome alignment methods

Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.


List of Tables
S7 Mammal simulation F-score results stratified by annotation type. 20 S8 Primate simulation F-score results stratified by annotation type 21 S9 Fly pseudo F-score regional results. . . . . . . . . . . . . . . . 26 S10 Jaccard distance values for all submissions to the primate dataset. 33 S11 Jaccard distance values for all submissions to the mammal dataset. 34 S12 Jaccard distance values for all submissions to the fly dataset. 35 S13 Primate simulation precision results. . . . . . . . . . . . . . . 37 S14 Primate simulation recall results. . . . . . . . . . . . . . . . . 38 S15 Mammal simulation precision results. . . . . . . . . . . . . . . 39 S16 Mammal simulation recall results. . . . . . . . . . . . . . . . . 40 S17 Mammal simulation precision results stratified by annotation type. 41 S18 Mammal simulation recall results stratified by annotation type. 41 S19 Primate simulation precision results stratified by annotation type. 42 S20 Primate simulation recall results stratified by annotation type. 42 3 0.1 Details of Regional Alignments & PSAR To convert each subregion into a 2D alignment matrix we designed a pipeline that used the chosen reference genome interval to create a "reference alignment" as follows: 1. For each submission, for each region, we used mafExtractor to pull out maf blocks that contained any positions in the reference that were within the target region. These blocks were then trimmed to contain only positions that aligned to the reference within the region of interest.
2. We used the mafTransitiveClosure tool to compute the transitive closure of the resulting alignment (see discussion), to make each alignment a consistent multiple sequence alignment. This was particularly important for the GenomeMatch alignments, which were all pairwise. Unlike for the whole genome alignments, taking the transitive closure of the aligned pairs just within the subregion did generally not perturb the performance substantially (data not shown).
3. Next, a row deduplication step was performed using mafDuplicateFilter, so that each species contributed at most one row to the 2D alignment. In short, for every block in the alignment a consensus sequence is created and then for species with multiple instances present in the block a similarity score is calculated in relation to the consensus. Only the sequence closest to the consensus in terms of substitution distance is kept. In the event of a tie the sequence closest to the start of the block is used. All the other instances are discarded.
4. The previous step could result in the removal of the reference sequence that tied the block into the region of interest. As a result, a second step of reference oriented region based extraction was performed to eliminate all blocks that more strongly align to an area outside of the region of interest (i.e. a part of the reference outside the region of interest had a higher similarity score in the row deduplication step).
5. The blocks were sorted based upon their left-right order along the positive strand of the region of interest in the reference. Then the rows of all blocks were reordered to be standardized (into a consistent species order, i.e. alphabetically) and finally all of the reference sequence instances were forced to be in relation to the positive strand.
Rearrangements in the non-reference sequences were ignored by concatenating the fragments of the non-reference sequence together according to their ordering along the reference sequence, as is standard practice in constructing "reference" alignments.
To ascertain how these manipulations affected the alignments we calculated regional precision and recall values for the simulated subregions, as we did for the larger alignments. Looking at the mammals, we find reasonable linear correlations between regional and overall precision (r 2 = 0.594), recall (r 2 = 0.989) and F-score values (r 2 = 0.992) (Supplemental Figure  S9), suggesting that the alignment manipulations did not bias the results too substantially. Furthermore, we find for most submissions a low variance in these results between different regions, suggesting five regions were likely enough to get a good approximation of the overall results.
To run PSAR on the regional alignments we broke up each regional alignment into 2kb windows and ran PSAR independently on each window. This windowing is likely to have somewhat affected the scores, though the effect was likely a very minor effect as the number of breaks between chunks is very small compared to the number of columns in the alignment. To make the scores comparable between different alignments, we look only at pairs including the reference sequence, which is a shared constant -and thus ignored non-reference pairs.

Details of Submissions
Each group was asked to detail how they computed their submissions, the computational resources used and the runtimes involved. What follows are the responses of the participating groups.

AutoMZ and TBA
I used LASTZ to produce pairwise alignments. AutoMZ and TBA are slightly modified from the package available at http://www.bx.psu.edu/ miler_lab, with no major algorithmic change. AutoMZ is reference-dependent and progressively aligns species in the species tree using the reference sequence as the guide. TBA progressively aligns species in the tree without a specified reference.
The overall steps of the pipelines are: running lastz to get pairwise alignments, then use single cov2 (in the MULTIZ/TBA software package) to post-process such pairwise alignments, and finally run AutoMZ or TBA

Mammals
Flies AutoMZ 90 minutes 220 minutes 27 hours TBA 125 minutes 204 minutes 7 days* Table S0: AutoMZ and TBA submission runtime details. *The program got very low priority on the server and other computation-extensive programs ran on the same sever for unknown time.
to produce multi-alignment. Among all steps, the first step to produce lastz alignment was most non-trivial in my pipeline. There were a few sequences (in flies) not repeat masked well, and lastz failed on the computer cluster I used. I finally got help from Bob Harris (at Penn State) to separately repeat mask these sequences in order to run lastz. Once lastz alignments were produced, the rest of steps were simply invoking automatic commands from the package.
For pairwise alignments, I used the following resource to set up the parameters of running LASTZ: http://genomewiki.ucsc.edu/index.php/ Hg19_conservation_lastz_parameters. For species pairs not listed in the resource, I used default parameters.
For all multi-alignments, I used default parameters. For AutoMZ alignments, I used references suggested by the organizers [simHuman for both simulations, dm3 for flies].
The pairwise alignments were distributed on a computer cluster. The total amount of time would be at least thousands of hours. The time for multi-alignments was estimated based on the timestamps of files on the server. The server might have other programs running at the same time when computing these alignments. Table S0 shows the run times of different programs used on the three different datasets.
Peak memory usage was not recorded. Most pairwise alignments were done on a computer cluster where each computer node has 1G memory. For the pairs of species that exceeded the memory, I splitted the sequences into smaller fragments and combined separate alignments later. The job requiring the maximum memory was producing TBA alignment of flies, which was done on a machine with 10G memory. So the peak memory use of the pipeline must be within 10G.
Pairwise alignments were computed on a computer cluster of around 100 processors, each with 1G memory. Multi-alignments were computed on a server with configured maximum 10G memory.
Pairwise alignments used a batch system. Multi-alignments used a single machine.
One person spent approximately 100 person-hours spent on these submissions. Most of time was spent on obtaining the pairwise alignments, since some of sequences were not repeat masked well, the pipeline broke many times when running on the computer cluster. It took time to track and fix them. Once the pairwise alignments were computed, it didn't take too much effort to get the multi-alignment results.

Cactus
All three datasets were aligned using the default parameters of progressive-Cactus. Progressive cactus differs from the original Cactus program (described in (Paten, Earl, Nguyen, Diekhans, Zerbino & Haussler 2011)) because it uses a progressive alignment strategy to align the genomes using a guide tree, employing the Cactus alignment algorithm at each step. In each case the guide tree used was the one provided (shown in Figure 1). Progressive cactus can be found at: https://github.com/glennhickey/ progressiveCactus/. The alignments were computed on a 64 core machine with 1 terabyte of ram. Only CPU hours were recorded for the simulated mammal dataset -and totalled just over 500 hours total.

EPO (Enredo Pecan Ortheus), mammals simulation
EPO stands for Enredo-Pecan-Ortheus. Enredo uses a set of anchors mapped on the genomes. For a given set of species, we have to generate a specific set of anchors. Note that you can re-use the same set for different runs, if you want to add a new sequence for instance. The anchors are extracted from conserved regions identified in sets of pairwise alignments. GERP is used to find the most conserved region in long alignments. All anchors are mapped on every genome. Hits are filtered and sent to Enredo. Enredo is a graph-based method that will define blocks of collinear sequences. These blocks can contain segmental duplications. Collinear segments are aligned with Pecan/Ortheus. Ortheus relies on the species tree when no duplication is detected in a given block. For the other blocks, it uses a recursive loop to find the best guide tree/alignment, starting with a random guide tree and using Semphy to infer a tree from the alignment.
Pecan was run as Ortheus. We use Ortheus to infer ancestral sequences. One person worked on this submission. Enredo's parameters have been optimised for real mammalian genomes. We have noticed that the coverage on the simulated genomes is much lower than with real genomes.

Pecan (Mercator-Pecan), primates simulation
Sequence similarities among coding genes are used to build an orthology map with Mercator. Collinear segments are aligned with Pecan.
Pecan was run with default parameters. Pecan used 14.2 CPU hours, 3:05 wall-clock hours. The Sanger default compute farm was used (see above for specs) We use eHive to manage the jobs. Processes in the farm are controlled by LSF.
Two people worked in this submission. We wish to thank Carsten Kemena for providing Mercator results.

Pecan, mammals simulation
Sequence similarities among coding genes are used to build an orthology map with Mercator. Collinear segments are aligned with Pecan.
Pecan was run with default parameters. Pecan used 49.5 CPU hours, 0:36 wall-clock hours, 2640 Mb of memory (peak).
Sanger default compute farm was used (see above for specs).
We use eHive to manage the jobs. Processes in the farm are controlled by LSF.
Two people worked in this submission. We wish to thank Carsten Kemena for providing Mercator results.

GenomeMatch
The GenomeMatch pipeline is developed to find alignments between genome sequences as well as to construct synteny maps. At the initial stage it generates a list of all homology blocks with length > 35 and similarity > 70%. As preliminary local alignments we consider consecutive chains of blocks that closer 20000 and length of gaps < 200000. On the next stage we identify alignments of smaller blocks within intervals between blocks found on the initial stage and add them to preliminary alignments forming a set of final local alignments.
To build synteny we apply dynamic programming to final local alignments to find the best set of non-overlapping alignments.
We computed alignments applying default parameters that were initially optimized using simulated similar sequences.
The alignment of 4th chromosome of mouse with 1st chromosome takes 35 minutes of one core of 2.3 GHz Intel processor. The GenomeMatch runs alignment of different sequence pairs on different cores of a multiprocessor computer. Each sequence pair requires˜0.3 -3 GB of computer shared memory. For analysis of the competition data we used single shared memory machine with 12 dual-core processors and 256 GB memory. The GenomeMatch loads alignment of different sequence pairs to different cores of the multi-processor computer.

Mugsy
Mugsy x86-64-v1r2.2 was used in single threaded mode with entirely default parameters for both the simulated datasets. Both datasets took approximately one day of compute time. Though parameter adjustment may have been able to improve the sensitivity somewhat the submitter is not convinced that MUMmer would have been able to adequately deal with the degree of divergence in the simulated mammals dataset. The default maximum unique match (MUM) size in the version used was 15, which appears to be quite low. It is the submitter's opinion that there is not much room to improve sensitivity without moving to approximate string matching.

MULTIZ primates
The standard UCSC MULTIZ pipeline was used. The resulting alignment is reference based. The first step is to build pairwise alignments using lastz which are chained using axtChain, then further filtered using chain-Net. The resulting chains are then given to MULTIZ which generates the multiple alignment. (Kent, Baertsch, Hinrichs, Miller & Haussler 2003, Blanchette, Kent, Riemer, Elnitski, Smit, Roskin, Baertsch, Rosenbloom, Clawson, Green et al. 2004 Default parameters to lastz. Minimum chain score: 3000 (slightly conservative) Chain linear gap costs: medium (default) Net used: standard net Chaining 1435 CPU hours, 8 wall clock hours, 4 GB peak memory. MULTIZ 1 hour, 1 hour, 8 GB peak memory. The UCSC swarm cluster was used, utilizing 512 cores, 4 GB memory per 2 cores, running the parasol job batching system. One person worked on this submission for a total of five hours.

progressiveMauve
The progressiveMauve multiple genome alignment software calculates positional homology alignments among two or more DNA sequences (Darling, Mau & Perna 2010). The algorithm does not align paralogs to each other, but does align the positionally-conserved copy of repeat sequences in different genomes.
The progressiveMauve software is a single standalone program for 64bit Linux that incorporates the following steps: (1) identify unique multimatches among all input sequences, (2) construct local chains of matches called Locally Collinear Blocks, (3) multiple-align Locally Collinear Blocks, (4) filter out low quality alignments with a homology HMM. Much more detail is available in the publication: Default parameters were used. The software version is the February 2nd 2011 build. (e.g., predating alignathon by 1 year).
4.45 CPU hours, 4.46 wall-clock hours were used. Peak memory was 11.2GB.
1 CPU on a shared memory machine was used. One person spent three person-hours for alignment, four person-hours for writing an XMFA to MAF converter.
Starting from the primate sequences directory, the command used to generate the alignment is: progressiveMauve --output=primate *.fa Timing statistics were collected with /usr/bin/time -v Time to convert from XMFA to MAF was not included but is negligible, a few minutes at most.
Results for the mammals simulation were nearly complete when a power failure interrupted the program and all results were lost. The program had been running for several weeks and used at least 200GB of memory. It's fair to say this implementation doesn't scale reasonably to such datasets.

PSAR-Align
For each MAF alignment fragment in given a MAF alignment, the Gblocks program was used to find highly conserved anchors with length ≥ 200bp by using default options of the Gblocks program. Then, for each inter-anchor regions, (i) PSAR was run to sample sub-optimal alignments, (ii) posterior probabilities of aligning two residues from two different sequences were computed, and (iii) finally revised alignments were built from the posterior probabilities.
Both simulated datasets were run on a compute cluster, utilizing 200 Intel(R) Xeon CPU 2.67GHz cores with 1GB memory for each job.

Robusta
We used the Mercator program to split the genomes into blocks. As input we used the provided gff files for the simulated sets and predictions from Geneid as well as data from UCSC for the drosophila sets. The blocks were then aligned using the Robusta alignment program (not published). Robusta is a meta-aligner and an extension of the M-Coffee package that combines the output of several alternative aligners into one unique final model. Alignment: Drosophila: We used the Robusta program we developed to calculate different combinations of existing alignment methods (Pecan, Mavid, progressiveMauve, Lastz). For this the different methods were called pairwise on the different datasets and the resulting alignment combined using the T-Coffee consistency algorithm.
In order to identify the best combination of methods, we used available RNA-Seq data (Odorant receptor, larval stage) collected on 6 of the 20 considered species (see below). Mapped read patterns were then projected onto the sequences and their matches were used to define an objective function. This function was used to compute the score of every block containing enough mapped RNA-Seq data. For this analysis we used two alternative protocols for the selection of the alignments: 1)AverageBest All blocks were produced with the same method which gave us the best average score and the best scoring combination of primary methods was selected. In our test this was Pecan-Mavid-Lastz which combines the three the considered packages, 2)PickBest For this method we picked for each block the method which gave the best RNA-Seq score. Whenever not enough RNA-Seq data was available, we picked up the AverageBest combination. The following alignment methods were considered: Robusta: Pecan-Mavid-Lastz, Pro-Coffee, Pecan, Robusta: Pecan-Lastz, Robusta:Pecan-Mavid-pMauve, Robusta: Lastz, Robusta: Pecan-Mavid, Robusta: Mavid-Lastz, Robusta: Pecan, Robusta: pMauve-Lastz, Robusta: Pecan-pMauve-Lastz, Robusta: Mavid, Robusta: Pecan-Mavid-pMauve-Lastz, Mavid, Robusta:Pecan-pMauve, Robusta:pMauve, Robusta: Mavid-pMauve, Robusta: Mavid-pMauve-Lastz Simulated: We used the method which on average gave the best RNA-seq value for the fly set (Robusta-Pecan-Mavid-Lastz).
Alignment (simulated sets) We used the method which on average gave the best value for the fly set (Rpw-Pecan-Mavid-Lastz).
Evaluation using RNA-Seq We have RNA-Seq data for 6 of the 20 fly species (dp4, droAna3, droEre2, droWil1, droYak2, droVir3). We mapped the reads using the segemehl program with default settings. We limited the measurement of accuracy to those sites in the data where a strong difference in the number of mapped reads (either drop or increase) occurs. We then scored each pair of aligned nucleotides with 1 when both nucleotides show an increase/decrease or with -1 in case of matching an increase with a decrease. For each pair of sequences the values are summed up to produce the final score of this alignment.
CPU Hours Primate:1d 8h 44m Mammal: 3d 13h 12m Fly: AverageBest: 52d 10h 25m PickBest: 518d 3h 10m Wall-clock time was not recorded. Peak memory was not recorded, but the maximum memory available at any time was 150 Gb Nine machines, having between 4 and 12 cpus, between 4 and 150GB Memory were used.
The genomes were split and the resulting blocks were then run on single machines.
Five people worked on these submissions.

VISTA-LAGAN
We aligned three datasets: simulated Test set, simulated Primate set, simulated Mammal set. VISTA pipeline infrastructure (Frazer, Pachter, Poliakov, Rubin & Dubchak 2004, Dubchak, Poliakov, Kislyuk & Brudno 2009) was utilized for the construction of genome-wide multiple DNA alignments between all genome assemblies. VISTA uses an efficient combination of global and local alignment methods and consists of the following steps: obtaining a map of large blocks of conserved synteny between the two species by applying Shuffle-LAGAN glocal chaining algorithm (Brudno, Malde, Poliakov, Do, Couronne, Dubchak & Batzoglou 2003) to local alignments by translated BLAT (Kent 2002); using Supermap, the fully symmetric whole-genome extension to the Shuffle-LAGAN. Alignment is done by PROLAGAN, a variation of the original Multi-LAGAN program that allows for the alignment of two alignments (profiles) and predicting ancestral contigs using a maximum matching algorithm (Dubchak et al. 2009). The four stages (local hits, chaining, global alignment, and ancestral reconstruction) are repeated for every node in the phylogenetic tree.        Table S7: Mammal simulation F-score results stratified by annotation type. Values reported here include self-alignment pairs, that is to say pairs where a sequence is aligned to itself. Rows are sorted in descending order according to overall F-score. Columns are left to right: the F-score of the submission for the entire genome, the F-score of the submission for just genic regions, the F-score of the submission for just the neutral regions and the F-score of the submission for just the repetitive regions.   Figure S1: Region 1 of simHuman with respect to simOrang of the regional analysis of the primate simulation data set. Region 1 is defined as bases 35,525,375 -stop 36,025,374 of simHuman chromosome A (horizontal axis). Rows are: the relative true coverage of any part of simMouse onto this region of the reference; the relative abundance of genes within the region; the relative abundance of repetitive sequence in the region; submissions in descending order of average F-score. Each submission row shows the F-score of the submission at a given location of the region in red. The vertical axis of each row is the same scale, as labeled in the bottom row. In grey, in the background, is shown the top submission (PSAR-Align). Note that all of the aligners perform well in this region and that the top ten are likely all within sampling noise of one-another, though the GenomeMatch submissions seem to have a systematic difficulty.  Figure S2: Region 4 of simHuman with respect to simMouse of the regional analysis of the mammal simulation data set. Region 4 is defined as bases 69,805,407 -70,305,406 of simHuman chromosome J (horizontal axis). Rows are: the relative true coverage of any part of simMouse onto this region of the reference; the relative abundance of genes within the region; the relative abundance of repetitive sequence in the region; submissions in descending order of average F-score. Each submission row shows the F-score of the submission at a given location of the region in black. The vertical axis of each row is the same scale, as labeled in the bottom row. In grey, in the background, is shown the top submission (Cactus). Note that most submissions managed to contain parts of the alignment within the gene regions, though half of the submissions had poor coverage in this region (they lack a black line through most of the plot). The lack of consistent signal from all of the GenomeMatch submissions may be explained by the fact that the submitters excluded repetitive regions from their alignment. Pseudo F-Score Figure S3: Pseudo F-score results stratified by phylogenetic distance for real fly data set. For each subplot the vertical axis shows the pseudo F-score and the horizontal axis shows individual submissions. Horizontal black lines show the average pseudo F-score of the submission. Submissions are ordered from left to right (descending) by average pseudo F-Score. Submissions are comprised of points connected by a line where the points are in order of phylogenetic distance (relative to the reference, dm3) and vertical lines are ± standard deviation. The phylogenetic distances of pairs are in ascending left to right order droSec1, droSim1,droYak2,droEre2,droTak,droRho,droEle,droEug,droFic,droKik,droBip,droAna3,droPer1,droMoj3,droVir3,droWil1,droGri2. For each phylogenetic pair point in the figure, the recall value is taken to be the average within region pairwise coverage.  Figure S4: Correlation between PSAR-based pseudo-F-Scores and F-Scores for identical bins with respect to the reference (simHuman for both the simulated Primate and simulated Mammalian data sets) using all pairs of species that include the reference and derived from all submissions. Transparency (α = 0.05) is used to represent the data points (n = 174, 110) such that highly dense areas are darker than others. The strong cloud of data points to the upper right are bins and pairs derived almost entirely from the simulated Primate data set. The overall trend is roughly linear with a coefficient of determination r 2 = 0.671.  Table S9: Fly pseudo F-score regional results. Rows are ordered by descending value of pseudo F-score region mean value. Columns are: pseudo F-score region mean; pseudo F-score region standard deviation. Values are the means and standard deviations of the submission within the five regions, with pseudo F-score calculated using the overall average coverage for all pairs of species in the submission as the recall value.  Figure S7: Pseudo F-score along region 4 of simHuman with respect to sim-Mouse of the regional analysis of the mammal simulation data set. Region 4 is defined as bases 69,805,407 -70,305,406 of simHuman chromosome J (horizontal axis). Rows are: the relative true coverage of any part of simMouse onto this region of the reference; the relative abundance of genes within the region; the relative abundance of repetitive sequence in the region; submissions in descending order of average pseudo F-score. Each submission row shows the pseudo F-score of the submission at a given location of the region in black. The vertical axis of each row is the same scale, as labeled in the bottom row. In grey, in the background, is shown the top submission (AutoMZ). Note that most submissions managed to contain parts of the alignment within the gene regions, though half of the submissions had poor coverage in this region (they lack a black line through most of the plot).  Figure S8: The change in F-Score between original submissions (exactly as provided by participants) and the transitive closure of those same alignments for the simulated primate (A) and simulated mammal (B) data sets. A negative value indicates that following transitive closure the alignment had a lower F-Score.      Comparator Region Figure S11: Correlation plots of comparator values between the regional and whole genome analyses. 36         Values reported here include self-alignment pairs, that is to say pairs where a sequence is aligned to itself. Rows are sorted in descending order according to overall recall. Columns are left to right: the recall of the submission for the entire genome, the recall of the submission for just genic regions, the recall of the submission for just the neutral regions and the recall of the submission for just the repetitive regions.