Efficient taxa identification using a pangenome index

Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: O(r) space, where r is the number of Burrows–Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, D={T1,T2,…,Td} over an alphabet of size σ, we extend the r-index with O(rd) additional words to support document listing queries for a pattern S[1..m] that occurs in ndoc documents in D in O(mloglogw(σ+n/ r)+ndoc) time and O(rd) space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.

to use a standard backward search in the r-index and then use locate queries to locate the offsets in the concatenated text in which the pattern occurs.These offsets can then be cross-referenced with another structure to determine which documents they occur in.This requires an amount of work proportional to the number of occurrences occ, which is expensive, particularly for repetitive matches against a pangenome.
We hypothesized that extending the r-index to multiclass classification could be accomplished by augmenting it with efficient facilities for document listing, namely, the ability to report all the reference sequences (documents) in which a particular pattern occurs.A document, which we will sometimes call a "class," could consist of a single genome or a collection of related genomes.
An early study by Muthukrishnan (2002) described a specialized index for document listing consisting of a generalized suffix tree and a document array.It provided O(m + ndoc) queries, where m is the length of the pattern, and ndoc is the number of distinct documents in which it occurs.But this came at the cost of O(n log n) bits of space, where n is the total length of the texts, which is impractical for large pangenome databases.Sadakane (2007) improved on this by introducing a new succinct document array representation and building on succinct representations of suffix trees and arrays.He showed how to reduce the index size to |CSA| + O(n) bits, where |CSA| is the size of the compressed suffix array using statistical compression with an increased time complexity of O(m + ndoc • log n), a high cost for repetitive text collections (Cobas and Navarro 2019).Later efforts further reduced the required space using grammar compression (Cobas and Navarro 2019) and relative Lempel-Ziv compression (Puglisi and Zhukova 2021).
We present a new method that solves the document listing problem in O(m log log w s + ndoc) time and O(rd) space using the r-index.Importantly, we also show how to use the prefix-free parsing process to build the profile simultaneously with the BWT.This document array structure (an example is shown in Table 1) can be sampled and stored at the run boundaries of the BWT, yielding a space complexity of O(rd).At query time, after performing a backward search for a pattern, we can report the document listing by simply examining the current document array profile (exemplified in Table 2), which is an array of d integers, as opposed to performing a query for each occurrence of a pattern.We also discuss practical optimizations that can be used to reduce the space usage of this data-structure even further in the context of metagenomic read classification.In our evaluations, we compare the query time and index size for our approach to an alternative that uses the standard r-index locate query to report document listings.Furthermore, we attempt to classify different strains of Escherichia coli and Salmonella enterica using our document array profiles in comparison to using SPUMONI 2's sampled document array (Ahmed et al. 2023).Finally, we believe that our theoretical guarantees will prove useful for the community by allowing read classification to be compared in a grounded manner that complements practical evaluation.

Results
We performed all the experiments on an Intel Xeon gold 6248R 32-core processor running at 3.00 GHz with 1.59 TB of RAM with 64-bit Linux.Time was measured using the std::chrono::sys-tem_clock from the C++ standard library.Our source and experimental codes can be found at GitHub (see Software availability).The r-index code used in our experiments can be found at GitHub (https://github.com/maxrossi91/r-index).

Comparing the query time and index size
To assess the speed of document listing, we compared the query time for the document array profiles to the query time for locate queries using the r-index.We attempted to compare our solution to the method of Cobas and Navarro (2019); however, we ran into various run-time errors when using it as described, so we were not able to include it in the results.
We built a series of indexes over genomes from different collections of bacterial species, described in Table 3.We simulated nanopore sequencing reads using PBSIM2 (Ono et al. 2021) at 95% read accuracy.We then used MONI (Rossi et al. 2022) to query each read against the pangenome index, extracting a total of 1 million maximal exact matches (MEMs) for each class.
We tested two variants of the document array profile datastructure.The first (labeled "Doc.Array" in Fig. 1) uses the standard document array profile, in which the width of each profile entry requires ⌈log 2 (|S|)⌉ bits.The second (labeled "Doc.Array (optimized)") instead stores truncated lcp values, so that lcps greater than 255 are stored as 255, so that only ⌈log 2 (255)⌉ = 8 bits are required per entry.This optimization is appropriate in real-world situations in which either the reads are known to be short (e.g., Illumina sequencing reads) or we would otherwise expect MEMs longer than 255 to be rare.
We observed that the query time using document array profiles was faster than the r-index locate query.For the three-class database, the document array profiles ranged from 1.6-3.2times faster.As more genomes were added to the database, the query time for the three-class r-index increased by 2.2-fold (214.87 sec vs. 96.2sec), whereas query time for the document array profile was essentially constant (67.8 sec vs. 61.7 sec).This shows a key advantage of our document listing; unlike when using the r-index locate queries, our query time is independent of the number of pattern occurrences.
We noted that the size of the r-index stayed relatively constant as the number of classes increased.However, for the document array profile (both standard and optimized), the index size grew with the number of classes, consistent with its O(rd) space complexity.As an example, in the "30+" genome database, focusing on the standard document array, the eight-class document array was 2.32 times larger than the five-class document array.Because d increased by 1.67 times and r increased by 1.43 times (79,722,710 vs. 55,559,459), we therefore would expect to see an index increase of about 2.39 times (1.67 × 1.43), which is close to what we see in practice (2.32).
We also observed for the three-class, "30+" genome database, the optimized document array was smaller than the r-index.The r-index stores a run-length encoded BWT (RLEBWT) along with the suffix array sampled at run boundaries in the BWT where each sample is stored in 5 bytes.The optimized document array also stores a RLEBWT; however instead of the suffix array, it replaces it with the document array profiles.Because it is a three-class Figure 1.Query time and index size when performing document listing queries using the document array profiles and the r-index.We varied the size of the database increasing from 30 bacterial genomes to 300 bacterial genomes.For each species/class, we would simulate nanopore reads at 95% accuracy and extract 1 million maximal-exact matches (MEMs) to query the data-structures.Therefore, for the three-class, five-class, and eight-class indexes, we queried them with 3 million, 5 million, and 8 million MEMs, respectively.This explains why the query time would increase for the indexes containing more classes.
database, each profile sampled at the run boundaries will only consist of 3 bytes, which explains why, overall, the optimized document array is smaller than the r-index for those conditions.
Additionally, as expected, the optimized document array profile was smaller than the standard profile; for the 300-genome database, it was 3.3 times smaller.We suggest further optimizations to reduce the document array profile size in the Discussion below.

Species and strain-level classification
We hypothesized that the document array profiles could particularly improve read classification accuracy in difficult scenarios in which it is important to be able to list all documents for each MEM.We compared the performance of the document array profile to another tool and structure designed for read classification: SPUMONI 2's (v2.0.0) (Ahmed et al. 2023) sampled document array.SPUMONI 2's sampled document array is quite simple; for each BWT run boundary, it simply converts the suffix array position to the document number in which that position occurs.Using these document labels, it is capable of reporting one document in which a particular exact match occurs.This is sufficient in situations in which reads contain many distinct matches (e.g., MEMs), so that document information can be pooled across the various matches to come to an overall conclusion.But in situations in which the documents are very similar to each other or in which reads are short or have a high error rate, we expect the full document array profile to impart higher accuracy.
We tested the two structures on increasingly difficult data sets, with each data set consisting of reference genomes from four distinct classes (Fig. 2).We used PBSIM2 (Ono et al. 2021) to simulate 50,000 nanopore reads from each class at 95% accuracy and then classified the reads using both document array approaches.Specifically, we identified all MEMs between the reads and the pangenome index, filtering to just MEMs of length 15 or longer.We then used the different document structures to obtain matching documents for each MEM: In the case of SPUMONI 2, we retrieved one document per MEM; in the case of our document array profile, we retrieved all documents where the MEM occurred.We then weighted the documents according to the length of the MEM and assigned each read to a document according to which received the largest total weight across all reported MEM/document combinations.
We observed that when the data set consisted of classes with low between-class sequence similarity ("different genera" and "same genus"), both methods performed well, with low classifica-tion errors (Fig. 3).However, for data sets with high sequence similarity (>97.5% ANI), such as the "E. coli strains" and "S.enterica strains," we see that the full document array profile provided greater classification accuracy compared with SPUMONI 2's one-document-per-match approach.

Classification using real nanopore mock community reads
We extended our analysis to real sequencing reads.We used nanopore reads from the UNCALLED (Kovaka et al. 2021) paper, which performed Oxford Nanopore sequencing of a Zymo mock community consisting of eight species (Staphylococcus aureus, S. enterica, E. coli, Pseudomonas aeruginosa, Listeria monocytogenes, Enterococcus faecalis, Bacillus subtilis, and Saccharomyces cerevisiae).We extracted a set of 582,042 reads from the data set that uniquely mapped to one of the seven bacterial species using minimap2 (Li 2018).We shortened each read to 2000 bp.
For each bacterial species, we constructed a database comprising of four strains from that species, one of which was chosen to be the actual strain used for the Zymo mock community.The other three strains were obtained from RefSeq.We then compared the strain-level classification accuracy of the two document array structures using the same MEM-weighted approach as was used in the previous experiment.As in the previous experiment, we observed that the document array profile enabled more accurate strain-level read classification (Fig. 4).This was true for reads derived from all seven of the bacterial species (though both approaches had near-perfect recall for B. subtilis reads).

Discussion
We described a new data-structure called the document array profile, along with an efficient algorithm for building the structure simultaneously with a pangenome r-index.This structure enables tools to find exact matches with respect to a full-text pangenome index while simultaneously learning which reference sequences the matches belong to.This opens the door to new applications of pangenome indexes, including in metagenomics read classification.
The structure requires O(rd) space and can compute a full document listing for a match in O(m log log w s + ndoc) time.We showed that, as the pangenome database grows in size, the document array profile's speed advantage grows relative to the standard r-index and its locate queries.Further, we showed that the structure's ability to list all documents associated with a match enables

A B
Figure 2. Summary of reference data sets of increasing difficulty for read classification.(A) Sequence homology, measured as average nucleotide identity (ANI) for all across-class pairs of sequences.ANI was estimated with fastANI (Jain et al. 2018).(B) List of the specific species and strains used for classes 1, 2, 3, and 4 for each of the four data sets.In the case of "different genera" and "same genus," we used 10 genomes per class.In the case of "E. coli strains" and "S.enterica strains," we used a single genome for each strain.

DocProfiles
Genome Research 1071 www.genome.orggreater accuracy compared with an existing alternative that considers only one document per match.
The main weakness of the document array profile is the fact that its space usage grows linearly with the number of documents d.This makes it difficult for it to be used in scenarios with a large number of documents (classes), which is the case in taxonomic read classification, in which there are thousands of species.However, this data-structure can be optimized even further to reduce its space usage with domain-specific knowledge.For example, in sequencing read classification, an exact match shorter than 15 bases might be too nonspecific to be helpful for classification.In that case, each element of the document array profile could be made "sparse," consisting only of values greater than 14.
An additional optimization would be to adopt a "top k" strategy.That is, rather than store lcp values to all possible documents, we can restrict the structure to store only the lcp values to the k documents having the greatest lcp at the run boundary.This allows us to bound the size of the structure while retaining the strongest match-to-document associations.
Recently, Cobas et al. (2020) designed solutions to the document listing with frequencies problem using the r-index as the text index.This problem is a more difficult task because it requires reporting not only the document listing but also the frequency of the pattern in each document.The frequency information could add valuable data for taxonomic classification because it gives an indication if a pattern is "common" within a document or if it is rather rare.Future work on the document array profiles will consist of exploring the possibility integrating elements of solution (Cobas et al. 2020) to allow the document array profiles to report frequencies along with the document listing.

Methods Preliminaries
drawn from an alphabet Σ of size σ.We denote by 1 the empty string that is the only string of length 0. We assume S is terminated by a special symbol Ó S lexicographically smaller than all symbols in Σ.Given two integers 1 ≤ i, j ≤ n, we denote with S[i..j] = S[i] • • • S[ j] the substring of S spanning positions i through j if i ≤ j, and S[i..j] = 1 otherwise.Given two integers 1 ≤ i, j ≤ n, we refer to S[i..n] as the ith suffix of S and to S[1..j] as the jth prefix of S. Given two strings S[1..n] and T[1..m], we denote with lcp(S, T) the length of the longest common prefix of S and T.

Suffix array, inverse suffix array, and longest common prefix array
Given a string S, the suffix array (Manber and Myers 1993) SA S [1..n] is the permutation of {1, …, n} that lexicographically sorts the suffixes of S. The inverse suffix array ISA S [1..n] is the inverse permutation of SA S [1..n]; that is, for all i = 1, …, n, SA S [ISA S [i]] = i.The longest common prefix array LCP S [1..n] stores the length of the longest common prefix between lexicographically consecutive suffixes of S; formally, LCP[1] = 0, and for all i = 2, …, n,

Burrows-Wheeler transform
Given a string S, the Burrows-Wheeler transform (Burrows and Wheeler 1994) BWT S [1..n] is the reversible permutation of S defined as the last column of the matrix of the lexicographically sorted rotations of S. When S is terminated by $, we can define for all The LF-mapping is the permutation of {1, …, n} that maps every character in the BWT S to its predecessor in text order; formally, LF We define r as the number of maximal equal-letter runs of BWT S .When clear from the context, we refer to SA S , ISA S , LCP S , and BWT S as SA, ISA, LCP, and BWT, respectively.

r-Index
The r-index (Gagie et al. 2020) is a text index that stores the run-length encoded BWT and the SA entries sampled at run boundaries.Given a text S[1..n] and a pattern P[1..m], the rindex allows you to find all occurrences of P in S in O(m log log w (s + n/ r) + occs log log w (n/ r)) and O(r) words of space, where occs is the number of occurrences of P in S.This result was later improved to O(m log log w (s) + occs) (Nishimoto et al. 2022).

Supporting document listing on the r-index
Given the text T that is the concatenation of the documents of D such that T has length n, let BWT be the BWT of T that has r equalletter runs.document array in position i.For all documents j = 1, …, d, we check if the length of the substring is less than or equal to the value stored in the profile for the jth document.This requires one comparison per document, or O(d) time.
Example 1.In the example in Table 1, if we look at P DA [4] = [1,2,4] corresponding to the suffix AAC#, we have that (1) for the pair (21, 1) the substring S = A occurs in documents 1, 2, and 3, because all values of P DA [4] are not smaller than 1; (2) for the pair (21, 2) the substring S = AA occurs in documents 2, and 3 because 2 > P DA [4] If we store each entry of the profile of the document array P DA [i] as a list of sorted pairs (P DA [i][ j], j), the query time can be reduced to O(ndoc) by simply scanning the list of pairs from the document with the largest profile value to the first document that has a profile value smaller than ℓ.

Sampling the profile of the document array
Storing the entire profile of the document array requires O(nd) words of space, which will be excessive for pangenomes.We seek to compress the profile of the document array by sampling it similarly to how r-index samples the suffix array.
Let BWT[s..e] be a maximal equal-letter run of the BWT of T .We store in position s and e the entries of the profile of the document array in positions LF(s) and LF(e), respectively.Applying the same reasoning as the toehold lemma (Policriti and Prezza 2016), we can show that this is enough to recover the document listing for a query pattern S.
The first property of the profile of the document array that we show is an upper bound on the values of the profile, when performing an LF step.
Lemma 2. For all positions Hence, if we consider the character preceding SA[i], that is, BWT[i], then by maximality of k, we have that concluding the proof.Table 1.An example of document array profiles for three documents Example of document array profiles (PDA) for the text T , which is the concatenation of the following three documents: {ATATGGC$, GTAGAAT$, TATGAAC#}.The bold text in the BWM T column represents the suffix of the text in each row of the Burrows-Wheeler matrix (BWM); more formally for We now show which elements of the profile of the document array can be extended when performing an LF mapping from a position in a maximal equal-letter run.Those are all the profiles corresponding to occurrences that are all preceded by the same character; that is, the corresponding interval in the suffix array is contained in the maximal equal-letter run.We first recall that given a maximal equal-letter run BWT T [s.Example 3. In the example in Table 1, if we consider i = 4, we have that the maximal equal-letter run containing i is BWT[4..5]; hence, the smallest substring S of T such that all occurrences of S in T are in SA T [4..5] is AA, and its length is given by ℓ = max (LCP[4], LCP[6]) + 1 = max (0, 1) + 1 = 2. Looking at P DA Note that the only case in which we have that DA[i] ≠ DA[LF(i)] is if the BWT runs is a run of $s.Hence, the above lemma can be applied generally when performing pattern matching queries.We can summarize our solution to Problem 1 in the following theorem.Let S[1..m] be a pattern for which we want to compute the list of documents such that S occurs in D. After we have processed S [q..m], we have an interval BWT[s q ..e q ] containing all the occurrences of S[q..m] in T , as well as a profile P ′ such that for all documents j, P ′ [ j] ≥ (m − q + 1) if S[q..m] occurs in T j , and P ′ [ j] < (m − q + 1) otherwise.Note that the profile is not required to be a document array profile entry for a given position.
If q > 1, we now want to extend the match of S[q..m] to S[q − 1..m] and show how we can maintain the invariant of the profile P ′ .We consider two cases.The first case is if BWT[s q ..e q ] contains either the beginning or the end of a run of the character S[q − 1].Here, we can update the interval BWT[s q−1 ..e q−1 ] with the standard backward-search and can select as P ′ the sample of the profile of the document array stored in the run boundary in BWT[s q ..e q ].The invariant of P ′ is preserved by Lemma 1.The second case is when BWT[s q ..e q ] is completely contained in a run, namely, BWT[s q − 1] = BWT[s q ] = … = BWT[e q ] = BWT[e q + 1]: Then we have that all occurrences of S[q..m] are preceded by the same character; hence by Lemma 3 for all j such that S[q..m] occurs in T j , the profile of the document array P after the backward step is P[j] = P ′ [j] + 1 ≥ (m − q).Furthermore, for all j such that S[q..m] does not occur in T j we have that P ′ [ j] < (m − q + 1); hence by Lemma 2, we have that P[j] ≤ P ′ [j] + 1 , (m − q).Hence, if for all j we set P[j] = P ′ [j] + 1, we have that the invariant requiring that for all documents j, P[j] ≥ (m − q) if S[q − 1..m] occurs in T j and P[j] , (m − q) otherwise is satisfied, concluding the proof.

Computing the document array profiles
The computation of the document array profiles is performed by scanning the values of BWT, SA, LCP, and DA in a streaming fashion.For all positions i = 1, …, n, we base the computation of P DA [LF (i)] on the observation that given a collection of documents D and a suffix u of document T i , the suffix u of document T j with the largest longest common prefix with u is the suffix of T j that either immediately precedes or immediately follows the suffix u in SA order.Formally, For each target database size (30, 100, 300 genomes), we include an equal number of genomes from each class.For example, for the three-class data set, we include 10 genomes of each species in the 30-genome database, 34 genomes of each in the 100-genome database, and 100 genomes of each species in the 300-genome database.Note that this leads to collections that slightly exceed the target size; for example, 3 × 34 leads to an index of 102 genomes.For this reason, we refer to the database sizes as "30+," "100+," and "300+."

Figure 3 .
Figure 3. Classification results using the document array profiles and SPUMONI 2's (Ahmed et al. 2023) sampled document array across four different data sets each with the four classes described in Figure 2.
We denote with D = {T 1 , . . ., T d } the collection of documents (strings)T 1 , …, T d , and we denote with T [1..n] = T 1 • • • T d the concatenation of the documents.The document array (Muthukrishnan 2002) DA[1..n] stores for each position i the document index of T [SA T [i]..n].An important problem in document retrieval is the document listing problem.Problem 1.Given a collection D = {T 1 , . . ., T d } and a pattern P, return the set of documents L # D where P occurs.

Definition 1 .
Figure Comparing the document array profiles and SPUMONI 2's(Ahmed et al. 2023) sampled document array on seven different strain-level classification tasks using real nanopore reads from the UNCALLED(Kovaka et al. 2021) project.
.e], the length ℓ of the smallest substring S of T such that all occurrences of S in T are in SA T [s..e] is given by ℓ= max (LCP T [s], LCP T [e + 1]) + 1, assuming LCP T [n + 1] = 0. Note that SA T [s..e] can also contain occurrences of substrings different from S. Lemma 3. Given a position i in the BWT of T such that DA[i] = DA [LF(i)], let BWT[s..e] be the maximal equal-letter run such that s ≤ i ≤ e, and let ℓ be the length of the smallest substring S of T such that all occurrences of S in T are in SA T [s..e].Then for all j = 1, …, d such that P DA [i][ j] ≥ ℓ, it holds that P DA [LF(i)][ j] = P DA [i][ j] + 1. Proof.The first observation is that if P DA [i][ j] ≥ ℓ, then there exists a s ≤ k ≤ e such that DA[k] = j; otherwise, by definition of ℓ and P DA [i][ j], P DA [i][ j] < ℓ.Hence, T [SA[k]..n] is preceded by the same character as T [SA[i]..n] because i and k are in the same BWT run.Therefore, if DA [k] = DA[LF[k]], we have that lcp(T [SA[LF(k)]..n], T [SA[LF(i)]..n]) = lcp(T [SA[k]..n], T [SA[i]..n]) + 1, which concludes the proof.

Theorem 1 .
Given a collection D of d documents D = {T 1 , T 2 , . . ., T d } over an alphabet of size σ, we show how to extend the r-index with O(rd) additional words to support document listing queries for a pattern S[1..m] that occurs in ndoc documents in D in O(m log log w (s + n/ r) + ndoc)time and O(rd) space, where w is the machine word size.(Query time can be improved to O(m log log w s + ndoc) by using the approach from Nishimoto et al. 2022.)Proof.Given a collection D, we store the BWT of the concatenation T of the documents of D, and for all maximal equal-letter runs BWT[s..e], we store in the positions of s and e the SA samples SA[s] and SA[e], as well as the document array profile samples P DA [LF[s]] and P DA [LF[e]].

Table 2 .
The document listing query process for a small pattern with respect to the documents in Table1Querying the document array profiles (PDA) for the pattern P = TATG by using backward search.The table shows the document array profile at each step of the search, and [4, 3, 8] is the final profile, which means that the pattern P occurs in documents 1 and 3 because |P| ≤ 4 and |P| ≤ 8.Example 2. In the example inTable 1, if we look at P DA [4] = [1, 2, 4] and at P DA [LF(4)] = P DA [14] = [1, 3, 5], we have that Lemma 2 is verified.

Table 3 .
Species included in the data sets of Figure1Species used in each data set