Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT

Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted de Bruijn graphs are a variant built on a collection of sequences and associate to each k-mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k-mer counting step with the unitig construction step, as well as on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3× to 21× compared with the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5× to 39× compared with the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs.


Equivalence of node-centric and edge-centric unitigs
In this section we prove the equivalence between the (maximal) unitigs of the node-centric de Bruijn graph and the (maximal) unitigs of the edge-centric de Bruijn graph, built on the same set of strings, R, (i.e., their spellings are exactly the same strings).Note that we give this proof only for directed graphs.
For ease of notation, in this section we will denote the edge-centric graph of R as G e k (R).The node-centric graph for R, which we denote as G n k (R), is formally defined by adding a node for every k-mer of R, and an edge between two nodes x and y if suf k−1 (x) = pre k−1 (y).In a node-centric graph, a path P (containing at least one node) is said to be a node-centric unitig if all nodes of P , except the last node, have out-degree equal to one, and all nodes of P , except the first one, have in-degree equal to one [1].A node-centric unitig is said to be maximal if it cannot be extended by a node on either side [1].See Supplemental Figure S1 for an example.We will use the term unitig to refer to a unitig in the edge-centric graph (as defined in the Preliminaries subsection of the main paper), and node-centric unitig to refer to the unitigs just defined in the node-centric graph.

CTTGA
Supplemental Figure S1: Top: Illustration of the node-centric de Bruijn graph G n 4 (R), where we assume R is the set consisting of all 4-mers that label the nodes of the shown graph.Bottom: the edge-centric de Bruijn graph G e 4 (R), built on the same set R. In both graphs we draw in red their corresponding maximal unitigs (i.e., node-centric and edge-centric, respectively).In the nodecentric graph, the node-centric unitigs in the middle of the figure do not include the node ACGT, since its out-degree is different from one, and do not include the node CTTG since its in-degree is different from one.Every unitig is labeled with its spelling, also in red.Notice that these spellings are the same in both graphs.
The spelling of a path P = (x 1 , . . ., x t ) in G n k (R) is analogously defined as the string As in the edge-centric case, by a node-centric unitig we will refer to either a path P in the node-centric graph, or to the spelling of P .
Theorem 1.Let R be a set of strings, and let X be the set of all node-centric unitigs of G n k (R) and let Y be the set of all unitigs of G e k (R).Then X = Y.Proof.We prove the theorem by proving X ⊆ Y and Y ⊆ X .For concreteness, we also draw some example strings on the nodes of graphs.In the edge-centric graphs on the bottom, we additionally label the edges with their corresponding k-mer.
Suppose for a contradiction that Y is not a unitig in G e k (R), namely that there is some internal node y i , for some i ∈ {2, . . ., p}, having in-degree or out-degree different than one; suppose w.l.o.g., that it has out-degree different than one.Since the edge (y i , y i+1 ) exists in G e k (R), this means that the out-degree of y i is non-zero, and thus at least two.Let (y i , y ′ ) be another edge out-going from y i (y ′ ̸ = y i+1 ).Refer to Supplemental Figure S2(left) for an illustration of this configuration.Therefore, x ′ := y i ⊙ k−2 y ′ is a node in G n k (R).Moreover, since i ≥ 2, there is a node y i−1 preceding y i in the unitig Y , and thus a node x i−1 preceding x i in the unitig X.Consider now the nodes . By definition, we have that suf k−1 (x i−1 ) = y i = pre k−1 (x i ), and suf k−1 (x i−1 ) = y i = pre k−1 (x ′ ).Thus, the out-degree of x i−1 in G n k (R) is at least two.Since x i−1 is not the last node of the unitig X (since i ≤ p), this contradicts the initial assumption that X was a node-centric unitig.
Y ⊆ X : Let Y ∈ Y; we show that Y ∈ X holds.Let Y = (y 1 , . . ., y p ), p ≥ 2 (since edgecentric unitigs are defined to contain at least one edge), and let x i := y i ⊙ k−2 y i+1 , for each i ∈ {1, . . ., p − 1}.Since x 1 , . . ., x p−1 are k-mers of R, then they are all nodes in G n k (R), and X := (x 1 , . . ., x p−1 ) (clearly, by construction X has the same spelling as Y ) is a path in G n k (R) (note that p − 1 ≥ 1, since p ≥ 2).We claim that X is a node-centric unitig in G n k (R).Suppose for a contradiction that this is not the case.W.l.o.g., we can assume that there is x i , for some i ∈ {1, . . ., p − 2}, having out-degree different from one, and thus at least two.Let x ′ ̸ = x i+1 be another out-neighbor of x i , and let y ′ := suf k−1 (x ′ ).Refer to Supplemental Figure S2(right) for an illustration of this configuration.Consider now the nodes y i+1 , y i+2 , y ′ (recall i ≤ p − 2) in G e k (R).By definition, (y i+1 , y i+2 ) is an edge in G e k (R); moreover (y i+1 , y ′ ) is also an edge in . Thus, the out-degree of the node y i+1 is at least two in G e k (R).Since i ≤ p − 2, this means that y i+1 is a node of the unitig Y , different from its last node y p , having out-degree at least two, which contradicts the fact that Y is a unitig in G e k (R).
Next, we prove the same equivalence, but for maximal unitigs.
Corollary 1.Let R be a set of strings, and let X be the set of all maximal node-centric unitigs of G n k (R) and let Y be the set of all maximal unitigs of G e k (R).Then X = Y.
Proof.We prove one inclusion only, since the other one follows completely symmetrically.Let M be a maximal node-centric unitig in G n k (R).We want to prove that M is also a maximal unitig in G e k (R).Suppose for a contradiction that M is not maximal in G e k (R).Then there exists a unitig Supplemental Table S2 3 Commands used Here we list all the command templates used to benchmark the tools.

Uncolored building
Figure S2: Illustration of the two analogous cases in the proof of Theorem 1.
: The default value of m (minimizer length) for each value of k (k-mer size) in a given range.These default values of m were chosen to give the best running time in the uncolored construction of the 1K Salmonella genomes.