De novo gene birth and the conundrum of ORFan genes in bacteria

  1. Howard Ochman
  1. Department of Molecular Biosciences, University of Texas at Austin, Austin, Texas 78712, USA
  • Corresponding author: h.uzzaman{at}utexas.edu
  • Abstract

    Bacterial genomes are notable in that they contain large numbers of lineage-restricted (“ORFan”) genes, which have been postulated to originate from either horizontal transfer, rapid divergence from pre-existing genes, or de novo emergence from noncoding sequences. We assess the body of research that explores each of these hypotheses and demonstrate that the mystery of the origin of bacterial ORFans still remains unresolved. Nonetheless, bacteria offer several unique avenues for research into the process and mechanics of gene birth at a resolution not feasible in other organisms. Both their amenability to experimental evolutionary analysis and their strain-level variation in gene content foster investigations of how noncoding sequences acquire expression and transition into functionality—questions central to the origin of phenotypic novelty.

    Most new genes are postulated to have formed through a process of duplication and divergence (Chen et al. 2013; Tautz 2014). But if genes arise only from pre-existing gene sequences, one would expect all genes to have homologs. Since the first sequencing projects, researchers have been struck by the occurrence in almost every genome of “ORFan” genes, those that lack homologs outside of the taxon in which they are found. Numerous explanations have been provided for the existence of ORFans (aka “orphans”), including rapid divergence from functional genes, inadequacies of search strategies, and de novo birth from noncoding sequences without a genic precursor (Khalturin et al. 2009; Tautz and Domazet-Lošo 2011; Van Oss and Carvunis 2019).

    The mystery surrounding the origin of ORFans is fundamental to the evolution of bacteria, whose genomes contain numerous species- and strain-restricted genes (Siew and Fischer 2003; Daubin and Ochman 2004a; Siew et al. 2004) but exhibit very low frequencies of gene duplication (Treangen and Rocha 2011; Tria and Martin 2021). Despite wide acknowledgement of the presence of ORFans in bacteria, few studies have attempted to investigate their emergence (Table 1). Of particular note is the phenomenon of de novo gene birth from noncoding sequences, a mode of gene origin that has been investigated in depth in diverse eukaryotic taxa (Begun et al. 2007; Cai et al. 2008; Knowles and McLysaght 2009; McLysaght and Guerzoni 2015; Zhang et al. 2019; Zhuang et al. 2019) but has gone virtually unexplored in bacteria (Vakirlis and Kupczok 2024). Here, we analyze the efforts undertaken to discover the contribution of different evolutionary processes to the enormous ORFan gene pool in bacteria, with a special focus on de novo gene birth.

    Table 1.

    Studies investigating the origin of ORFans in bacteria

    ORFans as artifacts

    Our discussion of bacterial ORFans focuses only on protein-coding annotated genes, noting that not all annotated genes are functional (Ghatak et al. 2019) and that conventional annotations can miss functional genes (Armengaud 2009; de Souza et al. 2009). Identifying the ORFans in an organism's genome involves searching all of its annotated protein sequences against the proteins encoded by all other taxa (“outgroups”) and retaining only those that have no recognizable homolog among outgroups (McLysaght and Hurst 2016; Vakirlis and McLysaght 2019). Because this approach depends on the robustness of the search strategy, it admits the possibility that homologs to a putative ORFan might exist in the database but that the search has failed to identify them. Therefore, before addressing issues concerning ORFan origins, it is essential to rule out such false positives. A conventional protein BLAST search could fail to return hits owing to inadequate annotation, inappropriate application of an e-value cutoff, or failure to account for remote homology (Fig. 1A).

    Figure 1.

    Workflow for detecting ORFans and de novo–emerged genes. (A) An augmented search strategy for detecting actual and excluding artifactual ORFan genes. (B) The gold standard of de novo gene detection. (Created with BioRender; https://www.biorender.com/.)

    Annotation inadequacies

    A protein homologous to a putative ORFan might possibly exist in the outgroup database but not be annotated as such in any of the surveyed genomes. Short proteins are particularly vulnerable to this concern because they often evade detection by conventional annotation programs (Storz et al. 2014; Tonkin-Hill et al. 2023). As a remedy, a protein BLAST search should be performed against all translated open reading frames, not just the annotated genes, in outgroup genomes (Fig. 1A). This procedure can be implemented either by searching all genome sequences with tFASTy and excluding frameshifted or truncated proteins from the results (Pearson et al. 1997; Karlowski et al. 2023) or by extracting all ORFs from outgroup genomes and conducting a protein BLAST search against their translated products.

    Reliance on e-value thresholds

    In a typical BLAST search, an e-value-based cutoff is implemented to distinguish genuine matches between the query and target sequences from those that are spurious (Vakirlis and McLysaght 2019). But because of the short length of some proteins, even a high-confidence hit might fail to return a low e-value owing to the inherently greater possibility of spurious short alignments. One remedy would be to manually curate protein alignments with higher e-values to rule out false-positive hits (Kuchibhatla et al. 2014). An alternative strategy is to search against the nucleotide sequences of outgroup genomic regions that are in conserved synteny with the putative ORFan. This restriction significantly reduces the size of the search database from entire genomes to, at most, several kilobases for each genome in which a hit is found, making it computationally manageable to incorporate a smaller “word size” parameter, thus leading to significant gains in sensitivity. The resulting alignments can then be manually curated to cull false positives.

    Recognition of remote homology

    Although sequence homology decays quickly between rapidly diverging homologs, structural homology can persist beyond the point at which there is virtually no sequence similarity (Weisman et al. 2020; Stern and Han 2022). Remote homology has been detected by searching hidden Markov model profiles derived from gene alignments against outgroup sequence or profile databases (Remmert et al. 2011; Lobb et al. 2015). The recent advent of protein structure prediction tools, such as ESMFold (Lin et al. 2023), coupled with structure-based search strategies (Edgar 2024; van Kempen et al. 2024) can potentially aid in the recovery of very distant homologs.

    The origins of ORFan genes

    A lack of homologs after these exhaustive search strategies leads to the more provocative issue of determining how such an ORFan gene actually emerges. There are three routes by which ORFans can arise in bacterial genomes: (1) de novo evolution from noncoding sequences (including noncoding alternative reading frames of functional genes), (2) extreme divergence from other functional genes, and (3) horizontal transfer from a source not present in the database (Daubin and Ochman 2004a; Yomtovian et al. 2010; Vakirlis and Kupczok 2024; Pereira et al. 2025).

    De novo gene birth is the most tractable and traceable of these three routes because it is possible, in principle, to identify the ancestral noncoding sequence from which the new gene formed. Based on extensive work in eukaryotes, a “gold standard” for detection of such genes has been proposed (Vakirlis and McLysaght 2019; Zhang et al. 2019; Vakirlis et al. 2022).

    The gold standard of de novo gene detection

    To unambiguously establish that a gene arose de novo, the noncoding sequence that gave rise to the ORFan needs to be detected. This first requires identification of the template sequence that retains the ancestral noncoding status in outgroup genomes (Fig. 1B). To increase confidence in homology inference, the noncoding sequence in the outgroup genomes should be in conserved synteny with the ORFan, with its inert status confirmed by the lack of a start codon and/or interruption by stop codons or frameshift mutations.

    Note that the mere presence of a homologous noncoding sequence in an outgroup genome is not sufficient to establish a de novo origin of the ORFan because it is possible that the outgroup sequence is noncoding as a result of gene inactivation after the ORFan emerged. To exclude this possibility, not only must the noncoding sequence display homology with the ORFan sequence but its “disabling” mutation (i.e., at least one start-codon disrupting mutation, stop-codon or frameshift mutation) must occur in two or more outgroup lineages (Fig. 1B). The rationale for this criterion is that when the same disabling mutation is present in at least two outgroup lineages, the most parsimonious reconstruction of events is that the noncoding status of the sequence is ancestral to the ORFan.

    Reported cases of de novo gene birth in bacterial genomes

    Only two studies have attempted to investigate de novo birth of bacterial ORFans by tracing their sequences to noncoding ancestors. In an analysis of the genus Bacillus, Karlowski et al. (2023) traced 331 ORFans to syntenic noncoding sequences in outgroup genomes. However, this study did not establish whether these noncoding sequences share a disabling mutation in more than one distinct lineage, thereby raising uncertainty about whether they are ancestral to the ORFans or simply represent cases in which a pre-existing gene has become pseudogenized in the outgroup genomes.

    Recently, Vakirlis and Kupczok (2024) traced 1075 species-specific ORFans to their putative noncoding ancestral sequences. The authors were able to identify the corresponding syntenic regions in at least two outgroup genomes, one from the same species and one from a close outgroup, but because of the small number of outgroup genomes bearing the putative noncoding sequence, they could not investigate whether the identical disabling mutation was present in two outgroup lineages. As such, their methodology does not exclude the possibility that ORFan genes emerged via rapid divergence from pre-existing genes or that their status as ORFans resulted from loss in multiple lineages via repeated pseudogenization. To acknowledge this uncertainty, the authors refer to these as “de novo gene candidates,” but considering the massive size of their database (4.7 million protein families from 4644 species in the human gut microbiome), the fact that de novo gene candidacy was assigned to only 0.2% of their recognized ORFans underscores the technical limitations in detecting this mode of gene birth in bacterial genomes.

    Difficulties intrinsic to detecting de novo gene birth

    It is only practical to apply the gold standard of de novo gene detection when sequences evolve slowly and are distributed across multiple closely related lineages. Furthermore, bacterial genomes are buffeted by a pervasive deletional bias that removes noncoding regions (Mira et al. 2001; Kuo et al. 2009), thereby rendering it less likely that noncoding sequences would be conserved across multiple lineages. For bacteria that engage in frequent interspecific gene transfer, even the gold-standard criterion demanding the same debilitating mutation(s) in two outgroup lineages may, in fact, be too permissive and should instead require their presence in more than two outgroup lineages to establish the ancestral state. The complicated nature of phylogenetic inferences in bacteria, combined with the low retention of noncoding sequences, suggests that evidence of real de novo genes is uncommon in these genomes.

    As an alternative to the stringent requirements of the gold standard, Vakirlis et al. (2024) have applied an ancestral sequence reconstruction method to detect noncoding ancestral sequence states. Although this might circumvent the need for identifying shared disabling mutation(s), it nonetheless requires the retention of noncoding sequences across multiple lineages, which is feasible for the yeast species investigated in the study but is less common in bacteria.

    It is notable that many of the problems facing the inference that a gene emerged de novo apply equally to discerning whether these genes arose through rapid divergence from functional genes that retain no similarity to the ORFan sequence. Despite these challenges, two indirect methods of investigating the rapid divergence scenario have been proposed. First, if the rate of sequence evolution of a protein can be calculated, one can infer the likelihood that a homolog to the protein exists at a given evolutionary distance, but the sequence-level homology has decayed past the point of recognition (Weisman et al. 2020; Barrera-Redondo et al. 2023). However, such an approach requires the protein to be present in at least three species to calculate a rate of sequence evolution, which limits its applicability to species-specific ORFans. Furthermore, proteins may experience lineage-specific changes in their rates of evolution, compromising the utility of such an approach (Prabh and Tautz 2021).

    Second, Vakirlis et al. (2020a) propose a synteny-informed method to calculate the overall rate of genes emerging via sequence divergence across a genome. Because bacterial ORFans can rarely be traced to a conserved syntenic region found in outgroups, it remains unclear whether this approach would be useful in case of bacteria.

    Native versus foreign origins of ORFan genes

    Although the competing hypotheses that account for bacterial ORFans—sequence divergence, de novo gene birth, and gene loss—can only rarely be resolved, broad questions relating to the source of ORFan genes can still be addressed. Specifically, do ORFans arise locally, from sequences already present in the genome, or do they arise via transfer from external sources not present in the database?

    It has been posited that ORFans originate in and are acquired from bacteriophages (Daubin and Ochman 2004a,b), an hypothesis bearing some appeal because bacteriophages comprise the largest group of biological entities and are underrepresented in the databases, they serve as agents of gene transfer, and their high mutation rates can generate an extraordinary amount of genetic novelty (Hendrix et al. 1999; Sanjuán et al. 2010; Bar-On et al. 2018; Benler and Koonin 2021). Moreover, it has been known since the 1950s that “lysogenic conversion genes” introduced by temperate bacteriophages can confer beneficial traits to bacteria (Lwoff 1953; Canchaya et al. 2003). Remnants of phage infection are evident in most bacterial genomes and are implicated in a variety of cellular functions (Wang et al. 2010; Bobay et al. 2014; Bondy-Denomy and Davidson 2014), including in defense (Touchon et al. 2017) and in maintaining cell morphology (Randich et al. 2019).

    Phages are unquestionably involved in bacterial gene transfer, but their role as a source of new bacterial genes is uncertain. The contribution of phages to the repertoire of bacterial ORFans has been investigated by two complementary methods. One method is to identify the fraction of ORFans that are traceable to existing phage genes, and in the first study of its kind, Yin and Fischer (2006) reported that ∼3% of all bacterial ORFans have a phage homolog. Although this value is certainly an underestimate on account of early database limitations, Vakirlis and Kupczok (2024) recently reported that only 5% of bacterial ORFans had phage homologs, even after they applied a low stringency threshold. In contrast to what might be expected if ORFans originated in phage, they found that the likelihood of phage homology increases with persistence of a gene, such that more conserved genes are more likely to have a phage homolog. Lobb et al. (2015) reported a similar estimate of phage-traceable ORFans based on remote homology searches but with a significant, albeit minor, enrichment of viral processes among ORFan functional classes.

    Alternatively, similarities in the sequence characteristics of ORFans and phage-encoded genes seeded the idea that many ORFan genes originated in phage, even if they share no observable homology with phage proteins. Compared with the majority of genes residing in a bacterial genome, both ORFans and phage genes are short and AT-rich and have atypical dinucleotide signatures (Daubin and Ochman 2004a,b). Subsequently, Cortez et al. (2009) found that 60% of bacterial ORFans are situated within clusters of genes that display atypical sequence compositions, which led them to deduce that ORFans within such clusters stemmed from events of horizontal transfer, with about one-half derived from viral and plasmid sources. In contrast to their results, Yomtovian et al. (2010) observed no significant similarities in amino acid composition between ORFans and phage proteins. Drawing on a much larger data set, Vakirlis and Kupczok (2024) reported that ORFans do not differ from conserved genes in their average GC content or other sequence properties, thereby undercutting the phage origin hypothesis and reinforcing the view of a local origin of ORFans.

    Because the vast majority of the phage sequences remains unsampled, as evidenced by the wealth of taxa and genes identified with each new metagenomic or metaproteomic survey (Nayfach et al. 2019, 2021; Sberro et al. 2019; Durrant and Bhatt 2021; Fremin et al. 2022), it is difficult to completely discount the hypothesis that a sizeable fraction of ORFans originate in phage even in light of the low proportion of ORFans with phage homologs. Although the association between gene age and their similarity to phage sequences has been taken as evidence that phages are an unlikely source of newly emerged ORFans (Yin and Fischer 2006; Vakirlis and Kupczok 2024), this finding is a predicted consequence of sampling bias because younger, rarer bacterial genes are expected to be rarer in phage sequence space as well.

    Because of the contradictory interpretations drawn from these studies, it is difficult to know how further sequence comparisons will help in resolving this debate. Even if, as previous studies report, ORFans manifest a lower GC content than conserved genes, a feature considered to be a hallmark of phage origin, this compositional bias is also observed in rapidly diverging genes owing to the inherent pattern of mutations in bacterial genomes (Schaaper and Dunn 1991; Sargentini and Smith 1994; Yamamura et al. 2000). Conversely, acquired genes eventually take on the sequence properties as their new genomic host, obscuring their origins (Daubin and Ochman 2004a,b). Limitations of sampling notwithstanding, only the consistent discovery of clear homologs to ORFan genes in viral databases can lend support to the phage-origin hypothesis.

    Alternative routes of de novo gene birth in bacteria

    Although the exact contribution of the different pathways to gene birth remains unclear, it is likely that each of the processes discussed so far have contributed to the generation of ORFan genes in bacteria, with vestiges of some gene-birth events traceable in extant genomes. Owing to their amenability to experimental evolutionary analysis and their strain-level variation in gene contents, bacteria offer the opportunity to identify forms of de novo gene birth that are not readily captured by conventional detection methods.

    De novo transcription or translation

    Detecting de novo gene birth need not rely only on the identification of new ORFs originating from noncoding regions. New genes can also arise as a result of transcription and/or translation of pre-existing ORFs that were previously unexpressed, a process that represents one of the more frequently detected routes of new gene birth (Fig. 2A; Grandchamp et al. 2023). Because these ORFans have homologous ORFs in outgroup genomes, they would be ignored based on conventional criteria (Fig. 1).

    Figure 2.

    Alternative forms of de novo gene birth. (A) De novo emerged transcription of a pre-existing ORF. (B) De novo gene emergence from a frameshifted gene sequence. (Created with BioRender; https://www.biorender.com/.)

    Merging comparative genomic and transcriptomic data allow detection of both the lineage-specific transcripts and the mutations responsible for the formation of new promoters (Blevins et al. 2021). Although bacterial promoters are often imprecise and difficult to detect (Coppens and Lavigne 2020; Lagator et al. 2022), and the absence of a canonical promoter or of transcription is not irrefutable evidence that a sequence is noncoding, the detection of novel expression and its causative regulatory sequences can be identified across short evolutionary timescales. Working with the Escherichia coli Long-Term Evolution Experiment (Lenski et al. 1991; Tenaillon et al. 2016; Good et al. 2017), a system in which the ancestral states of all genomes and each new mutation are known, we leveraged expression data assayed across a large number of growth conditions (Houser et al. 2015; Caglar et al. 2017; Tjaden 2023) to establish the de novo emergence of new transcripts and proteins (uz-Zaman et al. 2024). But because such idealized conditions are not available for natural populations, inferences about de novo gains in transcription or translation cannot be generalized across species.

    Gene birth via overprinting

    Paralleling the emergence of new transcription and translation of pre-existing open reading frames is the appearance of new genes from the noncoding reading frames of pre-existing genes (Fig. 2B). There is ample evidence that bacterial genes produce transcripts and proteins from alternative reading frames along both strands of DNA (Raghavan et al. 2012; Stringer et al. 2021; Smith et al. 2022) and that some of these proteins exhibit evidence of purifying selection, implying that they are functional (Ardern et al. 2020; Zehentner et al. 2020; Kreitmeier et al. 2022). That such products can serve as raw material for the formation of new genes (Ruiz-Orera et al. 2018) helps mitigate the idea that the lack of intergenic DNA in bacterial genomes limits the potential for new gene formation. Genes formed from within existing coding regions are apt to be easier to detect because their precursor sequences have a greater likelihood of being preserved over evolutionary timescales than do intergenic noncoding sequences. Also, because frameshifted proteins retain many properties of the protein encoded in the original coding frame (Bartonek et al. 2020), they may more easily transition to functionality compared with those originating from intergenic sequences.

    The evolution of new genes via frameshifted overprinting has been widely investigated in viruses (Sabath et al. 2012; Pavesi 2021), and there are numerous cases of gene overlap in bacterial genomes (Rogozin et al. 2002; Wright et al. 2022). But owing to the very short length of most overlaps, often involving only the start and stop codons of adjacent genes, few represent cases of de novo gene birth (Wright et al. 2022). To date, only two studies have investigated the more extensive overlaps between annotated bacterial genes. In a study of chimeric genes in the E. coli pangenome, Watson et al. (2021) identified 767 gene families that contain at least one domain derived from the shifted frame of an annotated protein. Because this study focused solely on chimeric proteins, they described no cases in which a protein was derived exclusively from the shifted frame of another gene. However, this feature was considered in a survey of all species-specific genes in the gut microbiome in which 1.2% of ORFans (representing 7585 families) were derived from the frameshifting of other genes in the same genome (Vakirlis and Kupczok 2024). Although these genes represent unambiguous examples of de novo emergence from a previously noncoding sequence, their estimates suggest that this mode of de novo gene birth is a minor contributor to the pool of bacterial ORFans.

    It is noteworthy that the frameshifted genes identified by Vakirlis and Kupczok (2024) did not overlap the reading frame from which they were derived, which is indicative of past gene duplication events, after which the two paralogs retain functionality in different frames. Similarly, a large fraction (31.5%) of the chimeric proteins reported by Watson et al. (2021) were nonoverlapping and could therefore be implicated in duplication events. Because this mode of gene origination requires paralogs to persist in the same genome, it is predictably rare in bacterial genomes owing to their very low retention of duplicated genes (Treangen and Rocha 2011).

    Investigating the mechanics of de novo gene birth

    The birth of a new gene from pre-existing noncoding sequences can be broadly conceptualized as having two phases: the acquisition of expression (and translation in the case of protein-coding genes) and the transition to functionality. Despite the complications accompanying the identification of fully formed de novo genes in bacteria, insights into the mechanisms of gene origin have been gained by studying these two phases separately.

    Phase 1: the transition to expression

    According to the proto-gene model of gene birth (Carvunis et al. 2012; Weisman and Eddy 2017), the functionality of gene sequences is preceded by their gain of expression and subsequent translation, such that the pool of expressed but nonfunctional proteins (“proto-genes”) harbored in each cell is the raw materials from which de novo genes arise. Recent genome-wide surveys of translation in a number of bacterial species have identified an abundance of novel proteins that cannot be detected by conventional gene annotation algorithms (Tables 2, 3; Baek et al. 2017; Hücker et al. 2017; Meydan et al. 2018, 2019; Weaver et al. 2019; Venturini et al. 2020; Stringer et al. 2021; Smith et al. 2022). Although translation of these proteins can be detected by ribosome profiling, their corresponding products mostly escape detection by mass spectrometry and western blots (VanOrsdel et al. 2018). For example, all but one of the studies listed in Table 2 failed to establish mass spectrometric evidence for 90% of the proteins detected by ribosome profiling, with three studies failing to find evidence for a single new protein resolved by ribosome profiling. Discrepancies in the detection of novel proteins have been attributed to their shorter lengths (which limits the generation of tryptic peptides), high hydrophobicity, and low stability in the cell (VanOrsdel et al. 2018; Fijalkowski et al. 2022), features that also explain why their detection suffers from poor reproducibility between studies (Weaver et al. 2019). Many of these novel sequences have been shown to be under purifying selection (Fesenko et al. 2025), but owing to their high rates of divergence (Stringer et al. 2021) and inconsistent translation, they are likely in the prefunctional, proto-gene phase of gene birth. Investigations of the properties and rate of emergence of these proto-genes can ultimately shed light on the first phases of gene birth: the de novo acquisition of open reading frames or expression.

    Table 2.

    Studies reporting the presence of nonannotated bacterial proteins using ribosome profiling

    Table 3.

    Studies reporting the presence of nonannotated bacterial proteins using mass spectrometry

    Phase 2: the transition to functionality

    Because of their rapid generation time and ease of propagation and genetic manipulation, bacteria provide excellent model systems to experimentally assay the functional potential of proteins encoded by nongenic sequences. Such evidence is usually achieved by expressing large pools of protein libraries in bacterial cells and testing for a functional phenotype. Using this approach, Knopp et al. (2019, 2021) have demonstrated the ability of random proteins to function as antibiotic-resistance peptides, either by modulating membrane potential or by engaging in specific interactions with transmembrane proteins. More recently, Frumkin and Laub (2023) have demonstrated the activity of a random peptide in inducing antitoxin resistance by interfering with the activity of protein chaperones in the cell. Using a rationally designed library of binary-patterned proteins, proteins that contain alternating polar and nonpolar residues, a wide range of auxotroph-rescue phenotypes could be demonstrated in bacteria (Kamtekar et al. 1993; Patel et al. 2009; Fisher et al. 2011; Donnelly et al. 2018). Also, in an approach that straddles the phage-transfer and de novo routes to gene origin, Warsi et al. (2020) constructed a gene fusion between bacterial and phage DNA that conferred a temperature-resistance phenotype.

    Cumulatively, such studies not only demonstrate the ability of random proteins to confer beneficial phenotypes but allow direct tests of hypotheses about the transition to functionality during de novo gene birth. For example, the functional peptides identified by Knopp et al. (2019) were all membrane-associated, which coheres to the “transmembrane-first” model of gene birth, according to which new genes initially acquire functionality by acting as transmembrane domains (Vakirlis et al. 2020b).

    State of the field and future prospects

    Questions pertaining to the origin of ORFan genes in bacteria, and the degree to which de novo gene evolution contributes to their formation, remain almost as mysterious today as they were two decades ago. This is surprising, given that sequence information has resolved so many other aspects of gene and genome evolution in bacteria (Ochman et al. 2000; Gevers et al. 2004; Lerat et al. 2005; Bratlie et al. 2010; Treangen and Rocha 2011; Tria and Martin 2021). A key barrier to progress is that none of the three mechanisms proposed to explain the origin of ORFans—rapid divergence, de novo birth, and transfer from sources absent in the database—are expected to leave remnants in genomes, making it unusually difficult to reconstruct the origins of most taxon-specific genes. Rapid divergence from a pre-existing gene, by definition, leaves no homologous sequences in the outgroup; detecting de novo origin requires the improbable persistence of noncoding sequences across multiple bacterial lineages and concerns still remain about phage undersampling. Despite these hurdles, bacterial model systems can provide unique and unexplored research avenues. Because of the abundance of genomics and transcriptomics data sets, bacteria offer the opportunity to study the fine-grained stages in the emergence of genes within the history of a single species. Furthermore, bacterial experimental evolution presents an avenue by which the mechanisms of gene birth can be explored at a resolution otherwise not possible in more complex systems. In these ways, research on gene birth in bacteria can illuminate unanswered questions pertaining to the origin of novelty across all life-forms.

    Competing interest statement

    The authors declare no competing interests.

    Acknowledgments

    We thank Kim Hammond for her help in preparing the figures. This work was supported by the National Institutes of Health (R35GM118038 to H.O.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

    Footnotes

    • Received October 25, 2024.
    • Accepted May 30, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    References

    | Table of Contents

    Preprint Server