Genomics of the future: Identification of quantitative trait loci in the mouse

  1. Lorraine Flaherty1,
  2. Bruce Herron, and
  3. Derek Symula
  1. Genomics Institute, Wadsworth Center, Troy, New York 12180, USA

Abstract

Positional cloning of quantitative trait loci in rodents is a common approach to identify genes involved in complex phenotypes, including genes important to human disease. However, cloning the causative genes has proved to be more difficult than determining their positions. New tools such as genomic sequence, clone libraries, and new genomic-based methods offer new approaches to identify these genes. Here we review how these new tools and approaches will improve our ability to discover the genes important in complex traits.

Identifying genetic loci controlling complex traits (quantitative trait loci or QTLs) is one of the biggest challenges confronting genetics. These genes influence such traits as growth, morphology, and behavior and determine susceptibility and severity for nearly every disease. In particular, QTLs represent a gateway to the genetic factors controlling common, non-Mendelian diseases, such as heart disease and cancer, and affect many more people than the classic single gene diseases studied in the early days of positional cloning. These diseases have clear genetic components, yet the underlying genes have proven difficult to identify. Unfortunately, these genes are usually subtle in their expression and effect on the general phenotype of the organism; they interact with other genes and environmental effects making them difficult to isolate and they often have a low penetrance. Positional cloning of these QTLs in rodents has proved to be one of the most powerful tools for the functional identification of these genes (Georges 1997). The first step in this procedure is to map these loci to particular chromosomal regions, usually spanning anywhere from 10-40 cm. Over two thousand QTLs have been mapped (http://www.pubmed.com); however, the future narrowing of these regions and the subsequent identification of these QTLs are not easy (Flint et al. 2005). Sometimes, these QTLs contain polymorphisms resulting in obvious and deleterious effects on expression and function. More often, sequence differences are difficult to reconcile with the predicted phenotypes or cause expression and/or phenotypic variations that are subtler. Often there is no “smoking gun.” Given the importance of QTLs in common diseases and in pharmacogenomics, recent genetic studies have focused on the identification of these loci.

Genomic sequence and related tools promised to help us find these genes by improving existing approaches and making possible new approaches. Here, we discuss some of the present and future techniques for making identification of these QTLs an easier task. Now that we are several years into the post-genome, we can begin to evaluate some of these strategies. These fall into three categories: (1) making candidate intervals as small as possible, (2) efficiently evaluating large numbers of genes in candidate intervals, and (3) testing candidate genes in a more powerful and efficient manner. These strategies are schemed in Figure 1. It is important to note that, with the possible exception of a targeted knock-in mutation of a QTL, none of the methods yet described can provide explicit proof of a gene's involvement in a trait. Rather, each approach tests a gene or (ideally) multiple genes in a different way until the weight of evidence is considered sufficient to prove causation (Abiola et al. 2003).

Detection and localization

In 1992, Dietrich and colleagues (Dietrich et al. 1992) forged a major tool for mapping QTLs—a dense, highly polymorphic linkage map based on molecular markers. Not only did this map become a wonderful tool to map QTLs in standard crosses, but it also made it practical to construct more specialized resources for use in linkage studies. These consisted of new mouse constructs, including large sets of recombinant inbred lines, heterogeneous stocks, consomics, advanced inbred lines, and congenic strains (Kruglyak and Lander 1995a,b; Lander and Kruglyak 1995; Darvasi 1998; Nadeau et al. 2000). For mapping studies, recombinant inbred lines and consomics have been the most useful and have led to the rough localization of a large number of QTLs (see http://www.informatics.jax.org/searches/allele_form.shtml). Perhaps, even more important, the complete C57BL/6 genome and low-coverage sequence for 129/S1, 129X1, A/J, and DBA/2J (Celera/Applera) became available on the Web, and 15 strains of mice are now being resequenced by the Center for Rodent Genetics (http://www.niehs.nih.gov/crg/). The single nucleotide polymorphisms (SNPs) revealed by these sequence databases provide an extremely dense map for linkage mapping.

What excited the QTL community more than improved mapping resolution, however, was the potential to use these SNPs for in silico localization using linkage disequilibrium (LD) analysis. Because most inbred mouse strains have a common ancestral heritage and ancestral genotypes account for most of the genetic variation among inbred strains (Wade et al. 2002; Frazer et al. 2004; Yalcin et al. 2004b), QTL genes are likely (though not necessarily) to be found in regions of different ancestry in strains with different phenotypes. Whole-genome association studies might be used to identify ancestral haplotype blocks, i.e., blocks of DNA inherited from a common ancestor, which correlate with different values of a trait among a large collection of inbred strains. This approach is attractive because it can use previously generated phenotypes, such as from the JAX Phenome Database, and requires no further crosses. Thus, in the hypothetical array of strains in Figure 2, sex-linked traits that follow strains A, C, F, and G would be due to the region between markers 5 and 6. For example, Pletcher et al (2004) used nearly 11,000 SNPs to derive inferred haplotypes for 48 strains and applied these haplotypes to existing phenotype data. Previously identified loci for several monogenic and several quantitative traits showed highly significant associations with the appropriate phenotypes (Pletcher et al. 2004). A similar approach would use haplotypes within a previously mapped QTL to potentially refine the QTL to a smaller haplotype block that associates with the trait (Manenti et al. 2004).

Figure 1.

Scheme for the identification of QTLs.

Despite the potential power of LD analysis, the emerging body of studies suggests major challenges to its application in mice (Frazer et al. 2004; Yalcin et al. 2004a). Complete resequencing has revealed greater detail than the whole genome shotgun reads used initially. Haplotype blocks in mice are not consistent among inbred strains; a given region may have different boundaries in different strains. There also are concerns that in silico analysis in mice may be impractical for precise QTL gene localization because of the large number of strains needed for good statistical power even under favorable conditions. It has been estimated that detecting a QTL would require between 40 and 150 strains, with 40 strains needed to determine a major gene controlling the majority of the phenotypic variance (Darvasi 2001). There is still hope for in silico strategies, though. While the structure of the mouse genome is more complex than originally thought, there are still blocks of sequence with clear phylogenetic relationships among the inbred strains (Frazer et al. 2004; Yalcin et al. 2004a). This information still allows for association tests that use a small set of SNPs to represent a given region. Alternatively, we can compare the strain distribution pattern (SDP) of each SNP with the SDP of the trait without regard for the structure of the genome. While this strategy requires complete sequence information for each strain, the limited number of SNP SDPs suggested by results of Yalcin and colleagues may have better statistical power than might be expected (Yalcin et al. 2004b). It seems likely that, given complete sequence coverage for many strains and analysis methods that balance power and resolution, some form of association will be important in QTL localization.

Candidate gene identification

Genomic tools, techniques, and databases have transformed the hunt for candidate genes. Perhaps the greatest effect concerns the initial characterization of the candidate interval. Database searches focusing on the candidate interval have replaced arduous physical mapping and sequencing that, very recently, were state-of-the-art. In addition, multi-species sequence comparisons have identified conserved non-coding regions that are important in gene regulation (Boffelli et al. 2004; Loots et al. 2005). It is now possible, through database searches and multi-species sequence comparisons, to produce a nearly comprehensive list of candidate functional units from Web-based databases (Ahituv et al. 2005). Thus, genomic tools focus the search for causative genes and polymorphisms to those regions most likely to impact phenotype. Evaluating these genes and regulatory regions for appropriate function and changes in sequence and expression remains a challenge but the efficiency of the techniques used has been increased vastly.

Expression profiling would seem to be a perfect tool for evaluating candidates. As for Mendelian traits, an investigator can assay the entire interval to identify genes whose expression correlates with phenotype. The simplest application of expression profiling is to compare two groups of mice with different phenotypes and ask whether any genes in the candidate region are differentially expressed. The clearest reported success using this approach is susceptibility of mice to allergen-induced-airway hyper-responsiveness. Wills-Karp and colleagues (Ewart et al. 2000; Karp et al. 2000) mapped two QTLs for this trait with one differentially expressed gene, hemolytic component Hc (previously known as C5) as a prime candidate. Subsequently, differences in the Hc sequence were shown to be the cause of this hyper-responsiveness. As simple and attractive as this strategy is, it does not work very often. The investigator must assay the appropriate tissue, cell type, environmental conditions, and developmental stage to evaluate the correct and “bottleneck” phenotype, i.e., the place and time where the QTL makes a difference. Moreover, few cloned QTLs determine expression differences of a magnitude that would be detected by this approach (Flint et al. 2005). Therefore, recent efforts have focused on detecting subtle expression differences and building “gene networks” to take advantage of the tremendous throughput of expression profiling.

One such approach, genetical genomics, treats gene expression levels as quantitative traits and uses a segregating population to map loci that regulate expression of a given gene (Schadt et al. 2003; Bystrykh et al. 2005; Chesler et al. 2005; Hubner et al. 2005; Li et al. 2005). Statistical analyses can then determine if a biological QTL co-segregates with an expression QTL (eQTL), which would suggest that the same gene is responsible for both the biological and expression phenotypes. It is possible to detect subtler expression differences by linkage mapping rather than through a simple two-group comparison (Schadt et al. 2003), in part because recombinant inbred lines, for example, will show trait segregation correlating with a particular genotype, thus increasing the statistical power through replication (Bystrykh et al. 2005; Chesler et al. 2005; Hubner et al. 2005).

Figure 2.

Haplotype block mapping of inbred strains to determine phenotype associated with red but not black chromosome.

Cis-acting eQTLs are defined as those that are closely linked to the target gene, while trans-acting eQTLs map elsewhere in the genome. A cis-acting eQTL is of particular interest for candidate gene identification since it is likely to regulate the closely linked structural gene and implicate it as a strong candidate. Trans-acting eQTLs are more difficult to understand since they suggest candidate genes more indirectly. The genes regulated by a given trans-acting eQTL probably represent downstream effects of the regulator and, as might be expected, can have similar functions or act in the same pathway (Chesler et al. 2005). Thus, gene networks defined by trans-acting eQTLs will broaden our interpretations of quantitative traits and provide a wealth of information for understanding gene regulation and coordinately regulated genes. Several reports recently demonstrated the power of this approach in linking expression of specific genes to behavioral (Chesler et al. 2005) and diabetes-related phenotypes in mice (Doss et al. 2005), hypertension in rats (Hubner et al. 2005), and turnover of mouse hematopoetic stem cells (Bystrykh et al. 2005). Importantly, these candidate genes frequently were polymorphic or mapped to regions of differing haplotypes between the parental strains, and thus were strong candidates for regulatory QTLs.

Like the LD studies described above, genetical genomics is a highly in silico method that has increased power to compare phenotypes from existing databases with emerging expression data and sequencing variations so that probable candidates can be identified. Some of this analysis software is available on the WebQTL Web site (http://www.genenetwork.org/home). While the authors of these initial studies generated their own expression data, data for other experiments are becoming increasingly available in expression databases such as NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/). This approach is a powerful one and is likely to become a common one to use for QTL studies.

Causative gene identification

Once strong candidates are identified, it is crucial to test them. The main problems include the difficulty in predicting the effects of existing polymorphisms on gene expression and function of the candidate gene, in extrapolating these effects to biological impact, and in ascribing phenotypic differences between two strains to one of many genes polymorphic between those strains. In some cases, the primary candidate gene, e.g., Rgs2 (Yalcin et al. 2004b) in the case of behavior, has not been shown to have a structural defect. In others, the only observed polymorphism may have no clear effect on candidate gene function. For example, the Apoa2 gene was a strong functional and positional candidate for a QTL controlling plasma HDL level but only one Apoa2 polymorphism was significantly associated with plasma HDL level when the coding region was examined, a conserved alanine that is changed to valine in high HDL strains (Wang et al. 2004). Furthermore, the common origin of the inbred strains has contributed to several polymorphisms in a candidate interval and may have nothing causative to do with the expected complex trait phenotype. For most QTLs, investigators must explicitly test new alleles of the candidate gene. They may generate null alleles, complement an existing allele, or, in the most recent twist, find new alleles of the candidate's ortholog in a different species (Lyons et al. 2005; Wang et al. 2005).

Targeted knock-in mutations of a candidate gene remain the gold standard for QTL gene identification (Abiola et al. 2003). A central reason for the utilization of mice in complex trait analysis has been the ability to isolate and genetically manipulate mouse embryonic stem (ES) cells that can be used to create mice with specific gene defects. Historically, generating a targeted mutation in a specific candidate gene was a major undertaking and the resources required to do so on a large scale were unlikely to be found in academic laboratories. Publicly available targeted mutations covered only a fraction of genome until recently. The availability of the mouse genome sequence and the clones used in the public sequencing effort make generating a targeting construct easier (Cotta-de-Almeida et al. 2003). One can identify the target site in a gene using a sequence database and generate homology arms by PCR of a purified BAC template. More sophisticated targeting constructs for conditional knockouts can be made by recombineering (Copeland et al. 2001). In addition, it soon may become unnecessary to build a single targeting construct. Instead, one can use MICER, a library of pre-made targeting vectors for insertional mutations (Adams et al. 2004). In addition, there is an initiative to make and screen a massive library of ES cells that contain a reporter gene in place of each putative mouse open reading frame (Austin et al. 2004). Thus, heterozygous mice created from each of these lines would provide spatial and temporal information about where the gene is expressed, while homozygous null mice could provide insight to gene function in vivo. A tiered strategy can also be used for mutant line characterization. Here multiple levels of phenotype ascertainment will be used to funnel specific loci into more focused studies. For example, if heterozygous mice show strong reporter expression in brain, a battery behavioral testing might be warranted for that mutant line (Austin et al. 2004).

Moreover, there is a wealth of genetrap mutations available to the rodent research community. These are mutations that result from the insertion of an expression vector into the target gene. This expression vector contains a reporter gene, a selectable marker, poly(A) site, and translation stop site, but no promoter sequences. The vector is electroporated into mouse ES cells and, if it inserts in a transcribed sequence and produces a fusion transcript composed of part of the endogenous transcribed sequence (i.e., the “trapped” gene), the reporter, and the selectable marker, it will allow these ES cell clones to be selected. The trapped gene then is identified by 5′ gene sequencing. The insertion often serves as a hypomorphic or null mutation, since it prematurely terminates transcription of the endogenous gene, and contains a reporter for the expression patterning of the endogenous gene. More recent genetraps feature site-specific recombinase sequences to create conditional mutations (Schnütgen et al. 2005). Several genetrap ES cell libraries have been compiled and over 45,000 are available publicly (see http://www.sanger.ac.uk/PostGenomics/genetrap/) covering more than one-third of the genes in the mouse genome (Skarnes et al. 2004). These genetrap libraries are particularly useful to investigators who are not mouse geneticists but wish to test a QTL candidate of interest.

Classical complementation is another powerful method for gene testing that has been made more practical by genomics. Correlation of the QTL with the presence of a single BAC would greatly refine the search for the causal mutation. BACs have become broadly utilized in functional mouse genomics and are the vector of choice for large scale sequencing because of its superior stability and DNA yield compared with other large insert vectors (Haldi et al. 1994). BACs also broaden the spectrum of candidate gene alleles available for testing and through the availability of libraries from different strains and through recent advances in BAC engineering that may facilitate the use of BAC transgenic mice to test for strain-specific dominant effects. The limitation of this approach is the availability of the proper BAC library and considerably large interval that most QTLs currently cover. BAC libraries from many strains are now publicly available with other strain libraries nearing completion (http://bacpac.chori.org/). Mapping of these BACs on the mouse genome will greatly facilitate their utilization.

SNP data from other species make possible a new, powerful candidate gene test, the cross-species comparison. These SNPs make it possible to ask the following question: Do different alleles of the human ortholog (or rat, platypus, etc.) of a candidate mouse gene correlate with phenotypic differences caused by the QTL? This procedure has been successfully used by Wang and colleagues (Wang et al. 2005), who identified Tnfsf4, as a candidate for diet-induced atherosclerosis based on position, expression, and sequence polymorphism. In addition, the authors found a SNP in the human TNFSF4 gene that was associated with elevated risk of coronary artery disease and myocardial infarction. This approach provides new and independent populations in which to test the original hypothesis. This can be important given the limited polymorphism present in the inbred mouse strains. It further points to the conservation of sequence (though probably not individual polymorphisms) across millions of years of evolutionary distance as a very persuasive demonstration of normal and variant function. However, the failure to recapitulate the effects of a rodent gene in humans (or any other species) is more difficult to interpret, as the results could be attributed to any number of complicating genetic or environmental factors. There is also concern about the low reproducibility of human association studies (Ioannidis 2005). Still, such cross-species comparisons provide powerful evidence of causation when successful and may yet prove to be important.

Mapping the location of the QTL is an extremely important first step in QTL identification. While it is clear that location is not sufficient to identify a QTL gene, it is an integral part of the investigation and substantially limits the candidate gene pool. Given the rapid increases in high-throughput evaluation of genes and functional databases and the difficulties in studying genes of small effect in heterogeneous segregating populations, is it possible to robustly identify candidate genes without positional information? Our hope is that approaches such as genetic transcription networks will soon be refined in a manner that is sufficient to “guess” what genes may be involved in a physiological function. The insertion and/or the deletion of these same candidates could then lead to their identification as QTLs that influence function. For example, Soriano and colleagues demonstrated a position-independent approach to identify genetrap clones mutant in genes important in PDGF signaling (Chen et al. 2004). Such position-independent strategies will certainly be used in the future especially when expression profiling becomes reproducible and available for a large number of tissues and strains.

Conclusions

SNP mappings of 15 strains of mice (sponsored by the Center for Rodent Genetics) will only be the beginning. The mouse investigator will be able to compare and contrast different chromosomal regions to identify new genes influencing the fate of QTLs affecting minor functions. It is likely that the contributions from each strain will focus on narrow regions for identification of these QTLs, leading to ∼100-500 genes with a known tissue-determining effect. Microarray analyses of gene activity should then narrow these genes down to a short list, functionally distinguishing them from their neighbors. Thus, QTL identifications should become easier tasks and ones that are easily accomplished by the amplification of our current mouse and molecular genetic techniques.

Footnotes

  • Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3841405.

  • 1 Corresponding author. E-mail flaherty{at}wadsworth.org; fax (518) 880-1388.

References

Web site references

| Table of Contents

Preprint Server