Copy number variation: New insights in genome diversity

  1. Jennifer L. Freeman1,2,
  2. George H. Perry1,3,
  3. Lars Feuk4,
  4. Richard Redon5,
  5. Steven A. McCarroll6,
  6. David M. Altshuler6,
  7. Hiroyuki Aburatani7,
  8. Keith W. Jones8,
  9. Chris Tyler-Smith5,
  10. Matthew E. Hurles5,
  11. Nigel P. Carter5,
  12. Stephen W. Scherer4, and
  13. Charles Lee1,2,9
  1. 1 Department of Pathology, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA;
  2. 2 Harvard Medical School, Boston, Massachusetts 02115, USA;
  3. 3 School of Human Evolution and Social Change, Arizona State University, Tempe, Arizona 85287, USA;
  4. 4 Department of Genetics and Genomic Biology, The Hospital for Sick Children, Toronto, Ontario M5G 1X8, Canada;
  5. 5 The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom;
  6. 6 Program in Medical and Population Genetics, Broad Institute of Harvard University and Massachusetts Institute of Technology, Cambridge, Massachusetts 02141, USA;
  7. 7 Genome Science Division, University of Tokyo, Tokyo, 153-8904 Japan;
  8. 8 Molecular Genetics Division, Affymetrix, Inc., Santa Clara, California 95051, USA

Abstract

DNA copy number variation has long been associated with specific chromosomal rearrangements and genomic disorders, but its ubiquity in mammalian genomes was not fully realized until recently. Although our understanding of the extent of this variation is still developing, it seems likely that, at least in humans, copy number variants (CNVs) account for a substantial amount of genetic variation. Since many CNVs include genes that result in differential levels of gene expression, CNVs may account for a significant proportion of normal phenotypic variation. Current efforts are directed toward a more comprehensive cataloging and characterization of CNVs that will provide the basis for determining how genomic diversity impacts biological function, evolution, and common human diseases.

Genomic variability can be present in many forms, including single nucleotide polymorphisms (SNPs), variable number of tandem repeats (VNTRs; e.g., mini- and microsatellites), presence/absence of transposable elements (e.g., Alu elements), and structural alterations (e.g., deletions, duplications, and inversions). Until recently, SNPs were thought to be the predominant form of genomic variation and to account for much normal phenotypic variation (International SNP Map Working Group 2001; The International HapMap Consortium 2005). However, two groups recently reported the widespread presence of copy number variation in normal individuals (Iafrate et al. 2004; Sebat et al. 2004), and these observations have since been replicated and expanded (e.g., de Vries et al. 2005; Schoumans et al. 2005; Sharp et al. 2005; Tuzun et al. 2005; Tyson et al. 2005; Conrad et al. 2006; Hinds et al. 2006; McCarroll et al. 2006; Repping et al. 2006). With this accumulation of information, it now seems appropriate to review our current understanding of copy number variation and its significance in human phenotypic variation (including disease resistance and susceptibility) and to discuss possible future directions for studies in this field.

CNVs in normal individuals

For this review, we prefer to use the term “variant” instead of “polymorphism” when referring to copy number changes. The frequencies of most copy number variants (CNVs) have not yet been well defined in human populations, and “polymorphism” is a term that is usually reserved for genetic variants that have a minor allele frequency of ≥1% in a given population. In our preferred nomenclature, a CNV represents a copy number change involving a DNA fragment that is ∼1 kilobases (kb) or larger (Feuk et al. 2006a). At a recent workshop (“The effects of genomic structural variation on gene expression and human disease,” The Wellcome Trust Sanger Institute, Hinxton, UK; November 27–28, 2005), it was suggested that CNVs not include those variants that arise from the insertion/deletion of transposable elements (e.g., ∼6-kb KpnI repeats) to minimize the complexity of future CNV analyses. The term CNV therefore encompasses previously introduced terms such as large-scale copy number variants (LCVs; Iafrate et al. 2004), copy number polymorphisms (CNPs; Sebat et al. 2004), and intermediate-sized variants (ISVs; Tuzun et al. 2005), but not retroposon insertions. Table 1 lists some of the terminology currently used in the CNV literature.

Table 1.

Selected terms in the CNV literature

Large duplications and deletions have been known for some time to be present within the human genome, initially from cytogenetic observations (e.g., Jacobs et al. 1959, 1978, 1992; Edwards et al. 1960; Patau et al. 1960; Coco and Penchaszadeh 1982), but their frequency was presumed to be low and for the most part directly related either to tandemly repeated genes or to specific genetic disorders (e.g., Lupski 1998; Ji et al. 2000; Inoue and Lupski 2002; Stankiewicz and Lupski 2002). In addition, they were often localized to repeat-rich regions such as telomeres, centromeres, and heterochromatin (e.g., Giglio et al. 2001).

A limited number of studies reported the presence of specific large duplications and deletions that were not apparently related to disease (e.g., Barber et al. 1998; Engelen et al. 2000). For example, a deleted region originally thought to be associated with ovarian cancer was later found to also be present in healthy individuals (Lin et al. 2000). Large duplication and deletion variants of portions of gene families/clusters, including olfactory receptors (Trask et al. 1998), major histocompatibility complex (MHC) class III genes (Ghanem et al. 1988), the β-defensin antimicrobial gene cluster (Hollox et al. 2003), and genes at the amylase locus (Groot et al. 1991), were also reported. Moreover, duplications and/or deletions were identified at a golgin-related gene downstream of the promyelocytic leukemia gene (Gilles et al. 2000) and the α7-nicotinic acetylcholine receptor gene (Riley et al. 2002). These and other studies provided the initial evidence that large duplication and deletion events, even if they contained genes, did not necessarily result in the presentation of early onset, highly penetrant genomic disorders or diseases (Buckland 2003).

Recent advancements in technology have facilitated a shift from locus-specific studies to genome-wide assessments of genetic variation. In 2004, two groups independently described the widespread presence of CNVs in the genomes of healthy people with no obvious genetic disorders (Iafrate et al. 2004; Sebat et al. 2004). Iafrate et al. (2004) used a bacterial artificial chromosome (BAC)-based array, with clones chosen at ∼1-megabase (Mb) intervals throughout the human genome, together with a technique called array-based comparative genomic hybridization (array CGH; Solinas-Toldo et al. 1997; Pinkel et al. 1998). In this study, the investigators identified >200 loci that contained genomic imbalances among 39 unrelated, healthy individuals representing five populations. The most commonly observed CNV encompassed the amylase locus at chromosome region 1p13.3, and high-resolution fiber FISH analyses showed that this region varied between 150 kb and 425 kb among different individuals. Sebat et al. (2004) amplified BglII-fragments from the genomes of 20 individuals representing nine populations and hybridized these DNAs to a microarray platform containing oligonucleotides spaced at 35-kb intervals throughout the genome (ROMA technique; Lucito et al. 2003). In this study, 76 CNVs with a median size of 222 kb and an average size of 465 kb were identified when a CNV cut-off criterion of three consecutive oligonucleotides was used. On average, 11 CNVs (Sebat et al. 2004) or 12.4 CNVs (Iafrate et al. 2004) were detected in each person with these array CGH assays. Initial commentaries (Carter 2004; Cheung 2004; Buckley et al. 2005) noted the low overlap that appeared to exist between the data sets of these two studies. However, when the CNVs were mapped onto the same build of the human genome, more overlap of the two data sets could be appreciated with CNVs of larger size and frequency (Table 2). Because of the small number of individuals examined and the limited resolution of both array platforms, it seems that the number of CNVs identified by these two studies was an underestimation of the true number of CNVs in humans (Buckley et al. 2005).

Table 2.

Comparison of CNVs identified by Sebat et al. (2004) to Iafrate et al. (2004) based on the number of individuals and the size of the CNVs

Following these two initial studies, Tuzun et al. (2005) used an in silico strategy to compare two human genomes at the DNA sequence level. One genome was represented by the reference human genome sequence (National Center for Biotechnology Information, NCBI build 35). Approximately 67% of this reference sequence originated from a single DNA library (the RPCI-11 BAC library) derived from a single anonymous male. The second genome was in the form of pairs of end-sequence reads from >500,000 fosmid clones of the G248 DNA library. This DNA library was derived from an anonymous North American female of European ancestry. Since the sizes of fosmid clones are tightly regulated at ∼40 kb, the investigators reasoned that pairs of end sequences for a given fosmid clone should align to the reference sequence with ∼40-kb spacing. Significant deviation of the alignment spacing (i.e., <32 kb or >48 kb) would suggest the presence of a CNV at that locus. Using this criterion, Tuzun et al. (2005) identified 241 CNVs, with most in the size range of 8 kb to 40 kb. More than 80% of these CNVs had not been identified previously, and most were below the expected resolution of the array platforms used in the initial CNV discovery studies (Iafrate et al. 2004; Sebat et al. 2004).

This in silico approach has an added advantage over array-based CNV discovery studies in being capable of detecting other structural genomic variants, namely inversions. These would be detected by consistent discrepancies in the aligned orientation of multiple paired end sequences. In this manner, the investigators identified 56 inversion breakpoints in addition to the 241 CNVs. Together, this suggested the presence of almost 300 putative sites of structural variation when comparing the genomes of two individuals by this method.

One consistent feature of CNVs that was noted in these three CNV studies (Iafrate et al. 2004; Sebat et al. 2004; Tuzun et al. 2005) was the preponderance of CNVs near known segmental duplications, significantly more often than expected by chance alone. Segmental duplications (also referred by some as low copy repeats or LCRs; Lupski 1998) can be defined as duplicated DNA fragments that are >1 kb and found either on the same chromosome or on different, nonhomologous chromosomes (Bailey et al. 2002; Lupski and Stankiewicz 2005). Segmental duplications need not vary in copy number, but if they do vary among individuals, they may also be considered CNVs (Feuk et al. 2006a).

Since a significant portion of CNVs was identified in regions containing known segmental duplications, Sharp et al. (2005) reasoned that a custom array, containing DNA clones targeting these known duplicated regions of the human genome (which are also speculated to serve as potential rearrangement hotspots), might be useful in the rapid identification of CNVs. Forty-seven unrelated individuals representing seven different populations were assessed with this targeted array platform, resulting in the identification of 119 CNVs, of which only 39% had been described previously. Moreover, Sharp et al. (2005) concluded that the sharing of CNVs among several populations meant that these specific genomic imbalances either predated the dispersal of modern humans out of Africa or recurred independently in different populations.

Haploinsufficiency is a condition that results when one copy of a dosage-sensitive gene has been deleted and results in developmental delay or impairment. Likewise, the term haplosufficiency may be a term that could be used to describe genomic deletions that do not result in developmental delay or impairment and can be found in healthy and apparently normal individuals. Recently, three CNV discovery studies that specifically interrogated human genomes for such deletion variants were published concurrently (Conrad et al. 2006; Hinds et al. 2006; McCarroll et al. 2006). Two of these studies (Conrad et al. 2006; McCarroll et al. 2006) relied on available SNP data generated from the International HapMap Project (The International HapMap Consortium 2005). The International HapMap Project was established to study human genetic variation in a cohort of 269 individuals from four populations (The International HapMap Consortium 2003). The first population sample consists of 90 individuals from 30 parent–offspring trios from a U.S. population (in Utah) with Northern and Western European ancestry collected by the Center d’Etude du Polymorphisme Humain (CEPH). The second population sample is from the Yoruban people of Ibadan, Nigeria, also consisting of 90 individuals from 30 parent–offspring trios. The third population sample is 45 unrelated Han Chinese from Beijing, China, and the fourth population sample consists of 44 unrelated Japanese from Tokyo, Japan. Phase I of the HapMap Project provided a SNP genotype at ∼5-kb resolution in each of these 269 samples studied for a total of 1.2 million SNPs (The International HapMap Consortium 2005). Phase II has now genotyped an additional 4.6 million SNPs to produce a current total of 5.8 million SNPs (http://www.hapmap.org).

Since SNP data are abundant and available at high spatial resolution across the human genome (The International HapMap Consortium 2005), Conrad et al. (2006) and McCarroll et al. (2006) reasoned that these SNP data might be used to discover underlying CNVs, if the underlying CNVs affected the results of SNP genotyping assays. McCarroll et al. (2006) hypothesized that deletion variants could leave at least three kinds of “footprints” in SNP data: (1) the identification of a run of null genotypes in a given individual, (2) the identification of contiguous genomic regions with SNP allele frequencies that deviated from expected Hardy-Weinberg equilibrium ratios, and (3) the recognition of runs of SNP genotyping results that did not fit expected Mendelian inheritance patterns in parent–offspring trios. In total, McCarroll et al. (2006) detected 541 deletion variants, ranging in size from 1 kb to 745 kb. Of the 541 deletions detected, 120 were observed as homozygous deletions (i.e., both copies of the genomic region were absent) in multiple, unrelated individuals. Ten of these homozygous deletions were relatively common and removed one or more exons of genes often involved in activities such as steroid metabolism, olfaction, and drug metabolism.

Conrad et al. (2006) focused exclusively on Mendelian inheritance inconsistencies. Numerous deletion variants (586) were identified, ranging from 300 bp to 1.2 Mb in size. Conrad et al. (2006) reported that the deletion CNV regions identified were relatively gene-poor, implying that many gene-containing deletions were subject to purifying selection. Despite this genome-wide trend, many individual genes are nonetheless affected by deletions. They found 92 genes that were completely deleted and another 109 genes that had portions of their coding sequences deleted (though the majority of these deletions were observed in only one trio and therefore may represent rare variants).

Of the 326 deletions that McCarroll et al. (2006) identified only from Mendelian inconsistencies, the overlap with the Conrad et al. (2006) data set was only 61.7% (201/326) (Fig. 1), which reflects, in part, the fact that Conrad et al. (2006) and McCarroll et al. (2006) used different criteria for defining Mendelian inconsistency (i.e., runs of genotypes that included at least two Mendelian inconsistencies and no heterozygous genotypes in single parent–offspring pairs versus runs of SNPs that showed similar patterns of Mendelian inconsistency across an entire HapMap population sample, respectively). Part of the incomplete overlap of these data sets may also be attributed to the estimated 15% false-positive rate of the two studies, based on confirmation studies on ∼100 loci using independent experimental approaches (Conrad et al. 2006; McCarroll et al. 2006).

Figure 1.

Comparison of overlapping CNVs identified by Conrad et al. (2006) and McCarroll et al. (2006). Conrad et al. (2006) identified a total of 586 deletions based on deviations from expected Mendelian inheritance patterns. McCarroll et al. (2006) identified deviations from Mendelian expectations in addition to null genotypes and deviations from Hardy-Weinberg equilibrium to identify a total of 541 deletions. When overlapping data were compared: (1) 139 deletions were identified only by Mendelian inheritance inconsistency in both studies, (2) 62 deletions were identified by Mendelian inheritance inconsistency and null genotypes by McCarroll et al. (2006) and by Mendelian inheritance inconsistency by Conrad et al. (2006), and (3) four deletions were detected only by null genotypes by McCarroll et al. (2006) but by Mendelian inheritance inconsistency by Conrad et al. (2006).

In the third study specifically identifying deletion variants, Hinds et al. (2006) hybridized DNA samples, from 24 unrelated individuals in a polymorphism discovery resource, to a high-density oligonucleotide array. This resulted in the identification of 215 potential deletion variants ranging from 70 bp to 10 kb. A subset of 100 PCR-confirmed deletions was further characterized, with 41 of the deletions found to be present among the 24 individuals with an allelic frequency of ≥10%. Forty-three deletions overlapped transcripts, and two deletions spanned exons. The deletions were then typed in a sample of 71 individuals who had previously been genotyped for ∼1.6 million genome-wide SNPs (Hinds et al. 2005), enabling comparison of the two data sets. The common deletions were found to be in linkage disequilibrium (LD: nonrandom pattern of alleles at different loci found together, more or less often than expected based on their frequencies) with surrounding SNPs, and the investigators therefore concluded that deletion variants and SNPs may often share similar evolutionary histories. This finding was similar to an observation made by McCarroll et al. (2006) in which many common deletion variants were in LD with nearby SNPs.

Clearly, every CNV discovery study has its own bias toward specific types and sizes of CNVs. For example, although the fine-scale approach of Hinds et al. (2006) was capable of detecting deletions of a wide variety of sizes, their analysis avoided repetitive regions (e.g., segmental duplications) that may be more likely to be associated with larger size CNVs (additional discussion below). Currently, the average size of all CNVs cataloged in the Database of Genomic Variants (http://projects.tcag.ca/variation) is ∼118 kb, but the median size is ∼18 kb. This discrepancy in mean and median CNV sizes may be due in part to the fact that more than half of the CNV entries now originate from the three recent deletion studies (Conrad et al. 2006; Hinds et al. 2006; McCarroll et al. 2006), which primarily report smaller CNVs; the majority being <10 kb (Eichler 2006). For CNVs detected by lower-resolution, BAC array-based methods, it is unclear what portion of the CNV-containing clone actually varies in copy number. With BAC array-based CGH methods, a BAC clone that shows copy number variation could entirely encompass a smaller CNV, overlap a CNV, or be totally within a CNV that is actually larger than the BAC clone itself. Because of this ambiguity, the size of the entire BAC clone is used in lieu of the actual size of the CNV.

One could speculate that larger CNVs (especially deletion variants) may be subject to increased selection pressures. Along with differences in mutation rates, this could affect the overall size distribution of human CNVs. Furthermore, the possibility that larger CNVs tend to represent multi-copy duplications is consistent with earlier observations that large segmental duplications are more likely to be tolerated by a genome than are deletions of similar sizes (i.e., >100 kb) (Lindsley et al. 1972; Brewer et al. 1999).

Thus, it appears from recent CNV studies that CNVs are a substantial source of genomic variation among humans. Currently, 1237 CNVs covering an estimated 143 Mb of genomic sequence have been identified (http://projects.tcag.ca/variation; http://paralogy.gs.washington.edu/structuralvariation; Nadeau and Lee 2006). Although it is difficult to compare such different data sets directly, the proportion of nucleotides that differ in copy number between two haploid genomes may be at least as large as the proportion that differs by SNPs. However, one must bear in mind that, for most studies, only a fraction of the putative CNVs have actually been validated by alternate methods or by their presence in multiple, unrelated individuals, and therefore the true number of CNVs in hu- mans is likely to be less than the sum of the data currently being published.

Potential mechanisms of CNV formation

CNVs often occur in regions reported to contain, or be flanked by, large homologous repeats or segmental duplications (Fig. 2; Fredman et al. 2004; Iafrate et al. 2004; Sharp et al. 2005; Tuzun et al. 2005). Segmental duplications could arise by tandem repetition of a DNA segment followed by subsequent rearrangements that place the duplicated copies at different chromosomal loci. Alternatively, segmental duplications could arise via a duplicative transposition-like process: copying a genomic fragment while transposing it from one location to another (Eichler 2001).

Figure 2.

Copy number variation is associated with segmental duplications on chromosome 17. One hundred DNA samples from the HapMap collection were analyzed by CGH on a whole-genome tiling path microarray composed of 27,000 large-insert clones. The coverage of chromosome 17 by the array is displayed in blue (top panel). (Green bars) Frequencies of DNA gains, (red bars) frequencies of DNA losses. Gene density (blue) and presence of segmental duplications along chromosome 17 (orange) are reported in bottom panels. (Black arrows) Hotspots of DNA copy number variation along the chromosome, which all occur in regions containing or flanked by blocks of segmental duplications.

CNVs that are associated with segmental duplications may be susceptible to structural chromosomal rearrangements via non-allelic homologous recombination (NAHR) mechanisms (Lupski 1998). NAHR is a process (Fig. 3) whereby segmental duplications on the same chromosome can facilitate copy number changes of the segmental duplicated regions along with intervening sequences (Inoue and Lupski 2002). In addition to the formation of CNVs in normal individuals, NAHR may also result in large structural polymorphisms and chromosomal rearrangements that directly lead to genomic instability or to early onset, highly penetrant disorders (Lupski 1998; Ji et al. 2000; Bailey et al. 2002, 2004; Stankiewicz and Lupski 2002; Scherer et al. 2003; Eichler et al. 2004; Shaw and Lupski 2004; Lupski and Stankiewicz 2005).

Figure 3.

Different classes of mutation operating in the human genome. The range of mutation rates and size of mutated locus are plotted for each class of mutation. (Green highlights) Mutation processes associated with structural variation. On rare occasions, minisatellite alleles can differ in size by >1 kb.

Not all CNVs, however, appear to be associated with segmental duplications. It is possible that subsets of CNVs, not associated with segmental duplications, may be formed or maintained by non-homology-based mutational mechanisms (Fig. 3; Shaw and Lupski 2004). Certain CNVs may be found to be associated with non-β DNA structures (DNA regions that differ in structure from the canonical right-handed β-helical duplex, including left-handed Z-DNA and cruciforms). Such DNA structures are believed to promote chromosomal rearrangements (Kurahashi and Emanuel 2001; Bacolla et al. 2004) and may also theoretically contribute to the genesis and maintenance of certain CNVs. Indeed, our understanding of the differential fragility of DNA sequences and mechanisms of non-homologous end-joining repair of double-strand breaks would be greatly improved by future large-scale sequencing efforts and definition of CNV breakpoints.

There may be a relationship between the size of a given CNV and its associated mutational mechanism(s). For example, data from at least two studies have shown that larger CNVs are more frequently associated with segmental duplications than are smaller CNVs (Fig. 4), although the effects of ascertainment biases remain unclear. In addition, there may be differential selection pressures exerted on deletion versus duplication events due to discrepancies in the way genomes tolerate gains and losses of genetic material. Nevertheless, it seems that among the smaller known CNVs, non-homology-driven mutational mechanisms may dominate.

Figure 4.

The positive correlation between size of CNVs and likelihood of association with segmental duplication. This correlation is noted by both the Conrad et al. (2006) and Tuzun et al. (2005) studies. The lower proportion of segmental duplication-associated CNVs in the Conrad et al. (2006) data relates to the greater difficulty in detecting CNVs in regions of segmental duplication when analyzing SNP genotyping data as opposed to fosmid end sequence mapping. The CNV size classes were chosen so as to obtain approximately equal numbers of CNVs in each class for the smaller data set.

Clinical implications and health

Large duplications and deletions have been known for some time to be related to the presentation of specific genetic disorders (Table 3), presumably as a result of copy number changes involving dosage-sensitive developmental genes. This has led to the establishment of genetic diagnostic tests for certain, well-characterized microdeletion and microduplication syndromes (e.g., Angelman syndrome, DiGeorge syndrome, Charcot-Marie-Tooth disease, etc.). If a de novo chromosomal aberration is recognized in a patient with a constitutional genetic abnormality (i.e., follow-up studies fail to reveal a similar chromosomal aberration in either of the two parents, and non-paternity has been excluded) and the aberration is not one of the dozen or so well known common chromosomal polymorphisms (e.g., inversion on chromosome 9; de la Chapelle et al. 1974; Lee 2005), the aber ration is assumed to be the cause of the clinically recognized abnormal phenotype.

Table 3.

Examples of disorders caused by genomic imbalances and CNVs identified in regions associated with these disordersa

In many ways, the gold standard for clinical cytogenetic testing still remains the GTG-banded karyotype, where a genome- wide analysis usually identifies chromosomal rearrangements/aberrations of 3–5 Mb and larger. However, with the advent of higher resolution, genome-wide assays (e.g., array-based CGH), many more subtle genomic aberrations are being discovered in patients referred for genetic testing. Along with this improved resolution of testing comes the difficulty of interpreting the increasing number of genomic imbalances identified with each sample. To assist with accurate clinical diagnostic interpretations of genome-wide, high-resolution array CGH testing, the Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER, http://www.sanger.ac.uk/PostGenomics/decipher) has been established and is now comprehensively collecting array CGH data and corresponding clinical information from patients referred for genetic testing. The goal of this database is to help improve medical care while facilitating research on the genetic etiology of submicroscopic chromosomal imbalances.

Genomic imbalances that appear to be inherited from a phenotypically normal parent are usually considered to be clinically less significant (Shaw-Smith et al. 2004; de Vries et al. 2005; Tyson et al. 2005). Consider an example where an apparently healthy individual carries a certain copy number change along with other genetic variant(s) that compensate for that genomic imbalance. Another person having the same genomic imbalance may not have inherited the additional compensatory genetic variant(s), leading to a different and possibly clinical phenotype. Such scenarios underlie the growing lack of confidence for interpreting the clinical consequences of genomic imbalances. This is further exacerbated by the fact that genomic imbalances identified by array CGH represent cumulative and not allele-specific CNV values. Thus, true inheritance patterns of CNVs could be masked by array CGH results (Fig. 5). Clearly, accurate interpretations of CNV inheritance patterns will be greatly facilitated with the development of locus-specific and allele-specific quantitative assays for evaluating DNA copy number. Ultimately, clinical diagnostic interpretations should be based on a more holistic view of the genome whereby the phenotypic consequences of an imbalance incorporates the genotype and state of all alleles of a given CNV, neighboring DNAs, and other influencing genomic regions (e.g., enhancers, repressors, etc.). Until such comprehensive information is available for each patient, caution should continue to be exercised when trying to interpret the inheritance and clinical significance of copy number variants. Some possible mechanisms by which the same CNV could have differential effects on phenotypic traits and gene expression have been recently reviewed by Feuk et al. (2006b).

Figure 5.

CNV inheritance patterns. To determine whether a CNV may be inherited or is a de novo event, trios including child, mother, and father are assessed. Currently, there is the potential for erroneous interpretation of the trio data since array-based CGH assays calculate copy number additively. For example, a mother who has a one-copy CNV on one chromosome and a three-copy CNV on the homologous chromosome (i.e., four total copies) would have no copy number difference when compared with a reference individual with four total copies (i.e., two copies on each homologous chromosome). In addition, a father who has two copies present on each homologous chromosome (i.e., four total copies) would also have no copy number difference when compared with the reference individual. If the child inherits the maternal chromosome containing three copies and the paternal chromosome containing two copies of the CNV, the child ends up with a total of five copies of the CNV. Upon comparison with the reference, the child could appear to have a de novo CNV. The development of locus-specific and allele-specific quantitative assays will aid in the interpretation of these CNV inheritance patterns.

CNVs that do not directly result in early onset, highly penetrant genomic disorders may consequently be considered to be neutral in function, but afterward shown to play a role in later onset genomic disorders or common diseases. Analyses of the functional attributes of currently known CNVs reveal a remarkable enrichment for genes that are relevant to molecular–environmental interactions and influence our response to specific environmental stimuli (Sebat et al. 2004; Tuzun et al. 2005; Feuk et al. 2006a; Nguyen et al. 2006). These include, but are not limited to, processes involving drug detoxification (e.g., glutathione-S-transferase, cytochrome P450 genes, and carboxylesterase gene families), immune response and inflammation (e.g., leukocyte immunoglobulin-like receptor, defensin, and APOBEC gene families), surface integrity (e.g., late epidermal cornified envelope and mucin gene families), and surface antigens (e.g., galectin, melanoma antigen gene, and rhesus blood group gene families). Likewise, some CNVs encompass genes that may contribute to interindividual variation in drug responses (Ouahchi et al. 2006), as well as in immune defense and disease resistance/susceptibility among humans. For example, interindividual and interpopulation differences in the copy number of the gene encoding CCL3L1, a human immunodeficiency virus-1 (HIV-1)–suppressive chemokine and ligand for the HIV coreceptor CCR5, were recently reported (Gonzalez et al. 2005). Individuals with a lower-than-average number of CCL3L1 copies had lower levels of CCR5-CCL3L1 complexes, leaving more CCR5 available for HIV entry and hence increasing their susceptibility to HIV/AIDS. Most recently, Aitman et al. (2006) discovered copy number variation of the Fcgr3 gene in rats, which predisposed those animals carrying fewer gene copies to develop a condition similar to glomerulonephritis in humans. Fcgr3 encodes for a transmembrane receptor found on the cell surfaces of macrophages that when activated results in phagocytosis and cytotoxicity. The duplicated (paralogous) Fcgr3-rs gene appears to have an inhibitory effect on Fcgr3 such that loss of Fcgr3-rs leads to an increased immune response and, in some cases, possibly autoimmunity. The orthologous gene in humans varies in copy number from 0 to 4, and association studies revealed that a lower copy number of the Fcgr3 ortholog (FCGR3B) in humans is an independent risk factor predisposing those individuals to immunologically related glomerulonephritis.

One obvious way by which CNVs result in human phenotypic diversity is by altering transcriptional levels (and presumably subsequent translational levels) of the genes that are in variable copy number. Such a correlation has already been demonstrated for certain CNV genes at the transcriptional (Hollox et al. 2003; Aldred et al. 2005; McCarroll et al. 2006) and translational (Gonzalez et al. 2005; Linzmeier and Ganz 2005) levels. Studies correlating mRNA and protein levels with genomic copy number of CNV genes need to always consider that some CNVs may have phenotypic effects that are apparent only in certain tissues and/or stages of development. Experimental approaches may also be required to distinguish between the effects of CNVs themselves and any regulatory SNPs with which they may be in strong LD (e.g., Stranger et al. 2005).

CNVs in other species and evolution

Another aspect of CNVs that needs to be addressed is whether levels and patterns of copy number variation among humans are similar to those in non-human primates and other organisms. Wide-spread copy number variation has already been documented among inbred strains of laboratory mice (Li et al. 2004; Adams et al. 2005). Li et al. (2004) used a minimal-tiled, whole-genome array platform containing ∼19,000 mouse BAC clones from the RPCI-23 library (derived from a C57BL/6J strain mouse) to interrogate the genomes of 15 commonly used inbred mouse strains. In total, the investigators identified 346 BAC clones that showed copy number variation among the mouse strains tested when they used the C57BL/6J as a reference. Adams et al. (2005) used a 1-Mb mouse BAC-based array CGH platform to compare genomic DNA from a 129S5 mouse with that from a C57BL/6J mouse and identified a total of 112 CNVs (corresponding to 130 BAC clones of the 2803 clones on their array).

Li et al. (2004) found that ∼10% of the CNV-containing BAC clones identified were within 200 kb of known segmental duplications, similar to that observed in humans. This again suggests that NAHR may play a role in the genesis and evolution of specific subsets of CNVs. Interestingly, large-scale deletions may be more tolerated in mice than in humans, especially when the deletion encompasses gene desert regions (Nobrega et al. 2004). Li et al. (2004) also found that unsupervised hierarchical cluster analysis of CNV patterns for each mouse strain led to stratification of the strains in a manner comparable to their known evolutionary history.

It is interesting to speculate on the phenotypic effects of these CNV patterns in different mouse strains. The mouse has long been recognized as a valuable model system for genetic research of human diseases, and sophisticated genetic manipulation studies can provide critical insights into the function of the corresponding genes in humans. Along with SNPs, CNVs may contribute to phenotypic variation among mouse strains and explain why different strains of mice sometimes produce apparently contradicting phenotypes when the same gene is knocked out/mutated. By carefully correlating functional variation with specific CNVs or sets of CNVs in the mouse, it may be possible to begin extrapolating the phenotypic consequences of orthologous CNVs in other organisms, including humans.

Insights into the evolutionary properties of CNVs can be obtained from cross-species comparative studies. Nguyen and colleagues (2006) compared the genes within known human and mouse CNVs and determined that human CNVs were often associated with genes that have relatively elevated ratios of non-synonymous (amino-acid-changing) to synonymous substitution rates. This may be interpreted as evidence for positive selection on CNVs during the evolutionary history of modern humans. Alternatively, this pattern may also include relaxation of selection or the presence of higher levels of purifying selection against CNVs in other gene types and families.

In an effort to understand the evolutionary history and significance of CNVs, chimpanzee (Pan troglodytes) CNV regions were recently identified and compared with human CNV regions (Perry et al. 2006). Using the same BAC array CGH platform that Iafrate et al. (2004) employed to identify >200 CNVs among 39 unrelated humans, 331 CNVs were identified among the genomes of 20 wild-born western chimpanzees. Interestingly, 74 of the chimpanzee CNVs occurred in the same regions as known human CNVs, and many of these CNVs were frequent in both species. These loci were also enriched (>20-fold, compared with all clones on the array) for segmental duplications that are shared by both species’ genomes. From an evolutionary standpoint, this raises at least two issues. First, CNVs may be discovered in homologous regions of other closely related species, depending on when the ancestral segmental duplications in these regions arose. Second, if NAHR occurs regularly in these regions, the high intraspecific frequency of some CNVs may be the result of multiple recurrences within a species rather than a single ancestral duplication or deletion event followed by an increase in frequency.

Gene duplication is known to be an important long-term evolutionary force, and as suggested for different strains of mice, some lineage-specific copy number differences may contribute to the phenotypic differences among taxa, including those that distinguish humans from chimpanzees and bonobos (Pan paniscus) (Ohno 1970; Samonte and Eichler 2002; Locke et al. 2003; Shaw and Lupski 2004; Feuk et al. 2005; Newman et al. 2005; Goidts et al. 2006; Wilson et al. 2006). Two studies have presented data suggesting that the fixation rate of unique duplications and gene-containing duplications on the human lineage was elevated relative to that of the chimpanzee (Fortna et al. 2004; Cheng et al. 2005). Currently, it is unclear whether these results reflect experimental ascertainment biases, different duplication mutation rates, relaxed functional constraint, or human lineage-positive selection for duplications. Regardless, in their analyses, Cheng et al. (2005) and Newman et al. (2005) established an important correlation between differences in lineage-specific copy number and changes in gene expression, a relationship previously inferred in a study of gene expression differences between humans and chimpanzees (Khaitovich et al. 2004). Detailed experimental efforts (including the generation of accurate finished sequences for some regions of the chimpanzee genome) will be necessary to link the fixation of any human lineage-specific CNVs to significant events in our evolutionary history. Similar studies would likely also be valuable in understanding the evolution of any organism.

Toward a global CNV map of the human genome

An important long-term goal for copy number variation research is to establish a comprehensive atlas of CNVs in the human genome. Such an effort would include correlation to phenotypes, mutational and evolutionary aspects, and behavior with other genomic factors (e.g., epigenetic control, linkage disequilibrium, etc.). Clearly, there are multiple methods for CNV discovery, with advantages and disadvantages for each technique. For example, the fosmid paired end sequence comparison strategy has proven to be an excellent means for CNV discovery (Tuzun et al. 2005), but is limited by the availability of DNA sequence data. The National Human Genome Research Institute of the NIH recently announced their intention to establish DNA libraries from 48 of the HapMap individuals (http://www.genome.gov/18016538) for the purposes of end sequencing as many as 1 million clones from each DNA library for fosmid paired end sequence comparisons. Such work should provide a catalog of structural variants in these representative individuals, including CNVs and balanced rearrangements (e.g., inversions), as well as lead to rapid demarcation of boundaries of specific copy number changes and genomic alterations. However, this strategy may be most ideal for identifying variants in the 8-kb to 40-kb range (since virtually no fosmids have inserts much larger than 40 kb), and it is unclear to what extent cloning artifacts and cloning biases lead to false positives and negative results.

Array-based comparative genomic experiments (e.g., array CGH) have also been shown to be valuable for discovering CNVs. Advantages of array-based CGH approaches include cost effectiveness and rapid screening of numerous individuals with a given platform, but clearly the resolution is limited by the size and the number of elements placed on the array. However, higher resolution arrays are now being assembled that could be used in such CNV discovery studies, including tiling arrays and higher density oligonucleotide arrays (Ishkanian et al. 2004; Dhami et al. 2005; Selzer et al. 2005; Urban et al. 2006). Typical array CGH assays are unable to provide some of the allele-specific CNV information that can be deduced from fosmid paired end sequence comparison strategies, but the array CGH assays do have the potential to identify a larger size range of CNVs. Finally, array CGH assays do not provide data on absolute copy number of a given CNV since the copy number of that CNV is unknown in the reference sample being used in the CGH assay. Hence, a copy number loss detected by array CGH may represent a deletion in the test material or a multi-copy duplication that is simply present in more copies in the reference sample being used.

The Copy Number Variation Project, an international consortium including founding researchers from The Wellcome Trust Sanger Institute (Hinxton, United Kingdom), Hospital for Sick Children (Toronto), University of Tokyo (Tokyo), Affymetrix (Santa Clara, CA), and Harvard Medical School/Brigham and Women’s Hospital (Boston, MA) aims to discover and characterize CNVs in human populations (http://www.sanger.ac.uk/humgen/cnv) using different technologies (Fig. 6). The initial goal of the consortium is to comprehensively identify CNVs in the 269 samples used for the International HapMap Project. By using the HapMap individuals as a resource for CNV studies, the resulting CNV data can be integrated with available SNP data to broaden our understanding of the genetic variation within an individual and eventually permit subsequent detailed association studies of genetic variation and human diseases.

Figure 6.

Cross-platform identification and validation of CNVs. (A) Array CGH, (B) Nimblegen array, (C) Agilent array, and (D) Affymetrix 500k SNP array platforms all identifying copy number variants in the GM 15510 individual from whom the G248 fosmid DNA library, used in the Tuzun et al. (2005) study, was created.

As the pace of CNV discovery accelerates, we caution that there will be numerous false positives and false negatives, irrespective of the platform used, and a priority will be to minimize these. For example, many CNV discovery studies utilize material from established cell cultures. The use of cell cultures provides an ongoing resource, with the possibility for multiple replicate experiments, follow-up validation studies, and subsequent transcriptional and translational associations and functional assays. However, if CNVs are relatively unstable regions of the genome, it is possible that some small genomic imbalances will arise as a result of the cell culture transformation and propagation, and these genomic imbalances could be erroneously typed as endogenous CNVs. Hence, in CNV discovery studies, validation should be given high priority. Validation (with varying degrees of confidence) might include the observation of the same CNV among multiple individuals using one or more experimental methods (e.g., array CGH, ROMA, fosmid end sequencing, analyses of SNP data sets) or confirmation in the same individual with different technologies (e.g., quantitative PCR, direct sequencing, fluorescence in situ hybridization [FISH], and fiber FISH [Fig. 7]).

Figure 7.

Fiber FISH image of the Dystrophin locus. Copy number variation has been identified at the Dystrophin locus in phenotypically normal humans (Iafrate et al. 2004; Conrad et al. 2006). Deletions at this locus have also been associated with Duchenne muscular dystrophy. Cytogenetic tools such as fiber FISH can be used to study the fine-scale structure of CNVs. (A) The genome structure from the UCSC genome browser showing the location of the 1-kb intron (two intron probes, purple dots), the exon (exon 2, red dot), and the three-color fiber FISH image (RP4–769D20, green). The Dystrophin locus CNV overlaps the 5′ end of Dystrophin, including exon 2 (red) and much of intron 1 (first purple dot). (B) The genome structure from the UCSC Genome Browser and the location of the 1-kb intron (three intron probes, purple dots) including non-polymorphic flanking BACs (RP4–672M15, red; RP6–60B16, orange) and a four-color fiber FISH image (RP4–769D20, green).

Since it now seems likely that CNVs are responsible for extensive differences in interindividual expression of immunological and environmental sensor genes, there is great interest in the possibility that CNVs play a role in the etiology of common diseases such as diabetes, cancer, and heart disease. Their potential relevance to common diseases and complex disorders deserves full investigation and may be accomplished by large-scale studies comprehensively comparing the CNV patterns between carefully phenotyped cohorts. However, while some CNVs may be in LD with flanking SNPs and could be effectively assayed by SNP genotyping (Hinds et al. 2006; McCarroll et al. 2006; Newman et al. 2006), other CNVs may have recurred multiple times independently (Conrad et al. 2006; Perry et al. 2006; Repping et al. 2006) and may not be as readily detectable through SNP-based association studies. Moreover, SNP and STR genotyping within CNV regions may be affected by variations in copy number of the SNP and STR sites themselves. For example, a multisite variant may not be genotyped correctly and almost certainly would not be scored such that the true underlying nature of this variant could be recovered (Fredman et al. 2004). This is worth consideration when moving toward fine-scale linkage and association studies, as unexpected fluctuations of significant scores may occur near or within the CNV region itself. SNP and STR markers in heterozygously deleted CNV regions may be scored as homozygous for the remaining allele, while SNP and STR markers at multicopy CNVs may be scored as homozygous for the most common SNP or STR allele. Indeed, some typing methods have even made calls of SNPs in homozygously deleted regions. In each case, statistical power may be compromised in or near these regions during linkage and association analyses. In addition, CNV alterations at one or more multiple sites in the genome may themselves introduce genetic and phenotypic heterogeneity, adding additional levels of complexity in genetic disease studies. Direct and accurate genotyping of the CNVs themselves will help to resolve some of these issues, so assessment of suitable large-scale technologies to accomplish this should also be made a priority.

Conclusions

The recent discovery of widespread copy number variation in human and other mammalian genomes provides immediate insights into genetic variability among populations and provides a foundation for studies of the contribution of CNVs to evolution and disease. The published data are still largely rudimentary, but new developments in high-resolution scanning technologies will likely facilitate the establishment of comprehensive CNV maps. It is unlikely that any one technology alone will allow thorough identification of all classes of CNVs, so a priority of future work should focus on verifying primary results, integrating multiple data sources, and assigning population frequencies to these genomic variants.

Acknowledgments

We thank Don Conrad (University of Chicago) for additional data included in Figure 4, Shona Hislop for Figure 5, Shumpei Ishikawa (University of Tokyo) for Figure 6D, John Iafrate (Massachusetts General Hospital, Boston) for Figure 7, and Nancy Voynow for critical reading of this manuscript. Some of the work presented here, from the Copy Number Variation Project, has been supported by grants from Genome Canada/Ontario Genomics Institute and the Canadian Institutes of Health Research (S.W.S.), the Department of Pathology at Brigham and Women’s Hospital and the Leukemia and Lymphoma Society (C.L.), and The Wellcome Trust (N.P.C., M.E.H., and C.T.-S.). S.W.S. is an International Scholar of the Howard Hughes Medical Institute.

Footnotes

References

| Table of Contents

Preprint Server