RESOURCE

BodyMap: A Collection of 3′ ESTs for Analysis of Human Gene Expression Information

Published November 1, 2000. Vol 10 Issue 11, pp. 1817-1827. https://doi.org/10.1101/gr.151500
Download PDF Cite Article Permissions Share
cover of Genome Research Vol 36 Issue 6
Current Issue:

Abstract

BodyMap is a collection of site-directed 3′ expressed sequence tags (ESTs) (gene signatures, GSs) that contains the transcript compositions of various human tissues and was the first systematic effort to acquire gene expression data. For the construction of BodyMap, cDNA libraries were made, preserving abundance information and histologic resolutions of tissue mRNAs. By sequencing 164,000 randomly selected clones, 88,587 GSs that represent chromosomally coded transcripts have been collected from 51 human organs and tissues. They were clustered into 18,722 independent 3′ termini from transcripts, and more than 3000 of these were not found among ESTs assembled in UniGene (Build 75). Assessment of the prevalence of polyadenylation signals and comparison with GenBank cDNAs indicated that there was no significant contamination by internally primed cDNAs or genomic fragments but that there was a relatively high incidence (12%) of alternative polyadenylation sites. We evaluated the sensitivity and resolution of expression information in BodyMap by in silico Northern hybridization and selection of tissue-specific gene probes. BodyMap is a unique resource for estimation of the absolute abundance of transcripts and selection of gene probes for efficient hybridization-based gene expression profiling. [BodyMap data are available at http://bodymap.ims.u-tokyo.ac.jp.]


In the early phase of its development, the expressed sequence tag (EST) collection (Adams et al. 1993, 1995) primarily served as a catalog to be screened for clones of interest by sequence homology. In the next phase, gene coverage was pursued (Aaronson et al. 1996; Williamson 1999) by using normalized libraries and/or highly complex sources (Soares et al. 1994; Hillier et al. 1996) to use the entries as markers to create a transcript map of the human genome, after clustering redundantly accumulated ESTs into gene units (Schuler et al. 1996). As genome sequencing efforts progress, ESTs have been used for exon identification (Dunham et al. 1999; Hattori et al. 2000), and they are being mapped and organized in the framework of genome sequence at a resolution of single nucleotides. Progress in the integration of ESTs into the genomic sequence will make EST data more of an expression of gene records rather than merely a pool of nucleotide sequences. Reflecting this trend, the major EST collection projects have shifted emphasis from efficiency of identifying novel sequences to meaningful source selection, such as coverage of a majority of cancer types (Strausberg et al. 2000).

BodyMap is a collection of site-directed 3′ ESTs (gene signatures, GSs) designed as an anatomical database of human gene expression in which sequences are used as identifiers (Okubo et al. 1992). Construction of BodyMap began in 1991 (Okubo et al. 1991) and representative human tissues and organs have been incorporated. During the collection of GSs, nonstructural information about the mRNA, including transcript abundance and anatomical distribution, was preserved. The libraries were constructed from well-characterized sources by using methods that minimize the differences in cloning efficiencies among transcripts, and libraries were never amplified before sequencing (Okubo et al. 1991). Accordingly, BodyMap has characteristics distinct from those of other public EST data sets, which were generated as sequence collections at the expense of expression information (Bonaldo et al. 1996). BodyMap has been used in the isolation and characterization of tissue-specific transcripts (Nishida et al 1996; Ohno et al. 1996; Maeda et al. 1997;Shimizu-Matsumoto et al. 1997) and in disease gene identification (Irvine et al. 1997; Nishida et al. 1997). Here we describe the structure and features of 88,587 GSs from human tissues collected in BodyMap.

RESULTS

Sources and Library Construction

The numbers of informative 3′ site-directed ESTs representing chromosomally coded genes are summarized in Table1. We refer to these 3′ ESTs, covering restricted 3′ ends in the sense direction, as gene signatures (GSs) (Okubo et al. 1992). Sources were selected to cover the most representative tissues and cell types. Emphasis was placed on pure connective tissues and epithelial cells, which are underrepresented in dbEST. In every case, tissue preparation was performed carefully, sometimes by microscopy, to minimize contamination by other cell types. For example, human epithelial cells were prepared by careful isolation of a monolayer or layers of cells free from visible contamination by connective tissues and blood cells (Ohnishi et al. 1999). As a result, for example, the sequence of the immunoglobulin λ chain transcript, which was found in 1% (11/870) of clones from colonic mucosa having a thin lining of loose connective tissue (lamina propria), was not identified in 20,440 clones from purified epithelium. Because of the elaborate manipulation steps, libraries were sometimes constructed by direct priming of less than a microgram of total RNA. Nevertheless, contamination by ribosomal RNA was very low (0.26%), probably because of the high specificity of first-strand synthesis with a low concentration vector primer.

Table 1.

The Most Abundant Transcripts in Human Tissues

B01.hl60 857 C08.Aortic media 1002 N01. Retina 877 X01.hepG2 740
Ribosomal protein S8 X67247 24Elastin M17282 155Opsin K02281 14EF-1α X16869 17
Ribosomal protein L9 U09953 22Osteonectin J03040 27Na/K ATPase β2 D87330 13Albumin L00133 17
Ribosomal protein L23 X53777 17Ribosomal protein L21 X89401 16Aldolase C X07292 10TPT-1 X16064 9
Ribosomal protein L7a X52138 16GS13325None16Ribosomal protein L9 U09953 9Ribosomal protein L31 X69181 9
B02. hl60/DMSO 1081 C09. Ventricle muscle 3785 N02. Cortex 2242 X02. Neonate liver 739
β-actin X00351 14Myosin heavy chain M25139 101Myelin basic protein M13577 49Albumin L00133 227
HHCPA78 homolog S73591 6Ig-λ light chain D01059 75hng/RC3 Y15059 14Apolipoprotein B J02775 38
Ribosomal protein L3 M90054 5Myoglobin X00373 60Apolipoprotein J M74816 13α2-HS-glycoprotein M16961 21
L-plastin L05492 5Troponin C, skel/card M37984 53Aldolase C X07292 9Haptoglobin α 1S X00637 16
B03. hl60/TPA 889 C10. atrial muscle 2823 N03. cerebellum 1107 X03. fetal liver 641
EF-1α X16869 26ANF M54951 203GFAP S40719 22Albumin L00133 109
Methionine AT-a L43509 14Actin, a-cardiac J00073 88Aldolase C X07292 7Haptoglobin α 1S X00637 27
Ferritin L M11147 14α B-crystallin S45630 27Myelin basic protein M13577 5γ-G globin X55656 17
TPT-1 X16064 13Troponin T, cardiac X74819 25Apolipoprotein J M74816 5Apolipoprotein All X04898 14
B04. granulocyte 1164 C11. Skeletal muscle 4527 N04. Neuroblast 1235 X04. Adult liver 956
β-2-microglobulin M17987 25α-actin, skeletal M20543 301EF-1α X16869 19Albumin L00133 279
Spermidine/spermineAT M77693 22Myosin heavy chain X03741 173H3.3 histone M11354 17Haptoglobin α 1S X00637 41
HLA-Cw1 M26429 21Myosin heavy chain X03740 137ribosomal protein L9 U09953 16α-1 acid gp M13692 20
Pre-B enhancing factor U02020 20Troponin C, skeletal X07898 121TPT-1 X16064 15Apolipoprotein B J02775 19
B05. CD8 T cell 1104 C12. Hair follicle 2164 N05. Caudate nucl. 1077 X05. Lung 874
β-2-microglobulin M17987 16Fibronectin K00799 84hng/RC3 Y15059 12Pulmonary SAP M30838 87
TPT-1 X16064 12COL1A1 M32798 57TALLA-1 D29808 11Clara cells 10 kd prot. U01101 31
EF-1a X16869 10EF-1α X16869 43Myelin basic protein M13577 8HLA-E heavy chain X64881 12
Yeast rp L4 homolog Z12962 10Osteonectin J03040 30KIAA0607 AB011179 7Fibronectin K00799 10
B06. CD4 T cell 1028 E1. Keratinocyte 820 N06. Thalamus 912 X06. Colon mucosa 921
β-2-microglobulin M17987 19Cytokeratin 14 J00124 15Myelin basic protein M13577 36L-FABP M10617 40
Ribosomal protein L11 X79234 14Metallothionein V00594 10GFAP S40719 8Galectin-4AF0148318
TPT-1 X16064 13Lipocortin II D00017 9apo J M74816 7CLCA1AF0394013
23 kD highly basic protein X56932 12Ribosomal protein S19 M81757 8Sox 8 AF164104 6Ig-λ-light chain D01059 12
C01. Adipose tissue 1488 E02. Cornea 2793 N07. Putamen 871 X07. Small cell ca. lung 843
Gelatin BP AB012165 19Apolipoprotein J M74816 73Myelin basic protein M13577 8BBC1 X64707 8
Ribosomal protein S8 X67247 16Cytokeratin 12 D78367 55GS04506None823 kD highly basic prot. X56932 7
apM2 D45370 15apM2NM00682940TPT-1 X16064 7Ribosomal protein S11 X06617 7
TPT-1 X16064 14Ferritin H M11146 31Na/K ATPase β2 D87330 7Ribosomal protein L7a X52138 6
C02. Aortic endothel* 967 E03. Conjunctiva 937 N08. Astrocyte* 1103 X08. Adeno ca. of lung 1183
Fibronectin K00799 36β-2-microglobulin M17987 23EF-1α X16869 26COL3A1 X14420 20
TPT-1 X16064 14Cytokeratin 13 X52426 23Ribosomal protein S17 M13932 14Thymosin β4 M17733 15
PAI-1 X13345 12Lipocortin X05908 8GFAP S40719 12Ig-λ light chain D01059 14
CTGF X78947 12EF-4All D30655 8Thymosin β4 M17733 10Ig-κ light chain M11937 13
C03. Osteoblast* 928 E04. Intest. Metaplasia 2192 N09. Schwann cell* 975 X09. Squamous cell ca. lung 1190
COL1A2 J03464 25Calcyclin J02763 25Ribosomal protein L10 AB007170 18Calcyclin J02763 23
Fibronectin K00799 22EF-1α X16869 20Ribosomal protein L9 U09953 17Ferritin L chain M11147 14
Osteonectin J03040 20Aminopeptidase N M22324 17Ribosomal protein S19 M81757 17Cystatin B L03558 11
COL3A1 X14420 18PSCA AF043489 17Ribosomal protein S29 U14973 15Cathepsin B L16510 9
C04. Fibroblast* 1097 E05. Fundic gland 3304 N10. Fetal neuron* 1108 X10. Iris 3314
Stromelysin X05232 26Pepsinogen J00287 572Ribosomal protein L37a X66699 9Apolipoprotein D M16696 45
Fibronectin K00799 22Lysozyme X14008 33TPT-1 X16064 91-8D X57351 43
Collagenase X05231 15Gastric lipase X05997 32Ribosomal protein L5 U14966 8Yeast rp L41 homolog Z12962 42
PAI-1 X13345 14TPT-1 X16064 29Thymosin β4 M17733 8TPT-1 X16064 40
C05. Mesangium 1101 E06. Ileum epithel 3675 N11. Corpus callosum 949 X11. Skin full thickness 4604
Fibronectin K00799 48β-2-microglobulin M17987 59GFAP S40719 12Yeast rp L41 homolog Z12962 59
Calcyclin J02763 30CLCA1 AF039400 51Ribosomal protein L37a X66699 7GS20959None58
ribosomal protein S19 M81757 13defensin 6 M98331 46ribosomal protein S8 X67247 7ribosomal protein S18 X69150 57
Yeast rp S28 homolog D14530 10GS2706None34Myelin basic protein M13577 7Delta-6 desaturaseAF0367957
C06. Itoh cell 1283 E07. Colon epithel 6451 N12. Substantia nigra 3477 X12. Tumor infiltrates 1585
Osteonectin J03040 23Galectin-4 AF014838 106Myelin basic protein M13577 129β-2-microglobulin M17987 49
Fibronectin K00799 18cytokeratin 8 X12882 86Ribosomal protein L7a X52138 51LD78α D90144 31
PAI-1 X13345 16L-FABP M10617 78EF-1α X16869 48RF-1α X16869 25
EF-1α X16869 16Calcyclin J02763 75αB-crystallin S45630 46Yeast rp S28 homolog D14530 24
C07. Bone flakes 1042 E08. Pituitary 1015 N13. Fetal brain 3797
Osteonectin J03040 25Prolactin M29386 181α-Tubulin X01703 35
COL1A2 J03464 24Growth hormone M13438 20Ribosomal protein L37a X66699 21
β-Globin V00497 13Secretogranin I Y00064 12EF-1α X16869 20
COL3A1 X14420 12sGTP-bp X07036 9Stathmin J04991 16

[i] The source tissue or cells and the number of total ESTs representing chromosomally encoded genes are given (shaded cells). The source groups were blood cells (B01–B06), connective and muscular tissues (C01–C12), epithelial tissues (E01–E08) and nervous tissues (N01–N13). When the source tissue was composed of multiple cell types or an uncategorizable cell type, it was categorized as complex (X01–X12). Asterisks denote primary cultured cells. The identities of the most frequently isolated tags are given along with their frequencies. (TPT-1) Translationally controlled tumor protein; (methionine AT-a) methionine adenosyltransferase-α; (spermine AT) spermine acetyltransferase; (EF-1a) elongation factor 1-α; (apM2) adipose most abundant protein-2; (PAI-1) plasminogen activator inhibitor-1; (ANF) atrial atriuretic factor; (EF-4AII) elongation factor 4AII; (CLCA1) calcium activated chloride channel 1; (L-FABP) liver fatty acid binding protein; (sGTP-BP) stimulatory GTP bonding protein; (hng/RC3) human neurogranin; (GFAP) glial fibrillary acidic protein; (TALLA-1) T-cell acute lymphoblastic leukemia associated antigen 1; (pulmonary SAP) pulmonary surfactant apoprotein.

Validation of Collected GSs

Collected GS sequences were evaluated if they represented true mRNA termini. Of 3928 independent GS sequences that matched GenBank entries, 3470 (88%) represented the most 3′ MboI fragments of the deposited cDNA sequences. The rest represented alternatively polyadenylated mRNAs or internally primed artifacts that cannot be discriminated by sequence inspection of individual cases. Thus, the presence of the poly(A) addition signals upstream of the addition site was used for the validation as mass data. Canonical signal (AATAAA) and sequences with single-base substitutions were examined 10 bp to 50 bp from the poly(A) tail in 4431 independent GS sequences (Fig.1A).1 In 93% of GS sequences, AATAAA or a single base variant was found. The prevalence of AATAAA and single base variants was quite similar to that observed in 1123 GenBank human cDNAs with clear annotations of poly(A) sites. In the case of 3′ ESTs deposited in dbEST, the proportions of signals differed greatly between those starting with a stretch of Ts and those without them (Fig. 1A). The former has very similar signal occurrence, but in the latter the proportion of AATAAA is greatly reduced, indicating that the short region following a poly(T) stretch was trimmed before data submission as reported (Hillier et al. 1996). The 458 BodyMap GS that matched internal regions of GenBank mRNAs had similar frequencies of hexanucleotide signals, suggesting that the majority of them are also polyadenylated in vivo. Because we counted not only the well-known single-base variants, such as ATTAAA, but also all of the possible single-base substitutions, the fraction of each of these variants was compared for cDNA ends having only one candidate signal (Fig. 1B). The proportion was consistent across GenBank primate sequences, BodyMap, and untrimmed 3′ ESTs. Trimmed 3′ ESTs served as nonterminal controls. This agreement between the three sets of data suggests that some of the uncommon hexanucleotide variants, such as BATAAA and AATABA (B = T, G, C), are functional and that most of the cDNA sequences with only single-base variants in either set of data represent true 3′ termini of transcripts.

Figure 1.

Distribution of AATAAA and single-base variants in 3′ ends of ESTs. (A) The hexanucleotide signals from the 3′ ends of four sets of cDNA sequences. The regions 10–50 bp from the polyadenylation sites were examined for the presence of AATAAA or its single-base variants. The 3′ ESTs from dbEST were divided into those starting with more than seven Ts (Tn > 7) and those without a T in the first position (T0). (B) The prevalence of each single-base variant in the cDNA termini with only one variant signal and no AATAAA within the 10–50-bp region from the poly(A) tail in each of four data sets is shown. The frequencies (%) of all 18 possible variants in each data set are shown beside bars. In vitro polyadenylation activities for each variant measured in the context of the SV40 polyadenylation signal are reproduced from the literature (Wickens et al. 1984; Sheets et al. 1990)

10f1_F1TT

Constitution of GS Population

Without exception, the most recurrent GSs in differentiated cells or in adult tissues were from nonhousekeeping genes (Table 1). Some were unique to each tissue, and some were shared among cells of the same lineage. The fraction of the most abundant GS varied more than tenfold across tissues or cell types. There were six tissues in which more than 10% of the total ESTs were attributable to a single GS cluster (Fig.2). They were secretory epithelia or muscular tissues. In the remaining tissues, the content varied by a small percent (mean, 2.5%; SD, 1.8%).

Figure 2.

The relative contents of the most abundant transcripts in 51 human tissues or cell types as measured by gene signature collection. The error range indicates the P value of 0.1 calculated for each observed occurrence. The identities of some transcripts are given. For the identities of other transcripts, see Table 1.

10f2_F1TT

The characteristics of the GS population for each source group—nervous tissues, connective tissues, and epithelial tissues—are illustrated in the accumulated frequency curve (Fig. 3) in which the cumulative sums of occurrences were plotted in descending order of GS occurrence. The epithelial and connective tissues have very similar curves, whereas that of the nervous system is clearly shifted downward. The curve for neural tissues did not overlap with the others at a credit level of 0.85 in the top 486 genes. As seen in Figure 2, 50% of the mass is accounted for by ∼500 genes in connective and epithelial tissues but by >900 genes in nervous tissue.

Figure 3.

Cumulative frequencies of gene signature (GS) sequences. The cumulative sums calculated in descending order of GS frequencies are plotted as a percentage of total tag occurrence. Tag occurrences in each of three major tissue categories were plotted separately.

10f3_F1TT_rev1

Overlapping with dbEST

To further characterize BodyMap data, we compared them with dbEST entries in UniGene (Build 75) that were clustered into 72,831 physical and annotational clusters. Of 18,722 GS clusters composed of 89,831 GS tags in BodyMap, 3,382 GS clusters did not match ESTs listed.

The GS in overlapping fraction have an average redundancy of 5.6 in BodyMap, whereas it was 1.3 in the GS cluster unique to BodyMap. GS unique to BodyMap were distributed at frequencies of 1%–5% in every library (mean, 4.0%; SD, 2.2%) and had hexanucleotide signal occurrences similar to the rest (data not shown). Nervous system tissues had a slightly greater content of unique GS (mean, 5.4%; SD, 2.1%) than was found in other tissues (mean, 3.8%; SD, 1.8%). These values equaled or exceeded 10% in only two libraries: full thickness of skin (10%) and fetal neuron (11%).

In Silico RNA Experiments

The primary goal for the construction of BodyMap was to create a genes × tissues matrix of transcription level that could be used for in silico experiments such as Northern hybridization and subtraction cloning. Although the depth of the clone collection limits the sensitivity and specificity of experiments, for abundant clones these primary objectives have been achieved. The sensitivity of the present matrix was assessed by probing the data with several genes known to have moderate expression levels and known tissue specificities. As shown in Table 2, the distributions of cytoskeletal intermediate filaments and collagens suggest that this clear segregation is applicable also to anonymous genes with similar expression levels. Such pure segregation patterns are not seen in libraries constructed from complex starting materials.

Table 2.

Distribution of Transcripts for Cytoskeletal Filaments and Collagens across Libraries

Blood Connective/muscle Epithelial Neural Cluster ID
Gene name GenBank B01 B02 B03 B04 B05 B06 C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 E01 E02 E03 E04 E05 E06 E07 E08 N01 N02 N03 N04 N05 N06 N07 N08 N09 N10 N11 N12 N13 BodyMap UniGene
Cytokeratin 4 X07695 4GS08105Hs. 3235
Cytokeratin 5 M19723 27 2GS05829Hs. 195850
Cytokeratin 6 L42601 3GS05780Hs. 111758
Cytokeratin 8 X12882 1222112262986GS00223Hs. 78271
Cytokeratin 12 D78367 55GS08025Hs. 66739
Cytokeratin 13 X52426 223GS06283Hs. 74070
Cytokeratin 14 J00124 152GS05738Hs. 117729
Cytokeratin 17 Z19574 3GS05804Hs. 2785
Cytokeratin 18 X12883 114558GS00243Hs. 65114
Cytokeratin 19 Y00503 141413525GS05013Hs. 182265
α-Actin,  cardiac J00073 13881GS13600Hs. 7768
α-Actin,  skeletal M20543 182301111GS14552Hs. 1288
α-Actin,  vascular X13839 3328545123GS03142Hs. 195851
β-Actin X00351 1414212112215322422107121716431318GS00244Hs. 180952
γ-Actin,  cytoskeletal M19283 1514139935211122493173612498429GS00114Hs. 215747
γ-Actin, enteric D00654 116GS03304Hs. 77443
Neurofilament-l X05608 16151GS04860Hs. 222661
Neurofilament-M Y00067 34121GS05625Hs. 71346
GFAP S40719 82238121274GS06290Hs. 1447
COL1A1 M32798 111429115711GS02049Hs. 172928
COL1A2 J03464 261315342238154GS02285Hs. 179573
COL2A1 X16468 2GS12481Hs. 81343
COL3A1 X14420 4128516141252112GS03074Hs. 119571
COL4A1 M26576 11121GS06075Hs. 119129
COL4A2 J02760 1GS16997Hs. 75617
COL5A1 M76729 13121GS03963Hs. 146428
COL5A2 M11718 13121GS03623Hs. 82985
COL6A1 X15880 211GS03931Hs. 10885
COL6A2 M34572 3112GS02756Hs. 219020
COL6A3 X52022 1521621GS03837Hs. 80988
COL9A2 M95610 21GS09520Hs. 37165
COL9A3 L41162 1GS05017Hs. 53563
COL11A1 J04177 1GS12953Hs. 82772
COL15A1 L25286 11213GS3082Hs. 83164
COL18A1 L22548 12311GS02482Hs. 78409

[i] The numbers are the observed recurrence of each GS in library grouped by tissue system. For the source materials and number of total isolates, see Table 1.

Another example of an in silico experiment is selection of genes with given patterns of expression. For example, genes differentially expressed in myeloid cells, based on the criteria that frequency variation between myeloid cells and nonmyeloid cells was highly significant (P < .005), are shown in Table3. By increasing the P value to 1%, 112 more genes were selected (data not shown).

Table 3.

Known Genes Selected as Uniquely Expressed in Myeloid Cells

Cluster ID GenBank identity #ACC Pvalue
GS08362Granulocyte colony-stimulating factor receptor S71484 2.56E-17
GS00697Plasminogen activator inhibitor-2 (PAI-2) M24657 1.13E-09
GS01724Pleckstrin (P47) X07743 1.13E-09
GS01024Leukocyte adhesion protein/CD18 M15395 2.14E-08
GS01345Bactericidal permeability increasing protein (BPI) J04739 7.56E-06
GS08325Phosphatidylinositol 3-kinase p110delta U86453 7.56E-06
GS01990Secreted protein (I-309) M57502 7.56E-06
GS01200ICB-1 mRNA AF044896 1.42E-04
GS01719Myeloid cell nuclear differentiation antigen M81750 1.42E-04
GS01202Neutrophil oxidase factor (NCF2)/p67-phox U00788 1.42E-04
GS00779Wegener's granulomatosis autoantigen proteinase 3 M97911 1.42E-04
GS01687c-raf-1 proto-oncogene L00212 2.68E-03
GS01000EVI2B3P M60830 2.68E-03
GS08337Grancalcin (neutrophil monocyte Ca binder) M81637 2.68E-03
GS00610Beige protein homolog (chs) U67615 2.68E-03
GS01164Differentiation antigen (CD33) M23197 2.68E-03
GS08512Monocytic leukemia zinc finger protein U47742 2.68E-03
GS01229Migration inhibitory factor-related protein 8 M21005 2.68E-03
GS01963Type II interleukin-1 receptor antagonist (IL-1ra3) AF057168 2.68E-03

[i] Gene signature (GS) clusters were selected by the criteria that have probability of uncontrolled expression between myeloid cells (HL60, HL60/DMSO, HL60/TPA, granulocytes) and the remaining tissues less than 0.5%. P values represent the probabilities of each gene with uncontrolled expression between two sets of libraries (see Methods for calculations). Along with these known genes, 28 GSs for novel genes as follows were selected: GS01371, GS01572, GS08424, GS01582, GS01553, GS01965, GS01356, GS00656, GS08595, GS01922, GS08435, GS01123, GS00963, GS01383, GS08551, GS08572, GS01352, GS01109, GS01561, GS08460, GS05157, GS00549, GS01251, GS00627, GS08477, GS08379, GS01458, GS01987. For sequences, refer to http://bodymap.ims.u-tokyo.ac.jp.

DISCUSSION

The wide coverage of human genes in dbEST permits parallel gene expression monitoring based on prior knowledge of gene sequence (Lockhart et al. 1996; Iyer et al. 1999). However, from a practical perspective, researchers must select a set of genes suitable for target tissues to make testing efficient (Loftus et al. 1999). The well-preserved abundance information and high anatomical resolution make BodyMap a preferable source for probe selection (http://bodymap.ims.u-tokyo.ac.jp). Another unique feature of BodyMap is the absolute abundance values for transcripts for various tissues. Such information is also found in shorter tag collection, SAGE (Velculescu et al. 1995; Welle et al. 1999), and the tissue-coverage complement to each other. The abundance data covering various tissues are complementary to relative gene expression comparison (DeRisi 1996;Schena et al. 1996; Kawamoto et al. 1999) for evaluating the functions of uncharacterized genes.

Site-directed EST sequences are indispensable for identification of gene ends within genomic sequences because even the most sophisticated computer programs tend to overpredict the presence of exons (Dunham et al. 1999). The overlap of dbESTs with BodyMap indicates that there are still more transcripts to be identified in brain and other tissues. The higher complexity of transcripts in brain, as shown by the accumulated frequency curve, supports this idea. Possible overprediction of genes by using 3′ ESTs is due to cloning artifacts and alternative polyadenylation. Validation of 3′ ESTs by using hexanucleotide signals suggested that such artifacts were negligible in our data set. The 3′ ends without the AATAAA were observed at high incidences not only in BodyMap (39%), but also in human cDNAs in GenBank (37%) and qualified 3′ ESTs from dbEST (40%). In those 3′ ends, several uncommon single-base variants, such BATAAA and AATABA (B = T, G, C), plausibly responsible for poly(A) formation in these 3′ ends, were found at very similar rates. After this paper was submitted, Beaudoing et al. (2000) published similar results from an analysis of 4344 human 3′ untranslated regions (UTRs) and 3′ ESTs overlapping with them. The proportion of 3′ ends without AATAAA was 41.8% in their analysis, and uncommon single-base variants were found at significant frequencies among them. In BodyMap, upstream alternative polyadenylation was found in 12% of GenBank mRNA entries. Assuming the same incidence of downstream alternatives, our estimate of alternative polyadenylation is 24%, close to the reported estimates by EST clustering (16%) (Gautheret et al.) and recent 3′ UTR analysis (28.6%). Although generation of multiple 3′ ESTs from one gene may affect transcript counting by EST clustering, assigning them to genomic sequences will easily resolve this problem as long as the ESTs are not far apart.

In summary, our site-directed 3′ ESTs can serve as a resource for selection of probes for sequence-based expression profiling methods and can provide absolute levels of gene expression that are important in considering gene function. Our collection covers various rare tissues and provides information on their mRNA populations. To allow full use of BodyMap for in silico mRNA experiments, the representation frequency matrix of gene × sources and all representative sequences have been made available through our ftp site (http://bodymap.ims.u-tokyo.ac.jp/datasets/index.html).

METHODS

Library Construction

Of 51 human libraries listed (Table 1), 15 were made by direct priming of total RNA. The specimens used for RNA preparations and the methods used are described elsewhere (http://bodymap.ims.u-tokyo.ac.jp). The other libraries were made from poly(A)-selected RNA. For counting transcripts by sequencing, only the most 3′-terminal fragment left by cleaving off the bulk of the fragment with MboI from the pUC119 vector-primed cDNA was cloned, as described previously (Matsubara et al. 1993). This shortening of the inserts facilitates the unbiased representation of mRNA regardless of their original sizes at the expense of losing ∼5% of gene sequences due to the absence of MboI site or its location too close to the poly(A) tail.

Data Collection and Cleansing

Starting with randomly isolated transformants, sequence templates were prepared by PCR amplification of the insert cDNA in single-stranded phage released into the culture medium. All sequences were read from the MboI site toward poly(A), which allows unambiguous identification of the original transcripts. They were referred to as GSs (Okubo et al. 1992). In half of the cases, dye primer chemistry was used, and in the remaining cases, DYEnamic ET* Terminator Cycle Sequencing Kit (Amersham Pharmacia Biotech Inc.) was used. Sequences with >5% Ns, not starting with GATC (theMboI site), or having more than one GATC were eliminated. We then eliminated those sequences having >90% similarity in an overlap longer than 50 bp or 70% of the sequence length with vectors and ribosomal sequences. Sequences for mitochondrial transcripts were also eliminated. When the GATC and poly(A) tail were separated by <17 bp, the sequences were eliminated from the analysis because they were not always unique enough. Lastly, sequences were compared with a library of repetitive sequences, REPBASE (Jurka 1995,ftp://ncbi.nlm.nih.gov/repository/repbase/) by using BLAST (Altschul et al. 1990), and repetitive regions were masked as previously reported (Hishiki et al. 2000). All GS sequences were submitted to the DNA DataBank of Japan (DDBJ) and made available at our web site (http://bodymap.ims.u-tokyo.ac.up).

Transcript Counting/EST Clustering

Sequences from each new library were first compared to each other with FASTA (Pearson et al. 1988). When the similarity exceeded 95% for an overlap longer than 50 bp or 70% of insert length and the overlap started at a GATC, they were considered the same tag and clustered (primary cluster). From each cluster, one representative GS was selected and compared with representative sequences from previously generated clusters. By using the same criteria, clusters of the same GS were grouped, and a new representative tag was selected from the new cluster (secondary cluster). A five-figure cluster ID, referred to as the GS number, was assigned to each independent cluster. Representative sequences for the GS clusters were compared periodically with primate sequences in GenBank (Re. 110.0) and ESTs in UniGene (Build 75,http://www.ncbi.nlm.nih.gov/UniGene/). The criteria for identity were the same as those used for clustering. The correspondence of BodyMap ID (GS) to UniGene ID (Hs) was submitted to GenBank and implemented in UniGene.

Selection of Differentially Expressed Genes

For the selection of genes preferentially expressed in a given set of tissues, for example tissues A, B, and C, libraries A–C were considered one library and the remaining 48 libraries in BodyMap another library. The probability of unregulated expression between the two hypothetical libraries was calculated for each GS by the equation reported by Audic and Claverie (Audic et al. 1997):

P(yx)=N2N1y(x+y)!x!y!1+N2N1(x+y+1)
Total isolation in A–C is N1 and isolation of the relevant GS is x. The total isolation in the remaining libraries isN2 and the occurrence of the relevant GS is y.

Analysis of Polyadenylation Signals

Among 62,710 entries of primate sequences in GenBank (Re.97), all human mRNAs with a single “poly(A)-site” listed in the features were used. From the representative sequences for all GSs, we selected those that satisfied all of the following conditions. The GS does not have matches in GenBank, is longer than 100 bp, and ends with poly(A). The GS sequence does not contain more than 5% Ns within 100 bp of the poly(A). The GS does not contain repetitive sequences or an N in the AATAAA sequence, such as 'NATAAA'.

From dbEST (Re. 93), we selected 118,353 3′ ESTs from the Washington-U/Merck project (Hillier et al. 1996) to avoid confusion due to inconsistencies in the feature descriptions from different laboratories. EST matches to BodyMap entries and GenBank primate mRNAs were eliminated first. Those ESTs with discrepancies between clone name and definition (5′ in clone name and 3′ EST in definition), and those denoted as “possible reverse clone” were also eliminated. 3′ ESTs with a stretch of longer than seven Ts (Tn > 7) at the beginning and those starting with A, G, or C (T0) were analyzed separately. Those ESTs starting with one to seven Ts were not used. Within each of these four categories, the 100 bases from the poly(A) site were compared with each other with BLAST N with the same criteria used for GS clustering, and the fragment containing the lowest number of Ns was selected from each cluster and used in the analysis.

The authors thank Ms. Kumiko Takagi for her secretarial assistance. This work was supported in part by Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Science and Culture, and Research for the future of Japan Society for the Promotion of Science, Japan.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Notes

[4] Corresponding author.

Notes

[5] E-MAIL [email protected]; FAX 81-6-6877-1922.

[6] Article and publication are at www.genome.org/cgi/doi/10.1101/gr.151500.

REFERENCES

  1. J.S. AaronsonB. EckmanR.A. BlevinsJ.A. BorkowskiJ. MyersonS. ImranK.O. Elliston(1996) Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data. Genome Res. 6:829–845.
  2. M.D. AdamsA.R. KerlavageC. FieldsJ.C. Venter(1993) 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nat. Genet. 4:256–267.
  3. M.D. AdamsA.R. KerlavageR.D. FleischmannR.A. FuldnerC.J. BultN.H. LeeE.F. KirknessK.G. WeinstockJ.D. GocayneO. White(1995) Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 377:3–174.
  4. S.F. AltschulW. GishW. MillerE.W. MyersD.J. Lipman(1990) Basic local alignment search tool. J. Mol. Biol. 215:403–410.
  5. S. AudicJ.-M. Claverie(1997) The significance of digital gene expression profiles. Genome Res. 7:986–995.
  6. E. BeaudoingS. FreierJ.R. WyattJ-M. ClaverieD. Gautheret(2000) Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10:1001–1010.
  7. M.F. BonaldoG. LennonM.B. Soares(1996) Normalization and subtraction: Two approaches to facilitate gene discovery. Genome Res. 6:791–806.
  8. J. DeRisiL. PenlandP.O. BrownM.L. BittnerP.S. MeltzerM. RayY. ChenY.A. SuJ.M. Trent(1996) Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat. Genet. 14:457–460.
  9. I. DunhamN. ShimizuB.A. RoeS. ChissoeA.R. HuntJ.E. CollinsR. BruskiewichD.M. BeareM. ClampL.J. Smink(1999) The DNA sequence of human chromosome 22. Nature 402:489–495.
  10. D. GautheretO. PoirotF. LopezS. AudicJ.-M. Claverie(1998) Alternate polyadenylation in human mRNAs: A large-scale analysis by EST clustering. Genome Res. 8:524–530.
  11. M. HattoriA. FujiyamaT.D. TaylorH. WatanabeT. YadaH.S. ParkA. ToyodaK. IshiiY. TotokiD.K. Choi(2000) The DNA sequence of human chromosome 21. Nature 405:311–319.
  12. L.D. HillierG. LennonM. BeckerM.F. BonaldoB. ChiapelliS. ChissoeN. DietrichT. DuBuqueA. FavelloGish(1996) Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6:807–828.
  13. T. HishikiS. KawamotoS. MorishitaK. Okubo(2000) BodyMap: A human and mouse gene expression database. Nucleic Acids Res. 28:136–138.
  14. A.D. IrvineL.D. CordenO. SwenssonB. SwenssonJ.E. MooreD.G. FrazerF.J. SmithR.G. KnowltonE. ChristophersR. Rochels(1997) Mutations in cornea-specific keratin K3 or K12 genes cause Meesmann's corneal dystrophy. Nat. Genet. 16:184–187.
  15. V.R. IyerM.B. EisenD.T. RossG. SchulerT. MooreJ.C.F. LeeJ.M. TrentL.M. StaudtJ. Hudson Jr.M.S. Boguski(1999) The transcriptional program in the response of human fibroblasts to serum. Science 283:83–87.
  16. S. KawamotoT. OhnishiH. KitaO. ChisakaK. Okubo(1999) Expression profiling by iAFLP: A PCR-based method for genome-wide gene expression profiling. Genome Res. 9:1305–1312.
  17. D.J. LockhartH. DongM.C. ByrneM.T. FollettieM.V. GalloM.S. CheeM. MittmannC. WangM. KobayashiH. Horton(1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14:1675–1680.
  18. S.K. LoftusY. ChenG. GoodenJ.F. RyanG. BirznieksM. HilliardA.D. BaxevanisM. BittnerP. MeltzerJ. Trent(1999) Informatic selection of a neural crest-melanocyte cDNA set for microarray analysis. Proc. Natl. Acad. Sci. 96:9277–9280.
  19. Maeda, K., Okubo, K., Shimomura, I., Mizuno, K., Matsuzawa, Y., and Matsubara, K. 1997. Analysis of an expression profile of.
  20. K MatsubaraK. Okubo(1993) cDNA analyses in the human genome project. Gene 135:265–274.
  21. K. NishidaW. AdachiA. Shimizu-MatsumotoS. KinoshitaK. MizunoK. MatsubaraK. Okubo(1996) A gene expression profile of human corneal epithelium and the isolation of human keratin 12 cDNA. Invest. Ophthalmol. Vis. Sci. 37:1800–1809.
  22. K. NishidaY. HonmaA. DotaS. KawasakiW. AdachiT. NakamuraA. J. QuantockH. HosotaniS. YamamotoM. Okada(1997) Isolation and chromosomal localization of a cornea-specific human keratin 12 gene and detection of four mutations in Meesmann corneal epithelial dystrophy. Am. J. Hum. Genet. 61:1268–1275.
  23. T. OhnishiK. Okubo(1999) Isolation of pure human mucosal epithelium for RNA analysis. Biotechniques 27:978–986.
  24. I. OhnoJ. HashimotoK. ShimizuK. TakaokaT. OchiK. MatsubaraK. Okubo(1996) A cDNA cloning of human AEBP1 from primary cultured osteoblasts and its expression in a differentiating osteoblastic cell line. Biochem. Biophys. Res. Commun. 228:411–414.
  25. K. OkuboN. HoriR. MatobaT. NiiyamaK. Matsubara(1991) A novel system for large-scale sequencing of cDNA by PCR amplification. DNA Seq. 2:137–144.
  26. K. OkuboN. HoriR. MatobaT. NiiyamaA. FukushimaY. KojimaK. Matsubara(1992) Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nat. Genet. 2:173–179.
  27. W.R. PearsonD.J. Lipman(1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. 85:2444–2448.
  28. M. SchenaD. ShalonR. HellerA. ChaiP.O. BrownR.W. Davis(1996) Parallel human genome analysis: Microarray-based expression monitoring of 1000 genes. Proc. Natl. Acad. Sci. 93:10614–10619.
  29. G.D. SchulerM.S. BoguskiE.A. StewartL.D. SteinG. GyapayK. RiceR.E. WhiteP. Rodriguez-TomeA. AggarwalE. Bajorek(1996) A gene map of the human genome. Science 274:540–546.
  30. M.D. SheetsS.C. OggM.P. Wickens(1990) Point mutations in AAUAAA and the poly (A) addition site: Effects on the accuracy and efficiency of cleavage and polyadenylation in vitro. Nucleic Acids Res. 18:5799–5805.
  31. A. Shimizu-MatsumotoW. AdachiK. MizunoJ. InazawaK. NishidaS. KinoshitaK. MatsubaraK. Okubo(1997) An expression profile of genes in human retina and isolation of a complementary DNA for a novel rod photoreceptor protein. Invest. Ophthalmol. Vis. Sci. 38:2576–2585.
  32. M.B. SoaresM.F. BonaldoP. JeleneL. SuL. LawtonA. Efstratiadis(1994) Construction and characterization of a normalized cDNA library. Proc. Natl. Acad. Sci. 91:9228–9232.
  33. R.L. StrausbergK.H. BuetowM.R. Emmert-BuckR.D. Klausner(2000) The cancer genome anatomy project: Building an annotated gene index. Trends Genet. 16:103–106.
  34. V.E. VelculescuL. ZhangB. VogelsteinK.W. Kinzler(1995) Serial analysis of gene expression. Science 270:484–487.
  35. S. WelleK. BhattC.A. Thornton(1999) Inventory of high-abundance mRNAs in skeletal muscle of normal men. Genome Res. 9:506–513.
  36. M. WickensP. Stephenson(1984) Role of the conserved AAUAAA sequence: Four AAUAAA point mutants prevent messenger RNA 3′ end formation. Science 226:1045–1051.
  37. A. R. Williamson(1999) The Merck Gene Index project. Drug Discov. Today 4:115–122.
Loading
Loading
Loading
Back to top