Structural Characterization of the Human Proteome

This directory contains supplementary material for Muller A, MacCallum RM & Sternberg MJE. Structural Characterisation of the Human Proteome. Genome Research 2002. Additional resources and web-based searches and applications are available from: http://www.sbg.bio.ic.ac.uk/ contact: Arne Muller (a.mueller@cancer.org.uk) or Mike Sternberg (m.sternberg@ic.ac.uk) This file descries the data and formats of the supplementary material. See file 'ls-sR' for a list and sizes (in kbytes) of included files. web-resources: Protein sequence of the proteomes are from the genome section of the NCBI ftp-server (ftp.ncbi.nih.gov). Human proteins are from ENSEMBL-0.8.0 (www.ensembl.org). Links between human sequences and OMIM are taken from the genelink table from the ENSEMBL-0.8.0 database. SCOP is available from The MRC (http://scop.mrc-lmb.cam.ac.uk/scop/) and Stanford (scop domain sequence files http://astral.stanford.edu/). PFAM is available from The Sanger Centre (http://www.sanger.ac.uk/). All SCOP superfamily codes refer top SCOP version 1.53. PFAM is version 6.2. Proteome abbreviations: Bacteria: ========= Mtub = Mycobacterium tuberculosis Ecoli = Escherichia coli Bsub = Bacillus subtilis Mgen = Mycoplasma genitalium Vcho = Vibrio cholerae Hpyl = Helicobacter pylori Aquae = Aquifex aeolicus Archaea: ======== Aero = Aeropyrum pernix Pyro = Pyrococcus horikoshii Mjan = Methanococcus jannaschii Eukaryota: Human = Homo sapiens S_cerevisiae or yeast = Saccharomyces cerevisiae D_melanogaster or fly = Drosophila melanogaster C_elegans or worm = Caenorhabditis elegans All files contain tab delimited data tables, a single column can contain spaces as in "Serine proterase inhibitors" (one column). All columns in the form 'MD5_11eaacf2431bafb6ec80cec311d77b5f' are the 32 character hexadecimal MD5 checksum of a protein sequence, these MD5 fingerprints are used as identifiers within the data tables and can be used to link tables. Below 'MD5' refers to the MD5-fingerprint of a protein sequences. All amino-acid sequences are converted to capital letters, and non-standard amino acid letters are replaced by 'X' (the MD5 is calculated from that sequence). Data files: nrprot.fasta.bz2: ================= compressed (with bzip2) fastaformated protein sequence database used for the analysis. The MD5 code for each sequence is used as the main sequence name, other names + descriptions are appended to the description line by ASCII charactter 01 (also see below '*.fasta'). SCOP-IMALA-DB.tar.bz2: ====================== compressed (with bzip2) tar-archive of IMPALA profiles for SCOP domains. The sequence names are MD5 fingerprints which can be looked up in the nrprot.fasta file. Please see ftp://ftp.ncbi.nih.gov/blast/executables/README.imp for imformation how to use these profiles. scopsfam.txt: ============= #1 SCOP superfamily code #2 SCOP domain code of a representative domain of this superfamily #3 Superfamily name omimseqlink.txt: ================ #1 MD5 of a human (ENSEMBL) sequence in OMIM #2 The OMIM identifier Each directory contains data files for the results of a genome or a set of genomes (bacteria and archaea). The file formats and contents is described below: *.fasta: ======== fasta formated protein sequence file for the proteome. Many sequences may be found in several databases with different identifiers and sometimes different text description. Different database entries are separated by the ASCII character 01. Sequence names are in NCBI-nrprot style, i.e. the database names and identifiers/accession numbers are connected by '|'. dbj: CDS from DNA data bank of Japan) gb: GenBank CDS ref: CDS from the RefSeq database gnl: wildcard, any sequence entry thats not in any other category emb: CDS from EMBL sp: SwissProt pir: PIR prf: sequence from the Patent database pdb: PDB chain scop: SCOP-1.53 domain foldlib: sequence of the 3D-PSSM fold-library (SCOP-1.53 + recent PDB chains) sanger: S. pombe sequences from The Sanger Centre ENST*: ENSEMBL transcript *.pfam: ======= #1 MD5 #2 PFAM accession code #3 Start of HMM-alignment in query #4 Stop of HMM-alignment in query #5 bit-score #6 e-value #7 first residue of HMM in alignment #8 last residue of HMM in alignment #9 length of HMM (in residues) #10 Name of PFAM entry #11 Description of PFAM entry *.prosite: ========== #1 MD5 #2 PROSITE Pattern accession code #3 Start of pattern in query (first residue matched) #4 Stop of pattern in query (last residue matched) *.region: ========= To reduce the amount of data to post-process PSI-BLAST, BLAST and IMPALA alignments are clustered into 'regions' of the query by their overlap. SCOP domains are clustered according to Muller et al. (1999), 293 1257-1271, J. Mol. Biol. Other sequences are clustered by creating a single linkage cluster of overlapping sequences. #1 Md5 #2 Name of region SCOP domain, PDB region, annotated region (functional annotated as described in the paper), any homology (as described in the paper) Note, 'regions' are not domains (except for SCOP), but just the location (in the query) of the homologues in the same cluster. Regions of the same type (name) are non-overlapping, but different regions types can overlap. #3 Start of the region within the query #4 Stop of the region within the query #5 MD5 of the representative of the region (the homologue of lowest e-value) #6 Start of the alignment within the representative #7 Stop of the alignment within the representative #8 bit-score of the representative alignment #9 e-value of the representative alignment #10 Sequence identity of the representative alignment #11 Type of the alignment (PsiBlastHit, BlastHit or IMPALAHit) Note, a PSiBlastHit does *not* mean this homology was not found by BLAST! BLAST was only run for those sequences that contain non-globular coiled-coils, trans-membrane or repeat regions. #12 accession numbers of the representative homologue #13 Description of the representative homologue Columns 12 and 13 are in the same format as the names and descriptions for the fasta formated sequence files. *.scop: ======= SCOP superfamilies found in the proteome #1 SCOP superfamily code #2 Number of domains of this superfamily in the proteome #3 #2/all_scop_domains_in_proteome #4 Number of sequences with at least one domain of superfamily #5 #4/all_sequences_in_proteome #6 Number of domains of this superfamily found in trans-membrane proteins #7 Number of co-occurring SCOP superfamilies #8 Number of co-occurring PFAM families *.domparterns: ============== SCOP superfamilies found in the same sequence (co-occurring superfamilies). #1 SCOP superfamily code A #2 SCOP superfamily code B #3 number of sequences with containing this co-occurrence Note, self association is also counted (e.g. 004.082.001 004.082.001) *.summary: ========== Overview and summary of the annotation status of a proteome. #1 On what the 'Number' of annotation features is based on (Sequences, Regions (see above for a definition of the 'region' term) or residues) #2 The name/type of the annotation AnnotRegion: Functional annotated region HomolRegion: Region with a homologue (functional annotated or un-annotated) ScopRegion: SCOP domain PDBRegion: Region homologues to a PDB chain HMMHit: homology to an uncharacterised PFAM family HMMHitAnnot: homology to a characterised PFAM family LCR: low complexity region Coil: Coiled-coil TMH: trans-membrane region SigPep: Signal peptide NonGlob: LCR or Coil or TMH or SigPep or NonGlob Total: a total number (of sequences or residues or regions) '_cumulative' means that this annotation feature was counted in a cumulative way. Sequences or residues already counted will not counted again, so that the 'Total' will not be exceeded. The priority of assignment is then: ScopRegion > PdbRegion > HMMHitAnnot > AnnotRegion > HMMHit > HomolRegion > HMMHitAnnot > SigPep > TMH > Coil > LCR #3 The number of observations [ end of file ]

This Article

  1. doi: 10.1101/gr.221202 Genome Res. November 1, 2002 vol. 12 no. 11 1625-1641

Preprint Server