Structural Characterization of the Human Proteome
This directory contains supplementary material for Muller A, MacCallum
RM & Sternberg MJE. Structural Characterisation of the Human
Proteome. Genome Research 2002.
Additional resources and web-based searches and applications are available
from: http://www.sbg.bio.ic.ac.uk/
contact: Arne Muller (a.mueller@cancer.org.uk) or
Mike Sternberg (m.sternberg@ic.ac.uk)
This file descries the data and formats of the supplementary
material. See file 'ls-sR' for a list and sizes (in kbytes) of included files.
web-resources:
Protein sequence of the proteomes are from the genome section of the NCBI
ftp-server (ftp.ncbi.nih.gov). Human proteins are from ENSEMBL-0.8.0
(www.ensembl.org). Links between human sequences and OMIM are taken from the
genelink table from the ENSEMBL-0.8.0 database. SCOP is available from The MRC
(http://scop.mrc-lmb.cam.ac.uk/scop/) and Stanford (scop domain sequence files
http://astral.stanford.edu/). PFAM is available from The Sanger Centre
(http://www.sanger.ac.uk/).
All SCOP superfamily codes refer top SCOP version 1.53. PFAM is version 6.2.
Proteome abbreviations:
Bacteria:
=========
Mtub = Mycobacterium tuberculosis
Ecoli = Escherichia coli
Bsub = Bacillus subtilis
Mgen = Mycoplasma genitalium
Vcho = Vibrio cholerae
Hpyl = Helicobacter pylori
Aquae = Aquifex aeolicus
Archaea:
========
Aero = Aeropyrum pernix
Pyro = Pyrococcus horikoshii
Mjan = Methanococcus jannaschii
Eukaryota:
Human = Homo sapiens
S_cerevisiae or yeast = Saccharomyces cerevisiae
D_melanogaster or fly = Drosophila melanogaster
C_elegans or worm = Caenorhabditis elegans
All files contain tab delimited data tables, a single column can
contain spaces as in "Serine proterase inhibitors" (one column).
All columns in the form 'MD5_11eaacf2431bafb6ec80cec311d77b5f' are the 32
character hexadecimal MD5 checksum of a protein sequence, these MD5
fingerprints are used as identifiers within the data tables and can be used to
link tables. Below 'MD5' refers to the MD5-fingerprint of a protein sequences.
All amino-acid sequences are converted to capital letters, and non-standard
amino acid letters are replaced by 'X' (the MD5 is calculated from that
sequence).
Data files:
nrprot.fasta.bz2:
=================
compressed (with bzip2) fastaformated protein sequence database used for the
analysis. The MD5 code for each sequence is used as the main sequence name,
other names + descriptions are appended to the description line by ASCII
charactter 01 (also see below '*.fasta').
SCOP-IMALA-DB.tar.bz2:
======================
compressed (with bzip2) tar-archive of IMPALA profiles for SCOP domains. The
sequence names are MD5 fingerprints which can be looked up in the nrprot.fasta
file. Please see ftp://ftp.ncbi.nih.gov/blast/executables/README.imp for
imformation how to use these profiles.
scopsfam.txt:
=============
#1 SCOP superfamily code
#2 SCOP domain code of a representative domain of this superfamily
#3 Superfamily name
omimseqlink.txt:
================
#1 MD5 of a human (ENSEMBL) sequence in OMIM
#2 The OMIM identifier
Each directory contains data files for the results of a genome or a set of
genomes (bacteria and archaea). The file formats and contents is described
below:
*.fasta:
========
fasta formated protein sequence file for the proteome. Many sequences may be
found in several databases with different identifiers and sometimes different
text description. Different database entries are separated by the ASCII
character 01. Sequence names are in NCBI-nrprot style, i.e. the database names
and identifiers/accession numbers are connected by '|'.
dbj: CDS from DNA data bank of Japan)
gb: GenBank CDS
ref: CDS from the RefSeq database
gnl: wildcard, any sequence entry thats not in any other category
emb: CDS from EMBL
sp: SwissProt
pir: PIR
prf: sequence from the Patent database
pdb: PDB chain
scop: SCOP-1.53 domain
foldlib: sequence of the 3D-PSSM fold-library (SCOP-1.53 + recent PDB chains)
sanger: S. pombe sequences from The Sanger Centre
ENST*: ENSEMBL transcript
*.pfam:
=======
#1 MD5
#2 PFAM accession code
#3 Start of HMM-alignment in query
#4 Stop of HMM-alignment in query
#5 bit-score
#6 e-value
#7 first residue of HMM in alignment
#8 last residue of HMM in alignment
#9 length of HMM (in residues)
#10 Name of PFAM entry
#11 Description of PFAM entry
*.prosite:
==========
#1 MD5
#2 PROSITE Pattern accession code
#3 Start of pattern in query (first residue matched)
#4 Stop of pattern in query (last residue matched)
*.region:
=========
To reduce the amount of data to post-process PSI-BLAST, BLAST and IMPALA
alignments are clustered into 'regions' of the query by their overlap. SCOP
domains are clustered according to Muller et al. (1999), 293 1257-1271, J. Mol.
Biol. Other sequences are clustered by creating a single linkage cluster of
overlapping sequences.
#1 Md5
#2 Name of region
SCOP domain, PDB region, annotated region (functional annotated as
described in the paper), any homology (as described in the paper)
Note, 'regions' are not domains (except for SCOP), but just the location (in
the query) of the homologues in the same cluster. Regions of the same type
(name) are non-overlapping, but different regions types can overlap.
#3 Start of the region within the query
#4 Stop of the region within the query
#5 MD5 of the representative of the region (the homologue of lowest e-value)
#6 Start of the alignment within the representative
#7 Stop of the alignment within the representative
#8 bit-score of the representative alignment
#9 e-value of the representative alignment
#10 Sequence identity of the representative alignment
#11 Type of the alignment (PsiBlastHit, BlastHit or IMPALAHit)
Note, a PSiBlastHit does *not* mean this homology was not found by BLAST!
BLAST was only run for those sequences that contain non-globular coiled-coils,
trans-membrane or repeat regions.
#12 accession numbers of the representative homologue
#13 Description of the representative homologue
Columns 12 and 13 are in the same format as the names and descriptions for the
fasta formated sequence files.
*.scop:
=======
SCOP superfamilies found in the proteome
#1 SCOP superfamily code
#2 Number of domains of this superfamily in the proteome
#3 #2/all_scop_domains_in_proteome
#4 Number of sequences with at least one domain of superfamily
#5 #4/all_sequences_in_proteome
#6 Number of domains of this superfamily found in trans-membrane proteins
#7 Number of co-occurring SCOP superfamilies
#8 Number of co-occurring PFAM families
*.domparterns:
==============
SCOP superfamilies found in the same sequence (co-occurring superfamilies).
#1 SCOP superfamily code A
#2 SCOP superfamily code B
#3 number of sequences with containing this co-occurrence
Note, self association is also counted (e.g. 004.082.001 004.082.001)
*.summary:
==========
Overview and summary of the annotation status of a proteome.
#1 On what the 'Number' of annotation features is based on (Sequences, Regions
(see above for a definition of the 'region' term) or residues)
#2 The name/type of the annotation
AnnotRegion: Functional annotated region
HomolRegion: Region with a homologue (functional annotated or un-annotated)
ScopRegion: SCOP domain
PDBRegion: Region homologues to a PDB chain
HMMHit: homology to an uncharacterised PFAM family
HMMHitAnnot: homology to a characterised PFAM family
LCR: low complexity region
Coil: Coiled-coil
TMH: trans-membrane region
SigPep: Signal peptide
NonGlob: LCR or Coil or TMH or SigPep or NonGlob
Total: a total number (of sequences or residues or regions)
'_cumulative' means that this annotation feature was counted in a cumulative
way. Sequences or residues already counted will not counted again, so that the
'Total' will not be exceeded. The priority of assignment is then: ScopRegion >
PdbRegion > HMMHitAnnot > AnnotRegion > HMMHit > HomolRegion > HMMHitAnnot >
SigPep > TMH > Coil > LCR
#3 The number of observations
[ end of file ]