Share a resource
The PhyloFacts resource contains pre-calculated structural and phylogenomic analysis of over 15,000 protein family "books" across the Tree of Life. Each book includes a multiple sequence alignment, one or more phylogenetic trees, predicted subfamilies, predicted 3D protein structures, active sites and other key residues, cellular localization, and Gene Ontology (GO) annotations and evidence codes. PhyloFacts includes hidden Markov models for classification of user-submitted (DNA or protein) sequences to protein families and subfamilies. Our current focus is on covering all the gene families represented in the human genome and all structural domains, but plan to expand the resource to include all proteins in all species. PhyloFacts enables biologists to avoid the systematic errors associated with function prediction by homology through the integration of a variety of experimental data and bioinformatics methods in an evolutionary framework.
This work was supported by a Presidential Early Career Award for Scientists and Engineers (PECASE) from the National Science Foundation, and by an RO1 from the National Human Genome Research Institute of the NIH.
1. Krishnamurthy, N., Brown, D., Kirshner, D. and Sjölander, K. (2006). PhyloFacts: An online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biology, 7(9):R83.
The Markov Chain Promoter Prediction Server (McPromoter) uses statistics to predict eukaryotic DNA transcription start sites.
Creates hidden Markov model of motif from MEME output and searches sequence database for matches to this motif.
Collection of multiple sequence alignments and hidden Markov models covering many common protein domains.
EVEREST is an automatic computational process identifying protein domainsand classifying them into families. The EVEREST database contains 20,029families, each defined by one or more HMMER HMMs. EVEREST has beenthoroughly tested and evaluated, and has been shown to reconstruct 56% ofPfam A families and 63% of SCOP families with high accuracy, and tosuggest many new domain families.
The current release of the EVEREST database was constructed by scanningUniProt release 8.1 and the sequences of all PDB structures with each ofthe EVEREST families.
The EVEREST database of protein domain families can be accessed throughthe EVEREST website. The web-site allows browsing through EVEREST domainfamilies as well as domain families defined by SCOP, CATH and Pfam A. Afamily page in the website provides a graphical representation of allproteins containing a domain of the family, and of all domains, as definedby the above four domain definition systems, on these proteins. Thewebsite also features analysis of relationship between families andsearches for proteins and families on the basis of keywords, familystatistics, family phylogenetic profile and more. Finally, the user mayupload a sequence to be scanned for EVEREST families and stored for futurebrowsing by that user.
The EVEREST database is also available for download in flat file format.
While developing EVEREST, E.P. was supported by an Eshkol fellowship from the Israeli Ministry of Science and by the Sudarsky B Center for Computational Biology.
This work is partially funded by NoE (Framework VI) BioSapiens consortium.
1. Portugaly, E., Linial, N. and Linial, M. (2007) EVEREST: A collection of evolutionary conserved protein domains. Nucleic Acids Res, 35: in press.
2. Portugaly,E., Harel,A., Linial,N. and Linial,M. (2006) EVEREST: automatic identification and classification of protein domains in all protein sequences. BMC Bioinformatics, 7: 277.
Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgr.ki.se/Pfam/, and in the US at http://pfam.wustl.edu/. The web pages can give access to the alignments, trees, protein structure and other functional information for each family. The Pfam libraries of HMMs can be used locally to define domains in complete genomes. Pfam currently contains over 6,000 protein families and domains.
TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) assignments, commentary, literature references and pointers to related TIGRFAMs, Pfam and InterPro models. These models are designed to support both automated and manually curated annotation of genomes. TIGRFAMs contains models of full-length proteins as well as domains at the levels of superfamilies, subfamilies and equivalogs (which are sets of homologous proteins that are conserved with respect to function since their last common ancestor). TIGRFAMs models are allowed to be heirarchically nested to yield the maximum amount of information for the annotation process. TIGRFAMs are thus complementary to Pfam models which are designed to represent non-overlapping structural domains. The TIGRFAMs database is integrated with the prokaryotic genome annotation pipeline at TIGR and thus is being constantly updated with respect to new information on protein function, model scope and performance. TIGRFAMs currently contains over 1600 protein families, having doubled in size in two years. TIGRFAMs is available for searching or downloading at www.tigr.org/TIGRFAMs.
Recent develoments :
Since the TIGRFAMs database was first described in the January 2001 database issue of Nucleic Acids Research, the number of models in TIGRFAMs has doubled to over 1600. A large number of entries have been assigned specific Gene Ontology (GO) terms. TIGRFAMs links are now reported in the SwissProt database. TIGRFAMs has been incorporated into InterPro; InterPro entries based on or including TIGRFAMs entries show parent/child and contains/found in relationships with entries from Pfam, SMART, and other protein classification databases. Continued use of TIGRFAMs in microbial annotation at TIGR has provided steady feedback for improving the accuracy of existing models while new genomes and new functional characterizationns became available. TIGRFAMs models now hit nearly twenty per cent of the proteins of typical newly sequenced bacterial genomes. The equivalog subset can be expected to make about 400 high-confidence specific functional assignments for a typical new 4-megabase bacterial genome.
DBD provides transcription factor predictions for more than 150 completely sequenced genomes available for browsing and download. Predictions are based on presence of sequence specific DNA binding domain assignments using hidden Markov models from the SUPERFAMILY and PFAM databases. Evaluation shows that our predictions are 97% accurate and give at least 65% coverage. Focusing on individual genomes show that DBD identifies many novel factors for example, corresponding to a 90% increase in the known mouse transcription factor repertoire. Users can browse factors by genome or domain type, search for particular factors or submit a sequence for prediction.
The SUPERFAMILY database contains a library of hidden Markov models representing all proteins of known structure. The database is based on the SCOP 'superfamily' level of protein domain classification which groups together the most distantly related proteins which have a common evolutionary ancestor. There is a public server at http://supfam.org which provides three services: sequence searching, multiple alignments to sequences of known structure, and structural assignments to all complete genomes. Given an amino acid or nucleotide query sequence the server will return the domain architecture and SCOP classification. The server produces alignments of the query sequences with sequences of known structure, and includes multiple alignments of genome and PDB sequences. The structural assignments are carried out on all complete genomes (currently 59) covering approximately half of the soluble protein domains. The assignments, superfamily breakdown, and statistics on them are available from the server. The database is currently used by this group and others for genome annotation, structural genomics, gene prediction, and domain-based genomic studies.
Thanks to Thomas Down for help setting up the DAS server, and Matthew Bashton for contribution to web design.
Gough, J., Karplus, K., Hughey, R.,and Chothia, C. (2001). Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure. in press J. Mol. Biol. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540.
Prediction of vertebrate and C. elegans genes.
Jumping Profile Hidden Markov Model (jpHMM) takes a HIV-1 genome sequence and uses a pre-calculated multiple alignment of the major HIV-1 subtypes to predict the phylogenetic breakpoints and HIV subtype of the submitted sequence.
HHrep is a tool for the de novo identificationof repeats in protein sequences based on the pairwise comparison of profile hidden Markov models (HMMs).
Based on the comparison of profile HMMs, HHpred takes a protein sequence or multiple sequence alignment as input and searches for remote homologues in an assortment of databases such as PDB, SMART and Pfam. The user can select either a local or global alignment method, and the search results can be used to generate 3D structural models.
Family Identification with Structure Anchored HMMs (FISH) is a server for the identification of remote sequence homologues on the basis of protein domains.
PRED-GPCR is a tool which queries user-supplied sequences against a database of HMMs corresponding to G-protein coupled receptor (GPCR) families in order to determine which GPCR family the query sequence most resembles.