Adopting a systematic gene-centric pipeline approach, GenomeTraFaC (http://genometrafac.cchmc.org) allows genome-wide detection and characterization of compositionally similar cis-clusters that occur in gene orthologs between any two genomes for both microRNA genes as well as conventional RNA-encoding genes. Each ortholog gene pair can be scanned to visualize overall conserved sequence regions, and within these, the relative density of conserved cis-element motif clusters form graph peak structures. The results of these analyses can be mined en masse to identify most frequently represented cis-motifs in a list of genes. The system also provides a method for rapid evaluation and visualization of gene model-consistency between orthologs, and facilitates consideration of the potential impact of sequence variation in conserved non-coding regions to impact complex cis-element structures. Using the mouse and human genomes via the NCBI Reference Sequence database and the Sanger Institute miRBase, the system demonstrated the ability to identify validated transcription factor targets within promoter and distal genomic regulatory regions of both conventional and microRNA genes.
This work was supported by grants NCI UO1 CA84291-07 (Mouse Models of Human Cancer Consortium), NIEHS ES-00-005 (Comparative Mouse Genome Centers Consortium) and NIEHS P30-ES06096 (Center for Environmental Genetics).
As a member of the International Nucleotide Sequence Database Collaboration (INSDC, http://www.insdc.org), DDBJ (http://www.ddbj.nig.ac.jp) has steadily collected, annotated, released and exchanged the original DNA sequence data, which, for example, is shown by a growth curve of the data submissions in the past years (visit http://www.ddbj.nig.ac.jp/images/breakdown_stats/percentage-e.gif). However, the current situation of data submissions is dramatically changing due to the emergence of ultra-high speed or the 2nd generation sequencers (2GS) such as 454 (by 454 Life Sciences), Solexa (by Illumina, Inc.), SOLiD (by Applied Biosystems) and Helicos (by Heliscope). With these machines the whole human genome could now be sequenced at one-thousandth or less speed of the first cases in 2001 (1, 2). Recently, two reports announced that the whole genome was sequenced for two well-known persons (3, 4), which was perhaps the beginning of personal genomics. Also known is the 1000 human genomes project that is underway in USA, Europe and China to obtain a complete and detailed catalogue of genetic variations of humans (http://www.1000genomes.org/page.php). Those activities warn us that the above growth curve will steepen drastically. At present INSDC have released about 100 billion bases in total. This is the outcome of the collaboration among the three member banks for more than 20 years. However, this number will easily be surpassed when the 1000 human genomes project is completed and the result is submitted to INSDC in a few years, or even before that. To cope with those activities INSDC collaborators discussed in 2008 the attitude towards handling mass submissions produced by 2GS. The common fear among the collaborators was limited computer storages that will sooner or later be filled with continuously coming mass submissions. Nevertheless, the collaborators agreed to collect, distribute and exchange mass data of transcriptomes such as trace archives and short reads, upon the condition that the sequences are assembled. DDBJ has also started to accept and release such mass sequence data.
Recent develoments :
In the following text, DDBJ s activity is reported with focus on mass data submissions from Japanese universities and institutes. DDBJ (http://www.ddbj.nig.ac.jp) collected and released 2,368,110 entries or 1,415,106,598 bases in the period from July 2007 to June 2008. The releases in this period include genome scale data of Bombyx mori, Oryzas latipes, Drosophila and Lotus japonicus. In addition, from this year we collected and released trace archive data in collaboration with National Center for Biotechnology Information (NCBI). The first release contains those of Oryzas latipes and bacterial meta-genomes in human gut. To cope with the current progress of sequencing technology, we also accepted and released more than 100 million short reads of parasitic protozoa and their hosts that were produced using a Solexa sequencer.
We thank all staff of DDBJ for the data collection, annotation, release, management and software development. DDBJ is funded by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) with the management expenses grant for national university cooperation. DDBJ is also supported by a grant from the National Project of Integrating Life Science Databases.
1. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001) Initial sequencing and analysis of the human genome, Nature, 409, 860-921
2. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. (2001) The sequence of the human genome, Science, 291, 1304-1351
3. Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He W., Chen, Y.-J., Makhijani, V., Roth, G.T. et al. (2008) The complete genome of an individual by massively parallel DNA sequencing, Nature, 452, 872-876
4. Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G. et al. (2008) The diploid genome sequence of an individual human, PLoS Biology, 5, 2113-2144
The EMBL Nucleotide Sequence Database (URL: http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data are exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred Web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection. For sequence similarity searches a variety of tools (e.g. Fasta, BLAST) are available.
Kulikova, T., Aldebert, P., Althorpe, N., Baker, W., Bates, K., Browne, P., Van den Broek, A., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Garcia-Pastor, M., Harte, N., Kanz, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Mancuso, R., McHale, M., Nardone, F., Silventoinen, V., Stoehr, P., Stoesser, G., Tuli, M.A., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W. and Apweiler, R. (2004) The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 32, D27-D30.
comprehensive sequence database that contains publicly available DNA sequences for more than 170,000 different organisms, obtained primarily through the
submission of sequence data from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the
BankIt (Web) or Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library and the DNA
Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's retrieval system, Entrez, which integrates
data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical
literature via PubMed. Sequence similarity searching is provided by the BLAST family of programs. Complete bimonthly releases and daily updates of the
GenBank database are available by FTP. NCBI also offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank
database and related resources are freely accessible via the NCBI home page at http://www.ncbi.nlm.nih.gov
Patome contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM, and GO databases, and their results were saved as a gene-patent table. Patome is available at http://www.patome.org/; the information is updated bimonthly.
B. Lee thanks Dr. YoungGyun Cho at the Korean Intellectual Property Office (KIPO) for helpful discussion. We especially thank Maryana Bhak for editing this manuscript. This work was supported by the Korean Ministry of Science and Technology (MOST) under grant number M10407010001-04N0701-00110.
The National Center for Biotechnology Information Reference Sequence (RefSeq) database provides curated non-redundant sequence standards for genomic regions, transcripts (including splice variants), and proteins.
Records are compiled using a combined approach of collaboration, automated methods, prediction, and curation and are extensively integrated with other NCBI resources facilitating navigation and discovery. RefSeq records represent the current best view of genomes and their transcript and/or protein products.
Recent develoments :
The RefSeq collection continues to expand apace with genome sequencing projects. The complete collection is provided for FTP in bi-monthly releases (ftp://ftp.ncbi.nih.gov/refseq/release/). RefSeq release 18, provided in July 2006, included over 2.7 million protein records and over 3,600 organisms.
1. The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 17, The Reference Sequence (RefSeq) Project. Available from http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books
2. Pruitt KD, Tatusova, T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins Nucleic Acids Res 2005 Jan 1;33(1):D501-D504
3. Pruitt KD, Katz KS, Sicotte H, Maglott DR. Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet. 2000 Jan;16(1):44-47.
Xpro is a relational database that contains all the eukaryotic protein-encoding DNA sequences in GenBank. It provides detailed and comprehensive features about both the intron containing and the intron-less genes.
In addition to the information found in the GenBank records, which includes properties such as sequence, position, length and description about introns, exons and protein coding regions, Xpro provides annotations on the splice sites motifs and intron phases. Furthermore, Xpro validates intron positions using alignment information between the records sequence and EST sequences found in dbEST. The entries in the XPro are also cross-referenced to SWISS-PROT/TrEMBL (http://www.ebi.ac.uk/trembl/index.html) and Pfam (http://www.sanger.ac.uk/Software/Pfam/) databases.