Sequence databases - Understanding Bioinformatics

Nucleotide sequence related databases (8% of the 858 listed in the Molecular Biology Database Collection) include major international collaborations, such as GenBank and the EMBL-EBI Nucleotide Sequence database, as well as resources

that are more gene specific with information on introns, exons, and splice sites, as well as motifs and transcriptional regulators and sites. RNA-specific databases comprise 5% of the total and most have data on secondary structure and other aspects in addition to sequence. Genomic databases form a large part of the database list, with 19% nonhuman and 8% human and other vertebrate genomes.

Figure 3.6

Distribution of the type of databases as classified at the Nucleic Acids Research (NAR) Molecular Biology Database Collection Web site. In 2006 there were 858 databases listed in total, classified into 14 major categories, of which the genome (27%) and sequence (26%) databases form the largest sections.

There are a number of different types of DNA sequences stored in the databases containing information about nucleic acids. These differ in the way they have been obtained, and each type provides different biological information and must be treated differently in terms of their analysis. The first type is raw genomic sequence, representing the sequence of chromosomal DNA. This is the type of DNA sequence derived from genome sequencing projects and is deposited in GenBank and the organism-specific DNA sequence databases. These sequences include all the elements present in genomic DNA, including noncoding regions, introns, and control regions, as well as the sequences that code for proteins and RNAs. A second type of DNA sequence that can be encountered, known as cDNAs, are the sequences of DNA molecules that have been synthesized by reverse transcription (copying of RNA into DNA) of mRNA molecules. The mRNA present at the time of the experiment will depend on the nature of the sample, for example the type of cell or tissue, the particular stage of development, or the particular disease. The set of cDNA database entries for that sample represents the genes actually being expressed in that sample. Because they are synthesized using the mRNA as a template, these DNA sequences lack any introns that might exist in the gene and any control sequences that lie outside the region transcribed into RNA (see Chapter 1). The third type of DNA sequence held in databases is known as

expressed sequence tags (ESTs). An EST is a partial cDNA sequence. As with

cDNAs, a library of ESTs indicates the range of genes being expressed in a sample, and they can be used to scan genome sequences to help identify genes. The NCBI site hosts a special database of ESTs called dbEST.

Protein sequence databases form a large group of the NAR list (14%). These include the major sequence databases such as UniProtKB, with its highly annotated component Swiss-Prot, and the NCBI Protein Database, both being efforts to collect information on all protein sequences. These protein databases are often compiled and annotated from raw nucleotide sequence data. For example,

Figure 3.7

An extract from a GenBank DNA sequence file with the DNA sequence of human LIM domain 7. The type of information on each line is preceded by the field name. A number of accession and version numbers are given, including cross-references (xref) which are links to related entries in other databases. Many lines have been omitted from the file after the "SOURCE" line identifying the sequence as of human origin. The section from "gene" to the line after the "CDS" line identifies the protein- coding sequence (CDS) as only bases 1261 to 5310 of the sequence. The final section of the file, from the "ORIGIN" line, is the formatted nucleotide sequence. Note that GenelD is the same as in the microarray data (see Figure 3.9).

UniProtKB is produced by analysis of all the translations of the EMBL database nucleotide sequences. It has two components: Swiss-Prot mentioned above, which has manual annotations incorporated, and the TrEMBL component, which is only annotated using a computer. The latter has more entries, but the annotation is not as accurate as Swiss-Prot.

The sequence databases tend to contain very extensive annotation, and large teams are needed to help with this intensive task. In the case of nucleotide sequences, the typical features identified include the presence of open reading frames, introns, and promoter sites (see Chapter 1), as well as translated protein sequence. The protein sequence databases have equally detailed annotation with more emphasis on the protein, with various properties including their localization, their biological targets, sequence motifs, active sites, and domains (see Chapter 2). A partial entry for a DNA sequence of the LM07 gene is illustrated in Figure 3.7 and for its protein sequence in Figure 3.8.

Secondary databases such as Blocks and Prodom supply information regarding sequence or structural patterns found in proteins. This information is very useful when searching for proteins that are related by evolution (see Chapters 4 and 6). The proteins are grouped together according to sequence or structural similarities such as analogous active sites or substructures. In addition to the general DNA and protein sequence databases, there are many databases that specialize in specific groups of sequences. There are specialist databases covering most areas, including repetitive DNA elements, protein motifs, and particular classes of proteins or RNA molecules.

In document Understanding Bioinformatics (Page 77-80)