Usually the data present in database entries should have certain properties that can be checked automatically. For example, a fully defined DNA sequence should consist only of the four bases A, C, G, and T. However, sometimes there is experi- mental uncertainty about the base identity at certain positions, leading to an extended character set that describes all the possible uncertainly (see Table 3.1). A similar situation occurs with protein sequences, leading to the one-letter codes B, Z, and X to describe uncertainly in the amino acid present at a sequence position. Other aspects of sequences can also be checked; for example, that the sequence length and molecular weight, if reported, agree with the sequence.
Some data can be checked even more rigorously The bonding geometry found in protein structures is known to show only limited variation, for example the length of the main chain double bond linking the C and 0 atoms (see Figure 2.2) averages
64
0.123 nm (1.23 A) with a standard deviation of 0.002 nm (0.02A). Protein structures are defined in the databases by atomic coordinates relative to an origin, so that the geometry is not explicitly given. However, it is easy to calculate the bond lengths of a structure in the database and to check them against their known ranges. This can be done for all bond lengths, bond angles, and torsion angles, as well as chirality (see Chapter 2). Sometimes, when an error is detected, it is possible to identify and correct the error in the atomic coordinates. In other cases the database entry can be annotated to describe the error. Another typical feature of protein structure entries in the databases is that they often lack certain atoms or even entire residues. This is due to the nature of the experimental techniques used, and frequently relates to the degree of freedom of movement of regions of the molecule. Again, these missing atoms should be identified in the annotation.
Other forms of data present in the database entries may also be amenable to auto- mated checking. For example, in the entry for a gene expression experiment, a check can be made that expression levels are given for every gene under every condition. Wherever cross-references are given to entries in other databases a check can be made that the databases and entries exist. These sorts of automated checking are usually only able to detect errors and highlight them for manual resolution. In addition, they may often fail to identify other more subtle errors. In some cases, such as when submitting microarray data to certain databases, the data must conform to a specific standard that is intended to include sufficient experimental information in order for the work to be accurately reproduced. One such data standard for microarray experi- ments is called MIAME. MIAME describes the Minimum Information About a Microarray Experiment that is needed to enable the interpretation of the results of the experiment unambiguously and, potentially, to reproduce the experiment.
As the number and size of databases grew, and annotations became more exten- sive, some potential problems were recognized in electronic text searching. Often alternative spellings of the same word were encountered, as well as alternative names. For example, it is not uncommon for an enzyme to have three or four alter- native names. As a consequence it can be difficult to locate all relevant database entries without searching for all names or spelling variants. To circumvent this problem, ontologies have been proposed covering specific disciplines. By restricting the words used in databases to those in the ontologies, text searches can be rendered more effective. Automated methods can be used to identify alternative terms and replace them with approved terms, as well as to find misspellings.
Initial analysis and annotation is usually automated
The data submitted in new database entries are increasingly generated in electronic form by the experiments. In such cases it is relatively easy to include details of important parameters used in obtaining the data. These can be extremely useful in subsequent interpretation. An example of this occurs in the MSD molecular struc- ture database, especially for those structures derived from crystallographic methods. As well as details that may only be of interest to other experimentalists, crystallographers report measures called the resolution and R factor, which are a very useful guide to the overall accuracy of the structure. The resolution indicates how much measured diffraction data were included in the work, while the R factor measures the correlation between the structure and the experimental measure- ments. Both give strong indications of the effective limit of accuracy with which atomic coordinates can be determined. It is common practice when analyzing general protein molecular geometry only to include structures whose resolution is better than some threshold, such as 0.2 nm (2.0 A).
As regards analysis of data already incorporated in the database, in many cases (and certainly with sequences and molecular structure data) many of the forms of analysis applied in bioinformatics are fully automatable. Many of these techniques
are described in detail in this book in the following chapters. An example of such a method is the identification of similar sequences by alignment of all sequences in a database. For example, by using such techniques on a protein sequence it is possible to identify the protein family to which it belongs and its likely function. The methods usually include statistical analysis that can assess the likely signifi- cance of the result, information which should ideally be presented in the database entry. One way of increasing the reliability of the analysis is to reduce the level of detail, for example identifying a more general class of enzyme function instead of trying to predict exactly which compounds are involved. However, this can render the information too vague to be useful.
There is a particularly high potential for misleading analysis when genes are identi- fied in nucleotide sequences by applying purely computational methods, a situa- tion which is common in the early stages of genome sequence annotation. Such methods are described in detail in Chapters 9 and 10. Sometimes there are no experimental data available to support the predicted genes. In such cases the genes and the proteins they encode are often labeled as "hypothetical" in the databases. The gene prediction methods available are still relatively inaccurate, especially for eukaryotes, so that considerable caution is required when encountering these entries in the databases. In some cases the gene may be correctly predicted, but it is also quite possible that no such gene exists, or that errors have been made in the details of the prediction, resulting in a different protein sequence for example. Often considerable experimental work occurs after genome sequencing is complete to try to obtain experimental evidence for these hypothetical genes and proteins.