Figure 2-17 Data Models The most common data models in bioinformatics are relational, flat, and object-oriented.

The flat data model is simply a table without any embedded structure information to govern the relationships between records. As a result, a flat file database can only work with one table or file at a time. Strictly speaking, a flat file doesn't really fit the criteria for a data model because it lacks an embedded structure. However, the lack of an embedded structure is one reason for the popularity of the flat file database in bioinformatics, especially in capturing sequence data. A sequence of a few dozen characters may be followed by a sequence of thousands of characters, with no known

relationship between the sequences, other than perhaps the tissue sample or sequence run. As such, a separate flat file can be used to efficiently store the sequence data from each sample or run. In order to make the management of large amounts of sequence or other data more tenable, a model with an embedded structure is required.

The relational model, developed in the early 1970s, is based on the concept of a data table in which every row is unique. The records or rows in the table are called tuples; the fields or columns are variably referred to attributes, predicates, or classes. Database queries are performed with the select operation, which asks for all tuples in a certain relation that meet a certain criterion—for example, a query such as "Which authors write about neurofibromatosis?" To connect the data of two or more relations, an operation called a join is performed. A record is retrieved from the database by means of a key, or label, that may consist of a field, part of a field, or a combination of several fields. Supporting this data model so that it's easy for someone to direct a search for the record that contains the particular value of the key is the purpose of a relational DBMS. Consider querying a bibliographic database with an "author_subject_table," using the Structured Query Language (SQL) statement:

SELECT *.* FROM author_subject_table WHERE subject = "Neurofibromatosis"

A useful feature of the relational model is that records or rows from different files can be combined as long as the different files have one field in common. Theoretically, records with a common field can be combined or joined with an unlimited number of files. The price paid for this flexibility is extended

access time. That is, in a database design that doesn't take likely use patterns into account,

performance suffers. A large amount of processor time will be spent extracting information from the system as the database program performs joins and other operations. This performance penalty is a reason for not simply polling application databases for data. It's far better, from a performance perspective, to move the data into a separate data repository, a second database that is optimized for the desired searching and analysis.

The attraction of the ubiquitous relational model is that it is mature, stable, reliable, well understood, and well suited for a number of different applications in bioinformatics. The basic concepts involved with the relational model are easily grasped; data are populated into rows and columns in a table, and tables are associated with one another by joining fields that match in the two tables. However, the relational model has several limitations. Because the relational model is based on rows and columns, it's most efficient working with scalar data such as names, addresses, and laboratory values. That is, all relationships between objects must be based on data values as opposed to a location or place-holder in the database. This limitation often requires the database designer to create additional relations to describe logical associations between data elements. For example, in a relational database containing both nucleotide and amino acid sequences, the researcher can't relate the two without the aid of tables that relate nucleotide sequences to proteins and protein sequences to specific amino acids.

An even greater limitation of the relational model from a bioinformatics perspective is that the

metaphor of rows and columns often isn't a natural fit for sequence or protein shape data. Recall that one reason for using a DBMS is to allow users to think of data management in abstract, high-level terms, instead of the underlying algorithms and data representation schemes. Although tables of rows and columns can be considered a simplification over hard disk platters, they can seem obtuse to a researcher working with thousands of sequences, genes, and other data that don't fit neatly into a tabular metaphor. That is, the relational model often doesn't hide the complexity of genomic data. As a result, various other data models are used by professionals in the biotech industry.

One alternative to the relational model is the hierarchical model, which predates the relational model by a decade. Unlike the flexible relational model, permanent hierarchical connections are defined when the database is created. Within the hierarchical database model, the smallest data entity is the record. That is, unlike records in a relational model, records within a hierarchical database are not necessarily broken up into fields. In addition, connections within the hierarchical model don't depend on the data. The hierarchical links, sometimes called the structure of the data, can best be thought of as forming an inverted tree, with the parent file at the top and children files below. The relationship between parent and children is a one-to-many connection, in that one parent may produce multiple children.

The basic operation on the hierarchical database is the tree walk, proceeding from parent to child. Data can be retrieved only by traversing the levels of the hierarchy according to the path defined by the succession of parent fields. This unidirectional convention causes certain relationships to be difficult to extract from the database, even though they may be explicit in the data. For example, one characteristic of the hierarchical model is that information must often be repeated. Returning to the author-subject database example, under the topic of neurofibromatosis, if an author wrote more than one paper on the subject, the author's name and contact information would be repeated throughout the database.

The hierarchical model was once very popular in medicine, in the form of the Massachusetts General Hospital Utility Multi-Programming System (MUMPS) database language, which was used to develop one of the first electronic medical record (EMR) systems. A reason for the initial popularity of MUMPS in the early 1960s was that the data model is a good fit for clinical data, which tends to follow a standard topic outline, which is hierarchical. For example, patients at the top of the hierarchy have child nodes containing the elements of the EMR, including chief complaint, diagnosis, and laboratory results, as defined in Table 2-2. The limitation, noted earlier, is that for every patient admission, certain data must repeated, such as the patient's address, billing information, and other demographic information.

information resides in databases following this model. For example, a descendent of MUMPS called simply M is the standard for EMRs in the Veterans Administration hospitals throughout the U.S. Because of the storage inefficiency of the hierarchical model for some types of data, the network model was developed in the late 1960s. For example, the network model is more flexible than the hierarchical one because multiple connections can be established between files. These multiple connections enable the user to gain access to a particular file more effectively, without traversing the entire hierarchy above that file. Unlike the one-to-many relationship supported by the hierarchical model, the network model is based on a many-to-one relationship. The network model is significant in bioinformatics in that it may play a significant role in the architecture of the Great Global Grid and other Web-based computing initiatives.

One of the most significant alternatives to the relational database model is the object-oriented model in which complex data structures are represented by composite objects, which are objects that contain other objects. These objects may contain other objects in turn, allowing structures to be nested to any degree. This metaphor is especially appealing to those who work with bioinformatics data because this nesting of complexity complements the natural structure of genomic data (see Figure 2-18).

Figure 2-18. Object-Oriented Data Representation. The object-oriented data

In document Prentice Hall Ptr Bioinformatics Computing pdf (Page 86-89)