• No results found

BASICS OF PHYSICAL DATABASE DESIGN Although this book is not concerned with physical database design, a basic

RETRIEVAL SYSTEMS

5.4 BASICS OF PHYSICAL DATABASE DESIGN Although this book is not concerned with physical database design, a basic

understanding of physical database design is still needed (even for an end user). It is important to keep in mind that searching and sorting methods in a DBMS involves input/output operations with secondary memory.

5.4.1 STORAGE MEDIA

We first give some terminology related to the size of databases: 1 MB (megabyte) = 106 bytes

1 GB (gigabyte) = 109 bytes 1 TB (terabyte) = 1012 bytes 1 PB (petabyte) = 1015 bytes

It is not uncommon for data warehouses to have size in hundreds of terabytes.

Several types of data storage exist, for a hierarchy of storage devices. At one end of this hierarchy is the fastest storage media called cache, and is managed by the operating system. The storage medium used for data that can be operated on is the main memory. The primary medium for the long-term on-line storage of data, however, is the magnetic disk. Traditionally, the entire database is typically stored on magnetic disk, although recently main-memory databases have drawn more and more attention [Eich 1992]. The advent of main-memory database has apparently enhanced the integration of knowledge-based computational intelligence techniques with database techniques. Note that disk storage is referred to as direct-access storage, because it is possible to read data on disk in any order. Having a large number of disks in a system can improve the rate at which data can be read or written, and improve the reliability of data storage by storing redundant information on

multiple disks. For this reason, a variety of disk-organization techniques referred to redundant arrays of independent disks (RAID) have now been widely used. At the other end of the storage-device hierarchy is the tape storage, which is considered as sequential-access storage. Tape storage is slow to access, and used primarily for backup and archival data.

5.4.2 FILE STRUCTURES AND INDEXING

A database is mapped into a number of files, which are maintained by the underlying operating system. These files reside permanently on disks (with backups on tables). A file is organized logically as a sequence of records. These records are mapped into disk blocks. Files are provided as a basic construct in operating systems.

A search key is the attribute (or attributes) used to look up records in a file. Note that it is different from key concepts (e.g., primary keys) in relational database (which is at the logical level of DBMS).

A query may reference only a small proportion of the records in a file. To reduce the overhead in searching for these records, we can construct indices for the files on which the database is stored. There are many types of indices:

(a) Index-sequential files. They are one of the oldest index schemes used in database systems and are designed for applications that require both sequential processing of the file and the random access to individual records. To permit fast retrieval of records in the order of the search key, records should be chained together by pointers. To allow fast random access, an index structure should be used. Indices could be either dense or sparse. In dense index, an index record appears for every search-key value in the file; while in sparse index, an index record is created for only some of the values.

In a standard index-sequential file, only one index is maintained. If several indices on different search keys are used, the index whose search key specifies the sequential order of the file is referred to as the primary index (also called clustering index). The search key of a primary index is usually (but not necessarily) the primary key. Each of the other indices is called a secondary index (or non-clustering index). In other words, a secondary index is an index whose search key specifies an order different from the sequential order of the file (for example, records are ordered by SSN but searched by names). Secondary indices improve the performance of queries that use search keys other than the primary one. However, the price we have to pay is the overhead when the database is modified. Note that regardless what kind of index is used, the index-sequential file organization suffers from performance degrading as the file grows.

(b) B+ tree or B tree index. They are designed to overcome the performance degrading problem. A B+ tree is a balanced tree in which all the leaves (which store the data or contain pointers to the data) are at the same level. The branch factor is usually a relative large number (say, 27 = 128), making only few disk accesses needed. Search operation can be carried out in a straightforward manner. Insert and delete are somewhat complex, because

the balanced condition may be violated and need to be restored. B-tree index is a variation of B+ tree where data may be stored in the internal nodes.

(c) Hash index. An alternative way of using index is instead of using ordered indices (sorted ordering), we can use a hash function to find the address of a data item directly by computing a function on the search-key value of the desired record. Two kinds of hashing can be distinguished:

Static hashing. It uses hash functions in which the set of bucket addresses is fixed. These hash functions cannot easily accommodate databases that grow significantly larger over time.

Dynamic hashing. It allows the hash function to be modified. Different dynamic hashing techniques have been developed. For example, in extendable hashing, buckets used to store the data can be split when the database grows, and can be coalesced when the database shrinks.

5.4.3 TUNING DATABASE SCHEMA

In Chapter 4, we discussed logical database design. The logical design should be followed by the physical database design, where we design the physical schema. It is important to keep in mind that as user requirements evolve, it is usually necessary to tune, or adjust, all aspects of a database design for good performance. There are three kinds of tuning [Ramakrishnan 1998]:

tuning indexes: Based on the observed workload we may refine the initial choice of indexes.

tuning the conceptual schema: This is to make changes to the conceptual schema in order to enhance performance.

tuning queries: This is to rewrite frequently executed queries and transactions in order to run them faster.

5.5 AN OVERVIEW OF QUERY PROCESSING AND