• No results found

RETRIEVAL SYSTEMS

5.7 DATA WAREHOUSING

The Internet provides an excellent chance (and also tremendous challenges) for building data warehouses. In this section, we take a look at this issue. More detailed discussion on data warehousing is provided in Chapter 11.

5.7.1 BASICS OF PARALLEL AND DISTRIBUTED DATABASES

5.7.1.1 Basics of parallel databases

Parallel systems improve processing and I/O speeds by using multiple CPUs and disks in parallel. There are two main measures of performance of a database system: the throughput (the number of tasks that can be completed in a given time interval) and the response time (the amount of time it takes to complete a single task from the time it submitted). A system that processes a large number of small transactions can improve throughput by processing many transactions in parallel. A system that processes large transactions can improve response time as well as throughput by performing subtasks of each transaction in parallel. Two important issues in studying parallelism are speedup and scale-up. Speedup refers to running a given task in less time by increasing the degree of parallelism. Scale-up refers to handling larger tasks by increasing the degree of parallelism. There are several architectural models for parallel machines used in parallel databases: shared memory (all the processors share a common memory), shared disk (all the processors share a common disk), shared nothing (the processors share neither a common memory nor common disk), hierarchical (a combination of the preceding architectures). Shared-nothing architecture has been proven the most successful one, because it provides both linear speedup as well as linear scale up, although it requires extensive reorganization of the DBMS code.

The basic idea behind parallel databases is to carry out evaluation steps in parallel whenever possible, in order to improve performance. Individual relational algebra operations (as discussed in Chapter 4) can be parallelized. In addition, we can execute different operations in an query in parallel and execute multiple queries in parallel. In its simplest form, I/O parallelism refers to reducing the time required to retrieve relations from disk by partitioning the

relations on multiple disks. The most common form of data partitioning in a parallel database environment is horizontal partitioning: the tuples of a relation are divided (or declustered) among many disks, such that each tuple resides on the disk.

Interquery parallelism: In interquery parallelism, different queries or transactions execute in parallel with one another. Transaction throughput can be increased by this form of parallelism. The primary use of interquery parallelism is to scale up a transaction-processing system to support a larger number of transactions per second. However, the response times of individual transactions are no faster than they would be if the transactions were run in isolation.

• Intraquery parallelism: Intraquery parallelism refers to the execution of a single query in parallel on multiple processors and disks. Using intraquery parallelism is important for speeding up long-running queries. The execution of a single query can be parallelized in two ways:

♦ Intraoperation parallelism: To speed up processing of a query by parallelizing the execution of each individual operation, such as sort, select, project, and join.

♦ Interoperational parallelism: To speed up processing of a query by executing in parallel the different operations in a query expression. 5.7.1.2 Distributed database systems

In a distributed database system, the database is stored on several computers. The computers in a distributed system communicate with one another through various communication media, such as high-speed networks or telephone lines. They do not share main memory or disks.

Unlike parallel systems, in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely coupled sites that share no physical components. In addition, the database systems that run on each site may have a substantial degree of mutual independence. In recent years, the need has arisen for accessing and updating data from a variety of preexisting databases, which differ in their hardware and software environments, and in the schemas under which data are stored. A multidatabase system is a software layer that enables such a heterogeneous collection of databases to be treated like a homogeneous distributed database.

A simple and popular distributed DBMS architecture is called client server. A client-server system has one or more client processes (which are responsible for user-interface issues) and one or more server processes (which manage data and execute transactions). A client process can send a query to any server process. Howeer, the client-server architecture does not allow a single query to span multiple servers. As a consequence, a client process could be quite complex, and its capabilities would begin to overlap with the server. To deal with these problems, collaborating server systems have been developed.

In distributed relational databases, relations are usually fragmented. There are two different schemes for fragmenting a relation: horizontal fragmentation

splits a relation by assigning each tuple of relation r to one or more fragments; vertical fragmentation splits the relation by decomposing the scheme R of relation r so that the original relation can be reconstructed by joining the fragments back (it is often convenient to add a special attribute called tuple-ID for this purpose).

For centralized systems, the primary criterion for measuring the cost of a particular strategy is the number of disk accesses; in a distributed system, we must take into account several other matters, including the cost of data transmission over the network and the potential gain in performance from having several sites process parts of the query in parallel.

Distributed environment also brings more challenges for issues related to transaction processing. For example, in order to ensure atomicity, all the sites in which a transaction T is executed must agree on the final outcome of the execution. T must either commit at all sites, or it must abort at all sites. Therefore, the transaction coordinator of T must execute a commit protocol. The simplest and most widely used is the two phase commit protocol (2PC). Roughly speaking, the first phase is to send prepare message while the second phase is based on the received massages to determine commit or abort.

5.7.2 DATA WAREHOUSING AND DECISION SUPPORT

The complexity involved in distributed database systems has stimulated organizations to find alternative ways to achieve decision support. Data warehousing is an emerging approach for effective decision support. A data warehouse is a "subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making." [Inmon 1996]. Though considered by some business people that data warehousing is a low-key answer for the "failed" distributed database systems, data warehousing does take advantage of various techniques related to distributed and parallel computing. A discussion on distributed and parallel computing issues in data warehousing can be found in [Garcia-Molina, Labio, Wiener and Zhuge 1999].

Data warehousing provides an effective approach to deal with complex decision support queries over data from multiple sites. The key to this approach is to create a copy (or derivation) of all the data at some one location, and to use the copy rather than going to the individual sources. Note that the original data may be on different software platforms or belong to different organizations.

Data warehouses contain consolidated data from many sources (different business unit), spanning long time periods, and augmented with summary information. Warehouses are much larger than other kinds of databases, sizes are much larger, typical workloads involve ad hoc, fairly complex queries, and fast response times are important. Data warehousing encompasses frameworks, architectures, algorithms, tools and techniques for bringing together selected data from multiple databases or other information sources into a single repository suitable for direct querying or analysis. Data

warehousing is especially important in industry today because of a need for enterprises to gather all of their information into a single place for in-depth analysis, and the desire to decouple such analysis from their OLTP systems. Since decision support often is the goal of data warehousing, clearly warehouses may be tuned for decision support, and perhaps vice versa.

In its simplest form, data warehousing can be considered as an example of asynchronous replication, in which copies are updated relatively infrequently (see [Ramakrishnam 1998] for more discussion). However, a more advanced implementation of data warehousing would store summary data or other kind of information derived from the source data. In other words, a data warehouse stores materialized views (plus some local relations if needed).

It is common in a data warehousing environment for source changes to be deferred and applied to the warehouse views in large batches for efficiency. Source changes received during the day are applied to the views in a nightly batch window (the warehouse is not available to the users during this period). Most current commercial warehousing systems (e.g. Prism, Redbrick) focus on storing the data for efficient access, and on providing extensive querying facilities at the warehouse. Maintenance of warehousing data (in a large degree, maintenance of materialized views) is thus an important problem. A more detailed discussion on data warehousing will be discussed in Chapter 11.

The widespread adoption of Internet technology will profoundly affect On- Line Analytical Processing (OLAP), which refers to applications dominated by stylized queries that typically involve group-by and aggregation operators for analysis purpose. Such queries are extremely important to organizations to analyze important trends so that better decisions can be made in the future. In addition, most vendors of OLAP engines have focused on Internet-enabling their offerings. The true promise of the Internet is in making OLAP a mainstream technology, that is, moving OLAP from the domain of analysts to consumers. E-commerce has emerged as one of the largest applications of the Internet in decision support. The basic concepts of data warehousing and aggregation have naturally made their way onto the web. In fact, some of the most popular Web sits on the Internet are basically databases. For example, search engines such Alta Vista and Lycos attempt to warehouse the entire web. Aggregation as a means to navigate and comprehend the vast amounts of data on the Internet has to also be recognized. Directory services such as Yahoo and Excite attempt to aggregate the entire web into a category hierarchy and give users the ability to navigate this hierarchy. The infrastructure for decision support is also in the process of improvement [Harinarayan 1997]. A more detailed discussion on data warehousing and related issues will be given in Chapter 11.

5.7.3 MIDDLEWARE

Middleware is a loosely defined term referring to the products to help customers deal with disparate, heterogeneous environments more effectively. In the DBMS arena, middleware products provide a consistent interface to

different local and remote data sources. Typically, data sources are supported through one or more specific drivers that (among other things) pass requests to a given data source and enable the results to be returned to the application.

From a customer's viewpoint, typical elements of data access middleware offerings include the following:

• an application programming interface (API) consisting of a series of available unction calls in C and a series of data access statements in dynamic SQL,

• a component called middleware engine for routing requests to various drivers and performing other functions (structures of middleware engine differ, depending on whether a global catalog or directory exists), and

drivers to translate requests issued through the middleware API to a format intelligible to the various back-end data sources. Gateways may be considered an early attempt at middleware; they provide specific point-to-point connectivity rather than broad-based connectivity.

The logical architecture of a data warehouse, including the roles of the middleware and data marts are depicted in Figure 5.5. Each data mart contains a portion of the data stored in the data warehouse (to be further discussed in Chapter 11). Also note source data are assisted by wrappers that facilitate conversion of data for integration.

Figure 5.5 Logical architecture of a data warehouse