• No results found

Although low level implementation details of different DBMS may vary, from System R[20] until this day nearly every DBMS system comprised of the same high level components. The standardization of the internal database architecture contributed to the fast and successful development of DBMS, since it allowed researchers to port promising techniques over the years from one system to another and it significantly facilitated the transfer of research insights into commercial products.

The main building blocks of a typical DBMS, as shown in Figure2.1, are the query processing module, storage module, concurrency control module and crash recovery module. Each of the modules comprises several components whose roles we discuss next.

1 In this thesis, as well as in the database literature, terms relation and table are used interchangeably. 2 In this thesis we consider relational DBMS whenever referred to DBMS.

2.2. Components of a DBMS DBMS Index files Heap files Database Concurrency control module File and access

methods mgr.

Buffer pool mgr.

Disk space mgr.

Storage module Concurrency control module Transaction mgr. Lock mgr. Crash recovery module Log mgr.

Query processing module

Parser Optimizer Executor

Figure 2.1: Components of a DBMS

The query processing module, which is in charge of processing queries issued by the users, consists of a parser, a query optimizer and a query executor3. When a database user issues a query, the parser, as a first component in the query flow, is responsible for syntactically analyzing the query, making sure it is consistent with the database schema4. The parsed query is then sent to the query optimizer, which uses the properties of the relation algebra and information about how the data is stored to transform the query into an efficient query

execution plan (usually referred to as an optimal plan5) for evaluating the query. A query execution plan is represented as a tree of relational operators. Upon receiving this plan, the query execution engine activates the chosen algorithms (the implementations of relational operators) to execute the query. The relational operators may require data from the storage module6.

The storage module comprises the file and access methods layer, buffer pool manager, and disk

space manager. The file and access methods layer works with files, which, in a DBMS, are a

collection of pages. Two types of files are supported: heap files, or files of unordered pages, and 3 Many DBMS have an additional component between the query parser and the optimizer, called the query

rewriter, which is in charge of simplifying the query (e.g., nesting is unnested, query predicates are simplified, etc.). Nonetheless, the database literature often considers this component as part of the optimizer [142], hence we chose not to show it separately.

4 Database schema is a set of relations with their mutual relationships described through foreign keys.

5 Although called an optimal plan, there are many cases when the query optimizers actually do not manage to find the optimal plan. The last is due to the fact that the optimizers employ different heuristics to prune the space of possible choices, missing thereby potentially optimal plans.

Chapter 2. A Look Inside the DBMS

indexes that assume some form of ordering among pages. The buffer pool manager brings the

pages in from disk to main memory as needed in response to page read requests, maintaining the illusion that a particular page resides in memory. The lowest layer of the DBMS software, the disk space manager, deals with management of space on disk, where the data is stored. Higher layers allocate, deallocate, read, and write pages through routines provided by this layer, while this layer works directly with blocks on disk.

The DBMS provides concurrency control mechanisms by carefully scheduling and coordinat- ing user requests. DBMS components associated with concurrency control are the transaction manager and the lock manager. The transaction manager schedules the execution of transac- tions and ensures that transactions request and release locks according to a suitable locking protocol, while the lock manager keeps track of requests for locks and grants locks on database objects when they become available.

Lastly, the crash recovery module is responsible for maintaining a log of all database changes and restoring the system to a consistent state after a crash.

Since the work presented in this thesis is related mostly to the query processing module, we discuss its components in more detail in the following section.

3

Query Processing

Once entered into the query processing module, SQL queries are represented as trees of relational operators. The relational operators of a tree can be transformed in many ways (e.g. by applying commutativity or associativity on the operator inputs), and furthermore each relational operator has multiple algorithms to implement its logic. We discuss some of the operator implementations in Section3.3. The distinction between the logic of a relation operator (i.e., what it does) and its actual implementation (i.e., how it does it) is usually made through a notion of logical operator and its corresponding physical operator. Having the last in mind, a query execution plan1is defined as a tree of physical operators. The procedure of finding an efficient query execution plan for a given query is called query optimization.

3.1 Metadata information in the system catalog

In order to find the most efficient query execution plan, the query optimizer uses metadata information (i.e., statistics) describing data characteristics to quantify the cost of different operator orderings and their physical implementations. The metadata information is stored in the system catalog of a DBMS discussed next.

A DBMS stores information about every table, index and view it contains in a special set of tables called catalog tables that together form the system catalog. Although the specific pieces of information kept by different systems may vary, overall there is a minimum set of information kept by every DBMS [203]. For each table, a DBMS stores: a) its table name, the file name in which data is stored and the file structure (e.g. is it a heap file or indexed), b) for each attribute of a table, its name and type, c) for each index on the table, its name, d) integrity constraints such as primary and foreign key constraints. For each index, its name, structure (e.g. B+ tree, or Hash) and the search key attributes are stored. Similarly, for each view a DBMS stores its name and definition. This metadata information is used during query parsing to check whether the query syntax is valid (e.g., whether all tables and attributes of the query exist in the database).

Chapter 3. Query Processing

Additionally, a DBMS maintains statistics representing data distributions of a table or an index. Statistics are a crucial component heavily exploited by the query optimizer, since the optimizer uses them directly during query compilation2to quantify alternative choices. The following information is commonly stored: a) table cardinality, defined as the number of tuples in a table, b) table size, defined as the number of pages the table occupies, c) index cardinality, defined as the number of distinct key values of the table, d) index size, defined as the number of pages of an index, e) index height, defined as the number of non-leaf levels3for each tree index, f ) index range, defined through a minimum and maximum present key value.