Terminology and Basic Concepts
2.5 Access Methods
As said before, in order to utilize clustering an index, also referred to as access method, is necessary. An index provides a way to organize and query data, i.e., it helps to efficiently find queried data.
Different storage devices also require different access methods in order to be accessed efficiently. In general, primary storage data structures try to minimize the number of tuples to process, while secondary storage index structures are minimizing the number of accesses to secondary storage. For example, AVL-Trees perform well for main memory since they are balanced binary trees, but B-Trees are superior for secondary storage due to their higher degree of branches per node and thus they are substantially smaller with respect to height and reduce the number of I/Os. This is crucial since accessing the secondary storage is the limiting factor and usually not the processing of tuples.
2.5.1 Clustering Index, Primary Index, Secondary Index
Without an index, data can be only accessed unqualified, i.e., in order to find relevant tuples the whole table has to be scanned. In the relational world this is called full table scan (FTS) or short scan.
A clustering index defines the physical organization of data , i.e., its clustering, and provides direct access to it. There can only be one clustering index for one instance of data and usually it is equivalent to the primary index, i.e., if they have the same key. Any other index on the same instance of data is a secondary index and has no effect on the clustering.
Secondary indexes require an additional indirection, i.e., they return tuple references rather than tuples. Possible tuple references are:
keys of another index, usually the primary index (tuple id: TID)
references to pages containing the relevant tuples
references to pages plus the positions of tuples on the pages (row id: ROWID) Using references to pages allows to circumvent the primary index during tuple fetching, but it requires additional efforts to correct the references when tuples are relocated due to page splits or merges.
Figure 2.6 shows a primary index referring to pages and a secondary index referring to the tuples stored on the data pages of the primary index. Data pages are displayed in physical order and we assume the secondary index has a different order than the primary index. It is obvious that a range query on the secondary index will cause random accesses when fetching the tuples.
One way to circumvent the random accesses is to include all queried attributes in the secondary index, i.e., make it a covering index for the queries that utilize it. Another
Linked Data Pages Clustering Page−Based Index
Secondary Index
Figure 2.6: Clustering Page-Based Index vs. Secondary Index
option is to postpone the access until all (or a substantial amount) of the TIDs/ROWIDs are available and to sort them ascending before actually fetching the tuples.
Further, a clustering index allows to index pages instead of tuples. This allows for a smaller index and better performance due to the page based processing. Pages lying within a queried range can be returned for further processing without inspecting each tuple. Secondary indexes do not allow this optimization since they are not clustering.
2.5.2 Address
To distinguish between the key of a relation and the key of an index we will refer to the latter one as address. Without loss of generality we define an address used by an index as follows:
Definition 2.20 (Address α(~p ) of a tuple ~p )
Let A be the domain of addresses. The address function A : Ω → A maps each tuple
~p ∈ Ω of a relation R to an address α ∈ A.
Note that addresses may not be unique and thus an address function is not necessarily a bijective function. In the following addresses are denoted by Greek letters.
An example for an address is a B-Tree key, which is a single attribute of a tuple or the concatenation of multiple attributes, usually referred to as compound or composite key.
Another example are hashing functions mapping tuples to buckets, thus the address is the bucket id.
Internally indexes usually use a different representation of addresses. A universal in-ternal representation of addresses is a bit-string. An index supporting this representation requires only the definition of transformations from the attribute type to its bit-string representation and thus does not need to know anything about the actual data type.
Furthermore, bit-strings have no size limit like fixed size types (e.g., integer, float, etc.) and bit-operations are supported efficiently by assembler code. A bit-string corresponds to a positive integer number and thus we may use this representation where appropriate.
|α| denotes the length of an address in bits.
2.5.3 Accessed Data
During query processing the actually accessed data differs depending on the used indexing technique which are depicted in Figure 2.7. A perfect index would access only the queried data (Figure 2.7(a)), but this does not happen in general.
A one dimensional index is only able to utilize the restriction on one dimension and has to apply post-filtering to the result as depicted in Figure 2.7(b). In worst case it has to read the whole table, i.e., if a partial match query does not restrict the index dimension.
With secondary indexes there is one index for each dimension. Query processing is performed by utilizing the restriction on each index separately and then the results are intersected. The results are usually tuple identifiers. Finally, the result tuples are fetched by random point queries (middle part of Figure 2.7(c)). Sorting the tuple identifiers ac-cording to the primary index before accessing the data can speedup the access by reducing the randomness, but it breaks pipelining, as fetching the tuples can start only after sorting.
A multidimensional index utilizes the restrictions in all dimensions and in average loads some “small” fraction of additional data besides the actual query shape (Figure 2.7(d)).
dimension 2
dimension 1
(a) Range Query
dimension 2
dimension 1
(b) One-Dimensional Index on dimension 2 with post-filtering on the actual query area
dimension 2
dimension 1
(c) Two Secondary-Indexes (one per dimension) and their intersection
dimension 2
dimension 1
(d) Multi-Dimensional Index accessing some areas outside the actual query area
Figure 2.7: Range Query and actually affected Area of the Universe