General Notation - Terminology and Basic Concepts

Terminology and Basic Concepts

2.1 General Notation

Our focus is on relational database management systems, but most of the results apply to database systems in general. We expect the reader to be familiar with the basic terminology of RDBMS:

Relation: Data is organized in a set of tables also called relations.

Tuple: Each row of a relation is considered as one event resp. unit of data called tuple.

Attributes: A relation has a fixed number of columns referred to as attributes and the value of an attribute have the same domain.

A tuple can be considered as a point in multidimensional space where the coordinates are given by the attribute values of the tuple. Consequently, we also refer to the attributes as dimensions. Often, only a subset of the attributes is used to qualify a tuple. Therefore, the attributes can be partitioned into two sets, a set of qualifying attributes or indexing attributes and a set of informational attributes.

In this thesis a tuple is considered to be a point in multidimensional space. The term universe is used to denote the multidimensional space defined by the Cartesian product of the domains of the attributes. Each attribute determines one dimension. The duality of tuple and point causes the following terms to be equal.

domain of a relation, universe, multidimensional space

relation, table, subset of a universe

attribute, dimension

attribute value, coordinate

arity, dimensionality

row, tuple, point

2.1.1 Tuple, Relation, Universe

Let D be a domain, i.e., a set of values, with a total ordering <D. min_D and max_D denote the minimal and maximal value of the domain. If the meaning is clear from the context we write < instead of <_D. |D| is the cardinality of D.

Due to the limitation of hardware we have finite sets of values in real applications, i.e., most todays CPUs only support numeric data types with a size of 32 bits and memory and storage are limited.

Definition 2.1 (Direct Neighbor, +D)

For any domain D with ordering <D two values a, b ∈ D are neighbors, i.e., a +D b, iff (a <_Db ∧ @c ∈ D with a <D c <_Db) ∨ (b <_D a ∧ @c ∈ D with b <D c <_Da). Lemma 2.1 (Maximum number of neighbors)

A given value a ∈ D has at most 2 neighbors.

Proof 2.1 (Maximum number of neighbors)

The set of possible neighbors of a value a is {b|a +D b}. There are at most two neighbors for min_D<_Da <_Dmax_D, otherwise there is only one neighbor, since min_D

resp. max_D are the domain boundaries.

Definition 2.2 (Attribute, A)

An attribute A is a named domain, i.e., the name can be used to designated the

domain.

Definition 2.3 (Tuple, ~t )

A tuple ~t is a vector of n values (t₁, · · · , t_n) from the attributes A₁, · · · , A_n and all

attributes have pairwise different names.

Definition 2.4 (Relation, R)

A relation R is a set of tuples with the same attributes. Without loss of generality the attributes A₁, · · · , A_d with 1 ≤ d ≤ n are called indexing attributes and we also refer to them as dimensions D1, · · · , Dd. The remaining attributes A_d+1, · · · , A_n are called informational attributes and they are qualified by the indexing attributes. |R| is the cardinality of R, i.e., the number of tuples in the relation R. For notational convenience we will refer to the set of possible dimension indices i as D = {1, · · · , d}.

Definition 2.5 (Multidimensional Domain, Ω)

The multidimensional domain Ω of a relation R is the cross product of the d indexing attributes, i.e., Ω = A₁ × · · · × A_d. We say Ω has the dimensionality d. Thus the indexing attributes of a tuple refer to a point in Ω. For tuples having only indexing attributes, we will use the terms tuple and point equivalently.

Definition 2.6 (Volume of Ω, |Ω|)

The volume |Ω| of a universe is given by the product of the cardinalities of the domains, i.e., |Ω| =Qd

i=1|Di|.

Definition 2.7 (Sparsity, ξ(R))

Given a relation R of the multidimensional domain Ω, the sparsity of R is defined by

ξ(R) = 1 − ^|R|_|Ω|.

The sparsity of relational applications is typically greater than 99.9% [Ram02], thus most of the universe is not occupied by data points. The unoccupied empty space will be called dead space. There are also applications with a sparsity of 0 usually processing raster data, e.g., images, signals, etc..,

Definition 2.8 (Multidimensional Ordering, <)

For a given multidimensional domain Ω two tuples resp. points ~p, ~q ∈ Ω satisfy ~p < ~q,

iff ∀i ∈ D : ~p_i <_D_i ~q_i.

Two tuples are equal when all their attributes are pairwise equal. ~p ≤ ~q is used to denote ~p < ~q ∨ ~p = ~q. While having a total order in one dimensional space there is no natural total ordering for multidimensional space that preserves neighbor relations as defined in the following.

Definition 2.9 (Multidimensional Direct Neighbor, +)

For a given multidimensional domain Ω two points ~p, ~q ∈ Ω are direct neighbors, i.e.,

~p + ~q, iff ∃i ∈ D such that ∀j ∈ D \ {i} : pi +Dq_i∧ p_j = q_j. The number of direct neighbors grows linear to the number of dimensions.

Definition 2.10 (Multidimensional Neighbor)

For a given multidimensional domain Ω two points ~p, ~q ∈ Ω are neighbors, iff ~p 6= ~q

and ∀i ∈ D : p_i = q_i∨ p_i +Di q_i.

In contrast to direct neighbors, this also takes the neighbors at corners into account, e.g., more than one dimensions differs by one. There are 3^d− 1 such neighbors, i.e., a d dimensional cube of side length 3 around ~p without the center ~p. Consequently, the number of neighbors grows exponentially w.r. to the number of dimensions. This can be seen as the real reason for the curse of dimensionality.

Lemma 2.2 (Maximum and minimum number of neighbors of ~p ∈ Ω)

A given point ~p ∈ Ω has at most 2d neighbors and at least d. Proof 2.2 (Maximum number of direct neighbors of ~p ∈ Ω)

According to Definition 2.9 on the preceding page two neighbors are equal in all dimensions except one dimension i and there are d possible values for i. For a given i there are at most 2 neighbors with respect to Di as we have seen before, since if a point is at the start or end of a dimension it has only one neighbor w.r. to this

dimension otherwise it has two neighbors.

2.1.2 Query, Result Set, and Selectivity

In the following we define the terminology for queries on a relation. We limit our view to selection queries resulting in a subset of the relation, i.e., projection, sorting, and aggre-gation are not considered. Spatial queries are not considered here, since relations in their definition as given before, consist only of points. For further reading on the management of spatial data the reader is referred to Chapter 7.

Definition 2.11 (Query, Q)

A query Q ⊆ Ω is a subset of the multidimensional domain defined by predicates. Definition 2.12 (Result Set, Q(R))

The result set Q(R) is the subset of tuples of R within the query, i.e.,

Q(R) = {~t ∈ R|~t ∈ Q} = R ∩ Q.

Definition 2.13 (Selectivity of a Query, sel(Q))

The selectivity a of query is defined by the fraction of the result set size over the size

of R, i.e., sel(Q) = ^|Q(R)|_|R| .

Definition 2.14 (Volume of a Query, vol(Q))

The volume a of query is its cardinality, i.e., vol(Q) = |Q|. Queries can be classified according to their predicate type, i.e., we have the following three basic multidimensional query types which are also depicted in Figure 2.1:

dimension 1

dimension 2

(a) Universe

dimension 1

dimension 2

(b) Point Query

dimension 1

dimension 2

dimension 1

dimension 2

(d) NN Query

Figure 2.1: Example for a two dimensional Universe and Queries on it

Point Query: Q_P(~p) = {~q ∈ R|~q = ~p} for a given ~p ∈ Ω.

Range Query: Q_R(~l, ~u) = {~q ∈ R|~l ≤ ~q ≤ ~u}. for a given ~l, ~u ∈ Ω.

Nearest Neighbor Query: Q_{N N}(~p, ∆, δ) = {~q ∈ R|∆(~p, ~q) ≤ δ} for a given ~p ∈ Ω, distance function ∆ and maximum distance δ.

A range query is specified by lower ~l and upper ~u bound with ~l ≤ ~u, which restricts dimension i to the interval [l_i, u_i]. Therefore, a range query is a multidimensional interval [~l, ~u] which corresponds to an iso-oriented rectangular subspace of the universe.

A point query is a special case of a range query where lower and upper bound are equal, i.e., ~l = ~u. Queries restricting only a subset of all possible dimensions are called partial match queries.

By combining multiple range queries one can compose arbitrary query shapes [FMB99].

Other query types require additional predicates, i.e., the nearest neighbor queries require a distance function.

Example 2.1 (Relation, Universe, Queries)

The two dimensional universe in Figure 2.1(a) corresponds to the relation R with two integer attributes A₁ = A₂ = [0, 7]. The volume of Ω is |Ω| = |A₁| · |A₂| = 8 · 8 = 64.

The relation consists of the points {(0, 2), (7,1), (3,4), (5,5), (0, 7)} (black squares in Figure 2.1(a)) and thus its sparsity is spar(R) = 1−^|R|_|Ω| = 1−₆₅⁵ ≈ 92%. Figure 2.1(b) depicts a point query Q_P(3, 4), Figure 2.1(c) a range query Q_R((2, 2), (5, 6)) and Figure 2.1(d) a nearest neighbor query QN N((3, 4), ∆, δ) with ∆ being the Euclidean

distance and δ the maximum distance.

In document Advanced Concepts and Applications of the UB-Tree (Page 29-33)