• No results found

The Model-Theoretic View on Databases

1.3 Structure of the Thesis

3.1.1 The Model-Theoretic View on Databases

Intellectual roots of databases are in first-order logic (Abiteboul, Hull, and Vianu 1995); in particular, in finite model theory (Libkin 2004). Thus, we adopt the model-theoretic perspective and view databases as first-order structures over some fixed domain.

Definition 3.1 (database) Let σ be a finite relational vocabulary, which consists of

finite sets R of predicate, C of constant, and V of variable names. A σ-atom is a ground

atom over the vocabulary σ. A database D over a relational vocabulary σ is a finite set

of σ-atoms.

We usually omit the vocabulary σ from the notation, if it is clear from context. A classical representation of a relational database is in terms of database tables, which organizes sets of atoms, relative to the predicate names. Each table corresponds to a

predicate and its rows correspond to ground atoms of that predicate, which are also

called records, facts, or tuples.

Example 3.2 Let us consider a database Dm given in Table 3.1, in terms of three

respectively. For example, the last row in the DirectedBy table is interpreted as the ground atom DirectedBy(winterSleep, ceylan). This database is represented as

Dm:= {StarredIn(deNiro, taxiDriver), . . . , Awarded(winterSleep, palmed’Or)},

in set notation. We will use both notations interchangeably. ♦ The query semantics is then given as a first-order semantics, which takes the additional assumptions employed in databases also into account.

Definition 3.3 (semantics) Let σ be a finite relational vocabulary, which consists of

finite sets R of predicate, C of constant, and V of variable names. A database D over σ

defines a first-order interpretation over σ, where

i) the domain is given as ∆D = C, (closed-domain assumption)

ii) aD= a, for all constants a ∈ C, (standard name assumption)

iii) (a1, . . . , an) ∈ PD iff P(a1, . . . , an) ∈ D, for all Pn∈ R, (closed-world assumption)

The satisfaction relation |= is then defined, as before. We say that a database D satisfies

a formula Φ if D |= Φ. ♦

Let us have a closer look at this semantics and highlight some of the important differences with classical first-order semantics, which will be critical throughout this work. The closed-domain assumption (CDA) restricts the domain of an interpretation to a finite, fixed set of constants; namely, to database constants. As a matter of fact, such interpretations presume that the domain is complete. The standard name assump-

tion (SNA) ensures a bijection between the database constants and the domain: it is

not possible to refer to the same individual in the domain with two different constant names. Last, but certainly not least, the closed-world assumption (CWA) of databases forces anything that is not known to be true, to be false (Reiter 1978); that is, databases make the data completeness assumption. Besides, under the closed domain and standard name assumptions, it is easy to see that the given semantics coincides with the Herbrand

semantics (Hinrichs and Genesereth 2006). Briefly, a Herbrand interpretation over σ

maps every σ-atom to either true, or false. Then, a database is simply a Herbrand interpretation, where the atoms that appear in the database are mapped to true, while ones not in the database are mapped to false, according to the closed-world assumption.

The simplifying assumptions of databases are useful for a variety of reasons. At the same time, it becomes very easy to produce some undesirable consequences under these assumptions as we will elaborate. We will revisit some of these assumptions, and discuss their implications, in depth, in the sequel, and illustrate our major motivation from this perspective.

The most fundamental task in databases is query answering; that is, given a database D and a formula Φ(x1, . . . , xn)of first-order logic, to decide whether there exists assignments

to free variables x1, . . . , xn, such that the resulting formula is satisfied by the database.

Importantly, here the variable assignments are of a special type, also called substitutions. Formally, a substitution [x/t] replaces all occurrences of the variable x by some database constant t in some formula Φ[x, y], denoted Φ[x/t].

3.1 Database Theory Table 3.1: A database Dm represented in terms of relational database tables.

StarredIn deNiro taxiDriver foster taxiDriver thurman pulpFiction travolta pulpFiction DirectedBy pulpFiction tarantino taxiDriver scorsese whiteRibbon haneke winterSleep ceylan Awarded pulpFiction palmed’Or taxiDriver palmed’Or whiteRibbon fibresci winterSleep palmed’Or

Given these preliminaries, we can now formulate query answering as a decision problem. Note that, we will mostly focus on the special case of this problem, called Boolean query answering, which we will also refer as query evaluation.

Definition 3.4 (query answering, evaluation) Let σ be a relational vocabulary;

Φ(x1, . . . , xn) be a first-order formula over σ; and D be a database over σ. Then, query

answering is to decide whether D |= Φ[x1/a1, . . . , xn/an]for a given substitution (answer)

[x1/a1, . . . , xn/an] to free variables x1, . . . , xn. For a Boolean formula Φ, Boolean query

answering, or simply query evaluation, is to decide whether D |= Φ. ♦ There exists a plethora of query languages in the literature. Classical database query languages range from the well-known conjunctive queries to arbitrary first-order queries, which we briefly introduce.

Definition 3.5 (query languages) A conjunctive query over σ is an existentially

quantified formula ∃~x.Φ(~x, ~y), where Φ(~x, ~y) is a conjunction of σ-atoms. A Boolean

conjunctive query over σ is a conjunctive query without free variables. A union of conjunctive queries is a disjunction of conjunctive queries. A union of conjunctive query

is Boolean if it does not contain any free variable. The class of Boolean unions of

conjunctive queries is denoted as UCQ.

We always focus on Boolean queries throughout this thesis unless explicitly mentioned otherwise. Unions of conjunctive queries are the most common database queries used, in practice; thus, they will also be emphasized in this work. Besides, note that full relational algebra corresponds to the class of first-order formulas (modulo some operators). Therefore, we include fragments of the class of first-order formulas as query languages in our analysis. In particular, we study ∃FO, ∀FO and FO queries, introduced in Chapter 2, as query languages. Besides, we sometimes use different syntactic forms to represent relational queries, such as CNF or DNF.

We also speak of matches for Boolean queries. Informally, a match is an assignment to the variables in the query such that the resulting ground query is satisfied by the database.

Definition 3.6 (match) Let Q be a Boolean query over σ, D a database over σ and V(Q) be the set of variables that occur in Q. A mapping ϕ : V(Q) 7→ C is called a

match for the query Q in D if D |= ϕ(Q).

For existentially quantified queries, it is sufficient to find a single match, to satisfy a given Boolean query evaluation. Conversely, for universally quantified queries, all mappings must result in a match in order to satisfy the query. Let us now briefly illustrate these notions on the database Dm given in Table 3.1 and on a simple conjunctive query.

Example 3.7 Let us consider again the database Dm and the non-Boolean query

Qt(x) := ∃y StarredIn(x, y) ∧ DirectedBy(y, tarantino),

which asks for actors that starred in a Tarantino movie. Answers to such queries are tuples from the database. For example, Qt(x) has two answers in the given database,

e.g. [x/thurman] and [x/travolta]. For each of these answers, there is a match, namely [y/pulpFiction], for the resulting Boolean query. We focus on Boolean variants of these queries. Answers to such queries are either true or false. For example, the query

∃x, y StarredIn(x, y) ∧ DirectedBy(y, tarantino),

where all variables are existentially quantified, returns true on the given database since

there is a match for the query. ♦

Importantly, UCQ denotes a class of queries (in analogy to ∃FO, ∀FO, and FO); thus, strictly speaking, it is not an abbreviation for “unions of conjunctive queries”. Nevertheless, we will slightly abuse this notation for unions of conjunctive queries and write “a UCQ Q” instead of “a UCQ query Q”.

From a conceptual perspective relational databases can also be viewed as propositional models, where every atom is mapped to a different proposition. Similarly, a database query can be rewritten into a propositional formula by naïvely grounding the query over the database constants and then replacing each ground atom in the resulting formula with a propositional variable. The propositional representation of the query is commonly known as the lineage of the query and can be exponentially large in the size of the database.