Set Theory - SQL-99 Complete Really

Georg Cantor was a German.

He invented Set Theory.

He was committed to a mental institution.

He died in 1918.

We can explain the preceding statements using Georg Cantor's own theory: a set is any collection of definite distinguishable things. We can conceive of the set as a whole, and in fact, we often do.

For example, we speak of "the Germans" (a set) and can rephrase our first statement as "Georg Cantor was a member (or element) of the set of Germans." By rephrasing, we emphasize the collection of individual things over the individual things themselves. That much is intuitive. But Cantor was careful in his choice of words. By "distinguishable'' (or distinct), he meant that in looking at any two things which fit in the set, we must be able to decide whether they are different. By "definite," he meant that if we know what the set is andcontinue

Page 14

we know what the thing is, we can decide whether the thing is a member of the set. Therefore, to know what a set "is," it is sufficient to know what the members are.

Here are a few examples. The "Germans" set also included Kaiser Wilhelm. However, it can be proved from historical records that Cantor was not a pseudonym or alias that the Kaiser used while off duty — therefore, the two members are distinguishable. At the same time, we know that there were several million other Germans, also distinguishable, and we could define the set by taking a census of all the Germans. There might be some difficulty deciding the individual question "What is a German?", but once that was done, there would be no difficulty deciding the collective question "What are the Germans?". Therefore, the members define the set, i.e., the members are definite.

The census we spoke of is possible because the Germans were a finite set (a fact which would have bored Cantor because he developed his theory to explain various gradations of infinity). We could enumerate the set thus:

{Georg Cantor, Kaiser Wilhelm, . . .}

In this enumeration, we used braces to indicate "we are enumerating a set" and an ellipsis to indicate "and so on" — that is, the list could go on but we felt it unnecessary to continue for the sake of our exposition. These are standard conventions and we will use braces and ellipses again.

Enumeration is unwieldy for large sets, so let us revisit the question "What is a German?" by taking the bounds stated in the song "Deutschland Ueber Alles" — a German is a person living between the Maas, the Memel, the Esch, and the Belt (four bodies of water which now lie

respectively in Holland, Russia, Austria, and Denmark). In Cantor's terms, that formula expresses a defining property. It is either true or it is false. If it is true, the person is in the set. If it is false, the person is outside the set.

Without stating the defining property in advance, the German census-takers would be unable to put together their forms and plan their information collection. The objective, though, is to produce an enumeration of all Germans. In computer terminology, we call the definition the database design and the enumeration the database itself. The "Germans" set can be broken up into several subsets such as:

{Berliners, Frankfurters, Hamburgers, . . .}

These subsets are also sets with defining properties (city of residence) and presumably, with members — but that is not necessary. For example, the set of Germans who live in the Rhine River is an empty set, i.e., it has no members, but it is still a set. The implicit breaking-up that happens when we ask "Among the Germans which ones are Frankfurters?" is an example of a set operation. A set operation is something we do with sets that results in the production of more sets.

Relations

First, let's consider a binary relation — that is, a relation between two things. The things don't have to be of the same type; all we are concerned with is that they have some type of bond and that they be in order. Getting back to our hero . . . there is a relationship between Georg Cantor and the concept Set Theory (he invented it). There is also a relationship between Kaiser Wilhelm and World War I (he started it). We could use our braces notation to enumerate this:break

{ (Georg Cantor, Set Theory), (Kaiser Wilhelm, World War I) }

Page 15

but it looks clearer when diagrammed as a two-dimensional table:

NAME ACTIVITY

Georg Cantor Set Theory

Kaiser Wilhelm World War I

There are some points to note about this diagram.

1. The table shows ordered pairs — we couldn't reverse them because Set Theory didn't invent Georg Cantor and World War I didn't cause Kaiser Wilhelm — the relationship between the NAME and ACTIVITY values is directional. Note, however, that the word "ordered" refers only to the horizontal order in the illustration — across the relation. We don't care which member of the set is listed first.

2. The table shows binary pairs — there is no space on the line for marking Georg Cantor's other achievements. Under "What did he do?" we can only express one thing. So, ordered means

ordered, and pair means pair.

So what, precisely, is the "relation" here? Well, it's the whole thing. The relationship is the set of all the ordered pairs and the ordering itself (i.e., how part A relates to part B). What we have in the preceding diagram is a picture of a relation and nothing but a relation.

This relation is a set. It has members. The members define the set. However, the members are no longer "elements," but ordered pairs. There's no reason to limit ourselves to ordered pairs though.

That's just the simplest relation, the binary relation, which is sometimes called "a relation of degree 2" (because there are two columns). We could have relations of degree 3, degree 4, degree 5, degree 6, degree 7, and so on, i.e., relations which are not binary but triple, quadruple,

pentuple, sextuple, septuple . . . Did you notice how after a while all the words ended in "-tuple"?

That's why the general term for a relation with n elements is n-tuple. Here is a relation of degree 4 showing all the information we have so far.

NAME ACTIVITY RESIDENCE DATE_OF_DEATH

Georg Cantor Set Theory Mental Institution

1918

Kaiser Wilhelm World War I Imperial Palace

????

. . .

Some differences now appear between the words Cantor used ("set theory terminology") and the words we use ("database terminology"). The line:

{Georg Cantor, Set Theory,Mental Institution, 1918}

would be one tuple to Cantor, but in SQL, we prefer the word row. Or, to be precise, the row value is the four-element:

{Georg Cantor,Set Theory,Mental Institution, 1918}

and the row is the box that holds that information. (We don't usually need to be so precise in ordinary discussion.) Meanwhile, going down rather than across, Cantor would say that we have four attributes, labeled NAME, ACTIVITY, RESIDENCE, and DATE_OF_DEATH. But in SQL, wecontinue

Page 16

prefer the word column instead and we call each element going down (such as 'Georg Cantor' or 'Kaiser Wilhelm') a column value.

Here's a quick summary. Moving across the relation are tuples (but don't use that word) or rows.

Moving up-and-down the relation are attributes (but don't use that word) or columns. The contents of a row is a row value. The intersection of a row and a column is a column value. The column value is "atomic" — it has only one element. There is always exactly one column value in each atomic box that is formed by the intersection of a row with a column.

Incidentally, in the diagram we used an ellipsis once more to mean "and so on" — there are more Germans in the set. We also used the symbol "????" for the column value of "When did he die?"

for Kaiser Wilhelm. This is not a standard symbol; there is no standard way of expressing the fact that not only do we not know when Kaiser Wilhelm died (i.e., value is Unknown), but we're not even sure that he's dead (i.e., category is Not Applicable). This is not a problem for Set Theory but it is a problem in practice and we'll return to it later. When we do, we'll call the "????" a NULL or null value.

And there we have it. A relation is an ordered n-tuple set with all members having the same degree, which is representable as a table of rows and columns. It really does seem appropriate that we should know what a relation is because (we trust the following is not an unpleasant surprise) SQL is a relational database management system, and that means the database is made of relations. To quote the Kellogg's Rice Krispies commercial, "What the heck did you think they were made of?"

Admittedly, when anything looks that obvious, it must be a lie. And it is. Relational databases are actually made of tables rather than relations. What's the difference? Well, with a table we can have two Georg Cantors. This clearly breaks Cantor's "distinguishable" rule and it's quite a big break; we all know that the famous "Set Of Integers" doesn't go

{1,1,1,1,1,2,2,2,2,2,3,3,3,3,3, . . .}

By definition, a set has no duplicates. A relation is a set, so a relation has no duplicates. But a table can have duplicates. This is a concession to practicalities, since we know that duplicate data happens in the "real world." However, if our database contains two Georg Cantors who really are the same person, then we have an error — while if it contains two Georg Cantors whom we can't distinguish from one another, then we have a design flaw. The long and the short of it all is (a) a relational database consists of tables, (b) a table is not a relation, but the only difference between the two is that a table may have rows with duplicate row values, (c) you should get rid of

duplicates regardless, therefore (d) all relational databases should consist of relations. (By the way, a table is sometimes called a multiset to distinguish it from a regular set. You'll see the word multiset in some Microsoft publications.)

In document SQL-99 Complete Really (Page 38-41)