Each attribute in a UML class becomes a column in a table. Again, you must worry about name length; this is 18—32 characters depending on your target DBMS. Most of the transformation of attributes has to do with creating suitable data type and constraint declarations, however.
Recall the UML attribute definition syntax from Chapter 7: stereotype visibility name : type-expression =
initial-value { property-string }
There are no attribute stereotypes, and visibility in relational databases is always public (+), so you can ignore these parts of the attribute definition. See the section on "Operations and Methods" for a way to pretend that your attributes are private, however.
Figure 11-6 presents the SQL-92 syntax for the column definition.
The <data type> or <domain name> corresponds to the type-expression you've associated with the attribute in UML. See the following section on "Domains and Data Types" for details. The <column constraint definition> is one of the several types of integrity constraint you can specify or the NOT NULL constraint. This corresponds to the tagged values in the property-string. See the following section on "Constraints and Object Identity."
Domains and Data Types Your type-expression may include two kinds of type: a type that directly corresponds to a basic SQL data type or a more complex domain based on such a type (see the previous section "Types and
Domains"). You use your type transformation table to determine a reasonable SQL data type for the attribute. Beyond that, things get complex.
Figure 11-6: The SQL-92 <column definition> Syntax
For enumerated types, you should build a type table (see "Types and Domains"). You can then specify a foreign key constraint to map the column to that table:
REFERENCES LegalStatusType(Value),
This column definition links to the LegalStatus Type table's Value column to get the set of possible values for the LegalStatus column in the Criminal Organization table. You use the same data type as the type table defines (CHAR(1)). The NOT NULL is optional, as is the DEFAULT clause. You may want to depend on a default table approach in which the application queries the default value rather than using a DEFAULT clause.
For the kind of subtyping available in SQL, you can specify character size, numeric precision and scale, and any other such restrictions available in your target RDBMS.
For ranges and other subtypes of that sort, you must use a CHECK constraint: IDNumber NUMBER(10,0) CHECK (IDNumber > 999999999),
This column definition specifies a 10-digit integer number, a number between 1,000,000,000 and 9,999,999,999. Notice how the numeric precision combines with the check constraint to enforce the range. You can also specify the range directly:
IDNumber INTEGER CHECK (IDNumber BETWEEN 1000000000 AND 9999999999),
To exclude noninteger values, this definition uses the INTEGER data type. This is not a native Oracle7 data type but rather an internal synonym for NUMBER with a scale of 0.
To summarize this example, you can go quite far to satisfying most subtype constraints using the SQL built-in types and the CHECK constraint mechanism. Unfortunately, you must specify the entire mess each time you use the UML type, because current RDBMS products do not yet implement the ANSI SQL-92 domain. If your vendor does, by all means use the domain to represent subtypes, as this greatly enhances reusability and consistency in your schema.
Note
Relational schemas cannot represent structured or multivalued data types such as records or arrays. If your UML design includes attributes with these kinds of types, you must treat them as though the type were a class. Create a separate table for the data, then link the table back to the owning table with a foreign key relationship.
Constraints and Object Identity The series of property strings in curly braces in the UML attribute definition includes the following extended tagged values:
{OID}: An explicit primary key
{alternate OID}: An explicit candidate key
{nullable}: A specification that the attribute can be null instead of having a value
The Chapter 7 section on "Object Identity and Uniqueness Constraints" goes into detail on the UML approaches to explicit and implicit object identity. Relational databases rely entirely on explicit identity. Relational theorists raise this feature to a principle, stating that "all information in the database at any given time must be cast explicitly in terms of values in relations and in no other way" [Date and Darwen 1998, pp. 145—146]. The consequence of this principle is that there can be no such thing as an object identifier that is not a value in a table column, nor can there be pointers to values or rows (see Chapter 12 for the OR concept of "reference," for example).
Most DBMS products support some kind of identification "under the covers." For example, Oracle7 provides a ROWID that uniquely identifies a row in a table. That ROWID can change, however, if the DBMS moves the row to a different location on disk. It is a physical identifier, not a logical identifier, and you should not consider it to be an object identifier. You thus can't store off a ROWID value in another table and use it later to retrieve the original row, as the internal value may change. Also, Oracle reserves the right to change the ROWID structure from version to version, as they did in Oracle8. Thus, both as a practical matter and in theory, you should restrict yourself in RDBMS schemas to explicit identity.
Note
SQL Server version 7 provides an OID data type to represent object identity explicitly. This moves SQL Server one step in the direction of the object-relational databases discussed in Chapter 12. It also integrates the concept of object identity directly into the relational model. Whether this constitutes a "value" according to Date's definition is unlikely.
Taking the explicit approach, you must figure out how to transform implicit and explicit UML identity into explicit relational identity. Starting with explicit {OID} attributes is much easier: you just create your columns from the
attributes and put a PRIMARY KEY constraint on them. If it is a single column, you can just add the words PRIMARY KEY to the column definition as a <column constraint definition>. If there are multiple attributes with {OID} tags, you must add a <table constraint definition> to specify the PRIMARY KEY.
Transforming implicit UML object identity to explicit relational identity is a bit harder because the tools vary widely from system to system. Usually, the simplest approach is to add an integer column to the table. The variance is how you assign values to that column. Oracle7, for example, has the SEQUENCE object that you can use to supply
monotonically increasing integer values. This corresponds directly to a primary key domain that any number of tables can share. SQL Server and Sybase, on the other hand, let you add the keyword IDENTITY to the attribute, and it then generates a monotonically increasing number as part of the INSERT operation. This does not correspond to a domain because you can't share the value across tables.
Tip
You should think about the identifier domain from the perspective of the required integer size. Computers limit integer values by the memory architecture of the machine. Many DBMS products limit integers to 4 billion or so (232), or half that for signed integers (-2 billion to 2 billion). If you will have fewer than 2 billion (4 billion) objects, that's fine, but think about it carefully. Your application code should understand the limitation as well and decline gracefully to produce more objects when it nears the limit.
Another alternative is to add code, either as a stored procedure or in your application, to generate these numbers. I've encountered three ways of doing this.
One approach is to create a table that has one row for each domain; you increment the number and store it as the "current" number, emulating the SEQUENCE. The only problem with this approach is that you lock the row (or in SQL Server and Sybase, even worse, the page). Locking can dramatically impact concurrency if you create many rows very quickly from many different connections (the OLTP scenario).
A second alternative is to query the maximum value of the column from the table, increment that by one, and use that number as the new value in an INSERT statement. This approach may or may not perform very well depending on the optimization available from the DBMS; at worst, it can scan all the rows in the table. It also has various impacts on locking the rows in the table (and locking the index pages for the table if the query uses them). A third alternative is to generate a unique number in some way. Variants on this include hashing (using a simple MOD transformation on the current date and time, for example, or some kind of random number generator) or GUID (globally unique identifier) generation. A GUID is also known as a uuid (universally unique identifier) [Leach and Salz 1997]. You generate it using a standard algorithm based on hardware identity and timestamp. The hashing approach works 90% of the time (the real number actually depends on the hash function you use), so you need to check the number or add code to generate a new number if the INSERT fails because of duplicate values in the primary key. The GUID approach works fine as long as you're willing to accept all keys using 36 bytes of fixed-length character space in the table and in the primary key index. For very large tables, this space can really add up compared with a single integer value.
Tip
GUIDs are the method of choice for identity domains that you want to be able to handle virtually any number of objects, as is likely for Holmes PLC. If it concerns you that integer values may be too limiting, you should definitely look into GUIDs [Leach and Salz 1997]. The Win32 application programming interface contains functions that generate GUIDs, as part of the remote procedure control library (rpc.h).
The candidate keys require some additional SQL syntax. Recall that each {alternate OID} tag specifies a number identifying the particular candidate key. This lets you have multiple candidate keys that comprise multiple columns. For each unique number in such a tag, you should generate a UNIQUE constraint, either as a <column constraint definition> or as a <table constraint definition>. Since these candidate keys are always explicit, you just create the attribute(s) as columns, then put in the constraints.
The UML {nullable} tag corresponds to the NOT NULL constraint, at least in its absence. That is, you attach
{nullable} to the attribute in UML if you do not want a NOT NULL constraint in your relational database. If there is no tag on the attribute, you must add the keywords NOT NULL to your column definition. Most RDBMS products let you add a null constraint, the reverse of the NOT NULL constraint, though that isn't part of the ANSI standard for the column definition.