Concatenated Keys - SQL Performance Explained

Although surrogate keys are widely accepted and implemented, there are cases when a key consists of more than one column. The indexes used to support the search on multiple columns are called concatenated or composite indexes.

The order of the individual columns within a concatenated index is not only a frequent cause of confusion but also the foundation for an extraordinary resistant myth; the „ most selective first” myth— Appendix A, Myth Directory has more details. The truth is that the column order affects the number of statements that can use the index.

For the sake of demonstration, let's assume the 1000 employees company from the previous section was bought by a Very Big Company. Unfortunately the surrogate key

values used in our EMPLOYEES table collide with those used by the Very Big Company. The EMPLOYEE_ID values can be reassigned—theoretically—because it's not a natural but a surrogate key.

However, surrogate keys are often used in interface to other systems—like an access control system—so that changing is not as easy. Adding a new column to maintain the uniqueness is often the path of least

resistance.

After all, reality bites, and the

SUBSIDIARY_ID column is added to the table. The primary key is extended

accordingly, the corresponding unique index is replaced by a new one on EMPLOYEE_ID and SUBSIDIARY_ID :

CREATE UNIQUE INDEX employee_pk ON employees (employee_id, subsidiary_id);

The new employee table contains all employees from both companies and has ten times as many records as before. A query to fetch the name of a particular employee has to state both columns in the where clause:

SELECT first_name, last_name FROM employees

WHERE employee_id = 123 AND subsidiary_id = 30;

As intended and expected, the query still uses the INDEX UNIQUE SCAN

operation:

---

---| Id ---| Operation ---|

Name | Rows | Cost |

Information (identified by operation id): 2

-access("EMPLOYEE_ID"=123 AND

"SUBSIDIARY_ID"=30)

The constellation becomes more interesting when the where clause doesn't contain all the indexed columns. For example, a query that lists all employees of a subsidiary:

SELECT first_name, last_name FROM employees

WHERE subsidiary_id = 20;

---

---| Id ---| Operation ---| Name ---| Rows | Cost |

---

---| 0 ---| SELECT STATEMENT ---| ---| 110 | 477 |

|* 1 | TABLE ACCESS FULL| EMPLOYEES | 110 | 477 | --- Predicate Information (identified by operation id): 1

-filter("SUBSIDIARY_ID"=20)

The database performs a FULL TABLE SCAN . A FULL TABLE SCAN reads all table blocks and evaluates every row against the where clause. No index is

used. The performance degrades linear with the data volume; that is, double the amount of data, twice as long to wait for the result. The FULL TABLE SCAN is

amongst the most critical operations used by the database and almost always a problem in online systems.

Full Table Scan

There are several cases when the

database considers a FULL TABLE SCAN the most effective way to retrieve the

requested data.

If the number of selected rows is a considerable fraction of the overall table size, the FULL TABLE SCAN can be more effective than an index lookup. Although this sounds odd in the first place, the FULL TABLE SCAN has an advantage over any index based access: there is no

need for an additional TABLE ACCESS BY INDEX ROWID step. The performance impact caused by the additional table access can be considerable—as explained in the section called “The Leaf Nodes” . Another aspect is that the Oracle database can perform the read operations for a FULL TABLE SCAN in a more efficient way than for an index lookup. The blocks needed for an index lookup are not known in advance. The database must read and process the index nodes in a block-by-block manner. A FULL TABLE SCAN must read the entire table anyway, the database can use the more efficient multi block read.

All of that should not hide the fact that a FULL TABLE SCAN is often caused by a missing or inadequate index.

The database doesn't use the index because it is not suitable for this query. A closer look into the index leaf nodes makes it more apparent.

To repeat the most important lesson from the previous chapter: the index leaf nodes are a sorted representation of the index columns. In case multiple columns are indexed, the first column is the most significant sort criterion, followed by the second, the third, and so on.

As a consequence, the tree structure can be used only if the where clause includes the leading columns of the index. The values of the subsequent index columns are not centralized within the leaf node structure and cannot be localized with a tree traversal.

Figure 2.1. Concatenated Index

Figure 2.1, “Concatenated Index” shows an index fragment with three leaf nodes and the corresponding branch node. The index consists of the EMPLOYEE_ID and

SUBSIDIARY_ID columns (in that order), as in the example above.

The search for SUBSIDIARY_ID = 20 is not supported by the index because the matching entries are distributed over a wide range of the index. Although two index entries match the filter, the branch node doesn't contain the search value at all. The tree cannot be used to find those entries.

Tip

Visualizing an index like Figure 2.1,

“Concatenated Index” helps to understand which queries can be supported by an index and which can't. Although such a figure is very nice, a much simpler picture is sufficient to get the point. It is usually enough to see the index order and know that the tree can quickly localize one

particular place within this sequence. The following SQL template returns the indexed columns in index order; that is the logical order of the index entries in the leaf nodes:

SELECT * FROM (

SELECT <INDEX COLUMN LIST> FROM

<TABLE> ORDER BY <INDEX COLUMN LIST> ) WHERE ROWNUM < 100;

If you insert the index definition and the corresponding table name into that

statement, you will get a small excerpt from the index. Ask yourself where you would start to search for the required data. If there isn't any particular place where the

searched values appear together, the index tree can't be used to find them.

It seems like the primary key index doesn't support the query to list all employees of a subsidiary. The easiest solution to tune the query is to create a new index on

SUBSIDIARY_ID . This index boosts the --- Predicate Information (identified by operation id): 2 -access("SUBSIDIARY_ID"=20)

The execution plan shows an INDEX RANGE SCAN on the new index. Although the solution seems to be perfectly

reasonable, there is an alternative that should be preferred.

Considering that a search for an

EMPLOYEE_ID in any subsidiary is very unlikely, the existing unique index can be restructured to support the primary key lookup as well as the lookup with the SUBSIDIARY_ID only. The trick is to change the column order in the index so that the new index definition is as follows:

CREATE UNIQUE INDEX EMPLOYEES_PK ON EMPLOYEES (SUBSIDIARY_ID, EMPLOYEE_ID);

The index is still unique, so the primary key lookup will perform an INDEX UNIQUE SCAN as before. The reversed column order changed which statements can be

supported by the index. The original definition served queries for

EMPLOYEE_ID only while the new definition supports queries on SUBSIDIARY_ID only.

Important

When defining an index, the number of statements it can support is the most important factor to consider.

Although the two-index solution will also yield very good select performance, the one index variant will give much better insert, delete and update performance. The preserved space might even increase the cache-hit rate so that the overall scalability improves.

To choose the right index, you must not only know how an index works—as

explained in this book—you must also know the business domain. The knowledge about dependencies between various attributes is essential to define an index correctly.

An external performance consultant can have a very hard time to figure out which columns can go alone into the where clause and which are always paired with other attributes. As long as you are not familiar with the business domain, this kind of exercise is actually reverse engineering.

Although I admit that reverse engineering can be fun if practiced every now and then, I know that it becomes a very depressing task if practiced on an every day basis.

Despite the fact that internal database administrators know the industry of their company often better than external

consultants, the detailed knowledge needed to optimally define the indexes is hardly accessible to them. The only place where the technical database knowledge meets the functional knowledge of the business domain is the development department.

In document SQL Performance Explained (Page 50-66)