Chapter 3: Indexing 3.1 Introduction
3.7 So how do you create indexes?
3.7.1 The Transact-SQL CREATE INDEX statement The Transact-SQL syntax is as follows:
CREATE [ UNIQUE ] [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON { table | view } ( column [ ASC | DESC ] [ ,...n ] ) [ WITH < index_option > [ ,...n] ] [ ON filegroup ] < index_option > :: = { PAD_INDEX | FILLFACTOR = fillfactor | IGNORE_DUP_KEY | DROP_EXISTING | STATISTICS_NORECOMPUTE | SORT_IN_TEMPDB }
The different options will now be described.
To create a clustered index in Transact-SQL the CLUSTERED keyword is used: CREATE CLUSTERED INDEX CI_AccountNo
ON accounts (account_no)
The above example creates a clustered index on the account_no column of the Accounts table. The next example creates a unique clustered index, as follows:
CREATE UNIQUE CLUSTERED INDEX CI_AccountNo ON accounts (account_no)
The unique keyword ensures that only one row has a particular key value, in this case
account_no. In other words, the uniqueness of the key is enforced. Note that the table may or may not already contain data. If it does, and if there are duplicate values, the above CREATE INDEX statement will fail:
CREATE UNIQUE CLUSTERED INDEX CI_AccountNo ON accounts (account_no)
Server: Msg 1505, Level 16, State 1, Line 1
CREATE UNIQUE INDEX terminated because a duplicate key was found. Most significant primary key is '105000'. The statement has been terminated.
Similarly, once the index has been successfully created, an attempt to insert or update a row that would result in a duplicate key value will fail:
INSERT INTO accounts (account_no, customer_no, branch_no, balance, account_notes)
VALUES (1916, 103424, 1012, 10765, 'A busy account') Server: Msg 2601, Level 14, State 3, Line 1
Cannot insert duplicate key row in object 'accounts' with unique index 'CI_AccountNo'.
The statement has been terminated.
This is fine, since we want the account_no column to contain no duplicate values, since this is the way we uniquely identify an account.
As mentioned previously, only one clustered index can be created on a table. This makes sense, since data can only be physically sorted in one order. Any attempt to create a second clustered index will fail:
CREATE CLUSTERED INDEX CI_AccountBalance ON accounts (balance)
Server: Msg 1902, Level 16, State 3, Line 1
Cannot create more than one clustered index on table 'accounts'. Drop the existing clustered index
'CI_AccountNo' before creating another.
To create a nonclustered index the CREATE INDEX statement is used, as it was for creating the clustered index, only in this case the NONCLUSTERED keyword is specified:
CREATE NONCLUSTERED INDEX NCI_AccountBalance ON accounts (balance)
If neither CLUSTERED nor NONCLUSTERED is specified, a nonclustered index is created. The UNIQUE keyword has the same effect as it does for a clustered index. Hence, the
following CREATE INDEX statement defaults to a nonunique, nonclustered index: CREATE INDEX NCI_AccountBalance
ON accounts (balance)
Note The name of the index can be any name considered legal by SQL Server. I prefer to prefix the name with CI_ or NCI_ to signify a clustered or nonclustered index,
respectively. I also find it useful to then use meaningful text that indicates the column name. This does, however, become unwieldy when you have an index that is comprised of many columns, so some compromises will have to be made. No naming scheme is ever perfect!
So far our examples have shown indexes that consist of only one column. It is not uncommon to create an index that consists of more than one column. Such an index is known as a
practical terms, is a limit few people are likely to hit. Also, the sum of the column sizes in the index cannot be greater than 900 bytes. It is not a good idea to choose a composite key of 900 bytes in length, because very few index entries will be able to fit into an index page and so many index pages will be used in the index. This will ultimately result in deep indexes consisting of many index levels. Traversing the index may then require many disk I/Os. In SQL Server 2000 it is, in fact, possible to create an index that contains columns defined with variable-length data types, such as VARCHAR, where the sum of the maximum sizes appears to exceed 900 bytes. However, if an attempt is made to insert a row so that the actual size of the index key would exceed the 900-byte limit, an error is returned.
For example, suppose a table consists of the following structure: CREATE TABLE account_details
(
account_no INT NOT NULL, account_notes VARCHAR(1000) NOT NULL )
If we attempt to create a nonclustered index on the account_notes column, SQL Server will successfully create the index but will warn us that the index key is potentially too large: CREATE NONCLUSTERED INDEX NCI_AccountDetails
ON account_details (account_notes)
Warning!. The maximum permissible key length is 900 bytes. The index 'NCI_AccountDetails' has maximum length of 1000 bytes. For some combination of large values, the insert/update operation will fail.
If we then attempt to insert a short string into the table, there is no problem: INSERT INTO account_details VALUES (1000, 'This string is less than 900')
However, if we attempt to insert a row with a string value large than 900 bytes, we are not allowed to do so:
INSERT INTO account_details
VALUES (1001, 'This string is more than 900'+ REPLICATE('*',900))
Server: Msg 1946, Level 16, State 4, Line 1
Operation failed. The index entry of length 928 bytes for the index 'NCI_AccountDetails' exceeds the maximum
permissible length of 900 bytes.
How do we specify an option to reserve space in index pages when an index is created? Remember that in the case of a clustered index the data pages are considered to be the lowest level of the index, whereas in the case of a nonclustered index the bottom level of the index is considered to be the lowest level of the index pages. In either case the lowest level of index is known as the leaf level.
The FILLFACTOR option is used to reserve space, and this option takes a value from 0 to 100. An index created with a FILLFACTOR of 100 will have its index pages completely filled. This is useful if no data is to be entered into the table in the future.
An index created with a FILLFACTOR of 0 will have its leaf pages completely filled, but other levels in the index will have enough space for a minimum of another index entry. An index created with a FILLFACTOR of between 0 and 100 will have its leaf pages filled to the FILLFACTOR percentage specified, and, again, other levels in the index will have enough space for a minimum of another index entry.
The default FILLFACTOR value is 0, and this default value can be changed with the
sp_configure system stored procedure or via the Database Settings tab in the Server Properties dialog box in the SQL Enterprise Manager. Table 3.1 shows the consequence of different FILLFACTOR values. A FILLFACTOR value of 0 specifies that the leaf-level page of the index should be completely filled, leaving no free space; however, the nonleaf pages should reserve space for one extra index entry. A FILLFACTOR value of 100 percent specifies that the leaf-level page of the index should be completely filled, leaving no free space. There should also be no free space reserved in the index pages. A FILLFACTOR value of 1 percent to 99 percent specifies that the leaf-level page of the index should be filled no more than the FILLFACTOR value. The nonleaf pages should reserve space for one extra index entry. Note that for nonunique clustered indexes, space is reserved for two index entries.
Table 3.1: The Effect of Different FILLFACTOR Values
FILLFACTOR Value % Nonleaf Page Leaf Page
0 one index entry completely full
1–99 one index entry ≤ FILLFACTOR % full 100 completely full completely full
Care should be taken when choosing a FILLFACTOR, since its relevance will depend on the way the application uses the table data. There is little point in reserving space throughout an index if the row inserted always has a key greater than the current maximum key value. The following example creates an index with a FILLFACTOR of 50 percent, meaning that each data page (leaf page) will only be filled to 50 percent. Index pages at the other levels will have room for one or two more index entries.
CREATE CLUSTERED INDEX CI_AccountBalance ON accounts (balance)
WITH FILLFACTOR =50
SQL Server will round up the number of rows placed on a page, so if the FILLFACTOR value would allow 3.3 rows, then 4 rows are stored.
Note that over time, as rows are inserted into the table, the effectiveness of the FILLFACTOR value will vanish, and a planned rebuilding of critical indexes at periodic intervals should be considered if heavy inserts are made to the table. Because SQL Server merges index pages with only one index entry to keep the index compact, the number of items on an index page is never less than two, even if a low value of FILLFACTOR is specified.
Another option, PAD_INDEX on the CREATE INDEX statement, is relevant to reserving space. The PAD_INDEX clause means that the FILLFACTOR setting should be applied to the index pages as well as to the data pages in the index.
The IGNORE_DUP_KEY option is useful when a unique clustered or nonclustered index is to be created on a table that might have rows with duplicate key values inserted. If the IGNORE_DUP_KEY option is set, rows containing duplicate key values are discarded, but the statement will succeed. However, if the IGNORE_DUP_KEY option is not set, the statement as a whole will be aborted.
The DROP_EXISTING option can be a very useful performance optimization. Suppose we have a scenario where we have a table on which we have built a clustered index and perhaps two nonclustered indexes. As discussed earlier, if there is a clustered index present on a table, then the pointers at the leaf level of any nonclustered indexes on that table will be the
clustered index key. Suppose we drop the clustered index from the table. The nonclustered index leaf pages can no longer contain index entries that use the clustered index key as the pointer value—there is no clustered index and therefore no clustered index key!
When the clustered index is dropped, SQL Server will rebuild all the nonclustered indexes on that table so that their index leaf pages will now contain index entries that use the Row ID as the pointer value. Remember, a Row ID is a Page ID (File ID plus page number) plus the position of the row on the page. The important point here is that SQL Server will rebuild all the nonclustered indexes on that table. This obviously can be a very time-consuming and resource-intensive process. But this is only the half of it.
Suppose the reason we wished to drop the clustered index was because we wanted to rebuild it. Perhaps we wanted to reorganize it so that page fragmentation was eliminated. Well, this means that after dropping the clustered index we are now going to create it again. Guess what's going to happen to all the nonclustered indexes on that table? You guessed! SQL Server will rebuild all the nonclustered indexes on that table so that their index leaf pages will now contain index entries that use the clustered index key as the pointer value.
This means that our clustered index reorganization has caused our nonclustered indexes to be rebuilt twice. What's annoying is that their leaf-level pointers have ended up as they started out anyway—clustered index key pointers. So what can we do to reduce the impact of rebuilding a clustered index?
Luckily for us the CREATE INDEX statement allows us to specify the DROP_EXISTING option. This allows us to issue a CREATE INDEX statement with the same name as an existing index. Using this option when you wish to rebuild a clustered index will give you a performance boost. The clustered index will be recreated on a new set of database pages, but, because the clustered index key values remain the same, the nonclustered indexes on the table do not have to be rebuilt. In fact, the recreation of the clustered index can make use of the fact that the data is already sorted in key sequence order so this data does not have to be sorted. The DROP_EXISTING option can also be used if the clustered index key definition changes. Perhaps a new column is used. In this case the nonclustered index will have to be rebuilt—but only once.
The DROP_EXISTING option can also be used for a nonclustered index, and there will be a performance advantage over dropping and creating the nonclustered index. However, the real benefit is with rebuilding clustered indexes. Using this option will definitely use fewer resources than performing a DROP INDEX followed by a CREATE INDEX.
A CREATE INDEX using this option can also be used to rebuild the index that is created when a primary key constraint is defined on a table. This was previously accomplished with DBCC DBREINDEX. Comparing the resource use of both approaches, they seem identical— so there is probably no need to change existing scripts on this basis alone.
The STATISTICS_NORECOMPUTE option dictates that out-of-date index statistics are not automatically recomputed. This is an option I have never had to use. I have found that ensuring that index key distribution statistics are as up-to-date and accurate as possible is the best approach. Index key distribution statistics are discussed in Chapter 4.
The ON FILEGROUP option allows the database administrator to create the index on a filegroup different from the table itself. The use of filegroups was discussed in Chapter 2. The idea is that by using multiple filegroups, disk I/O to the index and table can be spread across separate disk drives for better performance. However, most database administrators typically use a form of disk striping to spread disk I/O. Disk striping is discussed in Chapter 5.
Note Filegroups are also used to facilitate the backing up of large databases. However, if one filegroup contains a table and a separate filegroup contains an index for that table, then both filegroups must be backed up together.
Another index creation option that needs to be discussed is the column [ASC | DESC], which is part of the CREATE INDEX statement. Using these options determines whether an
ascending or descending index is created. When an index is created, each column in the index key can be flagged with ASC or DESC. This specifies whether the index column has its data sorted in an ascending or descending manner. The default is ASC, which ensures that scripts written to create indexes in earlier versions of SQL Server behave correctly.
Suppose we create an index on the Accounts table, as in the following example: CREATE NONCLUSTERED INDEX NCI_CustNoAccountNo
ON accounts (customer_no ASC, account_no DESC)
The data in the customer_no key column will be held in ascending order, whereas the data in the account_no key column will be held in descending order. Why bother providing this capability? After all, the doubly linked lists that chain the index pages in an index level together allow SQL Server to rapidly move backward and forward along the sequence of keys. This is true, but if the query requests data to be sorted in the ascending order of one column and the descending order of another column, then just moving along the chain is not going to provide the optimum performance. If, however, the key columns are actually held in a sequence that matches the ORDER BY, then the chain can be followed in one direction and this will provide the optimum performance, so no additional sorting will be required.
The following query will be fully supported by the NCI_CustNoAccountNo index without an additional sort step:
SELECT customer_no, account_no FROM accounts WHERE customer_no BETWEEN 1000 AND 1500 ORDER BY customer_no ASC, account_no DESC
The following query will not be fully supported by the NCI_CustNoAccountNo index, and it will need an additional sort step:
SELECT customer_no, account_no FROM accounts WHERE customer_no BETWEEN 1000 AND 1500 ORDER BY customer_no ASC, account_no ASC
A new metadata function named INDEXKEY_PROPERTY reports whether an index column is stored in ascending or descending order. The sp_helpindex system stored procedure has also been enhanced to report the direction of index key columns.
Finally, the SORT_IN_TEMPDB option can be used to place the data from intermediate sort runs used while creating the index into tempdb. This can result in a performance improvement if tempdb is placed on another disk drive or RAID array. The default behavior, if this option is not used, is to utilize space in the database in which the index is being created. This means that the disk heads are moving back and forth between the data pages and the temporary sort work area, which may degrade performance.
One aspect of index creation that can be seen from the CREATE INDEX syntax diagram is that SQL Server 2000 can create indexes on views. This is a significant enhancement to the product from a performance perspective and therefore is treated separately later in this chapter.
So, we have looked at the CREATE INDEX statement and the options that can be chosen. There are other ways in which we can create indexes and these are discussed in the following sections.