INDEXING XML DATA AND THE PERFORMANCE OF XQUERY IN RELATIONAL DATABASE BASED ON SQL SERVER
3. XML INDEXES IN SQL SERVER
Although the first capabilities of handling XML data appear in SQL Server 7.0, mechanism for indexing XML data was involved in version 2005. It was natu- ral consequence of the introduction of a native XML data type, which allows to create in relational table one or more columns containing XML documents or fragments stored as BLOB objects up to 2GB in size. SQL Server 2008 (just like preceding version) offers two kinds of XML indexes: one primary and three secondary.
An existence of a primary index on XML column eliminates an overhead, which is caused by the need to shred XML data into relational when every XQuery query is performed (because SQL Server makes preshredding BLOBs at run time). The primary index is used when the engine evaluates an every query except of retrieving the full XML document (then document is retrieved faster from the in- ternal representation). On the primary XML index you can create one or more sec- ondary XML indexes, which may increase performance of some kinds of queries. Secondary indexes provide additional information for query engine to facilitate choice of execution plan. Generally, index selection depends on a type of queries (path-based, value-based queries and property bag scenarios) [1], whose perform- ance might be improved.
112
To create XML index on XML type column in SQL Server we use a modified form of the CREATE INDEX statement of Transact-SQL DDL. A general abridged syntax is as follows:
CREATE [PRIMARY] XML INDEX index_name ON table (xml_column_name)
[USING XML INDEX xml_index_name [FOR {VALUE|PATH|PROPERTY}]]
where the USING clause is applied, when we create the secondary XML index, then
the FOR clause determines type of such index. Additionally, we can also use nine
options using the additional clause WITH [5].
3.1. Primary XML index
As we mentioned above, the primary XML index in SQL Server is B+-tree
containing the shredded representation of XML data, which is conformed to XML Information Set items. Nodes are labeled through a hierarchical labeling system ORDPATH [3], which allows capturing the structural fidelity of XML data without access to XML schema, only within a single column of the primary XML index. For example, subtree of a given XML node can be retrieved from index using a range scan over the OrdPath values of this node and its descendants.
If the primary XML index exists, then query optimizer decides how to carry on the XQuery expression included in SQL query, choosing between:
- top-down execution - rows of the base table (table, which contains targeted XML column) are processed before the primary XML index,
- bottom-up execution - the primary XML index is scanned before back-joins with
the base table.
Structure of the primary XML index was described in details e.g. in [4]. Here, we mention only that the number of rows in the primary index is approximately equal to number of nodes in XML blob, and every row contains the following in- formation about the node: tag name (an element or attribute name), node value, node type, document order information, path from each node to the root of the XML tree and primary key of the base table (total 11 columns). Hence, the primary XML index is larger in size than the text form of XML instance and unfortunately contains some redundancy (e.g. the primary key of the table is repeated in all rows for an XML instance).
The primary XML index requires that the table containing the indexed column must have a clustered primary key defined on it. It causes that the primary key cannot be modified as long as the XML index exists. In contrast to the relational index, the XML index must be created on a single XML type column (however it does not matter whether a given column is typed or untyped) and there is no possi- bility of creating the XML index on a computed column. On a given XML column we can create only one XML primary index and simultaneously many XML secon- dary indexes.
3.2. Secondary XML indexes
XQuery query performance can be additionally optimized in many cases by the
secondary XML indexes: PATH, VALUE and PROPERTY. No secondary index cannot
be created on an XML column, if the primary XML index does not exist on it. The PATH secondary index is used for XQuery queries based on path expres- sions performed on XML type columns especially when predicates containing the comparison operators are specified (e.g. in the query method). This type of index
was pointed out in [1] as particularly useful, when the exist method is used in the
WHEREclause of the base SQL query.
The PROPERTY index is designed to optimize the queries, in which from sin- gle XML instances the scalar values are selected based on their properties such as paths and known primary key.
The VALUE index can speed up query when a node value is important, and a
path expression is not precise i.e. name or exact location of the node are not known. In other words, it refers especially to queries, in which the wildcard charac- ter in a path expression is used and queries with a specified node value in a predi-
cate and the descendant-or-self axis (so-called descending search axis).
The secondary XML indexes improve performance of queries in the bottom- up approach regardless of their type. Therefore, an interesting issue is the analysis of the XML indexing strategy based on the class of queries, which efficiency can be increased by the appropriate choice of the secondary XML indexes. Further discussion on this topic is in Section 4.
3.3. Full-text index
In SQL Server it is also possible to create the full-text index on XML column (in the same way as on relational column). Then the content of the XML data is indexed and the XML markup is ignored (so XML tags and attributes are not in- dexed). Because the full-text search returns the whole XML instance (instead of only a fragment containing the requested data), then in practice it is usually used combination the full-text search index with the XML index. Such approach allows
to filter the XML data using the SQL queries (e.g. with predicates CONTAINS or
FREETEXT), what is particularly useful for document-centric data. Such queries can
use the XML methods value and query, however the SQL predicates have a
higher priority than XQuery expressions regardless of the form of query. From the point of view of XML data processing, it is worth to mention that full-text search in SQL Server 2008 is available completely in 51 languages.
114