XML columns in the geodatabase - Administering ArcSDE Microsoft. for SQL Server

Note: This topic was updated for 9.3.1.

ArcSDE handles XML data nearly the same way it handles spatial GIS data. A business table can have one or more columns of type XML. These columns can store XML documents that contain structured information such as descriptions or classifications of spatial features. XML documents are also useful for storing longer text descriptions than are typically stored in a column, and text indexes built on those descriptions let you search for features using their content. For example, ArcIMS Metadata Services use XML columns to create a searchable collection of documents that describe GIS resources.

An XML column will always have a full-text index that lets you search for a word anywhere in a document. The column may optionally have an XPath index that lets you search for a word or a number in a specific XML element or attribute. Functionality for full-text indexing must be installed and configured in the database before an XML column can be created. SeeConfiguring a database to support XML columnsto get started.

Information about an XML column is maintained in an ArcSDE system table named xml_columns. Information about an XML column's XPath index is maintained in the ArcSDE system tables xml_indexes and

xml_index_tags.

For each XML column, ArcSDE creates additional tables, which are used to store and index XML documents.

◦ The XML document table (sde_xml_doc<id#>) stores the XML document and maintains a full-text index on the document's content.

◦ The XML index table (sde_xml_idx<id#>) is created for XML columns that have an XPath index. This table stores the text or number content for each XPath that is indexed and maintains a full-text index on the content from some XPaths.

The ID number in the table names of the XML index table and XML document table is the internal registration number for the XML column.

To see further details on the tables used to track XML columns in the database, see the following topics:

XML columns in a geodatabase stored in DB2 XML columns in a geodatabase in Informix XML columns in a geodatabase stored in Oracle XML columns in a geodatabase stored in PostgreSQL XML columns in a geodatabase in SQL Server

For information on configuration parameters that control the storage of these tables, seeDBTUNE configuration parameter name-configuration string pairs.

ArcSDE for Oracle includes two scripts to help you configure your database to store XML data. SeeTuning an Oracle instance for XML data storagefor details.

How XML documents are indexed

ArcSDE XML columns always have a full document index. The XML document table stores the original XML document and also a copy of the document's content with the XML markup removed. A database text index is created on the document's content without XML markup. This index lets a search quickly find all documents that contain a given word or phrase, such as "population," anywhere in the text.

ArcSDE XML columns may optionally have an XPath index as well. XPath is a language used to identify specific parts of an XML document. An XML column's XPath index is created from an index definition that uses XPath notation to identify the elements and attributes in the column's XML documents that can be searched.

Each XML column can use a different index definition to build its XPath index.

An XML column's XPath index can be created and managed using ArcObjects; if you are using an Enterprise geodatabase, it can also be created and managed using the sdexml administrative command.

For example, suppose documents in the XML column describe a feature's source and the documents include an accuracy element that stores a number indicating the accuracy of the source in meters. Building an XPath index that includes the accuracy element would let you find all the features whose source accuracy is greater than a given value.

The XPath index should only include elements and attributes in the XML documents that are available to be searched by a client application. Keeping the XPath index small will improve the XML column's performance.

By default, an XML column will not have an XPath index. If you don't need to search individual XML elements or attributes, there is no need to create an XPath index.

If you do create an index definition, you must specify not only which XML elements and attributes, or XPaths, will be indexed but also how their values will be indexed. Each XPath is assigned a data type that is appropriate

to the type of content it contains. Each XPath can be indexed as a number, as a short character string, or using a database text index. These are discussed in more detail below.

The XPath index is really a collection of numeric, character, and text indexes. When an XML document is stored, it is checked to see if it contains any of the elements or attributes in the column's XPath index definition.

If so, the content of those XPaths is extracted from the document and added to a text, character, or numeric index depending on the data type of the XPath in the index definition.

About database text indexes

One of the advantages to storing information in XML documents is being able to use database text indexing and searching functionality to search those documents. Text indexes can provide better search results for large amounts of text and shorter text that has a large amount of variation between documents. Text indexes use linguistic analysis when matching the search phrase to the document's content and typically use relevance ranking to order the results based on the quality and frequency of those matches.

When the database builds a text index, it evaluates the original text and includes words in the index based on the indexing rules and database settings that are in place. Typically, words that have meaning, such as "river,"

"pollution," and "population," will be included in the text index, but words such as "and," "the," and "in" won't be included. Words may be indexed so searching with the word "fishing" would also match the words "fish,"

"fished," and "fisher"; this is known as stemming. Stemming is just one example of the advanced text indexing and searching methods that databases typically provide.

Each database may use different rules or algorithms to index the same text and may therefore produce different results with the same data. Also, each database may use different text indexing methods by default; for example, stemming may be performed by default in one database while in another database it must be enabled. Review the text indexing documentation for your database and set the database and the DBTUNE configuration parameters properly to achieve the desired results for your XML column.

Different rules are used to index text written in different languages. For example, the words that are left out of the text index will be different in English and French such as "et," "le," and "dans." Also, what defines a word is different in some languages; Western languages typically define a word as all the text between two spaces, but different rules may be used in German, Asian, and other languages where the text between two spaces may be a phrase. Different databases may support different languages for text indexing, and all text indexing methods may not be available in all languages.

All documents in an XML column must contain text in the same language. The database text indexing

components and the DBTUNE configuration parameters must be properly configured for that language in those XML documents to be indexed and searched correctly. The language used for text indexing may have to be configured separately from other language settings in the database.

Different databases approach text index updates in different ways. When a new XML document is stored in an XML column, the full text of the document and the content of any XPaths handled with a database text index will have their values extracted from the document and stored in the appropriate places. However, by default the database may not immediately process and analyze the text and then update the text indexes. This is generally a good thing. If many documents containing a lot of text are added to or updated in an XML column, the resources required to index the new text may decrease the performance of other database applications.

XPath index definition data types

In the XPath index, document content may be included in a text, character, or numeric index; this corresponds to the data types STRING, VARCHAR, and DOUBLE in the XPath index definition. If an XPath's data type is STRING, content in that element or attribute is indexed using a text index. XPaths with the data type VARCHAR will have their content placed in a regular character column and indexed as they would be in a typical database table. XPaths with the data type DOUBLE will have their values placed in a numeric column and indexed. An XPath index data type is not provided for dates.

When new documents are published, you can find them right away with searches evaluated using XPaths indexed as VARCHAR or DOUBLE. You will not be able to find new documents with searches against XPaths indexed as STRING (or the full document text index) until the database's text indexes have been updated to include information from the new documents.

Choosing STRING or VARCHAR for an XPath

If the domain of an XPath will always be a short string or a word or phrase from a list, consider indexing the element as VARCHAR. When an XPath contains a text code or a one-word entry, you are often looking for an exact match between the search string and the entire content of the XPath; an application that searches an XPath like this will often provide a list of values to choose from. This type of information is better handled with the VARCHAR data type; the exact match will be faster and the content of an XPath will consume less space in the database than if the same content was indexed as STRING.

However, all values stored in an XPath indexed as VARCHAR must be no more than 254–256 characters long depending on the database. Attempts to store an XML document will fail if the document includes an XPath that is indexed as VARCHAR and the content of the XPath is too long.

Text indexing is better suited for searching XPaths containing text that can be freely entered by a person, even if the text isn't very long. The linguistic analysis performed with a text index may produce better results searching that type of content than traditional searches against a character column.

An XML column's full document text index lets you find a word anywhere in the document. Sometimes a search needs to be more focused than that. For example, if the XML column contains information describing different GIS resources and you want to be able to find all work your organization produced for a project, you might include an element describing the purpose of the resource in the XPath index, and searches by project would look for references to the project name only in that element. This may make searches by project more accurate if the words used in project names may be used in different contexts. Elements describing concepts like the purpose of a resource are typically best handled by database text indexes.

Both VARCHAR and STRING XPaths can be searched using wildcard characters, but the wildcard character supported by the database may be different for these two data types in the same database. Also, the way in which wildcard characters can be used and the performance of the search may be different for each data type. Check your database documentation to determine what characters can be used as wildcards and how they can be used.

For XPaths indexed as STRING, you may not need to use wildcard characters if stemming is performed during text indexing; if stemming is not performed and wildcard characters are used often in searches, consider adding this capability to your XML column's text indexes to improve search performance.

Indexing numbers

If an XPath contains numbers, indexing them as DOUBLE lets you evaluate their contents numerically. For example, if an element contained the scale denominator of the source data for a feature, such as 24,000, you would be able to find all features whose source data was at a scale greater or less than a given value.

If an XPath's data type is DOUBLE in the XML column's index definition but a document contains text in that XPath instead, the document will be stored successfully, but the value in that XPath won't be included in the XPath index and a conflict will be recorded in the geodatabase. An error noting this conflict will appear in an ArcSDE service log file if verbose logging is enabled for the geodatabase.

For the element containing the scale denominator of the source data, the value "24000" will index successfully but the values "1:24000" and "1/24000" will not. The presence of the colon (:) and slash (/) characters means the values can't be manipulated as numbers. Similarly, for elements containing the percent cloud cover in an image or the cost of a resource that are indexed as DOUBLE, the values "10%" or "$50" would not be indexed and conflicts would be recorded.

In contrast, if a value will be indexed as text and a document contains a number in that metadata element, the number will be handled as text; a conflict won't be recorded.

The XPath index definition file

An XPath index is defined using a text file that contains a general description of the index definition and a list of XPaths. A default index definition file is not provided; the contents of an XML column's XPath index depend entirely on the nature of the XML documents it contains and their intended use.

The anatomy of an index definition file

The following example is an excerpt from an XPath index definition used with a collection of documents that describe GIS resources. The individual components of the index definition file are described below.

DESCRIPTION: ISO 19115 metadata collection index

##TAG

LOCATION PATH: /metadata/dataIdInfo/idPurp DESCRIPTION: Purpose of the resource DATA TYPE: STRING

EXCLUSION: FALSE END

##TAG

LOCATION PATH: /metadata/distInfo/ditributor/distorFormat/formatName DESCRIPTION: Format of the resource

DATA TYPE: VARCHAR EXCLUSION: FALSE END

##TAG

LOCATION PATH: /metadata/dataIdInfo/tpCat/TopicCatCd/@value DESCRIPTION: Topic category code

DATA TYPE: DOUBLE EXCLUSION: TRUE END

In this example, an XML element containing information about the purpose of the resource is added to the database text index; longer text authored by a person is better indexed in this manner. An XML element containing the format of the data, map, service, or other resource is added to the character string index; short, brief text is better indexed in this manner, particularly if the values originate from a fixed domain and you want to search the XML column for an exact match. Also, an XML attribute containing a numeric code identifying the theme of the resource, such as agriculture, oceans, or transportation, is ready to be indexed as a number but is currently excluded from the XPath index.

◦ XPath index description— The purpose of the XPath index. If present, the description must be provided on the first line in the file following the text "DESCRIPTION:". Descriptions must be no longer than 64 characters in length; longer text will be truncated to 64 characters.

◦ Set of tags— A list of XPaths, or tags, whose contents can be searched. Each tag must be defined by a set of properties that are preceded by a line with the text "##TAG" and are followed by a line with the text "END". No other text can be present on either of these lines; they signify the beginning and ending of the list of properties that define a tag.

Each tag can have the following properties. Only one property can be defined on each line. Tag properties can be defined in any order.

▪ Tag location— An XPath that identifies the absolute location of an XML element or attribute in an XML document; other parts of an XML document cannot be indexed. When an XML document is stored, if an element or attribute in the document has an XPath that exactly matches the XPath of a tag in the XML column's index definition, its content will be indexed.

This property is mandatory and must be provided on a line beginning with the text

"LOCATION PATH:".

The XPath must begin with a forward slash (/) followed by the document's root element, and it must not use XML namespaces. If an XPath includes an asterisk (*) to identify a set of nodes or square brackets ([]) to include subqueries that conditionally select tags, attempts to create an XPath index on an XML column will fail.

XPaths must be 254–256 characters in length or less, depending on the database being used. If an index definition has an XPath longer than is supported by your database, attempts to create an XPath index for an XML column will fail.

▪ Tag description— A brief overview of the information contained by the tag. Descriptions are optional. They may help someone edit the index definition later. Descriptions must be no longer than 64 characters in length; longer text will be truncated to 64 characters. If provided, this information must be on a line beginning with the text "DESCRIPTION:".

▪ Tag data type— A value indicating how to index the tag's content. Valid values are STRING, VARCHAR, or DOUBLE; these indexing options are discussed in detail in the sections above.

A data type is optional. If provided, this information must be on a line beginning with the text

"DATA TYPE:". If not provided, the tag's content will be indexed as STRING and included in the text index.

▪ Tag exclusion— A value indicating if the tag's content should be included in or excluded from the XPath index. Valid values are FALSE or TRUE. When FALSE, the tag is included in the XPath index and its content is indexed. When TRUE, the tag's definition is stored in the geodatabase but is excluded from the XPath index, and its content will not be indexed. While prototyping or modifying a client application, it can be helpful to keep a tag in the index definition even though it isn't currently being used.

An exclusion value is optional. If provided, this information must be on a line beginning with the text "EXCLUSION:". If not provided, the tag will be included in the XPath index.

Modifying the XPath index

You can change an XML column's index definition at any time. To do so, modify the index definition file by

In document Administering ArcSDE Microsoft. for SQL Server (Page 106-114)