Compressed self-indexed XML representation with efficient XPath evaluation

(1)

Compressed Self-Indexed

XML Representation

with Efficient XPath Evaluation

Tesis Doctoral

Doctoranda: Ana Belén Cerdeira Pena

Directores: Nieves Rodríguez Brisaboa, Gonzalo Navarro Badino

A Coruña, Enero de 2013

(2)

(3)

Nieves Rodríguez Brisaboa Departamento de Computación Facultad de Informática Universidade da Coruña 15071 A Coruña (España) Tel: +34 981 167000 ext. 1243 Fax: +34 981 167160 [email protected]

Gonzalo Navarro Badino

Departamento de Ciencias de la Computación Universidad de Chile

Blanco Encalada 2120 Santiago (Chile) Tel: +56 2 6892736

Fax: +56 2 6895531 [email protected]

(4)

(5)

(6)

(7)

Acknowledgments

My rst gratitude words are addressed to my thesis advisors, Nieves and Gonzalo. You both trusted in me from the beginning, and showed me that with your complete dedication whenever I needed, and your constant support throughout this work. You helped me to follow the right way with your knowledge and experience. Today this thesis is the result of all what you have taught me along this process. Thank you for your faith and unconditional help.

I also want to give thanks to all members of the Database Laboratory, working mates, but also friends, who make me enjoy every day the hours of work. With special aection, I express my gratitude to Antonio Fariña and Luisa Carpente, for their invaluable help, and constant encouragement. It would had not been the same without your support.

I can not forget all people I met during my research stays far from home, because of their hospitality, and friendship, when one most needs it. In particular, I would like to acknowledge to Diego Arroyuelo, Francisco Claude, Rodrigo Paredes, Rodrigo Cánovas, Daniel Valenzuela, Mauricio Marín, Miguel A. Martínez, Gabriella Pasi, and Emanuele Panzeri. I also want to give special thanks to Felipe Sologuren and Kim Nguyen, for their unselsh help in research topics.

Thanks as well to you, Miguel, for your enormous patience, and for all those years we have shared together. They have been wonderful, and I know they will remain so.

Finally, and undoubtedly, my biggest gratitude is for my family. My parents and sister. You have always been by my side, sharing with me the joys, but also suering in the bad moments. Whenever I felt down, you gave me the strength to continue, because you taught me that dreams can not be achieved without eort, without ghting for them. I've followed your example, and today one of my dreams has come true. You were right, it has not been easy, but the fact is that what has been dicult, would have been impossible without you.

(8)

(9)

Agradecimientos

Mis primeras palabras de agradecimiento van dirigidas a mis directores de tesis, Nieves y Gonzalo. Creísteis en mí desde el primer momento, y así me lo demostrasteis con vuestra completa dedicación, siempre que necesité de vuestros consejos, y vuestro apoyo constante a lo largo de todo este trabajo. Me habéis ayudado a seguir el camino correcto con vuestros conocimientos y experiencia. Hoy, esta tesis es el resultado de todo aquello que me habéis enseñado a lo largo de este proceso. Gracias por vuestra conanza y ayuda incondicional.

También quiero dar las gracias a todos los miembros del Laboratorio de Bases de Datos, compañeros de trabajo, pero también amigos, que hacen que cada día disfrute de las horas de trabajo. Con especial cariño, doy las gracias a Antonio Fariña y Luisa Carpente por su ayuda inestimable y ánimos constantes. No habría sido lo mismo sin vuestro apoyo.

No puedo olvidarme tampoco de todas aquellas personas que he ido conociendo durante mis largas estancias fuera de casa, por su hospitalidad y amistad, cuando uno más lo necesita. En particular, me gustaría mencionar a Diego Arroyuelo, Francisco Claude, Rodrigo Paredes, Rodrigo Cánovas, Daniel Valenzuela, Mauricio Marín, Miguel A. Martínez, Gabriella Pasi y Emanuele Panzeri. También agradezco especialmente a Felipe Sologuren y Kim Nguyen su ayuda desinteresada en temas de investigación.

Gracias además a ti, Miguel, por tu inmensa paciencia y haber compartido conmigo todos estos años. Han sido maravillosos, y sé que lo seguirán siendo.

Finalmente, y como no podía ser de otra forma, mi mayor agradecimiento es para mi familia. A mis padres y hermana. Sois quienes habéis estado a mi lado en todo momento, compartiendo alegrías, pero también sufriendo conmigo en los malos momentos. Si me caía, ahí estabais vosotros para ayudarme a levantar, porque me habéis enseñado que los sueños sólo se consiguen con esfuerzo, luchando por ellos. He seguido vuestro ejemplo, y hoy he cumplido uno de ellos. Teníais razón, no ha sido fácil, pero lo cierto es que sin vosotros lo difícil hubiese sido imposible.

(10)

(11)

Abstract

The popularity of the eXtensible Markup Language (XML) has been continuously growing since its rst introduction, being today acknowledged as the de facto standard for semi-structured data representation and data exchange on the World Wide Web. In this scenario, several query languages were proposed to exploit the expressiveness of XML data, as well as systems to provide an ecient support. At the same time, as research in compression became more and more relevant, works also focused their eorts on studying new approaches to provide ecient solutions, using the minimum amount of space. Today, however, there is a lack of practical available tools that join both ecient query support, and minimum space requirements.

In this thesis we address this problem, and propose a new approach for storing, processing and querying XML documents in time and space ecient way, by specially focusing on XPath queries. We have developed a new compressed self-indexed representation of XML documents that obtains compression ratios about 30%-40%, over which a query module providing ecient XPath query evaluation has also been developed. As a whole, both parts make up a complete system, we called XXS, for the ecient evaluation of XPath queries over compressed self-indexed XML documents. Experimental results show the outstanding performance of our proposal, which can successfully compete with some of the best-known solutions, and that largely outperforms them in terms of space.

(12)

(13)

Resumen

La popularidad del eXtensible Markup Language (XML) no ha hecho sino más que ir en aumento desde su introducción inicial, siendo hoy día reconocido como el estándar de facto para la representación de datos semi-estructurados, y el intercambio de datos en Internet. Bajo este escenario, son varios los lenguajes de consulta que se han venido proponiendo para explotar la expresividad de los datos en formato XML, así como sistemas que proporcionasen un soporte eciente a ellos. Al mismo tiempo, y conforme la investigación en compresión se ha hecho cada vez más relevante, los esfuerzos se han dirigido también a estudiar nuevas aproximaciones que ofreciesen soluciones ecientes, pero usando además la menor cantidad de espacio posible. Actualmente, sin embargo, existe una clara ausencia de herramientas prácticas disponibles que aúnen ambas características: un soporte a la realización de consultas eciente, con requisitos de espacio mínimos.

En esta tesis abordamos ese problema, y proponemos una nueva solución para el almacenamiento, procesamiento y consulta de documentos XML, eciente en tiempo y en espacio, centrándonos, en particular, en el lenguaje de consulta XPath. Así, hemos desarrollado una nueva representación comprimida y auto-indexada de documentos XML, que obtiene ratios de compresión del 30%-40%, y sobre la cual se ha creado un módulo de consulta para la eciente evaluación de consultas XPath. En conjunto, ambas contribuciones conforman un sistema completo, que hemos dado en llamar XXS, para la evaluación eciente de consultas XPath sobre documentos XML comprimidos y auto-indexados. Los resultados experimentales evidencian el destacado comportamiento de nuestra herramienta, que es capaz de competir exitosamente con algunas de las soluciones más conocidas, a las que además supera claramente en términos de espacio.

(14)

(15)

Resumo

A popularidade do eXtensible Markup Language (XML) non xo máis que medrar dende a súa introdución inicial, sendo recoñecido hoxe en día como o estándar de facto para a representación de datos semi-estruturados e o intercambio de datos na Rede. Baixo este escenario, son varias as linguaxes de consulta que se propuxeron para explotar a expresividade dos datos en formato XML, así como sistemas que proporcionasen un soporte eciente a eles. Ó mesmo tempo, e conforme a investigación en compresión se xo cada vez máis relevante, os esforzos tamén foron dirixidos a estudiar novas aproximacións que ofrecesen solucións ecientes, pero usando ademáis a menor cantidade de espacio posible. Actualmente, sen embargo, existe unha clara ausencia de ferramentas prácticas dispoñibles que agrupen ambas características: un soporte á realización de consultas eciente, xunto con requisitos de espacio mínimos.

Nesta tese abordamos ese problema, e propoñemos unha nova solución para o almacenamento, procesamento e consulta de documentos XML, eciente tanto en tempo como en espacio, centrándonos, en particular, na linguaxe de consulta XPath. Así, desenvolvimos unha nova representación comprimida e auto-indexada de documentos XML, que obtén ratios de compresión en torno ó 30%-40%, e sobre a cal se creou tamén un módulo de consulta para a eciente evaluación de consultas XPath. En conxunto, ambas contribucións conforman un sistema completo, que chamamos XXS, para a evaluación eciente de consultas XPath sobre documentos XML comprimidos e auto-indexados. Os resultados experimentais amosan o destacado comportamento da nosa ferramenta, que é capaz de competir exitosamente con algunhas das solucións máis coñecidas, ás que ademáis supera claramente en termos de espacio.

(16)

(17)

7 Query Evaluation 137 7.1 Conceptual Description . . . 138 7.1.1 Evaluation Strategies . . . 141 7.2 General Implementations . . . 143 7.2.1 Leaf Nodes . . . 143 7.2.1.1 Further Discussions . . . 144 7.2.2 Internal Nodes . . . 145 7.2.2.1 Further Discussions . . . 147 8 Implementations Description 149 8.1 Practical Segment Representation . . . 149

8.2 Implementations . . . 150

8.2.1 Leaf Nodes . . . 150

8.2.1.1 Elements . . . 151

8.2.1.2 Attributes and Words . . . 154

8.2.1.3 Phrases . . . 155

8.2.1.4 Optimized Leaf Nodes . . . 158

(20)

8.2.2 Internal Nodes . . . 160

8.2.2.1 Ancestor (or-self) . . . 162

8.2.2.2 Descendant (or-self) . . . 163

8.2.2.3 Parent . . . 164

8.2.2.4 Child . . . 167

8.2.2.5 Parameterized Operators: the distance parameter . 169 8.2.2.6 Following . . . 173

8.2.2.7 Preceding . . . 175

8.2.2.8 Following-sibling . . . 178

8.2.2.9 Preceding-sibling . . . 178

8.2.2.10 Basic Operators over Attributes . . . 180

8.2.2.11 Parameterized Operators over Attributes: the dis-tance parameter . . . 186

8.2.2.12 And . . . 187

8.2.2.13 Or . . . 189

8.2.2.14 Text Functions: contains and equal . . . 190

8.2.2.15 Other Functions: count . . . 194

8.2.2.16 Further Discussions . . . 194

9 Experimental Evaluation 195 9.1 Experimental Framework . . . 195

9.1.1 Test Machine . . . 195

9.1.2 Document Corpus . . . 196

9.1.3 Query Test Bed . . . 198

9.2 Compression Properties . . . 201

9.2.1 Results Evaluation . . . 203

9.2.1.1 Compression Ratios . . . 204

9.2.1.2 Time Measures . . . 207

9.3 Query Evaluation Performance . . . 211

9.3.1 Documents Tested . . . 212

9.3.2 Query Results . . . 212

9.3.2.1 Structural based Queries . . . 213

9.3.2.2 Text oriented Queries . . . 214

10 Conclusions and Future Work 221 10.1 Summary of Contributions . . . 221

(21)

A Publications and Other Research Results 225

B Algorithms 227

C Descripción del Trabajo Realizado 239

C.1 Introducción . . . 239

C.2 Metodología . . . 241

C.3 Conclusiones y Contribuciones . . . 242

C.4 Trabajo Futuro . . . 245

(22)

(23)

List of Figures

1.1 XXS architecture overview. . . 3 2.1 Tree view of a sample XML document. . . 13 2.2 Example of XML document. . . 18 2.3 Examples of XPath axes. . . 20 3.1 Building a classic Human tree. . . 33 3.2 Example of canonical Human tree. . . 34 3.3 Example of false matchings in Plain Hufmman, but not in Tagged

Human. Notice that special bytes of two bits are used for simplicity. 35 3.4 Codewords assignment in (s, c)-Dense Code. . . 37 3.5 Example of arithmetic compression for the text aabdb. . . 39 3.6 Dierent k-order models. . . 40 3.7 Compression of the text abbabcabbbbc using LZ77. . . 42 3.8 Compression of the text abbabcabbbbc using LZ78. . . 43 3.9 Direct Burrows-Wheeler Transform. . . 44 3.10 Example of WTBC structure. . . 47 3.11 Example of rank and select operations. . . 49 3.12 The wavelet tree of the sequence aadcbdbacd. . . 52 3.13 Example of byte-oriented rank operation by using a two

level-directory structure of partial counters. . . 54 3.14 Succinct representations of trees. . . 55 3.15 An example of the range min-max tree. . . 58 4.1 BPDT template for /tag1[./tag2]. . . 61

4.2 An XQuery expression (a) and its corresponding projection tree (b). 62 4.3 Query rewritten with signO statements. . . 62

(24)

4.4 GCX global architecture. . . 63 4.5 Example of active garbage collection in GCX query evaluation. . . . 64 4.6 Storage architecture of eXist. . . 67 4.7 XPath axes correspondence in the pre/post plane for the context node

f. . . 69 4.8 a): context nodes c and f are pruned, since they are inside the

ancestor region of e and i. b): the overlapping ancestor regions covered by e and i are partitioned along the pre axis at p1 and p2. c): after hitting f, descendant staircase join infers that no results can occur until h, thus a large part of the pre/post plane is skipped. 70 4.9 Classication of some examples of XML compression tools. . . 73 4.10 Example of text compression with XMill. . . 75 4.11 Example of Multiplexed Hierarchical Modeling in XMLPPM. . . 77 4.12 Dictionaries created from a sample XML document. . . 78 4.13 Operational scheme of XWRT. . . 79 4.14 Markups codication used by XComp. . . 80 4.15 Tag/attributes identiers (b) assigned by XComp to compress a

sample XML document (a). . . 81 4.16 An XML document (a) and its corresponding ordered labeled tree (b). 82 4.17 The set S after the pre-order traversal of T (left) and after its stable

sort regarding the component Sπ (right), together with the nal

output of the XBW transform (bottom). . . 83 4.18 Abstract view of XGrind compression. . . 86 4.19 Abstract view of XPRESS compression. . . 87 4.20 SIT structure (b) of an XML document fragment (a). . . 90 4.21 DOM tree division in XMLZip. . . 91 4.22 Example of SXSI data model. . . 95 4.23 Tree and text data representation in SXSI. . . 95 5.1 XML representation of XXS: the XML Wavelet Tree (XWT). . . 99 5.2 Example of XWT structure built from an XML document. . . 103 5.3 Example of correspondence between the XDTree node and a balanced

parentheses representation (BP) of the XML document structure. . . 115 5.4 Segments relationships. . . 117 6.1 Query parser submodule of the XXS system. . . 119 6.2 Example of query parse tree from a query without predicates. . . 121 6.3 Example of query parse trees from queries with predicates. . . 122

(25)

6.4 Examples of use of childatt and parentatt. . . 123

6.5 Example of Attributes equality simplication. . . 124 6.6 Example of Redundancy suppression. . . 125 6.7 Another example of Redundancy suppression. . . 125 6.8 Transformations of the Redundancy suppression category. . . 126 6.9 Equivalences of the Synonyms translation modication. . . 127 6.10 Example of Steps unication. . . 128 6.11 Typical scenarios of Steps unication. . . 129 6.12 Example of or optimization. . . 130 6.13 Example of and optimization. . . 130 6.14 Scenarios of Root node deletion. . . 131

6.15 Application of Attributes equality simplication transformation over the

initial query parse tree. . . 132

6.16 Application of Redundancy suppression transformations over the query

parse tree obtained from Figure 6.15. . . 133

6.17 Application of Synonyms translation modication over the query parse tree

resulted from Figure 6.16. . . 133

6.18 Steps unication transformations applied over the query parse tree

ob-tained from Figure 6.17. . . 134

6.19 Or/and optimizations applied over the query parse tree resulted from

Figure 6.18. . . 134

6.20 Final query execution tree of the query example described in Figure 6.15. 135

7.1 Query evaluator submodule of the XXS system. . . 137 7.2 Target relations that compared segments must keep to satisfy the

semantics of an internal node representing dierent XPath axes. . . 139 7.3 General query evaluation scheme. . . 140 7.4 Main strategies that characterize XXS query evaluation. . . 141 7.5 Skipping of segments. . . 142 7.6 Example of self-nested elements. . . 144 8.1 Dierent segment representations for elements. . . 150 8.2 First bytes validation with skipping, used to match a phrase pattern. 157 8.3 Examples to which optimized next procedures can be applied. . . 158 8.4 Segment advance for left<right (a) and left⊆right (b) in

full-nested scenario of ancestor axis. . . 162 8.5 Example for full-nested variant of parent axis. . . 165 8.6 Example for the full-nested variant of child axis. . . 167

(26)

8.7 Example for the full-nested variant of childdist axis. . . 170

8.8 Special cases of use of childdist and descendantdist. . . 171

8.9 Example for following axis. . . 174 8.10 Example for preceding axis. . . 175 8.11 Example for following-sibling axis. . . 178 9.1 First group of queries (A). . . 200 9.2 Second group of queries (B). . . 200 9.3 Third (C) and fourth (D) group of queries. . . 201 9.4 Compression ratios achieved by our proposal (in blue), general text

compressors (in black), XML conscious non-queriable compressors (in pink), and queriable tools (in green) over dierent XML documents. 205 9.5 Compression times. Comparison with general text compressors (top),

and with XML conscious non-queriable compression tools (bottom). 208 9.6 Decompression times. Comparison with general text compressors

(top), and XML conscious non-queriable compressors (bottom). . . . 209 9.7 Construction times of queriable solutions. . . 211

(27)

List of Tables

4.1 Size contributions maintaining i) only one dictionary, ii) separated

vocabularies for each tag, and iii) after merging title and keyword

vocabularies. . . 78

9.1 Document properties. . . 197

9.2 Systems construction performance. . . 212

9.3 Running times (in milliseconds) for the group of queries A over XMark2

document. . . 215

9.4 Running times (in milliseconds) for the group of queries A over XMark4

document. . . 215

9.5 Running times (in milliseconds) for the group of queries B over XMark2

document. . . 216

9.6 Running times (in milliseconds) for the group of queries B over XMark4

document. . . 216

9.7 Running times (in milliseconds) for the group of queries C over XMark2

document. . . 217

9.8 Running times (in milliseconds) for the group of queries C over XMark4

document. . . 217

9.9 Running times (in milliseconds) for the group of queries D over XMark2

document. . . 218

9.10 Running times (in milliseconds) for the group of queries D over XMark4

document. . . 219

(28)

(29)

List of Algorithms

5.1 Construction of XWT . . . 106

5.2 Display text position x . . . 108

5.3 Full text extraction . . . 109

5.4 Locate jth_{occurrence of word w operation} _{. . . 111}

5.5 Count operation for a word w . . . 112

5.6 Count operation for a word w until a position p . . . 112

5.7 Count operation for a phrase pattern ph . . . 114

7.1 General scheme for the next procedure of an internal node . . . 145

8.1 Next procedure of a non self-nested element . . . 151

8.2 Next procedure of a self-nested element . . . 152

8.3 Next procedure of attributes and words . . . 155

8.4 Next procedure of a continued phrase . . . 156

8.5 Next procedure of an interleaved phrase . . . 156

8.6 Optimized next procedure of specic elements (regardless they are or not

self-nested) . . . 159

8.7 Optimized next procedure of any element. . . 159

8.8 Optimized next procedure of attributes and words . . . 159

8.9 Next procedure of ancestor operator (non-nested variant) . . . 161

8.10 Next procedure of ancestor operator (full-nested variant) . . . 162

8.11 Next procedure of descendant operator (non-nested variant). . . 163

8.12 Next procedure of descendant operator (full-nested variant) . . . 164

8.13 Next procedure of parent operator (non-nested variant) . . . 165

8.14 Next procedure of parent operator (full-nested variant) . . . 166

8.15 Next procedure of child operator (non-nested variant). . . 167

8.16 Next procedure of child operator (full-nested variant) . . . 168

(30)

8.17 Modication to be applied over full-nested variant of child operator to

meet childdistsemantics . . . 170

8.18 Next procedure of any element of depth d . . . 172

8.19 Next procedure of any element of depth ≥ d . . . 172

8.20 f ind_descendants procedure . . . 173

8.21 Next procedure of following operator (non-nested variant) . . . 173

8.22 Next procedure of following operator (full-nested variant) . . . 174

8.23 Next procedure of preceding operator (non-nested variant) . . . 175

8.24 Next procedure of preceding operator (full-nested variant) . . . 176

8.25 Special next procedure of preceding operator (non-nested variant) . . . . 177

8.26 Special next procedure of preceding operator (full-nested variant). . . 177

8.27 Next procedure of following-sibling operator (full-nested variant). . . . 179

8.28 Next procedure of preceding-sibling operator (full-nested variant). . . . 180

8.29 Next procedure of ancestorattoperator (non-nested variant) . . . 182

8.30 Next procedure of ancestorattoperator (full-nested variant) . . . 182

8.31 Next procedure of descendantatt operator (applicable for non-nested and

full-nested variants) . . . 183

8.32 Next procedure of parentattoperator (non-nested variant) . . . 184

8.33 Next procedure of parentattoperator (full-nested variant) . . . 184

8.34 Next procedure of childatt operator (non-nested variant) . . . 185

8.35 Next procedure of childatt operator (full-nested variant). . . 186

8.36 Next procedure of and (self) operator (non-nested variant) . . . 187

8.37 Next procedure of and (self) operator (full-nested variant) . . . 188

8.38 Next procedure of andatt operator. . . 188

8.39 Next procedure of or operator (full-nested variant). . . 189

8.40 Next procedure of contains text function for single words (full-nested variant)190

8.41 Next procedure of contains text function for a phrase (full-nested variant) 191

8.42 Next procedure of containsatttext function . . . 192

8.43 Next procedure of equalatt text function . . . 193

B.1 Next procedure of any element (i.e. `*' applied to elements) . . . 227

B.2 Next procedure of or operator (non-nested variant) . . . 229

B.3 Next procedure of oratt operator . . . 230

B.4 Next procedure of orphraseoperator . . . 231

B.5 Next procedure of contains text function for single words (non-nested variant)232

B.6 Next procedure of contains text function for a phrase (non-nested variant) 233

(31)

B.8 Next procedure of equal text function for single words (full-nested variant) 235

B.9 Next procedure of equal text function for a phrase (non-nested variant). . 236

(32)

(33)

Chapter 1

Introduction

1.1 Motivation

Since its rst introduction in 1998, the importance of the eXtensible Markup Language (XML) [XMLa], has been constantly increasing, mainly due to its suitability for data exchange on the World Wide Web. Nowadays, it is widely employed and it has been acknowledged as the de facto standard for semi-structured data representation, being used to store large volumes of information from dierent domains, such as e-commerce and business, digital libraries, catalogs, chemical and biological areas, metadata specications, and so on.

To exploit the expressive power of XML, query languages like XPath [XPaa] and XQuery [XQu] have been dened, allowing constraint formulation on both document content and structure. Their growing interest, and also the challenge of solving those query languages, have triggered much research aimed to provide ecient solutions, either as theoretical proposals or in the form of real systems. These systems are usually divided into two dierent categories: those that follow a streaming approach (such as GCX [SSK07], SPEX [SPE], etc.), hence having to sequentially read the document to answer each query; and the indexed ones (such as Saxon [Kay08], Galax [FSC+_{03], MonetDB/XQuery [BGvK}+_{06], Qizx/DB [Qiz], etc.), requiring}

a rst preprocessing of the document to build additional data structures over it, which are then used to solve the queries without sequentially traversing the whole document.

Indexed systems are very interesting solutions for many scenarios, such as those where the documents are so large that a sequential scan is prohibitively costly or when many queries must be performed over the same document. However, while

(34)

streaming approaches are supposed to be slower than indexed ones, this may not always be the case. Note that indexed solutions improve querying capabilities at the expense of increasing the space requirements, due to the index structures. Thus, in case that the space needed for the index made it necessary to manipulate it on disk, eciency could be aected by I/O transfer times. Hence many eorts have been devoted to address the problem of creating an in-memory index, and also to cope with the usual high space requirements of the indexed alternatives. These eorts involve the use of compression techniques to minimize that extra space.

Related to the space challenge, another quite active line of research has been the development of XML compression methods. One of the main features of the XML data model is its great exibility. However, it also constitutes one of its main drawbacks, since the verbosity of XML documents may result into huge size documents, which have to be transmitted, stored and, as just seen, also queried. In this way, the use of compression tools not only saves storage space, but also time. Time is the critical factor in eciency, and working with a compressed version of a document saves time when it is transmitted through a network, when we need to access to disk looking for a document, or more importantly, when it is processed. Therefore, compression is clearly more convenient.

Several works have been devoted in the last years to the XML compression task, both in the form of general text compressors, known as XML-blind compressors (e.g. Ziv-Lempel techniques [ZL77, ZL78, Wel84], Human compression [Huf52, dMNZBY00], PPM based methods [CW84], Dense Codes compressors [BFNP07], etc.), or compressors specically designed to exploit XML document structure. Indeed, most of these XML conscious compressors have gone one step beyond, and have faced both problems, compression and query support, leading to several queriable compression tools (e.g. XGrind [TH02], XPRESS [MPC03], XCQ [LNWL03, NLWL06], XQzip [CN04], XQueC[ABMP07], etc.). Some of them allow one to perform queries directly over the compressed representation of the text (either sequentially or using indexes), while others need to decompress the data (either fully or partially) before operating over them. However, despite the large amount of research developed along the years on this compression area, today there is an stated lack of available practical solutions [Sak09].

A more novel approach has been to combine compression and indexing, creating self-indexed representations of the text [NM07], in such a way that the compressed data represents at the same time the structured text and an index built over it. In recent works [FLMM05, FLMM06] a self-index for XML data was presented (XBzipIndex). This solution provides some query support, yet it is restricted to a very limited class of queries. In [ACM+_{10], authors proposed another up-to-date}

proposal for compressed indexing of XML data. This tool, called SXSI, was tailored to work in main memory and it has been proved to be able to cope with an important subset of queries. This time, the main inconvenience is that its space requirements are still high compared to the size obtained by a plain compressor.

(35)

Hence, we can observe that ecient, scalable and stable implementations that take little space and provide, at the same time, full XML query support, are highly desirable, yet not satisfactorily achieved.

XXS

Q u e ry P a rs e r Query Module XML Representation XML Docu ment XML Wavelet Tree Q u e ry E v a lu a to r

Figure 1.1: XXS architecture overview.

1.2 Contributions

This thesis addresses the open problem pointed out at the end of the previous section, and proposes a complete and competitive solution that eciently supports XPath queries over a compressed and self-indexed representation of XML docu-ments. We have developed a system, called XXS (XPath evaluation on XML documents using a Self-index), that implements this solution. Figure 1.1 shows the architecture of our proposal. As it can be seen, it is mainly composed by two parts, which constitute the two main contributions of this work:

• XML Representation: The rst contribution is a new data structure, we call XML Wavelet Tree (XWT), that provides compact representation of XML documents, with implicit self-indexing capabilities. Its construction is made in two phases. First, an initial pass on the input document1 _{is performed to}

obtain the dierent words and frequencies, but keeping separated vocabularies depending on the category of the words, according to the dierent components of the XML data model. Then words are assigned a codeword using a

1_{Notice that a collection of documents can be regarded as a single document that integrates all} of them.

(36)

variant of a word-based byte-oriented compressor, called the (s,c)-Dense Code [BFNP07], particularly tailored to make XWT suitable for querying purposes. The second pass replaces each word of the document by its corresponding codeword, yielding a compressed representation. Yet, the bytes of each codeword are not consecutively stored. Instead, they are placed along dierent nodes of a tree, following a WTBC codeword bytes reorganization [BFLN12]. XWT represents XML documents using only about 30%-40% of the original document size, which is a negligible overhead compared with the compression ratios achieved by the underlying compression method, as experiments prove. What is more striking is that XWT self-indexing properties and construction features lend this representation the ability to eciently support XPath queries.

This new representation was published in preliminary form in the 13th

European Conference on Digital Libraries (ECDL 2009) [BCPN09].

• Query Module: As stated, XWT is a new approach to represent and process XML documents, in a time and space ecient way. But we have also addressed the query needs, by designing and implementing a query module for the ecient evaluation of XPath queries over the XWT representation. The Query module has two main components: the Query parser and the Query evaluator (see Figure 1.1). The Query parser submodule starts by obtaining a preliminary representation of the query, the query parse tree, that directly results from the own query syntax parsing. Then, several transformations are applied over this representation to produce another equivalent, but optimized one, that exploits XWT features, the query execution tree. This nal representation constitutes the execution plan of the query.

Once the query execution tree is obtained, the Query evaluator submodule directly translated it into operators that perform the global execution process over the XWT representation of the document. Three main strategies characterize the general evaluation procedure: a bottom-up approach, together with a lazy evaluation scheme, and the use of an skipping strategy. We describe in detail the whole process, and also the implementation of every operator.

The overall performance of XXS has been tested and compared with some well known state of the art solutions supporting XPath. Results show that it provides outstanding XPath evaluation capabilities, using little extra space (about 4%-8% of additional space) on top of the XWT representation.

A general description of the XXS system has been presented in the 7th_Workshop

on Compression, Text, and Algorithms of the 19th _{International Symposium on}

(37)

1.3 Structure of the Thesis

After this introductory chapter, the rest of the thesis is organized in two main parts, as follows:

• Part I - Basic Concepts and State of the Art Revision: this part introduces some previous concepts for a better understanding of the rest of the thesis.

Chapter 2 introduces the basic concepts about XML documents. It presents a general overview of the eXtensible Markup Language, together with a brief description of the most important languages to process XML documents, from which the XPath query language is given an special focus.

Chapter 3 addresses the relevance and benets of text compression nowadays to cope with space limitations, improving eciency. Given the space challenge that may result from XML verbosity, compression becomes crucial. We present a revision of some basic notions about general text compression, and describe some classical and up-to-date proposals in this eld.This chapter also explains some background information related to succinct data structures, and describes the most important ones regarding the scope of our work, namely, those used to solve basic operations (in particular, rank and select operations) over bit and byte sequences, as well as dierent succinct tree representations. Chapter 4 aims to revise some of the most relevant proposals in the state

of the art devoted to XML storage and querying. First a classication of some well-known streaming and indexed systems developed to specically provide an ecient support for XML query languages is presented. Then the chapter also focuses on space aspects, and describe several works that have addressed the problem of minimizing space requirements, in the form of XML queriable and non-queriable compression techniques. • Part II - The XXS proposal: this part is devoted to explain the

contributions of this thesis that together constitute the core of the XXS system, and to experimentally evaluate its performance.

Remember that our proposal aims at providing compact representation of XML documents, with an ecient query support. As shown in Figure 1.1, two main parts compose it: the XML representation and the Query module. Chapter 5 focuses on the rst one, and presents the XML Wavelet Tree (XWT), the compressed data structure we developed to represent XML documents with self-indexing capabilities. It describes in detail the XWT construction process, and the basic

(38)

procedures to compress, decompress, and search words and phrases over that representation. This chapter also points out some of the XWT main properties that are key to further provide ecient query evaluation. The Query module of XXS is initially addressed in Chapter 6. In

particular, this chapter deals with the Query parser component. The practical subset of XPath targeted in this work is rst introduced, and then the process from query parsing to the production of the nal query execution plan is described.

Chapter 7 focuses on query evaluation, and closes our proposal with the description of the Query evaluator, the XXS submodule in charge of the ecient evaluation of XPath queries over an XWT representation. The chapter conceptually describes how the general evaluation procedure operates, combining a bottom-up, and lazy evaluation approach with a skipping strategy to avoid the processing of those parts of the document that are not relevant for a given query.

Once known the description of the global execution process, Chapter 8 describes in detail every operator implementation, and their most relevant features.

Chapter 9 benchmarks XXS, and analyzes both its compression proper-ties, that stem from the underlying XWT representation, and its querying capabilities, by comparing it with some of the best current alternatives in the state of the art.

After that, this work ends with a nal summary chapter and dierent ap-pendixes, we next detail:

• Chapter 10 summarizes the main contributions of our work, and future directions of research.

• Appendix A lists the publications and other research activities related to this thesis.

• Appendix B describes the pseudocode of some of the operators presented in Chapter 7.

• Following the rules for PhD dissertation in a foreign language at the University of A Coruña, Appendix C contains a description, in Spanish, of this thesis work.

(39)

Part I

Basic Concepts and State of

the Art Revision

(40)

(41)

Chapter 2

XML and XPath Query

Language

This chapter presents the basic concepts related to XML documents. We rst provide a complete overview of the eXtensible Markup Language in Section 2.1, by describing the main features of this specication. Then, Section 2.2 starts by introducing a brief description of some of the most important languages used to process XML documents, to next focus on the XPath query language. For this query language, its base data model (Section 2.2.1), as well as the syntax used to create XPath expressions are shown (Section 2.2.2). Finally, recent and further extensions to XPath are also presented in Section 2.2.3.

2.1 XML Overview

The eXtensible Markup Language (XML) is a World Wide Web Consortium (W3C) standard markup language that was originally dened as a simplied subset of the Standard Generalized Markup Language (SGML) for use on the World Wide Web. Since its rst introduction in 1998 [XMLa, GP98], the language and its data model have soon proved their suitability to be the basis for the data interchange on the Internet. Today, XML is widely employed as a basic data model for representing general semi-structured information in dierent domains, ranging from business and e-commerce applications, to biology and chemistry areas.

The XML specication denes a set of rules for designing documents that can be processed by computer programs, while keeping human-readability. XML documents are basically built from strings of text and markups. The basic markup unit, which describes the structure of a document, is called an element (or tag). It is dened by a pair of matching marks, namely the start-tag and the end-tag, that

(42)

enclose the element content. Start-tags begin with `<', and end-tags with `</'. Both are then followed by the name which identies the element itself, and are closed by `>'. The name of the elements is generally related to the nature of the content they surround. For instance, we show below an example:

<section>XML Overview</section>

Some elements may be empty, that is, they have no content. In this case, we call them empty elements, and they are represented by combining the start-tag and the end-tag into a single empty-element tag beginning with `<', but ending with `/>'. There is also an special element, the so-called root element. It is the rst element in the document and contains all the other elements of the document.

Elements can have attributes. They consist of name-value pairs, that appear within the start-tag, just after the name of the element. Names are separated from values, which are enclosed in single or double quotation marks, by `='. For example:

<title>XML Overview</title>

Notice that image is an empty element, thus without content, but with two attributes, file and caption, whose values are document.png and XML document sample, respectively.

The elements and attributes names of some XML documents may be taken from multiple XML applications. Thus, they may share a common name, but standing for dierent meanings. In those cases, the use of XML namespaces allow one to disambiguate elements and attributes with the same name from each other by assigning them to URIs. Namespaces are implemented by attaching a prex to each element and attribute name, which is mapped to a URI by using a xmlns:prefix attribute either in the elements in which they are used or in the XML root element. In the following example, the xmlns:bk attribute associates bk prex to the URI http://www.vocexample.org/bkvoc, and hence all element and attribute names prexed by bk are in the same namespace:

<bk:catalog xmlns:bk=http://www.vocexample.org/bkvoc>

<bk:journal>

<bk:title>Information Retrieval</bk:title> <bk:year>2011</bk:year>

<bk:citations>7024</bk:citations> </bk:journal>

(43)

Some other important markups that can be found in an XML document are comments and processing instructions. Comments begin with `<!−−' and end with `−−>'. They may appear anywhere in a document outside of other markup. They are not part at all of the textual content of a document, since comments aim to make the raw XML more legible to human readers. XML processors may or may not retrieve the information included into the comments. Here is an example:

<!−−This content has been manually generated−−>

<book>

<title>Three ways to capsize a boat</title> </book>

... </library>

On the other hand, processing instructions (referred as PIs), that appear enclosed by `<?' and `?>', provide information to particular applications that may process the document. The application for which a processing instruction is intended, is identied immediately after the initial `<?', with a name called the PI target. The rest of the processing instruction contains the data with the instructions to be passed to the corresponding application. Like comments, processing instructions may appear anywhere in an XML document, outside of other markup. A common example of processing instruction is xml-stylesheet, which allows one to attach stylesheets to documents. For instance, in the following sample, the xml-stylesheet processing instruction indicates that browsers should apply the CSS stylesheet book.css to the document before showing it to the user:

<?xml-stylesheet href=book.css type=text/css?>

<!−−This content has been manually generated−−> <book>

<title>Three ways to capsize a boat</title> </book>

... </library>

It is forbidden to start a processing instruction with the PI target xml (not either with XML, XmL, xMl, etc.), since this name is reserved to specify the XML declaration of an XML document. It constitutes the prolog1 _{that any XML document should}

have2_{, and provides information about the document itself:}

1_{The prolog is everything in the XML document before the root element start-tag.}

2_{An XML document does not have to have an XML declaration. Notwithstanding, if an XML} document has it, then the declaration must be the rst thing in the document.

(44)

<?xml version=1.0 encoding=UTF-8 standalone=yes?>

<?xml-stylesheet href=book.css type=text/css?> <library>

... </library>

When working with characters that are interpreted in a specic way, like the character `<', that is always recognized as the beginning of a start/end-tag, it is necessary to provide escape facilities to include them out of their actual scope. To this aim, XML denes the entity references, that allow escaping markup characters appearing within the text content or within attribute values. XML has ve predened entity references: i) <, to replace `<', ii) &, used instead of `&', iii) >, representing `>', iv) ", for , and v) ', to substitute `. Only < and & must be used instead of the literal characters inside elements content. The others are optional. In turn, " and ' are useful inside attribute values in order to avoid misconstruing the ending of the value.

An alternative to the use of entity references inside large blocks of text containing many occurrences of special characters are CDATA sections. A CDATA section begins with <![CDATA[ and ends with ]]>, and makes data to be processed simply as character data, but not as markups. That is, markups are ignored. For instance, let us consider an XML document including some samples of source code. They may contain characters that an XML processor would recognize as markups (e.g. & and `<'). We can use a CDATA section to enclose the samples, and to prevent the usual performance:

<![CDATA[ a = i << 3; *b = &a; ]]>

2.1.1 Well-formedness and Validation

As stated, XML species a set of rules that make up the grammar of an XML document. Besides the possible components, it determines for instance, where elements may be placed, which names are allowed, how attributes are included, and so on. Documents that fulll the grammar are said to be well-formed. There are many rules, but some of the most important ones that a well-formed XML document must satisfy are the following: i) it has an unique root element, ii) every start-tag has its matching end-tag, iii) elements can not overlap (i.e. an element can not be closed until all the elements it contains have been closed), iv) attribute values must be quoted, v) an element may not have two attributes with the same name, vi) markup characters `<' and `&' may not occur in the character data

(45)

of elements and attributes. Notice that the three rst rules induce a proper tree structure on an XML document. Figure 2.1 illustrates an example. Furthermore, the grammar sets the basis needed to create XML parsers, able to read any XML document.

<book>

<title>Three ways to capsize a boat</title> <year>2010</year> <author> <name>Chris Stewart</name> <country>UK</country> </author> <price>11.25</price> </book> XML Document book

title year author price

Three ways to capsize a boat

2010 name country

Chris Stewart UK

11.25

Figure 2.1: Tree view of a sample XML document.

There are, basically, two main APIs for XML. The Simple API for XML (SAX) [SAXa] is an event-based API. It sequentially scans an XML document and throws events that are further handled by the parser. Examples of events are, for instance, an occurrence of a start-tag or an end-tag, content characters, a processing instruction, a comment, etc. In contrast, the Document Object Model (DOM) [DOM], is another API that builds a tree representation of the entire document in memory, thus using much more memory than the former approach, but permitting to randomly access and manipulate the document.

In addition to being well-formed, an XML document may also be valid. Particular XML applications may need to ensure that a given XML document adheres to some guidelines (rules) imposed by the application itself. In that case, the allowed markups, as well as their composition are specied in a schema. Whenever an XML document matches the schema it is said to be valid. If not, we say that the XML document is invalid. Hence, the validity of a document depends on which schema is used to compare it with. Documents do not always need to be valid, for many applications it is enough that the document is well-formed. There are several XML schema languages, each one having dierent levels of expressiveness. The most widely supported XML schema language3 _{is the Document Type Denition}

(DTD). A DTD denes the list of markups (e.g. elements, attributes, entities, etc.) that can be used in a document, and how they can be combined, together with basic content specications. For example:

<!ELEMENT library (book+)>

<!ELEMENT book (title, summary, chapter*)>

(46)

<!ELEMENT title (#PCDATA)>

<!ELEMENT summary (#PCDATA | keyword)*> <!ELEMENT chapter (#PCDATA)>

<!ATTLIST book ref CDATA #REQUIRED href CDATA #IMPLIED>

The rst element declaration of the DTD sample above states that each library element must contain one or more book child elements4_{. In turn, the second line}

indicates that each book element must have exactly one title child element followed also by exactly one summary element, and zero or more chapter elements5_{. That}

is, every book must contain a title and a summary, and may or may not have a chapter or multiple chapter elements. Nevertheless, the title must come before the summary, and this one must appear before all chapters.

Regarding title and chapter elements, lines 3 and 5 say that each occurrence of any of these elements may only contain parsed character data (referred with #PCDATA), that is, raw text, but not any child element. In case mixed content is allowed, then we use an element declaration similar to that shown in line 4. This states that a summary element may contain parsed character data as well as keyword children. It does not specify in which order they appear, nor how many instances of each occur. This declaration allows a summary to have 0 keyword children, 1 keyword children, or 26 keyword children.

In addition, the use of ATTLIST declarations are used to declare element attributes. For instance, if we consider lines 6 and 7 of the sample DTD we have been analyzing, they indicate that any book element must have a ref attribute (#REQUIRED). However, the href attribute is optional (#IMPLIED), and may be omitted from particular book elements. Both attributes are asserted to contain character data (i.e. any string of text)6_.

Therefore, according to the DTD sample just seen, the following XML document would be valid:

<title>Three ways to capsize a boat</title>

<summary>A charming and lyrical read, awash with the joy of discovery</summary>

<chapter>The proposal</chapter>

<chapter>When dreams come true</chapter> <chapter>Sailing to Greek Islands</chapter>

4_{The `+' after book stands for one or more.} 5_{This time `*' after chapter denotes zero or more.}

6_{CDATA is the most generic attribute type. Other attribute types are: NMTOKEN, NMTOKENS,} Enumeration, ENTITY, ID, IDREF, etc.

(47)

... </book> </library>

However, it would not be the case of the next document, since the summary element comes before the title one, and also the book element does not have the mandatory attribute ref:

<summary>A charming and lyrical read, awash with the joy of discovery</summary>

<title>Three ways to capsize a boat</title> <chapter>The proposal</chapter>

<chapter>When dreams come true</chapter> <chapter>Sailing to Greek Islands</chapter> ...

</book> </library>

Usually schemas are supplied in separated les from the documents they describe. Yet, DTDs are the only ones that can also be included inside the XML document. In both cases, the XML markup corresponding to the document type declaration is used. It is included in the prolog of the XML document, just after the XML declaration and before the root element, and it allows one to specify either a reference to an external DTD to which the document should be compared or even the DTD itself (between square brackets). For instance, let us assume that the previously discussed sample DTD is available at http://dtdsamples.com/library.dtd. Then, the document type declaration of an XML document conforming to this DTD looks like:

<?xml version=1.0 encoding=UTF-8 standalone=yes?> <?xml-stylesheet href=book.css type=text/css?>

<!DOCTYPE library SYSTEM http://dtdsamples.com/library.dtd>

This document type declaration tells that the root element of the document is library and that the DTD for the document can be found at http://dtdsamples .com/library.dtd.

Nevertheless, DTDs may not always be enough, since they provide limited support for type denition of the contained data. That is, a DTD does not allow

(48)

one to specify, for instance, that an element contains a real number or a date range. Some other well-known and more powerful schema languages that permit these kind of constraints are the W3C XML Schema Language [XSD], RELAX NG [CM01] or Schematron [Sch].

2.2 XPath Query Language

There are several languages for processing XML documents: the XML Path Language (XPath) [XPaa], the XML Query Language (XQuery) [XQu], the XSL Transformation (XSLT) [XSL], the XML Linking Language (XLink) [XLi], the XML Pointer Language (XPointer) [XPo], etc. The reference query languages are both XPath and XQuery7_{, while XSLT is used to transform an XML document into}

another XML document, by means of template rules. In turn, XLink allows one to attach simple, bidirectional or even multidirectional links to XML documents, with can be further specied by using XPointer, that permits to address individual parts of an XML document. As it can be seen, each one deals with dierent aspects of XML processing, yet the relevance of XPath stems from the fact that it constitutes the base for most of the rest ones. Since this thesis is focused on this language, we will next describe it in detail8_.

2.2.1 XPath Data Model

XPath aims to select parts of XML documents. The XPath data model considers XML documents as trees made up of nodes of dierent types. There are basically seven node types: i) the root node, ii) element nodes, iii) text nodes, iv) attribute nodes, v) comment nodes, vi), processing instructions nodes, and vi) namespace nodes. There is always a root node which is the root of the hierarchy. It has no name and no parent, and its unique child is the element node representing the root element of the document. It may also contain any comment or processing instruction occurring before the root element start-tag or after the root element end-tag.

Element nodes represent the elements of an XML document. Each of them has a parent, which in case of the root element is the root node, and for the rest of the element nodes, is the node containing it. An element node may have children

7_{The main expression of XQuery is the FLWOR expression: FOR, LET, WHERE, ORDER,} RETURN. This expression supports iteration and binding of variables to intermediate results. It is commonly assumed that the FLWOR expression serves approximately the same purpose in XQuery than the SELECT expression serves in the SQL language for relational databases.

(49)

that can be nodes representing another elements, text, comments, and processing instructions directly contained by the element.

Each attribute makes up the corresponding attribute node. The parent of an attribute node is the element node it belongs to, still an attribute is not considered its child. Attribute nodes have no children. The textual content of an element is represented by text nodes. Each text node contains the maximum contiguous run of character data not interrupted by any tag. Like the attribute nodes, text nodes do not have child nodes. Finally, each of the comment nodes, processing instruction nodes and namespace nodes, are related to occurrences of the respective components their names refer. Yet, these are rarely handled.

2.2.2 XPath Expressions

The basic concept in XPath is the expression. XPath syntax mainly consists of expressions whose result is usually a set of nodes9_{, but it can also be a boolean,}

numeric or string value. That is, expressions allow one to specify a set of nodes and optionally a function on the result. Hence, it is possible to search, for instance, for all the book nodes in an XML document and just deliver the set, or add a counting operation and deliver instead the number of such nodes.

The most important XPath expression is the so-called path expression, also known as location path. A location path identies a set of nodes in a document and is composed by a sequence of one or more minor units, namely the location steps. Location paths may start by a slash, `/', in which case they are absolute location paths, that are evaluated from the document root node, or may be relative location paths, which are evaluated from a context node.

Before formally describing path expressions, let us consider the following example of location path related to the XML document of the Figure 2.2 to show how they work:

/store/city/books[./@category=fantasy]/book/title

Since the expression begins with a slash, its evaluation will start from the root node. In particular, we are interested in store element nodes that are children of the root node. In that case, the element store (line 1) of the sample XML document satises this constraint, so we select it. Then, the following step selects all its children of type city (lines 2 and 23) and, for each selected node, the next step obtains those elements nodes that are books child nodes. However, the expression

9_{According to XPath 1.0 [XPaa] results are node sets, hence with no order; while in XPath 2.0} [XPab], results are sequences of nodes in a particular order, the `document order' (which applied over the XML document structure corresponds to a preorder traversal). However, arguably all the systems supporting XPath 1.0 assume as well this `document order' for results delivering. In this work we also assume that, even as a way to allow the compatibility of our system with future extensions.

(50)

XML Document

1. <store>

2. <city name=”Coruña” province=”Coruña”> 3. <books category=”fantasy”>

4. <book year=”1997">

5. <title>Harry Potter and the Philosopher’s Stone</title> 6. <author>J.K. Rowling</author>

7. <price>10.95</price> 8. </book>

9. <book year=”2000">

10. <title>Harry Potter and the Goblet of Fire</title> 11. <author>J.K. Rowling</author> 12. <price>13.50</price> 13. </book> 14. </books> 15. <books category=”literature”> 16. <book year=”1999">

17. <title>Driving over Lemons: An Optimist in Andalucia</title> 18. <author>Chris Stewart</author>

19. <price>10.25</price> 20. </book>

21. </books> 22. </city>

23. <city name=”Vigo” province=”Pontevedra”> 24. <books category=”fantasy”>

25. <book year=”1954">

26. <title>The Two Towers</title> 27. <author>J.R.R. Tolkien</author> 28. <price>20.15</price>

29. </book>

30. <book year=”1955">

31. <title>The Return of the King</title> 32. <author>J.R.R. Tolkien</author> 33. <price>23.75</price>

34. </book> 35. </books> 36. </city>

37. <city name=”Santiago” province=”Coruña”> 38. </city>

39. </store>

Figure 2.2: Example of XML document.

surrounded by the square brackets restricts that selection to only those books element nodes that have an attribute category with value fantasy (lines 3 and 24). Once those elements are retained, we continue by selecting their book children (lines 4, 9, 25 and 30). At last, the nal step returns the title child nodes of each of them (lines 5, 10, 26 and 31).

(51)

Now we will discuss in detail the main features of the path expressions. As seen, successive location steps are separated by slashes, and are evaluated from left to right. Each step in the path is relative to the one that preceded it. That is, the result of each location step makes up the context for the next. The general pattern of a location step is given by /axis::node_test[predicate]. That is, it is composed of three main parts:

• An axis, that species how to move from the context node to look for new nodes. There are 13 dierent axes, from which the 8 most common are illustrated in Figure 2.3:

1. child: identies every child node of the context node10_.

2. descendant: selects every child node of the context node, their children, and so on. That is, this axis identies every descendant node of the context node10_.

3. parent: the parent node of the context node.

4. ancestor: identies the parent node of the context node, but also the parent of the parent node, and so forth until reaching the root node. 5. following: selects every node10 _{that appears, in document order, after}

the context node, excluding all its descendant nodes.

6. preceding: identies every node10 _{that appears, in document order,}

before the context node, excluding all its ancestor nodes.

7. following-sibling: every node10 _{sibling of the context node that}

appears after the context node, in document order.

8. preceding-sibling: identies all nodes10 _{siblings of the context node}

appearing before the context node, in document order.

9. attribute: selects every attribute node of the context node. This axis can only be applied to element nodes.

10. self: identies the context node itself.

11. descendant-or-self: identies the context node and all its descen-dants.

12. ancestor-or-self: selects the context node and all its ancestors. 13. namespace: identies all the namespace nodes belonging to the context

node. The context nodes can only be element nodes.

Usually axes are classied into forward axes and reverse axes, depending on whether they take nodes that, in document order, are after or before the con-text node, respectively. Thus the child, descendant, descendant-or-self,

(52)

descendant ancestor context node following-sibling preceding-sibling preceding following context node child parent context node

Figure 2.3: Examples of XPath axes.

following, following-sibling, attribute and namespace axes, are all forward axes. In turn, parent, ancestor, ancestor-or-self, preceding, and preceding-sibling, are all considered as reverse axes. Note that the self axis can be either classied into the forward axes category or into the reverse axes group.

(53)

Some of the axes admit an abbreviated form. For instance, whenever the axis is omitted after `/', as happened in the example previously shown (e.g. /store/city/books[./@category=fantasy]/book/title), a child axis is assumed (since it is by far the most commonly used). Attribute axis, can also be expressed by the symbol @. Likewise, self and parent axes are represented with a shorter notation by using a single period (`.'), and a double period (`..'), respectively.

• A node test, that indicates the name or the type of the nodes that should be selected along the axis.

Every axis has a principal node type. If an axis can contain elements, then the principal node type is element; otherwise, it is the type of the nodes that the axis can contain. That is, for the attribute axis the principal node type is attribute, for the namespace axis, it is namespace, and for the rest of the axes, the principal node type is element. Commonly, node tests specify the name that the selected nodes must have. In this scenario, the name test is fullled if the type of the node corresponds to the principal node type of the axis specied in the location step and if its name matches that of the test. For example, let us assume the location step marked in blue face in the following path expression: /store/descendant::book. Then, according to it, only book element nodes descending from a store element node are selected. It is also possible to use the wildcard symbol `*', instead of a specic name. In such a case, the name test is true for any node of the principal node type, no matter its name. For instance, if we consider the example of Figure 2.2, the last location step of the path expression /store/city/books/book/*will select title, author and price element nodes11 _{children of any book node}

fullling the conditions imposed by the rest of the previous location steps. Likewise, /store/city/@*, will select any attribute node, regardless its name, from a city element node child of store. Assuming again the example of Figure 2.2, both name and province attributes of the corresponding cities will be delivered by this path expression.

In addition, node type tests allow selecting nodes of a specic type. Dierent functions are used to represent the node types we are interested in. For example, node() stands for nodes of any type, and text() selects only text nodes, while comment() and processing-instruction() are used to select comment nodes and processing-instruction nodes, respectively.

As stated in previous examples, when axes are specied through their shorthand notations, the axis and the node test are combined in the location step. For instance, that is the case of /store/city/@name. However, if

11_{Note that these three dierent element nodes correspond to all the types of element nodes} child from a book element in the document sample.

Compressed self-indexed XML representation with efficient XPath evaluation