HANDLING XML VERBOSITY - THE ECONOMIC SUBSTANTIATION OF XML DATABASES COMPRESSION

THE ECONOMIC SUBSTANTIATION OF XML DATABASES COMPRESSION

2. HANDLING XML VERBOSITY

There are many solutions for dealing with the verbosity of XML. Following the classification proposed by S. Sakr [5], we shall distinguish:

• general text compressors, i.e., implementing compression methods conceived for text (or, wider, general data), not especially for XML, still achieving satisfactory results when applied to XML documents;

• XML-conscious compressors, i.e., implementing compression methods conceived especially for XML, thus capable of taking advantage of spe- cific XML traits and achieving, at least theoretically, superior compression results;.

Among the general text compressors that are often used to compress XML data one can find widely used utilities such as zip, gzip, bzip2 and 7-zip. They em- ploy general-purpose compression methods, such as Deflate (zip, gzip and 7-zip), LZMA (7-zip), Prediction by Partial Match (7-zip) or Burrows-Wheeler Transform (bzip2 and 7-zip) (a comprehensive explanation of these methods can be found in [10]). As XML documents are generally well-compressible, the compression ratios obtained using these compressors seem to be high – until compared to what can be attained using a specialized approach (see, e.g. [7]).

The XML-specialized compression methods are typically implemented as ex- tensions of general purpose methods. They typically add some preprocessing steps that rearrange and transcode data, then apply a general purpose method, either to the entire preprocessed document, or parts of it.

One of the first XML-specialized compressors that gained popularity was XMill by Liefke and Suciu [4]. XMill makes use of three XML-conscious processing techniques. The first one splits the XML document into: element and attribute symbol names, text content, and document tree structure. Each of the re- sulting parts has different type of contents, so compressing them separately im- proves compression ratio. The second technique groups together contents of same XML elements, which helps compression methods with limited buffer, such as Deflate. The third, optional technique allows to apply a dedicated method for each type of data (such as numbers or dates), yet as it must be the user to guide XMill, which method should be applied to which container, it is hardly practical.

Originally XMill used only Deflate as the back-end compression method, later versions allow to replace it with more advanced methods: PPM or BWT. Still, the best relative compression improvement was measured for the Deflate.

102

The first XML-specialized compressor designed for the PPM-based back-end was XMLPPM by Cheney [2]. It applied techniques such as substituting element and attribute names with dictionary indices, removing closing tags marking their position (they can be reconstructed on decompression provided the document is well-formed), but the most important element of XMLPPM is ‘multiplexed hierar- chical modeling’ consisting in encoding different kinds of data (element and attribute names, element structure, attribute values, element contents) with distinct PPM statistical models. Additionally, in order to exploit some correlation between different kinds of data, the previous symbol, regardless of the model it belongs to, is used as a context for the next symbol.

A modification of XMLPPM is SCMPPM by Adiego, de la Fuente and Na- varro [1], in which every class of XML element is treated with a separate PPM model. This helps, but only in case of large XML documents, as the adaptive PPM model needs to process a significant number of symbols to become effective. A big flaw of SCMPPM is very high memory usage.

XBzip also uses PPM as back-end compression method, but only after applying a transform based on Burrows-Wheeler’s to represent XML document structure linearly using path-sorting and grouping [3]. XBzip has two work modes. The first one does not support queries over compressed data, but can attain compression ratios even higher than XMLPPM. The second, query-supporting mode, supported by XBzip Index utility, splits data into containers, and creates an FM-index (a compressed representation of a string that supports efficient substring searches) for each of them. For this reason, query processing times are very short, but storing the FM-index seriously decreases the compression ratios.

Recently, very high XML compression ratios were attained by XWRT [7]. As XWRT works as preprocessor, it can be combined theoretically with any general- purpose compression method, though the most promising results were obtained for LZMA – even though using PPM (or PAQ) helps achieve higher compression ratios, it significantly increases decompression time [9].

The preprocessing stage applies several transformations to the XML document. The main of them is the substitution of certain alphanumeric phrases with short identifiers. The substituted phrases include: ordinary words (that pass both the length and frequency thresholds), XML element start tags, URL’s, e-mail addresses, XML entities, and runs of spaces. Other transformations include succinct binary encoding of numbers, IP addresses, dates, and times, as well as replacing with special flags selected digrams and XML element end tags.

A modification of XWRT, QXT, offers lesser compression ratios for the sake of allowing partial decompression of the XML document and fast searching [8]. The main differences with XWRT are that QXT creates individual container for the contents of each XML element, and also packs data into blocks that can be decom- pressed separately.

3. ECONOMIC CONSEQUENCES OF APPLYING DATA COMPRESSION TO

In document INFORMATION SYSTEMS IN MANAGEMENT VI (Page 101-103)