[PDF] An Experimental Investigation of XML Compression Tools

Abstract

This paper presents an extensive experimental study of the state-of-the-art of XML compression tools. The study reports the behavior of nine XML compressors using a large corpus of XML documents which covers the different natures and scales of XML documents. In addition to assessing and comparing the performance characteristics of the evaluated XML compression tools, the study tries to assess the effectiveness and practicality of using these tools in the real world. Finally, we provide some guidelines and recommen- dations which are useful for helping developers and users for making an effective decision for selecting the most suitable XML compression tool for their needs.

Full PDF

AAn Experimental Investigation of XML Compression Tools

Sherif Sakr

National ICT Australia223 Anzac Parade, NSW 2052Sydney, Australia

[email protected]

ABSTRACT

This paper presents an extensive experimental study of thestate-of-the-art of XML compression tools. The study re-ports the behavior of nine XML compressors using a largecorpus of XML documents which covers the diﬀerent naturesand scales of XML documents. In addition to assessing andcomparing the performance characteristics of the evaluatedXML compression tools, the study tries to assess the ef-fectiveness and practicality of using these tools in the realworld. Finally, we provide some guidelines and recommen-dations which are useful for helping developers and users formaking an eﬀective decision for selecting the most suitableXML compression tool for their needs.

1. INTRODUCTION

The e X tensible M arkup L anguage ( XML ) has been ac-knowledged to be one of the most useful and important tech-nologies that has emerged as a result of the immensed pop-ularity of HTML and the World Wide Web. Due to the sim-plicity of its basic concepts and the theories behind, XMLhas been used in solving numerous problems such as pro-viding neutral data representation between completely dif-ferent architectures, bridging the gap between software sys-tems with minimal eﬀort and storing large volumes of semi-structured data. XML is often referred as self-describingdata because it is designed in a way that the schema is re-peated for each record in the document. On one hand, thisself-describing feature grants the XML great ﬂexibility andon the other hand, it introduces the main problem of ver-bosity of XML documents which results in huge documentsizes. This huge size lead to the fact that the amount ofinformation that has to be transmitted, processed, stored,and queried is often larger than that of other data formats.Since XML usage is continuing to grow and large reposito-ries of XML documents are currently pervasive, a great de-mand for eﬃcient XML compression tools has been exist. Totackle this problem, several research eﬀorts have proposedthe use of XML-conscious compressors which exploits the well-known structure of XML documents to achieve com-pression ratios that are better than those of general textcompressors. The usage of XML compressing tools has manyadvantages such as: reducing the network bandwidth re-quired for data exchange, reducing the disk space requiredfor storage and minimizing the main memory requirementsof processing and querying XML documents.Experimental evaluation and comparison of diﬀerent tech-niques and algorithms which deals with the same problemis a crucial aspect especially in applied domains of com-puter science. This paper presents an extensive experimen-tal study for evaluating the state-of-the-art of XML com-pression tools. We examine the performance characteristicsof nine publicly available XML compression tools against awide variety of data sets that consists of 57 XML documents.The web page of this study [1] provides access to the testﬁles, examined XML compressors and the detailed results ofthis study.The remainder of this paper is organized as follows. Sec-tion 2 brieﬂy introduces the XML compression tools exam-ined in our study and classiﬁes them in diﬀerent ways. Sec-tion 3 presents the data sets used to perform the experi-ments. Section 4 describes the test environments. Detailedand consolidated results of our experiments are presented inSection 5 before we draw our ﬁnal conclusions in Section 6.

2. SURVEY OF XML COMPRESSION TOOLS2.1 Features and Classiﬁcations

A very large number of XML compressors have been pro-posed in the literature of recent years. These XML com-pressors can be classiﬁed with respect to three main charac-teristics. The ﬁrst classiﬁcation is based on their awarenessof the structure of the XML documents. According to thisclassiﬁcation, compressors are divided into two main groups: • General Text Compressors : Since XML data arestored as text ﬁles, the ﬁrst logical approach for com-pressing XML documents was to use the traditionalgeneral purpose text compression tools. This group ofXML compressors [6, 2, 23] is

XML-Blind , treats XMLdocuments as usual plain text documents and appliesthe traditional text compression techniques [33]. • XML Conscious Compressors : This group of com-pressors are designed to take the advantage of theawareness of the XML document structure to achievebetter compression ratios over the general text com-pressors. This group of compressor can be further clas- a r X i v : . [ c s . D B ] M a y iﬁed according to their dependence on the availabilityof the schema information of the XML documents asfollows: – Schema dependent compressors where both of theencoder and decoder must have access to the doc-ument schema information to achieve the com-pression process [3, 11, 25, 8]. – Schema independent compressors where the avail-ability of the schema information is not requiredto achieve the encoding and decoding processes[29, 21, 9, 18].Although schema dependent compressors may be, the-oretically , able to achieve slightly higher compressionratios, they are not preferable or commonly used inpractice because there is no guarantee that the schemainformation of the XML documents is always available.The second classiﬁcation of XML compressor is based ontheir ability of supporting queries. • Non-Queriable (Archival) XML Processor : Thisgroup of the XML compressors does not allow anyqueries to be processed over the compressed format[29, 21, 11, 5, 9]. The main focus of this group isto achieve the highest compression ratio. By default,general purpose text compressors belong to the non-queriable group of compressors. • Queriable XML Processor : This group of the XMLcompressors allow queries to be processed over theircompressed formats [13, 34, 30]. The compressionratio of this group is usually worse than that of thearchival XML compressors. However, the main focusof this group is to avoid full document decompres-sion during query execution. In fact, the ability toperform direct queries on compressed XML formatsis important for many applications which are hostedon resource-limited computing devices such as mobiledevices and GPS systems. By default, all queriablecompressors are XML conscious compressors as well.The third classiﬁcation considers whether the compressionschemes operate in an online or oﬄine manner. • Online Compressors : are able to stream the com-pressed data to the decoder i.e the decode is able tobegin the process of decompression before the encodehas ﬁnished transmitting the compressed data. • Oﬄine Compressors : don’t allow the decoder tobegin the decompression process until the entire com-pressed ﬁle has been received.The online feature of XML compression tools could be veryimportant for the scenarios where the users are heavily ex-changing compressed XML documents over networks. Inthese scenarios, online decompression processors can eﬀec-tively decrease the network latency during the transmissionprocess.Table 1 lists the symbols that indicate the features of eachXML compressor included in the list of Table 2. Symbol DescriptionG General Text CompressorS Speciﬁc XML CompressorD Schema dependent CompressorI Schema Independent CompressorA Archival XML CompressorQ Queriable XML CompressorO Online XML CompressorF Oﬄine XML Compressor

Table 1: Symbols list of XML compressors features

In our study we considered, to the best of our knowledge, all

XML compression tools which are fulﬁlling the followingconditions:1. Is publicly and freely available either in the form ofopen source codes or binary versions.2. Is a schema-independent. As previously mentioned,the set of compressors which is not fulﬁlling this con-dition is not commonly used in practice .3. Be able to run under our Linux version of operatingsystem.Table 2 lists the surveyed XML compressors and theirfeatures where the

Bold font is used to indicate the com-pressors which are fulﬁlling our conditions and included inour experimental investigation. The border line betweenthe upper section and the lower section of Table 2 is usedto diﬀerentiate between the non-queriable (upper section)and queriable (lower section) sets of the XML compressors.The three compressors (

DTDPPM, XAUST, rngzip ) havenot been included in our study because they do not satisfyCondition 2. Although its source code is available and itcan be successfully compiled, XGrid did not satisfy Condi-tion 3. It always gives a ﬁxed run-time error message duringthe execution process. The rest of the list (11 compressors)don’t satisfy Condition 1. The status of a lack of sourcecode/binaries for a large number of the XML compressorsproposed in literature, to the best of our search eﬀorts andcontact with the authors, and especially from the queriableclass [30, 32, 34] was a bit disappointing for us. This has lim-ited a subset of our initially planned experiments especiallythose which targeted towards assessing the performance ofevaluating the results of XML queries over the compressedrepresentations. In the following we give a brief descriptionof each examined compressor.

General Text Compressors numerous algorithms havebeen devised over the past decades to eﬃciently compresstext data. In our evaluation study we selected three com-pressors which are considered to be the best representativeimplementations of the most popular and eﬃcient text com-pression techniques. We selected gzip [6], bzip2 [2] and PPM[23] compressors to represent this group.

XMill in [29] Liefke and Suciu have presented the ﬁrst im-plementation of an XML conscious compressor. In XMill,both of the structural and data value parts of the sourceXML document are collected and compressed separately . Inthe structure part, XML tags and attributes are encoded in adictionary-based fashion before passing it to a back-end gen-eral text compression scheme. Data values are grouped intoomogenous and semantically related containers accordingto their path and data type. Each container is then com-pressed separately using specialized compressor that is idealfor the data type of this container. In the latest versionsof the XMill source distribution, the intermediate binariesof the compressed format can be passed to one of three al-ternative back-end general purpose compressor: gzip, bzip2and PPM. In our experiments we evaluated the performanceof the three alternative back-ends independently. Hence, inthe rest of the paper we refer to the three alternative back-ends with the names

XMillGzip , XMillBzip and

XMillPPM respectively.

XMLPPM is considered as an adaptation of the generalpurpose P rediction by P artial M atching compression scheme(PPM) [23]. In [21], Cheney has presented XMLPPM as astreaming XML compressor which uses a M ultiplexed H ierarchicalPPM M odel called (MHM). The main idea of this MHMmodel is to use four diﬀerent PPM models for compressingthe diﬀerent XML symbols: element, attribute, characterand miscellaneous data. SCMPPM is described by Adiego et al. in [19] as a vari-ant of the XMLPPM compressor. It combines a techniquecalled S tructure C ontext M odelling (SCM) with the PPMcompression scheme. It uses a bigger set of PPM modelsthan XMLPPM as it uses a separate model to compress thetext content under each element symbol. XWRT is presented by Skibinski et al. in [35]. It applies adictionary-based compression technique called X ML W ord R eplacing T ransform. The idea of this technique is to re-place the frequently appearing words with references to thedictionary which is obtained by a preliminary pass over thedata. XWRT submits the encoded results of the preprocess-ing step to three alternative general purpose compressionschemes: gzip, LZMA and PPM. Axechop is presented by Leighton et al. in [27]. It dividesthe source XML document into structural and data seg-ments. The MPM compression algorithm is used to generatea context-free grammar for the structural segment which isthen passed to an adaptive arithmetic coder. The data seg-ment contents are organized into a series of containers (onecontainer for each element) before applying the B urrows- W heeler T ransformation (BWT) compression [20] over eachcontainer. Exalt in [36], Toman has presented an idea of applying asyntactical-oriented approach for compressing XML docu-ments. It is similar to AXECHOP in that it utilized the factthat XML document could be represented using a context-free grammar. It uses the grammar-based codes encodingtechnique introduced by Kieﬀer and Yang in [26] to encodethe generated context-free grammars.

3. OUR CORPUS3.1 Corpus Characteristics

Determining the XML ﬁles that should be used for eval-uating the set of XML compression tools is not a simpletask. To provide an extensive set of experiments for assess-ing and evaluating the performance characteristics of theXML compression tools, we have collected and constructeda large corpus of XML documents. This corpus contains awide variety of XML data sources and document sizes. Table

Compressor Features Code Available

GZIP (1.3.12) [6] GAIF Y

BZIP2 (1.0.4) [2] GAIF Y

PPM (j.1) [7] GAIF Y

XMill (0.7) [15] SAIF Y

XMLPPM (0.98.3) [16] SAIO Y

SCMPPM (0.93.3) [9] SAI Y

XWRT (3.2) [18] SAI Y

Exalt (0.1.0) [5] SAIF Y

AXECHOP [27] SAIF YDTDPPM [3] SADO YXAUST[11] SAD Yrngzip [8] SQD YMillau[25] SADO NXComp [28] SAIF NXGrind [13] SQIO YXBzip [24] SQI NXQueC [17] SQI NXCQ [32] SQIO NXPress [31] SQIO NXQzip [22] SQI NXSeq [30] SQI NQXT [34] SQI NISX [37] SQI N

Table 2: XML Compressors List

Size denotesthe disk space of XML ﬁle in MBytes.

Tags represents thenumber of distinct tag names in each XML document.

Nodes represents the total number of nodes in each XML data set.

Depth is the length of the longest path in the data set.

DataRatio represents the percentage of the size of data valueswith respect to the document size in each XML ﬁle. Thedocuments are selected to cover a wide range of sizes wherethe smallest document is 0.5 MB and the biggest documentis 1.3 GB. The documents of our corpus can be classiﬁedinto four categories depending on their characteristics: • Structural documents this group of documents hasno data contents at all. 100 % of each document sizeis preserved to its structure information. This cate-gory of documents is used to assess the claim of XMLconscious compressors on using the well known struc-ture of XML documents for achieving higher compres-sion ratios on the structural parts of XML documents.Initially, our corpus consisted of 30 XML documents.Three of these documents were generated by using ourown implemented Java-based random XML generator.This generator produces completely random XML doc-uments to a parameterized arbitrary depth with only structural information (no data values). In addition,we created a structural copy for each document of theother 27 original documents - with data values - of thecorpus. Thus, each structural copy captures the struc-ture information of the associated XML original copyand removes all data values. In the rest of this paper,we refer to the documents which include the data val-ues as original documents and refer to the documentswith no data values as structural documents. As aresult, the ﬁnal status of our corpus consisted of 57documents, 27 original documents and 30 structural documents. The size of our own 3 randomly generateddocuments (R1,R2,R3) are indicated in Table 3 andthe size of the structural copy of each original versionof the document can be computed using the followingequation: size ( structural ) = (1 − DR ) ∗ size ( Original )here DR represents the data ratio of the document. • Textual documents : this category of documents con-sists of simple structure and high ratio of its contentsis preserved to the data values. The ratio of the datacontents of these documents represent more than 70%of the document size. • Regular Documents consists mainly of regular doc-ument structure and short data contents. This doc-ument category reﬂects the XML view of relationaldata. The data ratio of these documents is in the rangeof between 40 and 60 percent. • Irregular documents consists of documents that havevery deep, complex and irregular structure. Similarto purely structured documents, this document cate-gory is mainly focusing on evaluating the eﬃciency ofcompressing irregular structural information of XMLdocuments.

Our data set consists of the following documents:

EXI-Group is a variant collection of XML documents in-cluded in the testing framework of the Eﬃcient XML Inter-change Working Group [4].

XMark-Group the XMark documents model an auctiondatabase with deeply-nested elements. The XML documentinstances of the XMark benchmark are produced by the xml-gen tool of the XML benchmark project [14]. For our ex-periments, we generated three XML documents using threeincreasing scaling factors.

XBench-Group presents a family of benchmarks that cap-tures diﬀerent XML application characteristics [12]. Thedatabases it generates come with two main models: 1) Data-centric (DC) model contains data that are not originallystored in XML format such as e-commerce catalog data andtransactional data 2) Text-centric (TC) model which repre-sents text data that are more likely stored as XML. Each ofthese two models can be represented either in the form of asingle document (SD) or multiple documents (MD). In short,these two levels of classiﬁcations are combined to generatefour database instances: TCSD, DCSD, TCMD, DCMD. Inaddition, XBench can generate databases with 4 diﬀerentsizes: small (11MB), normal (108MB) and large (1GB) andhuge (10GB). In our experiments, we only use TCSD andDCSD instances of the small and normal sizes.

Wikipedia-Group

Wikipedia oﬀers free copies of all con-tent to interested users [10]. For our corpus, we selected ﬁvesamples of the XML dumps with diﬀerent sizes and charac-teristics.

DBLP presents the famous database of bibliographic in-formation of computer science journals and conference pro-ceedings.

U.S House is a legislative document which provides infor-mation about the ongoing work of the U.S. House of Repre-sentatives.

SwissProt is a protein sequence database which describesthe DNA sequences. It provides a high level of annotationsand a minimal level of redundancy.

NASA is an astronomical database which is constructed byconverting legacy ﬂat-ﬁle formats into XML documents and then making them available to the public.

Shakespeare represents the gathering of a collection ofmarked-up Shakespeare plays into a single XML ﬁle. It con-tains many long textual passages.

Lineitem is an XML representation of the transactional re-lational database benchmark (TPC-H).

Mondial provides the basic statistical information on coun-tries of the world.

BaseBall provides the complete baseball statistics of allplayers of each team that participated in the 1998 MajorLeague.

Treebank is a large collection of parsed English sentencesfrom the Wall Street Journal. It has a very deep, non-regularand recursive structure.

Random-Group this group of documents has been gener-ated using our own implementation of a Java-based randomXML generator. This generator is designed in a way to pro-duce structural documents with very random, irregular anddeep structures according to its input parameters for thenumber of unique tag names, maximum tree level and docu-ment size. We used this XML generator for producing threedocuments with diﬀerent size and characteristics. The mainaim of this group is to challenge the examined compressorsand assess the eﬃciency of compressing the structural partsof XML documents.

4. TESTING ENVIRONMENTS

To ensure the consistency of the performance behaviorsof the evaluated XML compressors, we ran our experimentson two diﬀerent environments. One environment with highcomputing resources and the other with considerably lim-ited computing resources. Table 4 lists the setup details ofour high resources environment and Table 5 lists the setupdetails of the limited one.

Operating System

Ubuntu 7.10 (Linux 2.6.22 Kernel)

CPU

Intel Core 2 Duo E6850 CPU3.00 GHz, FSB 1333MHz4MB L2 Cache

Hard Disk

Seagate ST3250820AS250 GB

RAM

Compilers gcc/g++ 4.1

Table 4: Setup details of the powerful resources en-vironmentOperating System

Ubuntu 7.10 (Linux 2.6.20 Kernel)

CPU

Intel Pentium 42.66GHz, FSB 533MHz512KB L2 Cache

Hard Disk

Western Digital WD400BB40 GB

RAM

512 MB

Compilers gcc/g++ 4.1

Table 5: Setup details of the low resources environ-ment ata Set Name Document Name Size (MB) Tags Number of Nodes Depth Data RatioEXI [4] EXI-Telecomp.xml 0.65 39 651398 7 0.48EXI-Weblog.xml 2.60 12 178419 3 0.31EXI-Invoice.xml 0.93 52 78377 7 0.57EXI-Array.xml 22.18 47 1168115 10 0.68EXI-Factbook.xml 4.12 199 104117 5 0.53EXI-Geographic Coordinates.xml 16.20 17 55 3 1XMark [14] XMark1.xml 11.40 74 520546 12 0.74XMark2.xml 113.80 74 5167121 12 0.74XMark3.xml 571.75 74 25900899 12 0.74XBench [12] DCSD-Small.xml 10.60 50 6190628 8 0.45DCSD-Normal.xml 105.60 50 6190628 8 0.45TCSD-Small.xml 10.95 24 831393 8 0.78TCSD-Normal.xml 106.25 24 8085816 8 0.78Wikipedia [10] EnWikiNews.xml 71.09 20 2013778 5 0.91EnWikiQuote.xml 127.25 20 2672870 5 0.97EnWikiSource.xml 1036.66 20 13423014 5 0.98EnWikiVersity.xml 83.35 20 3333622 5 0.91EnWikTionary.xml 570.00 20 28656178 5 0.77DBLP DBLP.xml 130.72 32 4718588 5 0.58U.S House USHouse.xml 0.52 43 16963 16 0.77SwissProt SwissProt.xml 112.13 85 13917441 5 0.60NASA NASA.xml 24.45 61 2278447 8 0.66Shakespeare Shakespeare.xml 7.47 22 574156 7 0.64Lineitem Lineitem.xml 31.48 18 2045953 3 0.19Mondial Mondial.xml 1.75 23 147207 5 0.77BaseBall BaseBall.xml 0.65 46 57812 6 0.11Treebank Treebank.xml 84.06 250 10795711 36 0.70Random Random-R1.xml 14.20 100 1249997 28 0Random-R2.xml 53.90 200 3750002 34 0Random-R3.xml 97.85 300 7500017 30 0

Table 3: Characteristics of XML data sets

5. EXPERIMENTS

We evaluated the performance characteristics of XML com-pressors by running them through an extensive set of exper-iments. The setup of our experimental framework was verychallenging and complex. The details of this experimentalframework is described as follows: • We evaluated 11 XML compressors: 3 general purposetext compressors (gzip, bzip2, PPM) and 8 XML con-scious compressors (XMillGzip, XMillBzip, XMillPPMXMLPPM, SCMPPM, XWRT, Exalt, AXECHOP).For our main set of experiments, we evaluated thecompressors under their default settings. The rationalbehind this is that the default settings are consideredto be the recommended settings from the developersof each compressors and thus can be assumed as thebest behaviour. In addition to this main set of ex-periments, we run additional set of experiments with tuned parameters for the highest value of the level ofcompression parameter provided by some compressors(gzip, bzip2, PPM, XMillPPM, XWRT). That meansin total we run 16 variant compressors. The exper-iments of the tuned version of XWRT could be onlybe performed on the high resource setup because theyrequire at least 1 GB RAM. • Our corpus consists of 57 documents: 27 original doc-uments , 27 structural copies and 3 randomly generated structural documents (see Section 3.1). • We run the experiments on two diﬀerent platforms.One with limited computing resources and the otherwith high computing resources. • For each combination of an XML test document andan XML compressor, we run two diﬀerent operations( compression - decompression ). • To ensure accuracy, all reported numbers for our timemetrics ( compression time - decompression time ) (seeSection 5.2) are the average of ﬁve executions with thehighest and the lowest values removed.The above details lead to the conclusion that our number ofruns was equal to 9120 on each experimental platform (16 *57 * 2 * 5), i.e 18240 runs in total.In addition to running this huge set of experiments, weneeded to ﬁnd the best way to collect, analyze and presentthis huge amount of experimental results. To tackle thischallenge, we created our own mix of Unix shell and Perlscripts to run and collect the results of these huge numberof runs. In this paper, we present an important part fromresults of our experiments. For full detailed results, we referthe reader to the web page of this experimental study [1].

During the run of our experiments, some tools failed toeither compress or decompress some of the documents inour corpus. We consider the run as unsuccessful if the com-pressor fails to achieve either of the encoding and decodingprocesses of the test document. Thus, we had 57 runs foreach compressor (one run per document). Figure 1 presentsthe percentage of unsuccessful runs of each compressor. Fora detailed list of the errors generated during our experimentswe refer to the web page of this study [1]. We have two mainremarks about the results of Figure 1: • The general purpose text compressors have shown com-plete stability. They were able to successfully performthe complete set of runs. They are

XML-Blind thus re-quire no knowledge of the document-structure. Hence,they can deal with any XML document even if it suﬀersfrom any syntax or well-formedness problems. How-ever, XML conscious compressors are very sensitive tosuch problems. For example, some compressors which z i p b z i p pp m X M ill X M ll PP M X W R T S C M PP M E x a l t A x e c hop F a il u r e Figure 1: Percentage of unsuccessful runs of eachcompressor. uses the Expat XML parser such as XMLPPM will failto compress any XML document which uses externalentity references if it does not have a dummy

DTD dec-laration because the XML parser strictly applies theW3C speciﬁcation and will consider this document asnot well-formed. • Except the latest version of XMLPPM (0.98.3), noneof the XML conscious compressors was able to ex-ecute the whole set of runs successfully. Moreover,AXECHOP and Exalt compressors have shown verypoor stability . They failed to run successful decodingparts of many runs. They were thus excluded fromany consolidated results. Although an earlier versionof XMLPPM (0.98.2) suﬀered from some problem indecompressing the Wikipedia data sets, the latest ver-sion of XMLPPM (0.98.3) released by Cheney duringthe time of doing the experiments of this work hasﬁxed all earlier bugs and has shown to be the bestXML conscious compressor from the stability point ofview.

We measure and compare the performance of the XMLcompression tools using the following metrics:

Compression Ratio : represents the ratio between thesizes of compressed and uncompressed XML documents.

Compression Ratio = (Compressed Size) / (Uncompressed Size)

Compression Time : represents the elapsed time duringthe compression process i.e the period of time between thestart of program execution on a document until all the dataare written to disk.

Decompression Time : represents the elapsed time dur-ing the decompression process i.e the period of time betweenthe start of program execution on reading the decompressedformat of the XML document until delivering the originaldocument.For all metrics: the lower the metric value, the better thecompressor . In this section we report the results obtained by runningour exhaustive set of experiments. Figures 2 to 12 represents an important part of the results of our experiments. Severalremarks and guidelines can be observed from the results ofour exhaustive set of experiments. Some key remarks aregiven as follows: • The results of Figure 2 and Figure 3 show that the tuned run of XWRT with the highest level of compres-sion ratio achieves the overall best average compressionratio with very expensive cost terms of compressionand decompression times. • Figure 10(a) shows that the three alternative back-ends of

XMill compressor achieve the best compressionratio over the structural documents. Figure 4 showsthat XMillPPM achieves the best compression ratio forall the datasets. The irregular structural documents(Treebank, R1, R2, R3) are very challenging to the setof the compressors. This explains why they all had theworst compression ratios. • Figures 5 and 10 show that gzip-based compressors(gzip, XMLGzip) have the worst compression ratios.Excluding these two compressors, Figure 10 shows thatthe diﬀerences on the average compression ratios be-tween the rest of compressors are very narrow. Theyare very close to each other, the diﬀerence between thebest and the worst average compression ratios is lessthan 5%. Among all compressors, SCMPPM achievesthe best average compression ratio. • Figures 6,7,8,9 show that the gzip-based compressorshave the best performance in terms of compressiontime and decompression time metrics on both testingenvironments. The compression and decompressiontimes of the PPM-Based compression scheme (XMillPPM,XMLPPM, SCMPPM) are much slower than the othercompressors. Among all compressors, SCMPPM hasthe longest compression and decompression times. • Figure 11 illustrates the overall performance of XMLcompressors on the high and limited resources setupwhere the values of the performance metrics are nor-malized respect to bzip2. The results of this ﬁgure il-lustrate the narrow diﬀerences between the XML com-pressors in terms of their compression ratios and thewide diﬀerences in terms of their compression and de-compression times.

Obviously, it is a nice idea to use the results of our ex-periments and our performance metrics to provide a globalranking of XML compression tools. This is however an espe-cially hard task. In fact, the results of our experiments havenot shown a clear winner . Hence, diﬀerent ranking methodsand diﬀerent weights for the factors could be used for thistask. Deciding the weight of each metric is mainly depen-dant on the scenarios and requirements of the applicationswhere these compression tools could be used. In this paperwe used three ranking functions which give diﬀerent weightsfor our performance metrics. These three rankings functionare deﬁned as follows: • W F / ∗ CR ) + (1 / ∗ CT ) + (1 / ∗ DCT ). • W F / ∗ CR ) + (1 / ∗ CT ) + (1 / ∗ DCT ) W R T S C M PP M PP M X M L PP M X M ill PP M X M ill B z i p2 b z i p2 - PP M X M ill G z i p g z i p - C o m p r e ss i on R a t i o Figure 2: Average Compression Ratios (Tuned Pa-rameters)

Compression Time Decompression Time

Figure 3: Average compression and decompressiontimes over original documents on high resourcessetup (Tuned Parameters) • W F / ∗ CR ) + (1 / ∗ CT ) + (1 / ∗ DCT )where CR represents the compression ratio metric, CT rep-resents the compression time metric and DCT represents thedecompression time metric. In these ranking functions weused increasing weights for the compression ratio ( CR ) met-ric (33%, 50% and 60%) while CT and DCT were equallysharing the remaining weight percentage for each function.Figure 12 shows that gzip and

XMLGzip are ranked as thebest compressors using the three ranking functions and onboth of the testing environments. In addition, Figure 12 il-lustrates that none of the XML compression tools has showna signiﬁcant or noticeable improvement with respect to thecompression ratio metric. The increasing assignment for theweight of CR do not change the order of the global rankingbetween the three ranking functions.

6. CONCLUSION

We believe that this paper could be valuable for both thedevelopers of new XML compression tools and interestedusers as well. For developers, they can use the results ofthis paper to eﬀectively decide on the points which can beimproved in order to make an eﬀective contribution. For this category of readers, we recommend tackling the area ofdeveloping stable eﬃcient queriable

XML compressors. Al-though there has been a lot of literature presented in thisdomain, our experience from this study lead us to the resultthat we are still missing eﬃcient, scalable and stable imple-mentations in this domain. For users, this study could behelpful for making an eﬀective decision to select the suit-able compressor for their requirements. For example, forusers with highest compression ratio requirement, the re-sults of Figure 2 recommends the usage of either the PPMcompressor with the highest level of compression parameter( ppmd e -o16 document.xml ) or the XWRT compressor withthe highest level of compression parameter ( xwrt -l14 docu-ment.xml )(if they have more than 1 GB RAM on their sys-tems) while for the users with fastest compression time andmoderate compression ratio requirements, gzip and XMill-Gzip are considered to be the best choice (Figure 12).From the experience and the results of this experimentalstudy, we can draw the following conclusions and recommen-dations: • The primary innovation in the XML compression mech-anisms was presented in the ﬁrst implementation inthis domain by XMill. It introduced the idea of sepa-rating the structural part of the XML document fromthe data part and then group the related data itemsinto homogenous containers that can be compressedseparably. This separation improves the further stepsof compressing these homogenous containers using thegeneral purpose compressors or any other compres-sion mechanism because they can detect the redundantdata easily. Most of the following XML compressorshave simulated this idea in diﬀerent ways. • The dominant practice in most of the XML compres-sors is to utilize the well-known structure of XML doc-uments for applying a pre-processing encoding stepand then forwarding the results of this step to generalpurpose compressors. Consequently, the compressionratio of most XML conscious compressor is very depen-dent and related on the general purpose compressorssuch as: gzip, bzip2 or PPM. Figure 10 shows thatnone of the XML conscious compressors has achievedan outstanding compression ratio over its back-endgeneral purpose compressor. The improvements arealways not signiﬁcant with 5% being the best of cases.This fact could explain why XML conscious compres-sors are not widely used in practice. • The compression time and decompression time metricsplay a crucial role in the ranking of XML compressors. • The authors of the XML compression tools should pro-vide more attention to provide the source code of theirimplementations available. Many tools presented inthe literature - specially the queriable ones - have noavailable source code which prevents the possibility ofensuring the repeatability of the reported numbers. Italso hinders the possibility of performing fair and con-sistent comparisons between the diﬀerent approaches.For example in [30], the authors compared the resultsof their implementation

Xseq with

XBzip using an in-consistent way. They used the reported query evalua-tion time of

XBzip in [24] to compare with their timeslthough each of the implementation is running on adiﬀerent environment. • There are no publicly available solid implementationsfor grammar-based XML compression techniques andqueriable XML compressors. These two areas providemany interesting avenues for further research and de-velopment.As a future work, we are planning to continue maintainingand updating the web page of this study with further eval-uations of any new evolving XML compressors. In addition,we will enable the visitor of our web page to perform theironline experiments using the set of the available compressorsand their own XML documents.

7. REFERENCES

DCC ’04: Proceedings of theConference on Data Compression , page 522,Washington, DC, USA, 2004. IEEE Computer Society.[20] M. Burrows and D. J. Wheeler. A block-sortinglossless data compression algorithm. Technical Report124, 1994.[21] J. Cheney. Compressing XML with MultiplexedHierarchical PPM Models. In

DCC ’01: Proceedings ofthe Data Compression Conference (DCC ’01) , page163, Washington, DC, USA, 2001. IEEE ComputerSociety. [22] J. Cheng and W. Ng. XQzip: Querying CompressedXML Using Structural Indexing. In

Proceedings of theInternational Conference on Extending DatabaseTechnology (EDBT) , volume 2992 of

LNCS , pages219–236. Springer, 2004.[23] J. G. Cleary and I. H. Witten. Data compressionusing adaptive coding and partial string matching.

IEEE Transactions on Communications ,OM-32(4):396–402, April 1984.[24] P. Ferragina, F. Luccio, G. Manzini, andS. Muthukrishnan. Compressing and searching XMLdata via two zips. In

WWW ’06: Proceedings of the15th international conference on World Wide Web ,pages 751–760, New York, NY, USA, 2006. ACM.[25] M. Girardot and N. Sundaresan. Millau: an encodingformat for eﬃcient representation and exchange ofXML over the Web.

Comput. Networks ,33(1-6):747–765, 2000.[26] Kieﬀer and Yang. Grammar-Based Codes: A NewClass of Universal Lossless Source Codes.

IEEETIT:IEEE Transactions on Information Theory , 46, 2000.[27] G. Leighton, J. Diamond, and T. Muldner.AXECHOP: A Grammar-based Compressor for XML.In

DCC ’05: Proceedings of the Data CompressionConference , pages 467–467, Washington, DC, USA,2005. IEEE Computer Society.[28] W. Li. An XML compression tool. Master’s thesis,University of Waterloo, 2003.[29] H. Liefke and D. Suciu. XMill: An eﬃcient compressorfor XML data. In W. Chen, J. F. Naughton, and P. A.Bernstein, editors,

Proceedings of the 2000 ACMSIGMOD International Conference on Management ofData, May 16-18, 2000, Dallas, Texas, USA , pages153–164. ACM, 2000.[30] Y. Lin, Y. Zhang, Q. Li, and J. Yang. Supportingeﬃcient query processing on compressed XML ﬁles. In

SAC ’05: Proceedings of the 2005 ACM symposium onApplied computing , pages 660–665, New York, NY,USA, 2005. ACM.[31] J. Min, M. Park, and C. Chung. XPRESS: A queriablecompression for XML data. In

Proceedings of the ACMSIGMOD International Conference on Management ofData , pages 122–133. ACM Press, 2003.[32] W. Ng, W. Lam, P. T. Wood, and M. Levene. XCQ: Aqueriable XML compression system.

Knowl. Inf. Syst. ,10(4):421–452, 2006.[33] D. Salomon.

Data Compression: The CompleteReference . pub-SV, 2004.[34] P. Skibinski and J. Swacha. Combining Eﬃcient XMLCompression with Query Processing. In

ADBIS , pages330–342, 2007.[35] P. Skibinski and J. Swacha. Fast Transform forEﬀective XML Compression. In

CADSM , pages323–326, 2007.[36] V. Toman. Compression of XML Data. Master’sthesis, Charles University, Prague, 2004.[37] R. K. Wong, F. Lam, and W. M. Shui. Querying andmaintaining a compact XML storage. In

WWW ’07:Proceedings of the 16th international conference onWorld Wide Web , pages 1073–1082, New York, NY,USA, 2007. ACM. a s e B a ll D B L P E n W i k i N e w E n W i k i Q uo t e E n W i k i S ou r c e E n W i k i V e r s i t y E n W i k T i ona r y EX I - A rr a y EX I - f a c t boo k EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og L i ne i t e m M ond i a l N a s a S ha k e s pea r e S w i ss P r o t T r eeban k U S H ou s e DC S D - N o r m a l DC S D - S m a ll T C S D - N o r m a l T C S D - S m a ll X M a r k X M a r k X M a r k R ando m - R R ando m - R R ando m - R C o m p r e ss i on R a t i o Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM Exalt Axechop

Figure 4: Detailed compression ratios of structural documents B a s e B a ll D B L P E n W i k i N e w E n W i k i Q uo t e E n W i k i S ou r c e E n W i k i V e r s i t y E n W i k T i ona r y EX I - A rr a y EX I - f a c t boo k EX I - G eog C oo r d EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og L i ne i t e m M ond i a l N a s a S ha k e s pea r e S w i ss P r o t T r eeban k U S H ou s e DC S D - N o r m a l DC S D - S m a ll T C S D - N o r m a l T C S D - S m a ll X M a r k X M a r k X M a r k C o m p r e ss i on R a t i o Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM Exalt Axechop

Figure 5: Detailed compression ratios of original documents a s e B a ll D B L P E n W i k i N e w E n W i k i Q uo t e E n W i k i V e r s i t y EX I - A rr a y EX I - f a c t boo k EX I - G eog C oo r d EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og L i ne i t e m M ond i a l N a s a S ha k e s pea r e S w i ss P r o t T r eeban k U S H ou s e DC S D - N o r m a l DC S D - S m a ll T C S D - N o r m a l T C S D - S m a ll X M a r k X M a r k C o m p r e ss i on T i m e ( S e c ond s ) E n W i k i S ou r c e E n W i k T i ona r y X M a r k C o m p r e ss i on T i m e ( S e c ond s ) Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM Exalt Axechop

Figure 6: Detailed compression times on the high resources setup. B a s e B a ll D B L P E n W i k i N e w E n W i k i Q uo t e E n W i k i V e r s i t y EX I - A rr a y EX I - f a c t boo k EX I - G eog C oo r d EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og L i ne i t e m M ond i a l N a s a S ha k e s pea r e S w i ss P r o t T r eeban k U S H ou s e DC S D - N o r m a l DC S D - S m a ll T C S D - N o r m a l T C S D - S m a ll X M a r k X M a r k C o m p r e ss i on T i m e ( S e c ond s ) E n W i k i S ou r c e E n W i k T i ona r y X M a r k C o m p r e ss i on T i m e ( S e c ond s ) Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM Exalt Axechop

Figure 7: Detailed compression times on the limited resources setup. a s e B a ll D B L P E n W i k i N e w E n W i k i Q uo t e E n W i k i V e r s i t y EX I - A rr a y EX I - f a c t boo k EX I - G eog C oo r d EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og L i ne i t e m M ond i a l N a s a S ha k e s pea r e S w i ss P r o t T r eeban k U S H ou s e DC S D - N o r m a l DC S D - S m a ll T C S D - N o r m a l T C S D - S m a ll X M a r k X M a r k D e c o m p r e ss i on T i m e ( S e c ond s ) E n W i k i S ou r c e E n W i k T i ona r y X M a r k D e c o m p r e ss i on T i m e ( S e c ond s ) Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM Exalt Axechop

Figure 8: Detailed decompression times on the high resources setup. B a s e B a ll D B L P E n W i k i N e w E n W i k i Q uo t e E n W i k i V e r s i t y EX I - A rr a y EX I - f a c t boo k EX I - G eog C oo r d EX I - I n v o i c e EX I - T e l e c o m p EX I - w eb l og L i ne i t e m M ond i a l N a s a S ha k e s pea r e S w i ss P r o t T r eeban k U S H ou s e DC S D - N o r m a l DC S D - S m a ll T C S D - N o r m a l T C S D - S m a ll X M a r k X M a r k D e c o m p r e ss i on T i m e ( S e c ond s ) E n W i k i S ou r c e E n W i k T i ona r y X M a r k D e c o m p r e ss i on T i m e s ( S e c ond s ) Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM Exalt Axechop

Figure 9: Detailed decompression times on the limited resources setup. M ill PP M X M ill G z i p X M ill B z i p2 X M L PP M B z i p2 E x a l t A x e c hop G z i p S C M PP M PP M X W R T A v e r age C o m p r e ss i on R a t i o (a) Structural documents. S C M PP M X M L PP M X W R T X M ill B z i p2 X M ill PP M B z i p2 PP M X M ill G z i p G z i p A v e r gae C o m p r e ss i on R a t i o (b) Original documents. Figure 10: Average compression ratios.

Compression Ratio Compression Time Decompression Time

Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM (a) Limited resources setup.

Compression Ratio Compression Time Decompression Time (b) High resources setup.

Figure 11: Overall performance of compressing original documents.

F1 WF2 WF3

Bzip2 Gzip PPM XMillBzip2 XMillGzip XMillPPM XMLPPM XWRT SCMPPM (a) Limited resources setup.

WF1 WF2 WF3 (b) High resources setup.

Figure 12: Ranking Functions of compressing originaloriginal