Compressed Data Structures for Binary Relations in Practice
Carlos Quijada-Fuentes, Miguel R. Penabad, Susana Ladra, Gilberto Gutiérrez
DDate of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Compressed Data Structures for BinaryRelations in Practice
CARLOS QUIJADA-FUENTES , MIGUEL R. PENABAD , SUSANA LADRA , and GILBERTOGUTIÉRREZ , Universidad del Bío-Bío, Facultad de Ciencias Empresariales, 3800708, Chillán, Chile (e-mail: [email protected]) Universidade da Coruña, Centro de investigación CITIC, Facultade de Informática, 15071, A Coruña, Spain (e-mail: {miguel.penabad,susana.ladra}@udc.es)
Corresponding author: Carlos Quijada-Fuentes (e-mail: [email protected]).This research has received funding from the European Union’s Horizon 2020 research and innovation programme under the MarieSklodowska-Curie [grant agreement No 690941]; from the Ministerio de Ciencia, Innovación y Universidades (PGE and ERDF) [grantnumbers TIN2016-77158-C4-3-R; TIN2016-78011-C4-1-R; RTC-2017-5908-7]; Consellería de Economía e Industria of the Xunta deGalicia through the GAIN (Axencia Galega de Innovación), co-funded with ERDF [grant number IN852A 2018/14]; from Xunta deGalicia (co-funded with ERDF) [grant numbers ED431C 2017/58; ED431G/01]; and from University of Bío-Bío [grant numbers 1921192/R and 195119 GI/VC].
ABSTRACT
Binary relations are commonly used in Computer Science for modeling data. In additionto classical representations using matrices or lists, some compressed data structures have recently beenproposed to represent binary relations in compact space, such as the k -tree and the Binary Relation WaveletTree (BRWT). Knowing their storage needs, supported operations and time performance is key for enablingan appropriate choice of data representation given a domain or application, its data distribution and typicaloperations that are computed over the data.In this work, we present an empirical comparison among several compressed representations for binaryrelations. We analyze their space usage and the speed of their operations using different (synthetic andreal) data distributions. We include both neighborhood and set operations, also proposing algorithms forset operations for the BRWT, which were not presented before in the literature. We conclude that there isnot a clear choice that outperforms the rest, but we give some recommendations of usage of each compactrepresentation depending on the data distribution and types of operations performed over the data. We alsoinclude a scalability study of the data representations. INDEX TERMS
Binary relations, compact data structures, compressed binary relations, k -trees, BRWT,set operations, neighborhood queries I. INTRODUCTION
Let A and B be two sets of objects. A binary relation R isdefined as a subset of the Cartesian product A × B , wherefor each element ( a, b ) ∈ R , we say that a is related to b and denote this as aRb . In the areas of Mathematics andComputer Science, binary relations constitute a fundamentalconceptual and methodological tool [1] used for representingproperties or relationships among objects in a simple andintelligent way [2]. In Computer Science, binary relations canbe modeled by using data structures such as graphs, trees,inverted indices, or discrete grids [3], [4]. By using binaryrelations, it is possible to model complex problems. Forinstance, connections among pages of a particular Web site,or even among all pages in the World Wide Web (WWW)[5]–[8]; other fields are automated recommendation systems,where the users (customers) are related to purchased products [9]–[11].For these relations we can name a number of Neigh-borhood queries, such as those defined in [12]. Consider arelation R ⊆ A × B , with a total ordering ≤ A in A and a totalordering ≤ B in B . Then, we define the following operations: • isRelated ( x, y ) = true if xRy, f alse otherwise. • successors ( x ) = { y ∈ B | xRy } • predecessors ( y ) = { x ∈ A | xRy } • rangeN eighborhood ( x , y , x , y ) = { ( x, y ) | xRy,x ≤ A x ≤ A x ∧ y ≤ B y ≤ B y } Binary relations have been traditionally stored using eitheradjacency matrices or adjacency lists. However, and dueto the growth on the size that the sets of binary relationscurrently generated are experiencing (for example, a graph ofthe whole WWW), it is convenient to store these sets usingcompact data structures. The goal of doing so is to reduce
VOLUME 4, 2016 a r X i v : . [ c s . D S ] F e b uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice the storage needs (either RAM or disk), but maintaining thecapacity of processing the data directly in their compressedform. Reducing the storage size may have the advantage ofdiminishing, even removing, the need for I/O operations.One of the most widely known compact data structuresused to store binary relations are the k -tree [8] and theBinary Relation Wavelet Tree (BRWT) [13], which is basedon a Wavelet Tree [14]. In general, these structures supportonly some basic operations. For example, the k -tree wasinitially proposed to represent Web graphs, so it implementedoperations such as isRelated (which tests whether page X links to page Y , called access in the original paper), findingsuccessor or predecessor neighbors, and range neighborhoodqueries. More recently, in [15], the operations over k -treeswere extended to include set operations, that is, union, inter-section, or difference, among others. In the case of BRWTs,there is a number of operations defined over them, such asprimitive operations obtaining the labels associated to a givenobject, or the range of objects associated to a given label.However, no set operations are defined over BRWTs.When deciding which compact data structure to choose forrepresenting binary relations in the context of a given domainor application, it is very convenient to know in advancethe adequacy of the available data structures, in terms ofstorage needs, supported operations, and time performance.The decision can also consider the frequency of each kindof operation. For example, one application might make anintensive use of union and intersection operations, but rarelysearches for predecessor neighbors or performs range neigh-borhood queries.In this work, we present a comparison of three compactdata structures that can be used to represent binary relations: k -tree, k -tree1 and BRWT (Binary Relation Wavelet Tree).The comparison considers the same operations for all evalu-ated data structures. Basically, they are set operations (union,intersection, difference, and symmetric difference) and prim-itive neighborhood operations (isRelated, successors, prede-cessors, and range neighborhood queries). The goal of thiscomparison is to facilitate the choice of the most accuratedata structure for a particular application or domain. As anadditional alternative to compact data structures, we alsoinclude in our comparison the representation of the binaryrelations using compressed adjacency lists (using QMX andRice-runs encoders).Another contribution of the current work is the design andimplementation of all of the algorithms needed to performthe set operations and the neighborhood queries over binaryrelations represented with BRWT. Like the operations we usefor k -trees and k -tree1s, these algorithms operate directlyover the compact data structures, without decompressingthem.The rest of this paper is organized as follows. Section IIshows a review of the compact data structures consideredin the comparison. Section III describes the algorithms forperforming set operations over BRWTs. Section IV showsour empirical evaluation. Finally, the last section offers the overall discussion of the results and some conclusions of thiswork. II. PREVIOUS WORK
In this section, we describe the compact data structures andencoders that will be used in our comparison. A. K -TREE A k -tree [8] is a succinct data structure originally designedto represent Web graphs, but it is able to represent any binaryrelation.A k -tree for a binary relation represented by a matrix ofsize n × n is built as follows : the root node is associatedto the whole matrix, which is divided into k submatrices ( k rows by k columns). For each of these submatrices, a child isadded to the root node. We store a 0 in the node if all cellsin the submatrix are 0s, or a 1 if any cell contains a 1. Wethen proceed recursively on all children associated to a 1.The recursion stops when the algorithm processes either anindividual cell or a submatrix of 0s, so the resulting tree isnot balanced.By design, k -trees perform very well when the matrixhas a relatively low number of 1s that are clustered together,because large areas of 0s are represented by a single 0 bitin the k -tree. However, each 1 in the matrix can use morethan one bit in the k -tree, so its behavior worsens if thenumber of 1s increases. To avoid this problem, a variationof the k -tree, which we denote k -tree1 in this work, wasdesigned in [16]. Basically, it represents a uniform submatrix(either full of ones or zeroes) by a single 0 (with an additionalbitmap to decide whether it is full of ones or zeroes) andmixed submatrices with ones and zeroes by a 1. The recursionproceeds only for these mixed areas.Being succinct data structures, the described conceptualtrees are not stored, but only the bitmaps of their nodes. Navi-gation operations over k -trees and k -tree1s are described in[8] and [16], respectively. Set operations for both, includingthe pseudocode for the algorithms as well as empirical resultson their performance, are described in [15]. B. BRWT
A Binary Relation Wavelet Tree, or BRWT [13], is a specialtype of wavelet tree [17] specifically designed to representbinary relations. An example of the conceptual tree built for agiven binary matrix is show in Figure 1. Each node containstwo bitmaps, which correspond to two submatrices: the topand bottom halves of the original matrix. A bitmap positionis set to 0 if all values in this column of the submatrix are 0s(as in column 1 for the A-D bitmap in the root node) and itis set to 1 if any cell in this column has a 1 (column 2 at thesame bitmap). If the matrix is not squared or n is not a power of k , it is conceptuallyextended to the right and to the bottom with 0s, rounding the size up to thenext power of k . This does not cause a significant overhead because the k -tree can handle large areas of 0s efficiently. VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice (a) A binary relation matrix (b) Its BRWT representation
FIGURE 1: A BRWT example (based on [13]).The left and right subtrees are built recursively consideringthe bits set to 1 at the top/bottom bitmaps. For example,column 1 does not appear in the left subtree of the rootnode. Note also that, like column 2, a column can propagateto both left and right subtrees. This fact makes the BRWTdifferent from the original wavelet trees, because the bits ofa given level may be more than n , the number of objects.Interested readers can find in [13] more information aboutthe operations supported by BRWTs and some bounds ontheir complexity. We shall provide in the next section detailedinformation, as well as the pseudocode on which our imple-mentation is based, of the set operations for binary relationsimplemented in BRWTs. C. COMPRESSED ADJACENCY LISTS
A very naive and (generally) space-consuming representationof a binary relation is an adjacency list. Although a validrepresentation, it is not suitable if the relation is big anddoes not fit into main memory. To avoid this problem, theadjacency lists can be compressed.There are multiple techniques for compressing lists ofintegers, such as QMX and Rice-runs. QMX [18] is a com-pression algorithm that combines word-alignment, SIMD in-structions, and run-length encoding. It also includes a SIMD-aware intersection algorithm [19]. Rice-runs combines thewell-known Rice coding [20] with run-length compression[21], [22]. QMX performs really well for long adjacencylists, where SIMD instructions can be exploited. Rice-runsis specially suitable when the lists contain large sequencesof consecutive 1s in the input relation matrix, due to the useof run-length compression, boosting both compression andintersection speed.These techniques can be applied to compress any input.The authors have already used them to compress binaryrelations, using their own implementation in [15]. We mustnote, however, that these are not compact data structures.They are only compression schemes, and the lists must be decompressed before they can be used to efficiently performthe requested operations on the uncompressed data.
III. SET OPERATIONS OVER BRWT
We now describe the algorithms for computing union, in-tersection, difference, and symmetric difference of binaryrelations represented using BRWTs.Like the k -tree, BRWT is a hierarchical structure. Thus,the approach for the algorithms described in [15], [23] toimplement set operations over k -trees can be applied here.Of course, the differences and specific properties of theBRWT must be taken into account.The algorithms use essentially a breadth-first traversal ofthe trees that represent the input relations for the union,and a depth-first traversal for the remaining operations. Thealgorithm for the union is presented in Subsection III-A, andthe algorithm for the intersection is described in SubsectionIII-B. The remaining operations, difference and symmetricdifference, use an algorithm very similar to the intersection,so they are briefly described in the same subsection.For this section, given a node b of the BRWT, b l and b r represent the bitmaps associated to b . For instance, consider-ing b the root node of the BRWT at Figure 1, b l is ,and b r is . A. BREADTH-FIRST TRAVERSAL ALGORITHM (UNION)
Given the properties of the union, if we are processing twonodes (of the two input BRWT), it is possible to obtain theresult without accessing the children of these nodes. That is,if a bit representing a column in one of the nodes is , theoutput is regardless of the value of the bit in the other node,and if both bits are , the output is . This enables the use of abreadth-first traversal over the BRWTs to compute the resultof the union operation.The traversal is performed by doing a synchronized se-quential scan of the two input BRWT bitmaps A and B .Note that this synchronization must take into account that VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice
Algorithm 1 UnionBRWT ( A, B ) useQ l ← true pA ← , pB ← bA ← , bB ← for i ← . . . numOfObjects do Q l .Insert ( (cid:104) , (cid:105) ) end for Q l .Insert ( (cid:104) , (cid:105) ) while pA < | A | ∨ pB < | B | do if useQ l then (cid:104) f A , f B (cid:105) ← Q l .Remove () end if children ← while (( useQ l ∧ ( f A ∨ f B )) ∨ (! useQ l ∧ ! Q r IsEmpty ())) do if useQ l then Q r .Insert ( (cid:104) f A , f B (cid:105) ) else (cid:104) f A , f B (cid:105) ← Q r .Remove () end if bA ← f A ∧ A [ pA ] bB ← f B ∧ B [ pB ] if ( bA ∨ bB ) ∧ (! isLeaf ( pA ) ∨ ! isLeaf ( pB )) then Q l .Insert ( (cid:104) bA, bB (cid:105) ) children ← children + 1 end if R [ posR ] ← bA ∨ bB posR ← posR + 1 if f A then pA ← pA + 1 end if if f B then pB ← pB + 1 end if if useQ l then (cid:104) f A , f B (cid:105) ← Q l .Remove () end if end while useQ l ← ( ∼ useQ l ) if (! isLeaf ( pA ) ∨ ! isLeaf ( pB )) ∧ children > then Q l .Insert ( (cid:104) , (cid:105) ) end if end while return R there can exist a column that is defined in one of the twonodes being processed, but not in the other. Algorithm 1 usesqueues (as usual for breadth-first traversal of any tree). In thiscase, each element of the queue is a pair of flags (cid:104) f A , f B (cid:105) (indicating if the current column is defined in inputs A and B , respectively), which is used to determine the output bit,and whether we should enqueue a new pair to process thecurrent column in the child nodes (at the next level of theconceptual tree).The algorithm actually uses two queues: Q l and Q r . Thereason behind their use is the way the bitmaps are stored: foreach node b , we store the b l followed by the b r . Then, Q l is used to manage the breadth-first traversal of the BRWT,while Q r is only used to process the b r part of each node (wecan see in the algorithm that when an element is dequeuedfrom Q l , it is enqueued in Q r , but the enqueuing needed toprocess lower levels of the conceptual tree is always done in Q l ). B. DEPTH-FIRST TRAVERSAL ALGORITHMS(INTERSECTION, DIFFERENCE, AND SYMMETRICDIFFERENCE)
The algorithms for the intersection, difference and symmetricdifference use a depth-first traversal of the input BRWTs,because the output bit for a column in a node depends on the values of the same column in the descendants of this node,down to the leaves. The algorithms for the three operationsare very similar, in fact the navigation scheme is exactly thesame for all of them. The only changes are the value of theoutput, and the decision of whether it is necessary to explorethe children or omit these nodes. Thus, we will explain thedepth-first traversal only for the intersection, and highlightthe differences for the rest of the operations.The general idea is to process the input BRWTs columnby column, recursively. Algorithm 2 describes how to per-form the intersection between two BRWTs, processing everycolumn of the root nodes calling the recursive algorithm forthe intersection (Algorithm 3). An indication of how to adaptthese algorithms to perform the difference and symmetricdifference is shown later.For the intersection, testing the value of a column in agiven node, if both inputs have a (meaning this columnis defined in both BRWTs), requires a recursive checking.However, if one of the inputs does not have this columndefined, the output of the intersection will be a , and thereis no need to process their children. This is done for the parts b l and b r of each node b . For the intersection, the algorithmproduces a column in the output if any of the parts ( b l or b r )has this column defined. Otherwise, the column is omitted inthe output.Note that, even when the access to any child could bedone by using the rank and select operations for bitmaps,we considered the use of pointers to speed up the opera-tions . The initP ointersBRW T operation in Algorithm 2initializes these pointers to the start of each node. Duringthe operation, if a column is defined in one of the BRWTsbut not in the other, the Skip function updates the pointersof the descendant nodes to omit this column. Otherwise, thepointers are shifted one position to process the next columnafter recursively computing the output value.Algorithm 4 shows the
Skip function. Although in theworst case it would have to process all nodes of the BRWT,this case is extremely infrequent in practice.
Algorithm 2 IntersectBRWT ( A, B ) pA ← initP ointersBRW T ( A ) pB ← initP ointersBRW T ( B ) bA ← , bB ← idNode ← for i ← . . . numOfObjects do Intersect ( A, B, pA, pB, bA, bB, R, idNode ) end for return R As Algorithm 3 shows, lines 8–26 are specific for theintersection. These lines must be modified to implement thedifference and symmetric difference. The pseudocode forthese changes is shown in Table 1, but in summary thereare basically two changes: the output bit for the column, andthe management of the column when it is defined in onlyone BRWT. The output bit in lines 25–26 of Algorithm 3 is We have included in Section IV-D a brief note about the implications ofthis change. VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice
Algorithm 3 Intersect ( A, B, pA, pB, rA, rB, R, idN ode ) idCh left ← ( idNode + 1) ∗ idCh right ← ( idNode + 2) ∗ kl ← , kr ← bA ← rA ∧ A [ pA [ idNode ]] bA ← rA ∧ A [ pA [ idNode + 1]] bB ← rB ∧ B [ pB [ idNode ]] bB ← rB ∧ B [ pB [ idNode + 1]] {BEGIN code specific to Intersection}8: if ! isLeaf ( idNode ) then if bA ∧ bB then kl ← Intersect ( A, B, pA, pB, bA , bB , R, idCh left ) else if bA then Skip ( A, pA, idCh left ) else if bB then Skip ( B, pB, idCh left ) end if if bA ∧ bB then kr ← Intersect ( A, B, pA, pB, bA , bB , idCh right ) else if bA then Skip ( A, pA, idCh right ) else if bB then Skip ( B, pB, idCh right ) end if else kl ← bA ∧ bB kr ← bA ∧ bB end if {END code specific to Intersection}27: if kl ∨ kr ∨ isRootNode ( idNode ) then R [ idNode ] ← R [ idNode ] || kl R [ idNode + 1] ← R [ idNode + 1] || kr end if if rA ∨ isRootNode ( idNode ) then pA [ idNode ] ← pA [ idNode ] + 1 pA [ idNode + 1] ← pA [ idNode + 1] + 1 end if if rB ∨ isRootNode ( idNode ) then pB [ idNode ] ← pB [ idNode ] + 1 pB [ idNode + 1] ← pB [ idNode + 1] + 1 end if return kl ∨ kr Algorithm 4 Skip ( X, pX, idN ode ) if ! isLeaf ( idNode ) then bX ← X [ pX [ idNode ]] if bX then idCh left ← ( idNode + 1) ∗ Skip ( X, pX, idCh left ) end if bX ← X [ pX [ idNode + 1]] if bX then idCh right ← ( idNode + 2) ∗ Skip ( X, pX, idCh left ) end if end if pX [ idNode ] ← pX [ idNode ] + 1 pX [ idNode + 1] ← pX [ idNode + 1] + 1 computed by performing the AN D between the two bits forthe intersection, while it is the
AN D with negated secondbit for the difference, and the
EXOR for the symmetricdifference. The recursion when the current column is definedfor both input BRWTs is the same for all the algorithms, butwhen only one column is defined, they differ. We introduce anew
Copy function, which basically copies a column in oneof the input bitmaps to the output bitmap. The algorithm forthe difference copies the current column of the first bitmapif it is not defined in the second input. On the contrary, if thecolumn is only defined in the second bitmap, it is skipped. Forthe symmetric difference, if the column is defined in eitherbitmap, it is copied to the output. No skipping is needed for
Difference if ! isLeaf ( idNode ) thenif bA ∧ bB then kl ← RecDifference ( A, B, pA, pB, bA , bB , R, idCh left ) else if bA then Copy ( A, pA, R, idCh left ) kl ← else if bB then Skip ( B, pB, idCh left ) end if if bA ∧ bB then kr ← RecDifference ( A, B, pA, pB, bA , bB , idCh right ) else if bA then Copy ( A, pA, R, idCh right ) else if bB then Skip ( B, pB, idCh right ) end ifelse kl ← bA ∧ ∼ bB kr ← bA ∧ ∼ bB end if Symmetric difference if ! isLeaf ( idNode ) thenif bA ∧ bB then kl ← RecSymmDiff ( A, B, pA, pB, bA , bB , R, idCh left ) else if bA then Copy ( A, pA, R, idCh left ) else if bB then Copy ( B, pB, R, idCh left ) else kl ← end ifif bA ∧ bB then kr ← RecSymmDiff ( A, B, pA, pB, bA , bB , idCh right ) else if bA then Copy ( A, pA, R, idCh right ) else if bB then Copy ( B, pB, R, idCh right ) else kr ← end ifelse kl ← ( ∼ bA ∧ bB ) ∨ ( bA ∧ ∼ bB ) kr ← ( ∼ bA ∧ bB ) ∨ ( bA ∧ ∼ bB ) end if TABLE 1: Code snippets for set operations.this algorithm.
IV. EMPIRICAL EVALUATION
In this section we describe the experiments we have con-ducted to compare the performance of the three compact datastructures ( k -trees, k -tree1s, and BRWT) and that of thecompressed adjacency lists used to represent binary relations.We first describe the datasets used in our experiments, ofwhich one of them is real and three of them are synthetic.Then, we include some implementation details of our algo-rithms, and describe the experimental hardware and softwareframework we have used. Finally, we present our results. A. DATASETS
We ran our experiments over a real dataset ( snaps-uk )and three synthetically generated distributions that use well-known random models, such as Erd˝os and Rényi [24], small-world (using Newman Watts-Strogatz distribution [25]),and Barabasi-Albert distribution [26]. We shall refer tothese dataset distributions as random , smallworld and barabasi , respectively. Each dataset is formed by 12 files, VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice so the metrics obtained on the sizes and timings consider theaverage for these files.The snaps-uk dataset was taken from a series of twelvemonthly snapshots of a Web graph from the .uk domain,collected by the Laboratory for Web Algorithmics [27]–[29]. We have cut down these graphs to use 1 million nodes( n = 1 , , ). More information on this dataset is alsoavailable in [15].The three synthetic datasets were generated using theNetworkX Python library. For the random distribution, onlythe number of nodes n and number of edges m were re-quired as arguments. For the Barabasi-Albert and small-world distributions, a third parameter k is needed, to specifythe k -nearest neighbors to connect to a given node. Thisparameter determines the number of edges that is generated.If the number of generated edges is greater than the specifiednumber m , the remaining edges are removed.The output is a binary file containing the plain adjacencylist, which in turn is used to build the compact data structurerepresentations (for k -trees, k -tree1s, and BRWT) as wellas the compressed adjacency lists (QMX and Rice-runs).We have also generated 12 files for each distribution. Allof them have n = 1 , , nodes. In order to have thesame density, the number of generated edges for each filewas the same as the corresponding file for the snaps-uk dataset. That resulted in an average number of edges of m =2 , , per distribution, which gives a density m/n =0 . .Given that the distribution of 1s has a high impact onthe size of the compressed structures, as well as in theperformance of data structures, a sample of all datasets isshown in Figure 2. As we can see, the synthetically generateddatasets are much less clustered than snaps-uk , the realdataset. B. EXPERIMENTAL FRAMEWORK
The comparison we performed considered, for all the datastructures, the following operations: • Neighborhood queries: isRelated, successors, predeces-sors, and range neighborhood. • Set operations: union, intersection, difference, and sym-metric difference.The work described here required the coding of all ofthe algorithms (both neighborhood and set operations) forBRWT, as well as the neighborhood operations for k -tree1sand compressed adjacency lists using QMX and Rice-runs.For the remaining algorithms, we use the source code by theauthors of the works described in [8] and [16].The implementation language for all algorithms is C, com-piled with gcc version 6.3.0. The experiments were run onan isolated Intel ® Xeon ® [email protected] processor with20 MB of cache, and 64 GB of RAM. It runs Debian 10.1(buster) with kernel 4.19.0 (64 bits). http://law.di.unimi.it https://networkx.github.io/ C. RESULTS
We present in this section the results of our empirical eval-uation. First, we study independently the storage needs (theactual size) of the data structures, and their performance forboth neighborhood queries and set operations. Then, bothsizes and times are considered together in order to presentsome trade-offs that would apply when choosing a specificdata structure. Finally, we analyze the data structures in termsof scalability.In the neighborhood operations, besides the isRelated op-eration, we have included a similar one: isRelated-True. Infact, it is the same operation, but the result of the query isknown to be true (which corresponds to having a 1 in thematrix). Thus, isRelated-True acts as a worst-case scenariofor the isRelated operation in most cases. For example, for k -trees, we know that this operation must navigate the treeuntil its leaves (the result is 1, so it cannot be discarded by a0 in a previous level of the tree).Also note that, for the naming of the data structures inthe tables and graphics of this section, we have chosenshorter names: kt for k -tree, ktone for k -tree1, brwt for BWRT, qmx for the QMX encoder, and rice for theRice-runs encoder.
1) Storage
The average size taken up by each structure for all distribu-tion is shown in Table 2. As a reference, the size of the fulluncompressed adjacency lists is also included in the table.Our standard implementation uses 32-bit integers, so it isused as the base number. However, in order to representa relation for 1 million nodes, only 20 bits suffice. Thus,we also show the size theoretically needed to represent thisrelation. Table 3 shows the same information as a ratio,considering the full adjacency lists as the base for comparison(value . ), so the deviations can be better seen.Considering the datasets, we can see that the distributionthat allows for the best compression ratios (actually, the onlyone that gets compressed by all structures) is snaps-uk .This is reasonable, because this distribution is clustered,unlike the three synthetic ones, which are based on randommodels. More concretely, the best compression ratios areobtained by the k -tree variants, which benefit from distri-butions of small number of ones that are clustered. As for theQMX and Rice-runs, their behavior is worse, because theyare based on run-length compression, and having a smallerruns of 1s produces worse compression ratios.Considering the compressed data structures, we can seethat standard k -trees obtain better results than the plainadjacency list using 20 bits for those clustered distributions( smallworld and snaps-uk ), but not for those that fol-low a more random model ( barabasi and random ). Inany case, compressed data structures obtain generally betterresults than compressed adjacency lists (except in the caseof barabasi , which obtains better results for Rice-runsthan for k -tree1). Compressed adjacency lists require, forall synthetic data distributions, larger spaces than the original VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice (a)
Barabasi (b) random (c) smallworld (d) snaps-uk
FIGURE 2: A sample of all dataset distributions. barabasi random smallworld snaps-uk
Full adj. list (32 bits) 12,963,523 12,963,523 12,963,523 12,963,523Full adj. list (20 bits) 8,102,201 8,102,201 8,102,201 8,102,201 brwt kt ktone qmx rice TABLE 2: Average size (in bytes) for the different datasets.plain representation. For example, using QMX to compressthe random distribution actually obtains a file 70% biggerthan the full adjacency lists using 32 bits.
2) Timings
The timings shown in this section consider only the timedevoted to the operations themselves, without taking intoaccount the I/O time of reading the structures (for neighbor-hood and set operations) or writing the result (only for setoperations). In the case of the neighborhood operations, thetime shown corresponds to the execution of , queries ofthe same type over the same structure.Let us first discuss the neighborhood operations. Their timings, in milliseconds, are shown in Table 4. For a bettercomparison, Table 5 shows the same information as ratios.The value . corresponds to the shortest time for theoperation on a distribution.The information that stands out most in these tables cor-responds to the predecessors operation, where the fasteststructure is the BRWT, and the QMX and Rice-runs encodersare much slower (up to , times slower in the case ofthe random dataset). This is reasonable because encoderscompress the adjacency lists row by row, and finding the pre-decessors requires the decompression of all of the encodedlists.For successors, however, Rice-runs is the fastest, closely VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice barabasi random smallworld snaps-uk
Full adj. list (32 bits) 1.00 1.00 1.00 1.00Full adj. list (20 bits) 0.63 0.63 0.63 0.63 brwt kt ktone qmx rice TABLE 3: Average size, shown as a ratio.
Query barabasi random smallworld snaps-uk isRelated brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice TABLE 4: Timings for neighborhood queries (in ms).followed by QMX, while k -trees and k -tree1s are slower(almost , times). The reason is that, in this case, theencoders have to decompress only one list (at most; if thereare no 1s in the row, the answer is immediate).Considering together predecessors and successors, we cansee that the difference between the best and the worst ismuch larger in the predecessors because, as we mentioned,QMX and Rice-runs must decode all of the lists. However, forsuccessors, the compact data structures have to explore onlypart of the binary relation, not all of it. This is because all ofthem are actually self-indices, so they allow for a fast accessto a portion of the matrix. For the same reason, if we considerthe rangeNeighborhood queries, we can see that the compactdata structures perform better than the encoders, and, in thiscase, k -tree is the fastest structure.For the isRelated and isRelated-True queries, all structuresoffer a more homogeneous behavior. Anyway, BRWT is thefastest for isRelated, while it is Rice-runs for isRelated-True.This is due to the fact that isRelated accesses a random cell inthe matrix, and with a low density BRWT is able to answer false without reaching the leaf nodes, while in the case ofisRelated-True query, BRWT must always reach a leaf node.The same happens for k -trees and k -tree1s.For the set operations, Table 6 shows the actual times of our experiments, and the same information as ratios isshown in Table 7. Even when the difference between the bestand worst data structure is not as large as for neighborhoodoperations, it is clear that the encoders (Rice-runs, closelyfollowed by QMX) are the best option. The BRWT is almostalways the slowest for these kinds of operations.The reason behind that behavior is that set operations mustaccess in general all the elements of the binary relation. Theencoders just decompress row by row, build the result andencode it as a new output row. However, the three compactdata structures, being self-indices, have an overhead that (asin general for any type of index) worsens the performancewhen the full dataset has to be accessed.
3) Storage size versus time
Let us consider now the trade-off between storage size andperformance for all compared data structures. Figures 3–7 analyze their behavior. All figures contain two graphics: ( a ) for the neighborhood operations, and ( b ) for the setoperations.For the neighborhood operations, it is clear that the com-pact data structures, especially the standard k -tree, is thebest option, in terms of both size and performance, whilethe encoders use more space and perform worse. Figure 3 VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice
Query barabasi random smallworld snaps-uk isRelated brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice TABLE 5: Timings for neighborhood queries (ratio).
Operation barabasi random smallworld snaps-uk
Difference brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice TABLE 6: Timings for set operations (in ms).shows this behavior considering an average of all the datasetstogether. If we take into account the data distribution, we cansee that the previous conclusion is generally true, except forthe k -tree1. This data structure is highly dependant on thedegree of clustering (remember that it compresses areas of1s in the matrix), and thus its size grows for non-clustereddatasets (like barabasi , and random , as seen in Figures4 and 5 respectively), and it behaves better for the moreclustered ( smallworld and and especially snaps-uk , asseen in Figures 6 and 7 respectively).For the set operations, none of the data structures out-performs the rest in terms of size and performance. On thecontrary, we can see in Figure 3 that we have a trade-off,because the compact data structures use less space, but they are slower than the encoders. The encoders are faster, butthey need more space. Thus, the general advice would be thefollowing: if we are primarily interested in speed, and thereis enough available RAM to fit the data structures using theencoders, then use the encoders. If the datasets would not fitinto RAM, use the compact data structures.Parts ( b ) of Figures 4–7 show this behavior for each datadistribution. We can see that the BRWT is a poor choice forthe set operations in any distribution in terms of speed (not interms of space). Again, the k -tree1 shows a behavior highlydependent on the clusterization, being one of the fastest datastructures for the snaps-uk dataset and one of the slowesttechnique for barabasi . VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice
Operation barabasi random smallworld snaps-uk
Difference brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice brwt kt ktone qmx rice TABLE 7: Timings for set operations (ratio). (a) Neighborhood operations (b) Set operations
FIGURE 3: Average time versus average size for all distributions.
D. A NOTE ON SCALABILITY
The previous section analyzed the performance of the datastructures over relations having 1 million nodes. However,we are also interested in the behavior of the data structureswhen the size of the relation grows.We shall describe in this section the behavior of the datastructures when the number of nodes varies, growing upto , , nodes. We have chosen the smallworld data distribution for these experiments, because it is theless biased distribution: it is not as clustered as the realdataset ( snaps-uk ), which would benefit the compact datastructures, and it is not as evenly distributed as random or barabasi , which would benefit the QMX and Rice-runs encoders. The dataset was generated also using theNetworkX Python library, considering a value of 2 for the k parameter (indicating the k nearest neighbors that wouldbe linked in the graph) in all cases.Let us first analyze the neighborhood operations shownin Figure 8. In general, we can see that the data structures behavior is as expected from the previous analysis. The en-coders scale very well for the isRelated-True and successorsoperations. Note that we have chosen the isRelated-Trueoperation instead of isRelated, in order to force the structuresto either navigate down the tree (compact data structures) ordecompress a non-empty list (for the encoders). We also con-cluded that the encoders performed badly for predecessors,and this gets confirmed here. In fact, this operation couldnot be completed for the relations with 5 and 10 millionnodes, with our hardware configuration. If we remove theencoders from the plot (Figure 8d), we can see that theBRWT scales well (almost constant) while the remainingcompact data structures scale in a logarithmic order. It is alsoworth noticing that the k -tree and k -tree1 have a similartrend in all operations.For the rangeNeighborhood operation, shown in Figure 9,we have tested the scalability in two different ways: varyingthe number of nodes (as in the previous cases) but maintain-ing a fixed range size of × , and fixing the number of VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice (a) Neighborhood operations (b) Set operations
FIGURE 4: Average time versus average size for barabasi . (a) Neighborhood operations (b) Set operations FIGURE 5: Average time versus average size for random .nodes but varying the range size (up to , × , ).It might seem strange that QMX and Rice-runs solve therange queries faster when the number of nodes increases.However, this can be explained, as when the number of nodesincreases, the number of empty rows will probably increasetoo. Therefore, the number of rows actually explored anddecompressed decreases, so the time to solve the query isshorter. Note that the behavior of the compact data structuresis quite similar (it is not decreasing, but almost constant).Considering the variation of the range size, we can see thatall data structures increase the time for longer ranges, but in asublinear order in general, being the k -tree and k -tree1 themost efficient data structures.For the set operations, illustrated in Figure 10, all datastructures scale quite well, and in a uniform way (note thatthere are no crosses among the lines in the figure). Wecan highlight that the BRWT is the worst option for theseoperations, while the encoders are the more suitable ones.This confirms what was shown in Section IV-C2. A final note about scalability, but regarding some imple-mentation decisions for our algorithms: as we mentioned inSection III-B, we decided to use a set of pointers insteadof using the rank and select operations to speed up thedepth-first set operations over BRWTs. The same decisionwas taken to speed up the operations over both variants of k -trees. Figure 11 shows the behavior of the intersection algo-rithm using both implementations. The version with pointersclearly outperforms the rank / select version, especiallyfor large datasets (up to 3 times faster). Of course, this speed-up comes with a price, because the pointer version takes upmore memory (between 30% and 56%). V. CONCLUSIONS
In this work, we have conducted several experiments tocompare the behavior of several data structures used to storebinary relations. We have considered three compact datastructures ( k -tree, k -tree1 and BRWT) and two encodersor compressors (QMX and Rice-runs). VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice (a) Neighborhood operations (b) Set operations
FIGURE 6: Average time versus average size for smallworld . (a) Neighborhood operations (b) Set operations FIGURE 7: Average time versus average size for snaps-uk .For the compact data structures, we used the algorithmsfor k -trees and k -tree1s developed in [8] and [15], but thealgorithms for set operation over the BRWT are presentedhere for the first time, thus extending the functionality of thisdata structure.We have found that there is no clear winner, no datastructure is better than the rest in all cases. All of them havesome advantages and disadvantages, depending on severalfactors. We have considered the storage size and the responsetime as basic measurements, and have tested them usingseveral datasets with different characteristics, because thedata distribution has a great impact on the performance ofall data structures.In order to offer some general conclusions, we can groupthe data structures in three groups that have a similar be-havior: the encoders (QMX and Rice-runs), both k -treevariants, and BRWT.With respect to the encoders, they proved to be the fastestoption for the set operations in all cases, and are competitive for some neighborhood queries, except for rangeNeighbor-hood and especially the predecessors queries (which couldnot actually be executed for large datasets). In terms ofstorage needs, the encoders use in general more space thanthe compact data structures.The k -trees excel at the rangeNeighborhood queries, butare outperformed for the successors operation. For the restof the neighborhood queries they are competitive. For the setoperations, these structures are not as fast as the encoders, butthey are the best option amongst the compact data structures.They also scale reasonably well when the dataset grows. Interms of storage, k -trees are always the best option, usingmuch less space than the other structures. This can let the k -tree be a good option for those operations where they areslower than the encoders, when the encoders cannot fit therelation into main memory.BRWT is competitive for the neighborhood queries ingeneral, and is the best option for the predecessors queries.For the set operations, however, it is usually the worst option. VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice (a) isRelated (b) successors(c) predecessors (d) predecessors (compact)
FIGURE 8: Number of nodes versus average time on neighborhood operations for scalability.In terms of storage needs, it is competitive with respect tothe remaining compact data structures, and better than theencoders.Finally, we must highlight that the data distribution hasa great impact on both the size of the data structure andthe speed of the operations. In general, clustered data dis-tributions tend to favor compact data structures, while morerandom or evenly distributed datasets tend to benefit theencoders.We have presented here, to the best of our knowledge, thefirst study about the behavior of compressed data structuresfor binary relations, evaluating storage needs and speed ofthe operations based on different (synthetic and real) datadistributions, considering also the scalability of the datastructures.
REFERENCES
VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice (a) Varying the number of nodes (b) Varying the range size
FIGURE 9: Scalability measures for rangeNeighborhood queries VOLUME 4, 2016 uijada-Fuentes et al. : Compressed Data Structures for Binary Relations in Practice (a) Union (b) Intersection(c) Difference (d) Symmetric Difference
FIGURE 10: Number of nodes versus average time on set operations for scalability.FIGURE 11: Speed improvement of queries on BRWT byusing pointers
VOLUME 4, 2016 et al. : Compressed Data Structures for Binary Relations in Practice
CARLOS QUIJADA FUENTES obtained his de-gree in Civil Engineering in Computer Sciencein 2011 and his Master in Computer Science in2017, both from the University of Bío-Bío. Hisresearch area is data structures and algorithms.Chillán/Chile.
MIGUEL R. PENABAD obtained his Master inComputer Science in 1994 at the University of ACoruña. He received his Ph.D in 2001 at the Uni-versity of A Coruña. He is a professor in the sameuniversity since 2000. His main research interestsare database query optimization, and algorithmsand data structures for information retrieval.
SUSANA LADRA received the bachelor’s de-gree in mathematics from the National DistanceEducation University (UNED), in 2014, and themaster’s in computer science engineering and thePh.D. degree in computer science from the Uni-versity of A Coruña, in 2007 and 2011, respec-tively. She is currently an Associate Professor withthe Universidade da Coruña. She is the PrincipalInvestigator of several national and internationalresearch projects. She has published more than40 articles in various international journals and conferences. Her researchinterests include design and analysis of algorithms and data structures, anddata compression and data mining in the fields of information retrieval andbioinformatics.
GILBERTO GUTIÉRREZ RETAMAL receivedhis M. Sc. from the University of Chile in 1999and the Ph.D. in computer science in 2007 fromthe same university. His research areas includedata structures and algorithms, spatial and spatio-temporal databases. He is currently an associateprofessor in the Department of Computer Scienceand Information Technology, at the University ofBío-Bío, Chillán / Chile.16