[PDF] Tight and simple Web graph compression

Abstract

Analysing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory, which prevents efficient random access to edge (hyperlink) lists. A number of algorithms involving compression techniques have thus been presented, to represent Web graphs succinctly but also providing random access. Those techniques are usually based on differential encodings of the adjacency lists, finding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph. In this paper we present two Web graph compression algorithms. The first can be seen as engineering of the Boldi and Vigna (2004) method. We extend the notion of similarity between link lists, and use a more compact encoding of residuals. The algorithm works on blocks of varying size (in the number of input lines) and sacrifices access time for better compression ratio, achieving more succinct graph representation than other algorithms reported in the literature. The second algorithm works on blocks of the same size, in the number of input lines, and its key mechanism is merging the block into a single ordered list. This method achieves much more attractive space-time tradeoffs.

Full PDF

aa r X i v : . [ c s . D S ] S e p Tight and simple Web graph compression

Szymon Grabowski and Wojciech Bieniecki

Computer Engineering Department, Technical University of L´od´z,Al. Politechniki 11, 90–924 L´od´z, Poland { sgrabow,wbieniec } @kis.p.lodz.pl Abstract.

Analysing Web graphs has applications in determining page ranks, ﬁghting Webspam, detecting communities and mirror sites, and more. This study is however hampered bythe necessity of storing a major part of huge graphs in the external memory, which preventseﬃcient random access to edge (hyperlink) lists. A number of algorithms involving compressiontechniques have thus been presented, to represent Web graphs succinctly but also providingrandom access. Those techniques are usually based on diﬀerential encodings of the adjacencylists, ﬁnding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph.In this paper we present two Web graph compression algorithms. The ﬁrst can be seen asengineering of the Boldi and Vigna (2004) method. We extend the notion of similarity betweenlink lists, and use a more compact encoding of residuals. The algorithm works on blocks ofvarying size (in the number of input lines) and sacriﬁces access time for better compressionratio, achieving more succinct graph representation than other algorithms reported in theliterature. The second algorithm works on blocks of the same size, in the number of inputlines, and its key mechanism is merging the block into a single ordered list. This methodachieves much more attractive space-time tradeoﬀs.

Key words: graph compression, random access

Development of succinct data structures is one of the most active research areas inalgorithmics in the last years. A succinct data structure shares the interface withits classic (non-succinct) counterpart, but is represented in much smaller space, viadata compression. Successful examples along these lines include text indexes [25],dictionaries, trees [24,15] and graphs [24]. Queries to succinct data structures areusually slower (in practice, although not always in complexity terms) than usingnon-compressed structures, hence the main motivation in using them is to allow todeal with huge datasets in the main memory. For example, indexed exact patternmatching in DNA would be limited to sequences shorter than 1 billion nucleotideson a commodity PC with 4 GB of main memory, if the indexing structure were theclassic suﬃx array (SA), and even less than half of it, if SA were replaced with asuﬃx tree. On the other hand, switching to some compressed full-text index (see [25]for a survey) shifts the limit to over 10 billion nucleotides, which is more than enoughto handle the whole human genome.Another huge object of signiﬁcant interest seems to be the Web graph. This is adirected unlabeled graph of connections between webpages (i.e., documents), wherethe nodes are individual HTML documents and the edges from a given node are theoutgoing links to other nodes. We assume that the order of hyperlinks in a documentis irrelevant. Web graph analyses can be used to rank pages, ﬁght Web spam, detectcommunities and mirror sites, etc.s of early Sept. 2011, it is estimated that Google’s index has about 44 billionwebpages . Assuming 20 outgoing links per node, 5-byte links (4-byte indexes to otherpages are simply too small) and pointers to each adjacency list, we would need morethan 4.4 TB of memory, ways beyond the capacities of the current RAM memories.We believe that, confronted with the given ﬁgures, the reader is now convinced aboutthe necessity of compression techniques for Web graph representation.Preliminary versions of this manuscript were published in [16] and [17]. We assume that a directed graph G = ( V, E ) is a set of n = | V | vertices and m = | E | edges. The earliest works on graph compression were theoretical, and they usuallydealt with speciﬁc graph classes. For example, it is known that planar graphs can becompressed into O ( n ) bits [28,18]. For dense enough graphs, it is impossible to reach o ( m log n ) bits of space, i.e., go below the space complexity of the trivial adjacency listrepresentation. Since the seminal Jacobson’s thesis [20] on succinct data structures,there appear papers taking into account not only the space occupied by a graph, butalso access times.There are several works dedicated to Web graph compression. Bharat et al. [4]suggested to order documents according to their URL’s, to exploit the simple ob-servation that most outgoing links actually point to another document within thesame Web site. Their Connectivity Server provided linkage information for all pagesindexed by the AltaVista search engine at that time. The links are merely representedby the node numbers (integers) using the URL lexicographical order. We noted thatwe assume the order of hyperlinks in a document irrelevant (like most works on Webgraph compression do), hence the link lists can be sorted, in ascending order. As thesuccessive numbers tend to be close, diﬀerential encoding may be applied eﬃciently.Randall et al. [27] also use this technique (stating that for their data 80% of alllinks are local), but they also note that commonly many pages within the same siteshare large parts of their adjacency lists. To exploit this phenomenon, a given list maybe encoded with a reference to another list from its neighborhood (located earlier),plus a set of additions and deletions to/from the referenced list. Their encoding, inthe most compact variant, encodes an outgoing link in 5.55 bits on average, a resultreported over a Web crawl consisting of 61 million URL’s and 1 billion links.One of the most eﬃcient compression schemes for Web graph was presented byBoldi and Vigna [7] in 2003. Their method is likely to achieve around 3 bits per edge,or less, at link access time below 1 ms at their 2.4 GHz Pentium4 machine. Of course,the compression ratios vary from dataset to dataset. We are going to describe theBoldi and Vigna algorithm in detail in the next section as this is the main inspirationfor our solution.Claude and Navarro [11,13] took a totally diﬀerent approach of grammar-basedcompression. In particular, they focus on Re-Pair [22] and LZ78 compression schemes,getting close, and sometimes even below, the compression ratios of Boldi and Vigna,while achieving much faster access times. To mitigate one of the main disadvantagesof Re-Pair, high memory requirements, they developed an approximate variant of thisalgorithm. i hasa link to page j ) almost twice faster than returning the whole neighbor list. Still,we note that using non-lexicographical ordering is harmful for compact storing ofthe webpage URLs themselves (a problem accompanying pure graph structure com-pression in most practical applications). Note also that reordering the graph is theapproach followed in more recent works from the Boldi and Vigna team [6,5].Anh and Moﬀat [1] devised a scheme which seems to use grammar-based com-pression in a local manner. They work in groups of h consecutive lists and performsome operations to reduce their size (e.g., a sort of 2-dimensional RLE if a run ofsuccessive integers appears on all the h lists). What remains in the group is then en-coded statistically. Their results are very promising: graph representations by about15–30% (or even more in some variant) smaller than the BV algorithm with practicalparameter choice (in particular, Anh and Moﬀat achieve 3.81 bpe and 3.55 bpe for thegraph EU) and report comparable decoding speed. Details of the algorithm cannothowever be deduced from their 1-page conference poster.Recent works focus on graph compression with support for bidirectional naviga-tion. To this end, Brisaboa et al. [8] proposed the k -tree , a spatial data structure,related to the well-known quadtree, which performs a binary partition of the graph3atrix and labels empty areas with 0s and non-empty areas with 1s. The non-emptyareas are recursively split and labeled, until reaching the leaves (single nodes). An im-portant component in their scheme is an auxiliary structure to compute rank queries[20] eﬃciently, to navigate between tree levels. It is easy to notice that this elegantdata structure supports handling both forward and reverse neighbors, which impliesfrom its symmetry. Ladra [21] proposed a more eﬃcient encoding of leaves (whichare boxes of sizes e.g. 8 × k -tree.Finally, we have to mention the Hern´andez and Navarro work [19], where theycombine their previous techniques, k -tree [8] and Re-Pair for compressing the graphbinary relation [12] with edge reducing [9], obtaining interesting trade-oﬀs. In par-ticular, if some of the access time can be sacriﬁed, the space they achieved is thesmallest known among the solutions supporting bidirectional queries. Based on WebGraph datasets ( http://webgraph.dsi.unimi.it/ ), Boldi and Vignanoticed that similarity is strongly concentrated; typically, either two adjacency (edge)lists have nothing or little in common, or they share large subsequences of edges. Toexploit this redudancy, one bit per entry on the referenced list could be used, todenote which of its integers are copied to the current list, and which are not. Thosebit-vectors are dubbed copy lists . Still, Boldi and Vigna go further, noticing thatcopy lists tend to contain runs of 0s and 1s, thus they compress them using a sortof run-length encoding. They assume the ﬁrst run consists of 1s (if the copy listactually starts with 0s, the length of the ﬁrst run is simply zero), and then it allowsto represent a copy list as only a sequence of run lengths, encoded e.g. with Eliascoding.The integers on the current list which didn’t occur on the referenced list must bestored too, and how to encode them is another novelty of the described algorithm.They detect intervals of consecutive (i.e., diﬀering by 1) integers and encode themas pairs of the left boundary and the interval length; the left boundary of the nextinterval on a given list will be encoded as the diﬀerence to the right boundary of theprevious interval minus two (this is because between the end of one interval and the4 lg. 1

GraphCompressSSL(

G, BSIZE ). firstLine ← true prev ← [ ]3 outB ← [ ]4 outF ← [ ]5 for line ∈ G do residuals ← line if firstLine = false then f [1 . . . | prev | ] ← [1 , , . . . , for i ← to | prev | do if prev [ i ] ∈ line then f [ i ] ← else if prev [ i ] + 1 ∈ line then f [ i ] ← else if prev [ i ] + 2 ∈ line then f [ i ] ←

313 append( outF , f )14 for i ← to | prev | do if f [ i ] = 1 then

16 remove( residuals , prev [ i ])17 residuals ′ ← RLE(diﬀEncode( residuals )) + [0]18 append( outB , byteEncode( residuals ′ ))19 prev ← line firstLine ← false if | outB | ≥ BSIZE then

22 compress( outB )23 compress( outF )24 outB ← [ ]25 outF ← [ ]26 firstLine ← true beginning of another there must be at least one integer). The numbers which do notfall into any interval are called residuals and are also stored, encoded in a diﬀerentialmanner.Finally, the algorithm allows to select as the reference list one of several previouslines; the size of the window is one of the parameters of the algorithm posing atradeoﬀ between compression ratio and compression/decompression time and space.Another parameter aﬀecting the results is the maximum reference count, which is themaximum allowed length of a chain of lists such that one cannot be decoded withoutextracting its predecessor in the chain. We present two approaches to Web graph compression working locally, in small blocks;the ﬁrst one usually reaches slightly higher compression ratios but the second is morepractical, as being much faster.

Our ﬁrst algorithm (Alg. 1, SSL stands for “similarity of successive lists”) works inblocks consisting of multiple adjacency lists. The blocks in their compact form areapproximately equal, which means that the number of adjacency lists per block varies;for example, in graph areas with dominating short lists the number of lists per blockis greater than elsewhere.We work in two phases: preprocessing and ﬁnal compression, using a general-purpose compression algorithm. The algorithm processes the adjacency lines one-by-one and splits their data into two streams.5ne stream holds copy lists, in an extended sense compared to the Boldi and Vignasolution. Our copy lists are no longer binary but consist of four diﬀerent ﬂag symbols:0 denotes an exact match (i.e., value j from the reference list occurs somewhere onthe current list), 2 means that the current list contains integer j + 1, 3 means thatthe current list contains integer j + 2, if the corresponding integer from the referencelist is j . Finally, the bits 1 correspond to the items from the reference list which havenot been earlier labeled with 0, 2 or 3.Of course, several events may happen for a single element, e.g., the integer 34from the reference list triggers three events if the current list contains 34, 35 and 36.In such case, the ﬂag with the smallest value is chosen (i.e., 0 in our example).Moreover, we make things even simpler than in the Boldi–Vigna scheme and ourreference list is always the previous adjacency list.The other stream stores residuals, i.e., the values which cannot be decoded withﬂags 0, 2 or 3 on the copy lists. First diﬀerential encoding is applied and then anRLE compressor for diﬀerences 1 only (with minimum run length set experimentallyto 5) is run. The resulting sequence is terminated with a unique value (0) and thenencoded using a byte code.For this last step, we consider two variants. One is similar to two-byte dense code [26] in spending one bit ﬂag in the ﬁrst codeword byte to tell the length of the currentcodeword. Namely, we choose between 1 and b bytes for encoding each number, where b is the minimum integer such that 8 b − b = 3 for EU and b = 4 for the remainingavailable datasets.The second coding variant can be classiﬁed as a prelude code [14] in which twobits in the ﬁrst codeword byte tell the length of the current codeword; originally thelengths are 1, 2, 3 and 4 but we take 1, 2 and b such that 8 b − b could be 5 or 6 for really hugegraphs).Once the residual buﬀer reaches at least BSIZE bytes, it is time to end the currentblock and start a new one. Both residual and ﬂag buﬀers and then (independently)compressed (we used the well-known Deﬂate algorithm for this purpose) and ﬂushed.The code at Alg. 1 is slightly simpliﬁed; we omitted technical details serving forﬁnding the list boundaries in all cases (e.g., empty lines). Our second algorithm (Alg. 2, LM stands for “list merging”) works in blocks havingthe same number of lists, h (at least in this aspect our algorithm resembles the onefrom [1]).Given the block of h lists, the procedure converts it into two streams: one storesone long list consisting of all integers on the h input lists, without duplicates, andthe other stores ﬂags necessary to reconstruct the original lists. In other words, thealgorithm performs a reversible merge of all the lists in the block.The long list is compacted in a manner similar to the previous algorithm: the listis diﬀerentially encoded, zero-terminated and submitted to a byte coder (the variantwith 1, 2 and b bytes per codeword was only tried). Note we gave up the RLE phasehere. 6 lg. 2 GraphCompressLM(

G, h ). outF ← [ ]2 i ← for line i , line i +1 , . . . , line i + h − ∈ G do tempLine ← line i ∪ line i +1 ∪ . . . ∪ line i + h − tempLine ← removeDuplicates( tempLine )6 longLine ← sort( tempLine )7 items ← diﬀEncode( longLine ) + [0]8 outB ← byteEncode( items )9 for j ← to | longLine | do f [1 . . . | longLine | ] ← [0 , , . . . , for k ← to h do if longLine [ j ] ∈ line i + k − then f [ k ] ←

113 append( outF , bitPack( f ))14 compress(concat( outB , outF ))15 outF ← [ ]16 i ← i + h The ﬂags describe to which input lists a given integer on the output list belongs;the number of bits per each item on the output list is h , and in practical terms weassume h being a multiple of 8 (and even additionally a power of 2, in the experimentsto follow). The ﬂag sequence does not need any terminator since its length is deﬁnedby the length of the long list, which is located earlier in the output stream. Forexample, if the length of the long list is 91 and h = 32, the corresponding ﬂagsequence has 364 bytes.Now, we consider two variations for encoding the ﬂag sequence: either they arekept raw (the variant is latter denoted as LM-bitmap ), or diﬀerences (gaps) betweenthe successive 1s in the ﬂag sequence are written on individual bytes (the variant islatter denoted as

LM-diﬀ ). We note that each run of h bits corresponding to ﬂags fora single value on the output list must contain at least one set bit, hence the maximumgap between any two 1s in the resulting sequence is 2 h −

1, hence for h ≤

128 eachvalue can be stored on a byte (a preliminary experiment with h = 256 and using abyte code for gap encoding was rather unsucessful). Alg. 2 presents the LM-bitmapvariant.Those two sequences, the compacted long list and the ﬂag sequence (either raw,or gap-encoded), are then concatenated and compressed with the Deﬂate algorithm.One can see that the key parameter here is the block size, h . Using a larger h letsexploit a wider range of similar lists but also has two drawbacks. The ﬂag sequencegets more and more sparse (for example, for h = 64 and the EU-2005 crawl, asmuch as about 68% of its list indicators have only one set bit out of 64!), and theDeﬂate compressor is becoming relatively ineﬃcient on those data; a drawback moreimportant in the LM-bitmap variant. Worse, decoding larger blocks takes longer time. The experiments with the SSL algorithm comprise only the datasets EU-2005 andIndochina-2004, while the more practical LM variants are tested also on the UK-2002and Arabic-2005 crawls; all the datasets are downloaded from the WebGraph project( http://webgraph.dsi.unimi.it/ ), using both direct and transposed graphs. Notethat we use the natural order versions of them, as using reordered variants (also7vailable from the WebGraph project) may be more eﬃcient but then the compressionof the corresponding URL data deteriorates.The main characteristics of those datasets are presented in Table 1.

Dataset EU-2005 Indochina-2004 UK-2002 Arabic-2005direct transposed direct transposed direct transposed direct transposedNodes 862664 7414866 18520486 22744080Edges 19235140 194109311 298113762 639999458Edges / nodes 22.30 26.18 16.10 28.14% of empty lists 8.309 0.000 17.655 0.004 14.908 0.637 14.514 0.002Longest list length 6985 68922 6985 256425 2450 194942 9905 575618

Table 1.

Selected characteristics of the datasets used in the experiments.The main experiments (Sect. 5.1) were run on a machine equipped with an IntelCore 2 Quad Q9450 CPU, 8 GB of RAM, running Microsoft Windows XP (64-bit).Our algorithms were implemented in Java and run on the 64-bit JVM (JRE 6 usedin the ﬁrst series of tests, involving SSL, and JRE 7 in the latter tests, with the LMvariants). A single CPU core was used by all implementations. As seemingly acceptedin most reported works, we measure access time per edge, extracting many (100,000in our case) randomly selected adjacency lists and summing those times, and dividingthe total time by the number of edges on the required lists. The space is measured inbits per edge (bpe), dividing the total space of the structure (including entry pointsto blocks) by the total number of edges.Throughout this section by 1 KB we mean 1000 bytes.

Our ﬁrst algorithm, SSL, has three parameters: the number of ﬂags used (either 2 or4, where 2 ﬂags mimic the Boldi–Vigna scheme and 4 correspond to Alg. 1), the byteencoding scheme (either using 2 or 3 codeword lengths), and the residual block sizethreshold BSIZE. As for the last parameter, we initially set it to 8192, which meansthat the residual block gets closed and is submitted to the Deﬂate compression onceit reaches at least 8192 bytes. Experiments with the block size are presented in thenext subsection. The remaining parameters constitute four variants: Two ﬂags and two codeword lengths are used. Two ﬂags and three codeword lengths are used. Four ﬂags and two codeword lengths are used. Four ﬂags and three codeword lengths are used.As expected, the compression ratios improve with using more ﬂags and more densebyte codes (Table 2). Tables 3 and 4 present the compression and access time resultsfor the two extreme variants: 2a and 4b. Here we see that using more aggressivepreprocessing is unfortunately slower (partly because of increased amount of ﬂagdata per block) and the diﬀerence in speed between variants 2a and 4b is close to50%. Translating the times per edge into times per neighbor list, we need from 410 µ sto 550 µ s for 2a and from 620 µ s to 760 µ s for 4b. This is about 10 times less than theaccess time of 10K or 15K RPM hard disks.8 ataset EU-2005 Indochina-2004direct transposed direct transposed2a 2.286 2.345 1.101 1.0872b 2.199 2.290 1.062 1.0654a 1.735 1.809 0.936 0.9034b 1.696 1.782 0.909 0.890 Table 2.

The algorithm based on similarity of successive lists, compression ratios inbits per edge.Our second algorithm, LM, has one parameter, h , the number of lines (lists) perblock. We conducted experiments for h = 16, 32, 64, the results are presented in thelast three rows of Tables 3 and 4, respectively. For this comparison, only the LM-bitmap variant is used. We see that even LM64 cannot reach the compression of our4b variant, but its list extraction is faster 14–27 times. The fastest of the variantspresented here, LM16, is 1.3 and 2.0 slower than BV (7,3), respectively, with muchbetter compression (we checked also LM8, only on EU-2005: the results are 3.814 bpeand 0.20 µ s per edge). direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 5.169 0.24 – –2a 2.286 18.59 2.345 18.884b 1.696 28.93 1.782 27.83LM16 2.963 0.31 2.576 0.82LM32 2.373 0.55 2.233 1.05LM64 2.008 1.05 2.016 2.01 Table 3.

EU-2005 dataset. Compression ratios (bpe) and access times per edge.“LM x ” stands for LM-bitmap with h = x . To the results of BV (7,3) the amount of0.510 bpe should be added, corresponding to extra data required to access the graphin random order. direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 2.063 0.21 – –2a 1.101 20.77 1.087 21.104b 0.909 29.03 0.890 27.43LM16 1.668 0.43 1.411 0.47LM32 1.320 0.55 1.228 0.69LM64 1.097 0.79 1.093 1.16 Table 4.

Indochina-2004 dataset. Compression ratios (bpe) and access times per edge.“LM x ” stands for LM-bitmap with h = x . To the results of BV (7,3) the amount of0.348 bpe should be added, corresponding to extra data required to access the graphin random order.The larger experiment was run on four datasets (in both direct and transposedversions); the obtained results are presented in Fig. 1 and exact numbers, for morecareful examination, can be found in the appendix. The LM-bitmap variant faresbetter in comparison with smaller blocks ( h up to 16), but then the LM-diﬀ variant9tarts to win in compression, and the gap grows with growing h . Unfortunately,decoding LM-diﬀ blocks is also in most cases costlier, with 74% maximum loss forIndochina-2004 direct, h = 64. On average, its loss in speed to LM-bitmap is not,however, that big. Obviously, the block size should seriously aﬀect the overall space used by the structureand the access time. Larger blocks mean that the Deﬂate algorithm is more successfulin ﬁnding longer matches and the overhead from encoding ﬁrst lines in a block withoutany reference is smaller. On the other hand, more lines have to be usually decodedbefore extracting the queried adjacency list.In this experiment we run the 2a algorithm (the same implementation in Java)with each block of residuals terminated (and later Deﬂate-compressed) after reachingBSIZE of 1024, 2048, 4096, 8192 and 16384 bytes, respectively. The test computerhad an Intel Pentium4 HT 3.0 GHz CPU, 1 GB of RAM, and was running MicrosoftWindows XP Home SP3 (32-bit). The results (Table 5) show that doubling the blocksize implies space reduction by about 10% while the access time grows less than twice(in particular, using 8K blocks is only 2.0–2.5 times slower than using 2K blocks). Still,as the block size gets larger (compare the last two rows in the table), the improvementin compression starts to drop while the slowdown grows. For a reference, the accesstimes of a practical Boldi–Vigna variant, BV (7,3), are 0.47 µ s and 0.42 µ s on the testmachine. EU-2005 Indochina-2004bpe time [ µ s] bpe time [ µ s]1024 3.398 6.50 1.485 8.992048 2.869 8.91 1.292 12.054096 2.513 15.93 1.172 17.878192 2.286 27.60 1.101 29.8316384 2.129 48.77 1.061 57.39 Table 5.

Compression ratios and access times in function of the block size. 2a variantused. Tests run on the non-transposed graphs.

We presented two algorithms for Web graph compression, encoding blocks consistingof whole lines. All those algorithms achieve much better compression results thanthose presented in the literature, although two of them for the price of relatively slowaccess time. The more interesting algorithm, based on list merging, seems to be rathercompetitive to the algorithms known from the literature. Our approach lets achievecompression ratios not reported in the literature (LM-diﬀ, 128), for one-directionalqueries, for moderate slow-down in list accesses (the best tradeoﬀ here, however seemto be the variants LM-diﬀ and LM-bitmap for h = 32).If even better compression ratios are welcome, then our SSL 4b variant can beconsidered, being more than an order of magnitude slower. We point out that one10 .0 time (microsec/edge) space (bits/edge) EU-2005

WebGraphBFSLM-bitmap

LM-diff time (microsec/edge) space (bits/edge)

EU-2005 transposed

WebGraph

BFSLM-diff

LM-bitmap time (microsec/edge) space (bits/edge)

Indochina-2004

WebGraphBFSLM-bitmapLM-diff time (microsec/edge) space (bits/edge)

Indochina-2004 transposed

WebGraphBFSLM-bitmapLM-diff time (microsec/edge) space (bits/edge)

UK-2002

WebGraphBFSLM-bitmapLM-diff time (microsec/edge) space (bits/edge)

UK-2002 transposed

WebGraphBFS

LM-diffLM-bitmap time (microsec/edge) space (bits/edge)

Arabic-2005

WebGraph

BFSLM-bitmap

LM-diff time (microsec/edge) space (bits/edge)

Arabic-2005 transposed

WebGraphBFS

LM-diffLM-bitmap

Figure 1.

Compression ratios (bpe) and access times per edge11xtreme tradeoﬀ in succinct in-memory data structures is when accessing the structureis only slightly faster than reading data from disk. The niche for such a solution iswhen the given Web crawl cannot ﬁt in RAM memory using less tight compressedrepresentation and the stronger compression is already enough. The disk transfer rateis of relatively small imporantance here and what matters is the access time, whichis about 10 ms or more for commodity 7200 RPM hard disks. Our algorithms spendsigniﬁcantly less time for extracting an average adjacency list, even if they are 1 or 2orders of magnitude slower than the solutions from [7,11,12]. Another challenge is tocompete with SSD disks which are not much faster than conventional disks in readingor writing sequential data but their access times are two orders of magniture smaller.Here our LM variants are fast enough, though.Our algorithm works locally. In the future we are going to try to squeeze outsome global redundancy while compressing the LM byproducts. A natural candidatefor such experiments is the RePair algorithm [23,13]. Other lines of research we areplanning to follow are Web graph compression with bidirectional navigation and eﬃ-cient compression of URLs. As for bidirectional navigation, the very recent idea fromClaude and Ladra [10] is a prospective approach, in combination with LM, but evensumming up naively the sizes of the two structures we build now, for the direct andthe transposed graph, gives quite interesting results (see [19,10] for comparison).

References V. N. Anh and A. F. Moffat : Local modeling for webgraph compression , in DCC, J. A. Storer andM. W. Marcellin, eds., IEEE Computer Society, 2010, p. 519.2.

A. Apostolico and G. Drovandi : Graph compression by BFS . Algorithms, 2(3) 2009, pp. 1031–1044.3.

Y. Asano, Y. Miyawaki, and T. Nishizeki : Eﬃcient compression of web graphs , in COCOON, X. Huand J. Wang, eds., vol. 5092 of Lecture Notes in Computer Science, Springer, 2008, pp. 1–11.4.

K. Bharat, A. Z. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian : TheConnectivity Server: Fast access to linkage information on the Web . Computer Networks, 30(1–7) 1998,pp. 469–477.5.

P. Boldi, M. Rosa, M. Santini, and S. Vigna : Layered label propagation: a multiresolutioncoordinate-free ordering for compressing social networks , in WWW, S. Srinivasan, K. Ramamritham,A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, eds., ACM, 2011, pp. 587–596.6.

P. Boldi, M. Santini, and S. Vigna : Permuting web and social graphs.

Internet Mathematics, 2009,pp. 257–283.7.

P. Boldi and S. Vigna : The webgraph framework I: Compression techniques , in WWW, S. I. Feldman,M. Uretsky, M. Najork, and C. E. Wills, eds., ACM, 2004, pp. 595–602.8.

N. Brisaboa, S. Ladra, and G. Navarro : K2-trees for compact web graph representation , in Proc.16th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5721,Springer, 2009, pp. 18–30.9.

G. Buehrer and K. Chellapilla : A scalable pattern mining approach to web graph compression withcommunities , in WSDM, M. Najork, A. Z. Broder, and S. Chakrabarti, eds., ACM, 2008, pp. 95–106.10.

F. Claude and S. Ladra : Practical representations for web and social graphs , in Proc. ACM Conferenceon Information and Knowledge Management, ACM, 2011, To appear.11.

F. Claude and G. Navarro : Fast and compact Web graph representations , Tech. Rep. TR/DCC-2008-3, Department of Computer Science, University of Chile, April 2008.12.

F. Claude and G. Navarro : Extended compact web graph representations , in Algorithms and Appli-cations, T. Elomaa, H. Mannila, and P. Orponen, eds., vol. 6060 of Lecture Notes in Computer Science,Springer, 2010, pp. 77–91.13.

F. Claude and G. Navarro : Fast and compact web graph representations . ACM Transactions on theWeb (TWEB), 4(4) 2010.14.

J. S. Culpepper and A. Moffat : Enhanced byte codes with restricted preﬁx properties , in SPIRE,M. P. Consens and G. Navarro, eds., vol. 3772 of Lecture Notes in Computer Science, Springer, 2005,pp. 1–12. R. F. Geary, N. Rahman, R. Raman, and V. Raman : A simple optimal representation for balancedparentheses , in Combinatorial Pattern Matching, 15th Annual Symposium, CPM 2004, Istanbul,Turkey,July 5-7, 2004, Proceedings, S. C. Sahinalp, S. Muthukrishnan, and U. Dogrus¨oz, eds., vol. 3109 ofLecture Notes in Computer Science, Springer–Verlag, 2004, pp. 159–172.16.

S. Grabowski and W. Bieniecki : Tight and simple Web graph compression , in Proc. Prague Stringol-ogy Conference, J. Holub and J. ˇZd’´arek, eds., 2010, pp. 127–137.17. :

Merging adjacency lists for eﬃcient Web graph compression , in Proc. International Conferenceon Man-Machine Interactions, Springer, 2011, To appear.18.

X. He, M.-Y. Kao, and H.-I. Lu : A fast general methodology for information-theoretically optimalencodings of graphs . SIAM J. Comput., 30(3) 2000, pp. 838–846.19.

C. Hern´andez and G. Navarro : Compression of web and social graphs supporting neighbor andcommunity queries , in Proc. 5th ACM Workshop on Social Network Mining and Analysis (SNA-KDD),ACM, 2011, To appear.20.

G. Jacobson : Succinct Static Data Structures , PhD thesis, 1989.21.

S. Ladra : Algorithms and Compressed Data Structures for Information Retrieval , PhD thesis, 2011.22.

N. J. Larsson and A. Moffat : Oﬀ-line dictionary-based compression . Proceedings of the IEEE,88(11) Nov. 2000, pp. 1722–1732.23.

N. J. Larsson and A. Moffat : Oﬀ-line dictionary-based compression . Proceedings of the IEEE,88(11) 2000, pp. 1722–1732.24.

J. I. Munro and V. Raman : Succinct representation of balanced parentheses, static trees and planargraphs , in IEEE Symposium on Foundations of Computer Science (FOCS), 1997, pp. 118–126.25.

G. Navarro and V. M¨akinen : Compressed full-text indexes . ACM Computing Surveys, 39(1) 2007,p. article 2.26.

P. Proch´azka and J. Holub : New word-based adaptive dense compressors , in IWOCA, J. Fiala,J. Kratochv´ıl, and M. Miller, eds., vol. 5874 of Lecture Notes in Computer Science, Springer, 2009,pp. 420–431.27.

K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener : The link database: Fast access tographs of the Web , 2001.28.

G. Tur´an : On the succinct representation of graphs . Discrete Applied Math, 15(2) May 1984, pp. 604–618.

Appendix direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 5.679 0.211 3.304 0.160BFS, l4 4.325 0.192 3.367 0.144BFS, l8 3.561 0.219 2.996 0.183BFS, l16 3.169 0.330 2.803 0.289BFS, l32 2.969 0.583 2.708 0.576BFS, l1024 2.776 14.579 2.631 13.134LM-bitmap, 8 3.814 0.152 2.951 0.173LM-bitmap, 16 2.963 0.231 2.576 0.275LM-bitmap, 32 2.373 0.403 2.233 0.508LM-bitmap, 64 2.008 0.711 2.016 1.004LM-bitmap, 128 1.838 1.370 1.963 2.176LM-diﬀ, 8 4.115 0.193 3.204 0.200LM-diﬀ, 16 2.964 0.296 2.543 0.329LM-diﬀ, 32 2.275 0.481 2.107 0.547LM-diﬀ, 64 1.867 0.802 1.854 0.931LM-diﬀ, 128 1.640 1.396 1.727 1.609 Table 6.

EU-2005 dataset. Compression ratios (bpe) and access times per edge. Allcompressors are written in Java and were run with JRE 7. The extra data requiredto access the graph in random order are included.13 irect graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 2.411 0.153 1.384 0.130BFS, l4 2.331 0.137 1.339 0.091BFS, l8 1.860 0.199 1.158 0.112BFS, l16 1.615 0.257 1.063 0.173BFS, l32 1.488 0.403 1.016 0.326BFS, l1024 1.363 9.516 0.976 6.128LM-bitmap, 8 2.207 0.103 1.630 0.121LM-bitmap, 16 1.668 0.139 1.411 0.169LM-bitmap, 32 1.320 0.216 1.228 0.297LM-bitmap, 64 1.097 0.357 1.093 0.568LM-bitmap, 128 0.982 0.687 1.040 1.219LM-diﬀ, 8 2.412 0.145 1.824 0.151LM-diﬀ, 16 1.704 0.221 1.428 0.239LM-diﬀ, 32 1.295 0.360 1.180 0.404LM-diﬀ, 64 1.053 0.620 1.030 0.694LM-diﬀ, 128 0.915 1.127 0.950 1.243 Table 7.

Indochina-2004 dataset. Compression ratios (bpe) and access times peredge. All compressors are written in Java and were run with JRE 7. The extra datarequired to access the graph in random order are included. direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7, 3) 3.567 0.225 2.218 0.200BFS, l4 3.369 0.236 2.152 0.147BFS, l8 2.627 0.264 1.883 0.181BFS, l16 2.242 0.357 1.742 0.260BFS, l32 2.042 0.542 1.673 0.455BFS, l1024 1.851 12.618 1.621 10.370LM-bitmap, 8 3.490 0.158 2.714 0.178LM-bitmap, 16 2.733 0.219 2.381 0.260LM-bitmap, 32 2.241 0.346 2.113 0.444LM-bitmap, 64 1.925 0.584 1.919 0.841LM-bitmap, 128 1.760 1.120 1.842 1.773LM-diﬀ, 8 3.853 0.201 3.043 0.213LM-diﬀ, 16 2.813 0.297 2.438 0.328LM-diﬀ, 32 2.203 0.468 2.064 0.532LM-diﬀ, 64 1.843 0.771 1.849 0.900LM-diﬀ, 128 1.632 1.336 1.742 1.557 Table 8.