Tight and simple Web graph compression
aa r X i v : . [ c s . D S ] S e p Tight and simple Web graph compression
Szymon Grabowski and Wojciech Bieniecki
Computer Engineering Department, Technical University of L´od´z,Al. Politechniki 11, 90–924 L´od´z, Poland { sgrabow,wbieniec } @kis.p.lodz.pl Abstract.
Analysing Web graphs has applications in determining page ranks, fighting Webspam, detecting communities and mirror sites, and more. This study is however hampered bythe necessity of storing a major part of huge graphs in the external memory, which preventsefficient random access to edge (hyperlink) lists. A number of algorithms involving compressiontechniques have thus been presented, to represent Web graphs succinctly but also providingrandom access. Those techniques are usually based on differential encodings of the adjacencylists, finding repeating nodes or node regions in the successive lists, more general grammar-based transformations or 2-dimensional representations of the binary matrix of the graph.In this paper we present two Web graph compression algorithms. The first can be seen asengineering of the Boldi and Vigna (2004) method. We extend the notion of similarity betweenlink lists, and use a more compact encoding of residuals. The algorithm works on blocks ofvarying size (in the number of input lines) and sacrifices access time for better compressionratio, achieving more succinct graph representation than other algorithms reported in theliterature. The second algorithm works on blocks of the same size, in the number of inputlines, and its key mechanism is merging the block into a single ordered list. This methodachieves much more attractive space-time tradeoffs.
Key words: graph compression, random access
Development of succinct data structures is one of the most active research areas inalgorithmics in the last years. A succinct data structure shares the interface withits classic (non-succinct) counterpart, but is represented in much smaller space, viadata compression. Successful examples along these lines include text indexes [25],dictionaries, trees [24,15] and graphs [24]. Queries to succinct data structures areusually slower (in practice, although not always in complexity terms) than usingnon-compressed structures, hence the main motivation in using them is to allow todeal with huge datasets in the main memory. For example, indexed exact patternmatching in DNA would be limited to sequences shorter than 1 billion nucleotideson a commodity PC with 4 GB of main memory, if the indexing structure were theclassic suffix array (SA), and even less than half of it, if SA were replaced with asuffix tree. On the other hand, switching to some compressed full-text index (see [25]for a survey) shifts the limit to over 10 billion nucleotides, which is more than enoughto handle the whole human genome.Another huge object of significant interest seems to be the Web graph. This is adirected unlabeled graph of connections between webpages (i.e., documents), wherethe nodes are individual HTML documents and the edges from a given node are theoutgoing links to other nodes. We assume that the order of hyperlinks in a documentis irrelevant. Web graph analyses can be used to rank pages, fight Web spam, detectcommunities and mirror sites, etc.s of early Sept. 2011, it is estimated that Google’s index has about 44 billionwebpages . Assuming 20 outgoing links per node, 5-byte links (4-byte indexes to otherpages are simply too small) and pointers to each adjacency list, we would need morethan 4.4 TB of memory, ways beyond the capacities of the current RAM memories.We believe that, confronted with the given figures, the reader is now convinced aboutthe necessity of compression techniques for Web graph representation.Preliminary versions of this manuscript were published in [16] and [17]. We assume that a directed graph G = ( V, E ) is a set of n = | V | vertices and m = | E | edges. The earliest works on graph compression were theoretical, and they usuallydealt with specific graph classes. For example, it is known that planar graphs can becompressed into O ( n ) bits [28,18]. For dense enough graphs, it is impossible to reach o ( m log n ) bits of space, i.e., go below the space complexity of the trivial adjacency listrepresentation. Since the seminal Jacobson’s thesis [20] on succinct data structures,there appear papers taking into account not only the space occupied by a graph, butalso access times.There are several works dedicated to Web graph compression. Bharat et al. [4]suggested to order documents according to their URL’s, to exploit the simple ob-servation that most outgoing links actually point to another document within thesame Web site. Their Connectivity Server provided linkage information for all pagesindexed by the AltaVista search engine at that time. The links are merely representedby the node numbers (integers) using the URL lexicographical order. We noted thatwe assume the order of hyperlinks in a document irrelevant (like most works on Webgraph compression do), hence the link lists can be sorted, in ascending order. As thesuccessive numbers tend to be close, differential encoding may be applied efficiently.Randall et al. [27] also use this technique (stating that for their data 80% of alllinks are local), but they also note that commonly many pages within the same siteshare large parts of their adjacency lists. To exploit this phenomenon, a given list maybe encoded with a reference to another list from its neighborhood (located earlier),plus a set of additions and deletions to/from the referenced list. Their encoding, inthe most compact variant, encodes an outgoing link in 5.55 bits on average, a resultreported over a Web crawl consisting of 61 million URL’s and 1 billion links.One of the most efficient compression schemes for Web graph was presented byBoldi and Vigna [7] in 2003. Their method is likely to achieve around 3 bits per edge,or less, at link access time below 1 ms at their 2.4 GHz Pentium4 machine. Of course,the compression ratios vary from dataset to dataset. We are going to describe theBoldi and Vigna algorithm in detail in the next section as this is the main inspirationfor our solution.Claude and Navarro [11,13] took a totally different approach of grammar-basedcompression. In particular, they focus on Re-Pair [22] and LZ78 compression schemes,getting close, and sometimes even below, the compression ratios of Boldi and Vigna,while achieving much faster access times. To mitigate one of the main disadvantagesof Re-Pair, high memory requirements, they developed an approximate variant of thisalgorithm. i hasa link to page j ) almost twice faster than returning the whole neighbor list. Still,we note that using non-lexicographical ordering is harmful for compact storing ofthe webpage URLs themselves (a problem accompanying pure graph structure com-pression in most practical applications). Note also that reordering the graph is theapproach followed in more recent works from the Boldi and Vigna team [6,5].Anh and Moffat [1] devised a scheme which seems to use grammar-based com-pression in a local manner. They work in groups of h consecutive lists and performsome operations to reduce their size (e.g., a sort of 2-dimensional RLE if a run ofsuccessive integers appears on all the h lists). What remains in the group is then en-coded statistically. Their results are very promising: graph representations by about15–30% (or even more in some variant) smaller than the BV algorithm with practicalparameter choice (in particular, Anh and Moffat achieve 3.81 bpe and 3.55 bpe for thegraph EU) and report comparable decoding speed. Details of the algorithm cannothowever be deduced from their 1-page conference poster.Recent works focus on graph compression with support for bidirectional naviga-tion. To this end, Brisaboa et al. [8] proposed the k -tree , a spatial data structure,related to the well-known quadtree, which performs a binary partition of the graph3atrix and labels empty areas with 0s and non-empty areas with 1s. The non-emptyareas are recursively split and labeled, until reaching the leaves (single nodes). An im-portant component in their scheme is an auxiliary structure to compute rank queries[20] efficiently, to navigate between tree levels. It is easy to notice that this elegantdata structure supports handling both forward and reverse neighbors, which impliesfrom its symmetry. Ladra [21] proposed a more efficient encoding of leaves (whichare boxes of sizes e.g. 8 × k -tree.Finally, we have to mention the Hern´andez and Navarro work [19], where theycombine their previous techniques, k -tree [8] and Re-Pair for compressing the graphbinary relation [12] with edge reducing [9], obtaining interesting trade-offs. In par-ticular, if some of the access time can be sacrified, the space they achieved is thesmallest known among the solutions supporting bidirectional queries. Based on WebGraph datasets ( http://webgraph.dsi.unimi.it/ ), Boldi and Vignanoticed that similarity is strongly concentrated; typically, either two adjacency (edge)lists have nothing or little in common, or they share large subsequences of edges. Toexploit this redudancy, one bit per entry on the referenced list could be used, todenote which of its integers are copied to the current list, and which are not. Thosebit-vectors are dubbed copy lists . Still, Boldi and Vigna go further, noticing thatcopy lists tend to contain runs of 0s and 1s, thus they compress them using a sortof run-length encoding. They assume the first run consists of 1s (if the copy listactually starts with 0s, the length of the first run is simply zero), and then it allowsto represent a copy list as only a sequence of run lengths, encoded e.g. with Eliascoding.The integers on the current list which didn’t occur on the referenced list must bestored too, and how to encode them is another novelty of the described algorithm.They detect intervals of consecutive (i.e., differing by 1) integers and encode themas pairs of the left boundary and the interval length; the left boundary of the nextinterval on a given list will be encoded as the difference to the right boundary of theprevious interval minus two (this is because between the end of one interval and the4 lg. 1
GraphCompressSSL(
G, BSIZE ). firstLine ← true prev ← [ ]3 outB ← [ ]4 outF ← [ ]5 for line ∈ G do residuals ← line if firstLine = false then f [1 . . . | prev | ] ← [1 , , . . . , for i ← to | prev | do if prev [ i ] ∈ line then f [ i ] ← else if prev [ i ] + 1 ∈ line then f [ i ] ← else if prev [ i ] + 2 ∈ line then f [ i ] ←
313 append( outF , f )14 for i ← to | prev | do if f [ i ] = 1 then
16 remove( residuals , prev [ i ])17 residuals ′ ← RLE(diffEncode( residuals )) + [0]18 append( outB , byteEncode( residuals ′ ))19 prev ← line firstLine ← false if | outB | ≥ BSIZE then
22 compress( outB )23 compress( outF )24 outB ← [ ]25 outF ← [ ]26 firstLine ← true beginning of another there must be at least one integer). The numbers which do notfall into any interval are called residuals and are also stored, encoded in a differentialmanner.Finally, the algorithm allows to select as the reference list one of several previouslines; the size of the window is one of the parameters of the algorithm posing atradeoff between compression ratio and compression/decompression time and space.Another parameter affecting the results is the maximum reference count, which is themaximum allowed length of a chain of lists such that one cannot be decoded withoutextracting its predecessor in the chain. We present two approaches to Web graph compression working locally, in small blocks;the first one usually reaches slightly higher compression ratios but the second is morepractical, as being much faster.
Our first algorithm (Alg. 1, SSL stands for “similarity of successive lists”) works inblocks consisting of multiple adjacency lists. The blocks in their compact form areapproximately equal, which means that the number of adjacency lists per block varies;for example, in graph areas with dominating short lists the number of lists per blockis greater than elsewhere.We work in two phases: preprocessing and final compression, using a general-purpose compression algorithm. The algorithm processes the adjacency lines one-by-one and splits their data into two streams.5ne stream holds copy lists, in an extended sense compared to the Boldi and Vignasolution. Our copy lists are no longer binary but consist of four different flag symbols:0 denotes an exact match (i.e., value j from the reference list occurs somewhere onthe current list), 2 means that the current list contains integer j + 1, 3 means thatthe current list contains integer j + 2, if the corresponding integer from the referencelist is j . Finally, the bits 1 correspond to the items from the reference list which havenot been earlier labeled with 0, 2 or 3.Of course, several events may happen for a single element, e.g., the integer 34from the reference list triggers three events if the current list contains 34, 35 and 36.In such case, the flag with the smallest value is chosen (i.e., 0 in our example).Moreover, we make things even simpler than in the Boldi–Vigna scheme and ourreference list is always the previous adjacency list.The other stream stores residuals, i.e., the values which cannot be decoded withflags 0, 2 or 3 on the copy lists. First differential encoding is applied and then anRLE compressor for differences 1 only (with minimum run length set experimentallyto 5) is run. The resulting sequence is terminated with a unique value (0) and thenencoded using a byte code.For this last step, we consider two variants. One is similar to two-byte dense code [26] in spending one bit flag in the first codeword byte to tell the length of the currentcodeword. Namely, we choose between 1 and b bytes for encoding each number, where b is the minimum integer such that 8 b − b = 3 for EU and b = 4 for the remainingavailable datasets.The second coding variant can be classified as a prelude code [14] in which twobits in the first codeword byte tell the length of the current codeword; originally thelengths are 1, 2, 3 and 4 but we take 1, 2 and b such that 8 b − b could be 5 or 6 for really hugegraphs).Once the residual buffer reaches at least BSIZE bytes, it is time to end the currentblock and start a new one. Both residual and flag buffers and then (independently)compressed (we used the well-known Deflate algorithm for this purpose) and flushed.The code at Alg. 1 is slightly simplified; we omitted technical details serving forfinding the list boundaries in all cases (e.g., empty lines). Our second algorithm (Alg. 2, LM stands for “list merging”) works in blocks havingthe same number of lists, h (at least in this aspect our algorithm resembles the onefrom [1]).Given the block of h lists, the procedure converts it into two streams: one storesone long list consisting of all integers on the h input lists, without duplicates, andthe other stores flags necessary to reconstruct the original lists. In other words, thealgorithm performs a reversible merge of all the lists in the block.The long list is compacted in a manner similar to the previous algorithm: the listis differentially encoded, zero-terminated and submitted to a byte coder (the variantwith 1, 2 and b bytes per codeword was only tried). Note we gave up the RLE phasehere. 6 lg. 2 GraphCompressLM(
G, h ). outF ← [ ]2 i ← for line i , line i +1 , . . . , line i + h − ∈ G do tempLine ← line i ∪ line i +1 ∪ . . . ∪ line i + h − tempLine ← removeDuplicates( tempLine )6 longLine ← sort( tempLine )7 items ← diffEncode( longLine ) + [0]8 outB ← byteEncode( items )9 for j ← to | longLine | do f [1 . . . | longLine | ] ← [0 , , . . . , for k ← to h do if longLine [ j ] ∈ line i + k − then f [ k ] ←
113 append( outF , bitPack( f ))14 compress(concat( outB , outF ))15 outF ← [ ]16 i ← i + h The flags describe to which input lists a given integer on the output list belongs;the number of bits per each item on the output list is h , and in practical terms weassume h being a multiple of 8 (and even additionally a power of 2, in the experimentsto follow). The flag sequence does not need any terminator since its length is definedby the length of the long list, which is located earlier in the output stream. Forexample, if the length of the long list is 91 and h = 32, the corresponding flagsequence has 364 bytes.Now, we consider two variations for encoding the flag sequence: either they arekept raw (the variant is latter denoted as LM-bitmap ), or differences (gaps) betweenthe successive 1s in the flag sequence are written on individual bytes (the variant islatter denoted as
LM-diff ). We note that each run of h bits corresponding to flags fora single value on the output list must contain at least one set bit, hence the maximumgap between any two 1s in the resulting sequence is 2 h −
1, hence for h ≤
128 eachvalue can be stored on a byte (a preliminary experiment with h = 256 and using abyte code for gap encoding was rather unsucessful). Alg. 2 presents the LM-bitmapvariant.Those two sequences, the compacted long list and the flag sequence (either raw,or gap-encoded), are then concatenated and compressed with the Deflate algorithm.One can see that the key parameter here is the block size, h . Using a larger h letsexploit a wider range of similar lists but also has two drawbacks. The flag sequencegets more and more sparse (for example, for h = 64 and the EU-2005 crawl, asmuch as about 68% of its list indicators have only one set bit out of 64!), and theDeflate compressor is becoming relatively inefficient on those data; a drawback moreimportant in the LM-bitmap variant. Worse, decoding larger blocks takes longer time. The experiments with the SSL algorithm comprise only the datasets EU-2005 andIndochina-2004, while the more practical LM variants are tested also on the UK-2002and Arabic-2005 crawls; all the datasets are downloaded from the WebGraph project( http://webgraph.dsi.unimi.it/ ), using both direct and transposed graphs. Notethat we use the natural order versions of them, as using reordered variants (also7vailable from the WebGraph project) may be more efficient but then the compressionof the corresponding URL data deteriorates.The main characteristics of those datasets are presented in Table 1.
Dataset EU-2005 Indochina-2004 UK-2002 Arabic-2005direct transposed direct transposed direct transposed direct transposedNodes 862664 7414866 18520486 22744080Edges 19235140 194109311 298113762 639999458Edges / nodes 22.30 26.18 16.10 28.14% of empty lists 8.309 0.000 17.655 0.004 14.908 0.637 14.514 0.002Longest list length 6985 68922 6985 256425 2450 194942 9905 575618
Table 1.
Selected characteristics of the datasets used in the experiments.The main experiments (Sect. 5.1) were run on a machine equipped with an IntelCore 2 Quad Q9450 CPU, 8 GB of RAM, running Microsoft Windows XP (64-bit).Our algorithms were implemented in Java and run on the 64-bit JVM (JRE 6 usedin the first series of tests, involving SSL, and JRE 7 in the latter tests, with the LMvariants). A single CPU core was used by all implementations. As seemingly acceptedin most reported works, we measure access time per edge, extracting many (100,000in our case) randomly selected adjacency lists and summing those times, and dividingthe total time by the number of edges on the required lists. The space is measured inbits per edge (bpe), dividing the total space of the structure (including entry pointsto blocks) by the total number of edges.Throughout this section by 1 KB we mean 1000 bytes.
Our first algorithm, SSL, has three parameters: the number of flags used (either 2 or4, where 2 flags mimic the Boldi–Vigna scheme and 4 correspond to Alg. 1), the byteencoding scheme (either using 2 or 3 codeword lengths), and the residual block sizethreshold BSIZE. As for the last parameter, we initially set it to 8192, which meansthat the residual block gets closed and is submitted to the Deflate compression onceit reaches at least 8192 bytes. Experiments with the block size are presented in thenext subsection. The remaining parameters constitute four variants: Two flags and two codeword lengths are used. Two flags and three codeword lengths are used. Four flags and two codeword lengths are used. Four flags and three codeword lengths are used.As expected, the compression ratios improve with using more flags and more densebyte codes (Table 2). Tables 3 and 4 present the compression and access time resultsfor the two extreme variants: 2a and 4b. Here we see that using more aggressivepreprocessing is unfortunately slower (partly because of increased amount of flagdata per block) and the difference in speed between variants 2a and 4b is close to50%. Translating the times per edge into times per neighbor list, we need from 410 µ sto 550 µ s for 2a and from 620 µ s to 760 µ s for 4b. This is about 10 times less than theaccess time of 10K or 15K RPM hard disks.8 ataset EU-2005 Indochina-2004direct transposed direct transposed2a 2.286 2.345 1.101 1.0872b 2.199 2.290 1.062 1.0654a 1.735 1.809 0.936 0.9034b 1.696 1.782 0.909 0.890 Table 2.
The algorithm based on similarity of successive lists, compression ratios inbits per edge.Our second algorithm, LM, has one parameter, h , the number of lines (lists) perblock. We conducted experiments for h = 16, 32, 64, the results are presented in thelast three rows of Tables 3 and 4, respectively. For this comparison, only the LM-bitmap variant is used. We see that even LM64 cannot reach the compression of our4b variant, but its list extraction is faster 14–27 times. The fastest of the variantspresented here, LM16, is 1.3 and 2.0 slower than BV (7,3), respectively, with muchbetter compression (we checked also LM8, only on EU-2005: the results are 3.814 bpeand 0.20 µ s per edge). direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 5.169 0.24 – –2a 2.286 18.59 2.345 18.884b 1.696 28.93 1.782 27.83LM16 2.963 0.31 2.576 0.82LM32 2.373 0.55 2.233 1.05LM64 2.008 1.05 2.016 2.01 Table 3.
EU-2005 dataset. Compression ratios (bpe) and access times per edge.“LM x ” stands for LM-bitmap with h = x . To the results of BV (7,3) the amount of0.510 bpe should be added, corresponding to extra data required to access the graphin random order. direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 2.063 0.21 – –2a 1.101 20.77 1.087 21.104b 0.909 29.03 0.890 27.43LM16 1.668 0.43 1.411 0.47LM32 1.320 0.55 1.228 0.69LM64 1.097 0.79 1.093 1.16 Table 4.
Indochina-2004 dataset. Compression ratios (bpe) and access times per edge.“LM x ” stands for LM-bitmap with h = x . To the results of BV (7,3) the amount of0.348 bpe should be added, corresponding to extra data required to access the graphin random order.The larger experiment was run on four datasets (in both direct and transposedversions); the obtained results are presented in Fig. 1 and exact numbers, for morecareful examination, can be found in the appendix. The LM-bitmap variant faresbetter in comparison with smaller blocks ( h up to 16), but then the LM-diff variant9tarts to win in compression, and the gap grows with growing h . Unfortunately,decoding LM-diff blocks is also in most cases costlier, with 74% maximum loss forIndochina-2004 direct, h = 64. On average, its loss in speed to LM-bitmap is not,however, that big. Obviously, the block size should seriously affect the overall space used by the structureand the access time. Larger blocks mean that the Deflate algorithm is more successfulin finding longer matches and the overhead from encoding first lines in a block withoutany reference is smaller. On the other hand, more lines have to be usually decodedbefore extracting the queried adjacency list.In this experiment we run the 2a algorithm (the same implementation in Java)with each block of residuals terminated (and later Deflate-compressed) after reachingBSIZE of 1024, 2048, 4096, 8192 and 16384 bytes, respectively. The test computerhad an Intel Pentium4 HT 3.0 GHz CPU, 1 GB of RAM, and was running MicrosoftWindows XP Home SP3 (32-bit). The results (Table 5) show that doubling the blocksize implies space reduction by about 10% while the access time grows less than twice(in particular, using 8K blocks is only 2.0–2.5 times slower than using 2K blocks). Still,as the block size gets larger (compare the last two rows in the table), the improvementin compression starts to drop while the slowdown grows. For a reference, the accesstimes of a practical Boldi–Vigna variant, BV (7,3), are 0.47 µ s and 0.42 µ s on the testmachine. EU-2005 Indochina-2004bpe time [ µ s] bpe time [ µ s]1024 3.398 6.50 1.485 8.992048 2.869 8.91 1.292 12.054096 2.513 15.93 1.172 17.878192 2.286 27.60 1.101 29.8316384 2.129 48.77 1.061 57.39 Table 5.
Compression ratios and access times in function of the block size. 2a variantused. Tests run on the non-transposed graphs.
We presented two algorithms for Web graph compression, encoding blocks consistingof whole lines. All those algorithms achieve much better compression results thanthose presented in the literature, although two of them for the price of relatively slowaccess time. The more interesting algorithm, based on list merging, seems to be rathercompetitive to the algorithms known from the literature. Our approach lets achievecompression ratios not reported in the literature (LM-diff, 128), for one-directionalqueries, for moderate slow-down in list accesses (the best tradeoff here, however seemto be the variants LM-diff and LM-bitmap for h = 32).If even better compression ratios are welcome, then our SSL 4b variant can beconsidered, being more than an order of magnitude slower. We point out that one10 .0 time (microsec/edge) space (bits/edge) EU-2005
WebGraphBFSLM-bitmap
LM-diff time (microsec/edge) space (bits/edge)
EU-2005 transposed
WebGraph
BFSLM-diff
LM-bitmap time (microsec/edge) space (bits/edge)
Indochina-2004
WebGraphBFSLM-bitmapLM-diff time (microsec/edge) space (bits/edge)
Indochina-2004 transposed
WebGraphBFSLM-bitmapLM-diff time (microsec/edge) space (bits/edge)
UK-2002
WebGraphBFSLM-bitmapLM-diff time (microsec/edge) space (bits/edge)
UK-2002 transposed
WebGraphBFS
LM-diffLM-bitmap time (microsec/edge) space (bits/edge)
Arabic-2005
WebGraph
BFSLM-bitmap
LM-diff time (microsec/edge) space (bits/edge)
Arabic-2005 transposed
WebGraphBFS
LM-diffLM-bitmap
Figure 1.
Compression ratios (bpe) and access times per edge11xtreme tradeoff in succinct in-memory data structures is when accessing the structureis only slightly faster than reading data from disk. The niche for such a solution iswhen the given Web crawl cannot fit in RAM memory using less tight compressedrepresentation and the stronger compression is already enough. The disk transfer rateis of relatively small imporantance here and what matters is the access time, whichis about 10 ms or more for commodity 7200 RPM hard disks. Our algorithms spendsignificantly less time for extracting an average adjacency list, even if they are 1 or 2orders of magnitude slower than the solutions from [7,11,12]. Another challenge is tocompete with SSD disks which are not much faster than conventional disks in readingor writing sequential data but their access times are two orders of magniture smaller.Here our LM variants are fast enough, though.Our algorithm works locally. In the future we are going to try to squeeze outsome global redundancy while compressing the LM byproducts. A natural candidatefor such experiments is the RePair algorithm [23,13]. Other lines of research we areplanning to follow are Web graph compression with bidirectional navigation and effi-cient compression of URLs. As for bidirectional navigation, the very recent idea fromClaude and Ladra [10] is a prospective approach, in combination with LM, but evensumming up naively the sizes of the two structures we build now, for the direct andthe transposed graph, gives quite interesting results (see [19,10] for comparison).
References V. N. Anh and A. F. Moffat : Local modeling for webgraph compression , in DCC, J. A. Storer andM. W. Marcellin, eds., IEEE Computer Society, 2010, p. 519.2.
A. Apostolico and G. Drovandi : Graph compression by BFS . Algorithms, 2(3) 2009, pp. 1031–1044.3.
Y. Asano, Y. Miyawaki, and T. Nishizeki : Efficient compression of web graphs , in COCOON, X. Huand J. Wang, eds., vol. 5092 of Lecture Notes in Computer Science, Springer, 2008, pp. 1–11.4.
K. Bharat, A. Z. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian : TheConnectivity Server: Fast access to linkage information on the Web . Computer Networks, 30(1–7) 1998,pp. 469–477.5.
P. Boldi, M. Rosa, M. Santini, and S. Vigna : Layered label propagation: a multiresolutioncoordinate-free ordering for compressing social networks , in WWW, S. Srinivasan, K. Ramamritham,A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, eds., ACM, 2011, pp. 587–596.6.
P. Boldi, M. Santini, and S. Vigna : Permuting web and social graphs.
Internet Mathematics, 2009,pp. 257–283.7.
P. Boldi and S. Vigna : The webgraph framework I: Compression techniques , in WWW, S. I. Feldman,M. Uretsky, M. Najork, and C. E. Wills, eds., ACM, 2004, pp. 595–602.8.
N. Brisaboa, S. Ladra, and G. Navarro : K2-trees for compact web graph representation , in Proc.16th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5721,Springer, 2009, pp. 18–30.9.
G. Buehrer and K. Chellapilla : A scalable pattern mining approach to web graph compression withcommunities , in WSDM, M. Najork, A. Z. Broder, and S. Chakrabarti, eds., ACM, 2008, pp. 95–106.10.
F. Claude and S. Ladra : Practical representations for web and social graphs , in Proc. ACM Conferenceon Information and Knowledge Management, ACM, 2011, To appear.11.
F. Claude and G. Navarro : Fast and compact Web graph representations , Tech. Rep. TR/DCC-2008-3, Department of Computer Science, University of Chile, April 2008.12.
F. Claude and G. Navarro : Extended compact web graph representations , in Algorithms and Appli-cations, T. Elomaa, H. Mannila, and P. Orponen, eds., vol. 6060 of Lecture Notes in Computer Science,Springer, 2010, pp. 77–91.13.
F. Claude and G. Navarro : Fast and compact web graph representations . ACM Transactions on theWeb (TWEB), 4(4) 2010.14.
J. S. Culpepper and A. Moffat : Enhanced byte codes with restricted prefix properties , in SPIRE,M. P. Consens and G. Navarro, eds., vol. 3772 of Lecture Notes in Computer Science, Springer, 2005,pp. 1–12. R. F. Geary, N. Rahman, R. Raman, and V. Raman : A simple optimal representation for balancedparentheses , in Combinatorial Pattern Matching, 15th Annual Symposium, CPM 2004, Istanbul,Turkey,July 5-7, 2004, Proceedings, S. C. Sahinalp, S. Muthukrishnan, and U. Dogrus¨oz, eds., vol. 3109 ofLecture Notes in Computer Science, Springer–Verlag, 2004, pp. 159–172.16.
S. Grabowski and W. Bieniecki : Tight and simple Web graph compression , in Proc. Prague Stringol-ogy Conference, J. Holub and J. ˇZd’´arek, eds., 2010, pp. 127–137.17. :
Merging adjacency lists for efficient Web graph compression , in Proc. International Conferenceon Man-Machine Interactions, Springer, 2011, To appear.18.
X. He, M.-Y. Kao, and H.-I. Lu : A fast general methodology for information-theoretically optimalencodings of graphs . SIAM J. Comput., 30(3) 2000, pp. 838–846.19.
C. Hern´andez and G. Navarro : Compression of web and social graphs supporting neighbor andcommunity queries , in Proc. 5th ACM Workshop on Social Network Mining and Analysis (SNA-KDD),ACM, 2011, To appear.20.
G. Jacobson : Succinct Static Data Structures , PhD thesis, 1989.21.
S. Ladra : Algorithms and Compressed Data Structures for Information Retrieval , PhD thesis, 2011.22.
N. J. Larsson and A. Moffat : Off-line dictionary-based compression . Proceedings of the IEEE,88(11) Nov. 2000, pp. 1722–1732.23.
N. J. Larsson and A. Moffat : Off-line dictionary-based compression . Proceedings of the IEEE,88(11) 2000, pp. 1722–1732.24.
J. I. Munro and V. Raman : Succinct representation of balanced parentheses, static trees and planargraphs , in IEEE Symposium on Foundations of Computer Science (FOCS), 1997, pp. 118–126.25.
G. Navarro and V. M¨akinen : Compressed full-text indexes . ACM Computing Surveys, 39(1) 2007,p. article 2.26.
P. Proch´azka and J. Holub : New word-based adaptive dense compressors , in IWOCA, J. Fiala,J. Kratochv´ıl, and M. Miller, eds., vol. 5874 of Lecture Notes in Computer Science, Springer, 2009,pp. 420–431.27.
K. Randall, R. Stata, R. Wickremesinghe, and J. Wiener : The link database: Fast access tographs of the Web , 2001.28.
G. Tur´an : On the succinct representation of graphs . Discrete Applied Math, 15(2) May 1984, pp. 604–618.
Appendix direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 5.679 0.211 3.304 0.160BFS, l4 4.325 0.192 3.367 0.144BFS, l8 3.561 0.219 2.996 0.183BFS, l16 3.169 0.330 2.803 0.289BFS, l32 2.969 0.583 2.708 0.576BFS, l1024 2.776 14.579 2.631 13.134LM-bitmap, 8 3.814 0.152 2.951 0.173LM-bitmap, 16 2.963 0.231 2.576 0.275LM-bitmap, 32 2.373 0.403 2.233 0.508LM-bitmap, 64 2.008 0.711 2.016 1.004LM-bitmap, 128 1.838 1.370 1.963 2.176LM-diff, 8 4.115 0.193 3.204 0.200LM-diff, 16 2.964 0.296 2.543 0.329LM-diff, 32 2.275 0.481 2.107 0.547LM-diff, 64 1.867 0.802 1.854 0.931LM-diff, 128 1.640 1.396 1.727 1.609 Table 6.
EU-2005 dataset. Compression ratios (bpe) and access times per edge. Allcompressors are written in Java and were run with JRE 7. The extra data requiredto access the graph in random order are included.13 irect graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7,3) 2.411 0.153 1.384 0.130BFS, l4 2.331 0.137 1.339 0.091BFS, l8 1.860 0.199 1.158 0.112BFS, l16 1.615 0.257 1.063 0.173BFS, l32 1.488 0.403 1.016 0.326BFS, l1024 1.363 9.516 0.976 6.128LM-bitmap, 8 2.207 0.103 1.630 0.121LM-bitmap, 16 1.668 0.139 1.411 0.169LM-bitmap, 32 1.320 0.216 1.228 0.297LM-bitmap, 64 1.097 0.357 1.093 0.568LM-bitmap, 128 0.982 0.687 1.040 1.219LM-diff, 8 2.412 0.145 1.824 0.151LM-diff, 16 1.704 0.221 1.428 0.239LM-diff, 32 1.295 0.360 1.180 0.404LM-diff, 64 1.053 0.620 1.030 0.694LM-diff, 128 0.915 1.127 0.950 1.243 Table 7.
Indochina-2004 dataset. Compression ratios (bpe) and access times peredge. All compressors are written in Java and were run with JRE 7. The extra datarequired to access the graph in random order are included. direct graph transposed graphbpe time [ µ s] bpe time [ µ s]BV (7, 3) 3.567 0.225 2.218 0.200BFS, l4 3.369 0.236 2.152 0.147BFS, l8 2.627 0.264 1.883 0.181BFS, l16 2.242 0.357 1.742 0.260BFS, l32 2.042 0.542 1.673 0.455BFS, l1024 1.851 12.618 1.621 10.370LM-bitmap, 8 3.490 0.158 2.714 0.178LM-bitmap, 16 2.733 0.219 2.381 0.260LM-bitmap, 32 2.241 0.346 2.113 0.444LM-bitmap, 64 1.925 0.584 1.919 0.841LM-bitmap, 128 1.760 1.120 1.842 1.773LM-diff, 8 3.853 0.201 3.043 0.213LM-diff, 16 2.813 0.297 2.438 0.328LM-diff, 32 2.203 0.468 2.064 0.532LM-diff, 64 1.843 0.771 1.849 0.900LM-diff, 128 1.632 1.336 1.742 1.557 Table 8.