[PDF] Universal Graph Compression: Stochastic Block Models

Abstract

Motivated by the prevalent data science applications of processing and mining large-scale graph data such as social networks, web graphs, and biological networks, as well as the high I/O and communication costs of storing and transmitting such data, this paper investigates lossless compression of data appearing in the form of a labeled graph. A universal graph compression scheme is proposed, which does not depend on the underlying statistics/distribution of the graph model. For graphs generated by a stochastic block model, which is a widely used random graph model capturing the clustering effects in social networks, the proposed scheme achieves the optimal theoretical limit of lossless compression without the need to know edge probabilities, community labels, or the number of communities. The key ideas in establishing universality for stochastic block models include: 1) block decomposition of the adjacency matrix of the graph; 2) generalization of the Krichevsky-Trofimov probability assignment, which was initially designed for i.i.d. random processes. In four benchmark graph datasets (protein-to-protein interaction, LiveJournal friendship, Flickr, and YouTube), the compressed files from competing algorithms (including CSR, Ligra+, PNG image compressor, and Lempel-Ziv compressor for two-dimensional data) take 2.4 to 27 times the space needed by the proposed scheme.

Full PDF

aa r X i v : . [ c s . I T ] J un Universal Graph Compression: Stochastic Block Models

Alankrita BhattUniversity of California, San DiegoLa Jolla, CA 92093, USA [email protected]

Chi WangMicrosoft Research, RedmondRedmond, WA 98052, USA [email protected]

Lele WangUniversity of British ColumbiaVancouver, BC V6T1Z4, Canada [email protected]

Ziao Wang ∗ University of British ColumbiaVancouver, BC V6T1Z4, Canada [email protected]

Abstract

Motivated by the prevalent data science applications of processing and mining large-scalegraph data such as social networks, web graphs, and biological networks, as well as the high I/Oand communication costs of storing and transmitting such data, this paper investigates losslesscompression of data appearing in the form of a labeled graph. A universal graph compressionscheme is proposed, which does not depend on the underlying statistics/distribution of the graphmodel. For graphs generated by a stochastic block model, which is a widely used random graphmodel capturing the clustering eﬀects in social networks, the proposed scheme achieves theoptimal theoretical limit of lossless compression without the need to know edge probabilities,community labels, or the number of communities.The key ideas in establishing universality for stochastic block models include: 1) blockdecomposition of the adjacency matrix of the graph; 2) generalization of the Krichevsky–Troﬁmov probability assignment, which was initially designed for i.i.d. random processes. Infour benchmark graph datasets (protein-to-protein interaction, LiveJournal friendship, Flickr,and YouTube), the compressed ﬁles from competing algorithms (including CSR, Ligra+, PNGimage compressor, and Lempel–Ziv compressor for two-dimensional data) take 2.4 to 27 timesthe space needed by the proposed scheme. ∗ Authors are ordered alphabetically. This work was supported in part by the NSF Center for Science of Informationpostdoctoral fellowship, in part by the NSERC Discovery Grant RGPIN-2019-05448, and in part by the NSERCDiscovery Launch Supplement DGECR-2019-00447. i ontents ii Introduction

In many data science applications, data appears in the form of large-scale graphs. For example,in social networks, vertices represent users and an edge between vertices represents friendship; inWorld Wide Web, vertices are websites and edges indicate the hyperlinks from one site to the other;in biological systems, vertices can be proteins and edges illustrate protein-to-protein interaction.Such graphs may contain billions of vertices. In addition, edges tend to be correlated with eachother since, for example, two people sharing many common friends are likely to be friends aswell. How to eﬃciently compress such large-scale structural information to reduce the I/O andcommunication costs in storing and transmitting such data is an emerging challenge in the era ofbig data.The literature on graph compression is vast. Existing compression schemes follow various dif-ferent methodologies. Several methods exploited combinatorial properties such as cliques and cutsin the graph [1, 2]. Many works targeted at domain-speciﬁc graphs such as web graphs [3], biologynetworks [4, 5], and social network graphs [6]. Various representations of graphs were proposed,such as the text-based method, where the neighbor list of each vertex is treated as a “word” [7, 8],and the k -tree method, where the adjacency matrix is recursively partitioned into k equal-sizesubmatrices [9]. Succinct graph representations that enable certain types of fast computation, suchas adjacency query or vertex degree query, were also widely studied [10]. While most compressionschemes are for labeled graphs, there are also works considering lossless compression of unlabeledgraphs [11–13], an evolving sequence of graphs [14,15], or (correlated) data on the graph [16,17]. Werefer the readers to [18] for an exhaustive survey on lossless graph compression and space-eﬃcientgraph representations.In this paper, we take an information theoretic approach to study lossless compression of agraph. We assume the graph is generated by some random graph model and investigate losslesscompression schemes that achieve the theoretical limit, i.e., the entropy of the graph, asymptoticallyas the number of vertices goes to inﬁnity. When the underlying distribution/statistics of the randomgraph model is known, optimal lossless compression can be achieved by methods like Huﬀmancoding. However, in most real-world applications, the exact distribution is usually hard to obtainand the data we are given is a single realization of this distribution. This motivates us to considerthe framework of universal compression , in which we assume the underlying distribution belongsto a known family of distributions and require that the encoder and the decoder should not be afunction of the underlying distribution. For this paper, we focus on the family of stochastic blockmodels , which are widely used random graph models that capture the clustering eﬀect in socialnetworks. Our goal is to develop a universal graph compressor for a family of stochastic blockmodels with as wide range of parameters as possible.Universal compression for one-dimensional sequences is a well-studied area, especially for thefamily of independent and identically distributed (i.i.d.) processes and the family of stationaryergodic processes. A large number of universal compressors have been proposed for sequences, suchas the Laplace and Krichevsky–Troﬁmov (KT) compressor for i.i.d. processes [19, 20], Lempel–Ziv [21, 22] and Burrows–Wheeler transform [23] for stationary ergodic processes, and context treeweighting [24] for ﬁnite memory processes. Many of these have been adopted in standard data com-pression applications such as compress , gzip, GIF, TIFF, and bzip2. A natural question arisinghere is: Can we convert the two-dimensional adjacency matrix of the graph into a one-dimensionalsequence in some order and apply a universal compressor for the sequence? For some simple graphmodel such as Erd˝os–R´enyi graph, where each edge is generated i.i.d. with probability p , this wouldindeed work. For more complex graph models including stochastic block models, where edges arecorrelated to each other, it is unclear whether there is an ordering of the entries that results in a sta-1ionary process. We will show in Section 5 several orders including row-by-row, column-by-column,and diagonal-by-diagonal fail to produce a stationary process. On the other hand, in certain regimeof stochastic block models (as shown in Theorem 2), we manage to establish the universality ofLaplace and KT compressors using this approach through a generalization of the analysis from i.i.d.processes to identical but not necessarily independent processes. For experiments, we implementthe ordering of Peano–Hilbert space ﬁlling curve and apply Lempel–Ziv compressor [25] in fourbenchmark graph datasets. Its compression length turns out to be 2.5 to 4 times the length of ourproposed algorithm.Lossless compression for stochastic block models was ﬁrst studied by Abbe [16]. The focus thereis two-fold: 1) compute the entropy of the stochastic block model; 2) explore the relation betweencommunity detection and compression. Several interesting questions were presented: Knowing thecommunity labels will help compression, since edges can be grouped into i.i.d. subsets. But iscommunity detection necessary for compression? In the regime when community detection is notpossible, how do we compress the graph? We answer these questions in this paper by presentinga universal compressor that does not require knowledge of the edge probabilities, the communitylabels, or the number of communities. Our compressor remains universal even in the regime whencommunity detection is not possible. As a consequence, universal compression can be an easiertask than community detection for stochastic block models.The rest of the paper is organized as follows. In Section 1.1, we deﬁne universality over afamily of graph distributions and the stochastic block models. We present our main result in Sec-tion 1.2, which is a graph compressor that is universal for a family containing most non-trivialstochastic block models. We describe the encoding procedure of the proposed graph compressorin Section 2.1. We implement our compressor in four benchmark graph datasets and compareits empirical performance to four competing algorithms in Section 2.2. We illustrate key stepsin establishing universality in Section 3 and elaborate the proof of each step in Section 4. Sec-tion 5 explains why existing universal compressors developed for stationary processes may not beimmediately applicable for some one-dimensional ordering of entries in the adjacency matrix. Notation.

For an integer n , let [ n ] = { , , . . . , n } . Let log( · ) = log ( · ). We follow thestandard order notation: f ( n ) = O ( g ( n )) if lim n →∞ | f ( n ) | g ( n ) < ∞ ; f ( n ) = Ω( g ( n )) if lim n →∞ f ( n ) g ( n ) > f ( n ) = Θ( g ( n )) if f ( n ) = O ( g ( n )) and f ( n ) = Ω( g ( n )); f ( n ) = o ( g ( n )) if lim n →∞ f ( n ) g ( n ) = 0; f ( n ) = ω ( g ( n )) if lim n →∞ | f ( n ) || g ( n ) | = ∞ ; and f ( n ) ∼ g ( n ) if lim n →∞ f ( n ) g ( n ) = 1. For simplicity, we focus on simple (undirected, unweighted, no self-loop) graphs with labeled verticesin this paper. But our compression scheme and the corresponding analysis can be extended to moregeneral graphs. Let A n be the set of all labeled simple graphs on n vertices. Let { , } i be theset of binary sequences of length i , and set { , } ∗ = ∪ ∞ i =0 { , } i . A lossless graph compressor C : A n → { , } ∗ is a one-to-one function that maps a graph to a binary sequence. Let ℓ ( C ( A n ))denote the length of the output sequence. When A n is generated from a distribution, it is knownthat the entropy H ( A n ) is a fundamental lower bound on the expected length of any losslesscompressor [26, Theorem 8.3] H ( A n ) − log( e ( H ( A n ) + 1)) ≤ E [ ℓ ( C ( A n ))] (1)and therefore lim inf n →∞ E [ ℓ ( C ( A n ))] H ( A n ) ≥ . universal for the family of distributions P if for all distri-bution P ∈ P and A n ∼ P , we have lim sup n →∞ E [ ℓ ( C ( A n ))] H ( A n ) ≤ . (2)A stochastic block model SBM( n, L, p , W ) deﬁnes a probability distribution over A n . Here n is the number of vertices, L is the number of communities. Each vertex i ∈ [ n ] is associated witha community label X i ∈ [ L ]. The length- L column vector p = ( p , p , . . . , p L ) T is a probabilitydistribution over [ L ], where p i indicates the probability that any vertex is assigned community i . W is an L × L symmetric matrix, where W ij represents the probability of having an edge between avertex with community label i and a vertex with community label j . We say A n ∼ SBM( n, L, p , W )if the community labels X , X , . . . , X n are generated i.i.d. according to p and for every pair1 ≤ i < j ≤ n , an edge is generated between vertex i and vertex j with probability W X i ,X j . Inother words, in the adjacency matrix A n of the graph, A ij ∼ Bern( W X i ,X j ) for i < j ; the diagonalentries A ii = 0 for all i ∈ [ n ]; and A ij = A ji for i > j . We assume all the entries in W are in thesame regime f ( n ) and write W = f ( n ) Q , where Q is an L × L symmetric matrix with constantentries Q ij = Θ(1) for all i, j ∈ [ L ]. We assume all entries in p are Θ(1). We will consider twofamilies of stochastic block models: For 0 < ǫ < P ( ǫ ) : SBM with L = Θ(1) , f ( n ) = O (1) , f ( n ) = Ω (cid:0) n − ǫ (cid:1) , (3) P ( ǫ ) : SBM with L = Θ(1) , f ( n ) = o (1) , f ( n ) = Ω (cid:0) n − ǫ (cid:1) . (4)Note that the edge probability n is the threshold for a random graph to contain an edge with highprobability [27]. Thus, the family P ( ǫ ) covers most non-trivial SBM graphs. Clearly, P ( ǫ ) is astrict subset of P ( ǫ ), as it does not contain the constant regime f ( n ) = 1. Theorem 1 (Universality over P ) . For every < ǫ < , the graph compressor C k deﬁned inSection 2.1 is universal over the family P ( ǫ ) provided that < δ < ǫ, k ≤ p δ log n, and k = ω (1) . Theorem 2 (Universality over P ) . For every < ǫ < , the graph compressor C deﬁned inSection 2.1 is universal over the family P ( ǫ ) . For each k that divides n , the graph compressor C k : A n → { , } ∗ is deﬁned as follows. • Block decomposition.

Let n ′ = nk . For 1 ≤ i, j ≤ n ′ , let B ij be the submatrix of A n formedby the rows ( i − k + 1 , ( i − k + 2 , · · · , ik and the columns ( j − k + 1 , ( j − k + 2 , · · · , jk .For example, we have B =  A ,k +1 A ,k +2 · · · A , k A ,k +1 A ,k +2 · · · A , k ... ... . . . ... A k,k +1 A k,k +2 · · · A k, k  . (5)3e then write A n in the block-matrix form as A n =  B , B , · · · B ,n ′ B , B , · · · B ,n ′ ... ... . . . ... B n ′ , B n ′ , · · · B n ′ ,n ′  . (6)Denote B ut .. = B , , B , , B , , B , , B , , B , , · · · , B ,n ′ , · · · , B n ′ − ,n ′ (7)as the sequence of oﬀ-diagonal blocks in the upper triangle and B d .. = B , , B , , · · · , B n ′ ,n ′ (8)as the sequence of diagonal blocks. • Binary to m -ary conversion. Let m = 2 k . Each k × k block with binary entries in thetwo block sequences B ut and B d is converted into a symbol in [ m ]. • KT probability assignment.

Apply KT sequential probability assignment for the two m -ary sequences B ut and B d respectively. Given an m -ary sequence x , x , . . . , x N , KTsequential probability assignment deﬁnes N conditional probability distributions over [ m ] asfollows. For j = 0 , , , . . . , N −

1, assign conditional probability q KT ( i | x j ) .. = q KT ( X j +1 = i | X j = x j ) = N i ( x j ) + 1 / j + m/ i ∈ [ m ] , (9)where X j .. = ( X , . . . , X j ) , x j .. = ( x , x , . . . , x j ), and N i ( x j ) .. = P jk =1 { x k = i } counts thenumber of symbol i in x j . • Adaptive arithmetic coding.

With the KT sequential probability assignments, compressthe two sequences B ut and B d separately using adaptive arithmetic coding [28]. Algorithm 1: m -ary adaptive arithmetic encoding with KT probability assignment Input :

Data sequence x N , alphabet size m Initialize lower = 0 , upper = 1 , logprob = 0 , N = N = · · · = N m = 0; for j = 0 , , . . . , N − do range ← upper − lower ; for i = 1 , , . . . , x j +1 do Compute q KT ( i | x j ) = N i +1 / j + m/ ; upper ← lower + range · P x j +1 i =1 q KT ( i | x j ); lower ← upper − range · q KT ( x j +1 | x j ); N x j +1 ← N x j +1 + 1; logprob ← logprob + log( q KT ( x j +1 | x j )); Output: the binary representation of ( lower + upper ) with ⌈− logprob ⌉ + 1 bits4iven the compressed graph sequence y L , the number of vertices n and the block size k , thegraph decompressor D k : { , } ∗ → A n is deﬁned as follows. • Adaptive arithmetic decoding.

With the KT sequential probability assignments deﬁnedin section 2.1, decompress the two code sequences for B ut and B d separately using adap-tive arithmetic decoding. The length of data sequence B ut and B d are nk ( nk − / nk respectively. Algorithm 2: m -ary adaptive arithmetic decoding with KT probability assignment Input :

Binary sequence y L , alphabet size m = 2 k , length of data sequence N Add ‘0 . ’ before sequence y L and convert it into a decimal real number Y . Initialize lower = 0 , upper = 1 , N = N = · · · = N m = 0; for j = 0 , , . . . , N − do range ← upper − lower ; for i = 1 , , . . . , m do Compute q KT ( i | x j ) = N i +1 / j + m/ ;Find minimum z ∈ [ m ] such that lower + range · P zi =1 q KT ( i | x j ) > Y ; upper ← lower + range · P zi =1 q KT ( i | x j ); lower ← upper − range · q KT ( z | x j ); N z ← N z + 1; x j +1 ← z ; Output: the m -ary data sequence x , x , · · · , x N • m -ary to binary conversion. Each m -ary symbol in the sequence is converted to a k -bitbinary number and further converted into a k × k block with binary entries. • Adjacency matrix recovery.

With the blocks in B ut and B d , recover the adjacency matrixof A n in the order described in (6), (7), and (8). Remark 1 ( C compressor). When k = 1, the diagonal sequence B d becomes an all-zero se-quence, since we assume the graph is simple. So we will only compress the oﬀ-diagonal sequence B ut with the algorithm described above. Remark 2 ( Laplace probability assignment).

As an alternative to the KT sequential proba-bility assignment, one can also use the Laplace sequential probability assignment. Given an m -arysequence x , x , . . . , x N , Laplace sequential probability assignment deﬁnes N conditional probabilitydistributions over [ m ] as follows. For j = 0 , , , . . . , N −

1, we assign conditional probability q L ( X j +1 = i | X j = x j ) = N i ( x j ) + 1 j + m for each i ∈ [ m ] . (10)Both methods can be shown to be universal, while Laplace probability assignment has a muchcleaner derivation. However, KT probability assignment produces a better empirical performance.For this reason, we keep both in the paper. Remark 3 ( Relation between probability assignment and compression length).

In Al-gorithm 1, the terms log( q ( x j +1 | x j )) are added up, which lead to the marginal probability implied5y the sequential probability assignment N − X j =0 log( q ( x j +1 | x j )) = log  N − Y j =0 q ( x j +1 | x j )  = log( q ( x N )) . (11)For KT probability assignment, it is known that the marginal probability is q KT ( x N ) = (2 N − N − · · · (2 N m − m ( m + 2) · · · ( m + 2 N − . (12)For Laplace probability assignment, it is known that the marginal probability is q L ( x N ) .. = N ! N ! · · · N m ! N ! · (cid:0) N + m − m − (cid:1) . (13)The compression output length of Algorithm 1 is l log q ( x N ) m + 1. This relation will be the basis ofour length analysis in Propositions 3 and 4. Remark 4.

The computational complexity of the proposed algorithm is O (2 k n ). For the choiceof k that achieves universality over P ( ǫ ) family in Theorem 1, O (2 k n ) = O ( n δ ) for δ < ǫ . Forthe choice of k that achieves universality over P ( ǫ ) family in Theorem 2, O (2 k n ) = O ( n ). Remark 5.

Why is C k well-deﬁned? The block decomposition and the binary to m -ary conversionare clearly one-to-one. It is also known that for any valid probability assignment, arithmetic codingproduces a preﬁx code, which as also one-to-one. Remark 6.

The orders in B ut and B d do not matter in terms of establishing universality. Thecurrent orders in (7) and (8) together with arithmetic coding enable a horizon free implementation.That is, the encoder does not need to know the horizon n to start processing the data and canoutput partial coded bits on the ﬂy before receiving all the data. This leads to short encoding anddecoding delay. For some real-world applications, for example, when the number of users increasesin a large social network, this compressor has the advantage of not requiring to re-process existingdata and re-compress the whole graph from scratch. We implement the proposed universal graph compressor (UGC) in four widely used benchmarkgraph datasets: protein-to-protein interaction network (PPI) [29], LiveJournal friendship network(Blogcatalog) [30], Flickr user network (Flickr) [30], and YouTube user network (YouTube) [31].The block decomposition size k is chosen to be 1 , , . . . , k . We compare UGC to four competing algorithms. • CSR: Compressed sparse row is a widely used sparse matrix representation format. In theexperiment, we further optimize its default compressor exploiting the fact that the graph issimple and its adjacency matrix is symmetric with binary entries. • LZ: This is an implementation of the algorithm proposed in [25], which ﬁrst transforms thetwo-dimensional adjacency matrix into a one-dimensional sequence using the Peano–Hilbertspace ﬁlling curve and then compresses the sequence using Lempel–Ziv 78 algorithm [22].6

PNG: The adjacency matrix of the graph is treated as a gray-scaled image and the PNGlossless image compressor is applied. • Ligra+: This is another powerful sparse matrix representation format [32,33], which improvesupon CSR using byte codes with run-length coding.The compression ratios of the ﬁve algorithms implemented on four datasets are given as follows.The proposed UGC outperforms all competing algorithms in all datasets. The compression ratiosfrom competing algorithms are 2.4 to 27 times that of the universal graph compressor.UGC CSR LZ PNG Ligra+PPI 0.0226 0.166 0.06 0.089 0.0605Blogcatalog 0.0267 0.203 0.080 0.096 0.0682Flickr 0.00907 0.0584 0.0307 0.0262 0.0217YouTube 3 . × − . × − . × − . × − . × − Table 1: Comparison of the compression ratios.

In this section, we establish the universality of the graph compressor in Section 2.1. We ﬁrstcalculate the entropy of the (random) graph A n , which, recall, is the fundamental lower boundon the expected codeword length for any compression scheme. Since to establishing optimality weneed to show that lim sup n →∞ E [ ℓ ( C ( A n ))] H ( A n ) ≤

1, we will only be concerned with the ﬁrst order termin H ( A n ). Proposition 1 (Graph entropy) . Let A n ∼ SBM ( n, L, p , f ( n ) Q ) with f ( n ) = O (1) , f ( n ) = Ω (cid:0) n (cid:1) ,and L = Θ(1) . For ≤ p ≤ , let h ( p ) , − p log( p ) − (1 − p ) log(1 − p ) denote the binary entropyfunction. For a matrix W with entries in [0 , , let h ( W ) be a matrix of the same dimension whose ( i, j ) entry is h ( W ij ) . Then H ( A n ) = (cid:18) n (cid:19) H ( A | X , X )(1 + o (1)) (14)= (cid:18) n (cid:19) p T h (cid:0) f ( n ) Q (cid:1) p + o (cid:0) n h (cid:0) f ( n ) (cid:1)(cid:1) . (15) In particular when f ( n ) = Ω (cid:0) n (cid:1) and f ( n ) = o (1) , expression (15) can be further simpliﬁed as H ( A n ) = (cid:18) n (cid:19) f ( n ) log (cid:18) f ( n ) (cid:19) ( p T Qp + o (1)) . (16) Remark 7.

In the regime f ( n ) = Ω (cid:0) n (cid:1) and f ( n ) = O (1), the above result has been establishedin [16]. We extend the analysis to the regime f ( n ) = o (cid:0) n (cid:1) and f ( n ) = Ω( n ). Remark 8.

Proposition 1 can be used to calculate the entropy of the graph for certain importantregimes of f ( n ), in which the SBM displays characteristic behavior. For f ( n ) = 1, we have H ( A n ) = (cid:0) n (cid:1) h (cid:0) p T Qp (cid:1) (1 + o (1)); for f ( n ) = log nn (the regime where the phase transition for exact recoveryof the community labels occurs [34, 35]) we have H ( A n ) = n log n ( p T Qp + o (1)); when f ( n ) = n (the regime where the phase transition for detection between SBM and the Erd˝os–R´enyi model7ccurs [36]), we have H ( A n ) = n log n ( p T Qp + o (1)); when f ( n ) = n (the regime where the phasetransition for the existence of an edge occurs), we have H ( A n ) = log n ( p T Qp + o (1)).To compress the matrix A n , we wish to decompose it into a large number of components thathave little correlation between them. This leads to the idea of block decomposition describedpreviously. Since the sequence of blocks are used to compress A n we now prove that these blocksare identically distributed and asymptotically independent in a precise sense described as follows. Proposition 2 (Block decomposition) . Let A n ∼ SMB ( n, L, p, f ( n ) Q ) with f ( n ) = Ω (cid:0) n − ǫ (cid:1) forsome < ǫ < , f ( n ) = O (1) , and L = Θ(1) . Let k be an integer that divides n and n ′ = n/k .Consider the k × k block decomposition in (REF). We have all the oﬀ-diagonal blocks share thesame joint distribution; all the diagonal blocks share the same joint distribution. In other words,for any ≤ i , i , j , j ≤ n ′ with i = j , i = j and ≤ l , l ≤ n ′ , we have B i ,j d = B i ,j , B l ,l d = B l ,l . In addition, if k = ω (1) and k = o ( n ) , we have lim n →∞ H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) = 1 . (17)Because of this property of the block decomposition, we hope to compress these blocks as if theyare independent using a Laplace probability assignment (which, recall, is universal for the class ofall m -ary iid processes). However, since these blocks are still correlated (albeit weakly), we willneed a result on the performance of Laplace probability assignment on correlated sequences withidentical marginals, which we give next. Proposition 3 (Laplace probability assignment for correlated sequence) . Consider Z , Z , · · · , Z N ,where each Z i is identically distributed over an alphabet of size m ≥ , but is not necessarilyindependent. Let ℓ L ( Z , · · · , Z N ) = log q L ( Z , ··· ,Z N ) where q L ( · ) is the Laplace probability assignmentin (13) . We then have E [ ℓ L ( Z , · · · , Z N )] ≤ m log(2 eN ) + N H ( Z ) . (18)We provide a similar result for the KT probability assignment. Proposition 4 (KT probability assignment for correlated sequence) . Consider Z , · · · , Z N , whereeach Z i is identically distributed over an alphabet of size m ≥ , but is not necessarily independent.Let ℓ KT ( Z , · · · , Z N ) = log q KT ( Z ,Z ··· ,Z N ) where q KT ( · ) is the KT probability assignment in (12) We then have E [ ℓ KT ( Z , · · · , Z N )] ≤ m log (cid:0) e (cid:0) Nm (cid:1)(cid:1) + log( πN ) + N H ( Z ) . (19)We are now ready to prove Theorem 1. The proof of Theorem 2 follows similar arguments asin Theorem 1 and is deferred to Section 4.5. Proof of Theorem 1.

We will prove the universality of C k for both KT probability assignmentand Laplace probability assignment. Note that the upper bound on the expected length of KTin (19) is upper bounded by the upper bound on the length of Laplace in (18). So it suﬃces toshow Laplace probability assignment is universal.8e use the bound in Proposition 3 to establish the upper bound on the length of the code.Recall that here we compress the diagonal blocks B d ( m = 2 k n − sized alphabet, N = n ′ blocks)and the oﬀ-diagonal blocks B ut ( m = 2 k n − sized alphabet, N = (cid:0) n ′ (cid:1) blocks) separately. We have, E ( ℓ ( C k ( A n ))) H ( A n ) = E ( ℓ L ( B ut )) + E ( ℓ L ( B d )) H ( A n ) ≤ (cid:0) n ′ (cid:1) H ( B ) + 2 k n log (cid:16) e (cid:0) n ′ (cid:1)(cid:17) + n ′ H ( B ) + 2 k n log(2 en ′ ) H ( A n ) ( a ) ≤ (cid:0) n ′ (cid:1) H ( B ) + 2 k n log (cid:0) en (cid:1) + nH ( B ) + 2 k n log(2 en ) H ( A n ) ( b ) ≤ (cid:0) n ′ (cid:1) H ( B ) + 2 k n log (cid:0) en (cid:1) + nk n H ( A ) H ( A n )= (cid:0) n ′ (cid:1) H ( B ) H ( A n ) + 2 k n log (cid:0) en (cid:1) H ( A n ) + nk n H ( A ) H ( A n ) , where in (a) we bound (cid:0) n ′ (cid:1) ≤ n and n ′ ≤ n , and in (b) we note that H ( B ) ≤ k n H ( A ) since thereare k n − k n elements of the matrix (all apart from the diagonal elements) are distributed identicallyas A . We will now analyze each of these three terms separately. Firstly, using Proposition 2 yieldsthat ( n ′ ) H ( B ) H ( A n ) →

1. Next, since f ( n ) = Ω (cid:0) n − ǫ (cid:1) , we have H ( A n ) = Ω( n ǫ log n ) and subsequentlysubstituting k n ≤ √ δ log n , we have2 k n log(2 en ) H ( A n ) = O (cid:18) n δ log nn ǫ log n (cid:19) = O (cid:16) n δ − ǫ (cid:17) = o (1)since δ ≤ ǫ . Moreover, we have nk n H ( A ) H ( A n ) ≤ nk n H ( A ) H ( A n | X n ) = nk n H ( A ) (cid:0) n (cid:1) H ( A | X , X ) = O (cid:18) k n n (cid:19) = o (1) , where the penultimate equality used the fact that H ( A ) ∼ H ( A | X , X ) (since H ( h ( f ( n ) p T Qp ) ∼ p T h ( f ( n ) Q ) p ). We have then established that E ( ℓ ( C k ( A n ))) H ( A n ) ≤ (cid:0) n ′ (cid:1) H ( B ) H ( A n ) + 2 k n log (cid:0) en (cid:1) H ( A n ) + nk n H ( A ) H ( A n )= 1 + o (1) , which ﬁnishes the proof. Proof of Proposition 1.

Note that H ( A n ) = H ( A n | X n ) + I ( X n ; A n )9 (cid:18) n (cid:19) H ( A | X , X ) + I ( X n ; A n ) (20)= (cid:18) n (cid:19) p T h (cid:0) f ( n ) Q (cid:1) p + I ( X n ; A n ) , (21)where (21) follows since all the (cid:0) n (cid:1) edges are identically distributed and also independent given X n and consequently H ( A n | X n ) = (cid:18) n (cid:19) H ( A | X , X ) = (cid:18) n (cid:19) X i,j H ( A | X = i, X = j ) p i p j = (cid:18) n (cid:19) p T h ( f ( n ) Q ) p . When f ( n ) = Θ(1), we see that since0 ≤ I ( X n ; A n ) ≤ H ( X n ) = nH ( X ) ≤ n log L, we have that H ( A n ) = (cid:0) n (cid:1) p T h (cid:0) f ( n ) Q (cid:1) p + o (cid:0) n h ( f ( n )) (cid:1) .Next, consider the case when f ( n ) = o (1) and f ( n ) = Ω (cid:0) n (cid:1) . By properties of the entropy, wehave H ( A n | X n ) ≤ H ( A n ) ≤ (cid:18) n (cid:19) H ( A ) . (22)Note that P ( A = 1) = X i,j P ( A = 1 | X = i, X = j ) p i p j = p T f ( n ) Qp , which yields that H ( A ) = h (cid:0) f ( n ) p T Qp (cid:1) . Substituting this in (22) gives (cid:18) n (cid:19) p T h ( f ( n ) Q ) p ≤ H ( A n ) ≤ (cid:18) n (cid:19) h (cid:0) f ( n ) p T Qp (cid:1) . (23)Note now for any g ( n ) = o (1), we have h ( g ( n )) = − g ( n ) log g ( n ) − (1 − g ( n )) log(1 − g ( n ))= − g ( n ) log g ( n ) (cid:18) − g ( n )) log(1 − g ( n )) g ( n ) log g ( n ) (cid:19) . By noting that log(1 − g ( n )) g ( n ) → − g ( n )) → g ( n ) → h ( g ( n )) = g ( n ) log 1 g ( n ) (1 + o (1)) . Using this, we note that p T h ( f ( n ) Q ) p = p T Qp f ( n ) log f ( n ) (1 + o (1)) and h ( f ( n ) p T Qp ) = p T Qp f ( n ) log f ( n ) (1 + o (1)). Finally, substituting this into (23) yields H ( A n ) = p T Qp f ( n ) log 1 f ( n ) (1 + o (1))as required. 10 .2 Asymptotic i.i.d. via Block Decomposition We ﬁrst invoke a known property of stochastic block models (see, for example, [37]). We includethe proof here for completeness.

Lemma 1 (Exchangeability of SBM) . Let A n ∼ SBM( n, L, p , W ) . For a permutation π : [ n ] → [ n ] ,let π ( A n ) be an n × n matrix whose ( i, j ) entry is given by A π ( i ) ,π ( j ) . Then, for any permutation π : [ n ] → [ n ] , the joint distribution of A n is the same as the joint distribution of π ( A n ) , i.e., A n d = π ( A n ) . (24) Proof.

Let a n be a realization of the random matrix A n and π ( X n ) be the permuted vector( X π (1) , . . . , X π ( n ) ). For any symmetric binary matrix a n with zero diagonal entries, we have P ( A n = a n ) = X x n ∈ [ L ] n P ( A n = a n , X n = x n )= X x n ∈ [ L ] n P ( A n = a n | X n = x n ) n Y i =1 P ( X i = x i ) ( a ) = X x n ∈ [ L ] n Y i,j ≤ i

Consider a permutation π : [ n ] → [ n ] that has π ( x ) = n x + ( i − i ) k n for ( i − k n + 1 ≤ x ≤ i k n x + ( j − j ) k n for ( j − k n + 1 ≤ x ≤ j k n n − k n arguments are mapped to the n − k n values in [ n ] \ { ( i − k n +1 , · · · , i k n , ( j − k n , · · · , j k n } in any order. Lemma 1 implies that B i ,j , which is the submatrixformed by the rows ( i − k n + 1 , · · · , i k n and the columns ( j − k n + 1 , · · · , j k n has the samedistribution as the submatrix formed by the rows π (( i − k n + 1) , · · · , π ( i k n ) and the columns π (( j − k n + 1) , · · · , π ( j k n ). From the deﬁnition of π , we see that the latter submatrix is B i ,j and we establish that B i ,j d = B i ,j . Similarly, deﬁning a permutation π : [ n ] → [ n ] which has π ( x ) = x + ( l − l ) k n for ( l − k n + 1 ≤ x ≤ l k n and invoking Lemma 1 establishes B l ,l d = B l ,l .Now, clearly H ( B ut ) ≤ (cid:0) n ′ (cid:1) H ( B ), and therefore we havelim sup n →∞ H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) ≤ . (27)Moreover we have H ( A n ) = H ( B ut , B d ) ≤ H ( B ut ) + H ( B d ) ≤ H ( B ut ) + n ′ H ( B ) ≤ H ( B ut ) + n ′ k n h ( A ) where the last inequality follows by noting that except for the diagonal elements of B d (which are zero and thus have zero entropy), all other elements have the same distribution as A . We therefore obtain H ( B ut ) ≥ H ( A n ) − n ′ k n h ( A ) = H ( A n ) − nk n h ( A ) ≥ H ( A n | X n ) − nk n h ( A ) = (cid:0) n (cid:1) p T h ( f ( n ) Q ) p − nk n h ( f ( n ) p T Qp ). Consequently, H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) ≥ (cid:0) n (cid:1) (cid:16) p T h ( f ( n ) Q ) p − k n h ( f ( n ) p T Qp ) n − (cid:17)(cid:0) n ′ (cid:1) H ( B ) . (28)We will now analyze the right hand side of (28) in two parameter ranges. • f ( n ) = 1 : We have H ( B ) ≤ H ( B | X k n ) + H ( X k n ) (29) ≤ H ( B | X k n ) + 2 k n H ( p ) (30)= k n H ( A ,k n | X , X k n ) + 2 k n H ( p ) ≤ k n (cid:18) p T h ( Q ) p + 2 log Lk n (cid:19) (31)where (29) follows from the chain rule and (30) follows since all elements of the matrix B are independent given X , · · · , X k n . Plugging this into the RHS of (28) we obtain H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) ≥ (cid:0) n (cid:1) (cid:16) p T h ( Q ) p − k n h ( p T Qp ) n − (cid:17)(cid:0) n ′ (cid:1) k n (cid:16) p T h ( Q ) p + 2 log Lk n (cid:17) (32)since k n = o ( n ) , k n = ω (1) and (cid:0) n ′ (cid:1) k n ∼ (cid:0) n (cid:1) , we have from (32)lim inf n →∞ H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) ≥ f ( n ) = Ω (cid:0) n (cid:1) , f ( n ) = o (1) : Since B , is a matrix of k n identically distributed Bernoullirandom variables, we have H ( B , ) ≤ k n h ( A ,k n ) = k n h (cid:0) f ( n ) p T Qp (cid:1) . (34)Plugging this into the RHS of (28) then yields H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) ≥ (cid:0) n (cid:1) (cid:16) p T h ( f ( n ) Q ) p − k n h ( f ( n ) p T Qp ) n − (cid:17)(cid:0) n ′ (cid:1) k n h ( f ( n ) p T Qp ) . (35)We ﬁrst observe that in this parameter range, since f ( n ) = o (1), we have by Proposition 1 p T h ( f ( n ) Q ) p ∼ h (cid:0) f ( n ) p T Qp (cid:1) (36)Finally using that k n = o ( n ) and (cid:0) n ′ (cid:1) k n ∼ (cid:0) n (cid:1) establisheslim inf n →∞ H ( B ut ) (cid:0) n ′ (cid:1) H ( B ) ≥ Proof of Proposition 3.

Without loss of generality, we can assume the alphabet of the Z i s to be Z = [ m ]. Now deﬁne θ i .. = P ( Z = i ) , N i .. = P Nk =1 { Z k = i } , i ∈ [ m ]. We then have ℓ L ( Z , · · · , Z N ) = log 1 q L ( z , · · · , z N )= log θ N θ N · · · θ N m m q L ( z , · · · , z N ) + log 1 θ N θ N · · · θ N m m = log (cid:18) N + m − m − (cid:19) + log (cid:18) N ! N ! N ! · · · N m ! θ N θ N · · · θ N m m (cid:19) + log 1 θ N θ N · · · θ N m m ≤ log (cid:18) N + m − m − (cid:19) + log 1 θ N θ N · · · θ N m m (38) ≤ ( m −

1) log (cid:18) e (cid:18) Nm − (cid:19)(cid:19) + log 1 θ N θ N · · · θ N m m (39) ≤ m log(2 eN ) + m X i =1 N i log 1 θ i (40)where (38) follows since N ! N ! N ! ··· N m ! θ N θ N · · · θ N m m is a multinomial probability which is alwaysupper bounded by 1, and (39) follows since (cid:0) nk (cid:1) ≤ (cid:0) enk (cid:1) k . Taking expectation on both sides of (40),13e obtain E [ L L ( Z , · · · , Z N )] ≤ m log(2 eN ) + m X i =1 E [ N i ] log 1 θ i = m log(2 eN ) + m X i =1 N θ i log 1 θ i (41)= m log(2 eN ) + N H ( Z )where (41) follows since E [ N i ] = P Nk =1 E [ { Z k = i } ] = N P ( Z = i ) since the Z i are identicallydistributed. Lemma 2.

For any integer m > , N , N , · · · N m ∈ N and probability distribution ( θ , · · · θ m ) , (cid:0) NN ,N ··· N m (cid:1) θ N · · · θ N m m (cid:0) N N , N ··· N m (cid:1) θ N · · · θ N m m ≥ , where N = P mi =1 N i . Remark 9.

Equivalently, consider an urn containing known number of balls with m diﬀerentcolours. The lemma claims that the probability of getting N balls of colour 1, N of balls of colour2, · · · N m balls of colour m out of N draws with replacement is always greater than the probabilityof getting 2 N balls of colour 1, 2 N of balls of colour 2, · · · N m balls of colour m out of 2 N drawswith replacement. Proof of Proposition 4.

In this proof, we deﬁne a generalized form of factorial function. Let x be apositive integer, ( x + )! =

12 32 · · · ( x + ). Since (2 N − (2 N )!2 N ( N )! , we have m ( m + 2) · · · ( m + 2 N −

2) = 2 N (cid:16) m (cid:17) (cid:18) m + 22 (cid:19) · · · (cid:18) m + 2 N − (cid:19) = 2 N (cid:0) m + N − (cid:1) ! (cid:0) m − (cid:1) ! . Therefore we can rewrite the KT probability assignment in (12) as q KT ( x N ) = ( m − N ( m + N − (cid:0) NN (cid:1)(cid:0) NN (cid:1) m Y i =1 (2 N i )! N i !2 N i = ( m − N ( m + N − (cid:18) NN (cid:19) N ! N !(2 N )! m Y i =1 (2 N i )! N i !2 N i ( a ) ≥ (cid:0) m − (cid:1) ! (cid:0) NN (cid:1) N ( N + m − ) m − N !(2 N )! m Y i =1 (2 N i )! N i ! ( b ) = θ N · · · θ N m m ( m − (cid:0) NN (cid:1) N ( N + m − ) m − (cid:0) NN ,N ··· N m (cid:1) θ N · · · θ N m m (cid:0) N N , N ··· N m (cid:1) θ N · · · θ N m m where (a) follows that when m is even, N !( m + N − = N +1) ··· ( m + N − ≥ N + m − ) m − and when mis odd, N !( m + N − ≥ N !( m + N − )! = N +1) ··· ( m + N − ) ≥ N + m − ) m − , (b) follows that (cid:0) NN ,N ··· N m (cid:1) =14 ! Q mi =1 N i ! and θ i , P ( Z = i ). By lemma 2, we have q KT ≥ θ N ··· θ Nmm ( m − ( NN ) N ( N +1)( N +2) ··· ( N + m − . Thus, ℓ KT ( Z , · · · , Z N ) = log 1 q KT ( Z , Z · · · , Z N ) ≤ log 1 θ N · · · θ N m m + log 4 N ( N + m − ) m − ( m − (cid:0) NN (cid:1) = log 1 θ N · · · θ N m m + m −

12 log (cid:18) N + m − (cid:19) + log 4 N (cid:0) NN (cid:1) − log (cid:16) m − (cid:17) ! ( a ) ≤ log 1 θ N · · · θ N m m + m −

12 log (cid:18) N + m − (cid:19) + log 4 N (cid:0) NN (cid:1) − (cid:16) m − (cid:17) log ( m − e ) ( b ) ∼ log 1 θ N · · · θ N m m + m −

12 log (cid:18) N + m − (cid:19) + log √ πN − (cid:16) m − (cid:17) log (cid:18) m − e (cid:19) ∼ m e ( m + N ) m/ √ πN + log 1 θ N · · · θ N m m = m (cid:0) e (cid:0) Nm (cid:1)(cid:1) + 12 log( πN ) + m X i =1 N i log 1 θ i , where ( a ) follows Stirling’s approximation k ! ≥ √ πk ( ke ) k e k +1 and (b) follows Stirling’s approxi-mation for binomial coeﬃcient, i.e., (cid:0) NN (cid:1) ∼ N √ πN . Therefore, we have E [ ℓ KT ( Z , · · · , Z N )] ≤ m log (cid:0) e (cid:0) Nm (cid:1)(cid:1) + 12 log( πN ) + N H ( Z ) . Proof of Lemma 2.

Let p = N /N, p = N /N, · · · , p m = N m /N. Notice that P mi =1 p i = 1, so( p , · · · p m ) can be viewed as a probability distribution. And the entropy of this distribution is H ( p , · · · p m ) = P mi =1 − p i log p i . Firstly we consider the case when N , N · · · N m are all positiveand none of them equal to N. By Stirling’s approximation for factorial √ πn ( ne ) n e / (12 n +1) ≤ n ! ≤√ πn ( ne ) n e / n , we can bound (cid:18) NN , N · · · N m (cid:19) ≥ √ πN N N exp (cid:16) N +1 − N − N − · · · − N m (cid:17) (2 π ) m/ ( N N · · · N m ) / N N N N · · · N N m m = exp (cid:16) N +1 − N − N − · · · − N m (cid:17) (2 π ) m − ( p p · · · p m ) / N m − − NH ( p ,p , ··· ,p m ) . Similarly, we have (cid:18) N N , N · · · N m (cid:19) ≤ exp (cid:16) N − N +1 − N +1 − · · · − N m +1 (cid:17) (2 π ) m − m − ( p p · · · p m ) / N m − − H ( p ··· p m ) · Consider the function f ( N , N , · · · , N m ) = N +1 − N + ( N +1 − N ) + ( N +1 − N ) + · · · + ( N m +1 − N m )15nd the function g ( n ) = 124 n + 1 − n , where n is a positive integer. Function g ( n ) is minimized with n = 1 and min g ( n ) = 1 / − / f ( N , N , · · · , N m ) ≥ N +1 − N + (1 / − / m . Finally we areready to prove the lemma. (cid:0) NN ,N ··· N m (cid:1) θ N · · · θ N m m (cid:0) N N , N ··· N m (cid:1) θ N · · · θ N m m ≥ m − exp( f ( N , N , · · · , N m ))2 NH ( p ··· p m ) θ N · · · θ N m m ≥ m − exp (cid:16) N +1 − N + (1 / − / m (cid:17) − ND KL ( p || θ ) = 2 m − ND KL ( p || θ ) log e ( N +1 − N +(1 / − / m ) Notice that N +1 − N goes to zero when N → ∞ , m − > (1 / − / m and D KL ( P || θ ) ≥ (cid:0) NN ,N ··· N m (cid:1) θ N · · · θ N m m (cid:0) N N , N ··· N m (cid:1) θ N · · · θ N m m ≥ . When one of { N i } Ni =1 equals to N , without loss of generality, we assume that N = N . We have (cid:0) NN ,N ··· N m (cid:1) θ N · · · θ N m m (cid:0) N N , N ··· N m (cid:1) θ N · · · θ N m m = 1 θ N · · · θ N m m > . When there are k numbers out of N , N , · · · , N m that equal to zero, we can simply remove thesevalues and consider the case with alphabet size m − k . And this will yield the same result. Proof.

Once again, we establish universality for both KT and Laplace probability assignment.Following a similar argument as in the proof of Theorem 1, it suﬃces to show the universality ofLaplace. Since we are compressing N = (cid:0) n (cid:1) identically distributed bits using a Laplace probabilityassignment, Proposition 3 yields E ( ℓ ( C ( A n ))) H ( A n ) ≤ log(2 eN ) + N H ( A ) H ( A n ) ≤ log(2 eN ) + N H ( A ) H ( A n | X n )= (cid:18) log(2 eN ) + N H ( A ) N H ( A ) (cid:19) H ( A ) H ( A | X , X )= (cid:18) eN ) N h ( f ( n ) p T Qp ) (cid:19) h ( f ( n ) p T Qp ) p T h ( f ( n ) Q ) p ( a ) = 1 + o (1) . Here, (a) is justiﬁed by noting that log(2 eN ) Nh ( f ( n ) p T Qp ) ≤ log(2 en ) ( n ) h ( n − (2 − ǫ ) p T Qp ) h ( n − (2 − ǫ ) p T Qp ) h ( f ( n ) p T Qp ) , and thennoting that log(2 en ) ( n ) h ( n − (2 − ǫ ) p T Qp ) = o (1) and h ( n − (2 − ǫ ) p T Qp ) h ( f ( n ) p T Qp ) = O (1) when f ( n ) = Ω (cid:0) n − ǫ (cid:1) and that16 ( h ( f ( n ) p T Qp ) ∼ p T h ( f ( n ) Q ) p . Remark 10.

When f ( n ) = 1, the compressor C is strictly suboptimal. This is because thelength achieved by C is (cid:0) n (cid:1) h (cid:0) f ( n ) p T Qp (cid:1) (1 + o (1)), whereas the ﬁrst order term in the entropy is (cid:0) n (cid:1) p T h ( f ( n ) Q ) p T . When f ( n ) is o (1), these two have the same ﬁrst order term. However, when f ( n ) is constant, p T h ( f ( n ) Q ) p T is strictly smaller than h (cid:0) f ( n ) p T Qp (cid:1) by concavity of entropy. In this section, we take a closer look at the correlation among entries in the adjacency matrixand explain why existing universal compressors developed for stationary processes may not beimmediately applicable for certain orderings of the entries.Compressing A n entails compressing A , , · · · , A ,n , A , , · · · , A n − i.e. the bits in the upper triangle of A n . Clearly, these are not independent (because of thedependency through X n ) so one cannot use any of the compressors universal for the class of iidprocesses to compress A n . So, one hopes that it is possible to list the (cid:0) n (cid:1) random variables A , , · · · , A ,n , A , , · · · , A n − ,n in an order that makes the resulting sequence stationary, so thatthe Lempel–Ziv compressor (which, recall, is universal for the class of stationary processes) may beused. However, we show now that some of the most natural orders of listing these (cid:0) n (cid:1) bits resultin a sequence that is nonstationary.1. Horizontally:

Listing the bits in the upper triangle row-wise (i.e. ﬁrst listing the bits inthe ﬁrst row, followed by the bits in the second and so on, ending with A n − ,n ) we get thefollowing sequence A , , · · · , A ,n , A , , · · · , A ,n , · · · , A n − ,n which can be seen to be nonstationary. Consider the case when n = 4 , L = 2 , Q = Q =1 , Q = 0. In this case the horizontal ordering is A , , A , , A , , A , , A , , A , and this is seen to be nonstationary by observing P ( A , = 1 , A , = 0 , A , = 1) > P ( A , = 1 , A , = 0 , A , = 1) = 0.2. Vertically:

Listing the bits in the upper triangle column-wise (i.e. ﬁrst listing the bits inthe ﬁrst column, followed by the bits in the second and so on, ending with A n − ,n ) we getthe following sequence A , , A , , A , , · · · , A ,n , · · · , A n − ,n which can be seen to be nonstationary. Consider the case when n = 4 , L = 2 , Q = Q =1 , Q = 0. In this case the vertical ordering is A , , A , , A , , A , , A , , A , and this is seen to be nonstationary by observing P ( A , = 1 , A , = 0 , A , = 1) = 0 but P ( A , = 1 , A , = 0 , A , = 1) >

0. 17.

Diagonally:

Consider ⌊ n ⌋ sequences deﬁned as S := A , , A , , A , , · · · , A n − ,n , A n, S := A , , A , , A , , · · · , A n − ,n , A n − , , A n, ... S ⌊ n ⌋− := A , ⌊ n ⌋− , A , ⌊ n ⌋− , · · · , A n, ⌊ n ⌋− and S ⌊ n ⌋ = ( A , n/ , A , n/ , · · · , A n/ ,n , when n is even ,A , ⌊ n ⌋ , A , ⌊ n ⌋ , · · · , A n,n + ⌊ n ⌋ , when n is odd . Concatenating S , · · · , S ⌊ n ⌋ yields a sequence of length (cid:0) n (cid:1) . This corresponds to listing thebits diagonal-wise. However, even this does not yield a sequence that is stationary which canbe illustrated by considering the case when n = 4 , L = 2 , Q = Q = 1 , Q = 0. In thiscase the diagonal ordering is A , , A , , A , , A , , A , , A , and this is seen to be nonstationary by observing P ( A , = 0 , A , = 1 , A , = 1) > P ( A , = 0 , A , = 1 , A , = 1) = 0. Acknowledgment

L. Wang would like to thank Emmanuel Abbe and Tsachy Weissman for stimulating discussions inthe initial phase of the work. She is grateful to Young-Han Kim for his interest and encouragementin the results.

References [1] R. Rossi and R. Zhou, “GraphZIP: a clique-based sparse graph compression method,”

Journalof Big Data , vol. 5, no. 10, 2018.[2] Y. Lim, U. Kang, and C. Faloutsos, “Slashburn: Graph compression and mining beyondcaveman communities,”

IEEE Transactions on Knowledge and Data Engineering , vol. 26,no. 12, pp. 3077–3089, 2014.[3] P. Boldi and S. Vigna, “The webgraph framework i: Compression techniques,” in

Proceedingsof the 13th International Conference on World Wide Web , ser. WWW 04. New York,NY, USA: Association for Computing Machinery, 2004, pp. 595–602. [Online]. Available:https://doi.org/10.1145/988672.988752[4] T. C. Conway and A. J. Bromage, “Succinct data structures for assembling largegenomes,”

Bioinformatics , vol. 27, no. 4, pp. 479–486, 01 2011. [Online]. Available:https://doi.org/10.1093/bioinformatics/btq697[5] M. Hayashida and T. Akutsu, “Comparing biological networks via graph compression,”

BMCsystems biology , vol. 4 Suppl 2, no. Suppl 2, 2010.186] F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan,“On compressing social networks,” in

Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , ser. KDD 09. New York, NY,USA: Association for Computing Machinery, 2009, pp. 219–228. [Online]. Available:https://doi.org/10.1145/1557019.1557049[7] G. Navarro, “Compressing web graphs like texts,” Dept. of Computer Science, University ofChile, Tech. Rep., 2007.[8] K. Sadakane, “New text indexing functionalities of the compressed suﬃx arrays,”

Journal of Algorithms

Proceedings of the 16th International Symposium on String Processing and InformationRetrieval , ser. SPIRE 09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 18–30.[10] A. Farzan and J. I. Munro, “Succinct encoding of arbitrary graphs,”

The-oretical Computer Science

Discrete Applied Mathematics , vol. 8,no. 3, pp. 289 – 294, 1984.[12] M. Naor, “Succinct representation of general unlabeled graphs,”

Discrete Applied Mathematics ,vol. 28, no. 3, pp. 303 – 307, 1990.[13] Y. Choi and W. Szpankowski, “Compression of graphical structures: Fundamental limits,algorithms, and experiments,”

IEEE Transactions on Information Theory , vol. 58, no. 2, pp.620–638, Feb 2012.[14] P. Delgosha and V. Anantharam, “Universal lossless compression of graphical data,” in

Proc.IEEE Internat. Symp. Inf. Theory , Jun 2017.[15] ——, “Universal lossless compression of graphical data,” 2019. [Online]. Available:https://arxiv.org/abs/1909.09844[16] E. Abbe, “Graph compression: The eﬀect of clusters,” in

Proc. 54th Ann. Allerton Conf.Commun. Control Comput. , 2016, pp. 1–8.[17] A. Asadi, E. Abbe, and S. Verd´u, “Compressing data on graphs with clusters,” in

Proc. IEEEInternat. Symp. Inf. Theory , August 2017, pp. 1583–1587.[18] M. Besta and T. Hoeﬂer, “Survey and taxonomy of lossless graph compression and space-eﬃcient graph representations,” 2018. [Online]. Available: https://arxiv.org/abs/1806.01799[19] Q. Xie and A. R. Barron, “Minimax redundancy for the class of memoryless sources,”

IEEETransactions on Information Theory , vol. 43, no. 2, pp. 646–657, 1997.[20] ——, “Asymptotic minimax regret for data compression, gambling, and prediction,”

IEEETransactions on Information Theory , vol. 46, no. 2, pp. 431–445, 2000.[21] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,”

IEEE Trans-actions on information theory , vol. 23, no. 3, pp. 337–343, 1977.1922] ——, “Compression of individual sequences via variable-rate coding,”

IEEE transactions onInformation Theory , vol. 24, no. 5, pp. 530–536, 1978.[23] M. Eﬀros, K. Visweswariah, S. R. Kulkarni, and S. Verd´u, “Universal lossless source codingwith the burrows wheeler transform,”

IEEE Transactions on Information Theory , vol. 48,no. 5, pp. 1061–1081, 2002.[24] F. M. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting method: basicproperties,”

IEEE transactions on Information Theory , vol. 41, no. 3, pp. 653–664, 1995.[25] A. Lempel and J. Ziv, “Compression of two-dimensional data,”

IEEE transactions on Infor-mation Theory , vol. 32, no. 1, pp. 2–8, 1986.[26] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” 2014. [Online]. Available:http://people.lids.mit.edu/yp/homepage/data/itlectures v5.pdf[27] A. Frieze and M. Karoski,

Introduction to Random Graphs . Cambridge University Press,2015.[28] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding inthe h.264/avc video compression standard,”

IEEE Transactions on Circuits and Systems forVideo Technology , vol. 13, no. 7, 2003.[29] A. Grover and J. Leskovec, “Node2vec: Scalable feature learning for networks,” in

Proceedingsof the 22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining , ser. KDD 16. New York, NY, USA: Association for Computing Machinery, 2016,pp. 855–864. [Online]. Available: https://doi.org/10.1145/2939672.2939754[30] L. Tang and H. Liu, “Relational learning via latent social dimensions,” in

Proceedings of the15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , ser.KDD 09. New York, NY, USA: Association for Computing Machinery, 2009, pp. 817–826.[Online]. Available: https://doi.org/10.1145/1557019.1557109[31] S. Nandanwar and M. N. Murty, “Structural neighborhood based classiﬁcation ofnodes in a network,” in

Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , ser. KDD 16. New York, NY,USA: Association for Computing Machinery, 2016, pp. 1085–1094. [Online]. Available:https://doi.org/10.1145/2939672.2939782[32] J. Shun and G. E. Blelloch, “Ligra: A lightweight graph processing framework for sharedmemory,” in

Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practiceof Parallel Programming , ser. PPoPP 13. New York, NY, USA: Association for ComputingMachinery, 2013, pp. 135–146. [Online]. Available: https://doi.org/10.1145/2442516.2442530[33] J. Shun, L. Dhulipala, and G. E. Blelloch, “Smaller and faster: Parallel processing of com-pressed graphs with ligra+,” in , 2015, pp. 403–412.[34] E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the stochastic block model,”

IEEETransactions on Information Theory , vol. 62, no. 1, pp. 471–487, 2015.[35] E. Abbe and C. Sandon, “Community detection in general stochastic block models: Funda-mental limits and eﬃcient algorithms for recovery,” in . IEEE, 2015, pp. 670–688.2036] E. Mossel, J. Neeman, and A. Sly, “Reconstruction and estimation in the planted partitionmodel,”

Probability Theory and Related Fields , vol. 162, no. 3-4, pp. 431–461, 2015.[37] S. Lauritzen, A. Rinaldo, and K. Sadeghi, “Random networks, graphical models, and exchange-ability,”