[PDF] Non-backtracking walks reveal compartments in sparse chromatin interaction networks

Abstract

Chromatin communities stabilized by protein machinery play essential role in gene regulation and refine global polymeric folding of the chromatin fiber. However, treatment of these communities in the framework of the classical network theory (stochastic block model, SBM) does not take into account intrinsic linear connectivity of the chromatin loci. Here we propose the "polymer" block model, paving the way for community detection in polymer networks. On the basis of this new model we modify the non-backtracking flow operator and suggest the first protocol for annotation of compartmental domains in sparse single cell Hi-C matrices. In particular, we prove that our approach corresponds to the maximum entropy principle. The benchmark analyses demonstrates that the spectrum of the polymer non-backtracking operator resolves the true compartmental structure up to the theoretical detectability threshold, while all commonly used operators fail above it. We test various operators on real data and conclude that the sizes of the non-backtracking single cell domains are most close to the sizes of compartments from the population data. Moreover, the found domains clearly segregate in the gene density and correlate with the population compartmental mask, corroborating biological significance of our annotation of the chromatin compartmental domains in single cells Hi-C matrices.

Full PDF

NNon-backtracking walks reveal compartments in sparse chromatininteraction networks

K. Polovnikov , ∗ , A. Gorsky , , S. Nechaev , , S. V. Razin , , S. Ulyanov , Institute for Medical Engineering and Science,Massachusetts Institute of Technology, Cambridge, MA 02139 Skolkovo Institute of Science and Technology, 143026 Skolkovo, Russia Interdisciplinary Scientiﬁc Center Poncelet (UMI 2615 CNRS), 119002, Moscow, Russia Lebedev Physical Institute RAS, 119991, Moscow, Russia Moscow Institute for Physics and Technology, Dolgoprudnyi, Russia Institute for Information Transmission Problems of RAS, Moscow, Russia Institute of Gene Biology, Russian Academy of Sciences, Moscow, Russia Faculty of Biology, M.V. Lomonosov Moscow State University, Moscow, Russia (Dated: June 23, 2020)Chromatin communities stabilized by protein machinery play essential role in generegulation and reﬁne global polymeric folding of the chromatin ﬁber. However, treat-ment of these communities in the framework of the classical network theory (stochas-tic block model, SBM) does not take into account intrinsic linear connectivity of thechromatin loci. Here we propose the polymer block model, paving the way for com-munity detection in polymer networks. On the basis of this new model we modifythe non-backtracking ﬂow operator and suggest the ﬁrst protocol for annotation ofcompartmental domains in sparse single cell Hi-C matrices. In particular, we provethat our approach corresponds to the maximum entropy principle. The benchmarkanalyses demonstrates that the spectrum of the polymer non-backtracking operatorresolves the true compartmental structure up to the theoretical detectability thresh-old, while all commonly used operators fail above it. We test various operators onreal data and conclude that the sizes of the non-backtracking single cell domains aremost close to the sizes of compartments from the population data. Moreover, thefound domains clearly segregate in the gene density and correlate with the popula-tion compartmental mask, corroborating biological signiﬁcance of our annotation ofthe chromatin compartmental domains in single cells Hi-C matrices.

I. INTRODUCTION

Many real-world stochastic networks split into self-organized communities. Social net-works feature circles of friends [1–3], colleagues [2], members of a karate club [1], communitiesof dolphins [4] etc. Cellular networks demonstrate modular organization, which optimizescrucial biological processes and relationships, such as synchronization of neurons in the con-nectome [5, 6], eﬃciency of metabolic pathways [7, 8], genes specialization [9] or interactionbetween enhancers and promoters [10].Interest to polymer modular networks has appeared recently in the context of genomespatial folding. Proximity of chromatin loci in space is believed to be deeply connected ∗ To whom correspondence should be addressed. Email: [email protected] a r X i v : . [ q - b i o . M N ] J un with gene regulation and function. Hi-C experiments [13, 14, 17] provide the genome-widecolocalization data of chromatin loci. As the main outcome of the experiment, large genome-wide matrices of contacts from each individual cell or from the population are produced.Analyses of these matrices has revealed that the eukaryotic genome is organized in variousand biologically relevant communities, whose main function is to insulate some regionsof DNA and to provide easy access to the others. In particular, the data collected froma population of cells suggest that transcribed (”active”) chromatin segregates from the,”inactive” one, forming two compartments in the bulk of the nucleus [14, 19]. Withincompartments chromatin is organized further as a set of topologically-associated domains(TADs) [20–22] that regulate chromatin folding at ﬁner scales. However, interpretationand validation of communities in individual cells remains vaguely deﬁned due to sparsity ofrespective data.The broad ﬁeld of applications of stochastic modular networks has initiated the boostdevelopment of community detection methods. Spectral algorithms exploit the spectrumof various operators (adjacency, Laplacian, modularity) deﬁned on a network to identifythe number of communities and to infer the optimal network partition [23–27]. Typically,leading eigenvectors of these operators positively correlate with the true community structureor with the underlying core-periphery organization of the network [28]. These algorithms,along with the majority of theoretical results in the ﬁeld, are derived for the stochasticblock model (SBM) [23, 27] as an extension of Erd¨os-R´enyi graphs [29] to graphs withexplicitly deﬁned communities. One of the strongest limitations of the SBM is that edgesbetween vertices belonging to the same cluster inevitably attain equal weights. At thesame time, biological networks typically have several levels of organization within theircommunities [30]. In particular, identiﬁcation of several hierarchical levels in the networkbecomes tremendously important in the case of polymer networks, where diﬀerent pairs ofloci have marginally diﬀerent probabilities to form a contact in space [15], caused by thefrozen linear connectivity along the chain.Even for simplest polymer systems the contact probability demonstrates a power-lawbehavior with the dimensional-dependent scaling exponent characterizing universal long-ranged behavior of polymer folding [16]. In this work we propose the ”polymer stochasticblock model” which reﬂects a speciﬁc global polymer network organization with explicitstructuring into communities. The main new ingredient of the model under considerationis the average contact probability P ( s = | i − j | ) between the pairs of loci ( i, j ) which isconstant for standard non-polymeric networks, however cannot be neglected for polymers.Chromatin single cell networks are not only polymeric, but also sparse [17, 32]. It is knownthat upon reduction of the total number of edges in the network, there is a fundamentalresolution limit for all community detection methods [27, 33]. Furthermore, traditionaloperators (adjacency, Laplacian, modularity) fail far above this resolution limit, i.e. theirleading eigenvectors become uncorrelated with the true community structure above thethreshold [34]. That is explained by emergence of tree-like subgraphs (hubs) overlappingwith true clusters in the isolated part of the spectrum for these operators. The edge ofthe spectral density of sparse networks is universal and demonstrates the so-called ”Lifshitztail” [35, 36, 38, 39]. Localization on hubs, but not on true communities is a drawback ofall conventional spectral methods in the sparse regime.To prevent the eﬀect of localization on hubs and to make spectral methods useful in sparseregime, Krzakala et al. proposed to deal with non-backtracking random walks on a directedgraph that cannot revisit the same node on the subsequent step [34]. The crucial property ofnon-backtracking walks [40] is that they do not concentrate on hubs. It has been shown thatthe non-backtracking operator is able to resolve the community structure in sparse stochasticblock model up to the theoretical resolution limit. Typically, the majority of eigenvalues ofthe non-backtracking operator (which is a non-symmetric matrix with complex eigenvalues)are located inside a disc in a complex plane, and a number of isolated eigenvalues lie on thereal axis.For the sake of community detection in sparse polymer networks we construct thepolymer-type non-backtracking walks, appropriate for community detection in graphs withhidden linear memory (”polymeric background”). We establish the connection between thisoperator and the generalized polymer modularity, thus, bridging a gap with the maximumentropy principle. We test the performance of diﬀerent spectral methods (with and with-out polymer background) on sparse artiﬁcial benchmarks of polymer networks that mimiccompartmentalization in single cell Hi-C graphs. We show that polymer non-backtrackingwalks resolve the structure of communities up to the detectability threshold, while all otheroperators fail above it. In order to demonstrate eﬃciency of the method on real data, wepartition a set of single cell Hi-C contact maps of mouse oocytes into active (A) and in-active (B) compartments by diﬀerent operators. Found domains are shown to have similarsizes to the compartmental domains and correlate with the compartmental mask from thepopulation-averaged data. Analyses of the GC content within the domains demonstratesenrichment and depression of the genes density in the two clusters, thus, corroborating theirbiological signiﬁcance.The structure of the paper is as follows. In Section II we propose the polymer stochasticblock model, derive the entropy and the corresponding generalized modularity functional.In Section III we discuss polymer non-backtracking walks, prove their robustness on thebenchmarks emulating compartments, and, ﬁnally, test them on the real single cell data. InSection IV we draw the conclusions. II. STOCHASTIC BLOCK MODEL WITH POLYMER CONTACTPROBABILITYA. Deﬁnition of the model

Characterize a N -bead polymer chain by coordinates { x , x , ..., x N } of monomers i =1 , , ..., N and construct a corresponding topological graph G = ( V, E ) with the adjacencymatrix A ij (accounting for the bead’s proximity in space). Such graphs are typically con-structed upon processing of chromatin single cell Hi-C data and in computer simulations ofDNA folding [13, 14]. A graph G does not contain pairwise spatial distances of the polymerconﬁguration, however, provides information on spatial proximity of monomers (or groupsof monomers), which is usually of major biological relevance. For the 1-bin resolution of G the polymer beads (bins) are the nodes V . The edge between a pair of nodes ( i, j ) isdeﬁned by the condition ( i, j ) ∈ E iﬀ | x i − x j | < ε , where the threshold ε is some cutoﬀradius with which the contacts between the two loci are registered in Hi-C. Due to ﬁniteexcluded volume of chromatin, the theoretical number of contacts per monomer that canbe registered in single cell experiments is of order of few units, while the total size of thepolymer chain, measured in number of beads, is huge ( N ∼ in the 1-kb resolution forhuman chromosomes). Thus, the single cell contact matrices are essentially sparse [17, 32].Summation over realizations of adjacency matrices A ij obtained from diﬀerent cells resultsin a ”population-averaged” matrix A ij . By construction, entries of the weight matrix A ij are proportional to the probability that the spatial distance between monomers ( i, j ) is lessthan ε .Already for the simplest conﬁgurations, such as a conformation of ideal polymer chainisomorphic to the random walk, the matrix A ij is not expected to be uniform. This is dueto a polymeric power-law behaviour of a contact probability, P ( s ) ∼ s − α , for s = | i − j | (1)By deﬁnition, P ( s ) is probability to ﬁnd two beads of a linear chain, separated by a chemicaldistance s , close to each other in space. The critical exponent, α , is an important parameter,which characterizes the ”memory” about the embedding of a polymer loop of length s ina D -dimensional space [16]. Such a memory can arise due to some equilibrium topologicalstate of chromatin, or could be a result of partial relaxation of mitotic chromosomes [43].Notable examples of α , typically appearing in the chromatin context for chain embedding ina three-dimensional space, are α = 3 / α ≈ N nodes of a network are split into q diﬀerent groups G i , i = 1 , , ..., q and the edges between each pair of nodes are distributed independently with a probability,depending on the group labels (”colors”) of respective nodes. In a matrix of pairwise groupprobabilities Ω = { ω rt } with ( r, t ) = 1 , , ..., q , any randomly chosen pair of nodes ( i, j )(where i ∈ G r , j ∈ G t ) is linked by an edge with probability ω rt . The corresponding entry inthe adjacency matrix A ij is 1 with probability ω rt and 0 otherwise. The sum of many such”single-cell” Bernoulli matrices generates an analogue of the ”population-averaged” Hi-Cmatrix A ij with Poisson distributed number of contacts with the mean (cid:104)A ij (cid:105) = ω rt where i ∈ G r , j ∈ G t . To the ﬁrst approximation, the communities can be considered identical(known as a ”planted” version of the model)Ω rt = (cid:40) w in , r = tw out , r (cid:54) = t (2)Having (1) and (2), the simplest assumption one can come up with is that formationof compartments in chromatin is independent of the global memory of folding. Indeed,phenomenon of compartments is likely related to preferential interactions of nodes of thesame epigenetic type (e.g., ”active” or ”inactive”) and is modelled as a phase separation ofblock-copolymers [18]. This allows to suggest the factorization of (1) and (2), so that theﬁnal probability for the edge ( i, j ) reads P rob ij = P ( | i − j | ) (cid:40) ω in , r = tω out , r (cid:54) = t , i ∈ G r , j ∈ G t (3)To emulate A and B compartments in a single cell Hi-C network, we consider a simpleadjacency benchmark of a polymer with two communities. Namely, we represent the chain asa sequence of alternating segments of A and B type (painted in red and blue), whose lengthsare Poisson-distributed with the mean length λ . An example of the resulting adjacencymatrix is depicted in Fig. 1(a). Note that due to decay of the contact probability, the”checkerboard” compartmentalization pattern is hardly seen in single cells Hi-C data [32].Since segments of the same type are surrounded in space by segments of the other type, theyform local ”blob-like” clusters along the main diagonal of the adjacency matrix reminiscentto topologically-associated domains [20]. However, they are likely formed by a diﬀerentmechanism and have an order of magnitude larger size than TADs [18]. Such a multi-domain blob structure in Fig. 1(a) is a manifestation of the polymeric nature of the networkand it cannot be reproduced with communities of general memory-less networks, i.e. inthe framework of the canonical stochastic block model with two clusters – see Fig. 1(b) forcomparison. (a) (b) Figure 1: Adjacency matrices of N = 1000 with two clusters generated according to the (a) polymerstochastic block model ( w in = 1 , w out = 0 . , P ( s ) = s − , λ = 100) and (b) canonical stochasticblock model ( w in = 0 . , w out = 0 . , λ = 500). Vertices in the graph are enumerated by thepolymer coordinate (a) and ﬁrst all red, then all blue ones (b). B. Statistical inference of polymer SBM and generalized modularity functional

Suppose that a population-averaged matrix A is observed. By deﬁnition, each entry A ij of this matrix counts the amount of reads between the bins i and j coming from a populationof single cells. Thus, after proper normalization, A ij is a Poisson variable with the meandictated by (3), (cid:104) A ij (cid:105) = P ij ω g i g j , and ω g i g j = Ω ij are the pairwise group probabilities (at themoment we do not require all the groups to be identical). Neglecting correlations betweenthe matrix entries, the statistical weight of A conditioned on the cluster probability matrixΩ, background contact probability P and group labels of the nodes { g i } , can be factorizedinto the product of the Poisson probabilities for the entries A ij Z ( A| Ω , P, { g i } ) = (cid:89) i

1, the partitioning yields γ →

1, whichcorresponds to the pure background. To determine the optimal value of γ , one can run arecursive procedure, which consists of iterative maximization of the generalized modularityand renormalization of γ according to (8). We realize this approach in our numerical analysesbelow. III. POLYMER NON-BACKTRACKING FLOW OPERATORA. Non-backtracking walks on a directed polymer network

Search for the global maximum to the modularity functional is a very hard computationalproblem. One of most promising approaches which avoids a brute force, is to suggest thatif the community structure is signiﬁcantly strong, there is an operator whose eigenvectorsencode the network partitioning in these communities [3, 27]. However, as it was ﬁrst notedby Krzakala et al [34], for sparse networks leading eigenvectors become uncorrelated withtrue community structure well above the theoretical threshold. As a result, all conventionaloperators such as adjacency, Laplacian and modularity fail to ﬁnd communities in rathersparse networks.To overcome this diﬃculty, it was proposed to exploit the spectrum of the Hashimotomatrix B , which is a transfer matrix of non-backtracking walks on a graph [40]. It is deﬁnedon the edges of the directed graph, i → j, k → l , as follows B i → j,k → l = δ il (1 − δ jk ) (10)It is seen from (10) that the non-backtracking operator prohibits returns to the point whicha walker has visited at the previous step. Since matrix B is non-symmetric, its spectrumis complex. For Poissonian graphs the spectral density of B is constrained within a circleof radius (cid:112) (cid:104) d (cid:105) in the complex plain and exhibits no ”Lifshits tail” singularities near thespectral edge, in contrast to other conventional operators [34, 35]. Real eigenvalues of B lying out of the circle become relevant to the community structure even in sparse networks.Associating the corresponding eigenvectors with the network partitioning permits to detectcommunities all the way down to the theoretical limit. In [24] M. Newman suggested anormalized operator, that conserves the probability ﬂow at each step of the walker.For the sake of community detection in sparse polymer graphs, we propose a conceptuallysimilar operator that describes the evolution of the non-backtracking probability ﬂow on agraph with intrinsic linear memory R i → j,k → l = δ il (1 − δ jk ) d i − − γ ( d j d l ) − P jl (11)In Appendix we establish the connection between the non-backtracking operator and thegeneralized modularity, derived in the previous Section from the statistical inference of thepolymer SBM. Thus, partitioning of a polymer network into two communities according tothe leading eigenvector of the polymer non-backtracking ﬂow operator (11) responds to themaximum entropy principle.An example of the non-backtracking walk on a polymer graph is illustrated in theFig. 2(a). Note that despite immediate revisiting of nodes is forbidden, the walker is allowedto make loops. The second term in (11) plays a role of neutralization towards the contactprobability, arising from the linear organization of the network. This compensation providesa measure for the non-backtracking operator to tell apart the true communities from theﬂuctuations, evoked by the polymeric scaling. Trivially, the proposed non-backtracking op-erator is converged to the Newman’s ﬂow operator, when the background is non-polymeric,but rather corresponds to the conﬁguration model with ﬁxed degrees H ij = d i d j / m [24].For a pure polymeric graph without contamination by communities, the spectrum of (11)lies inside a circle of radius r = (cid:112) (cid:104) d ( d − − (cid:105) . As suﬃciently resolved communities areformed in the network, isolated eigenvalues pop up at the real axis.In Fig. 2(b) we depict the non-backtracking spectrum of a polymer SBM, correspondingto the fractal globule polymer network with P ( s ) = s − of the size N = 1000 with twocompartments, organized as contiguous alternating segments with the mean length λ = 100.For the parameters w in , w out used, the two compartments are well resolved that is providedby the isolated eigenvalue separated from the circle. Since the leading eigenvector u (1) ofthe polymer non-backtracking ﬂow, in contrast to the adjacency or modularity, is deﬁnedon directed edges of the network, one needs to evaluate the Potts spin variables g i = ± i -th node g i comes from the ﬂow along allthe directed edges pointing to i . Thus, in order to switch from edges to nodes, one needsto evaluate the sign of the sum v i = (cid:80) j A ij u (1) j → i and to assign the node i accordingly, g i = sign ( v i ). (a) 𝑤𝑤 𝑖𝑖𝑖𝑖 = 1 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.4 𝑃𝑃 𝑠𝑠 = 𝑠𝑠 −1 (b) 𝑖𝑖 = 1 𝑖𝑖 = 2 𝑖𝑖 = 3 𝑖𝑖 = 𝑁𝑁 Figure 2: (a) Depiction of the polymer SBM network: the backbone (bold), contacts betweengenomically distant monomers (dashed) and two chemical sorts of the monomers (red and blue),arranged into contiguous alternating segments. An example of the non-backtracking walk on suchgraph is shown by arrows. Immediate returns are forbidden, preventing localization on hubs; (b)Spectrum of the polymer non-backtracking ﬂow (11) for the fractal globular ( P ( s ) = s − ) large-scale organization of the chain with two overlaid compartments with the mean length λ = 100. B. Spectral clustering of the polymer stochastic block model

In this section we investigate spectral properties of the polymer non-backtracking ﬂowand compare performance of various linear operators in partition the polymer SBM. Thetwo compartments with λ = 100 are superimposed over the fractal globule, P ( s ) = s − , withtotal size of the network, N = 1000. We ﬁx the weight of internal edges at w in = 1 and changethe resolution of compartments by tuning the weight of external edges, w out = 0 . − . γ was chosen. Itis evident that the polymer ﬂow operator surpasses all conventional operators without thebackground, as well as the polymer modularity everywhere below w out ≈ .

5. Qualitativelysimilar behaviour was demonstrated by the traditional non-backtracking operator withoutthe background, when it was compared to other operators in [34]. Therefore, our analy-ses (i) underscores the importance of taking into account the contact probability (polymerbackground) when dealing with polymer graphs, and (ii) recapitulates eﬃciency of non-backtracking walks in resolving communities in sparse networks.It is worth noting that the abrupt fall in performance of the polymer ﬂow operatorcoincides with the leveling of its amount of isolated eigenvalues at zero, see Fig. 3(d). Valuesaround w out ≈ . w out into the average amount of inner, c in = N w in /

2, and outer, c out = N w out /

2, edges and plot them as functions of w out . As itis shown in Fig. 3(c)), the polymer ﬂow operator drops close to the theoretical detectabilitytransition for regular stochastic block models [48] (i.e. each node has exactly c in randomlinks with other nodes in its community and exactly c out randomly pointed links to nodesfrom the other community) c in − c out > √ c in + c out (12)For the stochastic block model the number of isolated eigenvalues of B exceeds the numberof communities by one [34]. However, in case of the polymer operator R the number ofisolated eigenvalues can be much larger and ”apparent” clusters might be formed ”locally”at the main diagonal due to the frozen linear connectivity, see Fig. 1(a). This is evidentfrom the Fig. 3(d), which shows that the number of isolated eigenvalues for the polymer ﬂowoperator can be of order of the amount of the segments ( N/λ ), if w out is suﬃciently low.Indeed, for the fractal globule probability of the edge between two distant segments of thesame type is s times smaller than probability of the link for two close monomers ( s = | k − m | is the genomic distance between segments k and m ). Due to the overall small number ofcontacts in the network, the polymer non-backtracking ﬂow ends up rationalizing them asseparate clusters.The value of γ cannot be chosen arbitrary since it characterizes optimal parameters ofstochastic blocks. Thus, one may propose the following iterative approach:(i) begin with the initial value γ = 1, for which we obtain the network partition;(ii) use the amount of inner and outer edges for estimating w in , w out ;(iii) recalculate γ according to (8);(iv) repeat the procedure iteratively until γ converges to γ opt .0 adjacencyn-LaplaceMN flowpoly modpoly nbt F r a c t i on o f c o rr e c t l y s o r t ed node s 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 i s o l a t ed e i gen v a l ue s step 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.1 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.4 (a) (b)(c) (d) 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.1 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.2 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.3 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.4 𝑤𝑤 𝑜𝑜𝑜𝑜𝑜𝑜 = 0.5 𝛾𝛾 Figure 3: (a) Comparison of performance of diﬀerent classical operators without background,polymer modularity and polymer non-backtracking ﬂow operators ( N = 1000 , P ( s ) = s − , w in =1 , λ = 100); (b) The iterative approach that can be used to determine the optimal value of γ forﬁve values of w out ; the true optimal values of γ calculated from (8) are shown by dash; (c) Themean numbers of inner c in and outer c out edges are calculated for each value of w out in order toestimate the detectability threshold for the corresponding regular network. (d) Amount of isolatedeigenvalues of the polymer ﬂow operator plotted against w out . Full spectra of the polymer ﬂowoperator for the two values of w out are shown in the insets. Results of this procedure are demonstrated in the Fig. 3(b) for ﬁve diﬀerent values of w out .It is seen that just several steps of iteration is suﬃcient to obtain a reasonable convergencetowards the theoretical values provided by (8). A drawback of this iterative procedure isthat at each step one needs to evaluate the spectrum of the operator 2 m × m , whichcould become a hard computational task for large and dense networks. As a reasonableapproximation to the optimal value of γ for the polymer ﬂow operator, one can evaluate γ opt similarly for the polymer modularity, which is smaller in size and symmetric. C. Polymer non-backtracking ﬂow resolves compartments in a single cell Hi-Cnetwork

To check robustness of the polymer non-backtracking ﬂow operator on real Hi-C datawe run it on a set of individual oocyte cells of mouse [32]. From the public repositorywe have taken the single cells Hi-C data on cis-contacts of 20 chromosomes from 13 singlecells (260 adjacency matrices, in total). While single cells matrices with suﬃciently largenumber of contacts are not sparse and can be split into compartments using conventionalmethods largely used for the bulk data (e.g., the leading eigenvectors of observed/expected1transformation of a population-averaged Hi-C map, [14]), here we take the cells with low tomoderate amount of contacts for the sake of comparative analyses of clustering performanceof diﬀerent spectral methods on sparse polymer graphs.Before proceeding with the analyses of compartments in single cells, the raw data mustbe preliminary processed. In order to extract compartmentalization signal from the maps,we have coarse-grained them to the resolution 200kb. At this resolution all ﬁner genomefolding structuring (like topologically-associated domains) is encoded within the coarse-grained blobs and does not communicate with two large-scale A and B compartments. Wenote, that, in principle, the method is applicable at higher resolution as well. However, thereare two important considerations. The non-backtracking operator is deﬁned on the edges,therefore, the leading eigenvectors need to be computed for much larger matrix than in caseof traditional operators, which are deﬁned on the nodes (e.g., modularity). This means thatthe computation time of the method is very sensitive to the resolution. Furthermore, oneneeds to be very careful with the overall network density: it decreases by several times upondecreasing of the bin size, so that one can occasionally cross the detectability limit (12).In each particular case the resolution for the annotation should be chosen with respect tothe sparsity of the experimental single cell contact maps. According to this logic, we havedecided to use the resolution 200 kb for the data of Flyamer et al.Most of the contacts in the cells have degeneracy 1 at the chosen resolution, however,several pairs of bins have more than 1 contact. To preserve this feature of enhanced connec-tivity, we consider the counts of contacts between the pairs as weights of the correspondingedges. Furthermore, the single-cell maps are noisy and some of really existing contacts getlost due to technical shortcomings of the experimental protocol. As long as the neighboringblobs in the chromatin chain are connected with probability 1, all lost contacts A i,i +1 needbe added to the adjacency matrix manually; we assign the weight 1 to such edges. We alsocleans the coarse-grained data from the self-edges, assigning A ii = 0.To determine the background model for our analyses we calculate the contact probability P ( s ) = N − s (cid:80) N − si =1 A i,i + s for each individual single cell and for the merged cell (summingsingle cells matrices), see Fig. 4a. Resulting dependence turns out to be fairly close to thefractal globule contact probability, P ( s ) ∼ s − α with α ≈ ≈ Drosophila cells [42]. As it was shown in previous Section, in order to ex-tract compartmentalization proﬁle overlaying a speciﬁc long-ranged folding, it is crucial toincorporate the respective background contact probability into the polymer model of thestochastic blocks.Having the background model determined, we construct the polymer non-backtrackingﬂow operator with the variable parameter γ and run the iterative clustering procedure toderive the optimal value γ . Similarly to the analyses on the benchmarks, see Fig. 3b, aswift convergence to the optimal value is observed here. The spectrum of the polymer ﬂowoperator for the cell 29749, chromosome 3 at γ ≈ . (c)(a) (b) 𝑠𝑠 −1 𝑠𝑠 𝑃𝑃 ( 𝑠𝑠 ) C o n t a c t p r o b a b ili t y o f s i n g l e c e ll s , (d) Figure 4: (a) The average contact probability P ( s ) of single cells (gray) and of the merged cell(solid, black) computed for logarithmically spaced bins with the logfactor 1.4; the fractal globulescaling P ( s ) ∼ s − is also shown by dashed line for comparison. (b) Annotation of active (red)and inactive (blue) compartmental domains for one of the contact maps (cell 29749, chromosome3, length N = 492, 200kb resolution) by the polymer non-backtracking ﬂow operator. Below themap the compartmental signal from the corresponding leading eigenvector of the polymer non-backtracking ﬂow matrix is shown. Inset: the full spectrum of the polymer ﬂow for the samecontact map. (c, d) Averaged proﬁles of the GC content (z-scores) plotted around the centers ofthe compartmental domains (active - red, inactive - blue) for the population of cells and for a poolof single cells. eigenvalues could be much larger than the number of compartments.The partition of the single cells in two compartments has been performed in the leadingeigenvector approximation of the diﬀerent operators. The boundaries of active and inactivedomains are determined according to the sign of the respective compartmental signal (seeFig. 4(b) and Supplementary Fig. S1 online). It is known that the gene density is higher inthe actively transcribed A compartment, thus, the fraction of GC letters in bins of activecompartmental domains needs to be larger than in inactive domains. To validate that theclusters found in single cells respond to the transcriptional domains and are biologicallysigniﬁcant, we calculate the GC content proﬁles around the centers of all A and B domainsseparately and then take the average of these proﬁles in each group. The types of the domainswere phased in accordance with the leading eigenvector of the bulk data (population Hi-Con embryonic stem cells was used [50]; the eigenvector was computed on the observed-over-expected map). We also plot analogous proﬁles for the leading eigenvector of the bulk data.In absence of direct annotation methods for single cells due to their sparsity, these twomeasures have been of use to approximate positions of the compartmental domains in singlecell Hi-C data [32].As expected, the GC content for the population-averaged map and the bulk E1 vectorboth have pronounced peaks at the center of A domains and symmetrical dips at the centerof B domains with the z -score amplitude equal to 0 . . A and for B ) of the polymernon-backtracking ﬂow fall symmetrically to zero at the same genomic distance, around 4 − (cid:104) l (cid:105) ≈ (cid:104) l (cid:105) ≈ α , we additionally runthe polymer non-backtracking for α = 3 /

2, which is the scaling exponent of the contactprobability for the ideal chain packing. Comparison of the two values of the parameter isdemonstrated in Supplementary Fig. S4 online: the proﬁles with α = 3 / α ≈ P ( s ) for the set of single cells, Fig. 4a, underscoring the importance ofneutralization on the appropriate average polymeric scaling before the clustering.Note that the partitions of the polymeric operators (non-backtracking, modularity) arevisibly much more adequate to apparent clustering of contacts in a particular cell (Supple-mentary Fig. S1 online). Despite the similarity in compartmental signals from the polymermodularity and from the polymer non-backtracking ﬂow, the sizes of modularity domainsare almost twice larger ( (cid:104) l (cid:105) ≈ ≈ − .

07 and stays negativethroughout the whole range of the compartmental interval. This is a consequence of sparsity,which results in a limited performance of all traditional spectral methods.

IV. CONCLUSION

In this paper we have developed theoretical grounds for spectral community detection insparse polymer networks. On the basis of suggested polymeric extension of the stochasticblock model, we have proposed the polymer non-backtracking ﬂow operator and have proventhat its leading eigenvector performs partitioning of a polymeric network into two clustersaccording the maximum entropy principle. The established connection with the modularityfunctional provides a computationally eﬃcient tool for the network partitioning and searchfor the optimal resolution parameter of the partition in polymer networks, which, however,is inferior to the non-backtracking in eﬃciency for sparse networks.The proposed theoretical framework is veriﬁed by extensive numerical simulations of poly-mer benchmarks, constructed in order to emulate compartmentalization in sparse chromatinnetworks. Comparative analyses of diﬀerent operators on the benchmark has suggested thatthe polymer ﬂow detects the communities up to the theoretical detectability limit, whileall other operators fail above it. At the same time, the amount of isolated eigenvalues ofthe polymer ﬂow operator can be larger than amount of true communities present in thenetwork, due to frozen linear connectivity that forces the chain to form ”blobs” along thechain contour. This result distinguishes the polymer system with thespect the canonicalstochastic block model, where the number of isolated eigenvalues of the non-backtrackingexactly matches the number of communities.Analyses of the single cell Hi-C data of mouse oocytes suggests that the non-backtrackingwalks eﬃciently split experimental sparse networks into biologically signiﬁcant communities,characterized by enrichment and depression of the genes density. The sizes of the compart-mental domains are fairly close to the sizes of the population-averaged domains. Comparison4with characteristics of the domains, inferred by other operators, underscores superiority ofthe non-backtracking walks in partitioning sparse polymer networks.In this study we have exploited for the polymer network analysis only the simplest spectralcharacteristics. More involved ones, e.g. spectral correlators and the level spacing distri-bution, carry additional information about the propagation of excitations in network. Thespectral statistics and non-ergodicity have been discussed in clustered networks in [11, 12].In the context of the gene interactions the spectral statistics has been discussed in [51] forthe matrices with the real spectrum. The non-backtracking matrices enjoy complex spec-trum hence the special means are required to analyze the level spacing in this case. Thecorresponding tool has been invented recently [52, 53], therefore, the spectral statistics ofthe polymer non-backtracking ﬂow operator certainly deserves a separate study.

Acknowledgments

We are grateful to Leonid Mirny, Mikhail Gelfand, Mikhail Tamm, Maxim Imakaev andNezar Abdennur for valuable discussions on the subject of the paper. KP, AG and SNacknowledge supports of the Foundation for the Support of Theoretical Physics and Math-ematics “BASIS” (respectively 17-1-2-27-8 for KP, 17-11-122-1 for AG and 19-1-1-48-1 forSN). This work was supported by grants RFBR 18-29-13013 (KP, AG, SN) and RSF 19-14-00016 (SVR, SU). The authors are grateful to CNRS for the organizational and ﬁnancialsupport of the publication.

Contributions

K.P. designed the study, performed the benchmark and single cell analyses and wrote themanuscript. A.G., S.N., S.V.R. and S.U. supervised the work and participated in writing.

Competing interests

The authors declare no competing ﬁnancial interests.

Methods. Quadratic form of the polymer non-backtracking operator

Let us consider a quadratic form involving the operator over the Potts spin variables g i , i = 1 , , ..., N and introduce the 2 m -dimensional (2 m is the number of edges in thenetwork) vector u , such as u i → j = g j . Then, R = u T R u = (cid:88) ( i,j ) ∈ E ( k,l ) ∈ E R i → j,k → l g j g l (13)5It can be shown that (13) coincides with the quadratic form of the generalized modularity.Let us consider the terms separately. The quadratic form of the ﬁrst, non-backtracking term,yields (cid:88) i → jk → l δ il (1 − δ jk ) d i − g j g l = (cid:88) i,j g i g j d i − A ij (cid:88) k (1 − δ A ik ) (1 − δ jk ) = (cid:88) ij A ij g i g j (14)where the sum over k enumerates the edges of the node i except of the edge ( i, j ) and, thus,equals d i −

1. Expanding the quadratic form of the second term similarly, we get γ (cid:88) i → jk → l ( d j d l ) − P jl g j g l = γ (cid:88) jl ( d j d l ) − P jl g j g l (cid:88) ik (cid:0) − δ A ij (cid:1) (1 − δ A kl ) = γ (cid:88) jl P jl g j g l (15)Collecting (14) and (15) together one arrives at R = u T R u = (cid:88) ij ( A ij − γP ij ) g i g j ; P ij = 1 | i − j | α (16)which is the quadratic form of the generalized modularity functional, proportional to theentropy of the polymer SBM. [1] Zachary, W. W. An information ﬂow model for conﬂict and ﬁssion in small groups. J. Anthro-pol. Res.

P. Natl.Acad. Sci. USA

P. Natl. Acad. Sci. USA

P.Roy. Soc. Lond. B Bio.

S477S481 (2004).[5] Harris, K. D., et al. Organization of cell assemblies in the hippocampus.

Nature

J. Neurosci.

Nature

Science

Statistical applications in genetics and molecular biology (2005).[10] Doyle, B., et al. Chromatin loops as allosteric modulators of enhancer-promoter interactions.

PLoS computational biology (2014).[11] Avetisov, V., Hovhannisyan, M., Gorsky, A., Nechaev, S., Tamm, M., & Valba, O. Eigenvaluetunneling and decay of quenched random network.

Phys. Rev. E [12] Avetisov, V., Gorsky, A., Nechaev, S., & Valba, O. Localization and non-ergodicity in clustered random networks, Journal of Complex Networks , CNZ026, https://doi.org/10.1093/comnet/cnz026 .[13] Dekker J., et al. Capturing chromosome conformation.

Science

Scientiﬁc reports

Statistical Physics of Macromolecules (American Instituteof Physics: New York, 1994).[17] Nagano, T., et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure.

Nature

Proceedings of the National Academy of Sciences

E6697-E6706 (2018).[19] Fortin, J.-P., & Hansen, K.D. Reconstructing A/B compartments as revealed by Hi-C usinglong-range correlations in epigenetic data.

Genome biology

180 (2015).[20] Dixon, J. R., et al. Topological domains in mammalian genomes identiﬁed by analysis ofchromatin interactions.

Nature

Cell

Scienceadvances eaaw1668 (2019).[23] Fortunato, S. Community detection in graphs.

Physics reports

PhysicalReview E

Phys-ical review E

Journal of Statistical Mechanics: Theory and Experiment

P10020 (2010).[27] Decelle, A., et al. Asymptotic analysis of the stochastic block model for modular networksand its algorithmic applications.

Physical Review E

Physica A: Statistical Mechanics and itsApplications

Publ. Math. Debrecen Phys. Rev. E

Soft Matter , Nature

J Stat Mech

P12021 (2012).[34] Krzakala, F., et al. Spectral redemption in clustering sparse networks.

Proceedings of the National Academy of Sciences

Physics-Uspekhi

99 (2018).[36] I. M. Lifshitz, Theory of ﬂuctuation levels in disordered systems, Sov. Phys. JETP,

Introduction to the theory of disorderedsystems (Wiley-Interscience: 1988)[38] Goh, K.-I. et al. Spectra and eigenvectors of scale-free networks.

Phys. Rev. E

Adv. Stud.Pure Math.

Scientiﬁc reports

Genome research

PLoS compu-tational biology (2008).[44] Grosberg, A. Yu., et al. Crumpled globule model of the three-dimensional structure of DNA.

EPL (Europhysics Letters)

373 (1993).[45] Grosberg, A. Yu., et al. The role of topological constraints in the kinetics of collapse ofmacromolecules.

Journal de physique

Proceedings of the National Academy of Sciences

Physical Review E

Naturemethods

119 (2018).[50] Bonev, B., et al. Multiscale 3D genome rewiring during mouse neural development.

Cell

SciRep. Phys. Rev. E

Physical Review X10.2,