[PDF] essHi-C: Essential component analysis of Hi-C matrices

Abstract

Motivation: Hi-C matrices are cornerstones for qualitative and quantitative studies of genome folding, from its territorial organization to compartments and topological domains. The high dynamic range of genomic distances probed in Hi-C assays reflects in an inherent stochastic background of the interactions matrices, which inevitably convolve the features of interest with largely aspecific ones. Results: Here we introduce a discuss essHi-C, a method to isolate the specific, or essential component of Hi-C matrices from the aspecific portion of the spectrum that is compatible with random matrices. Systematic comparisons show that essHi-C improves the clarity of the interaction patterns, enhances the robustness against sequencing depth, allows the unsupervised clustering of experiments in different cell lines and recovers the cell-cycle phasing of single-cells based on Hi-C data. Thus, essHi-C provides means for isolating significant biological and physical features from Hi-C matrices.

Full PDF

eessHi-C: Essential component analysis of Hi-Cmatrices

Stefano Franzini , Marco Di Stefano , and Cristian Micheletti SISSA, International School for Advanced Studies, Trieste, I-34136, Italy. CNAG-CRG, Centre Nacional d’An´alisi Gen´omica - Centre de Regulaci´o Gen´omica, Barcelona, 08028, Spain. * [email protected] AbstractMotivation:

Hi-C matrices are cornerstones for qualitative and quantitative studies ofgenome folding, from its territorial organization to compartments and topological domains.The high dynamic range of genomic distances probed in Hi-C assays reﬂects in an inherentstochastic background of the interactions matrices, which inevitably convolve the featuresof interest with largely aspeciﬁc ones.

Results:

Here we introduce a discuss essHi-C, a method to isolate the speciﬁc, or essentialcomponent of Hi-C matrices from the aspeciﬁc portion of the spectrum that is compatiblewith random matrices. Systematic comparisons show that essHi-C improves the clarity ofthe interaction patterns, enhances the robustness against sequencing depth, allows the un-supervised clustering of experiments in diﬀerent cell lines and recovers the cell-cycle phasingof single-cells based on Hi-C data. Thus, essHi-C provides means for isolating signiﬁcantbiological and physical features from Hi-C matrices.

Introduction

Much of our current understanding of the structural-functional interplay in the genome owesto the advancements fostered by chromosome conformation capture-based methods (3C [1]),which are now numerous [2, 3]. For instance, Hi-C experiments demonstrated that inter-chromosome interactions are suppressed compared to intra-chromosome ones, thus givingquantitative support to the earlier notion of chromosome territories [4]. Inspection of theinteraction matrices, and their dominant eigenvector [5], revealed the existence of two mainchromatin compartments. Analysis of the interaction patterns showed the presence of self-interacting domains termed topologically associating domains (TADs) [6, 7, 8], which mayform complex nested structures [9].The increasing Hi-C resolution has made it possible to compare interaction patterns ofdiﬀerent samples. The long term goal of such comparative studies is establishing whichaspects of genome organisation are varied across diﬀerent stages of cell development [10]and cell fate [11, 12, 13, 14], are aﬀected by diﬀerences in gene transcription [15, 16, 17], orare mis-regulated in disease-related phenotypic alterations [18] and cancer [19].Owing to the importance of comparative analysis, it is increasingly crucial to cross-validate data gathered with diﬀerent protocols, resolution, and sequencing depth [20, 21,22, 23, 24], thus identifying the common and statistically-signiﬁcant features [25] in Hi-Cmatrices.These observations pose, in turn, the more fundamental question of whether it is at allfeasible to identify a priori the robust, signiﬁcant features of a given Hi-C matrix withoutthe necessity to resort to other terms of comparisons, which might contain biases or evennot be available at all. An a priori knowledge of the signiﬁcant features of a Hi-C matrixwould also enhance the capability of detecting meaningful diﬀerences and similarities withother Hi-C matrices. a r X i v : . [ q - b i o . GN ] J a n n this study, we show that spectral analysis methods, which rely on the information con-tent of eigenvectors and eigenvalues, are ideally suited to this endeavour. However, with onlya few exceptions, spectral methods have so far focussed on the ﬁrst one or two eigenvectorsof Hi-C matrices, which are informative for the chromatin compartmentalisation.Here, we introduce essHi-C (after ’essential Hi-C’) to extend these considerations to thefull spectrum of Hi-C matrices, which we regularise for genomic distance. We show that mostof the spectrum is compatible with that of random matrices, and thus represents an aspeciﬁccomponent shared across chromosomes from diﬀerent samples. Interestingly, by discountingthis aspeciﬁc part of the spectrum, and retaining only what we term the essential component,we enhance the deﬁnition of chromosomes’ architectural features and provide a signiﬁcantadvantange to readily pick up similarities of replicates and dissimilarity across diﬀerent celllines. Accordingly, heterogeneous sets of Hi-C matrices can be reliably grouped per cell linesusing unsupervised clustering, which can even pick up the elusive interaction signatures ofdistinct experimental protocols. Finally, we show that essential matrices are stable againstvariations of the sequencing depth and the amount of input material of Hi-C. We found thatessHi-C is predictive of the features discernible in Hi-C matrices with deeper sequencingand is informative for the genome architecture in extremely low-input Hi-C datasets, likethe ones from single-cell Hi-C experiments. Methods

Dataset.

Our dataset consist of intra-chromosome Hi-C matrices from 79 experimentsof 9 human cell lines, including ﬁve with normal karyotype (GM12878, IMR90, NHEK,HMEC, hESC) and four with cancerous karyotypes (T47D, K562, KBM7, SKBR3) (seeSupplementary Table 1). The sra-toolkit (http://ncbi.github.io/sra-tools/) (v2.9.6) wasused to fetch the Hi-C datasets from the public sequence read archive (SRA) (Supple-mentary Table 1) and convert them to FASTQ format after validation. The TADbitpipeline [26] (https://github.com/3DGenomes/tadbit) was used to (i) check the quality ofthe FASTQ ﬁles; (ii) map the paired-end reads to the

H. sapiens reference genome (releaseGRCh38/hg38) using GEM [27] accounting for restriction enzyme cut-sites; (iii) removenon-informative reads using the default TADbit ﬁltering options; (iv) merge datasets withineach experiment when appropriate; (v) normalise each experiment using the OneD method[28] at 100 kbp (kilo-basepairs) resolution.

Observed over expected normalisation.

The obtained intra-chromosomal Hi-C ma-trices were subject to the so-called observed over expected (OoE) normalisation [5] to dis-count the overall dependence of matrix entries, A ij , on the corresponding genomic distance, s = | i − j | . Starting from a OneD-normalised matrix, A , the average interactions at anygenomic distance s was computed, I ( s ). Each entry of the normalised matrix, B , is thendeﬁned as B ij = A ij I ( s = | i − j | ) .Except when otherwise stated all Hi-C matrices considered in their full form are intendedto be OoE-normalised. Random matrix ensemble.

As a null model for the OoE Hi-C matrices we consideredthe so-called Gaussian orthogonal ensemble, whose elements are symmetric matrices withentries drawn from a unitary Gaussian distribution, hence with zero mean and unit variance.On average, the salient spectral properties of the elements of this ensemble are as follows[29, 30]. The set of orthonormal eigenvectors sample uniformly the surface of a unit ( N − N is the linear size of the matrices. The generic component, x , of anyeigenvector follows the same Gaussian distribution with zero mean and variance equal to1 /N , p ( x ) = (cid:113) N π e − Nx . (1)The distribution of the eigenvalues, λ ’s, is governed by the Wigner’s semicircular law: p ( λ ) = 12 π Λ (cid:112) − λ (2)with the interval [ − Λ , Λ] being the support of λ . Essential component of Hi-C matrices.

One of our main results is that OoE Hi-C matrices have a spectrum largely consistent with that of random matrices, except for limited set of eigenvectors with atypically large eigenvalues in modulus. Borrowing aterminology introduced in other contexts [31, 32], we refer to these eigenspaces as essential and we use their spectral summation to deﬁne the essential Hi-C matrix. Starting from anOoE-normalised matrix, A , the entries of its essential form are deﬁned as A essij = n ∗ (cid:88) n =1 λ n a ( i ) n a ( j ) n , (3)where a ( i ) n is the i − th component of the n − th eigenvector of A and λ n is the associatedeigenvalue. The matrix ( a ( i ) n a ( j ) n ) denotes the projector associated to the eigenvector a (seeFig. 1D for examples). Once weighed for the correspondent eigenvalue ( λ n a ( i ) n a ( j ) n ), it canbe interpreted as the eigenspace, that is the contribution of a to the Hi-C contact pattern.The eigenspaces are ranked for decreasing modulus of the eigenvalues so that the summationis restricted to the top n ∗ essential spaces. In principle, n ∗ could be assigned diﬀerently foreach matrix. For simplicity we set n ∗ = 10 for all the applications in this study, exceptfor the case of sparse single-cell Hi-C matrices, as the results do not vary appreciably uponincluding additional spaces, see Supplementary Fig. S1. Measuring matrix similarity across replicates.

In our dataset, the most numerous(34) independent Hi-C experiments on the same cell line (replicates) pertains to GM12878cells. The similarity of these a priori equivalent Hi-C measurements was assessed as follows.For each chromosome (including autosomes and X) 100 pairs of replicate Hi-C matrices wererandomly picked. For any such pair, P and Q , we next obtained the sum and diﬀerencematrices, S = P + Q and D = P − Q . The similarity parameter γ was computed as γ = (cid:104) S (cid:105) σ D (4)where the angular brackets denote the average over the all entries S ij with i ≥ j and σ D isthe root mean-square value of all entries D ij , again with with i ≥ j . OoE-normalised Hi-Cmatrices with good (poor) similarity thus yield values of γ that are much larger (smaller)than 1. Measuring matrix robustness across diﬀerent sequencing depths.

We consid-ered the set of GM12878 matrices from the experiment HIC003 [5], which have the highestavailable sequencing depth (see Supplementary Table 1), and used them as gold standard forcomparisons with matrices at much lower depths. The consistency of a generic matrix withthe gold standard (same chromosome) was measured in terms of the Spearman correlationcoeﬃcient of corresponding entries.

Metric distance of essential Hi-C matrices.

For clustering purposes, the squareddistance of two full OoE-normalised matrices, A and B , was computed as the Euclediandistance of corresponding entries, d ( A, B ) = (cid:48) (cid:88) i,j | A ij − B ij | , (5)where the prime denotes that the sum is taken over i ≥ j . The squared distance of thematrices in the essential form is instead deﬁned as: d ( A ess , B ess ) = n

The genome-wide distance of all chromo-somal matrices of two Hi-C experiments, α and β , is deﬁned as: d ( α, β ) = (cid:34)(cid:88) k d ( M αk , M βk ) (cid:35) / , (8) Full matrix Essential matrixF R e f e r e n ce C o m p a r i s on G D i ff e r e n ce GM12878 IMR90 GM12878 IMR90 G M h r : . - . M b Full Hi-C matrix

A BD Projectors of ranked eigenvectorComponent distributions of ranked eigenvectorp(| ! |) | ! | st nd th st th th th C nd c h r : . - . M b Figure 1: Spectral properties of Hi-C vs random matrices and comparison of full andessential Hi-C matrices. A

Hi-C matrix of human chromosome 17 of cell line GM12878 (experimentHIC001) after OoE normalisation, see Methods. Grey bands correspond to the centromere and otherregions removed by the TADbit ﬁltering step (see Methods). Its eigenvalue distribution is shown in panel B along with the distribution for random unitary random matrices of same size (yellow). The outlierpart of the Hi-C matrix spectrum, | λ | > .

7, is highlighted in green. C Probability distributions forthe components of Hi-C matrix eigenvectors of rank 1, 2, 10 and 100; analogous distribution for randommatrix eigenvectors is shown in yellow. The spectral projectors of the Hi-C eigenvectors are shown inpanel D for the region of chr17 between 4.66 and 9.69Mb. Below them the amplitude of the eigenvectors’components is shown. E Full Hi-C matrix of chromosome 17 from cell line GM12878 (experiment HIC001)and its essential version the genomic region chr17:4.66-6.96Mb is shown for clarity. F Other instances offull and essential matrices for the same genomic region for the same cell line (GM12878 from experimentHIC002) and a diﬀerent ones (IMR90 from the experiment HIC050), see also Supplementary Table 1.The diﬀerences of these matrices with the reference ones of panel E are shown in panel G .where index k runs over chromosomes, and M α,k is the Hi-C matrix of chromosome k fromexperiment α and, depending on the context, d is either the plain Eucledian distance of eq. 5or the essential one of eq. 6. Single cell Hi-C analysis.

The single-cell Hi-C (scHi-C) matrices in the haploid mouseembryonic stem cells (mESC) dataset of [33] was considered and analysed using the TADbitpipeline as for bulk Hi-C matrices with the exception of the OneD normalisation, which isnot tailored for scHi-C matrices. After discarding assays with missing data for one or morechromosomes, our dataset consisted of 320 complete single cell assays covering three cellularstages labelled as G1, early-S, late-S/G2 in ref. [33], see Supplementary Table 2. Stageslabelled as post-M and pre-M were not included because they contained zero and only onecomplete assay, respectively.scHi-C matrices require suitable ad hoc extension of the essHi-C analysis used for stan-dard (bulk) Hi-C matrices for their sparsity, which prevents taking meaningful OoE normal-isations. For this preliminary application we identiﬁed the n ∗ = 50 top ranking eigenspacesas the essential ones because they typically suﬃce to capture most of the trace of the non-normalised Hi-C matrices, see Supplementary Fig. S2. The same distance deﬁnition of eq. 6was used for comparing essential scHi-C matrices and compute the ROC curves, where theG1, early-S, late-S/G2 labelling of [33] was used as gold standard.The tripartite clustering of the essHiC dataset was computed from the Ward dendrogram. esults Spectral comparison of Hi-C and random matrices

Throughout the study we analysed Hi-C interaction matrices after the observed-over-expected(OoE) normalisation. As it is shown in the instance of Fig. 1A, the OoE normalisation, whichdiscounts genomic-distance biases [5], puts on equal footing interactions at diﬀerent sequenceseparations. The matrix in Fig. 1A pertains to chromosome 17 of cell line GM12878 and itsspectral properties are illustrated in panels B-E. Speciﬁcally, panel B shows the probabilitydistribution of the modulus of the eigenvalues, p ( | λ | ) while the probability distributions ofthe components of diﬀerent ranking eigenvectors are shown in panel C. The same panelsshow, by contrast, the analogous quantities but computed for symmetric random matriceswith the same linear size.The two eigenvalues distributions are very closely matching throughout the range | λ | ≤ .

7, which covers 95% of the Hi-C spectrum. Within the | λ | ≤ . | λ | > .

7, are highlighted in green in panel B and accounts foronly a small fraction of the Hi-C matrix spectrum. The components of the associated highest-ranking eigenvectors have a manifest non-Gaussian distribution. Thus, the top rankingeigenvalues and eigenvectors are the sole having markedly distincting properties from thosefound in random matrices. This fact holds in general, as it applies to diﬀerent chromosomesand cell lines, see Supplementary Figs. S3 and S4, and thus establishes that the bulk of Hi-Cmatrix spectrum is largely compatible with that of random matrices, except for the smallset of outlier eigenvalues and associated eigenvectors.

Essential Hi-C matrices

Because the bulk of the spectrum of Hi-C matrices can be described by a statistical modelinformed solely by the linear size of the matrix, we discounted this aspeciﬁc componentfrom the matrices so to isolate their essential component. The latter, that we term essentialHi-C matrix or essHi-C for brevity, is obtained from the spectral summation of the n ∗ = 10highest ranking projectors, see Methods and Fig. 1D.A comparison of a full Hi-C matrix and its essential component is provided in Fig. 1E.The data are for the same entry of Fig. 1A, chromosome 17 and cell line GM12878, butare presented for a chromosome portion only to aid visualization. Remarkably, the essentialmatrix not only retains the distinctive contact patterns, but it presents them with greaterclarity and contrast. The intra- and inter-domain contact patterns, as well as the domainboundaries, are more clearly deﬁned too. Enhancement of speciﬁcity

The beneﬁts of resorting to essential matrices, thus discounting the aspeciﬁc spectral compo-nent, further emerges from the comparative analysis of Fig. 1F, where the full and essentialHi-C matrices of panel A are compared with two other instances of chromosome 17, onefrom a biological replicate of the same GM12878 cell line, and one from the diﬀerent IMR90cell line.The entry-by-entry subtraction of the matrices is presented in Fig. 1G and shows thatcontact pattern diﬀerences are sharper and more deeply marked for the essential matrices.Importantly, the diﬀerence matrix of GM12878 replicates presents as a uniform backgroundwhile speckled patterns are clearly discernible for the full matrices of the diﬀerent cell lines.The result holds generally and is not dependent on the considered cell types, see Supple-mentary Fig. S5.

Robustness across replicates and varying sequencing depths

The enhancement of the contact pattern speciﬁcity is conveyed in Fig. 2A in terms of thesimilarity parameter γ . The latter quantiﬁes how similar are corresponding entries in two Sequencing depth (%) S p ea r m a n r a n k c o rr e l a t i on c o e ff i c i e n t chr13 chr19 chr21 ! A chr1 chr13 chr19 chr21051015 Full matrix Essential matrixchr1

Full matrixEssential matrix

Figure 2: Matrix robustness across replicates and sequencing depths . A Box-whisker plotsof the consistency parameter γ , see Methods, measured for distinct pairs of full and esential matricesfor the GM12878 cell lines. Four chromosomes with large diﬀerence of length and gene cdensity areconsidered. Boxplots show: central line, median; box limits, 75th and 25th percentiles; whiskers, 1.5times the interquartile range; outliers beyond this range are shown as individual points. B Spearmanrank correlation coeﬃcient measured for corresponding entries of a high sequencing depth matrix (goldstandard HIC003, 100% depth) and matrices at lower depths. Data are for the GM12878 and the samefour chromosomes as in panel A . atrices relative to their inherent statistical uncertainty, see Methods. Fig. 2A shows thedistribution of γ computed for randomly picked pairs of GM12878 matrices for four chro-mosomes. These correspond to chromosomes 1, 13, 19 and 21, which were chosen for theirdiverse gene content and length.The typical values of γ for full matrices are of the order unity and have a mild increasingdependence on chromosome length. Using, instead, the essHi-C version yields dramaticboost of the similarity parameter by about one order of magnitude, and with no particularbias or dependence on chromosome length (Fig. 2A).We next examined the impact of the sequencing depth, taking as gold standard the fullHi-C matrices with the largest sequencing depth in the GM12878 dataset, see Supplemen-tary Table 1. The Spearman’s rank correlation coeﬃcient of corresponding entries of thegold standard and other matrices with lower depths are presented in Fig. 2B, again, forchromosomes 1, 13, 19 and 21.Even though the gold standard is constituted by full Hi-C matrices the highest correlationacross all four chromosomes and all lower depths is observed for the essential matrices,not the full ones. The result is general in that it holds for other chromosomes too, seeSupplementary Fig. S6.In addition, the correlation coeﬃcient of the essential matrices has a visibly milderdependence on sequencing, the slope of the interpolating line being much smaller than thefull case. Unsupervised clustering of diﬀerent cell lines

As an application of the essHi-C analysis, we performed the unsupervised clustering of aheterogeneous dataset of 79 matrices, comprising (non-uniformly) 8 distinct cell lines and 3diﬀerent Hi-C techniques (Supplementary Table 1 and Methods).The two clusterings obtained by using the full and essential matrices are shown as den-dograms in Fig. 3A,C. They present major and striking diﬀerences.First, although the dendrograms are drawn using the same unit for the Ward score, thelength of the branches for the essential matrices are more than twice longer (2 . . .

6, whichonly marginally improves with respect to the random reference (

AU C = 0 . AU C = 0 . .

90, 0 .

82 and 0 .

91, respectively, which are all signiﬁcant. Some of these methods, includingspectral ones, were purposedly devised towards the comparative analysis of Hi-C matrices.The about optimal essHi-C performance is thus appealing as it is natively formulated as anenhancement method of individual matrices which can be adopted in comparative contextstoo, as shown here.By analyzing the quality of diﬀerent numbers of groupings within the essHIC dendro-gram, using the Dunn index, one discovers that the optimal number of clusters is 13, whichis larger than the number of cell lines (9). BC D

Ward score1.0 0.03.0 2.0 1.0 0.03.0 2.0Ward score

Clustering - full matrices d / d m ax d / d m ax Clustering - essential matricesFE

Figure 3: Clustering of diﬀerent cell lines A

Ward dendrogram of the dataset of 79 Hi-C experi-ments, covering diﬀerent cell lines, based on the genome-wide distance of full Hi-C matrices. The pairwisedistance matrix, with entries normalised to the maximum is shown in panel B . Analogous quantities, butcomputed for the corresponding essential Hi-C matrices are shown in panels C and D . The dashed linein the dendrogram of panel C marks the Ward score of the optimal (Dunn) subdivision in 13 clusters.The distances computed for the full and essential matrices, and other methods, were used to compute theROC curves of panel E , where true positives correspond to instances of the same cell lines. F Section ofthe dendrogram in D regarding the IMR90 cell line, which shows the correlation with the used restrictionenzymes. B False positive rate T r u e po s i t i ve r a t e G1 early-S late-S/G2

Figure 4: Cell-phasing of single-cell Hi-C matrices. A

The distances of single-cell matricesfrom [33] in their full and essential forms were used to compute the ROC curves, where true positivescorrespond to instances with the same G1, early-S, late-S/G2 labelling of [33]. B The bands correspondto the tripartite clustering assignment of the essHi-C matrices. The entries follow the time ordering of[33]. All clusters except one contain one cell-type only, however, while all experiments ofcell-types K562, hESC, SKBR3, and KBM7 are contained within a single cluster, other cell-types ensembles display a richer internal structure: this is apparent especially for IMR90experiments, which are sharply divided into two clusters (Fig. 3F). Upon closer inspectionone ﬁnds that this division is not an arbitrary one, but corresponds to diﬀerent restrictionenzymes and experimental techniques: one cluster contains

In-situ experiments using Mbol,while the other are dilution

Hi-C experiments using HindIII.Other subdivisions within cell-lines are not as clear-cut and cannot be explained only interms of diﬀerent experimental methodologies. However one interesting case is given by theonly mixed cluster, containing NHEK and HMEC experiments: both cell lines are epithelialsamples, which may explain their similarity, moreover all experiments within the clustershare the same methodology (

In-situ , Mbol). On the other hand the single isolated NHEKexperiment is a dilution

Hi-C, and uses the HindIII reduction enzyme.

Single-cell Hi-C matrices

Lastly, we applied the essHi-C analysis to an entirely distinct Hi-C context, namely single-cell Hi-C (scHi-C). We considered the scHi-C set of ref. [33], which covers diﬀerent cell-cycle stages of the mESC mouse embryonic cell line. In ref. [33] this data set time-ordered a posteriori with an elegant dimensional reduction procedure, which was instrumental toenhance the features of the inherently sparse matrices. To account for the sparsity of scHi-Cmatrices we performed the essential analysis using n ∗ = 50 eigenspaces, see Methods andSupplementary Fig. S2.The set of scHi-C matrices, in their plain form, cannot be clustered in a meaningfultime-ordered way, as shown by the near-diagonal trend (blue line) in the ROC plot of Fig. 4(AUC=0.55). The essHi-C matrices, instead, show a noticeable and signiﬁcant correlation,AUC=0.68. Indeed, the same metrics and clustering procedure adopted for the ensembleHi-C dataset of Fig. 3 returns primary partitions that are in very good accord with thetime-ordered cellular stages proposed by [33], see Fig. 4B. iscussion We presented a systematic spectral analysis of Hi-C matrices and established a generalstrategy to isolate their essential component from the largely-aspeciﬁc complementary one.The starting point of the analysis was the comparison of the spectral properties of OoE-normalised Hi-C matrices with those of random matrices. For the latter we consideredthe Gaussian orthogonal ensemble, which is commonly taken as reference for the statisticalspectral properties of systems with many degrees of freedom with complex interactions.The comparison presented in Fig. 1 demonstrated that random matrices are viable termsof references for Hi-C matrices. In fact, the distributions of their eigenvalues and eigenvectorscomponents are largely superposable, except for a small subset of outlier eigenspaces. It isthis part of the spectrum, which stands out from the statistically-dominated background,that subsumes the speciﬁc features of the Hi-C data. We used to deﬁne the essential Hi-C matrix via the spectral summation that is algorithmically implemented in the essHi-Cpackage.The gist of the essHi-C analysis is diﬀerent from approaches addressing data noise in Hi-Cmatrices at the local or bin-wise level. The latter, in fact, are designed to remove biologicalbiases [36, 28] or numerical imbalance [37, 38] from Hi-C matrices, while essHi-C discountsthe aspeciﬁc component by isolating the spectral, hence generally non-local properties thatdiﬀer from random matrices.The resulting enhanced speciﬁc content of the essential matrices is illustrated by theclearer and sharper features of essential matrices (Fig. 1 and, especially, by the comparisonof diﬀerent instances of Hi-C matrices from the same or diﬀerent cell lines (Fig. 1G). Thesubtraction of matrices of the same cell lines is noticeably more uniform and less noisy forthe essential matrices compared to the full ones. In addition, the subtraction of the essHi-Cmatrices of diﬀerent cell lines provides a neater highlighing of the diﬀerent features, whichare instead convolved with noise in full matrices.We focussed on two applications of the essHi-C analysis, chosen for their relevance andchallenging nature. Firstly, we compared full Hi-C matrices obtained at high sequencingdepth, with matrices at lower depth, both in the full and essential forms, see Fig. 2. Thecomparison demonstrated that essential matrices have a signiﬁcant boost of correlation withthe highest depth reference matrices. In fact, the correlation of the essential matrices isonly modestly impacted by the decrease of the sequencing depth. These results provide astriking illustration of the signiﬁcant potential that the essHi-C analysis for isolating speciﬁcinteraction features that would require major increase of sequencing depth to be discernedin plain matrices.Secondly, we carried out the unsupervised clustering of a heterogenous ensemble of Hi-Cmatrices covering several cell lines, see Fig. 3. Good correpondence of cell lines and theunsupervised hierarchical subdivisions are observed only for the essHi-C matrices, not thefull ones. Furthermore, essHi-C based subdivisions of the IMR90 cell lines correlate with thediﬀerent restriction enzymes used in the Hi-C assays for the two subsets. This unexpectedresult shows that diﬀerent experimental probes can reﬂect in suﬃciently distinct contactpropensities, and that these can be picked up using essHi-C analysis.Overall, the results show that essHi-C matrices are better suited than full matrices toisolate signiﬁcant contact patterns, which ought to be useful also in context where con-tact propensities are used for chromosome modelling both to generate mean-ﬁeld genomestructures [39, 40] or to highlight the cell-to-cell variability [41, 42].Finally, to illustrate the perspective potential of essHi-C analysis we discussed a prelim-inary application to single cell matrices, focussing on the set of ref. [33] The ROC curves inFig. 4 show that the time ordering of Nagano et al. cannot be recovered from the full scHi-Cmatrices. This is consistent with the fact that a dimensional-reduction of scHi-C matriceswas needed to obviate to the sparsity of the two-dimensional full matrices and establish theirtime-ordering. It is therefore signiﬁcant and appealing that, once the matrices are castedin their essential form, a clear correlation with the time ordering of ref. [33] emerges, andthe main cellular phases are recovered, see Fig. 4. This fact, suggests that essHi-C anal-ysis might be proﬁtably used in place of the dimensional reduction step for a more directdetermination of time-ordering or other scHi-C applications.More in general, our results further emphasizes the advantages oﬀered by essHi-C analysis cross very diﬀerent contexts. Conclusion

We presented a systematic spectral analysis of Hi-C matrices and established a generalstrategy to identifying and separating their essential component from the largely aspeciﬁccomplementary one.By analysing essential Hi-C matrices in diﬀerent contexts, we established numerous ad-vantages over the use of spectral-complete (full) ones: from improving the sharpness andclarity of the speciﬁc interaction to enhancing the robustness again sequencing depth, al-lowing the unsupervised clustering of diﬀerent cell lines and the cell-phasing of single-cellassays.The results open numerous perspectives for using essHi-C analysis to optimally isolatebiologically- and physically-relevant information from Hi-C matrices. Beyond the applica-tions considered here, we expect our tool to be useful in comparative contexts where vari-ations of chromosome compartmentalization could be picked up with enhanced reliabilityand hence better related to epigenomics changes [43] or cell diﬀerentiation [11, 12, 13].The essHi-C software package is made freely available for academic use and can beaccessed at https://github.com/stefanofranzini/essHIC.

Funding

This work has been supported by the Italian Ministry for University. eferences [1] J. Dekker. Capturing chromosome conformation. Science , 295(5558):1306–1311, feb2002.[2] Stefan Grob and Giacomo Cavalli. Technical review: a hitchhiker’s guide to chromosomeconformation capture. In

Plant Chromatin Dynamics , pages 233–246. Springer, 2018.[3] Nadine ¨Ubelmesser and Argyris Papantonis. Technologies to study spatial genomeorganization: beyond 3c.

Brieﬁngs in Functional Genomics , 18(6):395–401, 2019.[4] T. Cremer and C. Cremer. Chromosome territories, nuclear architecture and generegulation in mammalian cells.

Nature Reviews Genetics , 2:292, 2001.[5] E. Lieberman-Aiden, N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy,A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, R. Sandstrom, B. Bern-stein, M. A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A. Mirny,E. S. Lander, and J. Dekker. Comprehensive mapping of long-range interactions revealsfolding principles of the human genome.

Science , 326(5950):289–293, oct 2009.[6] T. Sexton, E. Yaﬀe, E. Kenigsberg, F. Bantignies, B. Leblanc, M. Hoichman, H. Par-rinello, A. Tanay, and G. Cavalli. Three-dimensional folding and functional organizationprinciples of the drosophila genome.

Cell , 148:458, 2012.[7] Jesse R. Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, MingHu, Jun S. Liu, and Bing Ren. Topological domains in mammalian genomes identiﬁedby analysis of chromatin interactions.

Nature , 485(7398):376–380, apr 2012.[8] Elph`ege P. Nora, Bryan R. Lajoie, Edda G. Schulz, Luca Giorgetti, Ikuhiro Okamoto,Nicolas Servant, Tristan Piolot, Nynke L. van Berkum, Johannes Meisig, John Se-dat, Joost Gribnau, Emmanuel Barillot, Nils Bl¨uthgen, Job Dekker, and Edith Heard.Spatial partitioning of the regulatory landscape of the x-inactivation centre.

Nature ,485(7398):381–385, apr 2012.[9] James Fraser, Carmelo Ferrai, Andrea M Chiariello, Markus Schueler, Tiago Rito, Gio-vanni Laudanno, Mariano Barbieri, Benjamin L Moore, Dorothee CA Kraemer, StuartAitken, Sheila Q Xie, Kelly J Morris, Masayoshi Itoh, Hideya Kawaji, Ines Jaeger,Yoshihide Hayashizaki, Piero Carninci, Alistair RR Forrest, Colin A Semple, Jos´eeDostie, Ana Pombo, and Mario Nicodemi. Hierarchical folding and reorganization ofchromosomes are linked to transcriptional changes in cellular diﬀerentiation.

MolecularSystems Biology , 11(12):852, dec 2015.[10] Hui Zheng and Wei Xie. The role of 3d genome organization in development and celldiﬀerentiation.

Nature Reviews Molecular Cell Biology , page 1, 2019.[11] Boyan Bonev, Netta Mendelson Cohen, Quentin Szabo, Lauriane Fritsch, Giorgio LPapadopoulos, Yaniv Lubling, Xiaole Xu, Xiaodan Lv, Jean-Philippe Hugnot, AmosTanay, et al. Multiscale 3d genome rewiring during mouse neural development.

Cell ,171(3):557–572, 2017.[12] Ralph Stadhouders, Enrique Vidal, Fran¸cois Serra, Bruno Di Stefano, Fran¸cois Le Dily,Javier Quilez, Antonio Gomez, Samuel Collombet, Clara Berenguer, Yasmina Cuartero,et al. Transcription factors orchestrate dynamic interplay between genome topology andgene regulation during cell reprogramming.

Nature Genetics , 50(2):238–249, 2018.[13] Jonas Paulsen, Tharvesh M Liyakat Ali, Maxim Nekrasov, Erwan Delbarre, Marie-OdileBaudement, Sebastian Kurscheid, David Tremethick, and Philippe Collas. Long-rangeinteractions between topologically associating domains shape the four-dimensionalgenome during diﬀerentiation.

Nature Genetics , 51(5):835–843, 2019.[14] Satish Sati, Boyan Bonev, Quentin Szabo, Daniel Jost, Paul Bensadoun, Francois Serra,Vincent Loubiere, Giorgio Lucio Papadopoulos, Juan-Carlos Rivera-Mulia, LaurianeFritsch, et al. 4d genome rewiring during oncogene-induced and replicative senescence.

Molecular Cell , 78(3):522 – 538.e9, 2020.[15] Job Dekker, Marc A. Marti-Renom, and Leonid A. Mirny. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data.

NatureReviews Genetics , 14(6):390–403, may 2013.

16] Tom Sexton and Giacomo Cavalli. The role of chromosome domains in shaping thefunctional genome.

Cell , 160(6):1049–1059, mar 2015.[17] Anthony D. Schmitt, Ming Hu, Inkyung Jung, Zheng Xu, Yunjiang Qiu, Catherine L.Tan, Yun Li, Shin Lin, Yiing Lin, Cathy L. Barr, and Bing Ren. A compendium ofchromatin contact maps reveals spatially active regions in the human genome.

CellReports , 17(8):2042–2059, nov 2016.[18] Dar´ıo G. Lupi´a˜nez, Katerina Kraft, Verena Heinrich, Peter Krawitz, Francesco Bran-cati, Eva Klopocki, Denise Horn, H¨ulya Kayserili, John M. Opitz, Renata Laxova,Fernando Santos-Simarro, Brigitte Gilbert-Dussardier, Lars Wittler, Marina Borschi-wer, Stefan A. Haas, Marco Osterwalder, Martin Franke, Bernd Timmermann, JochenHecht, Malte Spielmann, Axel Visel, and Stefan Mundlos. Disruptions of topologi-cal chromatin domains cause pathogenic rewiring of gene-enhancer interactions.

Cell ,161(5):1012–1025, may 2015.[19] Peter Hugo Lodewijk Krijger and Wouter De Laat. Regulation of disease-associatedgene expression in the 3d genome.

Nature Reviews Molecular Cell Biology , 17(12):771,2016.[20] Jean-Philippe Fortin and Kasper D. Hansen. Reconstructing A/B compartments asrevealed by Hi-C using long-range correlations in epigenetic data.

Genome Biology ,16(1), aug 2015.[21] Tao Yang, Feipeng Zhang, Galip G¨urkan Yardımcı, Fan Song, Ross C. Hardison,William Staﬀord Noble, Feng Yue, and Qunhua Li. HiCRep: assessing the reproducibil-ity of hi-c data using a stratum-adjusted correlation coeﬃcient.

Genome Research ,27(11):1939–1949, aug 2017.[22] John C. Stansﬁeld, Kellen G. Cresswell, Vladimir I. Vladimirov, and Mikhail G. Doz-morov. HiCcompare: an r-package for joint normalization and comparison of HI-cdatasets.

BMC Bioinformatics , 19(1), jul 2018.[23] Galip G¨urkan Yardımcı, Hakan Ozadam, Michael EG Sauria, Oana Ursu, Koon-KiuYan, Tao Yang, Abhijit Chakraborty, Arya Kaul, Bryan R Lajoie, Fan Song, et al.Measuring the reproducibility and quality of hi-c data.

Genome Biology , 20(1):57,2019.[24] Jingtian Zhou, Jianzhu Ma, Yusi Chen, Chuankai Cheng, Bokan Bao, Jian Peng, Ter-rence J. Sejnowski, Jesse R. Dixon, and Joseph R. Ecker. Robust single-cell hi-c clus-tering by convolution- and random-walk–based imputation.

Proceedings of the NationalAcademy of Sciences , 116(28):14011–14018, jun 2019.[25] Oana Ursu, Nathan Boley, Maryna Taranova, Y X Rachel Wang, Galip GurkanYardimci, William Staﬀord Noble, and Anshul Kundaje. GenomeDISCO: a concor-dance score for chromosome conformation capture experiments using random walks oncontact map graphs.

Bioinformatics , 34(16):2701–2707, mar 2018.[26] Fran¸cois Serra, Davide Ba`u, Mike Goodstadt, David Castillo, Guillaume J. Filion, andMarc A. Marti-Renom. Automatic analysis and 3d-modelling of hi-c data using TADbitreveals structural features of the ﬂy chromatin colors.

PLOS Computational Biology ,13(7):e1005665, jul 2017.[27] Santiago Marco-Sola and Paolo Ribeca. Eﬃcient Alignment of Illumina-Like High-Throughput Sequencing Reads with the GEnomic Multi-tool (GEM) Mapper.

CurrentProtocols in Bioinformatics , 50(1):11–13, 2015.[28] Enrique Vidal, Fran¸cois le Dily, Javier Quilez, Ralph Stadhouders, Yasmina Cuartero,Thomas Graf, Marc A Marti-Renom, Miguel Beato, and Guillaume J Filion. Oned:increasing reproducibility of hi-c samples with abnormal karyotypes.

Nucleic AcidsResearch , 46(8):e49–e49, 2018.[29] Sean O ' Rourke, Van Vu, and Ke Wang. Eigenvectors of random matrices: A survey.

Journal of Combinatorial Theory, Series A , 144:361–442, nov 2016.[30] Giacomo Livan, Marcel Novaes, and Pierpaolo Vivo.

Introduction to Random Matrices .Springer International Publishing, 2018.

31] Andrea Amadei, Antonius BM Linssen, and Herman JC Berendsen. Essential dynamicsof proteins.

Proteins: Structure, Function, and Bioinformatics , 17(4):412–425, 1993.[32] Cristian Micheletti. Comparing proteins by their internal dynamics: Exploringstructure–function relationships beyond static structural alignments.

Physics of lifereviews , 10(1):1–26, 2013.[33] Takashi Nagano, Yaniv Lubling, Csilla V´arnai, Carmel Dudley, Wing Leung, YaelBaran, Netta Mendelson Cohen, Steven Wingett, Peter Fraser, and Amos Tanay.Cell-cycle dynamics of chromosomal organization at single-cell resolution.

Nature ,547(7661):61–67, 2017.[34] Milena Rond´on-Lagos, Ludovica Verdun Di Cantogno, Caterina Marchi`o, NelsonRangel, Cesar Payan-Gomez, Patrizia Gugliotta, Cristina Botta, Gianni Bussolati, San-dra R Ram´ırez-Clavijo, Barbara Pasini, et al. Diﬀerences and homologies of chromo-somal alterations within and between breast cancer cell lines: a clustering analysis.

Molecular Cytogenetics , 7(1):8, 2014.[35] Koon-Kiu Yan, Galip G¨urkan Yardımcı, Chengfei Yan, William S Noble, and MarkGerstein. HiC-spector: a matrix library for spectral and reproducibility analysis of hi-ccontact maps.

Bioinformatics , 33(14):2199–2201, 2017.[36] Eitan Yaﬀe and Amos Tanay. Probabilistic modeling of hi-c contact maps eliminatessystematic biases to characterize global chromosomal architecture.

Nature Genetics ,43(11):1059–1065, oct 2011.[37] Maxim Imakaev, Geoﬀrey Fudenberg, Rachel Patton McCord, Natalia Naumova, AntonGoloborodko, Bryan R Lajoie, Job Dekker, and Leonid A Mirny. Iterative correction ofhi-c data reveals hallmarks of chromosome organization.

Nature Methods , 9(10):999–1003, sep 2012.[38] P. A. Knight and D. Ruiz. A fast algorithm for matrix balancing.

IMA Journal ofNumerical Analysis , 33(3):1029–1047, oct 2012.[39] Marie Trussart, Fran¸cois Serra, Davide Ba`u, Ivan Junier, Luis Serrano, and Marc AMarti-Renom. Assessing the limits of restraint-based 3d modeling of genomes andgenomic domains.

Nucleic Acids Research , 43(7):3465–3477, 2015.[40] Fran¸cois Serra, Marco Di Stefano, Yannick G Spill, Yasmina Cuartero, Michael Good-stadt, Davide Ba`u, and Marc A Marti-Renom. Restraint-based three-dimensional mod-eling of genomes and genomic domains.

FEBS letters , 589(20):2987–2995, 2015.[41] Luca Giorgetti, Rafael Galupa, Elph`ege P Nora, Tristan Piolot, France Lam, JobDekker, Guido Tiana, and Edith Heard. Predictive polymer modeling reveals cou-pled ﬂuctuations in chromosome conformation and transcription.

Cell , 157(4):950–963,2014.[42] Harianto Tjong, Wenyuan Li, Reza Kalhor, Chao Dai, Shengli Hao, Ke Gong, YonggangZhou, Haochen Li, Xianghong Jasmine Zhou, Mark A Le Gros, et al. Population-based3d genome structure analysis reveals driving forces in spatial genome organization.

Proceedings of the National Academy of Sciences , 113(12):E1663–E1672, 2016.[43] R. Vilarrasa-Blasi, P. Soler-Vila, N. Verdaguer-Dot, N. Russinol, M. Di Stefano, V. Cha-paprieta, G. Clot, I. Farabella, P. Cusco, X. Agirre, F. Prosper, R. Beekman, S. Bea,D. Colomer, I. Gut, H. Stunnenberg, E. Campo, M.A. Marti-Renom, and J.I. Martin-Subero. Dynamics of genome architecture and chromatin function during human b celldiﬀerentiation and neoplastic transformation.

Nature Communications , in press, 2020., in press, 2020.