[PDF] Identifying 3D Genome Organization in Diploid Organisms via Euclidean Distance Geometry

Abstract

The spatial organization of the DNA in the cell nucleus plays an important role for gene regulation, DNA replication, and genomic integrity. Through the development of chromosome conformation capture experiments (such as 3C, 4C, Hi-C) it is now possible to obtain the contact frequencies of the DNA at the whole-genome level. In this paper, we study the problem of reconstructing the 3D organization of the genome from such whole-genome contact frequencies. A standard approach is to transform the contact frequencies into noisy distance measurements and then apply semidefinite programming (SDP) formulations to obtain the 3D configuration. However, neglected in such reconstructions is the fact that most eukaryotes including humans are diploid and therefore contain two copies of each genomic locus. We prove that the 3D organization of the DNA is not identifiable from distance measurements derived from contact frequencies in diploid organisms. In fact, there are infinitely many solutions even in the noise-free setting. We then discuss various additional biologically relevant and experimentally measurable constraints (including distances between neighboring genomic loci and higher-order interactions) and prove identifiability under these conditions. Furthermore, we provide SDP formulations for computing the 3D embedding of the DNA with these additional constraints and show that we can recover the true 3D embedding with high accuracy from both noiseless and noisy measurements. Finally, we apply our algorithm to real pairwise and higher-order contact frequency data and show that we can recover known genome organization patterns.

Full PDF

IIDENTIFYING 3D GENOME ORGANIZATION IN DIPLOIDORGANISMS VIA EUCLIDEAN DISTANCE GEOMETRY ∗ ANASTASIYA BELYAEVA † , KAIE KUBJAS ‡ , LAWRENCE J. SUN † , AND

CAROLINEUHLER † . Abstract.

The spatial organization of the DNA in the cell nucleus plays an important rolefor gene regulation, DNA replication, and genomic integrity. Through the development of chromo-some conformation capture experiments (such as 3C, 4C, Hi-C) it is now possible to obtain thecontact frequencies of the DNA at the whole-genome level. In this paper, we study the problemof reconstructing the 3D organization of the genome from such whole-genome contact frequencies.A standard approach is to transform the contact frequencies into noisy distance measurements andthen apply semideﬁnite programming (SDP) formulations to obtain the 3D conﬁguration. However,neglected in such reconstructions is the fact that most eukaryotes including humans are diploid andtherefore contain two copies of each genomic locus. We prove that the 3D organization of the DNA isnot identiﬁable from distance measurements derived from contact frequencies in diploid organisms.In fact, there are inﬁnitely many solutions even in the noise-free setting. We then discuss variousadditional biologically relevant and experimentally measurable constraints (including distances be-tween neighboring genomic loci and higher-order interactions) and prove identiﬁability under theseconditions. Furthermore, we provide SDP formulations for computing the 3D embedding of the DNAwith these additional constraints and show that we can recover the true 3D embedding with high ac-curacy from both noiseless and noisy measurements. Finally, we apply our algorithm to real pairwiseand higher-order contact frequency data and show that we can recover known genome organizationpatterns.

Key words.

3D genome organization; Hi-C; diploid organisms; Euclidean distance geometry;semideﬁnite programming, systems of polynomial equations.

AMS subject classiﬁcations.

1. Introduction.

It is now well established that the spatial organization of thegenome in the cell nucleus plays an important role for cellular processes including generegulation, DNA replication, and the maintenance of genomic integrity [11, 39, 40].Notably, a recent study [43] showed a causal link between three-dimensional (3D)genome organization and gene regulation, where gene repositioning was induced andsubsequent changes in gene expression were observed. This motivates the developmentof methods to reconstruct the 3D structure of the genome to study its functions.The genetic information in cells is contained in the DNA, which is organized intochromosomes and packed into the cell nucleus. Chromosome conﬁrmation capturetechniques (such as 3C, 4C, Hi-C, Capture-C) have enabled the interrogation of thecontact frequencies between pairs of genomic loci at the whole-genome scale [12, 37, 24,20]. In Hi-C, for example, interacting chromosome regions are crosslinked (i.e., frozen), ∗ Submitted.

Funding:

Anastasiya Belyaeva was supported by an NSF Graduate Research Fellowship(1122374), the Abdul Latif Jameel World Water and Food Security Lab (J-WAFS) at MIT andthe MIT J-Clinic for Machine Learning and Health. Kaie Kubjas was supported by the EuropeanUnion’s Horizon 2020 research and innovation programme: Marie Sk lodowska-Curie grant agree-ment No. 748354, research carried out at LIDS, MIT and Team PolSys, LIP6, Sorbonne Univer-sity. Caroline Uhler was partially supported by NSF (DMS-1651995), ONR (N00014-17-1-2147 andN00014-18-1-2765), IBM, and a Simons Investigator Award. † Laboratory for Information and Decision Systems, Department of Electrical Engineering andComputer Science, and Institute for Data, Systems and Society, Massachusetts Institute of Technol-ogy, Cambridge, MA ([email protected], [email protected], [email protected]). ‡ Department of Mathematics and Systems Analysis, Aalto University, Espoo, Finland(kaie.kubjas@aalto.ﬁ). 1 a r X i v : . [ q - b i o . GN ] J a n A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER

Fig. 1 . Schematic of the diploid genome. Nucleus with green, blue and red curves depictingthree homologous pairs of chromosomes. In the unphased setting, the measured distance betweenloci i and j corresponds to the sum of the four distances (denoted in purple) between two pairs ofhomologous loci x i , y i and x j , y j . the DNA is then fragmented, the crosslinked fragments are ligated, and paired-endsequencing is applied to the ligation products and mapped to a reference genome [24].By binning the genome and ascribing each read pair into the corresponding bin, oneobtains a contact frequency matrix between genomic loci that is commonly of the size10 × .Diﬀerent computational approaches for reconstructing the 3D genome organiza-tion from contact frequency data have been considered. Distance-based approachesconvert contact frequencies F ij into spatial distances D ij and ﬁnd a Euclidean embed-ding of the points in 3D [14, 46, 23, 35]. Ensemble methods such as MCMC5C andBACH [36, 19] learn a set of possible 3D structures by deﬁning a probabilistic modelfor contact frequencies and generating an ensemble of structures via MCMC sam-pling. Other ensemble methods include molecular dynamics simulations that modelDNA as a polymer and output an ensemble of 3D structures [25, 27, 13, 32]. Finally,statistical methods have also been proposed that directly model contact counts in-stead of distances, using for example the Poisson distribution [42], and maximize thelog-likelihood of the data to infer the 3D genome organization.Almost all existing methods make the simplifying assumption that the genomeis haploid, when in fact most organisms of interest including humans are diploid, i.e.there are two copies of each chromosome known as homologous chromosomes . Forexample, human cells contain two copies of 23 chromosomes each. The challenge isthat the contact frequency data from chromosome conformation capture experimentsis generally unphased , meaning that the copies of each chromosome cannot be dis-tinguished. As a result, if the DNA is modeled as a string of beads containing twocopies of each bead i for 1 ≤ i ≤ n , then the measured contact frequencies result in an n × n matrix, from which we would like to infer the 3D embedding of 2 n points. Thisproblem cannot be solved by classical methods for 3D genome reconstruction methodssuch as those mentioned above. With signiﬁcant experimental eﬀorts, phased datacan be obtained [7, 8] and used in order to reconstruct the 3D genome organization [4].However, such data is rare and costly.In this paper, we provide a computational method for inferring the 3D diploidorganization of the genome without relying on phased data. In particular, we considera distance-based approach and use Euclidean distance geometry to obtain the 3Ddiploid structure of the genome. The precise mathematical problem considered inthis paper is as follows and illustrated in Figure 1. DNA is modeled as a string ofbeads, that contains two copies of each bead i for 1 ≤ i ≤ n . We would like to infer the D GENOME ORGANIZATION IN DIPLOID ORGANISMS x i ∈ R and y i ∈ R . Sincefor unphased data, the two copies of each bead cannot be distinguished, the problemis to identify the 3D conﬁguration (2 n × x , . . . x n , y , . . . y n ∈ R (up to translation and rotation), from the composite distance measurements D ij ,1 ≤ i = j ≤ n ( n × n matrix), corresponding to the sum of the distances betweeneither copy of bead i and j , i.e., D ij = k x i − x j k + k x i − y j k + k y i − x j k + k y i − y j k . In the haploid or phased setting, this problem boils down to the standard Euclid-ean distance geometry problem. This problem has a long history: in the classicalsetting with no missing values, this problem can be solved via the classical multidimen-sional scaling (cMDS) algorithm that is based on spectral decomposition followed bydimensionality reduction; see [9] for an overview. Other approaches for the Euclideanembedding and completion problems, including in the presence of missing values, arenon-convex formulations [15, 28] as well as semideﬁnite relaxations [1, 16, 5, 26, 44, 45].A naive approach in the unphased diploid setting is to assume that the fourdistances that make up our measured composite distance D ij are equal and solve thecorresponding Euclidean embedding problem. However, it is evident from single-cellimaging studies that the four distances in D ij can be wildly diﬀerent [3, 30]. Hence thisapproach cannot provide realistic embeddings. While, a simple dimension argument(6 n variables versus (cid:0) n (cid:1) constraints) suggests that the 3D genome conﬁguration isuniquely identiﬁable, one of the main results of our paper is that the 3D diploidgenome conﬁguration is not identiﬁable from unphased data. In fact, we show thatthere are inﬁnitely many conﬁgurations that satisfy the constraints imposed by D ij ,even in the noiseless setting (section 2, Theorem 2.1).We therefore consider additional biologically relevant and experimentally mea-surable constraints and study identiﬁability of the 3D diploid structure under theseconstraints. First, we take into account distances between neighboring beads, i.e. k x i − x i +1 k and k y i − y i +1 k on each chromosome. While we show that this yields uniqueidentiﬁability for conﬁgurations in 2D, there are still inﬁnitely many conﬁgurationsin 3D, which is of primary interest for genome modeling (section 3, Proposition 3.1,Proposition 3.2). To obtain identiﬁability in 3D, we consider adding constraints basedon contact frequencies between three or more loci simultaneously. The measurementof such higher-order contact frequencies has recently been enabled by experimentalassays such as SPRITE [33], C-walks [31] and GAM [2]. We prove that this infor-mation can be used to uniquely identify the 3D genome organization from unphaseddata in the noiseless setting (section 4, Theorem 4.1).Finally, we provide an SDP formulation for obtaining the 3D diploid conﬁgura-tion from noisy measurements (section 5) and show based on simulated data that ouralgorithm has good performance and that it is able to recover known genome orga-nization patterns when applied to real contact frequency data collected from humanlymphoblastoid cells (section 6).

2. Unidentiﬁability from pairwise distance constraints.

In the remainderof the paper we denote the true but unknown coordinates of the homologous loci by x ∗ i and y ∗ i and the corresponding noiseless distances by D ∗ ij while the symbols x i and y i denote the variables that we want to solve for. While from a biological perspectivethe relevant setting is when x i , y i ∈ R , results that hold more generally will be statedin R d . The main result of this section is Theorem 2.1, which characterizes the set ofsolutions given by the constraints D ∗ ij in dimension d ≤

3. In particular, it establishes

A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER non-identiﬁability of the 3D genome structure from pairwise distance measurementsin the diploid unphased setting.

Theorem

Let d ≤ and n ≥ d + 3 . Then ( x , . . . , x n , y , . . . , y n ) ∈ ( R d ) n satisﬁes (2.1) D ∗ ij = k x i − x j k + k x i − y j k + k y i − x j k + k y i − y j k for all ≤ i = j ≤ n if and only if it satisﬁes (2.2) x i + y i = x ∗ i + y ∗ i and k x i k + k y i k = k x ∗ i k + k y ∗ i k for all ≤ i ≤ n up to translations and rotations in R d and permutations of x i and y i . As a consequence, the measurements D ∗ ij identify the location of each pair ofhomologous loci ( x i , y i ) up to a sphere with center ( x ∗ i + y ∗ i ) / k x ∗ i − y ∗ i k / x i , y i lie opposite to each other anywhere on this sphere. Unless x ∗ i = y ∗ i for all i , i.e., all spheres have radius 0, this set is inﬁnite in dimensions d > k x i − y i k within each homologous pair is ﬁxed given the pairwisedistances D ∗ ij . This result is used to prove Lemma 2.4. Lemma

Let ( x , . . . , x n , y , . . . , y n ) ∈ ( R d ) n satisfy (2.3) x i + y i = x ∗ i + y ∗ i and k x i k + k y i k = k x ∗ i k + k y ∗ i k for all ≤ i ≤ n. Then k x i − x j k + k x i − y j k + k y i − x j k + k y i − y j k = D ∗ ij for all ≤ i = j ≤ n. Proof.

Observe that for each pair x i , y i satisfying the equations (2.3), it holdsthat D ∗ ij = 2 · ( k x ∗ i k + k y ∗ i k ) + 2 · ( k x ∗ j k + k y ∗ j k ) − x ∗ i + y ∗ i ) · ( x ∗ j + y ∗ j )= 2 · ( k x i k + k y i k ) + 2 · ( k x j k + k y j k ) − x i + y i ) · ( x j + y j )= k x i − x j k + k x i − y j k + k y i − x j k + k y i − y j k . This completes the proof.Next we will show that the distance between homologous pairs is uniquely deter-mined by the D ∗ ij Lemma

Let d ≤ and n ≥ d + 3 . Then for each ≤ i ≤ n the quantity k x i − y i k is identiﬁable from the constraints imposed by the D ∗ ij , i.e., for any solution ( x , . . . , x n , y , . . . , y n ) ∈ ( R d ) n to the equations deﬁned by the D ∗ ij in (2.1) , thequantity k x i − y i k is constant. The constraint d ≤ n ≥ d + 3 isnecessary for unique identiﬁability of the distance between homologous pairs of loci. Proof.

Without loss of generality we assume that i = 1 and show that k x − y k isequal to some constant. First, we perform a shift on the solution so that x = − y = v .Since shifts preserve distances, they in particular preserve the equality constraints(2.1). Hence, D ∗ j = k v − x j k + k v − y j k + k − v − x j k + k − v − y j k . D GENOME ORGANIZATION IN DIPLOID ORGANISMS D ∗ j = 4 k v k + 2( k x j k + k y j k ) . Let j = k be both not equal to 1. Then substituting the above leads to D ∗ j + D ∗ k − D ∗ jk = 8 k v k + 2( x j + y j ) · ( x k + y k ) . Deﬁning T jk := D ∗ j + D ∗ k − D ∗ jk and s j := √ x j + y j ), this is equivalent to T jk − k v k = s j · s k . Let T be the ( d + 1) × ( d + 1) submatrix of T satisfying T ij = T i +1 ,j + d +2 , i.e. therows of T correspond to the rows 2 , , . . . , d +2 of T and the columns of T correspondto the columns d +3 , d +4 , . . . , d +3 of T . We now show that for generic conﬁgurationsdet( T ) = 0. Since det( T ) can be written as a polynomial in the coordinates x i and y i , then det( T ) = 0 for generic conﬁgurations as long as it does not identically vanish.Hence it suﬃces to present one conﬁguration where det( T ) is nonzero. For d ≤ T has full rank, then the matrix determinant lemma implies that(2.4) det( T − J k v k ) = (1 − k v k T ( T ) − ) det( T ) , where denotes the all ones vector. Note that the scalar T T is ﬁxed and(det T ) = 0. Furthermore, since T − J k v k is formed from the dot products between d -dimensional vectors, it has rank at most d and therefore det( T − J k v k ) = 0 due to T − J k v k being a ( d +1) × ( d +1) matrix. Hence, (1 − k v k T ( T ) − ) det( T ) = 0,which is a linear equation in terms of k v k . As a consequence, it has a unique solutionfor k v k and thus the distance between the homologous pair x , y is ﬁxed as long as n ≥ d + 3.We next characterize all solutions to the constraints imposed by the D ∗ ij . Lemma

Let d ≤ and n ≥ d + 3 . Let ( x , . . . , x n , y , . . . , y n ) ∈ ( R d ) n be asolution to k x i − x j k + k x i − y j k + k y i − x j k + k y i − y j k = D ∗ ij for all ≤ i = j ≤ n. Then x i + y i = x ∗ i + y ∗ i and k x i k + k y i k = k x ∗ i k + k y ∗ i k for all ≤ i ≤ n up to translations and rotations in R d and permutations of x i and y i .Proof. Without loss of generality we perform a translation on the solution suchthat x = − y = v for some vector v . By Lemma 2.3 the quantity k x k − y k k isconstant for each 1 ≤ k ≤ n and thus also k v k is constant. Since for any j = 1 itholds that D ∗ j = 4 k v k + 2( k x j k + k y j k ), also k x j k + k y j k is constant and hence k x i k + k y i k = k x ∗ i k + k y ∗ i k for all 1 ≤ i ≤ n .Similarly to the proof of Lemma 2.3, if we deﬁne T jk = D ∗ j + D ∗ k − D ∗ jk and s j = √ x j + y j ), we ﬁnd that T jk − k v k = s j · s k . A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER

Fig. 2 . Distance constraints between neighboring beads. Green and blue curves depict twohomologous pairs of chromosomes. For the green curves distances between neighboring genomicregions are shown by black lines.

Because we have access to the diagonal constraints now, this relationship holds forall j, k and not just j = k . Thus T − J k v k is a symmetric ( n − × ( n −

1) matrixadmitting a rank d factorization. Let S be the matrix formed with the vectors s j . Wethen have T − J k v k = SS T . There is a result on rank factorizations of symmetricmatrices that any other factorization T − J k v k = S S T satisﬁes S = S Q for someorthogonal matrix Q [22, Proposition 3.2]. Thus for any other solution s j , we have s j = s j Q , implying all solutions are simply orthogonal transformations of each other(rotations, reﬂections, etc.)In summary, we have shown that once we have ﬁxed x + y = 0 via translation,then the quantities x j + y j are unique up to orthogonal transformations and thequantities k x j k + k y j k are unique.

3. Distance constraints between neighboring loci.

In section 2, we showedthat the 3D genome conﬁguration is not identiﬁable from pairwise distance constraintsavailable from typical (unphased) contact frequency maps. In order to gain identiﬁa-bility, we next consider adding other biological constraints to the problem formulationthat are generally available or can be measured. In particular, since DNA can beviewed as a string of connected beads, we use the distance between adjacent beads asan additional constraint. The distance between neighboring beads can be derived em-pirically for example from imaging studies [29, 21]; see also our experimental resultsin section 6. The additional mathematical constraints are: k x i − x i +1 k = k x ∗ i − x ∗ i +1 k and k y i − y i +1 k = k y ∗ i − y ∗ i +1 k for 1 ≤ i ≤ n − , where x ∗ , x ∗ , . . . , x ∗ n and y ∗ , y ∗ , . . . , y ∗ n correspond to consecutive beads on homologouschromosomes; see Figure 2.In this section we show the following results: under the additional distance con-straints between neighboring loci, we prove that identiﬁability can be obtained in the2D setting (Proposition 3.1). However, in the 3D setting we prove that there arestill inﬁnitely many 3D conﬁgurations even with these additional distance constraints(Proposition 3.2).For the proofs of Proposition 3.1 and Proposition 3.2 we recall from Theorem 2.1that ( x i , y i ) and ( x ∗ i , y ∗ i ) are diametrically opposite points on the same sphere. Denotethe i -th sphere by S i and let it have center c i and radius r i . Then k c i − x i k = r i and2 c i − x i = y i . Proposition

For n ≥ and generic ( x ∗ , . . . , x ∗ n , y ∗ , . . . , y ∗ n ) ∈ ( R ) n , there D GENOME ORGANIZATION IN DIPLOID ORGANISMS is a unique point ( x , . . . , x n , y , . . . , y n ) ∈ ( R ) n satisfying the equations (3.1) x i + y i = x ∗ i + y ∗ i and k x i k + k y i k = k x ∗ i k + k y ∗ i k for ≤ i ≤ n, k x i − x i +1 k = k x ∗ i − x ∗ i +1 k and k y i − y i +1 k = k y ∗ i − y ∗ i +1 k for ≤ i ≤ n − . Proof.

We have y = 2 c − x and y = 2 c − x . Plugging this into k y − y k = k y ∗ − y ∗ k gives k y ∗ − y ∗ k = k (2 c − x ) − (2 c − x ) k = k (2 c − c ) − ( x − x ) k = k c − c k + k x − x k − c − c ) · ( x − x ) . The quantities k c − c k and k x − x k are ﬁxed. This implies that the quantity(2 c − c ) · ( x − x ) is ﬁxed. Since we know k x − x k and c = c holds by genericness,then there are two possible angles for x − x (this is where we use the 2D constraint)and thus that there are two possible solutions for x − x .Because x , x are constrained to lie on circles, the solutions for x are the in-tersection points of the ﬁrst circle and the second circle translated by x − x andthe solutions for x are the intersection points of the second circle and the ﬁrst circletranslated by x − x . Hence each solution for x − x leads to at most two possiblesolutions for ( x , x ). In turn this implies there are at most four solutions for x .We now investigate the four solutions. The ﬁrst two solutions are obtained bytranslating the circle centered at c by x ∗ − x ∗ and intersecting it with the circlecentered at c , see Figure 3. One of the two solutions is x ∗ . The other two solutionsare reﬂections of these two solutions over the line from c to c .Let x ∗ , x ∗ , c , c be ﬁxed. They determine four possible solutions for x . Wewill show that these four solutions are diﬀerent from the four solutions we get fromconsidering x ∗ , x ∗ , c , c for generic x ∗ , c (apart from x ∗ ).If either of the reﬂected solutions over the line from c to c coincides with oneof the four original solutions, then we can perturb c away from the line from c to c to change these solutions. If the solution that is the intersection point of the circlecentered at c and the translation by x ∗ − x ∗ of the circle centered at c (diﬀerentfrom x ∗ ) coincides with one of the four original solutions, then we can perturb x ∗ . Fig. 3 . Identiﬁability in the 2D setting with neighboring distance constraints. Two solutionsfor x are obtained by translating the circle centered at c by x ∗ − x ∗ (this new circle is colored blue)and intersecting it with the circle centered at c . The other two solutions are obtained by reﬂectingthe blue circle over the line through c and c (this new circle is colored green) and intersecting itwith the circle centered at c . The true solution for x is colored black and the three alternativesolutions for x are colored red. A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER

This changes x ∗ − x ∗ and hence the second intersection point of the circle centered at c and the translation by x ∗ − x ∗ of the circle centered at c .A similar argument can be used to show that x , . . . , x n − have unique solutions.Given a unique solution for x , there are two solutions for x if and only if x ∗ lieson the line from c to c . This is however not a generic conﬁguration. A similarargument applies for x n .Despite having uniqueness in 2D, we do not have uniqueness in 3D as shown inthe following proposition. Proposition

Let n ∈ N . For generic ( x ∗ , . . . , x ∗ n , y ∗ , . . . , y ∗ n ) ∈ ( R ) n , thereare inﬁnitely many points ( x , . . . , x n , y , . . . , y n ) ∈ ( R ) n satisfying equations (3.1) .Proof. If n = 1, then x ∗ , y ∗ can be chosen randomly with the constraint that x ∗ = y ∗ . Then x and y can be any points on the sphere S deﬁned by x ∗ , y ∗ . Nowassume that n ≥

2. Fix any x ∗ , y ∗ such that x ∗ = y ∗ . Choose two circles C and C on the sphere S deﬁned by x ∗ , y ∗ that intersect at two points one of which is x ∗ .The circle C is the intersection of S and another sphere T . Let x ∗ be the centerof the sphere T . Let C be the circle on S that consists of points antipodal to C .Then C is also an intersection of S and another sphere T . Let y ∗ be the center ofthe sphere T . We use the same procedure to construct x ∗ and y ∗ from x ∗ and y ∗ , x ∗ and y ∗ from x ∗ and y ∗ etc.The only condition on x ∗ and y ∗ is x ∗ = y ∗ , hence ( x ∗ , y ∗ ) is a generic pointin R × R . The condition that C is a circle on the sphere S containing x ∗ isequivalent to x ∗ being any point in R outside the line through x ∗ and y ∗ . Similarly,the condition that C is a circle on the sphere S containing x ∗ is equivalent to y ∗ being any point in R outside the line through x ∗ and y ∗ . The condition that C and C intersect at two diﬀerent points of S is equivalent to the normal vector of thetangent plane of S at x ∗ and the normal vectors of the planes deﬁned by C and C being linearly independent. Hence ( x ∗ , x ∗ , y ∗ , y ∗ ) is a generic point in ( R ) . Similararguments can be used to show that ( x ∗ , x ∗ , . . . , x ∗ n , y ∗ , y ∗ , . . . , y ∗ n ) is a generic pointin ( R ) n .Now consider points x n and y n in an ε -neighborhood of x ∗ n and y ∗ n . Consider thespheres that are centered at x n and y n and have radii k x ∗ n − − x ∗ n k and k y ∗ n − − y ∗ n k .The intersections of these spheres with S n − give circles ˜ C n − and ˜ C n − that areperturbations of circles C n − and C n − . In particular, the intersection of the circle˜ C n − and the circle ˜ C n − that consists of points antipodal to ˜ C n − consists of twopoints for ε small enough. Choosing x n − to be the intersection point correspondingto x ∗ n − and y n − its antipodal gives points x n − , y n − satisfying k x n − − x n k = k x ∗ n − − x ∗ n k and k y n − − y n k = k y ∗ n − − y ∗ n k .Assuming that ε is small enough, then x n − and y n − are in small neighborhoodsof x ∗ n − and y ∗ n − , and we can continue the same procedure to ﬁnd x n − and y n − from x n − and y n − , x n − and y n − from x n − and y n − etc. In particular, we canﬁnd x , . . . , x n − , y , . . . , y n − satisfying equations (3.1) for every x n and y n in an ε -neighborhood of x ∗ n and y ∗ n .The previous proposition suggests that there are two degrees of freedom for choos-ing x , . . . , x n , y , . . . , y n on each homologous pair and thus that ﬁnite identiﬁabilityrequires two additional algebraically independent constraints per homologous pair.Similarly this suggests that unique identiﬁability requires three additional algebrai-cally independent constraints per homologous pair, where each endpoint of a chromo-some needs to be included in at least one of the additional constraints. D GENOME ORGANIZATION IN DIPLOID ORGANISMS (a) (b) Fig. 4 . Higher-order distance constraints. (a) Three loci x i , x i , x i , located on the samechromosome, are depicted. In the phased setting, the higher-order distance D x i x i x i is deﬁnedas the sum of the distances (pink dashed lines) of the three loci x i , x i , x i to their centroid (pinkpoint). Green and blue curves depict two diﬀerent chromosomes. (b) Illustrates the deﬁnition of D i i i in the unphased setting. Green, blue and red curves depict neighborhoods around threehomologous loci ( x i , y i ) , ( x i , y i ) and ( x i , y i ) . From these homologous loci 8 possible higher-order distances can be deﬁned (colored dashed lines) based on the 8 centroids depicted in the ﬁgure.The higher-order distance D i i i is deﬁned as the minimum of these 8 distances (achieved here bythe three black dashed line segments).

4. Identiﬁability from higher-order contact constraints.

In section 3, weshowed that considering distances between neighboring beads only yields identiﬁabil-ity in 2D but not in 3D. In the following, we consider adding further constraints thatare becoming widely available from experimental data, namely higher-order contactfrequencies between three or more loci as measured by experimental assays such asSPRITE [33], C-walks [31] and GAM [2]. We express these constraints mathematicallyby letting F ∈ R m × m ×···× m be a contact frequency tensor, where F x i ,x i ,...,x ik mea-sures the contact frequency between loci i , i , . . . , i k with coordinates x i , x i , . . . , x i k .In the unphased setting, we can only measure a combination of contact frequenciesover the homologous loci { x i , y i } × { x i , y i } × . . . × { x i k , y i k } , which we denoteby F i i ...i k . In addition, as for 2-way interactions, we turn contact frequencies into“distances” by deﬁning D i i ...i k := 1 /F i i ...i k .In the following, we provide our interpretation of distances in the higher-ordersetting. For simplicity, we ﬁrst describe the higher-order distances for three loci inthe phased setting. Since F x i x i x i counts how often the three loci come together,we interpret D x i x i x i as the sum of the distances of the three loci x i , x i , x i totheir centroid (Figure 4a). We next provide a generalization to the unphased setting.For three homologous loci ( x i , y i ) , ( x i , y i ) and ( x i , y i ), their contact frequencycan be formed by 8 possible triples, namely ( x i , x i , x i ), ( x i , x i , y i ), ( x i , y i , x i ),( y i , x i , x i ), ( x i , y i , y i ), ( y i , y i , x i ), ( y i , x i , y i ), and ( y i , y i , y i ). We will as-sume that one of the triples constitutes the majority of the observed contact frequencycount and hence we deﬁne D i i i as the minimum over all 8 higher order distances.This is illustrated in Figure 4b. Generalizing from three to k loci, our higher-order0 A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER distance deﬁnition then becomes D i i ...i k = min z ij ∈{ x ij ,y ij }  k X j =1 k z i j − ( z i + . . . + z i k ) /k k  . In the following, we prove our main result; namely we show that the distanceconstraints of order 3 (3-way distances) together with the previously considered pair-wise distance constraints and distance constraints among consecutive beads resultsin unique identiﬁability of the 3D genome conﬁguration (Theorem 4.1). In fact, onlyvery few order 3 distance constraints are required for unique identiﬁability. As weshow in Theorem 4.1 it is suﬃcient that the ﬁrst and last bead of each chromosome becontained in an order 3 distance constraint. This is a reasonable constraint given thatmethods such as SPRITE, C-walks and GAM measure higher-order interactions overthe whole genome. These insights are of interest experimentally since they suggestthat the methods can restrict the measurement of such higher order constraints toﬁrst and last beads of each chromosome, known as telomeres.

Theorem

Let m be the number of chromosome pairs, let n , n , . . . , n m bethe number of domains on chromosomes , , . . . , m and deﬁne n = n + n + . . . + n m .Let I ⊆ [ n ] × [ n ] × [ n ] be such that each of , n , n + 1 , n + n , . . . , n + n + . . . + n m − + 1 , n (labels of domains at the beginning and at the end of each chromosome) iscontained in at least one triple in I . Let x ∗ , . . . , x ∗ n , y ∗ , . . . , y ∗ n ∈ R be ﬁxed such that min z ∗ i ∈{ x ∗ i ,y ∗ i } for i = k ,k ,k  X j ∈{ k ,k ,k } k z ∗ j − ( z ∗ k + z ∗ k + z ∗ k ) / k  = 0 for ( k , k , k ) ∈ I. Consider the polynomial system: (4.1) x i + y i = x ∗ i + y ∗ i and k x i k + k y i k = k x ∗ i k + k y ∗ i k for ≤ i ≤ n, k x i − x i +1 k = k x ∗ i − x ∗ i +1 k and k y i − y i +1 k = k y ∗ i − y ∗ i +1 k for i ∈ [ n ] \{ n , n + n , . . . , n } , min z i ∈{ x i ,y i } for i = k ,k ,k  X j ∈{ k ,k ,k } k z j − ( z k + z k + z k ) / k  = 0 for ( k , k , k ) ∈ I. Then for generic x ∗ , . . . , x ∗ n , y ∗ , . . . , y ∗ n , this system has a unique solution in ( R ) n . To prove Theorem 4.1, we will need two lemmas. Lemma 4.2 states that for aﬁxed solution ( x ∗ , y ∗ ) on a sphere S and given distances between solutions on S and S , there are ﬁnitely many solutions ( x , y ) on the sphere S . Lemma 4.3 is anextension of Lemma 4.2. It states that if one has ﬁnitely many solutions on a sphere S i , then given distances between neighboring beads, there are ﬁnitely many solutionson any sphere connected to S i . Lemma

Let x ∗ , x ∗ , y ∗ , y ∗ ∈ R be ﬁxed. Consider the polynomial system: (4.2) x + y = x ∗ + y ∗ , k x k + k y k = k x ∗ k + k y ∗ k , k x ∗ − x k = k x ∗ − x ∗ k and k y ∗ − y k = k y ∗ − y ∗ k . D GENOME ORGANIZATION IN DIPLOID ORGANISMS For generic x ∗ , x ∗ , y ∗ , y ∗ , this system has ﬁnitely many solutions in ( R ) n .Proof. The ﬁrst two equations of (4.2) say that x , y and x ∗ , y ∗ are pairs ofantipodal points on the same sphere. We denote this sphere by S . The third equationsays that x is the same distance from x ∗ as x ∗ is from x ∗ . Hence x must lie onthe circle C x that is the intersection of S and the sphere centered at x ∗ and withradius k x ∗ − x ∗ k . The last equation says that y must lie on the circle C y that is theintersection of S and the sphere centered at y ∗ with radius k y ∗ − y ∗ k . We considerthe circle C x that consists of antipodal points to the circle C y on the sphere S .The intersection of the circles C x and C x gives the solutions for x . Unless the twocircles are equal, they intersect at at most two points. Since y is antipodal to x ,then for each x there is a unique y . The circles coincide if and only if x ∗ , y ∗ andthe center of S are collinear. Lemma

Let x ∗ , . . . , x ∗ n , y ∗ , . . . , y ∗ n ∈ R be ﬁxed. Consider the polynomialsystem: x i + y i = x ∗ i + y ∗ i and k x i k + k y i k = k x ∗ i k + k y ∗ i k for ≤ i ≤ n, k x ∗ − x k = k x ∗ − x ∗ k , k y ∗ − y k = k y ∗ − y ∗ k , k x i − x i +1 k = k x ∗ i − x ∗ i +1 k and k y i − y i +1 k = k y ∗ i − y ∗ i +1 k for ≤ i ≤ n − . For generic x ∗ , . . . , x ∗ n , y ∗ , . . . , y ∗ n , this system has ﬁnitely many solutions in ( R ) n − .Proof. By Lemma 4.2, there are ﬁnitely many antipodal pairs ( x , y ) ∈ R × R on S such that k x ∗ − x k = k x ∗ − x ∗ k and k y ∗ − y k = k y ∗ − y ∗ k . Similarly, for each ofthese antipodal pairs ( x , y ) ∈ R × R on S , there are ﬁnitely many antipodal pairs( x , y ) ∈ R × R on S satisfying k x − x k = k x ∗ − x ∗ k and k y − y k = k y ∗ − y ∗ k etc. Proof of Theorem . We recall that the ﬁrst line of the polynomial system (4.1)gives that x i , y i are antipodal points on a sphere S i . Consider a triple ( k , k , k ) ∈ I that contains 1 and the equation on the last line of the polynomial system (4.1)corresponding to this triple. This equation gives that z k , z k , z k , where z i ∈ { x i , y i } ,coincide. Hence z k , z k , z k lie on the intersection of S k , S k , S k . Generically, if theintersection of three spheres is non-empty in R , then it consists of two points P and P . This gives four possible solutions for x , y : the points P, P and their antipodalson S . By Lemma 4.3, there are ﬁnitely many solutions for x , . . . , x n , y , . . . , y n given these ﬁxed solutions x , y on S . In the next two paragraphs we will show thatgenerically these ﬁnitely many solutions do not contain antipodal points on any ofthe spheres S , . . . , S n .If there are two antipodal solutions on S i , then we may assume that they comeeither from the same solution on S or antipodal solutions on S , because we canperturb S k , S k , S k slightly to change the other pair of solutions. First we will showthat generically a solution for x i on S i does not give a pair of antipodal solutions for x i +1 on S i +1 . If this was the case, then both the solution for x i and its antipodalwould have to lie on the plane that is perpendicular to the line through the antipodalpair of solutions for x i +1 on S i +1 . This plane contains the centers of S i and S i +1 .Hence for a solution for x i , there is only one antipodal pair on solutions on S i +1 .Thus for a generic distance between the solutions on S i and S i +1 , a solution on S i does not give an antipodal pair of solutions on S n +1 .Secondly, suppose that two diﬀerent solutions on S i give a pair of antipodalsolutions on S i +1 . We will show that when we perturb the distance between solutions2 A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER on S i and S i +1 , then we do not get an antipodal pair anymore. Let x i and x i be twodiﬀerent solutions on S i that give solutions x i +1 and 2 c i +1 − x i +1 on S i +1 . Hence k c i +1 − x i +1 − x i k = k x i +1 − x i k . We want to show that generically k c i +1 − ( x i +1 + (cid:15) ) − x i k = k x i +1 + (cid:15) − x i k , where x i +1 + (cid:15) is the perturbed solution. Indeed, using the identity k x i +1 − x i k = k c i +1 − x i +1 − x i k gives k c i +1 − ( x i +1 + (cid:15) ) − x i k − k x i +1 + (cid:15) − x i k = 2 (cid:15) ( x i + x i − c i +1 ) . This quantity is equal to zero if and only if (cid:15) = 0 or c i +1 is the middle point of theline segment from x i to x i . This is generically not the case.Using a triple ( k , k , k ) ∈ I containing n and the equation for this triple, weget four possible solutions for x n , y n . Generically, only one of them coincides withthe ﬁnitely many solutions on S n that we get from the solutions on S , becauseperturbing the spheres slightly (with keeping the coinciding points ﬁxed) perturbsthe second intersection point of the three spheres and we know that generically theﬁnitely many points do not contain antipodal points.The unique solution on S n comes from one solution on each of the spheres S , . . . , S n − : If this was not the case then two diﬀerent solutions on S i give the samesolution on S i +1 . By the proof of Proposition 3.1, the dot product ( c i − c i +1 ) · ( x i − x i +1 )is ﬁxed. Hence for a ﬁxed x i +1 , all possible solutions for x i lie on a hyperplane andthis hyperplane is perpendicular to c i − c i +1 . Therefore, if two solutions on S i give thesame solution on S i +1 , then they lie on a hyperplane perpendicular to c i − c i +1 . Byslightly perturbing the sphere S i +1 , this is not the case anymore, and hence genericallya solution on S i +1 comes from a unique solution on S i .

5. Algorithms and implementation.

So far, we derived a theoretical frame-work to establish when we have unique and ﬁnite identiﬁability of the 3D conﬁgurationin the noiseless setting. However, a unique solution does not necessarily mean thatwe can ﬁnd it eﬃciently, as in many cases ﬁnding the solution may be NP-hard.In addition, we have so far not yet considered the noisy setting. In this section, weshow how to construct an optimization formulation to determine the 3D conﬁgurationeﬃciently.We frame the 3D reconstruction problem as a Euclidean embedding problem,where the coordinates x , . . . x n , y , . . . y n ∈ R are inferred from distances. Similarto ChromSDE [46], we formulate all distances in terms of entries in the Gram matrix G , which tracks the dot products between the 2 n genomic regions. Namely, lettingthe column/row i of G correspond to x i and the column/row n + i correspond to itshomologous locus y i , then the distances are given by k x i − x j k = G i,i + G j,j − G i,j , k x i − y j k = G i,i + G n + j,n + j − G i,n + j and k y i − y j k = G n + i,n + i + G n + j,n + j − G n + i,n + j . It is natural to work with the Gram matrix G , since it is rotation invari-ant. By imposing the constraint P i,j G i,j = 0 we can also ﬁx the translational axis.Also the additional distance constraints that we introduced in the previous sections(Theorem 2.1, Proposition 3.1, Theorem 4.1) can be represented as linear constraintsin terms of entries in G as follows: • Pairwise distances: g ij ( G ) := G i,i + G j,j + G n + i,n + i + G n + j,n + j − G i,j − G n + i,j − G i,n + j − G n + i,n + j • Distances between homologous pairs: g ii ( G ) := G i,i + G n + i,n + i − G i,n + i D GENOME ORGANIZATION IN DIPLOID ORGANISMS • Distances between neighboring beads: g i + ( G ) := G i,i + G i +1 ,i +1 − G i,i +1 • Distances of order 3 (can be generalized to higher orders): g ijk ( G ) := min l ( g ijkl : l = 1 , . . . , g ijk ( G ) := G i,i + G j,j + G k,k − G i,j − G i,k − G j,k ,g ijk ( G ) := G i,i + G j,j + G n + k,n + k − G i,j − G i,n + k − G j,n + k ,g ijk ( G ) := G i,i + G n + j,n + j + G k,k − G i,n + j − G i,k − G n + j,k ,g ijk ( G ) := G i,i + G n + j,n + j + G n + k,n + k − G i,n + j − G i,n + k − G n + j,n + k ,g ijk ( G ) := G n + i,n + i + G j,j + G k,k − G n + i,j − G n + i,k − G j,k ,g ijk ( G ) := G n + i,n + i + G j,j + G n + k,n + k − G n + i,j − G n + i,n + k − G j,n + k ,g ijk ( G ) := G n + i,n + i + G n + j,n + j + G k,k − G n + i,n + j − G n + i,k − G n + j,k ,g ijk ( G ) := G n + i,n + i + G n + j,n + j + G n + k,n + k − G n + i,n + j − G n + i,n + k − G n + j,n + k . Our objective is to determine a rank 3 solution of G , satisfying the above con-straints. However, this optimization problem is non-convex due to the rank constraint,and we instead consider the standard relaxation: we minimize the trace of the Grammatrix as an approximation to matrix rank [17]. The resulting optimization problemthen becomes the following semideﬁnite program (SDP):(5.1) minimize G tr( G )subject to g ii ( G ) = D ∗ ii , ≤ i ≤ n,g ij ( G ) = D ∗ ij , ≤ i < j ≤ n,g i + ( G ) = D ∗ i + , i ∈ Ω ,g ijk ( G ) = D ∗ ijk , ( i, j, k ) ∈ Ω , X ≤ i,j ≤ n G i,j = 0 ,G (cid:23) . Here, D ∗ ii denote the distances between homologous pairs computed from the pairwisedistances using Lemma 2.3, D ∗ ij denote the pairwise distances, D ∗ i + denote the dis-tances between neighboring beads, and D ∗ ijk denote the distances between three loci(while one could also consider 4 or higher order distance constraints, in our imple-mentation we only used 3-way distance constraints since higher-order contacts are ex-tremely sparse). The index set Ω = [2 n ] \{ n , n + n , . . . , n, n + n , n + n + n , . . . , n } corresponds to all beads that are not the last bead on a chromosome. The index setΩ ⊆ [ n ] corresponds to all triples of beads with non-zero contact frequencies.In the noisy setting, which is relevant for biological data, we replace the equalityconstraints by penalties in the loss function. Namely, using D ∗ for the noiseless and D for the noisy distances, we replace the equality constraints of the form g ( G ) = D ∗ by4 A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER adding ( g ( G ) − D ) to the objective function. For the higher-order distance constraintsof the form D ∗ ijk = min( g ijk ( G ) , . . . , g ijk ( G )) for ( i, j, k ) ∈ Ω we use slack variablesand a convex relaxation using an atomic norm that combines the ‘ - and ‘ -norms.More precisely, we propose the use of the following transformation in the noisy setting, D ijk + λ ijkl = g ijkl ( G ) + s ijkl for l = 1 , , . . . , , where λ ijkl , s ijkl ≥ i, j, k, l act as slack variables. In general, for each triple( i, j, k ) we want one of the λ ijkl to be close to 0 and the sum over all s ijkl to be small.Naively this can be done by placing P s ijkl + P λ ijkl into the objective function.However, this would not enforce for each ( i, j, k ) at least one λ ijkl to be close to 0.Instead we propose to use X ( i,j,k ) ∈ Ω , ≤ l ≤ s ijkl + vuuut X ( i,j,k ) ∈ Ω  X ≤ l ≤ λ ijkl  . The ‘ -norm will push down the P l λ ijkl for each ( i, j, k ), while the ‘ norm will driveat least one of the λ ijkl to zero, which is precisely the desired behavior. The quantity qP i,j,k ( P l λ ijkl ) is an atomic norm as deﬁned in [6] with the set of atoms A = { ( λ ijkl ) : X i,j,k X l λ ijkl ! = 1 and X i,j,k λ ijkl ijk = 1 for l ijk = 1 , . . . , , ( i, j, k ) ∈ Ω } . Then the optimization problem in the noisy setting becomes:(5.2) minimize

G,s,λ ρ tr( G ) + X ≤ i ≤ n ( g ii ( G ) − D ii ) + X ≤ i

6. Evaluation on synthetic and real data.6.1. Synthetic data.

We start by testing our method on simulated data. Forthis we construct three diﬀerent types of 3D structures: (a) a Brownian motion modelusing a standard normal distribution to generate successive points; (b) points sam-pled uniformly along a spiral with random translations sampled uniformly within(0 , .

5) range and orientations sampled uniformly within ( − π , π ); (c) points sampleduniformly in a unit sphere. Performance of our method in the noiseless setting.

For the 1D settingwe deduced in section 2 that the pairwise distance constraints by themselves are suf-ﬁcient to identify the underlying 3D conﬁguration. For the 2D setting we proved insection 3 that knowing additionally the distances between neighboring beads leads touniqueness. We here perform simulations in 3D since this is the biologically relevantsetting. These results are depicted in Figure 5 with additional examples in FigureSM1. The input to our algorithm are the pairwise distances (which are summed overhomologs), all 3-way distances, the distances between homologous loci, and the dis-tances between neighboring beads. In the noiseless setting considered here we solvethe SDP formulation in Equation (5.1). Figure 5 and Figure SM1 show that the trueand reconstructed structures highly overlap, thereby indicating that our optimizationformulation is able to recover the 3D structure of the full diploid genome in the noise-less setting. When the 3-way distance constraints are removed, the reconstructionsare less aligned with the true structures. This is shown in Figure 6, where we measure6

A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER the root-mean-square deviation (RMSD) between true and reconstructed 3D coordi-nates over 20 trials. In line with our theoretical results, these experimental results inthe noiseless setting indicate the importance of higher-order contact frequencies forrecovering the 3D diploid conﬁguration, especially when the number of chromosomesis larger. (a) (b) (c)

Fig. 5 . Examples of true and reconstructed points on simulated data. (a) Brownian motionmodel. (b) Spirals. (c) Random points in a sphere. We generate six chromosomes with in totalof 120 domains, corresponding to three homologous pairs with 20 domains per chromosome in thenoiseless setting. Solid lines / points correspond to true 3D coordinates and dashed lines / unﬁlledpoints to reconstructions via our method. Each color represents a diﬀerent chromosome. (a) (b) (c)

Fig. 6 . Performance of our method in the noiseless setting. Root-mean-square deviation(RMSD) between true and reconstructed structure computed with and without higher-order distanceconstraints. Simulated data was generated using a Brownian motion model with (a) one (b) two and(c) three chromosomes. Mean and standard deviation over trials are shown. Performance of our method in the noisy setting.

Next, we considernoisy distance observations D ij = D ∗ ij (1 + δ ) and noisy 3-way distance observations D i i ...i k = D ∗ i i ...i k (1+ δ ) by sampling δ uniformly within ( − (cid:15), (cid:15) ) as in [46], where (cid:15) isa given noise level. For our simulations we sample a maximum of 1000 3-way distanceconstraints. As shown in Figure SM2, we observe that the number of constraintsdoes not have a major eﬀect on the reconstruction accuracy. While for all simulationsshown in this section, we set the tuning parameter ρ = 0 . ρ .In Figure 7 we numerically assess the accuracy of our predicted structure for theBrownian motion model for diﬀerent number of chromosomes (one, two, or three) anddiﬀerent number of domains per chromosome (10 or 20) by computing the Spearmancorrelation between reconstructed and true pairwise distances, similar to [46]. Asexpected, Figure 7 shows that when the noise level increases, then the Spearmancorrelation between the original and reconstructed conﬁguration decreases. For the D GENOME ORGANIZATION IN DIPLOID ORGANISMS (a) (b) (c)

Fig. 7 . Performance of our method in the noisy setting. Spearman correlation under diﬀerentnoise levels for (a) one, (b) two and (c) three chromosomes. Simulated data was generated usinga Brownian motion model where each chromosome has or domains. Mean and standarddeviation over trials are shown. We apply our al-gorithm to the problem of reconstructing the diploid genome from contact frequencydata derived from experiments. We obtain pairwise and 3-way contact frequenciescollected via SPRITE in human lymphoblastoid cells from [33]. Since we aim to re-construct the whole diploid genome, which consists of approximately 6 billion basepairs, for computational reasons we bin the contact frequencies in the SPRITE datasetinto 10 Mega-base pair (Mb) regions. While some previous studies considered higherresolutions, the majority of the studies [4, 19, 36, 42, 46] did not attempt to recon-struct the whole diploid genome and focused only on reconstructing one chromosome,thus enabling them to consider higher resolutions.After ﬁltering out regions with a small number of total contacts, we obtain 514unphased points on the chromosomes. We convert the pairwise contact frequenciesto pairwise distances using the previously observed relationship D ij = F − / ij [36]and use Lemma 2.3 to obtain the distances between homolog pairs from this data.As in our simulations in the noisy setting, we randomly sample 1000 3-way distanceconstraints from all nonzero 3-way contact frequencies (for the transformation from3-way contact frequencies to 3-way distances, see section 4). Finally, we obtain thedistances between neighboring 10Mb beads by empirically evaluating the 3D recon-structions under diﬀerent input distances; see section SM4 and Figure SM4, FigureSM5.Using the pairwise constraints, homolog-homolog constraints, neighboring beadconstraints, and 3-way distance constraints, we solve the SDP problem in Equa-tion (5.2) for the noisy setting and analyze the corresponding 3D coordinates. Ourdiploid reconstruction is shown in Figure 8a. We compare this diploid genome recon-struction to the 3D structure obtained via ChromSDE, shown in Figure 8b obtainedunder the assumption that the observed contact frequencies and the correspondingdistances are a sum of four equal quantities, i.e., k x i − x j k , k x i − y j k , k y i − x j k , and k y i − y j k are equal. In Figure SM6, we show that the reconstruction obtained usingChromSDE with equal distances does not recapitulate known biology as described inthe following paragraphs.Experimental (imaging) studies have shown that chromosomes are organized bysize within the nucleus, with small chromosomes in the interior and larger chromo-somes on the periphery [3]. We colored each chromosome according to its size and8 A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER chr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrX (a) chr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrX (b) N o r m a li z e d c h r o m o s o m e s i z e (c)(d) (e) Fig. 8 .

3D Diploid Genome Reconstruction. Estimated 3D positions of all chromosomes andtheir corresponding homologs at 10Mb resolution. 3D positions obtained using (a) our method and(b) using ChromSDE with chromosomes colored according to chromosome number. (c) Whole diploidorganization obtained via our method, colored by chromosome size. (d) Mean chromosome size asthe distance from the center increases. (e) The number of A compartments as the distance from thecenter increases. computed the mean chromosome size versus distance away from the center. Theresults of the 3D conﬁguration obtained using our method are shown in Figures 8cand 8d and recapitulate prior studies: smaller chromosomes are preferentially locatedin the center, whereas larger chromosomes are preferentially on the periphery; see alsosection SM6, Figure SM7. This is especially apparent for chromosomes 2 and 4, whichare some of the largest chromosomes, and in our reconstruction they are located onthe periphery as expected.Experimental studies on the spatial organization of the genome have also shownthat the center of the nucleus is enriched in active compartments (known as A com-partments), while the periphery contains inactive compartments (known as B com-partments) [38]. From previously published data on the location of A and B compart-ments along the genome in human lymphoblastoid cells [34], we counted the numberof A compartments per 10Mb bin. Then dividing our 3D reconstruction into con-centric circles of increasing radius away from the center, we found the mean numberof A compartments in each concentric circle. Figure 8e shows that with increasingdistance away from the center, the number of A compartments decreases. Thus, ourreconstruction recovers the experimentally observed trend for A compartments to bepreferentially located near the nucleus center. As shown in Figures SM8 and SM9 insection SM7, we note that our results are robust to the choice of the tuning parameter ρ resulting in biologically plausible conﬁgurations independent of the choice of ρ .Currently, many studies such as [35] simply ignore the fact that the genome isdiploid and infer the 3D genome organization as if the data was collected from ahaploid organism, assuming that the homologous loci have the same 3D structure. D GENOME ORGANIZATION IN DIPLOID ORGANISMS

7. Discussion.

In this paper, we proved that for a diploid organism the 3Dgenome structure is not identiﬁable from pairwise distance measurements alone. Thisimplies that applying any algorithm for the reconstruction of the 3D genome structurefrom typical chromosome conformation capture data for a diploid organism can resultin any of the inﬁnitely many conﬁgurations with the same pairwise contact frequencies.We showed that unique idenﬁability is obtained using distance constraints betweenneighboring genomic loci as well as 3-way distance constraints in addition to thepairwise distance constraints that can be obtained from typical contact frequencydata. Distances between neighboring genomic loci can be obtained empirically, e.g.from imaging studies, while 3-way distance constraints can be obtained from the mostrecently developed sequencing-based methods for obtaining contact frequencies suchas SPRITE [33], C-walks [31] and GAM [2]. We also presented SDP formulationsfor determining the 3D genome reconstruction both in the noiseless and the noisysetting. Finally, we applied our algorithm to contact frequency data from humanlymphoblastoid cells collected using SPRITE and showed that our results recapitulateknown biological trends; in particular, in the 3D conﬁguration identiﬁed using ourmethod, the small chromosomes are preferentially situated in the interior of the cellnucleus, while the larger chromosomes are preferentially situated at the periphery ofthe cell nucleus. In addition, in the 3D conﬁguration identiﬁed using our method thenumber of A domains is higher in the interior versus the periphery, which is in linewith experimental results. Our work shows the importance of higher-order contactfrequencies that can be measured using SPRITE [33], C-walks [31] and GAM [2] forobtaining the 3D organization of the genome in diploid organisms. This is particularlyrelevant for the reconstruction of cancer genomes, where copy number variations arefrequent and hence the genome may contain even more than two copies of each locus.We conjecture that identiﬁability of the 3D genome structure can also be achievedby replacing the higher-order contact constraints by distance constraints to the centerof the cell nucleus. Such constraints are also biologically relevant, since these distancescan be measured via imaging experiments, or inferred by measuring whether a partic-ular locus is in a lamin-associated domain or a telomere, both of which tend to lie atthe boundary of the cell nucleus [10, 18, 41]. Another future research direction is thedevelopment of specialized solvers to enable reconstruction of the genome at higherresolution. In this study we used a 10Mbp resolution due to the computational con-straints imposed by SDP solvers. Finally, the theoretical results in this paper build onthe assumption that distances are inverses of square roots of pairwise and higher-ordercontact frequencies. An interesting future research direction is to develop a methodfor estimating the map between higher-order contact frequencies and distances, andthen prove identiﬁability as well as build reconstruction algorithms for these diﬀerentmaps.

Acknowldegements.

We thank Mohab Safey El Din for helpful discussions.

REFERENCES A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER[1]

A. Y. Alfakih, A. Khandani, and H. Wolkowicz , Solving Euclidean distance matrix com-pletion problems via semideﬁnite programming , Computational Optimization and Appli-cations, 12 (1999), pp. 13–30.[2]

R. A. Beagrie, A. Scialdone, M. Schueler, D. C. A. Kraemer, M. Chotalia, S. Q. Xie,M. Barbieri, I. de Santiago, L.-M. Lavitas, M. R. Branco, et al. , Complex multi-enhancer contacts captured by genome architecture mapping , Nature, 543 (2017), p. 519.[3]

A. Bolzer, G. Kreth, I. Solovei, D. Koehler, K. Saracoglu, C. Fauth, S. M¨uller,R. Eils, C. Cremer, M. R. Speicher, et al. , Three-dimensional maps of all chromosomesin human male ﬁbroblast nuclei and prometaphase rosettes , PLoS Biology, 3 (2005), p. e157.[4]

A. G. Cauer, G. Yardimci, J.-P. Vert, N. Varoquaux, and W. S. Noble , Inferring diploid3D chromatin structures from Hi-C data , bioRxiv:644294, (2019).[5]

L. Cayton and S. Dasgupta , Robust Euclidean embedding , in Proceedings of the 23rd Inter-national Conference on Machine Learning, ICML ’06, New York, NY, USA, 2006, ACM,pp. 169–176.[6]

V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky , The convex geometry oflinear inverse problems , Foundations of Computational Mathematics, 12 (2012), pp. 805–849.[7] . G. P. Consortium et al. , An integrated map of genetic variation from 1,092 human genomes ,Nature, 491 (2012), pp. 56–65.[8] . G. P. Consortium et al. , A global reference for human genetic variation , Nature, 526 (2015),pp. 68–74.[9]

T. F. Cox and M. A. A. Cox , Multidimensional scaling , Chapman and Hall / CRC, 2000.[10]

L. Crabbe, A. J. Cesare, J. M. Kasuboski, J. A. J. Fitzpatrick, and J. Karlseder , Humantelomeres are tethered to the nuclear envelope during postmitotic nuclear assembly , CellReports, 2 (2012), pp. 1521–1529.[11]

J. Dekker , Gene regulation in the third dimension , Science, 319 (2008), pp. 1793–1794.[12]

J. Dekker, K. Rippe, M. Dekker, and N. Kleckner , Capturing chromosome conformation ,Science, 295 (2002), pp. 1306–1311.[13]

M. Di Pierro, B. Zhang, E. L. Aiden, P. G. Wolynes, and J. N. Onuchic , Transferablemodel for chromosome architecture , Proceedings of the National Academy of Sciences, 113(2016), pp. 12168–12173.[14]

Z. Duan, M. Andronescu, K. Schutz, S. McIlwain, Y. J. Kim, C. Lee, J. Shendure,S. Fields, C. A. Blau, and W. S. Noble , A three-dimensional model of the yeast genome ,Nature, 465 (2010), pp. 363–367, http://dx.doi.org/10.1038/nature08973.[15]

H. Fang and D. P. O’Leary , Euclidean distance matrix completion problems , OptimizationMethods and Software, 27 (2012), pp. 695–717.[16]

M. Fazel, H. Hindi, and S. P. Boyd , Log-det heuristic for matrix rank minimization with ap-plications to Hankel and Euclidean distance matrices , in Proceedings of the 2003 AmericanControl Conference, vol. 3, IEEE, 2003, pp. 2156–2162.[17]

M. Fazel, H. Hindi, S. P. Boyd, et al. , A rank minimization heuristic with application tominimum order system approximation , in Proceedings of the American Control Conference,vol. 6, Citeseer, 2001, pp. 4734–4739.[18]

L. Guelen, L. Pagie, E. Brasset, W. Meuleman, M. B. Faza, W. Talhout, B. H. Eu-ssen, A. de Klein, L. Wessels, W. de Laat, et al. , Domain organization of humanchromosomes revealed by mapping of nuclear lamina interactions , Nature, 453 (2008),pp. 948–951.[19]

M. Hu, K. Deng, Z. Qin, J. Dixon, S. Selvaraj, J. Fang, B. Ren, and J. S. Liu , Bayesianinference of spatial organizations of chromosomes , PLoS Computational Biology, 9 (2013),p. e1002893.[20]

J. R. Hughes, N. Roberts, S. McGowan, D. Hay, E. Giannoulatou, M. Lynch,M. De Gobbi, S. Taylor, R. Gibbons, and D. R. Higgs , Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment , NatureGenetics, 46 (2014), pp. 205–212.[21]

R. Jungmann, M. S. Avenda˜no, J. B. Woehrstein, M. Dai, W. M. Shih, and P. Yin , Mul-tiplexed 3d cellular super-resolution imaging with dna-paint and exchange-paint , NatureMethods, 11 (2014), p. 313.[22]

N. Krislock , Semideﬁnite facial reduction for low-rank Euclidean distance matrix completion ,PhD thesis, University of Waterloo, 2010, http://hdl.handle.net/10012/5093.[23]

A. Lesne, J. Riposo, P. Roger, A. Cournac, and J. Mozziconacci ,

3D genome reconstruc-tion from chromosomal contacts , Nature Methods, 11 (2014), pp. 1141–1143.[24]

E. Lieberman-Aiden, N. L. v. Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling,I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, R. Sandstrom, B. Bernstein,

D GENOME ORGANIZATION IN DIPLOID ORGANISMS M. A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A. Mirny, E. S.Lander, and J. Dekker , Comprehensive mapping of long-range interactions reveals fold-ing principles of the human genome , Science, 326 (2009).[25]

E. Lieberman-Aiden, N. L. Van Berkum, L. Williams, M. Imakaev, T. Ragoczy,A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, R. Sandstrom,B. Bernstein, M. A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos,L. A. Mirny, E. S. Lander, and J. Dekker , Comprehensive mapping of long-range in-teractions reveals folding principles of the human genome , science, 326 (2009), pp. 289–293.[26]

F. Lu, S. Keles¸, S. J. Wright, and G. Wahba , Framework for kernel regularization withapplication to protein clustering , Proceedings of the National Academy of Sciences, 102(2005), pp. 12332–12337.[27]

L. A. Mirny , The fractal globule as a model of chromatin architecture in the cell , Chromosomeresearch, 19 (2011), pp. 37–51.[28]

B. Mishra, G. Meyer, and R. Sepulchre , Low-rank optimization for distance matrix com-pletion , in 2011 50th IEEE Conference on Decision and Control and European ControlConference, 2011, pp. 4455–4460.[29]

I. M¨uller, S. Boyle, R. H. Singer, W. A. Bickmore, and J. R. Chubb , Stable morphology,but dynamic internal reorganisation, of interphase human chromosomes in living cells ,PloS One, 5 (2010), p. e11560.[30]

G. Nir, I. Farabella, C. P. Estrada, C. G. Ebeling, B. J. Beliveau, H. M. Sasaki, S. H.Lee, S. C. Nguyen, R. B. McCole, S. Chattoraj, et al. , Walking along chromosomeswith super-resolution imaging, contact maps, and integrative modeling , PLoS Genetics, 14(2018), p. e1007872.[31]

P. Olivares-Chauvet, Z. Mukamel, A. Lifshitz, O. Schwartzman, N. O. Elkayam,Y. Lubling, G. Deikus, R. P. Sebra, and A. Tanay , Capturing pairwise and multi-waychromosomal conformations using chromosomal walks , Nature, 540 (2016), pp. 296–300.[32]

Y. Qi and B. Zhang , Predicting three-dimensional genome organization with chromatin states ,PLoS computational biology, 15 (2019), p. e1007024.[33]

S. A. Quinodoz, N. Ollikainen, B. Tabak, A. Palla, J. M. Schmidt, E. Detmar, M. M.Lai, A. A. Shishkin, P. Bhat, Y. Takei, et al. , Higher-order inter-chromosomal hubsshape 3D genome organization in the nucleus , Cell, 174 (2018), pp. 744–757.[34]

S. S. P. Rao, M. H. Huntley, N. C. Durand, E. K. Stamenova, I. D. Bochkov, J. T.Robinson, A. L. Sanborn, I. Machol, A. D. Omer, E. S. Lander, et al. , A 3D mapof the human genome at kilobase resolution reveals principles of chromatin looping , Cell,159 (2014), pp. 1665–1680.[35]

L. Rieber and S. Mahony , miniMDS: 3D structural inference from high-resolution Hi-C data ,Bioinformatics, 33 (2017), pp. i261–i266.[36] M. Rousseau, J. Fraser, M. A. Ferraiuolo, J. Dostie, and M. Blanchette , Three-dimensional modeling of chromatin structure from interaction frequency data using Markovchain Monte Carlo sampling , BMC Bioinformatics, 12 (2011), p. 414.[37]

M. Simonis, P. Klous, E. Splinter, Y. Moshkin, R. Willemsen, E. de Wit, B. vanSteensel, and W. de Laat , Nuclear organization of active and inactive chromatin do-mains uncovered by chromosome conformation capture–on-chip (4C) , Nature Genetics, 38(2006), pp. 1348–1354.[38]

T. J. Stevens, D. Lando, S. Basu, L. P. Atkinson, Y. Cao, S. F. Lee, M. Leeb, K. J.Wohlfahrt, W. Boucher, A. O’Shaughnessy-Kirwan, et al. ,

3D structures of individ-ual mammalian genomes studied by single-cell Hi-C , Nature, 544 (2017), p. 59.[39]

C. Uhler and G. V. Shivashankar , Chromosome intermingling: mechanical hotspots forgenome regulation , Trends in Cell Biology, 27 (2017), pp. 810–819.[40]

C. Uhler and G. V. Shivashankar , The regulation of genome organization and gene expres-sion by nuclear mechanotransduction , Nature Reviews Molecular Cell Biology, 18 (2017),pp. 717–727.[41]

B. Van Steensel and A. S. Belmont , Lamina-associated domains: links with chromosomearchitecture, heterochromatin, and gene repression , Cell, 169 (2017), pp. 780–791.[42]

N. Varoquaux, F. Ay, W. S. Noble, and J.-P. Vert , A statistical approach for inferring the3D structure of the genome , Bioinformatics, 30 (2014), pp. i26–i33.[43]

H. Wang, X. Xu, C. M. Nguyen, Y. Liu, Y. Gao, X. Lin, T. Daley, N. H. Kipniss,M. La Russa, and L. S. Qi , CRISPR-mediated programmable 3D genome positioning andnuclear organization , Cell, 175 (2018), pp. 1405–1417.[44]

K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul , Graph Laplacian regularization for large-scale semideﬁnite programming , in Advances in Neural Information Processing Systems,2007, pp. 1489–1496. A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER[45]

L. Zhang, G. Wahba, and M. Yuan , Distance shrinkage and Euclidean embedding via reg-ularized kernel estimation , Journal of the Royal Statistical Society: Series B, 78 (2016),pp. 849–867.[46]

Z. Zhang, G. Li, K.-C. Toh, and W.-K. Sung ,

3D Chromosome Modeling with Semi-DeﬁniteProgramming and Hi-C Data , Journal of Computational Biology, 20 (2013).

UPPLEMENTARY MATERIALS: IDENTIFYING 3D GENOMEORGANIZATION IN DIPLOID ORGANISMS VIA EUCLIDEANDISTANCE GEOMETRY ∗ ANASTASIYA BELYAEVA † , KAIE KUBJAS ‡ , LAWRENCE J. SUN † , AND

CAROLINEUHLER † . SM1. Simulations: reconstructions in the noiseless setting.

Figure SM1shows additional reconstructions of simulated data in the noiseless setting. The truestructures are consistently recovered under diﬀerent data generation models.

Fig. SM1 . Additional examples of true and reconstructed points on simulated data. True pointswere generated using Brownian motion model (ﬁrst row), spirals (second row) and random pointsin a sphere (third row). We generated six chromosomes, corresponding to three homologous pairswith 20 domains per chromosome in the noiseless setting. Solid lines / points correspond to true3D coordinates and dashed lines / unﬁlled points to reconstructions via our method. Each colorrepresents a diﬀerent chromosome. ∗ Submitted.

Funding:

Anastasiya Belyaeva was supported by an NSF Graduate Research Fellowship(1122374), the Abdul Latif Jameel World Water and Food Security Lab (J-WAFS) at MIT andthe MIT J-Clinic for Machine Learning and Health. Kaie Kubjas was supported by the EuropeanUnion’s Horizon 2020 research and innovation programme: Marie Sk lodowska-Curie grant agree-ment No. 748354, research carried out at LIDS, MIT and Team PolSys, LIP6, Sorbonne Univer-sity. Caroline Uhler was partially supported by NSF (DMS-1651995), ONR (N00014-17-1-2147 andN00014-18-1-2765), IBM, and a Simons Investigator Award. † Laboratory for Information and Decision Systems, Department of Electrical Engineering andComputer Science, and Institute for Data, Systems and Society, Massachusetts Institute of Technol-ogy, Cambridge, MA ([email protected], [email protected], [email protected]). ‡ Department of Mathematics and Systems Analysis, Aalto University, Espoo, Finland(kaie.kubjas@aalto.ﬁ). SM1 a r X i v : . [ q - b i o . GN ] J a n M2 A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER

SM2. Simulations: impact of the number of 3-way distance constraints.

Figure SM2 shows the impact of the number of 3-way distance constraints on the so-lution in the noisy setting. We explored the impact of the number of 3-way distanceconstraints speciﬁcally when the number of chromosomes is higher (three chromo-somes) since higher-order constraints seem to play a more critical role in that setting.We evaluate the performance when 500, 1000 or all (4060 for 30 domains) 3-way dis-tance constraints are used. Figure SM2 shows that the choice of the number of 3-waydistance constraints has little impact on the accuracy of reconstruction, so we used1000 3-way distance constraints (or all possible triplets if that number was smaller)for the simulations and the real data analysis. (a) (b)

Fig. SM2 . The impact of the number of 3-way distance constraints in the noisy setting. Boxplotsshowing Spearman correlation and root-mean-square deviation (RMSD) for diﬀerent number of 3-way distance constraints over trials. Simulated data was generated using Brownian motion modelwith three chromosomes, where each chromosome had domains. Noise level of . was added.We used ρ = 0 . to solve the SDP. Green triangles and lines indicate the mean and medianperformance respectively. SM3. Simulations: impact of the tuning parameter ρ . Figure SM3 ex-plores the impact of the tuning parameter ρ in the noisy setting. Figure SM3 showsthat the choice of ρ has little impact on the accuracy of the reconstruction. For thesimulations in the noisy setting and for real data we chose ρ = 0 . (a) (b) Fig. SM3 . The impact of ρ in the noisy setting. Boxplots showing Spearman correlation androot-mean-square deviation (RMSD) for diﬀerent values of ρ over trials. Simulated data wasgenerated using a Brownian motion model with one chromosome and domains per chromosomeas well as noise level of . . We used the maximum number of triplet tensor constraints ( ) tosolve the SDP. Green triangles and lines indicate the mean and median performance respectively. UPPLEMENTARY MATERIALS: 3D GENOME ORGANIZATION IN DIPLOID ORGANISMS

SM3

SM4. Distance between neighboring beads.

We consider diﬀerent valuesfor the distance between neighboring beads as input to our algorithm. If the distancebetween neighboring beads is chosen to be too small, the resulting 3D diploid recon-struction of the data results in the homologous loci x , . . . x n (copy A) and y , . . . y n (copy B) being completely separated as shown in Figure SM4a. We gradually in-creased the values for the distance between neighboring beads and quantiﬁed theseparation between x , . . . x n and y , . . . y n as follows: We obtained the hyperplaneseparating x , . . . x n and y , . . . y n by ﬁtting a support-vector machine (SVM) clas-siﬁer. Next, we identiﬁed the k points among x , . . . x n as well as among y , . . . y n that are closest to the hyperplane and computed their centroids. Figure SM4b showsthe sum of the distances of the two centroids to the separating hyperplane, therebyquantifying the separation of the copy A points from the copy B points. This distanceshould approach 0 as the copy A and copy B points come closer together. Indeed,Figure SM4b shows that using a parameter of 0.65, the distance between the k closestpoints stabilizes close to 0 and thus we used 0.65 as the distance between neighboringbeads. (a) (b)(c) (d) Fig. SM4 . Empirical choice of parameter for the distance between neighboring beads. (a) The3D genome reconstruction with parameter for the distance between neighboring beads set to 0.5. Thehomologous loci x , . . . x n (copy A) and y , . . . y n (copy B), colored by red and blue are completelyseparated. (b) The distance of centroids corresponding to k closest points to the SVM hyperplaneseparating copy A from copy B (red and blue points) for diﬀerent parameter settings. The blackdashed line corresponds to the chosen parameter of 0.65. (c) Confusion matrix quantifying howoften points clustered via k -means (predicted label) were assigned their true label (copy A or copy B)when parameter of 0.5 was used. (d) Same as (c) for the chosen parameter 0.65. Higher confusionacross labels indicates that points belonging to copy A and copy B are not clearly separated, asdesired. We provide additional quantiﬁcation regarding the separation of points in copyM4

A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER

A and copy B by clustering the 3D structure using k -means into two clusters andcomputing a confusion matrix, where the true labels are given by copy A and copy B.If the points x , . . . x n and y , . . . y n are completely separated, then k -means wouldresult in near perfect accuracy of separation of all points into copy A and copy B.Figure SM4c shows that this is indeed the case when using a distance parameter of0.5. For the chosen parameter of 0.65, the confusion matrix is shown in Figure SM4d,reinforcing the observation that indeed copy A and copy B are getting mixed.We note that our observations are robust to the exact choice of the distancebetween neighboring beads. In Figure SM5 we show the resulting 3D reconstructionas well as chromosome size and A compartment trends when using a parameter of 0.7as the distance between neighboring beads. chr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrX (a) N o r m a li z e d c h r o m o s o m e s i z e (b)(c) (d) Fig. SM5 .

3D diploid genome reconstruction with a diﬀerent parameter (0.7 instead of 0.65) forthe distance between neighboring beads. (a) Estimated 3D positions of all chromosomes and theircorresponding homologs with chromosomes colored according to chromosome number. (b) Wholediploid organization obtained via our method, colored by chromosome size. (c) Mean chromosomesize as the distance from the center increases. (d) The number of A compartments as the distancefrom the center increases.

SM5. Comparison with ChromSDE.

We compare our whole genome recon-struction to the reconstruction inferred by ChromSDE [SM3]. Since ChromSDE doesnot account for the fact that the measured contact frequencies and corresponding ob-served distances are a sum of four diﬀerent distances, i.e. k x i − x j k , k x i − y j k , k y i − x j k , and k y i − y j k , we converted frequencies to distances using D ij = F − / ij andused D ij / k x i − y i k = ∞ . Giventhe described distance constraints, we solved the SDP for the Gram matrix and ob- UPPLEMENTARY MATERIALS: 3D GENOME ORGANIZATION IN DIPLOID ORGANISMS

SM5tained the 3D coordinates using eigenvector decomposition, similar to our method.Figure SM6 shows the corresponding solution and quantiﬁcation of the mean chro-mosome size and number of A compartments as the radius from the center increases.The computed 3D diploid genome conﬁguration obtained via ChromSDE does notrecapitulate that chromosome size increases with distance away from the center andthat the number of A compartments decreases with distance away from the center. N o r m a li z e d c h r o m o s o m e s i z e (a) (b) (c) Fig. SM6 .

3D diploid genome reconstruction with ChromSDE. ChromSDE was run using adistance matrix where the distances for k x i − x j k , k x i − y j k , k y i − x j k , and k y i − y j k were set to D ij / . (a) Estimated 3D positions of all chromosomes and their corresponding homologs at 10Mbresolution colored by chromosome size. (b) Mean chromosome size as the distance from the centerincreases. (c) The number of A compartments as the distance from the center increases. SM6. Analysis of 3D diploid genome reconstruction.

We provide furtheranalysis of the 3D diploid genome reconstruction obtained using our algorithm fromcontact frequency data. Figure SM7 shows that chromosome size is correlated withthe distance of the chromosome to the center of the cell nucleus, whihc is in line knownbiological trends.

Fig. SM7 . Chromosome size (normalized by the size of the largest chromosome) versus themean distance of the chromosome and its homolog away from the center.

SM7. Real data: impact of the tuning parameter ρ . Figure SM8 ex-plores the impact of the tuning parameter ρ on the results of the real data analy-sis. We compute the RMSD between the 3D genome reconstruction computed with ρ = 0 . ρ ∈ (0 . , . , . , . , . , ,

1) to quantify how much the 3D structurechanges when increasing ρ . As shown in Figure SM8, the RMSD is low across dif-ferent choices of the tuning parameter. In Figure SM9, we provide the 3D genomeM6 A. BELYAEVA, K. KUBJAS, L. J. SUN AND C. UHLER reconstruction and trends for mean number of A compartments and mean chromo-some size versus the distance away from the center for the 3D reconstruction computedwith ρ = 10 since the RMSD was the highest for this choice of the tuning parameter.We observe that the trends remain the same also for this choice of ρ . Fig. SM8 . The impact of ρ in real data. Root-mean-square deviation (RMSD) between the 3Dgenome reconstruction computed with ρ = 0 . and the 3D genome reconstructions computedwith other choices of ρ ∈ (0 . , . , . , . , . , , . chr1chr2chr3chr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrX (a) N o r m a li z e d c h r o m o s o m e s i z e (b)(c) (d) Fig. SM9 .

3D diploid genome reconstruction with ρ = 10 . Estimated 3D positions of all chro-mosomes and their corresponding homologs at 10Mb resolution. Chromosomes are colored accordingto (a) chromosome number and (b) chromosome size. (c) Mean chromosome size as the distancefrom the center increases. (d) The number of A compartments as the distance from the centerincreases. SM8. Haploid distance matrices.

To show that modeling the diploid aspectof the genome is critical and provides valuable information regarding the 3D organiza-tion of the genome, we randomly labeled each homolog of a particular chromosome tocorrespond to either copy A or copy B of the chromosome and computed the Euclidean

UPPLEMENTARY MATERIALS: 3D GENOME ORGANIZATION IN DIPLOID ORGANISMS

SM7distances between all loci belonging to copy A, i.e. || x i − x j || to obtain one haploiddistance matrix and the Euclidean distances between all loci belonging to copy B,i.e. || y i − y j || to obtain the second haploid distance matrix. Figure SM10a and Fig-ure SM10b show the haploid distance matrices where the points 1 , . . . , n are assignedto copy A and points n + 1 , . . . , n are assigned to copy B. We were interested incomparing the two haploid distance matrices to see whether the two haploid matriceswere the same or if modeling the diploid aspect of the genome also allowed us tolearn about each homolog. Inspection of the two matrices reveals that the distancesare diﬀerent in these two haploid matrices, suggesting that modeling the genomeas a diploid structure is important and indeed provides additional information. Wequantify the diﬀerence by computing the Spearman correlation between the distancematrices over 100 diﬀerent samplings of assignments of chromosomes to either copyA or B. Figure SM10c shows the histogram of the calculated Spearman correlationswith mean Spearman correlation of 0.08. (a) (b) (c) Fig. SM10 . Haploid distance matrices. (a) Haploid distance matrix for points belonging tocopy A and (b) its corresponding homologous haploid distance matrix for points belonging to copyB. (c) Spearman correlation between haploid distance matrices over 100 diﬀerent assignments ofeach homolog to either copy A or copy B.

REFERENCES[SM1]

Z. Zhang, G. Li, K.-C. Toh, and W.-K. Sung ,

3D Chromosome Modeling with Semi-DeﬁniteProgramming and Hi-C Data3D Chromosome Modeling with Semi-DeﬁniteProgramming and Hi-C Data