[PDF] NGS Based Haplotype Assembly Using Matrix Completion

Abstract

We apply matrix completion methods for haplotype assembly from NGS reads to develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by applying a mathematical model to convert the reads to an incomplete matrix and estimating unknown components. This process is followed by quantizing and decoding the completed matrix in order to estimate haplotypes. These algorithms are compared to the state-of-the-art algorithms using simulated data as well as the real fosmid data. It is shown that the SNP missing rate and the haplotype block length of the proposed HapOPT are better than those of HapCUT2 with comparable accuracy in terms of reconstruction rate and switch error rate. A program implementing the proposed algorithms in MATLAB is freely available at this https URL.

Full PDF

NNGS Based Haplotype Assembly Using Matrix Completion

Sina Majidian , Mohammad Hossein Kahaei

Abstract

We apply matrix completion methods for haplotype assembly from NGS reads to develop the new HapSVT, HapNuc,and HapOPT algorithms. This is performed by applying a mathematical model to convert the reads to an incompletematrix and estimating unknown components. This process is followed by quantizing and decoding the completedmatrix in order to estimate haplotypes. These algorithms are compared to the state-of-the-art algorithms usingsimulated data as well as the real fosmid data. It is shown that the SNP missing rate and the haplotype block lengthof the proposed HapOPT are better than those of HapCUT2 with comparable accuracy in terms of reconstructionrate and switch error rate. A program implementing the proposed algorithms in MATLAB is freely available athttps://github.com/smajidian/HapMC.

Introduction

The Single Nucleotide Polymorphism (SNP) is a kind of genetic variation with a frequency greater than 1% inpopulation. In diploid organisms, genomes are organized into pairs of chromosomes, a paternal and a maternal copy.The sequence of SNPs on each copy of a pair of chromosomes is called a haplotype. A genotype is the conﬂation oftwo haplotypes on the homologous chromosomes. An SNP is called homozygous, if a pair of alleles at this locus ismade up of two identical nucleotides, and is heterozygous, otherwise.From the evolutionary point of view, the SNP happens as a consequence of mutation. However, since the mutationrate is low, several mutations of a locus rarely occur. Thus, it is usual to assume that the majority of SNPs arebi-allelic, meaning that each SNP can be chosen from just two of the four possible nucleotides, i.e. , A, T, C, andG [1]. Accordingly, in this work we similarly use this assumption. The haplotype is widely used in the Genome WideAssociation Studies (GWAS), clinical genetics, linkage analysis, drug-design, and personalized medicine [2].To extract a haplotype, one may use the following three approaches where the last two approaches are mathe-matical:1) Applying high-cost experimental and expensive methods for every single individual which is of course not desir-able [2].2) Haplotype phasing wherein the haplotypes are inferred from the genotypes of multiple individuals. As such, amethod based on the maximum parsimony assumption [3] and statistical methods like SHAPEIT, developed basedon the Hidden Markov Model [1,4] may be mentioned. Note that using this approach, the haplotype of an individualcan not be found separately and also is challenged by the low-frequency and also de novo variants [2].3) Estimating haplotypes from Next Generation Sequencing (NGS) reads i.e. nucleotide sequence of fragments. Usingthis approach, known as the haplotype assembly, haplotyping of a single individual becomes feasible. In this regard,HapCUT2 [5], HapTree [6], and HapSAT [7] are three famous methods developed based on probabilistic models.These methods are sensitive to the selected model and thus fragile to the model error.A recent method for haplotype assembly is AltHap [8] which has shown accurate results compared to H-PoP [9],SCGD [10], and HapTree [6]. The H-PoP is a heuristic algorithm originated from the Balanced Optimal Partition(BOP) optimization model which beneﬁts from the Minimum Error Correction (MEC) as well as the maximumfragments cut approaches [11]. The SDhaP [12] is also another heuristic method based on correlation clustering andnon-convex optimization which does not guarantee reaching the global optimum.The innovation of this article is threefold. First, the haplotype assembly is mathematically formulated basedon matrix completion methods. Secondly, three new algorithms called the Haplotype assembly based on Singular1 a r X i v : . [ q - b i o . GN ] M a r Value Thresholding (HapSVT), Haplotype assembly based on Nuclear norm minimization (HapNuc), and Haplotypeassembly based on OPTSPACE (HapOPT) are proposed. Next, in the section of Results, these algorithms arecompared to some benchmark methods in terms of the reconstruction rate and the switch error rate.

Model of Haplotypes

To exploit the NGS reads as the raw data, a computational modeling is needed. For this purpose, similar to [10], weﬁrst convert the sequence of nucleotides which can be either reads or haplotypes into a sequence of numbers. TheSNP nucleotides are converted to 1 and − β AR gene [3] for which the maternal and paternal haplotypes of an individual are shown by h m and h p , respectively. The corresponding codewords based on the above modeling are presented in the last column. Tab. 1:

Haplotypes of β AR genes and their corresponding codewords.

Nucleotides CodewordsAlleles G/A C/A G/A C/G T/C T/C T/C G/A C/G G/A { } h m A C G G C C C G G G { -1,1, 1,-1,-1,-1,-1, 1,-1,1 } h p G C A C T T T A C G { } Next, assuming that each read has been aligned to the reference genome, the non-SNP sites of each read areomitted. Then, the reads are coded using the procedure described in Table 1, and are completed by adding zeros forthe length of l as shown for 10 aligned reads in Table 2. As seen in this example, for the 1st row, we get { -1 1 1 0 00 0 0 0 0 } with 3 sites of ± Tab. 2:

Example of aligned reads for β AR genes and the considered codewords.

Reads Nucleotides Codewords1 A C G -1 1 1 0 0 0 0 0 0 02 G G C C 0 0 1 -1 -1 -1 0 0 0 03 G G G G 0 0 1 -1 0 0 0 0 -1 14 G C A C T T 1 1 -1 1 1 1 0 0 0 05 A C T A C G 0 0 -1 0 0 1 1 -1 1 16 G C T T 1 1 0 0 1 1 0 0 0 07 C C G -1 1 0 0 -1 0 0 0 0 18 A C C C C -1 1 0 0 -1 -1 -1 0 0 09 G C T A C 1 0 0 1 0 0 1 -1 1 010 A C C G 0 0 -1 1 0 0 0 0 1 1Without loss of generality, by representing the codewords of Table 2 by the vectors r i , i = 1 , ..., N , we form theread matrix R , where N is the number of reads. In fact, R is an incomplete matrix with the rank of 2 which consistsof the maternal and paternal haplotypes in its rows. At this stage, we may utilize matrix completion methods tocomplete this low rank matrix. To do so, by estimating the zero entries of R , we obtain the completed matrix H which has the same dimension as R , i.e. , N × l where l is the haplotype length. According to Table 2, these matricesare given by (1) and (2). R =  − − − − − − − − − − − − − − − − −  (1) H =  − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − −  (2)From H , one can observe that only two of its rows are diﬀerent and thus the desired haplotypes are given by h m = (cid:2) − − − − − − (cid:3) , (3) h p = (cid:2) − − (cid:3) . (4)These vectors can then be decoded to the sequence of nucleotides using the ﬁrst row of Table 1. To the best ofour knowledge, no algorithm has been reported to distinguish between the maternal and paternal haplotypes andtherefore h p and h m may be interchanged with each other.It should be noted that the above example is an error-free case to clarify the procedure of data modeling whichcan be trivially solved. For the erroneous case; which is the subject of our work, R is an incomplete version of H + N where N shows the noise matrix [8]. Proposed Methods

We present three new algorithms for haplotype assembly whose general block diagram is illustrated in Fig 1. Thegoal is to estimate h p and h m from the noisy reads. The ﬁrst two blocks have been explained before. In the third Fig. 1:

Block diagram of the proposed algorithms. block, we receive an incomplete matrix R with a few known entries where the set of indices of known entries is givenby Ω [10]. Then, we intend to estimate the unknown entries based on rank assumption. Mathematically, this ismodeled by the following optimization problem:min H (cid:88) ( i,j ) ∈ Ω ( H ij − R ij ) subject to rank( H ) = 2 . (5)It is worth mentioning that here we have not only considered the case of all-heterozygous variants, but also includedthe case of both heterozygous and homozygous variants. This can be realized as a point of this work in comparisonto some other methods that are restricted to heterozygous variants. In the all-heterozygous case, the two haplotypeswill be the negative of each other, i.e. , h p = − h m and thus the rank of H will be one (See (5)).To solve (5), the nuclear norm minimization, Singular Value Thresholding (SVT), and OPTSPACE methods havealready been reported [13], based on which we introduce three new algorithms called the HapSVT, HapNuc, andHapOPT. Haplotype assembly based on Singular Value Thresholding (HapSVT)

To explain the proposed HapSVT algorithm, we ﬁrst introduce the SVT which is based on Singular Value Decom-position (SVD) [14] deﬁned for the read matrix R as R = U Σ V H , Σ = diag( σ i ) i = 1 , ..., r (6) where H denotes the hermitian operator, and U and V have orthonormal columns with the dimension of N × r and l × r , respectively. By applying the singular value shrinkage operator D τ ( · ) to R , we obtain D τ ( R ) = U D τ ( Σ ) V H , (7)where D τ ( Σ ) = diag (cid:0) max { σ i − τ, } (cid:1) . (8)It is worth noting that D τ ( R ) is the optimal value of the optimization problemmin Z (cid:107) R − Z (cid:107) F + τ (cid:107) Z (cid:107) ∗ , (9)where (cid:107) · (cid:107) F is the Frobenius norm and (cid:107) · (cid:107) ∗ shows the nuclear norm as the summation of singular values.To perform the matrix completion part as shown in Fig 1, we recursively use the SVT in two steps. In the ﬁrststep, starting with the initial matrix Y = R , the singular value shrinkage operator is used as X k = D τ ( Y k − ) . (10)Then, in the second step, the diﬀerence between the projected matrix X k and the initial matrix is compensatedfor the known entries using Y k = Y k − + δ P Ω ( R − X k ) , (11)for k = 1 , , . . . , where P Ω ( · ) is an operator which keeps the entries of the matrix corresponding to Ω unchanged,and sets the other entries to zero. The iterations continue until the condition (cid:107)P Ω ( X k − R ) (cid:107) F < (cid:15) (cid:107) R (cid:107) F is satisﬁedand the last X k is reported as the completed matrix H .To extract h p and h m , we compute the reduced row echelon form of H and by using the ﬁrst two pivot positions,two independent rows of H are obtained. Then, in order to acquire the paternal and maternal haplotypes the entriesare quantized to 1 and −

1. The procedures of the HapSVT algorithm is depicted in Algorithm 1.

Algorithm 1:

Haplotype assembly using SVT (HapSVT). input : N aligned reads output: Haplotypes h m , h p /* Read Matrix Preparation */ Convert the sequences of nucleotides (reads) to the sequences of numbers. Add zeros to each read to construct r i s with the length of l . Construct the read matrix R ( N × l ). /* Matrix Completion (SVT) */ Initialize Y = R , k = 0, i = 1. while (cid:107)P Ω ( X k − R ) (cid:107) F < (cid:15) (cid:107) R (cid:107) F do k = k + 1 X k = D τ ( Y k − ) Y k = Y k − + δ P Ω ( R − X k ) end H = X k /* Reduced Row Echelon Form (RREF) Calculation */ [ H r , p ] = RREF( H T ) /* Haplotype Extraction */ H q = 2 ∗ ( H > − h p = H q ( p (1) , :) h m = H q ( p (2) , :) Convert the entries of h m and h p to the nucleotides. Haplotype assembly based on Nuclear norm minimization (HapNuc)

A popular method for matrix completion is based on relaxing the non-convex rank function to a convex function.Since the number of nonzero singular values determines the rank of a matrix, an approximation of the rank function is deﬁned by the summation of singular values, known as the nuclear norm [15]. In this way, the optimization problemis cast as min H (cid:107) H (cid:107) ∗ subject to (cid:107)P Ω ( H − R ) (cid:107) F < (cid:15). (12)This problem can be solved easily using the CVX, a MATLAB based package [16]. It has been shown thatthe nuclear norm minimization has strong mathematical guarantees to achieve the optimal solution [15, 17, 18]. Todevelop the new HapNuc algorithm, we substitute the SVT part of Algorithm 1 by nuclear norm minimization. Haplotype assembly based on OPTSPACE (HapOPT)

Another method for matrix completion is known as OPTSPACE [19] in which unlike the two previous methods, weassume that the rank of the desired matrix H is known. The OPTSPACE consists of the following three steps: a)trimming, b) projection, and c) cleaning, as explained below.a) In the trimming step, those columns of R with the degrees larger than 2 | Ω | /l are set to zero where | · | shows thecardinality of a set and l is the haplotype length. The degree of a column (or a row) shows the number of its knownentries. This step is also performed for the rows of R with the degrees larger than 2 | Ω | /N where N is the numberof reads.b) The trimmed R obtained from Step (a) is projected to the space of rank r matrices using P ( R ) = N l | Ω | U P r ( Σ ) V H , (13)where P r ( Σ ) = diag( σ , ...σ r ) and U and V are given by (6).c) The cleaning step is performed by solving the following optimization problem,min X ∈ R N × r , Y ∈ R l × r min S ∈ R r × r (cid:88) ( i,j ) ∈ Ω (cid:0) R ij − ( XSY H ) ij (cid:1) , (14)which contains two minimization parts. The inner part results in a function in terms of X and Y . To solve theouter minimization part, we use a gradient based recursive method whose initial matrices are computed from Step(b), i.e., X = U and Y = V . Then, this recursive method leads to the optimal solution H = X opt S opt Y H opt .To ﬁnalize the third new HapOPT algorithm, we should substitute the SVT part of Algorithm 1 by the above threesteps. Results

Using extensive simulations, we compare the performance of the proposed HapSVT, HapNuc, and HapOPT algo-rithms with that of the three recent benchmark algorithms AltHap [8], HapCUT2 [5], and SDhaP [12]. It has alreadybeen shown that these algorithms outperform some other algorithms like RefHap [20], SCGD [10], HapTree [6], andH-PoP [9]. For comparison purposes, a well-known criterion is the reconstruction rate deﬁned as [21]rr = 1 − l min (cid:110) HD (cid:0) ˆ h m , h m (cid:1) , HD (cid:0) ˆ h p , h p (cid:1)(cid:111) , (15)where ˆ h p and ˆ h m are the reconstructed haplotypes which are compared to the known maternal and paternal hap-lotypes, h m and h p . Moreover, HD ( · , · ) is the augmented hamming distance between two vectors which counts thenumber of non-identical sites using HD ( a , b ) = l (cid:88) j =1 D (cid:0) a ( j ) , b ( j ) (cid:1) , (16)where D ( · , · ) is deﬁned as D ( a, b ) = (cid:26) a = b . (17)To consider another criterion for performance evaluation, we make use of the SWitch Error Rate (SWER),deﬁned as the number of switches divided by the haplotype length [22]. A switch happens when the parentalorigin of an allele with respect to that of the previous allele diﬀers from one parent to another. For example, byconsidering h p = [1 , , ,

1] and h m = [ − , − , − , −

1] as the grand truth haplotypes and the estimated haplotypesas ˆ h p = [1 , , − , −

1] and ˆ h m = [ − , − , , Simulated data

First, we use the simulated data [21] generated based on real human haplotypes in the HapMap project. Thisdataset; which contains diﬀerent read matrices with various error rates and coverage values originated from diﬀerenthaplotype lengths, has vastly been used in previous studies [10, 23, 24]. We choose the longest available haplotypefrom the dataset with the length of l = 700. The coverage value of the NGS paired-end reads varies from c = 3 to itsgreatest value c = 10. The average number of reads are N = 561, 936, and 1873 for coverage values of c = 3, 5, and10, respectively. The number of SNPs covered in each read is a constant value equal to 7.4. Also, 10% (and 20%) ofthe entries of the read matrix are contaminated by noise with uniform distribution. The results are averaged over100 independent trials of the experiment.Table 3 shows the reconstruction rates for diﬀerent coverage values and error rates. The corresponding SWERsare also depicted in Table 4. In this case, HapCUT2 is not examined, since it needs the Variant Call Format (VCF)ﬁle which is not available for this simulated dataset [21]. As seen in both Tables 3 and 4, the proposed HapOPTalgorithm outperforms the others in terms of the reconstruction rate as well as the SWER. It is worth reminding thatthe SDhaP solves a non-convex optimization problem using a heuristic technique with the gradient descent algorithmwhich does not guarantee reaching the global optimum. Furthermore, as a consequence of increasing the coveragevalue, a better performance is achieved by a lower SWER and a higher reconstruction rate.

Tab. 3:

Reconstruction rates for diﬀerent algorithms on simulated data [21]. The best values are inboldface. coverage error rate (%) SDhaP AltHap HapOPT(Proposed) HapSVT(Proposed) HapNuc(Proposed)3 10 97.87 99.04

Tab. 4:

SWERs for diﬀerent algorithms on simulated data [21]. The best values are in boldface. coverage error rate (%) SDhaP AltHap HapOPT (Proposed) HapSVT (Proposed) HapNuc (Proposed)3 10 0.070 0.038

Real fosmid data

We evaluate the proposed algorithms on the sequence data of the individual NA12878 fabricated based on a fosmidapproach [20]. The coverage of this data set is c = 3 and the average read length is 40 kb, and hence, is a low-coverageand long-read dataset. For evaluation purposes, we consider the trio-phased haplotype from the GATK resourcebundle, as the grand truth containing 1.3 million heterozygous variants in common with fosmid dataset [22,25]. Thisdataset has already been used in several studies [5, 8, 22].In the simulated dataset used in the last section, each read overlaps at least one another read, while for thereal data these overlaps do not necessarily occur. In this situation, our algorithm incorporates the overlaps forhaplotype estimation, and as a result, the output of each algorithm is some disjoint parts of the whole haplotype,called haplotype blocks. To evaluate a common length for these blocks, we consider their mean and also the AN50deﬁned as the median of blocks lengths in base pairs weighted by a proportion of correctly estimated alleles [6]. Also,we deﬁne the SNP Missing Rate (SMR) for each chromosome as the ratio of the number of missing SNPs in theestimates and the haplotype length [26]. The results on the real fosmid data are shown in Table 5. One can see that both HapOPT and AltHap algorithms achieve lower SNP missing rates in comparison to HapCUT2 and SDhaP.Moreover, HapOPT and AltHap have a better span in terms of AN50. Tab. 5:

Mean and AN50 of haplotype blocks lengths for diﬀerent algorithms on real fosmid data.

SDhaP HapCUT2 AltHap HapOPT (Proposed)Chr. SMR Mean AN50 (kb) SMR Mean AN50 (kb) SMR Mean AN50 (kb) SMR Mean AN50(kb)1 6.2 71.5 254 6.7 71.1 229 6.2 72.7 234 6.2 72.7 2342 6.9 68.6 241 8.3 68.3 219 6.9 69.7 223 6.9 69.7 2233 8.1 69.7 218 8.6 69.3 195 8.0 70.6 204 8.0 70.6 2044 10.0 63.4 192 10.4 63.1 172 9.9 64.6 177 9.9 64.6 1775 8.2 69.5 219 8.8 69.0 206 8.2 70.3 210 8.2 70.3 2106 7.3 82.4 243 7.9 81.9 224 7.3 84.0 236 7.3 84.0 2367 7.2 69.7 222 7.6 69.5 207 7.1 71.0 212 7.1 71.0 2128 7.8 75.6 229 8.3 75.2 207 7.7 76.8 220 7.7 76.8 2209 7.0 79.6 249 7.5 79.2 230 6.9 80.9 235 6.9 80.9 23510 6.8 83.9 238 7.3 83.4 217 6.7 84.9 220 6.7 84.9 22011 7.1 77.1 234 7.5 76.8 225 7.0 78.3 228 7.0 78.3 22812 6.4 73.4 262 7.3 73.0 241 6.7 74.1 249 6.7 74.1 24913 10.2 69.1 203 10.7 68.7 186 10.1 70.3 191 10.1 70.3 19114 6.5 77.5 259 7.0 77.1 238 6.3 78.4 246 6.3 78.4 24615 6.0 73.7 251 6.4 73.2 228 5.9 74.1 234 5.9 74.1 23416 3.8 96.6 345 4.2 96.2 317 3.7 97.9 327 3.7 97.9 32717 3.9 70.8 323 4.5 70.4 305 3.9 71.5 310 3.9 71.5 31018 7.1 75.3 228 7.6 74.9 216 7.0 76.0 223 7.0 76.0 22319 3.1 90.8 374 3.5 90.4 345 3.0 93.8 360 3.0 93.8 36020 4.3 92.4 314 4.8 92.0 297 4.2 93.7 304 4.2 93.7 30421 6.6 81.1 252 7.0 80.8 242 6.4 82.4 242 6.4 82.4 24222 2.7 123.7 445 3.2 123.2 425 2.6 123.9 426 2.6 123.9 426To assess the accuracy of diﬀerent algorithms, the corresponding reconstruction rates [5, 22] are presented in Fig2. Moreover, we have considered both short and long SWERs [5, 22]. By a long switch, we mean that the parentalorigin does not change for at least two SNPs and if two switches occur one after each other, we consider it as a shortswitch. These two metrics are reported on real fosmid data in Figs 3 and 4.

Tab. 6:

Runtime of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

SDhaP AltHap HapCUT2 HapOPT (Proposed)Runtime (Minutes) 5 10 18 355

Fig. 2:

Reconstruction rate of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

Fig. 3:

Short SWER of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

Fig. 4:

Long SWER of HapOPT, HapCUT2, AltHap, and SDhaP on real fosmid data.

From the above results, one can observe that HapOPT outperforms SDhaP and AltHap in terms of the recon-struction rate as well as long and short SWERs with a reasonable runtime as reported in Table 6. Note that although,HapCUT2 achieves the best accuracy, still its SNP missing rate is greater than that of HapOPT. These results onthe whole show that HapOPT is a promising tool for haplotype assembly with the best SNP missing rate and a goodaccuracy in terms of reconstruction rate and SWER.

Conclusion

We have exploited matrix completion methods including SVT, nuclear norm minimization, and OPTSPACE forhaplotype estimation. This was led to developing the new HapSVT, HapNuc, and HapOPT algorithms. Ourexperimental comparison on simulated data revealed that HapOPT is more accurate than SDhaP and AltHap interms of reconstruction rate and switch error rate. Also, the results on real noisy fosmid data showed that theaccuracy of HapOPT is better than that of SDhaP and AltHap and also is comparable to that of HapCUT2 interms of the reconstruction rate and the short and long SWERs. Moreover, it was shown that HapOPT outperformsthe recently addressed algorithms, HapCUT2 and SDhaP, in terms of the mean, SNP missing rate, and AN50 ofthe haplotype block length. Furthermore, the proposed algorithm is not restricted to the heterozygous assumption,as commonly considered in peer algorithms. On the whole, we can conclude that using the proposed HapOPT,the haplotype is reconstructed more completely and continuously with acceptable accuracy. Also, the proposedoptimization problem is capable of estimating haplotypes for diﬀerent ploidy levels. Our research direction for futureis to work on polyploids.

Availability of data and materials

The MATLAB program of the proposed algorithms is publicly available at https://github.com/smajidian/HapMC .The simulated datasets consisting of read matrices and true haplotypes used in this work can be downloadedfrom https://github.com/smajidian/HapMC/raw/master/data/Simulated_data.mat.zip . The fosmid datasetfor NA12878 is taken from [22, 25]. The fragment ﬁles can be downloaded from https://github.com/smajidian/HapMC/raw/master/data/phasing-matrices.zip and the grand truth haplotypes are available at https://github.com/smajidian/HapMC/raw/master/data/validation.zip . References [1] Delaneau O, Marchini J, Zagury JF. A linear complexity phasing method for thousands of genomes.

NatMethods , 9(2):179–181, 2012.[2] Snyder MW, Adey A, Kitzman JO, Shendure J. Haplotype-resolved genome sequencing: experimental methodsand applications.

Nat Rev Genet , 16(6):344–358, 2015.[3] Wang L, Xu Y. Haplotype inference by maximum parsimony.

Bioinformatics , 19(14):1773–1780, 2003.[4] O’Connell J, Sharp K, Shrine N, Wain L, Hall I, Tobin M, et al. Haplotype estimation for biobank-scale datasets.

Nat Genet , 48(7):817–820, 2016.[5] Edge P, Bafna V, Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing tech-nologies.

Genome Res , 27(5):801–812, 2017.[6] Berger E, Yorukoglu D, Peng J, Berger B. Haptree: A novel bayesian framework for single individual polyplo-typing using NGS data.

PLoS Comput Bioly , 10(3):e1003502, 2014.[7] Mousavi SR, Khodadadi I, Falsafain H, Nadimi R, Ghadiri N. Maximum likelihood model based on minor allelefrequencies and weighted max-sat formulation for haplotype assembly.

J Theor Biol , 350:49–56, 2014.[8] Hashemi A, Banghua Z, Vikalo H. Sparse tensor decomposition for haplotype assembly of diploids and polyploids

BMC Genomics , 19(Suppl 4):191, 2018.[9] Xie M, Wu W, Wang J, Jiang T. H-PoP and H-PoPG: Heuristic partitioning algorithms for single individualhaplotyping of polyploids

Bioinformatics , 32(24):3735–3744, 2016.[10] Cai C, Sanghavi S, Vikalo H. Structured low-rank matrix factorization for haplotype assembly.

IEEE J Sel TopSignal Process , 10(4):647–657, 2016.[11] Xie M, Wu M, Wang J, Jiang T. A fast and accurate algorithm for single individual haplotyping

BMC SystBiol , 6(Suppl 2):S8, 2012. [12] Das S, Vikalo H. SDhaP: Haplotype assembly for diploids and polyploids via semi-deﬁnite programming. BMCGenomics , 16(1):260, 2015.[13] Davenport MA, Romberg J. An overview of low-rank matrix recovery from incomplete observations.

IEEE JSel Top Signal Process , 10(4):608–622, 2016.[14] Cai JF, Candes EJ, Shen Z. A singular value thresholding algorithm for matrix completion

SIAM J Optim ,20(4):1956–1982, 2010.[15] Candes EJ, Tao T. The power of convex relaxation: Near-optimal matrix completion.

IEEE Trans Inf Theory ,56(5):2053–2080, 2010.[16] Grant M, Boyd S CVX: Matlab software for disciplined convex programming. 2013, Available from:http://cvxr.com/cvx[17] Candes EJ, Recht B. Exact matrix completion via convex optimization.

Found Comut Math , 9(6):717, 2009.[18] Recht B, Fazel M, Parrilo PA. Guaranteed minimum-rank solutions of linear matrix equations via nuclear normminimization.

SIAM Rev Soc Ind Appl Math , 52(3):471–501, 2010.[19] Keshavan RH, Montanari A, Oh S. Matrix completion from a few entries.

IEEE Trans Inf Theory , 56(6):2980–2998, 2010.[20] Duitama J, McEwen GK, Huebsch T, Palczewski S, Schulz S, Verstrepen K, et al. Fosmid-based whole genomehaplotyping of a HapMap trio child: evaluation of single individual haplotyping techniques.

Nucleic Acids Res ,40(5):2041–2053, 2011.[21] Geraci F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem.

Bioinformatics , 26(18):2217–2225, 2010.[22] Kuleshov V. Probabilistic single-individual haplotyping.

Bioinformatics , 30(17):379–385, 2014.[23] Deng F, Cui W, Wang L. A highly accurate heuristic algorithm for the haplotype assembly problem

BMCGenomics , 14:S2, 2013.[24] Chen ZZ, Deng F, Wang L. Exact algorithms for haplotype assembly from whole-genome sequence data

Bioin-formatics , 29(16):1938–1945, 2013.[25] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discoveryand genotyping using next-generation DNA sequencing data.

Nat Genet , 43(5):491–498, 2011.[26] Motazedi E, Finkers R, Maliepaard C, de Ridder D. Exploiting next-generation sequencing to solve the haplo-typing puzzle in polyploids: a simulation study.