[PDF] A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

Abstract

Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin -- an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posteriori probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques.

Full PDF

AA Graph Auto-Encoder for Haplotype Assembly and Viral QuasispeciesReconstruction

Ziqi Ke and Haris Vikalo

Department of Electrical and Computer EngineeringThe University of Texas at [email protected], [email protected]

Abstract

Reconstructing components of a genomic mixture from dataobtained by means of DNA sequencing is a challenging prob-lem encountered in a variety of applications including singleindividual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixturecomponents to provide massive amounts of reads whose rel-ative positions can be determined by mapping the reads to aknown reference genome; assembly of the components, how-ever, requires discovery of the reads’ origin – an NP-hardproblem that the existing methods struggle to solve with therequired level of accuracy. In this paper, we present a learningframework based on a graph auto-encoder designed to exploitstructural properties of sequencing data. The algorithm is aneural network which essentially trains to ignore sequencingerrors and infers the posterior probabilities of the origin ofsequencing reads. Mixture components are then reconstructedby ﬁnding consensus of the reads determined to originate fromthe same genomic component. Results on realistic syntheticas well as experimental data demonstrate that the proposedframework reliably assembles haplotypes and reconstructsviral communities, often signiﬁcantly outperforming state-of-the-art techniques. Source codes and datasets are publiclyavailable at https://github.com/WuLoli/GAEseq.

Genetic makeup of a biological sample, inferred by means ofDNA sequencing, will help determine an individual’s suscep-tibility to a broad range of chronic and acute diseases, supportthe discovery of new pharmaceutical products, and personal-ize and improve the delivery of health care. However, beforethe promises of personalized medicine come to fruition, ef-ﬁcient methods for accurate inference of genetic variationsfrom massive DNA sequencing data must be devised.Information about variations in an individual genome isprovided by haplotypes, ordered lists of single nucleotidepolymorphisms (SNPs) on the individual’s chromosomes(Schwartz 2010). High-throughput DNA sequencing tech-nologies generate massive amounts of reads that sample anindividual genome and thus enable studies of genetic varia-tions (Schwartz 2010; Clark 2004; Sabeti 2002). Haplotype

Copyright c (cid:13) reconstruction, however, remains challenging due to limitedlengths of reads and presence of sequencing errors (Hashemi2018). Particularly difﬁcult is the assembly of haplotypesin polyploids, organisms with chromosomes organized in k -tuples with k > , where deep coverage is typically requiredto achieve desired accuracy. This implies high cost and oftenrenders existing haplotype assembly techniques practicallyinfeasible (Motazedi 2018).A closely related problem to haplotype assembly is thatof reconstructing viral communities. RNA viruses such ashepatitis, HIV, and Ebola, are characterized by high muta-tion rates which give rise to communities of viral genomes,the so-called viral quasispecies. Determining genetic diver-sity of a virus is essential for the understanding of its originand mutation patterns, and the development of effective drugtreatments. Reconstructing viral quasispecies (i.e., viral hap-lotypes , as we refer to them for convenience) is even morechallenging than haplotype assembly (Ahn 2018) since thenumber of constituent strains in a community is typically un-known, and its spectra (i.e., strain frequencies) non-uniform.Existing methods often approach haplotype assembly asthe task of grouping sequencing reads according to their chro-mosomal origin into as many clusters as there are chromo-somes. Separation of reads into clusters is rendered challeng-ing by their limited lengths and the presence of sequencingerrors (Hashemi 2018); such artifacts create ambiguities re-garding the origin of the reads. The vast majority of existinghaplotype assembly methods attempt to remove the afore-mentioned ambiguity by altering or even discarding the data,leading to minimum SNP removal (Lancia 2001), maximumfragments cut (Duitama 2010), and minimum error correction(MEC) score (Lippert 2002) optimization criteria. Majorityof haplotype assembly methods developed in recent years arefocused on optimizing the MEC score, i.e., determining thesmallest possible number of nucleotides in sequencing readsthat should be altered such that the resulting dataset is con-sistent with having originated from k haplotypes ( k denotesthe ploidy of an organism) (Xie 2016; Pirola 2015; Kuleshov2014; Patterson 2015; Bonizzoni 2016). These include thebranch-and-bound scheme (Wang 2005), an integer linearprogramming formulation in (Chen 2013), and a dynamic This work was funded in part by the NSF grant CCF 1618427. a r X i v : . [ q - b i o . GN ] N ov rogramming framework in (Kuleshov 2014). All these tech-niques attempt to ﬁnd exact solution to the MEC score min-imization problem; the resulting high complexity has moti-vated search for computationally efﬁcient heuristics. Theyinclude the greedy algorithm in (Levy 2007) and methodsthat compute posterior joint probability of the alleles in a hap-lotype sequence via MCMC (Bansal 2008) and Gibbs (Kim2007) sampling. A max-cut algorithm for haplotype assemblyin (Bansal 2008) is motivated by the clustering interpreta-tion of the problem. The efﬁcient algorithm proposed there,HapCUT, has recently been upgraded as HapCUT2 (Edge2017). In (Aguiar 2012), a novel ﬂow-graph approach to hap-lotype assembly was proposed, demonstrating performancesuperior to state-of-the-art methods. More recent methods in-clude a greedy max-cut approach in (Duitama 2011), convexoptimization framework in (Das 2015), and a communication-theoretic motivated algorithm in (Puljiz 2016).Haplotype assembly for polyploids ( k > ) is more chal-lenging than that for diploids ( k = 2 ) due to a much largerspace of possible solutions to be searched. Among the afore-mentioned methods, only HapCompass (Aguiar 2012), SD-haP (Das 2015) and BP (Puljiz 2016) are capable of solvingthe haplotype assembly problem for k > . Other techniquesthat can handle reconstruction of haplotypes for both diploidand polyploid genomes include a Bayesian method HapTree(Berger 2014), a dynamic programming method H-PoP (Xie2016) shown to be more accurate than the techniques in(Aguiar 2012; Berger 2014; Das 2015), and the matrix factor-ization schemes in (Cai 2016; Hashemi 2018).On another note, a number of viral quasispecies recon-struction methods were proposed in recent years. Exam-ples include ShoRAH (Zagordi 2011) and ViSpA (Astro-vskaya 2011) that perform read clustering and read-graphpath search, respectively, to identify distinct viral compo-nents. QuasiRecomb (T¨opfer 2013) casts the problem asthe decoding in a hidden Markov model while QuRe (Pros-peri 2012) formulates it as a combinatorial optimization.PredictHaplo (Prabhakaran 2014) employs non-parametricBayesian techniques to automatically discover the numberof viral strains in a quasispecies. More recently, aBayesQR(Ahn 2017) approached viral quasispecies reconstructionwith a combination of hierarchical clustering and Bayesianinference while (Ahn 2018) relies on tensor factorization.In this paper, we propose a ﬁrst ever neural network-basedlearning framework, named GAEseq, to both haplotype as-sembly and viral quasispecies reconstruction problems. Theframework aims to estimate the posterior probabilities of theorigins of sequencing reads using an auto-encoder whosedesign incorporates salient characteristics of the sequencingdata. Auto-encoders (Fukushima 1975) are neural networksthat in an unsupervised manner learn a low-dimensional rep-resentation of data; more speciﬁcally, they attempt to performa dimensionality reduction while robustly capturing essen-tial content of high-dimensional data (Goodfellow 2016).Auto-encoders have shown outstanding performance in a va-riety of applications across different ﬁelds including naturallanguage processing (Socher 2011), collaborative ﬁltering(Rianne 2017), and information retrieval (Thomas 2016), toname a few. Typically, auto-encoders consist of two blocks: an encoder and a decoder. The encoder converts input datainto the so-called codes while the decoder reconstructs theinput from the codes. The act of copying the input data tothe output would be of little interest without an importantadditional constraint – namely, the constraint that the dimen-sion of codes is smaller than the dimension of the input. Thisenables auto-encoders to extract salient features of the in-put data. For both the single individual and viral haplotypereconstruction problems, the salient features of data are theorigins of sequencing reads. In our work, we propose a graphauto-encoder architecture with an encoder featuring a soft-max function placed after the dense layer that follows graphconvolutional layers (Masci 2011; Rianne 2017); the softmaxfunction acts as an estimator of the posterior probabilitiesof the origins of sequencing reads. The decoder assembleshaplotypes by ﬁnding the consensus sequence for each com-ponent of the mixture, thus enabling end-to-end solution tothe reconstruction problems. Let H denote a k × n haplotype matrix where k is the numberof (single individual or viral) haplotypes and n is the haplo-type length. Furthermore, let R denote an m × n SNP frag-ment matrix whose rows correspond to sequencing reads andcolumns correspond to SNP positions. Matrix R is formed byﬁrst aligning reads to a reference genome and then identify-ing and retaining only the information that the reads provideabout heterozygous genomic sites. One can interpret R asbeing obtained by sparsely sampling an underlying groundtruth matrix M , where the i th row of M is the haplotypesampled by the i th read. The sampling is sparse because thereads are much shorter than the haplotypes; moreover, thereads may be erroneous due to sequencing errors. Following(Ahn 2018) , we formalize the sampling operation as [ P Ω ( M )] ij = (cid:26) M ij , ( i, j ) ∈ Ω0 , otherwise (1)where Ω denotes the set of informative entries in R , i.e.,the set of ( i, j ) such that the j th SNP is covered by the i th read, and P Ω is the projection operator denoting the sam-pling of haplotypes by reads. Sequencing is erroneous andthus [ P Ω ( R )] ij may differ from [ P Ω ( M )] ij ; in particular,given sequencing error rate p , [ P Ω ( R )] ij = [ P Ω ( M )] ij withprobability − p .Since each read samples one of the haplotypes, R = P Ω ( U H ) where U denotes the m × k matrix indicatingorigins of the reads in R . In particular, each row of ma-trix U is one of the k -dimensional standard unit vectors e ( k ) i , ≤ i ≤ k , with in the i th position and the remainingentries . If i th read samples j th haplotype, the i th row of U is e ( k ) j . If the origins of reads were known, each haplotypecould be reconstructed by ﬁnding consensus of reads whichsample that particular haplotype. We think of the assemblyas a two-step procedure: given the SNP fragment matrix R we ﬁrst identify the read origin indicator matrix U and thenuse U to reconstruct the haplotype matrix H .o characterize the performance of haplotype assemblymethods we rely on two metrics: the minimum error correc-tion (MEC) score, which can be traced back to (Lippert 2002),and the correct phasing rate, also referred to as reconstruc-tion rate. The MEC score is deﬁned as the smallest numberof observed entries in R that need to be altered (i.e., cor-rected) such that the resulting data is consistent with havingoriginated from k distinct haplotypes, i.e.,MEC = m (cid:88) i =1 min j =1 , ,...,k HD ( R i : , H j : ) , (2)where HD ( · , · ) denotes the Hamming distance between itsarguments (sequences, evaluated only over informative en-tries), R i : denotes the i th row of R and H j : denotes the j th row of H . The correct phasing rate (CPR) is deﬁned asCPR = 1 − kn (min k (cid:88) i =1 HD ( H i : , M ( H i : ))) , (3)where M is the one-to-one mapping from the set of recon-structed haplotype to the set of true haplotype (Hashemi2018), i.e., mapping that determines the best possible matchbetween the two sets of haplotypes. To characterize perfor-mance of methods for reconstruction of viral quasispecieswith generally a priori unknown number of components, inaddition to correct phasing rate we also quantify recall rate ,deﬁned as the fraction of perfectly reconstructed componentsin a population (i.e., recall rate = T PT P + F N ), and predictedproportion , deﬁned as the ratio of the estimated and the truenumber of components in a genomic mixture (Ahn 2018).To assemble haplotypes from a set of reads we design andemploy a graph auto-encoder. Fig. 1 (b) shows the entireend-to-end pipeline that takes the collection of erroneousreads and generates reconstructed haplotypes. First, the SNPfragment matrix R is processed by the graph encoder to inferthe read origin indicator matrix U ; then, a haplotype decoderreconstructs matrix H . The graph auto-encoder is formalizedin the next section. Graph auto-encoders are a family of auto-encoders speciﬁ-cally designed for learning on graph-structured data (Rianne2017; Thomas 2016). In this paper, we design graph auto-encoders for the assembly of the components of a genomicmixture. As in conventional auto-encoder structures, the de-veloped architecture consists of two parts: the graph encoderand the decoder. The graph encoder Z = f ( R, A ) takes theSNP fragment matrix R and the m × n graph adjacency ma-trix A as inputs, and outputs the m × k node embeddingmatrix Z . Note that we impose constraints on the node em-bedding matrix so that the salient features extracted by agraph auto-encoder approximate the read origin indicatormatrix U . Such a constraint does not prevent efﬁcient train-ing of the auto-encoders via backpropagation. The decoder ˆ R = g ( Z ) is utilized to reconstruct the SNP fragment matrix R and the haplotype matrix H from the node embedding ma-trix Z ; this implies that the decoder is essentially capable ofimputing the unobserved entries in the SNP fragment matrix. To numerically represent information in the SNP fragmentmatrix R , we encode its entries R ij using a set of discretevalues – one for each of possible nucleotides – where themapping between nucleotides and the discrete values canbe decided arbitrarily. To this end, we may simply representthe nucleotides A, C, G and T by , , and , respectively;non-informative entries in each row of R , i.e., SNP posi-tions not covered by a read, are represented by . Note thatthe SNP fragment matrix can be represented by an undi-rected bipartite graph G = ( V, E, W ) where the set of readnodes r i ∈ A with i ∈ { , ..., m } and the set of SNP nodes s j ∈ B with j ∈ { , ..., n } together form the set of vertices V , i.e., A ∪ B = V . The weights w ∈ { , , , } = W as-signed to edges ( r i , w, s j ) ∈ E are the discrete values usedto represent nucleotides. With this model in place, we canrephrase the graph encoder as Z = f ( R, A , A , A , A ) ,where A w ∈ { , } m × n represents the graph adjacency ma-trix for a nucleotide encoded by w . Equivalently, A w has1’s for the entries whose corresponding positions in R areencoded by w . Since we are interested in imputing the unob-served entries based on the observed entries in R instead ofsimply copying the observed entries to ˆ R , it is beneﬁcial toreformulate the decoder as ˆ R = g ( Z, R ) . In other words, theauto-encoder is trained to learn from the observed entries inorder to determine origin of reads, impute unobserved entriesof R , and reconstruct haplotypes in the genomic mixture. Recall the interpretation that the SNP fragment matrix R isobtained by erroneously sampling an underlying ground truthmatrix M . This motivates development of a speciﬁc graphencoder architecture, motivated by the ideas of the design in(Rianne 2017), that is capable of detecting origin of sequenc-ing reads in R via estimating the posterior probabilities ofthe origin of each read.Let D r denote an m × m diagonal read degree matrixwhose entries indicate the number of SNPs covered by eachread, and let D s denote an n × n diagonal SNP degree matrixwhose entries indicate the number of reads covering eachSNP. We facilitate exchange of messages between read nodesand SNP nodes in the graph, initiating it from the set of readnodes A ; doing so helps reduce the dimensions of weightsand biases since the number of reads m is far greater thanthe haplotype length n . Note that the dimension of messageskeeps reducing during the message passing procedure.The messages from read nodes to SNP nodes are M (1) = σ ( (cid:88) w =1 D − s A Tw RW (1) w + B (1) w ) , (4)where W (1) w and B (1) w denote the weights and biases of theﬁrst convolutional layer for the nucleotide encoded with w ,respectively, σ denotes an element-wise activation functionsuch as ReLU ( · ) = max( · , , and ( · ) T denotes the transposeof a matrix. The dimension of both W (1) w and B (1) w is n × c (1) ,where c (1) denotes the message length after the ﬁrst messagepassing step.igure 1: (a) Segment of the SNP fragment matrix. Non-zero entries represent SNP information provided by sequencing reads;labels - indicate the four nucleotides. Zero entries in a row indicate that the read does not cover corresponding SNP. In thisillustration, the ﬁrst two rows represent reads originating from the same haplotype; the third and fourth reads both originatedfrom another haplotype; and so on. (b) The pipeline from the SNP fragment matrix to haplotypes via a graph auto-encoder.Figure 2: A forward pass through the graph auto-encoder consisting of a stacked graph encoder that passes messages betweenread and SNP nodes and constructs approximate read origin indicator matrix via the softmax function. Decoder reconstructshaplotypes and SNP fragment matrix.The messages from SNP nodes to read nodes are M (2) = σ ( (cid:88) w =1 D − r A w M (1) W (2) w + B (2) w ) , (5)where W (2) w and B (2) w denote the weights and biases of thesecond convolutional layer for the nucleotide encoded with w , respectively. The dimension of both W (2) w and B (2) w is c (1) × c (2) , where c (2) denotes the message length after thesecond message passing step.Repeating message passing and stacking the convolutionallayers leads to formation of a deep model. The read nodes to SNP nodes layer is readily generalized as M (2 i +1) = σ ( (cid:88) w =1 D − s A Tw M (2 i ) W (2 i +1) w + B (2 i +1) w ) , (6)where i ∈ { , , , ... } and M (2 i ) = R for i = 0 . The dimen-sion of M (2 i ) is m × c (2 i ) . Furthermore, the SNP nodes toread nodes layer is generalized as M (2 i ) = σ ( (cid:88) w =1 D − r A w M (2 i − W (2 i ) w + B (2 i ) w ) , (7)where i ≥ . The dimension of M (2 i − is n × c (2 i − . Notethat the messages are passed from read nodes to SNP nodeswhen the subscript of M is odd, and otherwise traverse in theopposite direction.quation (6) and (7) specify the graph convolutional layerwhile the dense layer is deﬁned as O = σ ( M ( l ) W d + B d ) , (8)where O denotes the output of the dense layer, W d and B d are the weights and biases of the dense layer, respectively, M ( l ) is the output of the last graph convolutional layer, and l represents the number of graph convolutional layers. Thedimension of W d is c ( l ) × k and the dimension of O and B d is m × k , where k denotes the ploidy (i.e., the number ofcomponents in a genomic mixture).To ﬁnd Z which approximates the read origin indicatormatrix U (i.e., Z with each row close in the l -norm senseto a k -dimensional standard basis vector), we employ thesoftmax function Z ij = e βO ij (cid:80) kj =1 e βO ij , (9)where in our experiments we set β to 200. Having estimatedread origins by the node embedding matrix Z , the reads canbe organized into k clusters. This enables straightforwardreconstruction of haplotypes by determining the consensussequence for each cluster. Thus far, we have conveniently been representing alleles asthe numbers in { , , , } . It is desirable, however, that inthe deﬁnition of a loss function the distance between nu-merical values representing any two alleles is identical, nomatter which pair of alleles is considered; this ensures theloss function relates to the MEC score – the metric of interestin haplotype assembly problems. Following (Ahn 2018), wedeﬁne the loss function of the auto-encoder as the squaredFrobenius norm of the difference between a one-hot SNPfragment matrix R and the reconstructed matrix ˆ R = Z H at the informative positions, i.e., L = ||P Ω ( R − Z H ) || F ,where R ∈ { , } m × n and H ∈ { , } k × n are formedby substituting discrete values w ∈ { , , , } by the setof four dimensional standard basis vectors e (4) i , ≤ i ≤ .With such a notational convention, the proposed loss func-tion approximates the MEC score; it only approximates thescore, rather than coincides with it, because Z is an approx-imation of the read-origin matrix U . Therefore, the graphauto-encoder is trained to approximately minimize the MECscore. Fig. 2 illustrates the data processing pipeline that takesas inputs reads in the SNP fragment matrix and produces thematrix of haplotypes as well as imputes missing entries in theSNP fragment matrix. The proposed graph auto-encoders forhaplotype assembly and viral quasispecies reconstruction areformalized as Algorithm 1 and Algorithm 2, respectively. Forthe viral quasispecies reconstruction problem, the number ofclusters k is typically unknown; detailed strategy based on(Ahn 2018) for the automated inference of k can be found inSupplementary Document B. The hyper-parameters of GAEseq are determined by train-ing on 5 synthetic triploid datasets with coverage 30 × and Algorithm 1

Graph auto-encoder for haplotype assembly Input:

SNP fragment matrix R , the number of experi-ments n exp and the number of haplotpyes k Output:

Reconstructed haplotypes H while n exp (cid:54) = 0 do

4: Initialize W ( i ) w , B ( i ) w , W d and B d using Xavier initializa-tion where w ∈ { , , , } and i ∈ { , } for n epoch = 1 to do M (1) ← σ ( (cid:80) w D − s A Tw RW (1) w + B (1) w ) M (2) ← σ ( (cid:80) w D − r A w M (1) W (2) w + B (2) w ) O ← σ ( M (2) W d + B d ) Z ij ← e βOij (cid:80) kj =1 e βOij with β = 200

10: Calculate H by majority voting11: L ← ||P Ω ( R − Z H ) || F

12: Record reconstructed haplotypes and the MEC score13: Update W ( i ) w , B ( i ) w , W d and B d using Adam Optimizerwhere w ∈ { , , , } and i ∈ { , } end for n exp ← n exp − end while Output the reconstructed haplotypes H corresponding tothe lowest MEC score Algorithm 2

Graph auto-encoder for viral quasispecies re-construction Input:

SNP fragment matrix R , the number of experi-ments n exp , the MEC improvement rate threshold η andthe estimated initial number of components k Output:

Reconstructed viral haplotypes H and the in-ferred frequencies Initial τ ← , MECﬂag ← and k τ ← k while τ = 0 or k τ = k τ − do for k ∈ { k τ , k τ + 1 } do

6: Run Algorithm with k end for if MECimpr ( k τ ) ≤ η then k τ +1 ← (cid:98) ( k τ + max { , k i } ) / (cid:99) , { i ∈{ , · · · , τ - } : k i ≤ k τ } ; MECﬂag ← else if MECﬂag = 0 then k τ +1 ← k τ else k τ +1 ← (cid:98) ( k τ + min k i ) / (cid:99) , { i ∈ { , · · · , τ - } : k i > k τ } end if end if τ ← τ + 1 end while Output the viral quasispecies H with k = k τ + 1 and theinferred frequenciesvalidated on different 5 synthetic triploid datasets with thesame coverage. The results reported in this section are ob-tained on test data. Detailed description of the computationalplatform and the choice of hyper-parameters can be found inSupplementary Document A.able 1: Performance comparison on biallelic Solanum Tuberosum semi-experimental data.MEC CPRCoverage Mean SD Mean SD15 GAEseq H-PoP 28.700 32.667 0.783 0.066AltHap 59.100 28.125 0.709 0.05425 GAEseq

AltHap 92.600 83.649 0.756 0.06835 GAEseq

H-PoP 41.700 53.971 0.823 0.094AltHap 164.000 101.583 0.754 0.093Figure 3: The precision-recall curves for Solanum Tuberosum semi-experimental data with coverage 15 × , 25 × and 35 × Table 2: Performance comparison of GAEseq, PredictHap, TenSQR and aBayesQR on a real HIV-1 5-virus-mix data. Geneswhere all the strains are perfectly reconstructed are denoted as boldface. p17 p24 p2-p6 PR RT RNase int vif vpr vpu gp120 gp41 nefGAEseq PredProp HXB

100 100 100 100 100 100 100

100 96.2 96.7 100CPR .

100 100 100 100 100 100 100 JR − SCF

100 100 100 100 100 100 100

100 99.9 100 99.3CPR NL −

100 100 100 100 100 100 100

100 100 100 99.8CPR

Y U

100 100 100 100 100 100 100

100 99.6 100 98.1PredictHap PredProp HXB

100 100 100 .

100 100 100

100 99.8 100 JR − SCF

100 100 100

100 99.7 100 100CPR NL −

100 100 100

100 100 100 100CPR

Y U

100 100 100

100 98.6 100 100TenSQR PredProp HXB

100 99.2

100 100 100 100 .

100 98.0

100 100 100 100 JR − SCF

100 100

100 100 100 100

100 98.3 97.7 99.8CPR NL −

100 99.5

100 100 100 100

100 99.8 99.5 99.7CPR

Y U

100 100 100 100

100 94.9 100 98.6aBayesQR PredProp 1 1 HXB

100 100 .

100 100

92 96.5 98.9 95.5CPR JR − SCF

100 100

100 100 NL −

100 100

100 99.8

100 96.3 98.8 100CPR

Y U

100 100

100 0 98.6 99.2 .1 Performance comparison on biallelicSolanum Tuberosum semi-experimental data

We ﬁrst evaluate performance of GAEseq on realistic simula-tions which, for convenience and to distinguish from perhapsmore rich synthetic and experimental data discussed in sup-plementary documents, we refer to as ”semi-experimentaldata”. The semi-experimental data is obtained by simulatingmutations, shotgun sequencing procedure, read alignmentand SNP calling steps in a ﬁctitious experiment on a singleindividual

Solanum Tuberosum (polyploid with k = 4 ). De-tails on how exactly the semi-experimental data is generatedand processed can be found in Supplementary Document C.We compare the performance of GAEseq on this data withpublicly available software HapCompass (Aguiar 2012), analgorithm that relies on graph-theoretic models to performhaplotype assembly, H-PoP (Xie 2016), a dynamic program-ming method, and AltHap (Hashemi 2018), a method basedon tensor factorization. The performance of different methodsis evaluated in terms of the MEC score and CPR. All the con-sidered softwares were executed with their default settings,i.e. we follow instructions in the papers they were originallyproposed; there are no parameter tuning steps required forthese methods. We report the MEC scores and CPR achievedby the considered algorithms in Table 1. For each sequenc-ing coverage, the mean and standard deviation (SD) of theadopted metrics are evaluated over 10 samples. As shown inthe table, GAEseq achieves the lowest average MEC score aswell as the lowest standard deviation of the MEC score at allsequencing coverage settings. Moreover, GAEseq achievesthe highest average CPR at all coverage settings. Note thatthe MEC score increases with sequencing coverage sincehigher coverage implies more reads. The results demonstratethat the adopted graph abstraction enables GAEseq to achievehigh accuracy of the reconstruction task by learning poste-rior probabilities of the origins of reads. Fig. 3 shows theprecision-recall curves for data with coverage 15 × , 25 × and35 × . Note that GAEseq performs very accuratly at high se-quencing coverage while its performance deteriorates at lowcoverage. An extended version of Table 1 with additionalcoverage settings is in Supplementary Document C.We further test the performance of GAEseq on simulatedbiallelic diploid, polyallelic triploid and tetraploid data, andon real Solanum Tuberosum data; in addition to H-Pop, Al-tHap and HapCompass, comparisons on diploid data alsoinclude performance of HapCUT2 (Edge 2017). GAEseqoutperforms all the considered algorithms by achieving lowerMEC score and higher CPR. Further details can be found inSupplementary Document D and E. The real HIV-1 data with pairwise distances between . − . and relative frequencies between and is an in vitro viral population of 5 known HIV-1 strains gener-ated by Illumina’s MiSeq Benchtop Sequencer (Di 2014).These reads are then aligned to the HIV- HXB referencegenome. According to (Di 2014), we remove reads of lengthlower than 150bp and mapping quality scores lower than 60 for better results. We compare the performance of GAE-seq on gene-wise reconstruction of the HIV population tothat of other state-of-the-art methods such as PredictHaplo(Prabhakaran 2014), TenSQR (Ahn 2018) and aBayesQR(Ahn 2017), following their default settings. For fair bench-marking, we use the same dataset as (Ahn 2018) which iswhy the results of our benchmarking tests match those in(Ahn 2018). The correct phasing rate and the inferred strainfrequencies are evaluated for all reconstructed strains be-cause the ground truth for the 5 HIV-1 strains is available at(https://bmda.dmi.unibas.ch/software.html). Following (Ahn2018), we evaluate predicted proportion by setting the pa-rameter η needed to detect the number of HIV-1 strains to . . The results in Table 2 show that GAEseq perfectlyreconstructs all 5 HIV-1 strains in 8 genes while other meth-ods correctly reconstruct components in 5 or 6 genes. Thisdemonstrates that GAEseq’s inference of read origins basedon posterior probabilities enables high accuracy of the recon-struction tasks. Regarding the 5 genes where GAEseq andother methods do not achieve perfect reconstruction (p24,vpu, gp120, gp41, nef): closer examination of viral strainsreconstructed by various methods suggests translocationsof short viral segments within those 5 genes in the “goldstandard” dataset created by (Di 2014). Those short transloca-tion cause mismatch between the actual ground truth and thesequences (Di 2014) generated. Further results on reconstruc-tion of HIV viral communities can be found in SupplementDocument F. In this article, we introduce auto-encoders to the problem ofreconstructing components of a genomic mixture from high-throughput sequencing data that is encountered in haplotypeassembly and analysis of viral communities. In particular, agraph auto-encoder is trained to group together reads thatoriginate from the same component of a genomic mixtureand impute missing information in the SNP fragment matrixby learning from the available data. The graph convolutionalencoder attempts to discover origin of the reads while thedecoder aims to reconstruct haplotypes and impute missinginformation, effectively correcting sequencing errors. Studieson semi-experimental data show that GAEseq can achievesigniﬁcantly lower MEC scores and higher CPR than thecompeting methods. Benchmarking tests on simulated andexperimental data demonstrate that GAEseq maintains goodperformance even at low sequencing coverage. Studies onreal HIV-1 data illustrate that GAEseq outperforms existingstate-of-the-art methods in viral quasispecies reconstruction.

References [Schwartz 2010] Schwartz, R. 2010. Theory and algorithms for thehaplotype assembly problem.

Communications in Info. & Sys. vol.10, no. 1, 2010, 23–38.[Clark 2004] Clark, A. G. 2004. The role of haplotypes in candidategene studies.

Genet Epidemiol. vol. 27, no. 4, 2004, 321–33.[Sabeti 2002] Sabeti, P.; Reich, D.; Higgins, J.; Levine, H.; andRichter, D. 2002. Detecting recent positive selection in the hu-man genome from haplotype structure.

Nature , 419(6909):832–37,2002.Lancia 2001] Lancia, G.; Bafna, V.; Istrail, S.; Lippert, R.; andSchwartz, R. 2001. SNPs problems, complexity, and algorithms.

European symposium on algorithms , vol. 1, Springer: 2001.182–93.[Duitama 2010] Duitama, J.; Huebsch, T.; McEwen, G.; Suk, E.; andHoehe, MR. 2010. Refhap: a reliable and fast algorithm for singleindividual haplotyping.

Proceedings of the First ACM InternationalConference on Bioinformatics and Computational Biology , ACM:2010. 160–169.[Lippert 2002] Lippert, R.; Schwartz, R.; Lancia, G.; and Istrail, S.2002 Algorithmic strategies for the single nucleotide polymorphismhaplotype assembly problem.

Brief Bioinformatics , 2002, 3(1):23–31.[Xie 2016] Xie, M.; Wu, Q.; Wang, J.; and Jiang, T. 2016. H-PoPand H-PoPG: Heuristic partitioning algorithms for single individualhaplotyping of polyploids.

Bioinformatics , 2016, 32(24):3735–44.[Pirola 2015] Pirola, Y.; Zaccaria, S.; Dondi, R.; Klau, G.; Pisanti,N.; and Bonizzoni, P. 2015. Hapcol: accurate and memory-efﬁcienthaplotype assembly from long reads.

Bioinformatics , 2015, 32(11),1610–1617.[Kuleshov 2014] Kuleshov, V. 2014. Probabilistic single-individualhaplotyping.

Bioinformatics , 2014, 30(17):379–85.[Patterson 2015] Patterson, M.; Marschall, T.; Pisanti, N.; Van, IerselL.; and Stougie, L. 2015. Whatshap: weighted haplotype assem-bly for future-generation sequencing reads.

J Comput Biol. , 2015,22(6):498–509.[Bonizzoni 2016] Bonizzoni, P.; Dondi, R.; Klau, G.; Pirola, Y.;Pisanti, N.; and Zaccaria, S. 2016. On the minimum error cor-rection problem for haplotype assembly in diploid and polyploidgenomes.

J Comput Biol. , 2016, 23(9):718–36.[Edge 2017] Edge, P.; Bafna, V.; and Bansal, V. 2017. Hapcut2:robust and accurate haplotype assembly for diverse sequencingtechnologies.

Genome Res. , 2017; 27(5):801–12.[Aguiar 2012] Aguiar, D., and Istrail, S. 2012. Hapcompass: a fastcycle basis algorithm for accurate haplotype assembly of sequencedata.

J Comput Biol.

Nucleic Acids Res. , 2011, 40(5), 2041–2053.[Das 2015] Das, S.; and Vikalo, H. 2015. SDhaP: haplotype as-sembly for diploids and polyploids via semi-deﬁnite programming.

BMC Genomics , 2015; 16(1):260.[Puljiz 2016] Puljiz, Z.; and Vikalo, H. 2016. Decoding genetic vari-ations: communications-inspired haplotype assembly.

IEEE/ACMTrans Comput Biol Bioinform (TCBB) , 2016; 13(3):518–30.[Berger 2014] Berger, E.; Yorukoglu, D.; Peng, J.; and Berger, B.2014. Haptree: A novel Bayesian framework for single individ-ual polyplotyping using NGS data.

PLoS Comput Biol.

Bioinformatics (Oxford, England) , 34(13), i23–i31.[Wang 2005] Wang, R.; Wu, L.; Li, Z.; and Zhang, X. 2005. Haplo-type reconstruction from SNP fragments by minimum error correc-tion.

Bioinformatics , 2005, 21(10):2456–2462.[Chen 2013] Chen, Z.; Deng, F.; and Wang, L. 2013. Exact algo-rithms for haplotype assembly from whole-genome sequence data.

Bioinformatics , 2013, 29(16):1938–1945.[Levy 2007] Levy, S.; Sutton, G.; Ng, P.; Feuk, L.; and Halpern, A.2007 The diploid genome sequence of an individual human.

PLoSBiol. , 2007, 5(10):254. [Bansal 2008] Bansal, V.; Halpern, A.; Axelrod, N.; and Bafna, V.2008. An MCMC algorithm for haplotype assembly from whole-genome sequence data.

Genome Res. , 2008; 18(8):1336–46.[Kim 2007] Kim, J.; Waterman M.; and Li, L. 2007. Diploid genomereconstruction of ciona intestinalis and comparative analysis withciona savignyi.

Genome Res.

MIT Press .[Lippert 2002] Lippert, R.; Schwartz, R.; Lancia, G.; and Istrail, S.2002. Algorithmic strategies for the single nucleotide polymorphismhaplotype assembly problem.

Brieﬁngs in bioinformatics , 3(1), 23–31.[Motazedi 2018] Motazedi, E.; Finkers, R.; Maliepaard, C.; and deRidder, D. 2018. Exploiting next-generation sequencing to solvethe haplotyping puzzle in polyploids: a simulation study.

Brieﬁngsin bioinformatics , 19(3), 387–403.[Cai 2016] Cai, C.; Sanghavi, S.; and Vikalo, H. 2016. Structuredlow-rank matrix factorization for haplotype assembly.

IEEE J SelTop Sign Proc.

BMC Genomics , 2018; 19(4), 191.[Rianne 2017] Rianne van den Berg, Thomas N. Kipf andMax Welling 2017. Graph Convolutional Matrix Completion. arXiv:1706.02263 .[Thomas 2016] Kipf, Thomas N. and Max Welling 2016. VariationalGraph Auto-Encoders.

CoRR , abs/1611.07308.[Di 2014] Di, F.; T¨opfer, A.; Rey, M.; Prabhakaran, S.; Duport, Y.;Leemann, C.; Schmutz, S.; Campbell, N. K.; Joos, B.; and Lecca,M. R. 2014. Full-length haplotype reconstruction to infer the struc-ture of heterogeneous virus populations.

Nucleic acids research ,42(14), e115.[Ahn 2017] Ahn, S.; and Vikalo, H. 2017. aBayesQR: A bayesianmethod for reconstruction of viral populations characterized by lowdiversity. In

International Conference on Research in Computa-tional Molecular Biology , pages 353–369. Springer.[Zagordi 2011] Zagordi, O.; Bhattacharya, A.; Eriksson, N.; andBeerenwinkel, N. 2011. Shorah: estimating the genetic diversityof a mixed sample from next-generation sequencing data.

BMCbioinformatics , 12(1), 119.[Astrovskaya 2011] Astrovskaya, I.; Tork, B.; Mangul, S.; West-brooks, K.; M˘andoiu, I.; Balfe, P.; Zelikovsky, A. 2011. Inferringviral quasispecies spectra from 454 pyrosequencing reads.

BMCbioinformatics , 12(6), 1.[Prosperi 2012] Prosperi, M. C; and Salemi, M. 2012. Qure: soft-ware for viral quasispecies reconstruction from next-generationsequencing data.

Bioinformatics , 28(1), 132–133.[Prabhakaran 2014] Prabhakaran, S.; Rey, M.; Zagordi, O.; Beeren-winkel, N.; and Roth, V. 2014. Hiv haplotype inference using apropagating dirichlet process mixture model.

IEEE/ACM Trans. onComput. Biol. Bioinform. (TCBB) , 11(1), 182–191.[T¨opfer 2013] T¨opfer, A.; Zagordi, O.; Prabhakaran, S.; Roth, V.;Halperin, E.; and Beerenwinkel, N. 2013. Probabilistic inference ofviral quasispecies subject to recombination.

Journal of Computa-tional Biology , 20(2), 113–123.[Socher 2011] Socher, R.; Pennington, J.; Huang, Eric H.; Ng, An-drew Y.; and Manning, Christopher D. 2011. Semi-supervisedRecursive Autoencoders for Predicting Sentiment Distributions.

Conference on Empirical Methods in Natural Language Processing ,EMNLP(11), 151-161.Fukushima 1975] Fukushima K. C. 1975. a self-organizing multi-layered neural network.

Biol Cybern , 20(3-4), 121-36.[Masci 2011] Masci J.; Meier U.; and Ciresan D. 2011. Stackedconvolutional auto-encoders for hierarchical feature extraction.

Ar-tiﬁcial Neural Networks and Machine Learning ICANN , 529.

Supplementary Document A : Computationalsettings

The models were implemented on a 3.70GHz Intel i7-8700K proces-sor, 2 NVIDIA GeForce GTX 1080Ti computer graphics cards and32GB RAM. Randomness of the initial weights of the auto-encodermay cause the neural network to remain in a local minimum dur-ing training. To overcome this, we run GAEseq multiple times andchoose the result with the lowest MEC score. Given a SNP fragmentmatrix, we run GAEseq 200 times, train 200 models and get 200reconstructed haplotype matrix candidates; the algorithm selectsthe candidate corresponding to the lowest MEC score automatically.In the GAEseq software, users can specify how many times to runthe algorithm; the software will run automatically as opposed tomanually running GAEseq multiple times.As for hyperparameters, the number of graph convolutional lay-ers in the auto-encoder was set to 2 (in particular, one read-nodes-to-SNP-nodes layer and one SNP-nodes-to-read-nodes layer); di-mension of messages reduces linearly during message passing be-tween the layers. For example, if the number of reads is m , thehaplotype length is n , the ploidy is k and we use 2 graph con-volutional layers and a dense layer, the dimensions of the weightand bias matrix of the ﬁrst and second layer are n × (cid:6) ( n − n − k ) (cid:7) and (cid:6) ( n − n − k ) (cid:7) × (cid:6) ( n − × n − k ) (cid:7) , respectively. The dimen-sions of the weight and bias matrix of the denser layer are set to (cid:6) ( n − × n − k ) (cid:7) × k and m × k , respectively. We use the Adamoptimizer (Diederik 2015), set the step size to 0.0001, and set allother parameters to their default values in Tensorﬂow. We also useXavier initialization (Xavier 2010) with default settings and thenumber of epoches set to 100. In our studies, we found that an archi-tecture with two graph convolutional layers achieves signiﬁcantlybetter results than state-of-the-art haplotype assembly methods. Toensure the model generalizes well to unobserved entries, we addeddropout regularization to message passing between the layers; foreach layer, the dropout probability is 0.1. For all experiments onviral quasispecies reconstruction, the initial number of clusters k is set to 2. Supplementary Document B : Determining thenumber of components in a viral quasispecies

When reconstructing haplotypes sampled by a collection of sequenc-ing reads, GAEseq requires as input the number of haplotypes, k .While the ploidy of an individual organism in the haplotype as-sembly problem is known a priori, cardinality of a viral communityneeds to be estimated. To determine k , we examine the improvementrate of the MEC score deﬁned asMECimpr ( k ) = MEC ( k ) − MEC ( k + 1) MEC ( k ) . (10)Recall that the MEC score is deﬁned as the smallest number of theobserved entries in R that need to be altered such that the resultingdata is consistent with having originated from k distinct haplotypes.The score decreases monotonically with k ; however, once k reachesthe actual number of components, the improvement rate of the MECscore (MECimpr) saturates. To ﬁnd the saturation point, we compareMECimpr with a pre-deﬁned threshold. Following (Ahn 2018), thenumber of components is determined via binary search. Speciﬁcally,starting from an initial k , the number of components is updated as k τ ← k τ − until MECimpr ( k τ ) ≤ η ; at this point, the number ofcomponents starts to decrease as k τ +1 ← (cid:98) ( k τ + max { , k i } ) / (cid:99) where { i ∈ { , · · · , τ − } : k i ≤ k τ } . Once MECimpr ( k τ ) > η ,the number of components increases again as k τ +1 ← (cid:98) ( k τ +min k i ) / (cid:99) where { i ∈ { , · · · , τ − } : k i > k τ } . If k τ = k τ − ,the search procedure stops by assigning k τ +1 ← k τ + 1 whichis the estimated number of strains. The recommended choice ofthe threshold η is discussed in (Ahn 2017) where the estimation ofthe number of components via MECimpr was demonstrated to berobust with respect to the choice of the threshold. Supplementary Document C : Performancecomparison on biallelic Solanum Tuberosumsemi-experimental data

The semi-experimental data is obtained by simulating mutations,shotgun sequencing procedure, read alignment and SNP callingsteps in an experiment on a single individual

Solanum Tuberosum .In particular, we use

Haplogenerator (Motazedi 2018) to generatehaplotypes by introducing independent mutations that follow thelognormal distribution of a randomly selected genome region from

Solanum Tuberosum chromosome 5 (Potato Genome SequencingConsortium 2011) of length 5000 bp. The mean distance betweenneighboring SNPs and the standard deviation (SD) are set to 21bp and 27 bp, respectively, as previously suggested by (Motazedi2018). Due to

Haplogenerator ’s limitations, we constrain mutationsto transitions and do not consider transversions (i.e., mutationsare constrained to be between A and C and between G and T). × bp-long Illumina’s MiSeq reads of inner distance 50 bpand standard deviation 10 bp are generated to uniformly samplehaplotypes using ART software (Huang 2012) with default setting.Following this step, the generated reads are aligned to the referencegenome using the BWA-MEM algorithm (Li 2009); the reads havingmapping quality score lower than 60 or being shorter than 70 bp arediscarded. SNPs are called if, at any given site, the abundance of aminor allele exceeds a predetermined threshold; the SNP fragmentmatrix is formed by collecting all such heterozygous sites. Sevendifferent sets of semi-experimental data obtained by sampling atvarying coverage (10 × , 15 × , 20 × , 25 × , 30 × , 35 × and 40 × ) aregenerated; each set consists of 10 samples. We ﬁrst generate genomeregions of length 5000 bp by partitioning the Solanum Tuberosum chromosome 5 and then randomly select 70 among them (generatedhaplotypes and reads are different for each sample). The sequencingerror rate is automatically set by the built-in quality proﬁles of

ART inferred from large amounts of recalibrated sequencing data (Huang2012). Table 1 shows the performance comparison of GAEseq,AltHap, HapCompass and H-PoP on biallelic Solanum Tuberosumsemi-experimental data.

Supplementary Document D : Performancecomparison on simulated biallelic diploid data andpolyallelic triploid and tetraploid data.

To further test GAEseq, we evaluate its performance on syntheticdata. Once again we use

Haplogenerator (Motazedi 2018) to gen-erate haplotypes of a randomly synthesized reference genome oflength 5000 bp. The mean distance between neighboring SNPs andthe standard deviation (SD) are set to 5 bp and 3 bp respectively,creating haplotype blocks of length about 500. All the possiblemutations were allowed and set to be equally likely, leading to notonly biallelic but also polyallelic SNPs in the synthesized haplotypedata. Illumina’s MiSeq read generation, read alignment and SNPcalling procedures are implemented following the same procedureas in the case of semi-experimental data from Section 3.1. The datasynthesized in this fashion consists of different sets, each with able 3: Performance comparison of GAEseq, AltHap, HapCompass and H-PoP on biallelic Solanum Tuberosum semi-experimental data. MEC CPRCoverage Mean SD Mean SD10 GAEseq H-PoP 19.700 25.254 0.803 0.086AltHap 64.100 32.953 0.727 0.07215 GAEseq

H-PoP 28.700 32.667 0.783 0.066AltHap 59.100 28.125 0.709 0.05420 GAEseq

H-PoP 30.500 37.023 0.791 0.078AltHap 82.100 56.658 0.737 0.06825 GAEseq

AltHap 92.600 83.649 0.756 0.06830 GAEseq

HapCompass 306.800 187.934 0.796 0.081H-PoP 34.200 32.798 0.879 0.088AltHap 263.000 499.659 0.762 0.13335 GAEseq

H-PoP 41.700 53.971 0.823 0.094AltHap 164.000 101.583 0.754 0.09340 GAEseq

HapCompass 208.000 176.699 0.833 0.070H-PoP 30.4 28.487 0.823 0.102AltHap 195.8 281.641 0.762 0.084

10 samples, as we explore different ploidy (k = 2, 3 and 4) andsequencing coverage (5 × , 10 × , 15 × , 20 × , 25 × , 30 × , 35 × and40 × ).For the diploid synthetic data sets, we represent an allele by 0 ifit coincides with the corresponding reference allele and by 1 if it isan alternative allele. SNP positions with only alternative alleles areremoved. In addition to H-PoP, HapCompass and AltHap, we alsocompare GAEseq with HapCUT2 (Edge 2017); by design, use ofHapCUT2 is limited to haplotype assembly of diploids. The metricsof performance are the previously introduced MEC score and CPR.Table 2 shows the mean and standard deviation of the MEC scoreand CPR for diploid data. The results are evaluated over 10 samplesfor each combination of ploidy and coverage. GAEseq achievesthe lowest average MEC score and the lowest standard deviation ofthe MEC score for almost all coverage settings; its performance isfollowed by those of H-PoP, HapCompass, HapCut2 and AltHap.The average CPR achieved by GAEseq is very close to 1 for allcoverage settings, indicating that GAEseq is able to near-perfectlyreconstruct haplotypes of diploid species even when the coverage isvery low; its performance is followed by those of H-PoP, HapCut2,HapCompass and AltHap. When the coverage is × , the averageCPR achieved by GAEseq is while it is approximately . , . , . and . for H-PoP, HapCut2, HapCompass andAltHap, respectively.For the polyploid synthetic data sets, both H-PoP and HapCom-pass are restricted to reconstruction of biallelic haplotypes and are not applicable to the assembly of polyallelic ones. Furthermore,recall that HapCUT2 can only be applied to diploid haplotypes.We therefore limit performance comparison of GAEseq on poly-ploid synthetic data to only AltHap; Tables 3 and 4 illustrate themean and standard deviation of the MEC score and CPR for triploidand tetraploid data, respectively. The results are evaluated over 10samples for each combination of ploidy and coverage. As can beseen in these tables, GAEseq outperforms AltHap for all ploidy andcoverage settings. As shown in Table 3, GAEseq performs well ontriploid data, achieving average CPR and relatively small stan-dard deviation even for the low coverage of × ; at the same time,performance of AltHap deteriorates rapidly with increased ploidy,achieving average CPR while GAEseq achieves . at cov-erage × . As illustrated in Table 4, in applications to tetraploiddata the performance of GAEseq starts to gracefully deteriorate– when the coverage is × , GAEseq achieves average CPR ofapproximately while in the same scenario AltHap achieves av-erage CPR of approximately . When the coverage is increasedto × , GAEseq achieves average CPR of approximately . while AltHap achieves average CPR of approximately . . Supplementary Document E : Performancecomparison on real Solanum Tuberosum data

We further test the performance of GAEseq on real potato data(accession SRR6173308) at

Solanum Tuberosum chromosome 5 able 4: Performance comparison of GAEseq, HapCut2, HapCompass, H-PoP and AltHap on simulated biallelic diploid data.MEC CPRCoverage Mean SD Mean SD5 GAEseq

HapCUT2 110.500 23.922 0.975 0.006HapCompass 87.500 25.903 0.965 0.010H-Pop 40.000 30.551 0.989 0.011AltHap 884.200 659.565 0.699 0.20410 GAEseq

HapCUT2 213.600 63.132 0.980 0.005HapCompass 159.600 58.329 0.974 0.005H-Pop 34.600 6.736 0.997 0.004AltHap 583.900 948.344 0.796 0.21815 GAEseq

HapCUT2 339.800 59.066 0.978 0.003HapCompass 268.300 67.003 0.971 0.005H-Pop 47.900 9.539 0.998 0.002AltHap 342.900 379.213 0.852 0.16920 GAEseq

HapCUT2 519.400 57.386 0.972 0.010HapCompass 408.000 81.067 0.961 0.018H-Pop 129.700 191.788 0.989 0.030AltHap 668.400 579.261 0.787 0.20125 GAEseq

HapCUT2 613.000 157.786 0.977 0.006HapCompass 460.700 97.637 0.968 0.007H-Pop 85.700 17.192 0.998 0.003AltHap 1151.600 649.058 0.743 0.15030 GAEseq

HapCUT2 685.300 180.714 0.979 0.006HapCompass 591.600 150.400 0.968 0.009H-Pop 98.000

HapCUT2 827.600 202.643 0.978 0.006H-Pop 702.200 180.647 0.968 0.007H-Pop 107.900

HapCUT2 1015.400 219.442 0.977 0.006HapCompass 896.500 204.603 0.965 0.008H-Pop 124.500

AltHap 1073.300 1099.181 0.847 0.184 (Potato Genome Sequencing Consortium 2011). The 10 samplesof real potato data are generated by ﬁrst randomly selecting 10genome regions of length varying from 5032 to 7573 and thenaligning the Illumina HiSeq 2000 paired-end reads to the selectedgenome regions. After the read alignment step using the BWA-MEMalgorithm (Li 2009), the SNP calling step is implemented to createthe SNP fragment matrix. Reads having mapping quality score lowerthan 60 or shorter than 70 bp are discarded. Since the ground truthhaplotypes are not available for this dataset, we only evaluate theperformance of GAEseq and the competing methods in terms ofthe MEC score. Table 5 compares the performance of GAEseq,AltHap, HapCompass and H-PoP averaged over 10 selected regionsof the real Solanum Tuberosum data. As can be seen from the table, GAEseq outperforms all the competing schemes in terms of boththe average MEC score and its standard deviation, achieving 379.8average MEC score. GAEseq is followed by H-PoP and AltHapwhile HapCompass achieves the highest average MEC score.

Supplementary Document F : Further results onreconstruction of HIV viral communities

Table 6 shows the gene-wise reconstruction results on the real HIV-1data that include inferred frequencies (omitted from Table 2 in themain paper for brevity).We further evaluate the performance of GAEseq on the 4036bplong gag-pol region. Following (Ahn 2018), we divide the gag-polregion into overlapping blocks, reconstruct the viral components in able 5: Performance comparison of GAEseq and AltHap on simulated polyallelic triploid data.MEC CPRCoverage Mean SD Mean SD5 GAEseq

AltHap 1908.500 237.324 0.559 0.05910 GAEseq

AltHap 1769.300 948.754 0.760 0.09115 GAEseq

AltHap 1058.100 864.563 0.796 0.12320 GAEseq

AltHap 1287.100 578.507 0.682 0.07025 GAEseq

AltHap 1430.200 757.482 0.775 0.09330 GAEseq

AltHap 2133.200 1082.576 0.729 0.10935 GAEseq

AltHap 2928.700 869.617 0.723 0.07540 GAEseq

AltHap 2943.600 1113.480 0.737 0.104Table 6: Performance comparison of GAEseq and AltHap on simulated polyallelic tetraploid data.MEC CPRCoverage Mean SD Mean SD5 GAEseq

AltHap 2641.700 410.159 0.544 0.05610 GAEseq

AltHap 2807.200 938.668 0.658 0.07515 GAEseq

AltHap 2742.500 1055.672 0.718 0.08120 GAEseq

AltHap 1929.700 1008.766 0.729 0.06325 GAEseq

AltHap 1987.100 1091.893 0.779 0.08430 GAEseq

AltHap 2265.200 1277.366 0.759 0.05135 GAEseq

AltHap 3906.400 1131.654 0.747 0.05640 GAEseq

AltHap 3775.300 1036.702 0.762 0.075Table 7: Performance comparison of GAEseq, AltHap, Hap-Compss and H-PoP on the real Solanum Tuberosum data.MECMean SDGAEseq

HapCompass 2726 2393.7H-PoP 409.5 282.24AltHap 742.1 469.5 each block independently, and combine the results to reconstructthe full region of interest. Speciﬁcally, the region is divided into asequence of blocks of length 500bp where the consecutive blocksoverlap by 250bp. We run GAEseq to perform reconstruction of viral components in each of the total 18 blocks and merge the resultsto retrieve the entire region of interest. Particularly, the mismatchesbetween strains reconstructed on two consecutive blocks in the over-lapping region are corrected based on majority voting using readsthat are covering the mismatched positions and are assigned to thealigned strains. Following this procedure, GAEseq perfectly recon-structed all of 5 HIV-1 strains in the gag-pol region, achieving 100%Reconstruction Rate for all 5 strains and Predicted Proportion of 1on 355241 remained paired-end reads. The frequencies of 5 HIV-1 strains are estimated as . , . , . , . and . by counting the proportion of reads assigned to the samestrain; these results are consistent with the frequencies estimated byaBayesQR and TenSQR softwares. a b l e : P e rf o r m a n cec o m p a r i s on s o f GA E s e q , T e n S Q R , a B a y e s Q R a nd P r e d i c t H a pon a r ea l H I V - - v i r u s - m i xd a t a . p17p24p2 - p6 P RR T R N a s e i n t v i f vp r vpugp120gp41n e f GA E s e q P r e d P r op . . C P R H X B ( . ) . ( . ) ( ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) . ( . ) ( . ) C P R . ( . ) . ( . ) ( ) ( ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) . ( . ) ( . ) . ( . ) C P R J R − S C F ( . ) ( . ) ( ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) ( ) . ( . ) C P R N L − ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) C P R Y U ( . ) ( . ) ( ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) ( ) . ( . ) P r e d i c t H a p P r e d P r op . . . . . . . . C P R H X B ( . ) ( ) ( . ) ( . ) ( . ) . ( . ) ( . ) ( . ) ( . ) . ( . ) ( ) ( ) ( ) C P R . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) ( . ) ( . ) ( ) . ( . ) ( . ) . ( . ) C P R J R − S C F ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) ( . ) ( . ) C P R N L − ( ) . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) C P R Y U ( . ) ( ) ( . ) ( . ) ( . ) ( ) ( ) ( ) ( . ) ( . ) . ( . ) ( . ) ( . ) T e n S Q R P r e d P r op . . . . . . C P R H X B ( . ) . ( . ) () . ( . ) . ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) . ( ) . ( . ) ( ) C P R . ( . ) ( . ) ( . ) ( . ) . ( . ) ( . ) ( . ) ( ) ( . ) . ( ) . ( . ) ( . ) . ( ) C P R J R − S C F ( . ) ( ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) . ( . ) . ( ) C P R N L − ( ) . ( . ) ( . ) ( . ) . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) . ( . ) . ( . ) C P R Y U ( . ) . ( . ) ( . ) . ( . ) . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) ( . ) . ( . ) a B a y e s Q R P r e d P r op11 . . . . C P R H X B ( . ) . ( . ) ( . ) ( . ) . ( . ) ( . ) . ( . ) ( . ) ( . ) . ( ) ( . ) ( ) . ( . ) C P R . ( . ) . ( ) ( . ) ( . ) . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) . ( . ) . ( . ) . ( . ) C P R J R − S C F ( . ) . ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( . ) ( ) . ( . ) . ( . ) . ( . ) . ( . ) C P R N L − ( . ) ( . ) ( . ) ( . ) . ( . ) ( ) ( . ) . ( . ) ( . ) ( . ) . ( ) . ( . ) ( . ) C P R Y U ( . ) . ( . ) ( . ) ( . ) . ( . ) ( . ) . ( . ) . ( ) ( . ) ( . ) ( ) . ( . ) . ( ) P r e d i c t e d P r opo r ti on ( P r e d P r op ) a nd C o rr ec t P h a s i ng R a t e ( C P R ( % ))f o r GA E s e q , P r e d i c t H a p l o , T e n S Q R a nd a B a y e s Q R a pp li e d t o r ec on s t r u c ti ono f H I V - HX B , H I V - . , H I V - J R - C SF , H I V - N L - a nd H I V - YU f o r a ll

13 g e n e s o f t h e H I V - a t a s e t . F r e qu e n c i e s a r e r e po r t e d i np a r e n t h e s i s . eferences [Diederik 2015] Diederik P. Kingma and Jimmy Ba 2015. Adam: AMethod for Stochastic Optimization. arXiv:1412.6980 [cs.LG] .[Xavier 2010] Xavier Glorot and Yoshua Bengio 2010. Understand-ing the difﬁculty of training deep feedforward neural networks. Proceedings of the thirteenth international conference on artiﬁcialintelligence and statistics , 249-256.[Ahn 2018] Ahn, S.; Ke, Z.; and Vikalo, H. (2018). Viral quasis-pecies reconstruction via tensor factorization with successive readremoval.

Bioinformatics (Oxford, England) , 34(13), i23–i31.[Ahn 2017] Ahn, S.; and Vikalo, H. 2017. aBayesQR: A bayesianmethod for reconstruction of viral populations characterized by lowdiversity. In

International Conference on Research in Computa-tional Molecular Biology , pages 353–369. Springer.[Motazedi 2018] Motazedi, E.; Finkers, R.; Maliepaard, C.; and deRidder, D. 2018. Exploiting next-generation sequencing to solvethe haplotyping puzzle in polyploids: a simulation study.

Brieﬁngsin bioinformatics , 19(3), 387–403.[Potato Genome Sequencing Consortium 2011] Potato Genome Se-quencing Consortium 2011. Genome sequence and analysis of thetuber crop potato.

Nature , 475, 189–195.[Huang 2012] Huang, W.; and Li, L.; Myers, J. R. and Marth, G. T.2012. ART: a next-generation sequencing read simulator.

Bioinfor-matics , 28(4), 593–594.[Li 2009] Li, H.; and Durbin, R. 2009. Fast and accurate short readalignment with burrows-wheeler transform.

Bioinformatics , 25(14),1754–1760.[Edge 2017] Edge, P.; Bafna, V.; and Bansal, V. 2017. Hapcut2:robust and accurate haplotype assembly for diverse sequencingtechnologies.