[PDF] Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

Abstract

High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual's genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual is limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual's genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step calling genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first by its performance on simulated sequence data, and secondly on real sequence data where we obtain estimates using low coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse world-wide populations, and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show filters can correct for the confounding by sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates, and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher coverage data.

Full PDF

EEstimating heterozygosity from a low-coverage genome sequence,leveraging data from other individuals sequenced at the same sites

An Investigation Submitted to

Genetics

Katarzyna Bryc ∗ Nick Patterson † David Reich ∗ November 12, 2018 ∗ Department of Genetics, Harvard Medical School, Boston, MA 02115 † Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, MA 021421 a r X i v : . [ q - b i o . P E ] D ec unning Head: Estimating heterozygosityKey Words: Heterozygosity, Low-coverage sequence dataCorresponding Author:Katarzyna BrycDepartment of GeneticsHarvard Medical School77 Ave. Louis PasteurBoston, MA 02115(617) 432-1101 (ph.)(617) 432-6306 (fax) [email protected] bstract High-throughput shotgun sequence data makes it possible in principle to accurately estimatepopulation genetic parameters without confounding by SNP ascertainment bias. One suchstatistic of interest is the proportion of heterozygous sites within an individual’s genome,which is informative about inbreeding and eﬀective population size. However, in many cases,the available sequence data of an individual is limited to low coverage, preventing the con-ﬁdent calling of genotypes necessary to directly count the proportion of heterozygous sites.Here, we present a method for estimating an individual’s genome-wide rate of heterozygos-ity from low-coverage sequence data, without an intermediate step calling genotypes. Ourmethod jointly learns the shared allele distribution between the individual and a panel ofother individuals, together with the sequencing error distributions and the reference bias.We show our method works well, ﬁrst by its performance on simulated sequence data, andsecondly on real sequence data where we obtain estimates using low coverage data consis-tent with those from higher coverage. We apply our method to obtain estimates of therate of heterozygosity for 11 humans from diverse world-wide populations, and through thisanalysis reveal the complex dependency of local sequencing coverage on the true underlyingheterozygosity, which complicates the estimation of heterozygosity from sequence data. Weshow ﬁlters can correct for the confounding by sequencing depth. We ﬁnd in practice thatratios of heterozygosity are more interpretable than absolute estimates, and show that weobtain excellent conformity of ratios of heterozygosity with previous estimates from highercoverage data. 3

NTRODUCTION

Heterozygosity, or the fraction of nucleotides within an individual that diﬀer between thechromosomes they inherit from their parents, is a crucial number for understanding humanvariation. Estimating this simple statistic from any type of sequence data is confoundedby sequencing errors, mapping errors, and imperfect power for detecting polymorphisms.Obtaining an unbiased estimate is especially diﬃcult for ancient genomes where the sequenceshave a higher error rate, or in cases of low-coverage sequence data where there is low powerto detect heterozygous sites, or for hybrid capture where there may be additional biases dueto the oligonucleotides used for ﬁshing out sequences of interest.Several methods for estimating individual heterozygosity have been proposed (

Johnson and

Slatkin

Hellmann et al.

Jiang et al.

Lynch

Haubold et al.

Haubold et al. et al.

Haubold et al. mlRho , an implementation of a method that jointly infers θ , the scaled mutation rate, and ρ , the scaled recombination rate for a shotgun sequencedgenome. However, they examined performance of their method at 10X coverage and asmall sequence error rate of 4 × − , which is about four times lower than encounteredcurrently in real data ( Shendure and Ji Meyer et al.

MATERIALS AND METHODS

We apply our method to read data at sites with a target minimum coverage (for example, ≥

5X coverage) for the sequenced diploid individual of interest, aligned to some referencegenome of known sequence. We also use sequence read data from n other individuals likewisealigned to the reference.Let a be the unknown diploid genotype of our target individual at some position in the5enome, and c be the aligned reference allele. Then the allele distribution in other individualswill depend on g = ( a, c ). Let x = ( x , x , . . . , x n ) be the vector of alleles by taking onerandomly sampled read from each of the n individuals. Let w be the observed alleles fromthe reads for our individual. Both w and x are observed quantities for a given position inthe genome for our individual, and we are interested in modeling the joint probability of w , x as the sum of the joint probabilities conditional on g .We assume conditional independence of w , x on the true unobserved genotype g . Thisassumption holds if the allele frequency spectrum of the panel of individuals depends only onthe true underlying genotypic state of our individual, and not the allele counts we observe,and likewise the allele count distribution depends only on the true underlying genotypicstate and not on the the alleles observed in the other individuals. From this conditionalindependence property, we then derive: P ( w , x | g ) = P ( w | g ) P ( x | g ) (1) P ( w , x ) = (cid:88) g P ( g ) P ( w , x | g ) (2)= (cid:88) g P ( w | g ) P ( x | g ) P ( g ) (3) P ( w , x , g ) = P ( w | g ) P ( x | g ) P ( g ) (4)which will later provide the leverage to infer the most likely values for the above probabilities,including P ( g ), which gives us the genomic rate of heterozygosity.For every site which has both suﬃcient coverage in our individual and for which we havecomplete information of the panel, we add this site to the corresponding bin of observedalleles w and panel x . This full matrix would be inconveniently large, so to simplify thedata matrix of counts, we polarize our allele counts with respect to the reference, restrictingto bi-allelic SNPs, which constitute the majority of sites. We denote the reference allele as0, and allow only a single other variant per site, summarizing the observed alleles from thereads by the number of non-reference alleles. Thus, we denote genotypes as g ∈ { , , } ,which we will refer to as the homozygous ancestral, heterozygous, and homozygous derived6tates, respectively. If, for example, we consider only sites with a coverage of 4, then w ∈{ (4 , , (3 , , (2 , , (1 , , (0 , } . We can also easily represent x as a vector of 0’s and 1’sreferring to reference or non-reference allele present in the randomly sampled reads, forexample, x = (0 , , , ,

1) where the length of x is determined by the number of individualswe sample.We create a count matrix N of dimension || w || × || x || corresponding to the number ofobserved sites with each particular combination of w and x . The counts of the number ofloci where the individual is w and the panel of individuals individuals comprise the alleles x is represented by the corresponding row and column entry in the matrix N .From the matrix N we estimate the true values of P ( g ) , P ( w | g ), and P ( x | g ) using theEM algorithm. Let Y obs be the observed counts of alleles in the matrix N w , x . Let Y mis be N w , x ,g , the missing or unobserved counts of the alleles with the true parameter state g . Thenthe likelihood of the data is: L = (cid:88) w , x N w , x log P ( w , x ) (5)If the hidden variable, g , corresponding to the true underlying genotypic state were observed,the log-likelihood would be: L (cid:48) = (cid:88) w , x ,g N w , x ,g log P ( w , x , g ) (6)But this would require ﬁtting || g || diﬀerent parameters per observed data point (i.e., countentry of N w , x ). This would require ﬁtting three times as many parameters as there are datapoints. However, by relying on our conditional independence from equation (4) above wecan reduce the number of parameters to be ﬁtted from the data.By EM theory, the Q function Q ( P, ˆ P ) is given by: Q ( P, ˆ P ) = E post L (cid:48) ( ˆ P )= (cid:88) w , x ,g ˆ N w , x ,g log ˆ P ( w , x , g )7here ˆ N w , x ,g is the expected value of N w , x ,g , which in our case derives from the multinomialdistribution, under the posterior distribution calculated with the old parameters P .The estimates for ˆ P that maximize Q , also derived from the MLE estimates for themultinomial distribution, are: ˆ P ( w | g ) = (cid:80) x ˆ N w , x ,g (cid:80) w , x ˆ N w , x ,g ˆ P ( x | g ) = (cid:80) w ˆ N w , x ,g (cid:80) w , x ˆ N w , x ,g ˆ P ( g ) = (cid:80) w , x ˆ N w , x ,g N Further, by Bayes theorem this expands to,ˆ N w , x ,g = N w , x · ˆ P ( w , x , g )ˆ P ( w , x ) = N w , x · ˆ P ( w | g ) ˆ P ( x | z ) ˆ P ( g ) (cid:80) z ˆ P ( w | g ) ˆ P ( x | g ) ˆ P ( g )By basic EM theory these re-estimated values of ˆ P will generate a non-decreasing se-quence of values for the log likelihood L . Finally, we obtain the parameter of interestˆ P ( g = 1) after convergence. Implementation

In practice, without constraining the parameters ˆ P ( w | g ) we reach localbut not consistently global likelihood maxima, which do not necessarily correspond to thegenotypic state parameters we wish to obtain. To improve the ability of the EM to achieveglobal maxima, we ﬁt Beta-Binomial distributions (eﬀectively an over-dispersed Binomialdistribution) to the probabilities of number of non-reference alleles P ( w | g ) for each possiblegenotypic state g . This constraint, as well as the choice of reasonable starting parameters forthe EM initialization, in practice improves our convergence to global maximum correspondingto the homozygous ancestral, heterozygous and homozygous derived genotypic states.Like the Beta distribution, the MLE estimates for the Beta-Binomial distribution do nothave a closed form, though they can be found using direct numerical optimization [such asa ﬁxed-point iteration or a Newton-Raphson iteration]. However, instead, we estimate the8wo parameters ( α, β ) using MOM estimators for the Beta-Binomial, by setting:ˆ α = ( n − ¯ x − s / ¯ x )¯ x ( s / ¯ x + ¯ x/n − n ˆ β = ( n − ¯ x − s / ¯ x )( n − ¯ x )( s / ¯ x + ¯ x/n − n In the case of under-dispersed data, it is possible to obtain MOM estimates that areinvalid. Though unlikely to occur in the read data, for this contingency, we instead ﬁt aBinomial distribution to the data.There are several challenges in implementing the EM for our problem. The ﬁrst is thatthe likelihood is poorly deﬁned when any of the parameters we are interested in estimatingapproach 0, as then the likelihood also goes to zero. So to avoid this situation, we add a“prior” (cid:15) to the likelihood calculation which adds a small count value in the step calculatingthe parameters to avoid probabilities reaching 0. So instead, we calculate the posterior: L (cid:48) = ˆ N p,x,z log ( P ( p, x, z )) + L (cid:48) rather than the maximum likelihood estimation, hence, we obtain a MAP (maximum a pos-teriori) estimate, which is a Bayesian method that incorporates a prior over the distributionto be estimated (in this case, a small uniform prior). We choose a small prior (less than intotal counting one site across all possible matrices) that does not impact our estimates. Ingeneral, our estimates are robust to choice of this prior, within a range examined of 1e-10 to1e-50, and we will continue to refer to our method as an EM implementation though in factwe use a non MLE method. In eﬀect, our equations for each step remain the same, exceptthat in the M-step of the EM (where we estimate the parameters) we instead estimate theMAP using the prior. Speciﬁcally, we estimate: P osterior = (cid:88) i,j ( N i,j · log ( (cid:88) z P i,j,z ) + (cid:15) · (cid:88) z log ( P i,j,z ))In practice, we set (cid:15) to 1e–20 which does not alter estimates of the probabilities whilepreventing ill-deﬁned likelihoods. 9s with all likelihood calculations, our probabilities approach very small numbers. Toavoid numerical error due to underﬂow of small likelihoods and parameter estimates, weimplement the EM storing all probabilities and likelihoods in the log form.Lastly, likelihood maximization occurs on an arbitrary base, so to avoid numerical issuesdue to any remaining underﬂow of the likelihood calculation, we compute a factor F at thestart of the EM. For each iteration, we compute the likelihood of the data minus this constantfactor, which is a standard practice and does not aﬀect the computation of the maximum.This is equivalent to calculating the log odds: L = (cid:32)(cid:88) i,j N i,j · log ( P i,j ) (cid:33) − ( F )= (cid:32)(cid:88) i,j N i,j · log ( P i,j ) (cid:33) − (cid:32)(cid:88) i,j N i,j · log ( F i,j ) (cid:33) = (cid:88) i,j N i,j · ( log ( P i,j ) − log ( F i,j ))= (cid:88) i,j N i,j · log ( P i,j F i,j )for some constants F i,j . In practice, we set F i,j to be the likelihood at initialization ofthe EM. We then iterate the EM until both the change in parameters and the change in thelikelihood is smaller than our chosen threshold, which in practice we set as 1e–50.It should be noted that any form for tallying read counts may be used, including theallele proﬁle used in ( Haubold et al.

Proof of principle 1: Application to simulated data

We generated simulated sequencedata and applied our EM method for estimating heterozygosity to assess the accuracy of ourestimation procedure. 10 enerating coalescent simulations

We generated 100 replicate datasets of sequence datausing MaCS (

Chen et al.

Gutenkunst et al.

Adding simulated error

We simulate sequence data from the true genotypes by adding errorsto reads. First, for all variable loci in the target individual, we randomly choose which alleleis on a read, then adds errors to the each read (with high error rate of 0.002) to generatethe total number of derived reads for the individual at the locus out of the total sequencingdepth. For each other sequenced chromosome, we add errors with a lower error rate of 0.0001(since we assume the panel is composed of higher-quality genomes), then add a count for theﬁnal simulated locus in the appropriate hash bin. Lastly, for each invariant locus, we adderrors to the target individuals reads, and at a lower rate, adds errors to the other sequencedchromosomes, and input these counts into the hash bin. With an error rate of 0.001 we adderrors to the chimpanzee chromosome which inverts the ancestral and derived reads.

Proof of principle 2: Downsampling high-coverage genomes

To assess the eﬃcacy ofour method at lower coverages on real sequence data, we obtain estimates of heterozygosityfor a San individual sequenced to higher coverage using Illumina’s Genome Analyzer IIxnext-generation sequencing technology, which we then downsample to varying levels lowcoverage. We use this dataset of sequence reads to explore the ability of our method toperform on low-coverage sequence data, and the lower bound of coverage at which we areable to obtain accurate estimates of heterozygosity. We compare the performance of ourmethod to the estimate of θ obtained from MlRho ( Haubold et al. pplication to 11 world-wide human genomes

We align sequence data from 11 humangenomes from world-wide populations and an archaic Denisovan genome to the chimpanzeereference genome to avoid introducing human-reference-population biases. For details on thepopulations, samples and the sequencing performed, see (

Meyer et al.

RESULTS

Simulation results

We obtain accurate estimates of heterozygosity across a variety ofcoverage levels (5X to 30X) (see Figure 1). We note a tiny bias of 0.3% (in relative terms)from the true rate for 5X coverage read data, but with higher coverage this bias goes tozero. [Figure 1 about here.]12 ownsampling results

Figure 2 illustrates that our EM estimation method and

MlRho give consistent estimates of heterozygosity for the HGDP San individual starting at about10X coverage and higher. However, at lower coverage (about 4X) our method signiﬁcantlyoutperforms

MlRho , giving a slightly biased, but nearly convergent, estimate.[Figure 2 about here.]

Heterozygosity estimates for 11 present-day and Denisovan genomes

We presentour initial estimates of heterozygosity, downsampled to three diﬀerent depths, for each se-quencing coverage bin (normalized by individual mean sequence coverage) in Figure 3A.Our estimates of heterozygosity are consistent, independent of count matrix (i.e., downsam-pling) size, as would be expected from our simulated downsampling results shown in Figure2. However, we ﬁnd a strong signal that the estimates of heterozygosity are correlated tosequencing coverage of the region. We note that this is not an artifact of the larger amountof data available at higher coverage, since each bin is calculated after being downsampledto the same depth. Instead, the U-shaped curves in Figure 3B indicate that the apparentnext-generation sequencing coverage is dependent on properties of the underlying genomicsequence. In particular, we ﬁnd that regions of lower coverage and higher coverage (relativeto the mean sequencing depth) show higher heterozygosity.[Figure 3 about here.]We witnessed increased heterozygosity at regions of higher coverage, which we suspectedwas due to perceived genetic diversity due to cryptic segmental duplications. To explore thishypothesis, we restricted our analyses to regions of the genome which have been identiﬁedas unlikely to contain segmental duplications, available on the Eichler Laboratory web-site ( http://eichlerlab.gs.washington.edu/database.html ). We ﬁnd that this ﬁlterstrongly reduces the eﬀect of regions with likely segmental duplications on our estimates ofheterozygosity (Figure 3C), conﬁrming that unidentiﬁed segmental duplications, which result13n a net higher sequencing coverage of the region, result in an high estimate of heterozygos-ity for such regions. Roving these regions with known segmental duplications reduces thiseﬀect at regions with higher sequencing coverage. However, the increase in heterozygosityat higher coverage still is present even after this correction (see Figure 3C), suggesting thatthere may be further individual or population-speciﬁc duplications remaining.Using only data that passed the segmental duplication ﬁlter, we obtain estimates forthe sequenced genomes on the same set of regions, restricting to regions with sequencingcoverage between 20X and 40X. Using the EM, we estimate the total genome-wide fraction ofheterozygosity for each individual, and we also can extract estimates of the allelic distributionof heterozygous and homozygous sites (Figure 4). We present the absolute estimates weobtain in Table 1, as well as the ratio of heterozygosity in the Denisova genome relative tothe other individuals. We ﬁnd the highest estimates of heterozygosity for the San Africanindividual, and next highest estimates of heterozygosity for other African individuals fromMandenka, Yoruba, Mbuti, and Dinka populations. The next highest levels of heterozygosityare in individuals from European populations (French, Sardinian), followed by East Asianpopulations (Dai, Han). We ﬁnd the lowest estimates of heterozygosity in the individualsfrom Melanesia (Papuan) and from a Native American population (Karitiana).[Table 1 about here.][Figure 4 about here.]

DISCUSSION

We have shown that our heterozygosity estimation method both performs well in low cov-erage simulated sequence data and provides consistent estimates on real low-coverage datadownsampled from higher coverage. In particular, our method outperforms other methods ondata that has been sequenced at less than 10X coverage, and provides reasonable estimatesfor as low as 4X coverage. 14ur estimates for 11 world-wide human genomes and the archaic Denisovan genome pro-vide important insights into the distribution of heterozygosity across human populations.Furthermore, our results show that estimates of heterozygosity are strongly aﬀected by ge-nomic properties such as copy-number variability, and these properties aﬀect sequencingcoverage. Hence, we show that the heterozygosity is not independent of sequencing coverageeven within one genome, and is elevated in both regions with low coverage (relative to themean sequencing depth) as well as regions with high coverage. This is an unexpected resultif one assumes a the “Lander-Waterman” Poisson distribution of read depth (

Lander and

Waterman

Weber and

Myers

Meyer et al. relative diversity previously documented in these popu-lations, and with other patterns of genetic diversity such as decay of linkage disequilibirium(

Jakobsson et al. Li et al. Jakobsson et al.

ACKNOWLEDGMENTS

KB gratefully acknowledges that this investigation was support by the National Institutesof Health under Ruth L. Kirschstein National Research Service Award

LITERATURE CITED

Chen, G. , P. Marjoram , and

J. Wall , 2009 Fast and ﬂexible simulation of DNAsequence data. Genome research (1) : Gutenkunst, R. , R. Hernandez , S. Williamson , and

C. Bustamante , 2009 Infer-ring the joint demographic history of multiple populations from multidimensional SNPfrequency data. PLoS genetics (10) : e1000695. Haubold, B. , P. Pfaffelhuber , and

M. Lynch , 2010 mlRho–a program for estimat-ing the population mutation and recombination rates from shotgun-sequenced diploidgenomes. Molecular Ecology (s1) : Hellmann, I. , Y. Mang , Z. Gu , P. Li , M. Francisco , A. Clark , and

R. Nielsen ,2008 Population genetic analysis of shotgun assemblies of genomic sequences from mul-tiple individuals. Genome research (7) : akobsson, M. , S. W. Scholz , P. Scheet , J. R. Gibbs , J. M. VanLiere , H. C.Fung , Z. A. Szpiech , J. H. Degnan , K. Wang , R. Guerreiro , J. M. Bras , J. C.Schymick , D. G. Hernandez , B. J. Traynor , J. Simon-Sanchez , M. Matarin , A. Britton , J. van de Leemput , I. Rafferty , M. Bucan , H. M. Cann , J. A.Hardy , N. A. Rosenberg , and

A. B. Singleton , 2008 Genotype, haplotype andcopy-number variation in worldwide human populations. Nature (7181) : Jiang, R. , S. Tavar´e , and

P. Marjoram , 2009 Population genetic inference from rese-quencing data. Genetics (1) : Johnson, P. and

M. Slatkin , 2006 Inference of population genetic parameters in metage-nomics: a clean look at messy data. Genome research (10) : Lander, E. and

M. Waterman , 1988 Genomic mapping by ﬁngerprinting random clones:a mathematical analysis. Genomics (3) : Li, J. Z. , D. M. Absher , H. Tang , A. M. Southwick , A. M. Casto , S. Ramachan-dran , H. M. Cann , G. S. Barsh , M. Feldman , L. L. Cavalli-Sforza , and

R. M.Myers , 2008 Worldwide human relationships inferred from genome-wide patterns ofvariation. Science (5866) : Lynch, M. , 2008 Estimation of nucleotide diversity, disequilibrium coeﬃcients, and muta-tion rates from high-coverage genome-sequencing projects. Molecular biology and evolu-tion (11) : Meyer, M. , M. Kircher , M. Gansauge , H. Li , F. Racimo , S. Mallick , J. Schraiber , F. Jay , K. Pr¨ufer , C. de Filippo , and others , 2012 A high-coverage genome sequence from an archaic denisovan individual. Science.

Shendure, J. and

H. Ji , 2008 Next-generation DNA sequencing. Nature biotechnol-ogy (10) : Weber, J. and

E. Myers , 1997 Human whole-genome shotgun sequencing. Genome17esearch (5) : ist of Figures A: Each dataset has been downsampled to diﬀerent coverage levels, denotedby symbol color and shape. The red line corresponds to True=Estimated, orperfect estimation of heterozygosity. B: Run-by-run diﬀerences between trueand estimated heterozygosity rates, stratiﬁed by downsampling coverage. They-axis shows the percent error from the true value of heterozygosity. . . . . 202 Our EM Heterozygosity estimates (red) and

MlRho estimates on the regions ofa San individual genome sequenced to 30-45X, and experimentally downsam-pled. At higher coverage, both methods converge to an estimate of 7 . × − .We note that our estimates for 4X and 5X coverage are much more accuratethan MlRho . Results for less than 4X coverage were not possible to obtainfrom

MlRho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Estimates of heterozygosity for each of the 11 present-day human genomesand Denisova, where each indvixual is denoted by a unique color. Relativecoverage is deﬁned as the lower bound of the sequencing bin, divided by themean sequencing depth for the individual. A: Heterozygosity estimates areconsistent across downsampling levels. Downsampling to 5X, 10X, and 20Xlevels is denoted by line type. Each individuals is denoted by line color. B: All individuals show an increase in estimated heterozygosity at higher (andlower) relative coverage. C: Eﬀect of removing known regions with segmen-tal duplications. Estimates of heterozygosity are shown for a sample of ﬁveof the individuals. Without ﬁltering, estimates for each bin are shown withsolid lines. After exclusion of regions within known CNV and segmental du-plications, the heterozygosity estimates display a ﬂatter distribution (dottedlines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Inferred distribution of homozygous ancestral (red), heterozygous (green), andhomozygous derived (blue) sites for the San HGDP individual. The y-axis ispresented on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2319 lllll l llllllllllllll ll ll ll l lll ll ll ll l lll lll lll lll lll lll ll lll ll lll ll lllll ll l l lll lll llll lll ll lll ll l ll . . . . True Heterozygosity (%) E s t i m a t ed H e t e r o zy go s i t y ( % ) l

5X coverage10X coverage20X coverage30X coverage A l l

5X 10X 20X 30X . . . . . . . Coverage % E rr o r f r o m t r ue he t e r o zy go s i t y B Figure 1: True versus estimated rates of heterozygosity for 100 simulated read datasets. A: Each dataset has been downsampled to diﬀerent coverage levels, denoted by symbolcolor and shape. The red line corresponds to True=Estimated, or perfect estimation ofheterozygosity. B: Run-by-run diﬀerences between true and estimated heterozygosity rates,stratiﬁed by downsampling coverage. The y-axis shows the percent error from the true valueof heterozygosity. 20 l l l l l l H e t e r o zy go s i t y E s t i m a t e ( % ) Coverage l l l l lll

HeterozygosityMlRho

Figure 2: Our EM Heterozygosity estimates (red) and

MlRho estimates on the regions of aSan individual genome sequenced to 30-45X, and experimentally downsampled. At highercoverage, both methods converge to an estimate of 7 . × − . We note that our estimatesfor 4X and 5X coverage are much more accurate than MlRho . Results for less than 4Xcoverage were not possible to obtain from

MlRho .21 .50 0.75 1.00 1.25 1.500.00000.00050.00100.00150.0020

Consistent across downsamplings

Relative coverage H e t e r o zy go s i t y E s t i m a t e Downsampled to 5XDownsampled to 10XDownsampled to 20X A Effect of sequence coverage

Relative coverage H e t e r o zy go s i t y E s t i m a t e ( D o w n s a m p l ed t o X ) DenisovaKaritianaPapuanHanDaiFrench SardinianDinkaYorubaMbutiMandenkaSan B Effect of segmental duplication filter

Relative coverage H e t e r o zy go s i t y E s t i m a t e ( D o w n s a m p l ed t o X ) No filteringAfter fiter removing CNV regions C Figure 3: Estimates of heterozygosity for each of the 11 present-day human genomes andDenisova, where each indvixual is denoted by a unique color. Relative coverage is deﬁned asthe lower bound of the sequencing bin, divided by the mean sequencing depth for the individ-ual. A: Heterozygosity estimates are consistent across downsampling levels. Downsamplingto 5X, 10X, and 20X levels is denoted by line type. Each individuals is denoted by line color. B: All individuals show an increase in estimated heterozygosity at higher (and lower) relativecoverage. C: Eﬀect of removing known regions with segmental duplications. Estimates ofheterozygosity are shown for a sample of ﬁve of the individuals. Without ﬁltering, estimatesfor each bin are shown with solid lines. After exclusion of regions within known CNV andsegmental duplications, the heterozygosity estimates display a ﬂatter distribution (dottedlines). 22 N u m be r o f s i t e s ( k b ) − − + + Homozygous ancestralHeterozygousHomozygous derived

Figure 4: Inferred distribution of homozygous ancestral (red), heterozygous (green), andhomozygous derived (blue) sites for the San HGDP individual. The y-axis is presented on alog scale. 23 ist of Tablesist of Tables