[PDF] Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis

Abstract

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases, but IBD detection accuracy in non-simulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father-mother-child trios from the 23andMe, Inc. dataset. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false positive rate over 67% for 2-4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors when detecting IBD, a necessity for retrieving long (> 6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that scores IBD segments proportional to the number of switch errors they contain. Applying HaploScore filtering to the IBD data at a precision of 0.8 produced a 13-fold increase in recall when compared to length-based filtering. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative data sources using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.

Full PDF

11 Reducing pervasive false positive identical-by-descent segmentsdetected by large-scale pedigree analysis

Eric Y. Durand , † , Nicholas Eriksson , Cory Y. McLean , † † These authors contributed equally to this work.Corresponding author: Cory Y. McLean, [email protected] a r X i v : . [ q - b i o . P E ] F e b Abstract

IBD segments are regions of DNA between two individuals that were inherited from a recent sharedcommon ancestor. IBD segments can be detected on high-density genetic data such as that produced bygenome-wide genotyping arrays or whole genome sequencing.Detecting the presence and distribution of IBD segments between individuals is fundamental to manygenetic applications . Long-range phasing uses IBD segments to resolve haplotype phasing inaccuracies.IBD segments have been used to identify disease genes and estimate the heritability of traits andcommon diseases . The lengths and distribution of IBD segments within and across populations havebeen used to infer demographic history and identify regions under natural selection .All methods for IBD detection ultimately try to detect a similarity between haplotypes that is statisti-cally unlikely to occur in the absence of IBD sharing. Hidden Markov models have been used extensivelyfor probabilistic IBD segment detection . However, these methods scale quadratically withinput sample sizes and are thus not suitable for IBD detection in population-scale datasets (reviewed in ).Non-probabilistic IBD detection methods use a “hash-and-extend” methodology that is conceptually sim-ilar to BLAST : identical or nearly identical short haplotype match “seeds” are detected eﬃciently andthe seeds are extended to adjacent sites subject to heuristic constraints. These non-probabilistic meth-ods have the advantage that they are able to scale to much larger datasets than probabilistic methods.Implementations include GERMLINE , fastIBD , and ReﬁnedIBD . GERMLINE and ReﬁnedIBDuse short windows of sites as seeds, whereas fastIBD uses small segments of the inferred haplotype graphas seeds.These three methods diﬀer in the way that detected candidate segments are chosen to be kept as trueIBD segments: fastIBD uses haplotype frequency, ReﬁnedIBD uses a combination of segment geneticlength and a likelihood ratio test, and GERMLINE uses segment length. The probabilistic reﬁnementmethods of fastIBD and ReﬁnedIBD require a haplotype graph to be generated. Consequently, bothfastIBD and ReﬁnedIBD perform haplotype phasing in addition to IBD detection. Haplotype phasinghas superlinear computational complexity . Current computer memory capacity constraints limit thenumber of individuals who can be phased together to tens of thousands of individuals. Thus, computingall pairwise IBD requires splitting the cohort into multiple smaller batches, all of which must be comparedto each other, each time being phased anew. This makes the detection of all pairwise IBD segments in acohort of over 100,000 individuals computationally infeasible using these methods. Since GERMLINE usessegment length to reﬁne IBD segments, it does not perform genotype phasing. Consequently, detectionof all pairwise IBD segments can be performed on large cohorts by phasing each individual once and thenusing GERMLINE to detect IBD.IBD detection accuracy is typically assessed on simulated data, as true IBD segments can then beknown precisely . However, accurate simulation of population demography is diﬃcult , andsimulation parameters directly aﬀect the estimated precision and recall of IBD detection algorithms.With a large number of father-mother-child trios, IBD detection accuracy can be estimated on non-simulated data by examining concordance between reported IBD segments in the child and his or herparents.In this work, we analyze the accuracy of IBD segments reported by GERMLINE since its decouplingof phasing and IBD detection make it feasible for IBD detection on population-scale datasets. We usea large cohort of trios to assess IBD segment accuracy on non-simulated data. We perform a detailedexamination of discrepant segments and present a method that substantially improves accuracy whileremaining computationally tractable for population-scale datasets. Finally, we replicate the ﬁndingsusing an independent cohort of individuals from the 1000 Genomes project. Results

Non-simulated data show substantial inaccuracy in short reported IBD seg-ments

To analyze IBD detection accuracy on non-simulated data, we examined IBD segments detected ina cohort of 25,432 individuals of European ancestry that includes 2,952 distinct father-mother-childtrios (the “23andMe cohort”,

Material and Methods ). By focusing speciﬁcally on segments reportedbetween a trio child and an individual who is not a parent of that child (henceforth called “child-other”segments), IBD accuracy can be quantiﬁed: by the deﬁnition of IBD, if a child-other segment is true,at least one of the child’s parents must also share a segment IBD with the individual (henceforth called“parent-other” segments) that encompasses the child-other segment.GERMLINE reported a total of 18,125,797 child-other segments in the 23andMe cohort on chro-mosome 21. After ﬁltering artifactual IBD segments reported in regions of low site density, 13,307,562child-other segments were retained for analysis. Only 14% of these child-other segments were encompassedby a parent-other segment (

Figure 1A , Supplementary Figure S1A ). Another 25% of child-other seg-ments have a partial parent-other segment in which at least one segment end is truncated (

Figure 1A , Supplementary Figure S1B ). Segment ends imply the presence of opposite homozygote genotypesbetween the individuals. Opposite homozygote sites that terminate a parent-other segment exclude thepossibility of child-other IBD at those sites. To determine whether truncated segment ends representedfalse child-other IBD or genotyping error in parent-other regions, Illumina GenCall scores were examinedat the opposite homozygote sites truncating 128,656 randomly selected partial parent-other segments.Considering GenCall scores of ≥ . , over 95% of opposite homozygote sitesanalyzed (122,364/128,656) have conﬁdent genotype calls in both the parent and other individual. Thisresult indicates that the vast majority of disagreements between child-other and parent-other segmentsrepresent false positive IBD in the child rather than false negative IBD in the parent ( Figure 1B ).The remaining 61% of child-other segments have no corresponding parent-other segment (

Figure 1A , Supplementary Figure S1C ). All segments in this subset were analyzed to determine whether theyrepresented false positive child-other segments or false negative parent-other segments by examining thenumber of parent-other opposite homozygote sites in the region. Nearly 98% of these child-other segmentshave at least one opposite homozygote site in the parent (

Figure 1C ). Given a 95% accuracy rate forparent-other opposite homozygote sites (

Figure 1B ), the probability that a region containing N oppositehomozygote sites is actually a false negative parent-other IBD segment was calculated as (1 − . N . Theexpected fraction of false negative parent-other segments in this subset is 0 . . Supplementary Figure S1 ). Segments were segregated by genetic and physicallengths and the average segment overlap of all segments in each bin was calculated (

Figure 2A ). Geneticlength is a more reliable indicator of average segment overlap than physical length and segments longerthan 6 cM generally show a high degree of overlap. However, the average overlap drops rapidly as segmentlength is reduced (

Figure 2A ).IBD accuracy was estimated by considering child-other segments with substantial parent-other seg-ment overlap as true IBD. Because precise determination of IBD endpoints from genotype data is dif-ﬁcult , a threshold of 80% segment overlap was used to classify a segment as true IBD. Using thiscriterion, more than 67% of all reported segments shorter than 4 cM are false positive child segments( Figure 2B ). Figure 2C – F show the IBD segment overlap distributions segregated by genetic length.Most 2–3 cM segments are erroneous ( Figure 2C ), and only segments longer than 5 cM have a negligi-ble number of false positives (

Figure 2F ). Indeed, when ﬁltering solely by genetic length, all segmentsshorter than 5 cM must be discarded in order to achieve a precision value of 0 . Supplementary Fig-ure S2 ). However, because there are many more short segments, eliminating all segments shorter than5 cM eliminates 99% of all true IBD segments, a dramatic loss in recall (

Supplementary Figure S2 ).In the next section, we investigate the properties of true IBD segments of all lengths, and contrast themwith erroneous segments.

Overly permissive diplotype matching causes reported segment inaccuracy

IBD segments are shared between two individual haplotypes. Thus, if the phase of each individualgenotype was known, IBD detection algorithms could in principle analyze each individual haplotypeindependently. However, for individuals without a genotyped pedigree, genotypes have to be phasedstatistically, where switch errors occur at an appreciable frequency (

Supplementary Figure S4 ). Ex-amination of only haplotypes in the presence of switch errors is known to reduce power to detect IBD,especially for long segments , since they are likely to harbor more switch errors than short segments.Thus, GERMLINE (and many other IBD detection methods) matches IBD segments between individualdiplotypes, trying to allow for a moderate number of switches between individuals’ haplotypes. In prac-tice, this is achieved by allowing haplotype match seeds to extend until an opposite homozygous site ismet. There are two potential issues with this approach that could lead to inconsistent segment reportingbetween parent and child and are explored further below.Detection of child-other segments with a truncated or absent corresponding parent-other segmentcould arise from the haplotype matching between the child and the other individual, but a switch er-ror in the parent causing the corresponding haplotype to not match between the parent and the otherindividual. To investigate this potential error source, all 2,952 trios were trio-phased using the laws ofMendelian inheritance and then IBD detection was performed as before. Trio-phasing ensures that chil-dren and parents are phased essentially perfectly (i.e., up to recombination events), eliminating haplotypediscrepancies between parent and child as a source of segment discrepancies. The number and accuracyof child-other segments using trio-phased data is nearly identical to that of BEAGLE-phased data, show-ing that parent-child haplotype discrepancies contribute a negligible amount toward discrepant segments( Supplementary Figure S5 ).Alternatively, child-other segments with no corresponding parent-other segment could be false re-ported IBD between the child and the other individual due to overly permissive diplotype matching. Toexamine this possibility, each full 100-site window in all 13,307,562 child-other segments was analyzed(63,542,380 total windows) to see whether the window satisﬁed the diplotype match criterion and thehaplotype match criterion between the child and the other individual and between the parent and theother individual. The analysis was segregated by windows contained within corresponding parent-othersegments (likely true IBD) and windows that are not contained within corresponding parent-other seg-ments (false IBD). The diplotype match criterion is satisﬁed in the child in 97.6% of windows containedwithin parent-other segments (

Table 1A ) and in 97.5% of windows not contained within parent-othersegments (

Table 1B ). Roughly 67.7% of windows contained within both child-other and parent-othersegments satisfy the haplotype match criterion for IBD in the child (

Table 1A ), consistent with trueIBD given the window size and empirical switch error rate (

Supplementary Figure S4 ). In contrast,only 44.2% of windows not contained within a parent-other segment satisfy the haplotype match criterionfor IBD in the child (

Table 1B ), a substantial reduction (binomial

P < − ).The poor precision in short segments is thus due to the allowance of diplotype-only matches withinthe IBD detection algorithm. However, allowing diplotype-only matches is necessary for detection oflong segments due to imperfect haplotype phasing . The substantial reduction in windows matchinghaplotypes in regions of false IBD suggests a haplotype-based metric that is robust to switch errors couldimprove precision of reported IBD without the loss of recall incurred by haplotype-only IBD detectionmechanisms. A haplotype-based metric to identify true IBD segments

IBD is fundamentally a property of haplotypes, not diplotypes. Consequently, true IBD should appearconsistent with haplotype matches, modulo expected genotyping and switch errors. We introduce Hap-loScore as a measure of haplotype IBD likelihood: given a genotyping error rate per site (cid:15) and a switcherror rate per site σ , the HaploScore for a candidate IBD segment S isHaploScore( S ) = 1 | S | (cid:16) n g (cid:15) + n s σ (cid:17) ,where | S | is the number of genotyped sites in S and n g and n s are the number of genotyping and switcherrors, respectively, that together minimize the score while reconciling the segment as matching acrossa single haplotype in both individuals. Conceptually, HaploScore is a measure of the ratio of observedand expected genotyping and switch errors. In segments falsely reported as IBD, a larger-than-expectednumber of genotyping and switch errors may be required to reconcile the segments as matching acrossindividual haplotypes, and their HaploScores will be large.Genotyping and switch error rates per site were estimated from the data to be (cid:15) = 0 . σ = 0 . Material and Methods ). Using those parameters, HaploScore was calculated on all segments shorterthan 6 cM. To investigate whether HaploScore behaves diﬀerently between true and false IBD, we plotteda heat map of IBD segment overlap as a function of segment genetic length and HaploScore values.HaploScore eﬀectively discriminates true and false IBD segments at all lengths (

Figure 3A ). Indeed,the relationship between HaploScore and mean segment overlap is nearly monotonic, drawing a clearboundary between segments with at least 80% overlap and others at all genetic lengths.In addition, we assessed the power of HaploScore as a binary classiﬁer to decide if an IBD segmentis true. We varied a HaploScore threshold from 0 to 22 (the maximum observed HaploScore value onchromosome 21), and classiﬁed segments with a HaploScore value smaller than the threshold as true IBD.We then computed the true positive and false positive rates at each HaploScore threshold. HaploScoreperformed well as a binary classiﬁer at all genetic lengths, achieving an area under the receiver operatingcharacteristic curve (AUC) greater than 0 . Figure 3B ). At all levelsof precision, power increased as segment length increased, owing at least in part to the general positivecorrelation between segment length and number of sites in the segment. Importantly, and in sharpcontrast with length-based ﬁltering (

Supplementary Figure S2 ), HaploScore-based ﬁltering retainsmany segments shorter than 5 cM at a precision of 0 . Figure 3C ). Recall of HaploScore-based ﬁlteringat 0 . .

19, a 13-fold increase compared to length-based ﬁltering.

Robustness of results to HaploScore parameter variation

HaploScore is a function of two parameters: the genotyping error rate (cid:15) and the switch error rate σ .However, only the ratio of the two parameters aﬀects the behavior of the score. In order to assess therobustness of HaploScore to varying parameters, a grid search was performed in which (cid:15) was ﬁxed at0 . σ was varied three orders of magnitude from (cid:15)/

100 to 10 (cid:15) , and the AUC was computed at eachgrid point (

Figure 4A ). As expected, performance was strongest when the ratio of the parameters wasnear its true value. However, the performance degradation was modest across the wide range of parameterratio values examined, with the AUC dropping by less than 2% at worst.

Robustness of results to true IBD deﬁnition

In all analyses above, true IBD segments were deﬁned as child-other segments that have at least 80%parent-other segment overlap. To assess the robustness of HaploScore to diﬀerent true IBD deﬁnitions,a grid search was performed in which the deﬁnition of true IBD was varied from 10% to 100% parent-other segment overlap in increments of 10%. The AUC was computed at each grid point (

Figure 4B ).Performance was generally stable for all segment lengths and true IBD deﬁnitions, with the exception of5–6 cM segments at 100% overlap, where performance degraded appreciably. This is likely due at leastin part to the inherent bias for longer segments to have more sites at which premature truncation ofdetected IBD segments can arise from genotyping or switch errors.

Robustness of results to genome-wide IBD identiﬁcation

To conﬁrm that the results presented are not due to particular genomic features of chromosome 21,chromosome 10 was analyzed on the full cohort using the same parameters ( (cid:15) = 0 . σ = 0 . Supplementary Figure S6 ). In addition, IBDsegments were examined on all autosomes in the subset of all individuals comprising the 2,952 unrelatedtrios. No substantial deviations in performance were observed (not shown).

Filtering spurious reported IBD segments using HaploScore

HaploScore can be used to ﬁlter out spurious segments reported by an IBD detection algorithm asan eﬃcient post-processing step. The reduced power to detect short segments requires more stringentHaploScore threshold values for shorter segments to achieve a similar precision value as for longer segments(

Figure 3 ). Since HaploScore provides a way to rank segments, the trade-oﬀ between precision and recallcan be tuned to the needs of the particular downstream application.HaploScore threshold values to ensure particular average overlap values of resultant segments weregenerated (see

Material and Methods ) and three separate ﬁltering results are shown in

Figure 5 .Notably, more stringent ﬁltering parameters have the largest eﬀect on short segments and have nearlythe same eﬀect as lenient ﬁltering parameters for segments over 5 cM (

Figure 5 ). This result is intuitive,as the short reported segments are enriched for false positives (

Figure 2C – F ). Robustness of results to alternate individuals and genotyping platforms

To assess the robustness of the ﬁndings in an alternative population, a cohort of 555 European individualsincluding 52 father-mother-child trios genotyped as part of the 1000 Genomes project were analyzed(the “1000 Genomes cohort”, Supplementary Table S1 ). Individuals in the 1000 Genomes cohort weregenotyped on the Illumina HumanOmni2.5-Quad v1-0 B SNP array and as such provide an independentsample set from which to assess the generalizability of our results to additional individuals and alternativegenotyping platforms.GERMLINE reported a total of 6,585 child-other segments on chromosome 21 in the 1000 Genomescohort. After ﬁltering artifactual IBD segments reported in regions of low site density, 5,770 child-othersegments were retained for analysis. The number of child-other segments detected in the 1000 Genomescohort is much smaller than in the 23andMe cohort (5,770 versus 13,307,562 candidate segments) sincethe 1000 Genomes cohort is much smaller. However, the rate of candidate segment detection is similar: inthe 1000 Genomes cohort, there are 5,770 segments for 52 ×

552 child-other pairs, resulting in an averageof 5770 / (52 × .

20 child-other segments per trio. In the 23andMe cohort, the corresponding rateis 13307562 / (2952 × .

18 child-other segments per trio.Analyses of child-other segments detected in the 1000 Genomes cohort were performed analogouslyto those in the 23andMe cohort. Only 12% of child-other segments were encompassed by a parent-othersegment, 20% of child-other segments have a partial parent-other segment in which at least one segmentend is truncated, and the remaining 68% of child-other segments have no corresponding parent-othersegment (

Supplementary Figure S7A ). Analysis of truncated segments in the 1000 Genomes cohortalso strongly suggests that false child-other IBD accounts for most discrepant segments, as 92% of oppositehomozygote sites that truncate the 1,174 truncated segments have conﬁdent genotype calls in both theparent and other individual (

Supplementary Figure S7B ). Finally, in the 68% of child-other segmentsthat have no corresponding parent-other segment, over 99% contain at least one opposite homozygotesite in the parent (

Supplementary Figure S7C ). Taken together, these results show that the 1000Genomes cohort is also rife with false positive IBD, and despite the diﬀerent genotyping platform used,the error proﬁle in the 1000 Genomes cohort is qualitatively very similar to that in the 23andMe cohort(

Figure 1 ).Examination of the relationship between segment length and segment overlap in the 1000 Genomescohort indicates similar general trends as those discovered in the 23andMe cohort (compare

Supple-mentary Figure S8 and

Figure 2 ), though the smaller number of segments makes the results morenoisy. Comparison of all 44,542 full 100-site windows in the 5,770 child-other segments shows that overlypermissive diplotype matching causes false reported IBD segments: the diplotype match criterion is satis-ﬁed in 97.1% of windows contained within parent-other segments and in 96.5% of windows not containedwithin parent-other segments, whereas the haplotype match criterion is satisﬁed in 68.7% of windowscontained within parent-other segments but in only 51.1% of windows not contained within parent-othersegments (

Supplementary Table S2 ), a substantial reduction (binomial

P < − ).Finally, the performance of HaploScore in segregating true and false reported IBD was analyzed inthe 1000 Genomes cohort. The switch error rate was estimated from the data to be σ = 0 .

003 and thegenotyping error rate was estimated to be (cid:15) = 0 . Supplementary Figure S9 and

Figure 3 ). The smallnumber of child-other segments analyzed in the 1000 Genomes cohort causes somewhat noisy results, butthe eﬀectiveness of HaploScore as a discriminator between true and false positive IBD is readily apparent.

Discussion

The usage of IBD segments in genetic analyses will become increasingly common as the number ofindividuals with their genetic composition known increases. Due to the inherently quadratic nature ofIBD detection between all pairs of individuals in a cohort, non-probabilistic methods are required to keepthe computational burden as low as possible. However, eﬀective ﬁltering methods are required to ensurereported IBD segments are accurate.Using the laws of Mendelian inheritance is an eﬀective way to avoid modeling complex demographichistory when evaluating the accuracy of population genetics methods including IBD detection and localancestry inference . By using known familial relationships in a large set of trios, we were able toanalyze the accuracy of IBD segments reported by GERMLINE on non-simulated data. We found asurprisingly large number of false positive short segments and showed that these false positives arosedue to the diplotype-based IBD detection mechanism introduced to detect long IBD segments in thepresence of phasing switch errors . We introduced a haplotype-based metric, HaploScore, that eﬀectivelydiscriminates between true and false reported IBD segments. We also investigated a likelihood-ratio-basedmetric, but found it less eﬀective than HaploScore ( Supplementary Text ).Importantly, HaploScore can be computed eﬃciently using dynamic programming (in O( | S | ) timeper segment, see Material and Methods ). This suggests a strategy for accurate IBD detection inpopulation-scale datasets: detect candidate segments using a non-probabilistic IBD detection methodwith relatively permissive parameters and then cull true segments using HaploScore ﬁltering. In addition,HaploScore can be applied as a post-processing step to existing genotyping- and sequencing-based IBDsegments, provided that an estimate of the switch error rate and the genotyping error rate are available.Achieving optimal HaploScore performance in a diﬀerent population cohort or when using an alterna-tive genotyping platform depends on being able to accurately estimate the genotyping and switch errorrates of the data. Genotyping error rates can be estimated in any cohort by methods such as repeatgenotyping . While accurate determination of switch error rates currently requires trios or orthogonalanalysis methods such as phased sequencing , the robustness of HaploScore to substantial variationsin the parameter ratio indicates that it should be extensible to non-European populations, genotypingplatforms of diﬀerent marker density, or even sequencing-based assays. Indeed, we demonstrated therobustness and generalizability of HaploScore by analyzing an independent cohort of 555 European in-dividuals from the 1000 Genomes project who were genotyped on a chip nearly twice as dense as the23andMe chip. While the smaller sample size of the 1000 Genomes cohort produced noisier results, allmajor ﬁndings of the analysis of the 23andMe cohort were replicated in the 1000 Genomes cohort.Python code implementing HaploScore ﬁltering is freely available ( https://github.com/23andMe/ibd ). Material and Methods

Cohort description

The 23andMe cohort analyzed comprises 25,432 customers of 23andMe, Inc., a personal genetics company,who were genotyped on the Illumina HumanOmniExpress+ BeadChip as part of the 23andMe PersonalGenome Service. The chip contains roughly 1,000,000 sites genome-wide . Individuals were selected forhaving >

97% European ancestry as described previously . The 23andMe cohort includes 2,952 distinctfather-mother-child trios identiﬁed by IBD sharing . Parent-child relationships were deﬁned as havingat least 85% of the genetic length of the genome shared on at least one haplotype and no more than 10%of the genetic length of the genome shared on both haplotypes. Parent-parent relationships were deﬁnedas having at most 20% of the genetic length of the genome shared on at least one haplotype.The 1000 Genomes cohort analyzed comprises 555 individuals from ﬁve European populations whowere genotyped on the Illumina HumanOmni2.5-Quad v1-0 B SNP array as described previously (sam-ples available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz ). The 1000 Genomescohort includes 52 distinct father-mother-child trios identiﬁed within the 1000 Genomes project (meta-data available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt ) and which we validated independently by IBD sharing ( SupplementaryTable S1 ). All members of the 1000 Genomes cohort were veriﬁed to not be present in the 23andMecohort.

IBD detection

The 23andMe cohort

Genotypes of all individuals included in the 23andMe cohort were phased using BEAGLE version 3.3.1in batches of 8,000–9,000 individuals as described previously . In each batch, we excluded sites withminor allele frequency < . P < − , call rate < × , algorithm with the parameters -bits 100 -err hom 2 -err het 0 -w extend-min m 2 -map , corresponding to the empirical genotyping and switch error rates ofthe data (see HaploScore parameter estimation below). The genetic map used was generated bythe Phase II HapMap and lifted over to NCBI Build GRCh37 coordinates using the UCSC GenomeBrowser liftOver tool (available at http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz ). To omit clearly artifactual candidateIBD segments arising from sequence assembly gaps and platform eﬀects, candidate segments were ﬁlteredby site density . Segments with a site density (measured in sites/cM) in the lowest 10% of all 1 cMwindows on the chromosome were omitted. All remaining candidate IBD segments were retained. The 1000 Genomes cohort

Genotypes of all 555 individuals in the 1000 Genomes cohort were phased using BEAGLE version 3.3.1in a single batch. Windows of 3,000 sites that overlapped by 100 sites were stitched together as describedpreviously . Sites that were not polymorphic in the 555 individuals examined, had a 1000-Genomes-reported Hardy-Weinberg equilibrium P < − , or a call rate within the 555 individuals examined <

95% were excluded, resulting in 23,142 sites on chromosome 21. GenCall genotype scores were set to0 for all sites not called in each individual.Candidate IBD segments were identiﬁed and ﬁltered identically to those found in the 23andMe cohortdescribed above.

HaploScore description and computational complexity

HaploScore provides a metric by which to rank the likelihood that a stretch of DNA is inherited IBDbetween two individuals or not. Let (cid:15) and σ denote the probability of a genotyping error and a switcherror at any given site, respectively. The HaploScore for a candidate IBD segment S isHaploScore( S ) = 1 | S | (cid:16) n g (cid:15) + n s σ (cid:17) (1)where | S | is the number of genotyped sites in S and n g and n s are the number of genotyping and switcherrors, respectively, that together minimize the score while reconciling the segment as matching across asingle haplotype in both individuals.0Finding the HaploScore (i.e. the optimal values of n g and n s subject to the constraints) can beviewed as ﬁnding the minimum-cost path through the directed acyclic graph (DAG) described below( Supplementary Figure S10 ).Let G be a DAG with a single source node and a single sink node. Between the source and the sink,the graph has | S | levels, one per genotyped site in segment S . Each of these | S | levels has four nodes,one for each possible haplotype conﬁguration. Each node in level l has four outgoing directed edges, oneto each node in level l + 1. Below, we use the same notation for nodes and their weights.At any level l , let h ( i,j ) l , i, j ∈ { , } denote the four possible haplotype conﬁgurations of an IBDmatch. The nodes are weighted as follows: h ( i,j ) l = (cid:40) i in ﬁrst individual matches haplotype j in second individual at position l ,1 /(cid:15) otherwise. (2)Let e ( i,j ) , ( u,v ) l denote the weight of the edge between nodes h ( i,j ) l and h ( u,v ) l +1 . Edges are weighted asfollows: e ( i,j ) , ( u,v ) l =  i = u and j = v ,1 /σ if i = u and j (cid:54) = v ,1 /σ if i (cid:54) = u and j = v ,2 /σ if i (cid:54) = u and j (cid:54) = v . (3)The weights of the four edges from the source node to the nodes in the ﬁrst level, as well as theweights from the nodes in level | S | to the sink node, are set to 0. The cost of a path in G is deﬁned asthe sum of the weights of the edges and nodes it traverses.HaploScore( S ) is equal to the smallest of all path costs from the source to the sink. It can be eﬃcientlycomputed using dynamic programming by noting that the smallest cost from the source to level l + 1 inthe graph can easily be inferred from the smallest cost from the source to level l . Let C ( i,j ) l denote thesmallest cost from the source to haplotype conﬁguration ( i, j ) at level l . Then, C ( u,v ) l +1 = min i,j,u,v (cid:16) C ( i,j ) l + e ( i,j ) , ( u,v ) l + h ( u,v ) l +1 (cid:17) . (4)The minimum cost to reach level l , C ∗ l , is then C ∗ l = min i,j C ( i,j ) l . (5)The above equations clearly show that computing HaploScore( S ) involves 16 comparisons at eachgenotyped site in S . Thus, the complexity of computing HaploScore( S ) is at most 16 | S | . Performancecan be further improved when ﬁltering by HaploScore by terminating computation as soon as a segment’sHaploScore becomes too high to satisfy the maximum value threshold. HaploScore parameter estimation

HaploScore uses two parameters, the genotyping error rate per site (cid:15) and the switch error rate persite σ . Analyses of genotyping chip accuracy and internal comparisons between genotype and whole-genome sequencing data verify that the genotyping error rate is <

1% (not shown). To estimate theempirical switch error rate per site, all 2,952 trios in the 23andMe cohort were trio-phased using thelaws of Mendelian inheritance and the results for all children were compared to their BEAGLE-phasedhaplotypes, assuming that the trio-phased haplotypes represented the true phase. The average per-site1switch error rate ranged from 0 . . Supplementary Figure S4 ).The switch error rate calculation process described above was performed independently on the 1000Genomes cohort. A total of 3,629 switch errors were detected in the 52 trio children over 23,142 sites.This corresponds to an individual switch error rate per site of 3629 / (52 × . HaploScore threshold matrix generation

A matrix of HaploScore thresholds was generated in the following manner: all segments were binned bygenetic length in 0 . Supplementary File 1 ). Acknowledgments

We thank the customers of 23andMe who contributed the genetic data that made this research possibleand are grateful to the employees of 23andMe for creating and supporting the resources necessary forthis research. We also thank members of the 23andMe research team for insightful comments. This workwas supported by the National Human Genome Research Institute of the National Institutes of Health(grant number R44HG006981).

References

1. Browning, S. R. & Browning, B. L. Identity by descent between distant relatives: detection andapplications.

Annual Review of Genetics , 617–633 (2012).2. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation.

NatureGenetics , 1068–1075 (2008).3. Krawitz, P. M. et al. Identity-by-descent ﬁltering of exome sequence data identiﬁes PIGV mutationsin hyperphosphatasia mental retardation syndrome.

Nature Genetics , 827–829 (2010).4. Gusev, A. et al. DASH: a method for identical-by-descent haplotype mapping uncovers associationwith recent variation.

The American Journal of Human Genetics , 706–717 (2011).5. Jonsson, T. et al. A mutation in APP protects against Alzheimer’s disease and age-related cognitivedecline.

Nature , 96–99 (2012).6. Visscher, P. M. et al.

Assumption-free estimation of heritability from genome-wide identity-by-descentsharing between full siblings.

PLOS Genetics , e41 (2006).7. Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: geneticinteractions create phantom heritability. Proceedings of the National Academy of Sciences of theUnited States of America , 1193–1198 (2012).8. Palamara, P. F., Lencz, T., Darvasi, A. & Pe’er, I. Length distributions of identity by descent revealﬁne-scale demographic history.

The American Journal of Human Genetics , 809–822 (2012).29. Gusev, A. et al. The architecture of long-range haplotypes shared within and across populations.

Molecular Biology and Evolution , 473–486 (2012).10. Ralph, P. & Coop, G. The geography of recent genetic ancestry across Europe. PLOS Biology ,e1001555 (2013).11. Albrechtsen, A. et al. Relatedness mapping and tracts of relatedness for genome-wide data in thepresence of linkage disequilibrium.

Genetic Epidemiology , 266–274 (2009).12. Han, L. & Abney, M. Using identity by descent estimation with dense genotype data to detectpositive selection. European Journal of Human Genetics , 205–211 (2013).13. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkageanalyses.

The American Journal of Human Genetics , 559–575 (2007).14. Browning, S. R. & Browning, B. L. High-resolution detection of identity by descent in unrelatedindividuals. The American Journal of Human Genetics , 526–539 (2010).15. Han, L. & Abney, M. Identity by descent estimation with dense genome-wide genotype data. GeneticEpidemiology , 557–567 (2011).16. Palin, K., Campbell, H., Wright, A. F., Wilson, J. F. & Durbin, R. Identity-by-descent-based phasingand imputation in founder populations using graphical models. Genetic Epidemiology , 853–860(2011).17. Brown, M. D., Glazner, C. G., Zheng, C. & Thompson, E. A. Inferring coancestry in populationsamples in the presence of linkage disequilibrium. Genetics , 1447–1460 (2012).18. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment searchtool.

Journal of Molecular Biology , 403–410 (1990).19. Gusev, A. et al.

Whole population, genome-wide mapping of hidden relatedness.

Genome Research , 318–326 (2009).20. Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. TheAmerican Journal of Human Genetics , 173–182 (2011).21. Browning, B. L. & Browning, S. R. Improving the accuracy and eﬃciency of identity-by-descentdetection in population data. Genetics , 459–471 (2013).22. Williams, A. L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousandsof genotyped samples.

The American Journal of Human Genetics , 238–251 (2012).23. Browning, B. L. & Browning, S. R. Eﬃcient multilocus association testing for whole genome associ-ation studies using localized haplotype clustering. Genetic Epidemiology , 365–375 (2007).24. Fan, J.-B. et al. Highly parallel SNP genotyping. In

Cold Spring Harbor Symposia on QuantitativeBiology , vol. 68, 69–78 (Cold Spring Harbor Laboratory Press, 2003).25. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 humangenomes.

Nature , 56–65 (2012).26. Pasaniuc, B. et al.

Analysis of Latino populations from GALA and MEC studies reveals genomicloci with biased local ancestry estimation.

Bioinformatics , 1407–1415 (2013).327. Pompanon, F., Bonin, A., Bellemain, E. & Taberlet, P. Genotyping errors: causes, consequences andsolutions. Nature Reviews Genetics , 847–846 (2005).28. Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri.

Elife ,e00569 (2013).29. Hinds, D. A. et al. A genome-wide association meta-analysis of self-reported allergy identiﬁes sharedand allergy-speciﬁc susceptibility loci.

Nature Genetics , 907–911 (2013).30. Henn, B. M. et al. Cryptic distant relatives are common in both isolated and cosmopolitan geneticsamples.

PLOS ONE , e34267 (2012).31. Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inferencefor whole-genome association studies by use of localized haplotype clustering. The American Journalof Human Genetics , 1084–1097 (2007).32. Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs.

Nature , 851–861 (2007).33. Kent, W. J. et al.

The human genome browser at UCSC.

Genome Research , 996–1006 (2002).34. Zhuang, Z., Gusev, A., Cho, J. & Pe’er, I. Detecting Identity by Descent and Homozygosity Mappingin Whole-Exome Sequencing Data. PLOS ONE , e47618 (2012).35. Paynter, R. A. et al. Accuracy of multiplexed Illumina platform-based single-nucleotide polymorphismgenotyping compared between genomic and whole genome ampliﬁed DNA collected from multiplesources.

Cancer Epidemiology Biomarkers & Prevention , 2533–2536 (2006).4 Figures

Reported child segments N u m b e r o f se g m e n t s ( m illi on s ) ! parent segment overlap < 100% 0% < parent segment overlap < 80% 0% parent segment overlap A B C N u m b e r o f se g m e n t s Number of opposite homozygotes

0 5 10 15 20 "

25 8,062,142 segments absent in both parents

61% 8% 14% 17%

0% 20% 40% 60% 80% 100%

Truncated segments

Mean: 8.6 Median: 7

Figure 1. Analysis of child-other segments in parents. A.

The majority of child-other segmentsare not detected in either parent. Parent segment overlap is calculated as the percentage of sites in thechild-other segment that are included in the parent-other segment. B. Truncation points forparent-other segments are nearly always conﬁdently-genotyped opposite homozygote sites, consistentwith false positive IBD in the child. The opposite homozygote site causing truncation of theparent-other segment was examined in a randomly-selected subset of all 3,371,616 segments with partialparent overlap. C. Child-other segments with no corresponding parent-other segments contain manyparent-other opposite homozygotes in the region, also consistent with false positive IBD in the child.For each of these child-other segments, the number of opposite homozygote sites present between theparent and the other individual at that segment location is calculated separately for each parent, andthe smaller is chosen as the number of opposite homozygotes in the region.5 ! " : / ; , - . < *=**=)*=!*="*= , ; - : , > , - . ? @ , A ; B ! " A , / : / ? - *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*)***!***"*** D E > F , A ? G : , > , - . : . ? E : ; - H : !I"106 *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*)**!**"** D E > F , A ? G : , > , - . : . ? E : ; - H : "I *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*)*!*"* D E > F , A ? G : , > , - . : . ? E : ; - H : *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*$)*)$!*!$ D E > F , A ? G : , > , - . : . ? E : ; - H : $I%106 ! " Figure 2. Accuracy of child-other IBD segments reported by GERMLINE. A.

Heat map ofthe mean fraction of reported child-other IBD segments contained in a corresponding parent-othersegment, binned by two measures of segment length. For each child-other segment, the fraction of thesegment also reported as an IBD segment between the parent and the other individual is calculated.Shown in each bin is the mean of the segment fractions calculated for all segments in the bin. B. Thefraction of child-other segments that are true IBD as a function of segment length. True IBD segmentsare deﬁned as having at least 80% of their sites encompassed by a parent-other segment.

C–F.

Histograms of child-other segment counts binned by segment overlap for segments of 2–3 cM ( C ), 3–4cM ( D ), 4–5 cM ( E ), and 5–6 cM ( F ). Note the scale changes on the y-axes: though the fraction of truesegments of length < ! " - + : ’ ’ ( , ? ’ . @ ’ ( ) , A ’ : - C : D ’ , ? * ) * A ’ , : ) ’ $E%,+1,03;>!2 G : ’ + * ? * ( $E%,+1 !" Figure 3. Improving detection of true IBD segments using HaploScore. A.

Heat map of themean fraction of reported IBD segments found in parents, binned by segment genetic length andHaploScore. Calculations are performed as in

Figure 2A . B. Receiver operating characteristic forreported IBD segments of various lengths, discriminating by HaploScore. True IBD is deﬁned asin

Figure 4. HaploScore is robust to a wide range of input parameters. A.

AUCs for a range ofgenotyping to switch error rate ratios. We varied the switch error rate σ relative to the genotyping errorrate (cid:15) . For each value of σ , we evaluated the resulting AUC discriminating by HaploScore, where wedeﬁned true positive segments as having a segment overlap of at least 0 . B. AUCs for a range ofsegment overlap values required to classify a segment as a true positive. For each of ten diﬀerentsegment overlap thresholds (0 . , . . . , . ! " (cid:1) %&’()(*+,-.()/*0-1,23%& ! %& %& " %& %& %& : ( ; - < = - > ( / ( ) * > ?,<;(-*0;(>0<.@-

Histogram of number ofsegments reported after ﬁltering at three diﬀerent HaploScore thresholds, t ∈ { . , . , . } . Eachthreshold t corresponds to a genetic-length-speciﬁc array of maximal HaploScores allowed to retain allsegments with mean segment overlap of at least t , as described in the HaploScore threshold matrixgeneration section of

Material and Methods . Note that the y-axis is on a log scale.9

Tables

Table 1. Haplotype and diplotype window matches in child-other segments. A.

Counts ofwindow types in windows contained within a corresponding parent-other segment. B. Counts of windowtypes in windows that are not contained within a corresponding parent-other segment. A Child Diplo Child Haplo Child Both

Total

Par None Par Diplo

Par Haplo

Par Both

Total B Child Diplo Child Haplo Child Both

Total

Par None

Par Diplo

Par Haplo

Par Both

Total

Par, parent; Diplo, diplotype match only; Haplo, haplotype match only.0

Supplementary Note

Logarithm of Odds (LOD) segment scoring

In this section, we describe an alternative scoring for potential IBD segments that is similar in spirit tothe LOD score used in ReﬁnedIBD . Speciﬁcally, for a given segment S shared between two individuals i and i , we compute its LODscore as follows:LODscore( S ) = log  P r (cid:16) G ( S ) obs , G ( S ) obs | IBD (cid:17)

P r (cid:16) G ( S ) obs , G ( S ) obs | no IBD (cid:17)  , (6)where G ( S ) obs (resp. G ( s ) obs ) is the observed genotype of individual i (resp. i ) over segment S , and P r (cid:16) G ( S ) obs , G ( S ) obs | IBD (cid:17) (resp.

P r (cid:16) G ( S ) obs , G ( S ) obs | no IBD (cid:17) ) is the pseudo-likelihood of observing G ( S ) obs and G ( S ) obs conditioned on individuals i and i being IBD over segment S (resp. not being IBD).The pseudo-likelihood is computed as follows: P r (cid:16) G ( s ) obs , G ( s ) obs | IBD , (cid:15) (cid:17) = S (cid:89) i =1 (cid:88) G true (cid:88) G true P r (cid:16) G ( i ) true , G ( i ) true | IBD (cid:17)

P r (cid:16) G ( i ) true | G ( i ) obs , (cid:15) (cid:17) P r (cid:16) G ( i ) true | G ( i ) obs , (cid:15) (cid:17) ,where (cid:15) is the genotyping error rate, S is the number of markers in the IBD segment, and G ( i ) true j is thetrue genotype of individual j at position i . The probability of genotypes ( G ( i ) true , G ( i ) true ) as a function ofthe IBD state (0, 1 or 2 alleles shared IBD at position i ) is given in Supplementary Table S3 . We notethat

Supplementary Table S3 was derived elsewhere . The probability of observing a genotype giventhe true genotype and the genotyping error rate is given in Supplementary Table S4 . Two genotypesare considered IBD if they either share one or two alleles IBD (IBD1 and IBD2 in

SupplementaryTable S4 ), and we give equal prior probabilities to the two conﬁgurations.We assessed the performance of LODscore by computing its AUC for various segment sizes. We notethat even though the LODscore has power to ﬁlter out false IBD segments, its AUC is generally lowerthan the HaploScore detailed in the main text (

Supplementary Figure S11 ). Reasons for the lowerpower of LODscore may arise in part from two issues: 1) LODscore assumes each site is independent andthus ignores correlation between adjacent markers, and 2) LODscore ignores available phase information.Both issues could be alleviated by explicitly incorporating linkage disequilibrium between adjacent sitesand switch errors into the model. However, because of the strong performance of HaploScore, we did notexplore these research avenues further.1

Supplementary Figures A Child Mother Father x x x x x x x x x

Child Mother Father x x x x x x x x

Child Mother Father x x x x x x x x x x x x

B C

Figure S1. Choosing the parent through which child-other IBD segments have beentransmitted.

The genome is represented as a horizontal gray line. Assayed sites compatible with IBDbetween the listed individual and a hypothetical other individual (not pictured) are indicated as verticalblack lines. Assayed sites incompatible with IBD (e.g., opposite homozygote sites) are indicated as redcrosses. Orange boxes indicate reported IBD segments between the listed individual and thehypothetical other individual (not pictured). A. The unambiguous case in which one parent has acorresponding IBD segment and the other parent does not. Here, the father would be selected as theparent for analysis. B. The case where each parent has an IBD segment that partially overlaps the childsegment. The parent selected for analysis is determined by the fraction of sites shared IBD. In this case,despite the longer physical length of the father’s segment, the mother would be selected since hersegment overlap (5 of 9 sites) is larger than the father’s (3 of 9 sites). C. The case where neither parenthas a reported IBD segment. The father would be selected as the parent for analysis, since his genotypecontains fewer opposite homozygote sites in the child IBD region.2

Figure S2. Receiver operating characteristics of child-other IBD segments discriminatingby genetic length. A.

True positive rate vs. false positive rate when discriminating by minimumgenetic length. B. Precision vs. recall when discriminating by minimum genetic length. Values for fourparticular minimum genetic length criteria are marked on each plot.3

Figure S3. Length distribution of child-other IBD segments.

Heat map shows the number ofsegments in each bin segregating by the genetic and physical lengths of the segments. Axes identical tothose in

Figure 2A .4 ! "! ()*+& ,-.+/0123+43&!!!5 ( : / ; + < = / : > ) + ** ? * + * : ! "! ()*+" ! "! ()*+’ ! "! ()*+ ! "! ()*+@ ! "! ()*+$ ! &! "! ’! ()*+B ! &! "! ’! ()*+C ! &! "! ’! ()*+A ! &! "! ’! ()*+&! ! &! "! ’! ()*+&& ! &! "! ’! ()*+&" ! &! "! ’!!%!!%!!&!%!!"!%!!’!%!! ()*+&’ ! &! "!!%!!"@ ()*+& ! &! "!!%!!’! ()*+&@ ! &! "!!%!!’! ()*+&$ ! &! "!!%!!%!!&!%!!"!%!!’!%!! ()*+&B ! &! "!!%!!"A ()*+&C ! ()*+&A ! @ &! &@ "!!%!!’" ()*+"! ! ()*+"& ! ()*+"" Figure S4. Switch errors in BEAGLE-phased data occur at a nearly-constant rate acrosschromosomes.

Switch error positions were detected in 2,952 trio children by comparingBEAGLE-phased haplotypes with trio-phased haplotypes and assuming trio-phased data was truth.The average individual switch error rate was calculated at each site by dividing the total number ofswitch errors at that site by 2,952. Red lines plot the cumulative switch error rate scaled by the numberof sites on the chromosome, to facilitate inter-chromosomal comparison. Numbers in the top left of eachgraph indicate the average per-site switch error rate for the chromosome. Black dashed lines indicatethe expected individual cumulative switch errors per site assuming a constant switch error rate at eachsite on the chromosome.5

Figure S5. IBD segment overlap and HaploScore performance on chromosome 21 usingtrio-phased trios. A.

Heat map of the mean fraction of reported IBD segments found in parents,binned by two measures of segment length. B. The fraction of child-other segments that are true IBDas a function of segment length. True IBD segments are deﬁned as having at least 80% of their sitesencompassed by a parent-other segment. C. Heat map of the mean fraction of reported IBD segmentsfound in parents, binned by segment genetic length and HaploScore. D. Receiver operatingcharacteristic for reported IBD segments of various lengths, discriminating by HaploScore. The fourpanels are analogous to

Figure 2A , B and Figure 3A , B , respectively, using trio-phased data for all2,952 trios. The similarity of this ﬁgure and the main text ﬁgure panels indicates that haplotypephasing errors do not contribute substantially to the estimates of IBD accuracy.6 Figure S6. IBD segment overlap and HaploScore performance on chromosome 10. A.

Heatmap of the mean fraction of reported IBD segments found in parents, binned by two measures ofsegment length. B. The fraction of child-other segments that are true IBD as a function of segmentlength. True IBD segments are deﬁned as having at least 80% of their sites encompassed by aparent-other segment. C. Heat map of the mean fraction of reported IBD segments found in parents,binned by segment genetic length and HaploScore. D. Receiver operating characteristic for reportedIBD segments of various lengths, discriminating by HaploScore. The four panels are analogousto

Figure 2A , B and Figure 3A , B , respectively, calculated on chromosome 10 here.7 A B C N u m b e r o f se g m e n t s (t hou sa nd s ) Reported child segments ! parent segment overlap < 100% 0% < parent segment overlap < 80% 0% parent segment overlap 1,174 segments truncated in parent Not confident in parent or other Confident in parent and other 0 5 10 15 20 "

25 3,912 segments absent in both parents

68% 6% 12% 14% N u m b e r o f se g m e n t s Number of opposite homozygotes

Mean: 10.4 Median: 8

0% 20% 40% 60% 80% 100%

Truncated segments

Figure S7. Analysis of child-other segments in parents in the 1000 Genomes cohort.

Thisﬁgure is analogous to

Figure 1 but performed on the 1000 Genomes cohort. A. The majority ofchild-other segments are not detected in either parent. B. Truncation points for parent-other segmentsare nearly always conﬁdently-genotyped opposite homozygote sites, consistent with false positive IBD inthe child. The opposite homozygote site causing truncation of the parent-other segment was examinedin all 1,174 segments with partial parent overlap. C. Child-other segments with no correspondingparent-other segments contain many parent-other opposite homozygotes in the region, also consistentwith false positive IBD in the child. For each of these child-other segments, the number of oppositehomozygote sites present between the parent and the other individual at that segment location iscalculated separately for each parent, and the smaller is chosen as the number of opposite homozygotesin the region.8 ! " : / ; , - . < *=**=)*=!*="*= , ; - : , > , - . ? @ , A ; B ! " A , / : / ? - *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*$**)***)$**!***!$**"***"$** D E > F , A ? G : , > , - . : !H"106 *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*$*)**)$*!**!$*"**"$* D E > F , A ? G : , > , - . : "H *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*!* D E > F , A ? G : , > , - . : *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*! D E > F , A ? G : , > , - . : $H%106 ! " Figure S8. Accuracy of child-other IBD segments reported by GERMLINE in the 1000Genomes cohort.

This ﬁgure is analogous to

Figure 2 but performed on the 1000 Genomes cohort. A. Heat map of the mean fraction of reported child-other IBD segments contained in a correspondingparent-other segment, binned by two measures of segment length as described in

Figure 2A . B. Thefraction of child-other segments that are true IBD as a function of segment length. True IBD segmentsare deﬁned as having at least 80% of their sites encompassed by a parent-other segment as in

Figure 2B . C–F.

Histograms of child-other segment counts binned by segment overlap for segments of2–3 cM ( C ), 3–4 cM ( D ), 4–5 cM ( E ), and 5–6 cM ( F ). Note the scale changes on the y-axes: thoughthe fraction of true segments of length < ! " - + : ’ ’ ( , ? ’ . @ ’ ( ) , A ’ : - C : D ’ , ? * ) * A ’ , : ) ’ $E%,+1,03;>%2 G : ’ + * ? * ( $E%,+1 !" Figure S9. Improving detection of true IBD segments using HaploScore in the 1000Genomes cohort.

This ﬁgure is analogous to

Figure 3 but performed on the 1000 Genomes cohort. A. Heat map of the mean fraction of reported IBD segments found in parents, binned by segmentgenetic length and HaploScore. Calculations are performed as in

Figure 2A . B. Receiver operatingcharacteristic for reported IBD segments of various lengths, discriminating by HaploScore. True IBD isdeﬁned as in

Figure 2B . The dashed black line indicates the no-discrimination line. The area undereach curve is parenthesized in its legend entry. C. Precision-recall plot for child-other segments binnedby segment length.0 h (1 , h (1 , h (1 , h (2 , h (2 , h (2 , h (2 , h (1 , h (1 , L h (1 , L h (2 , L h (2 , L / / / h (1)2 = h (1)2 ) /✏ h (1)2 = h (2)2 ) /✏ h (2)2 = h (1)2 ) /✏ h (2)2 = h (2)2 ) /✏ src snk h (1 , L h (1 , L h (2 , L h (2 , L / / / Figure S10. Graph illustrating the HaploScore computation.

The graph has one source, onesink and one level per genotyped site in the IBD segment. At each level l , the graph contains four nodes,indicating the haplotype conﬁguration at site l . Each node has weight 0 if the two corresponding allelesare the same, or 1 /(cid:15) if they are diﬀerent. Each node in level l has four outgoing directed edges, one toeach of the four nodes in level l + 1. The edge weights are 0, 1 /σ , or 2 /σ , depending on whether 0, 1 or2 switch errors are necessary to explain the transition. For clarity, some edges are omitted in this ﬁgure.The source node src has four outgoing directed edges with weight 0, one to each of the four nodes inlevel 1. Each node in level L has one outgoing directed edge to the sink node snk with weight 0.1 Figure S11. Receiver operating characteristic for reported IBD segments of variouslengths, discriminating by LODscore.

True positive IBD segments are deﬁned as having at least80% of their sites encompassed by a parent-other segment. The area under each curve is parenthesizedin its legend entry.2

Supplementary Tables

Table S1. Characteristics of the individuals in the 1000 Genomes cohort.Population a Total trios Total individuals Reported relationships

IBS 50 150 All triosCEU 2 104 Two trios, 98 unrelated b GBR 0 101 One mother-daughter pair, one unknownsecond order relationship, 97 unrelatedTSI 0 100 One sibling pair, 98 unrelatedFIN 0 100 All unrelated a IBS, Iberian populations from Spain; CEU, Utah residents with ancestry from northern and westernEurope; GBR, British from England and Scotland; TSI, Toscans from Italy; FIN, Finnish from Finland. b Reportedly unrelated NA06989 and NA12155 share 35 cM on chromosome 21.3

Table S2. Haplotype and diplotype window matches in child-other segments of 1000Genomes data. A.

Counts of window types in windows contained within a correspondingparent-other segment. B. Counts of window types in windows that are not contained within acorresponding parent-other segment. A Child Diplo Child Haplo Child Both

Total

Par None Par Diplo

Par Haplo

50 167 170

Par Both

817 169 7,702

Total B Child Diplo Child Haplo Child Both

Total

Par None

Par Diplo

Par Haplo

90 281 387

Par Both

Total

Par, parent; Diplo, diplotype match only; Haplo, haplotype match only.4

Table S3. Genotype probabilities for a pair of individuals for diﬀerent IBD states. ( G , G ) IBD0 IBD1 IBD2

AA, BB 2 p q p p p AA, AB 4 p q p q p q p q + pq pqp represents the allele frequency of allele A and q (= 1 − p ) represents the allele frequency of allele B.5 Table S4. Observed genotype probabilities with genotyping errors. G true = AA G true = AB G true = BBG obs = AA (1 − (cid:15) ) (1 − (cid:15) ) (cid:15) (cid:15) G obs = AB − (cid:15) ) (cid:15) (1 − (cid:15) ) + (cid:15) − (cid:15) ) (cid:15)G obs = BB (cid:15) (1 − (cid:15) ) (cid:15) (1 − (cid:15) ))