Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis
11 Reducing pervasive false positive identical-by-descent segmentsdetected by large-scale pedigree analysis
Eric Y. Durand , † , Nicholas Eriksson , Cory Y. McLean , † † These authors contributed equally to this work.Corresponding author: Cory Y. McLean, [email protected] a r X i v : . [ q - b i o . P E ] F e b Abstract
Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental tomany genetic applications, from demographic inference to estimating the heritability of diseases, butIBD detection accuracy in non-simulated data is largely unknown. In principle, it can be evaluated usingknown pedigrees, as IBD segments are by definition inherited without recombination down a family tree.We extracted 25,432 genotyped European individuals containing 2,952 father-mother-child trios from the23andMe, Inc. dataset. We then used GERMLINE, a widely used IBD detection method, to detect IBDsegments within this cohort. Exploiting known familial relationships, we identified a false positive rateover 67% for 2–4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulateddata at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors whendetecting IBD, a necessity for retrieving long ( > Introduction
IBD segments are regions of DNA between two individuals that were inherited from a recent sharedcommon ancestor. IBD segments can be detected on high-density genetic data such as that produced bygenome-wide genotyping arrays or whole genome sequencing.Detecting the presence and distribution of IBD segments between individuals is fundamental to manygenetic applications . Long-range phasing uses IBD segments to resolve haplotype phasing inaccuracies.IBD segments have been used to identify disease genes and estimate the heritability of traits andcommon diseases . The lengths and distribution of IBD segments within and across populations havebeen used to infer demographic history and identify regions under natural selection .All methods for IBD detection ultimately try to detect a similarity between haplotypes that is statisti-cally unlikely to occur in the absence of IBD sharing. Hidden Markov models have been used extensivelyfor probabilistic IBD segment detection . However, these methods scale quadratically withinput sample sizes and are thus not suitable for IBD detection in population-scale datasets (reviewed in ).Non-probabilistic IBD detection methods use a “hash-and-extend” methodology that is conceptually sim-ilar to BLAST : identical or nearly identical short haplotype match “seeds” are detected efficiently andthe seeds are extended to adjacent sites subject to heuristic constraints. These non-probabilistic meth-ods have the advantage that they are able to scale to much larger datasets than probabilistic methods.Implementations include GERMLINE , fastIBD , and RefinedIBD . GERMLINE and RefinedIBDuse short windows of sites as seeds, whereas fastIBD uses small segments of the inferred haplotype graphas seeds.These three methods differ in the way that detected candidate segments are chosen to be kept as trueIBD segments: fastIBD uses haplotype frequency, RefinedIBD uses a combination of segment geneticlength and a likelihood ratio test, and GERMLINE uses segment length. The probabilistic refinementmethods of fastIBD and RefinedIBD require a haplotype graph to be generated. Consequently, bothfastIBD and RefinedIBD perform haplotype phasing in addition to IBD detection. Haplotype phasinghas superlinear computational complexity . Current computer memory capacity constraints limit thenumber of individuals who can be phased together to tens of thousands of individuals. Thus, computingall pairwise IBD requires splitting the cohort into multiple smaller batches, all of which must be comparedto each other, each time being phased anew. This makes the detection of all pairwise IBD segments in acohort of over 100,000 individuals computationally infeasible using these methods. Since GERMLINE usessegment length to refine IBD segments, it does not perform genotype phasing. Consequently, detectionof all pairwise IBD segments can be performed on large cohorts by phasing each individual once and thenusing GERMLINE to detect IBD.IBD detection accuracy is typically assessed on simulated data, as true IBD segments can then beknown precisely . However, accurate simulation of population demography is difficult , andsimulation parameters directly affect the estimated precision and recall of IBD detection algorithms.With a large number of father-mother-child trios, IBD detection accuracy can be estimated on non-simulated data by examining concordance between reported IBD segments in the child and his or herparents.In this work, we analyze the accuracy of IBD segments reported by GERMLINE since its decouplingof phasing and IBD detection make it feasible for IBD detection on population-scale datasets. We usea large cohort of trios to assess IBD segment accuracy on non-simulated data. We perform a detailedexamination of discrepant segments and present a method that substantially improves accuracy whileremaining computationally tractable for population-scale datasets. Finally, we replicate the findingsusing an independent cohort of individuals from the 1000 Genomes project. Results
Non-simulated data show substantial inaccuracy in short reported IBD seg-ments
To analyze IBD detection accuracy on non-simulated data, we examined IBD segments detected ina cohort of 25,432 individuals of European ancestry that includes 2,952 distinct father-mother-childtrios (the “23andMe cohort”,
Material and Methods ). By focusing specifically on segments reportedbetween a trio child and an individual who is not a parent of that child (henceforth called “child-other”segments), IBD accuracy can be quantified: by the definition of IBD, if a child-other segment is true,at least one of the child’s parents must also share a segment IBD with the individual (henceforth called“parent-other” segments) that encompasses the child-other segment.GERMLINE reported a total of 18,125,797 child-other segments in the 23andMe cohort on chro-mosome 21. After filtering artifactual IBD segments reported in regions of low site density, 13,307,562child-other segments were retained for analysis. Only 14% of these child-other segments were encompassedby a parent-other segment (
Figure 1A , Supplementary Figure S1A ). Another 25% of child-other seg-ments have a partial parent-other segment in which at least one segment end is truncated (
Figure 1A , Supplementary Figure S1B ). Segment ends imply the presence of opposite homozygote genotypesbetween the individuals. Opposite homozygote sites that terminate a parent-other segment exclude thepossibility of child-other IBD at those sites. To determine whether truncated segment ends representedfalse child-other IBD or genotyping error in parent-other regions, Illumina GenCall scores were examinedat the opposite homozygote sites truncating 128,656 randomly selected partial parent-other segments.Considering GenCall scores of ≥ . , over 95% of opposite homozygote sitesanalyzed (122,364/128,656) have confident genotype calls in both the parent and other individual. Thisresult indicates that the vast majority of disagreements between child-other and parent-other segmentsrepresent false positive IBD in the child rather than false negative IBD in the parent ( Figure 1B ).The remaining 61% of child-other segments have no corresponding parent-other segment (
Figure 1A , Supplementary Figure S1C ). All segments in this subset were analyzed to determine whether theyrepresented false positive child-other segments or false negative parent-other segments by examining thenumber of parent-other opposite homozygote sites in the region. Nearly 98% of these child-other segmentshave at least one opposite homozygote site in the parent (
Figure 1C ). Given a 95% accuracy rate forparent-other opposite homozygote sites (
Figure 1B ), the probability that a region containing N oppositehomozygote sites is actually a false negative parent-other IBD segment was calculated as (1 − . N . Theexpected fraction of false negative parent-other segments in this subset is 0 . . Supplementary Figure S1 ). Segments were segregated by genetic and physicallengths and the average segment overlap of all segments in each bin was calculated (
Figure 2A ). Geneticlength is a more reliable indicator of average segment overlap than physical length and segments longerthan 6 cM generally show a high degree of overlap. However, the average overlap drops rapidly as segmentlength is reduced (
Figure 2A ).IBD accuracy was estimated by considering child-other segments with substantial parent-other seg-ment overlap as true IBD. Because precise determination of IBD endpoints from genotype data is dif-ficult , a threshold of 80% segment overlap was used to classify a segment as true IBD. Using thiscriterion, more than 67% of all reported segments shorter than 4 cM are false positive child segments( Figure 2B ). Figure 2C – F show the IBD segment overlap distributions segregated by genetic length.Most 2–3 cM segments are erroneous ( Figure 2C ), and only segments longer than 5 cM have a negligi-ble number of false positives (
Figure 2F ). Indeed, when filtering solely by genetic length, all segmentsshorter than 5 cM must be discarded in order to achieve a precision value of 0 . Supplementary Fig-ure S2 ). However, because there are many more short segments, eliminating all segments shorter than5 cM eliminates 99% of all true IBD segments, a dramatic loss in recall (
Supplementary Figure S2 ).In the next section, we investigate the properties of true IBD segments of all lengths, and contrast themwith erroneous segments.
Overly permissive diplotype matching causes reported segment inaccuracy
IBD segments are shared between two individual haplotypes. Thus, if the phase of each individualgenotype was known, IBD detection algorithms could in principle analyze each individual haplotypeindependently. However, for individuals without a genotyped pedigree, genotypes have to be phasedstatistically, where switch errors occur at an appreciable frequency (
Supplementary Figure S4 ). Ex-amination of only haplotypes in the presence of switch errors is known to reduce power to detect IBD,especially for long segments , since they are likely to harbor more switch errors than short segments.Thus, GERMLINE (and many other IBD detection methods) matches IBD segments between individualdiplotypes, trying to allow for a moderate number of switches between individuals’ haplotypes. In prac-tice, this is achieved by allowing haplotype match seeds to extend until an opposite homozygous site ismet. There are two potential issues with this approach that could lead to inconsistent segment reportingbetween parent and child and are explored further below.Detection of child-other segments with a truncated or absent corresponding parent-other segmentcould arise from the haplotype matching between the child and the other individual, but a switch er-ror in the parent causing the corresponding haplotype to not match between the parent and the otherindividual. To investigate this potential error source, all 2,952 trios were trio-phased using the laws ofMendelian inheritance and then IBD detection was performed as before. Trio-phasing ensures that chil-dren and parents are phased essentially perfectly (i.e., up to recombination events), eliminating haplotypediscrepancies between parent and child as a source of segment discrepancies. The number and accuracyof child-other segments using trio-phased data is nearly identical to that of BEAGLE-phased data, show-ing that parent-child haplotype discrepancies contribute a negligible amount toward discrepant segments( Supplementary Figure S5 ).Alternatively, child-other segments with no corresponding parent-other segment could be false re-ported IBD between the child and the other individual due to overly permissive diplotype matching. Toexamine this possibility, each full 100-site window in all 13,307,562 child-other segments was analyzed(63,542,380 total windows) to see whether the window satisfied the diplotype match criterion and thehaplotype match criterion between the child and the other individual and between the parent and theother individual. The analysis was segregated by windows contained within corresponding parent-othersegments (likely true IBD) and windows that are not contained within corresponding parent-other seg-ments (false IBD). The diplotype match criterion is satisfied in the child in 97.6% of windows containedwithin parent-other segments (
Table 1A ) and in 97.5% of windows not contained within parent-othersegments (
Table 1B ). Roughly 67.7% of windows contained within both child-other and parent-othersegments satisfy the haplotype match criterion for IBD in the child (
Table 1A ), consistent with trueIBD given the window size and empirical switch error rate (
Supplementary Figure S4 ). In contrast,only 44.2% of windows not contained within a parent-other segment satisfy the haplotype match criterionfor IBD in the child (
Table 1B ), a substantial reduction (binomial
P < − ).The poor precision in short segments is thus due to the allowance of diplotype-only matches withinthe IBD detection algorithm. However, allowing diplotype-only matches is necessary for detection oflong segments due to imperfect haplotype phasing . The substantial reduction in windows matchinghaplotypes in regions of false IBD suggests a haplotype-based metric that is robust to switch errors couldimprove precision of reported IBD without the loss of recall incurred by haplotype-only IBD detectionmechanisms. A haplotype-based metric to identify true IBD segments
IBD is fundamentally a property of haplotypes, not diplotypes. Consequently, true IBD should appearconsistent with haplotype matches, modulo expected genotyping and switch errors. We introduce Hap-loScore as a measure of haplotype IBD likelihood: given a genotyping error rate per site (cid:15) and a switcherror rate per site σ , the HaploScore for a candidate IBD segment S isHaploScore( S ) = 1 | S | (cid:16) n g (cid:15) + n s σ (cid:17) ,where | S | is the number of genotyped sites in S and n g and n s are the number of genotyping and switcherrors, respectively, that together minimize the score while reconciling the segment as matching acrossa single haplotype in both individuals. Conceptually, HaploScore is a measure of the ratio of observedand expected genotyping and switch errors. In segments falsely reported as IBD, a larger-than-expectednumber of genotyping and switch errors may be required to reconcile the segments as matching acrossindividual haplotypes, and their HaploScores will be large.Genotyping and switch error rates per site were estimated from the data to be (cid:15) = 0 . σ = 0 . Material and Methods ). Using those parameters, HaploScore was calculated on all segments shorterthan 6 cM. To investigate whether HaploScore behaves differently between true and false IBD, we plotteda heat map of IBD segment overlap as a function of segment genetic length and HaploScore values.HaploScore effectively discriminates true and false IBD segments at all lengths (
Figure 3A ). Indeed,the relationship between HaploScore and mean segment overlap is nearly monotonic, drawing a clearboundary between segments with at least 80% overlap and others at all genetic lengths.In addition, we assessed the power of HaploScore as a binary classifier to decide if an IBD segmentis true. We varied a HaploScore threshold from 0 to 22 (the maximum observed HaploScore value onchromosome 21), and classified segments with a HaploScore value smaller than the threshold as true IBD.We then computed the true positive and false positive rates at each HaploScore threshold. HaploScoreperformed well as a binary classifier at all genetic lengths, achieving an area under the receiver operatingcharacteristic curve (AUC) greater than 0 . Figure 3B ). At all levelsof precision, power increased as segment length increased, owing at least in part to the general positivecorrelation between segment length and number of sites in the segment. Importantly, and in sharpcontrast with length-based filtering (
Supplementary Figure S2 ), HaploScore-based filtering retainsmany segments shorter than 5 cM at a precision of 0 . Figure 3C ). Recall of HaploScore-based filteringat 0 . .
19, a 13-fold increase compared to length-based filtering.
Robustness of results to HaploScore parameter variation
HaploScore is a function of two parameters: the genotyping error rate (cid:15) and the switch error rate σ .However, only the ratio of the two parameters affects the behavior of the score. In order to assess therobustness of HaploScore to varying parameters, a grid search was performed in which (cid:15) was fixed at0 . σ was varied three orders of magnitude from (cid:15)/
100 to 10 (cid:15) , and the AUC was computed at eachgrid point (
Figure 4A ). As expected, performance was strongest when the ratio of the parameters wasnear its true value. However, the performance degradation was modest across the wide range of parameterratio values examined, with the AUC dropping by less than 2% at worst.
Robustness of results to true IBD definition
In all analyses above, true IBD segments were defined as child-other segments that have at least 80%parent-other segment overlap. To assess the robustness of HaploScore to different true IBD definitions,a grid search was performed in which the definition of true IBD was varied from 10% to 100% parent-other segment overlap in increments of 10%. The AUC was computed at each grid point (
Figure 4B ).Performance was generally stable for all segment lengths and true IBD definitions, with the exception of5–6 cM segments at 100% overlap, where performance degraded appreciably. This is likely due at leastin part to the inherent bias for longer segments to have more sites at which premature truncation ofdetected IBD segments can arise from genotyping or switch errors.
Robustness of results to genome-wide IBD identification
To confirm that the results presented are not due to particular genomic features of chromosome 21,chromosome 10 was analyzed on the full cohort using the same parameters ( (cid:15) = 0 . σ = 0 . Supplementary Figure S6 ). In addition, IBDsegments were examined on all autosomes in the subset of all individuals comprising the 2,952 unrelatedtrios. No substantial deviations in performance were observed (not shown).
Filtering spurious reported IBD segments using HaploScore
HaploScore can be used to filter out spurious segments reported by an IBD detection algorithm asan efficient post-processing step. The reduced power to detect short segments requires more stringentHaploScore threshold values for shorter segments to achieve a similar precision value as for longer segments(
Figure 3 ). Since HaploScore provides a way to rank segments, the trade-off between precision and recallcan be tuned to the needs of the particular downstream application.HaploScore threshold values to ensure particular average overlap values of resultant segments weregenerated (see
Material and Methods ) and three separate filtering results are shown in
Figure 5 .Notably, more stringent filtering parameters have the largest effect on short segments and have nearlythe same effect as lenient filtering parameters for segments over 5 cM (
Figure 5 ). This result is intuitive,as the short reported segments are enriched for false positives (
Figure 2C – F ). Robustness of results to alternate individuals and genotyping platforms
To assess the robustness of the findings in an alternative population, a cohort of 555 European individualsincluding 52 father-mother-child trios genotyped as part of the 1000 Genomes project were analyzed(the “1000 Genomes cohort”, Supplementary Table S1 ). Individuals in the 1000 Genomes cohort weregenotyped on the Illumina HumanOmni2.5-Quad v1-0 B SNP array and as such provide an independentsample set from which to assess the generalizability of our results to additional individuals and alternativegenotyping platforms.GERMLINE reported a total of 6,585 child-other segments on chromosome 21 in the 1000 Genomescohort. After filtering artifactual IBD segments reported in regions of low site density, 5,770 child-othersegments were retained for analysis. The number of child-other segments detected in the 1000 Genomescohort is much smaller than in the 23andMe cohort (5,770 versus 13,307,562 candidate segments) sincethe 1000 Genomes cohort is much smaller. However, the rate of candidate segment detection is similar: inthe 1000 Genomes cohort, there are 5,770 segments for 52 ×
552 child-other pairs, resulting in an averageof 5770 / (52 × .
20 child-other segments per trio. In the 23andMe cohort, the corresponding rateis 13307562 / (2952 × .
18 child-other segments per trio.Analyses of child-other segments detected in the 1000 Genomes cohort were performed analogouslyto those in the 23andMe cohort. Only 12% of child-other segments were encompassed by a parent-othersegment, 20% of child-other segments have a partial parent-other segment in which at least one segmentend is truncated, and the remaining 68% of child-other segments have no corresponding parent-othersegment (
Supplementary Figure S7A ). Analysis of truncated segments in the 1000 Genomes cohortalso strongly suggests that false child-other IBD accounts for most discrepant segments, as 92% of oppositehomozygote sites that truncate the 1,174 truncated segments have confident genotype calls in both theparent and other individual (
Supplementary Figure S7B ). Finally, in the 68% of child-other segmentsthat have no corresponding parent-other segment, over 99% contain at least one opposite homozygotesite in the parent (
Supplementary Figure S7C ). Taken together, these results show that the 1000Genomes cohort is also rife with false positive IBD, and despite the different genotyping platform used,the error profile in the 1000 Genomes cohort is qualitatively very similar to that in the 23andMe cohort(
Figure 1 ).Examination of the relationship between segment length and segment overlap in the 1000 Genomescohort indicates similar general trends as those discovered in the 23andMe cohort (compare
Supple-mentary Figure S8 and
Figure 2 ), though the smaller number of segments makes the results morenoisy. Comparison of all 44,542 full 100-site windows in the 5,770 child-other segments shows that overlypermissive diplotype matching causes false reported IBD segments: the diplotype match criterion is satis-fied in 97.1% of windows contained within parent-other segments and in 96.5% of windows not containedwithin parent-other segments, whereas the haplotype match criterion is satisfied in 68.7% of windowscontained within parent-other segments but in only 51.1% of windows not contained within parent-othersegments (
Supplementary Table S2 ), a substantial reduction (binomial
P < − ).Finally, the performance of HaploScore in segregating true and false reported IBD was analyzed inthe 1000 Genomes cohort. The switch error rate was estimated from the data to be σ = 0 .
003 and thegenotyping error rate was estimated to be (cid:15) = 0 . Supplementary Figure S9 and
Figure 3 ). The smallnumber of child-other segments analyzed in the 1000 Genomes cohort causes somewhat noisy results, butthe effectiveness of HaploScore as a discriminator between true and false positive IBD is readily apparent.
Discussion
The usage of IBD segments in genetic analyses will become increasingly common as the number ofindividuals with their genetic composition known increases. Due to the inherently quadratic nature ofIBD detection between all pairs of individuals in a cohort, non-probabilistic methods are required to keepthe computational burden as low as possible. However, effective filtering methods are required to ensurereported IBD segments are accurate.Using the laws of Mendelian inheritance is an effective way to avoid modeling complex demographichistory when evaluating the accuracy of population genetics methods including IBD detection and localancestry inference . By using known familial relationships in a large set of trios, we were able toanalyze the accuracy of IBD segments reported by GERMLINE on non-simulated data. We found asurprisingly large number of false positive short segments and showed that these false positives arosedue to the diplotype-based IBD detection mechanism introduced to detect long IBD segments in thepresence of phasing switch errors . We introduced a haplotype-based metric, HaploScore, that effectivelydiscriminates between true and false reported IBD segments. We also investigated a likelihood-ratio-basedmetric, but found it less effective than HaploScore ( Supplementary Text ).Importantly, HaploScore can be computed efficiently using dynamic programming (in O( | S | ) timeper segment, see Material and Methods ). This suggests a strategy for accurate IBD detection inpopulation-scale datasets: detect candidate segments using a non-probabilistic IBD detection methodwith relatively permissive parameters and then cull true segments using HaploScore filtering. In addition,HaploScore can be applied as a post-processing step to existing genotyping- and sequencing-based IBDsegments, provided that an estimate of the switch error rate and the genotyping error rate are available.Achieving optimal HaploScore performance in a different population cohort or when using an alterna-tive genotyping platform depends on being able to accurately estimate the genotyping and switch errorrates of the data. Genotyping error rates can be estimated in any cohort by methods such as repeatgenotyping . While accurate determination of switch error rates currently requires trios or orthogonalanalysis methods such as phased sequencing , the robustness of HaploScore to substantial variationsin the parameter ratio indicates that it should be extensible to non-European populations, genotypingplatforms of different marker density, or even sequencing-based assays. Indeed, we demonstrated therobustness and generalizability of HaploScore by analyzing an independent cohort of 555 European in-dividuals from the 1000 Genomes project who were genotyped on a chip nearly twice as dense as the23andMe chip. While the smaller sample size of the 1000 Genomes cohort produced noisier results, allmajor findings of the analysis of the 23andMe cohort were replicated in the 1000 Genomes cohort.Python code implementing HaploScore filtering is freely available ( https://github.com/23andMe/ibd ). Material and Methods
Cohort description
The 23andMe cohort analyzed comprises 25,432 customers of 23andMe, Inc., a personal genetics company,who were genotyped on the Illumina HumanOmniExpress+ BeadChip as part of the 23andMe PersonalGenome Service. The chip contains roughly 1,000,000 sites genome-wide . Individuals were selected forhaving >
97% European ancestry as described previously . The 23andMe cohort includes 2,952 distinctfather-mother-child trios identified by IBD sharing . Parent-child relationships were defined as havingat least 85% of the genetic length of the genome shared on at least one haplotype and no more than 10%of the genetic length of the genome shared on both haplotypes. Parent-parent relationships were definedas having at most 20% of the genetic length of the genome shared on at least one haplotype.The 1000 Genomes cohort analyzed comprises 555 individuals from five European populations whowere genotyped on the Illumina HumanOmni2.5-Quad v1-0 B SNP array as described previously (sam-ples available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz ). The 1000 Genomescohort includes 52 distinct father-mother-child trios identified within the 1000 Genomes project (meta-data available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130606_sample_info/20130606_sample_info.txt ) and which we validated independently by IBD sharing ( SupplementaryTable S1 ). All members of the 1000 Genomes cohort were verified to not be present in the 23andMecohort.
IBD detection
The 23andMe cohort
Genotypes of all individuals included in the 23andMe cohort were phased using BEAGLE version 3.3.1in batches of 8,000–9,000 individuals as described previously . In each batch, we excluded sites withminor allele frequency < . P < − , call rate < × , algorithm with the parameters -bits 100 -err hom 2 -err het 0 -w extend-min m 2 -map
Genotypes of all 555 individuals in the 1000 Genomes cohort were phased using BEAGLE version 3.3.1in a single batch. Windows of 3,000 sites that overlapped by 100 sites were stitched together as describedpreviously . Sites that were not polymorphic in the 555 individuals examined, had a 1000-Genomes-reported Hardy-Weinberg equilibrium P < − , or a call rate within the 555 individuals examined <
95% were excluded, resulting in 23,142 sites on chromosome 21. GenCall genotype scores were set to0 for all sites not called in each individual.Candidate IBD segments were identified and filtered identically to those found in the 23andMe cohortdescribed above.
HaploScore description and computational complexity
HaploScore provides a metric by which to rank the likelihood that a stretch of DNA is inherited IBDbetween two individuals or not. Let (cid:15) and σ denote the probability of a genotyping error and a switcherror at any given site, respectively. The HaploScore for a candidate IBD segment S isHaploScore( S ) = 1 | S | (cid:16) n g (cid:15) + n s σ (cid:17) (1)where | S | is the number of genotyped sites in S and n g and n s are the number of genotyping and switcherrors, respectively, that together minimize the score while reconciling the segment as matching across asingle haplotype in both individuals.0Finding the HaploScore (i.e. the optimal values of n g and n s subject to the constraints) can beviewed as finding the minimum-cost path through the directed acyclic graph (DAG) described below( Supplementary Figure S10 ).Let G be a DAG with a single source node and a single sink node. Between the source and the sink,the graph has | S | levels, one per genotyped site in segment S . Each of these | S | levels has four nodes,one for each possible haplotype configuration. Each node in level l has four outgoing directed edges, oneto each node in level l + 1. Below, we use the same notation for nodes and their weights.At any level l , let h ( i,j ) l , i, j ∈ { , } denote the four possible haplotype configurations of an IBDmatch. The nodes are weighted as follows: h ( i,j ) l = (cid:40) i in first individual matches haplotype j in second individual at position l ,1 /(cid:15) otherwise. (2)Let e ( i,j ) , ( u,v ) l denote the weight of the edge between nodes h ( i,j ) l and h ( u,v ) l +1 . Edges are weighted asfollows: e ( i,j ) , ( u,v ) l = i = u and j = v ,1 /σ if i = u and j (cid:54) = v ,1 /σ if i (cid:54) = u and j = v ,2 /σ if i (cid:54) = u and j (cid:54) = v . (3)The weights of the four edges from the source node to the nodes in the first level, as well as theweights from the nodes in level | S | to the sink node, are set to 0. The cost of a path in G is defined asthe sum of the weights of the edges and nodes it traverses.HaploScore( S ) is equal to the smallest of all path costs from the source to the sink. It can be efficientlycomputed using dynamic programming by noting that the smallest cost from the source to level l + 1 inthe graph can easily be inferred from the smallest cost from the source to level l . Let C ( i,j ) l denote thesmallest cost from the source to haplotype configuration ( i, j ) at level l . Then, C ( u,v ) l +1 = min i,j,u,v (cid:16) C ( i,j ) l + e ( i,j ) , ( u,v ) l + h ( u,v ) l +1 (cid:17) . (4)The minimum cost to reach level l , C ∗ l , is then C ∗ l = min i,j C ( i,j ) l . (5)The above equations clearly show that computing HaploScore( S ) involves 16 comparisons at eachgenotyped site in S . Thus, the complexity of computing HaploScore( S ) is at most 16 | S | . Performancecan be further improved when filtering by HaploScore by terminating computation as soon as a segment’sHaploScore becomes too high to satisfy the maximum value threshold. HaploScore parameter estimation
HaploScore uses two parameters, the genotyping error rate per site (cid:15) and the switch error rate persite σ . Analyses of genotyping chip accuracy and internal comparisons between genotype and whole-genome sequencing data verify that the genotyping error rate is <
1% (not shown). To estimate theempirical switch error rate per site, all 2,952 trios in the 23andMe cohort were trio-phased using thelaws of Mendelian inheritance and the results for all children were compared to their BEAGLE-phasedhaplotypes, assuming that the trio-phased haplotypes represented the true phase. The average per-site1switch error rate ranged from 0 . . Supplementary Figure S4 ).The switch error rate calculation process described above was performed independently on the 1000Genomes cohort. A total of 3,629 switch errors were detected in the 52 trio children over 23,142 sites.This corresponds to an individual switch error rate per site of 3629 / (52 × . HaploScore threshold matrix generation
A matrix of HaploScore thresholds was generated in the following manner: all segments were binned bygenetic length in 0 . Supplementary File 1 ). Acknowledgments
We thank the customers of 23andMe who contributed the genetic data that made this research possibleand are grateful to the employees of 23andMe for creating and supporting the resources necessary forthis research. We also thank members of the 23andMe research team for insightful comments. This workwas supported by the National Human Genome Research Institute of the National Institutes of Health(grant number R44HG006981).
References
1. Browning, S. R. & Browning, B. L. Identity by descent between distant relatives: detection andapplications.
Annual Review of Genetics , 617–633 (2012).2. Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation.
NatureGenetics , 1068–1075 (2008).3. Krawitz, P. M. et al. Identity-by-descent filtering of exome sequence data identifies PIGV mutationsin hyperphosphatasia mental retardation syndrome.
Nature Genetics , 827–829 (2010).4. Gusev, A. et al. DASH: a method for identical-by-descent haplotype mapping uncovers associationwith recent variation.
The American Journal of Human Genetics , 706–717 (2011).5. Jonsson, T. et al. A mutation in APP protects against Alzheimer’s disease and age-related cognitivedecline.
Nature , 96–99 (2012).6. Visscher, P. M. et al.
Assumption-free estimation of heritability from genome-wide identity-by-descentsharing between full siblings.
PLOS Genetics , e41 (2006).7. Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: geneticinteractions create phantom heritability. Proceedings of the National Academy of Sciences of theUnited States of America , 1193–1198 (2012).8. Palamara, P. F., Lencz, T., Darvasi, A. & Pe’er, I. Length distributions of identity by descent revealfine-scale demographic history.
The American Journal of Human Genetics , 809–822 (2012).29. Gusev, A. et al. The architecture of long-range haplotypes shared within and across populations.
Molecular Biology and Evolution , 473–486 (2012).10. Ralph, P. & Coop, G. The geography of recent genetic ancestry across Europe. PLOS Biology ,e1001555 (2013).11. Albrechtsen, A. et al. Relatedness mapping and tracts of relatedness for genome-wide data in thepresence of linkage disequilibrium.
Genetic Epidemiology , 266–274 (2009).12. Han, L. & Abney, M. Using identity by descent estimation with dense genotype data to detectpositive selection. European Journal of Human Genetics , 205–211 (2013).13. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkageanalyses.
The American Journal of Human Genetics , 559–575 (2007).14. Browning, S. R. & Browning, B. L. High-resolution detection of identity by descent in unrelatedindividuals. The American Journal of Human Genetics , 526–539 (2010).15. Han, L. & Abney, M. Identity by descent estimation with dense genome-wide genotype data. GeneticEpidemiology , 557–567 (2011).16. Palin, K., Campbell, H., Wright, A. F., Wilson, J. F. & Durbin, R. Identity-by-descent-based phasingand imputation in founder populations using graphical models. Genetic Epidemiology , 853–860(2011).17. Brown, M. D., Glazner, C. G., Zheng, C. & Thompson, E. A. Inferring coancestry in populationsamples in the presence of linkage disequilibrium. Genetics , 1447–1460 (2012).18. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment searchtool.
Journal of Molecular Biology , 403–410 (1990).19. Gusev, A. et al.
Whole population, genome-wide mapping of hidden relatedness.
Genome Research , 318–326 (2009).20. Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. TheAmerican Journal of Human Genetics , 173–182 (2011).21. Browning, B. L. & Browning, S. R. Improving the accuracy and efficiency of identity-by-descentdetection in population data. Genetics , 459–471 (2013).22. Williams, A. L., Patterson, N., Glessner, J., Hakonarson, H. & Reich, D. Phasing of many thousandsof genotyped samples.
The American Journal of Human Genetics , 238–251 (2012).23. Browning, B. L. & Browning, S. R. Efficient multilocus association testing for whole genome associ-ation studies using localized haplotype clustering. Genetic Epidemiology , 365–375 (2007).24. Fan, J.-B. et al. Highly parallel SNP genotyping. In
Cold Spring Harbor Symposia on QuantitativeBiology , vol. 68, 69–78 (Cold Spring Harbor Laboratory Press, 2003).25. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 humangenomes.
Nature , 56–65 (2012).26. Pasaniuc, B. et al.
Analysis of Latino populations from GALA and MEC studies reveals genomicloci with biased local ancestry estimation.
Bioinformatics , 1407–1415 (2013).327. Pompanon, F., Bonin, A., Bellemain, E. & Taberlet, P. Genotyping errors: causes, consequences andsolutions. Nature Reviews Genetics , 847–846 (2005).28. Voskoboynik, A. et al. The genome sequence of the colonial chordate, Botryllus schlosseri.
Elife ,e00569 (2013).29. Hinds, D. A. et al. A genome-wide association meta-analysis of self-reported allergy identifies sharedand allergy-specific susceptibility loci.
Nature Genetics , 907–911 (2013).30. Henn, B. M. et al. Cryptic distant relatives are common in both isolated and cosmopolitan geneticsamples.
PLOS ONE , e34267 (2012).31. Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inferencefor whole-genome association studies by use of localized haplotype clustering. The American Journalof Human Genetics , 1084–1097 (2007).32. Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs.
Nature , 851–861 (2007).33. Kent, W. J. et al.
The human genome browser at UCSC.
Genome Research , 996–1006 (2002).34. Zhuang, Z., Gusev, A., Cho, J. & Pe’er, I. Detecting Identity by Descent and Homozygosity Mappingin Whole-Exome Sequencing Data. PLOS ONE , e47618 (2012).35. Paynter, R. A. et al. Accuracy of multiplexed Illumina platform-based single-nucleotide polymorphismgenotyping compared between genomic and whole genome amplified DNA collected from multiplesources.
Cancer Epidemiology Biomarkers & Prevention , 2533–2536 (2006).4 Figures
Reported child segments N u m b e r o f se g m e n t s ( m illi on s ) ! parent segment overlap < 100% 0% < parent segment overlap < 80% 0% parent segment overlap A B C N u m b e r o f se g m e n t s Number of opposite homozygotes
0 5 10 15 20 "
25 8,062,142 segments absent in both parents
61% 8% 14% 17%
0% 20% 40% 60% 80% 100%
Truncated segments
Mean: 8.6 Median: 7
Figure 1. Analysis of child-other segments in parents. A.
The majority of child-other segmentsare not detected in either parent. Parent segment overlap is calculated as the percentage of sites in thechild-other segment that are included in the parent-other segment. B. Truncation points forparent-other segments are nearly always confidently-genotyped opposite homozygote sites, consistentwith false positive IBD in the child. The opposite homozygote site causing truncation of theparent-other segment was examined in a randomly-selected subset of all 3,371,616 segments with partialparent overlap. C. Child-other segments with no corresponding parent-other segments contain manyparent-other opposite homozygotes in the region, also consistent with false positive IBD in the child.For each of these child-other segments, the number of opposite homozygote sites present between theparent and the other individual at that segment location is calculated separately for each parent, andthe smaller is chosen as the number of opposite homozygotes in the region.5 ! " : / ; , - . < *=**=)*=!*="*= , ; - : , > , - . ? @ , A ; B ! " A , / : / ? - *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*)***!***"*** D E > F , A ? G : , > , - . : . ? E : ; - H : !I"106 *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*)**!**"** D E > F , A ? G : , > , - . : . ? E : ; - H : "I *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*)*!*"* D E > F , A ? G : , > , - . : . ? E : ; - H : *=" *=% (cid:1) *=(C,3>,-.1?@,A2;B*$)*)$!*!$ D E > F , A ? G : , > , - . : . ? E : ; - H : $I%106 ! " Figure 2. Accuracy of child-other IBD segments reported by GERMLINE. A.
Heat map ofthe mean fraction of reported child-other IBD segments contained in a corresponding parent-othersegment, binned by two measures of segment length. For each child-other segment, the fraction of thesegment also reported as an IBD segment between the parent and the other individual is calculated.Shown in each bin is the mean of the segment fractions calculated for all segments in the bin. B. Thefraction of child-other segments that are true IBD as a function of segment length. True IBD segmentsare defined as having at least 80% of their sites encompassed by a parent-other segment.
C–F.
Histograms of child-other segment counts binned by segment overlap for segments of 2–3 cM ( C ), 3–4cM ( D ), 4–5 cM ( E ), and 5–6 cM ( F ). Note the scale changes on the y-axes: though the fraction of truesegments of length < ! " - + : ’ ’ ( , ? ’ . @ ’ ( ) , A ’ : - C : D ’ , ? * ) * A ’ , : ) ’ $E%,+1,03;>!2 G : ’ + * ? * ( $E%,+1 !" Figure 3. Improving detection of true IBD segments using HaploScore. A.
Heat map of themean fraction of reported IBD segments found in parents, binned by segment genetic length andHaploScore. Calculations are performed as in
Figure 2A . B. Receiver operating characteristic forreported IBD segments of various lengths, discriminating by HaploScore. True IBD is defined asin
Figure 2B . The dashed black line indicates the no-discrimination line. The area under each curve isparenthesized in its legend entry. C. Precision-recall plot for child-other segments binned by segmentlength.7
Figure 4. HaploScore is robust to a wide range of input parameters. A.