[PDF] Coverage statistics for sequence census methods

Abstract

Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest.

Full PDF

CCoverage statistics for sequence census methods

Steven N. Evans, Valerie Hower and Lior PachterOctober 24, 2018

Abstract

Background:

We study the statistical properties of fragment coverage in genome sequencingexperiments. In an extension of the classic Lander-Waterman model, we consider the eﬀect ofthe length distribution of fragments. We also introduce the notion of the shape of a coveragefunction, which can be used to detect abberations in coverage. The probability theory under-lying these problems is essential for constructing models of current high-throughput sequencingexperiments, where both sample preparation protocols and sequencing technology particularscan aﬀect fragment length distributions.

Results:

We show that regardless of fragment length distribution and under the mild assump-tion that fragment start sites are Poisson distributed, the fragments produced in a sequencingexperiment can be viewed as resulting from a two-dimensional spatial Poisson process. Wethen study the jump skeleton of the the coverage function, and show that the induced trees areGalton-Watson trees whose parameters can be computed.

Conclusions:

Our results extend standard analyses of shotgun sequencing that focus on cover-age statistics at individual sites, and provide a null model for detecting deviations from randomcoverage in high-throughput sequence census based experiments. By focusing on fragments, weare also led to a new approach for visualizing sequencing data that should be of independentinterest.

The classic “Lander-Waterman model” [15] provides statistical estimates for the read coveragein a whole genome shotgun (WGS) sequencing experiment via the Poisson approximation to theBinomial distribution. Although originally intended for estimating the extent of coverage whenmapping by ﬁngerprinting random clones, the Lander-Waterman model has served as an essentialtool for estimating sequencing requirements for modern WGS experiments [17]. Although it makesa number of simplifying assumptions (e.g. ﬁxed fragment length and uniform fragment selection )that are violated in actual experiments, extensions and generalizations [19, 18] have continued tobe developed and applied in a variety of settings.The advent of “high-throughput sequencing”, which refers to massively parallel sequencingtechnologies has greatly increased the scope and applicability of sequencing experiments. With theincreasing scope of experiments, new statistical questions about coverage statistics have emerged.In particular, in the context of sequence census methods , it has become important to understandthe shape of coverage functions, rather than just coverage statistics at individual sites.1 a r X i v : . [ q - b i o . GN ] A p r equence census methods [20] are experiments designed to assess the content of a mixture ofmolecules via the creation of DNA fragments whose abundances can be used to infer those of theoriginal molecules. The DNA fragments are identiﬁed by sequencing, and the desired abundancesinferred by solution of an inverse problem. An example of a sequence census method is ChIP-Seq.In this experiment, the goal is to determine the locations in the genome where a speciﬁc proteinbinds. An antibody to the protein is used to “pull down” fragments of DNA that are bound viaa process called chromatin immunoprecipitation (abbreviated by ChIP). These fragments form the“mixture of molecules” and after purifying the DNA, the fragments are determined by sequencing.The resulting sequences are compared to the genome, leading to a coverage function that records,at each site, the number of sequenced fragments that contained it. As with many sequence censusmethods, “noise” in the experiment leads to random sequenced fragments that may not correspondto bound DNA, and therefore it is necessary to identify regions of the coverage function that deviatefrom what is expected according to a suitable null model.The purpose of this paper is not to develop methods for the analysis of ChIP-Seq (or any othersequence census method), but rather to present a null model for the shape of a coverage functionthat is of general utility. That is, we propose a deﬁnition for the shape of a fragment coveragefunction, and describe a random instance assuming that fragments are selected at random from agenome, with lengths of fragments given by a known distribution. The distinction between our workand previous statistical studies of sequencing experiments, is that we go beyond the description ofcoverage at a single location, to a description of the change in coverage along a genome. We begin by explaining what we mean by a coverage function . Given a genome modeled as a stringof ﬁxed length N , a coverage function is a function f : { , . . . , N } −→ Z ≥ . The interpretationof this function, is that f ( i ) is the number of sequenced fragments obtained from a sequencingexperiment that cover position i in the genome. It is important to note that N is typically large;for example, the human genome consists of approximately 2 . N is verylarge, we replace the ﬁnite set { , . . . , N } with R , and re-deﬁne a coverage function to be a function f : R −→ Z ≥ . This helps to simplify our analysis.We next introduce an object that describes a sequence coverage function’s shape. Our approachis motivated by recent applications of topology including persistent homology [2, 21] and the use ofcritical points in shape analysis [1, 5, 6]. For a given coverage function f : R −→ Z ≥ , we will deﬁnea rooted tree, which is a particular type of directed graph with all the directed edges pointing awayfrom the root. This tree T f is based on the upper-excursion sets of f : U h := { ( x, f ( x )) | f ( x ) ≥ h } , h ∈ Z ≥ and keeps track of how the sets U h evolve as h decreases. Long paths in T f representfeatures of the coverage function that persist through many values of h .Speciﬁcally, for each h ∈ Z ≥ , let C h denote the set of connected components of the upper-excursion set U h . We deﬁne the rooted tree T f = ( V, E ) as follows • Vertices in V correspond to the connected components in the collection { C h } h ∈ Z ≥ • ( i, j ) ∈ E provided their corresponding connected components c i ∈ C h i and c j ∈ C h j with h i < h j satisfy h i = h j − c j ⊂ c i .Note that the root of T f corresponds to the single connected component in C . The tree T f is verysimilar to a contour tree [1, § T f correspond to (equivalence classes of) local extrema of f . Eachlocal maximum of f yields the birth of a new connected component as we sweep down through h ∈ Z ≥ while a local minimum of f merges connected components. Since we do not require f to have distinct critical values (as is frequently assumed), the vertices in T f can have arbitrarydegrees, as is depicted in Figure 1C.In the sequel, we will use the following equivalent characterization that can be found in [7, § f : R −→ Z ≥ with f ( a ) = f ( b ) = 0 and f ( x ) > x ∈ ( a, b ), weform an integer-valued sequence x , . . . , x n that records the changes in height of f on the interval[ a, b ]. The sequence x , . . . , x n consists of the y values that f travels through from x := f ( a ) = 0to x n := f ( b ) = 0 and satisﬁes x = x n = 0 ,x i > < i < n, | x i − x i − | = 1 for 1 ≤ i ≤ n. Such a sequence is called a lattice path excursion away from

0. Next, we deﬁne an equivalencerelation on the set { , , . . . , n } by setting i ≡ j ⇐⇒ x i = x j = min i ≤ k ≤ j x k . The equivalence classes under this relation are in 1 : 1 correspondence with the connected compo-nents in the upper-excursion sets of f | [ a,b ] . One equivalence class is { , n } , and if { i , . . . , i p } isan equivalence class with 0 < i < i < . . . < i p then x i − = x i − , whereas x i q − = x i q + 1 for2 ≤ q ≤ p . Conversely, any index i with x i − = x i − T f | [ a,b ] as the set { } ∪ { i | x i − = x i − } . Two indices i < i are adjacent in T f | [ a,b ] provided x i = x i + 1 and x k ≥ x i for i ≤ k ≤ i . Figure 1 gives an example of a coveragefunction together with its lattice path excursion (0 , , , , , , , , , , , , , , , ,

0) and rootedtree. The minimal elements of each equivalence class in Figure 1B are depicted with red squares.Figure 1: A coverage function (A) with its lattice path excursion (B) and rooted tree (C).3

Planar Poisson processes from sequencing experiments

In order to model random coverage along the genome, we use a Poisson process to give randomstarting locations to the fragments. Speciﬁcally, suppose that we have a stationary Poisson pointprocess on R with intensity ρ . At each point of the Poisson point process we lay down an intervalthat has that point as its left end-point. The lengths of the successive intervals are independentand identically distributed with common distribution µ . We will use the notation X for a coveragefunction built from this process and X t for the height at a point t .Let t , t , · · · be the left-end points and l , l , · · · be the corresponding lengths of intervals. Theinterval given by ( t i , l i ) will cover a nucleotide t provided t i ≤ t and t i + l i ≥ t . We can view thispictorially by plotting points { ( t j , l j ) } in the plane. Then X t —the number of intervals covering t —is the number of points in the triangular region below. We now recall the deﬁnition of a two- l = t − t t ( t, l )-plane Figure 2: A two dimensional view of a sequencing experiment.dimensional Poisson process and refer the reader to [10, § § σ -algebra B ( R ). A random countable subset Π of R is called a non-homogeneous Poisson process with mean measure Γ if, for all Borel subsets A , therandom variables N ( A ) := A ∩ Π) satisfy:1. N ( A ) has the Poisson distribution with parameter Γ( A ), and2. If A , · · · , A k are disjoint Borel subsets of R , then N ( A ) , · · · , N ( A k ) are independent ran-dom variables.The following theorem is a consequence of [14, Proposition 12.3]. Theorem 3.0.1.

The collection { ( t i , l i ) } of points obtained as described above is a non-homogeneousPoisson process with mean measure ρ m ⊗ µ . Here m is Lebesgue measure on R . We compute the expected value E [ X t ] = ρ m ⊗ µ (wedge) : ρ m ⊗ µ (wedge) = ρ (cid:90) t −∞ (cid:90) ∞ t − u µ ( dv ) du = ρ (cid:90) t −∞ µ (( t − u, ∞ )) du = ρ (cid:90) ∞ µ (( s, ∞ )) ds. .1 Fragment lengths have the exponential distribution We treat the simplest case ﬁrst, namely the case where the distribution µ of fragment lengths isexponential with rate λ . Then, we have µ (( s, ∞ )) = P { l > s } = e − λs , and E ( X t ) = ρ (cid:90) ∞ e − λs ds = ρλ . Claim 1.

The process X is a stationary, time-homogeneous Markov process.Proof. It is clear that X is stationary because of the manner in which it is constructed from a Poissonprocess on R that has a distribution which is invariant under translations in the t direction; thatis, the random set { ( t i , l i ) } has the same distribution as { ( t i + t, l i ) } for any ﬁxed t ∈ R . Since µ isexponential, it is memoryless, meaning for any interval length l with an exponential distribution P { l > a + b | l > a } = P { l > b } . This means that probability that an interval covers t knowing that it covers t is the same as theprobability that an interval starting at t covers t . Thus, the probability that X t = k given X t for t ≤ t only depends on the value of X t . Indeed, in terms of time, P { X t = k | X t = k (cid:48) } dependsonly on t − t .More speciﬁcally, X is a birth-and-death process with birth rate β ( k ) = ρ in all states k anddeath rate δ ( k ) = kλ in state k ≥

1. Note that as the exponential distribution is the onlydistribution with the memoryless property, we lose the Markov property when µ is not exponential.To build the tree of §

2, we are interested in the jumps of the coverage function f ( t ) = X t . Wehence consider the jump chain of X — a discrete-time Markov chain with transition matrix P ( i, j ) =  , if i = 0 and j = 1 , ρρ + iλ , if i ≥ j = i + 1 , iλρ + iλ , if i ≥ j = i − , , otherwise . Suppose now we have a lattice path excursion starting at 0. Given a vertex v of the associated treeat height k , we are interested in the number of oﬀspring (at height k + 1) of this vertex. Suppose i is the minimal equivalence class representative for vertex v , and suppose [ i ] = { i , i , · · · , i n } with i < i < · · · < i n . Then, we have x i r = k for 0 ≤ r ≤ n , x i r +1 = k + 1 for 0 ≤ r ≤ n − x i n +1 = k −

1, and x t > k for i < t < i n with t (cid:54) = some i r . From the Markov property, for0 ≤ j ≤ n , P { x i j +1 = k + 1 | x i j = k } = ρρ + λk and P { x i j +1 = k − | x i j = k } = λkρ + λk . The resultingtree is a Galton-Watson tree with generation-dependent oﬀspring distributions (see [8, 9, 12, 13]for more on Galton-Watson trees). Indeed, we have P { a vertex at height k has n oﬀspring } = (cid:18) ρρ + λk (cid:19) n λkρ + λk , which is the probability of n failures before the ﬁrst success in a sequence of independent Bernoullitrials where the probability of success equals λkρ + λk .5 .2 Fragment lengths have a general distribution Suppose that we have a general distribution µ for the fragment lengths. We observe X at some ﬁxed“time” – which might as well be 0 because of stationarity, and ask for the conditional probabilitygiven X that the next jump of X will be upwards. We know from the above that if µ is exponentialwith rate λ , then conditional on X = k this is ρ/ ( ρ + kλ ).Let T denote the time until the next segment comes along. This random variable has anexponential distribution with rate ρ and is independent of X [4, § X = k ,the two-dimensional Poisson point process must have k points in the region A := { ( t, l ) : −∞ < t ≤ , − t < l < ∞} . T ( t, l )-plane T Figure 3: A wedge from the planar Poisson process.Conditionally, these k points in A have the same distribution as k points chosen at random in A according to the probability measure ρ m ⊗ µ ( B ) ρ m ⊗ µ ( A ) for B ⊂ A However, in order that the next jump after 0 is upwards, the two-dimensional Poisson point processmust have no points in the orange region { ( t, l ) : −∞ < t ≤ , − t < l < T − t } as these segments end before time T . This leaves the k points lying in the blue region B T := { ( t, l ) : −∞ < t ≤ , T − t ≤ l < ∞} , which occurs with probability (cid:16) ρ (cid:82) ∞ T µ (( u, ∞ )) duρ (cid:82) ∞ µ (( u, ∞ )) du (cid:17) k . Thus, conditional on X = k , the probabilitythat the next jump will be upwards is (cid:90) ∞ (cid:18) (cid:82) ∞ t µ (( u, ∞ )) du (cid:82) ∞ µ (( u, ∞ )) du (cid:19) k ρe − ρt dt. p ( k ) for this quantity. A reasonable approximation to the jump skeleton Z of X is to take itbe a discrete-time Markov chain on the nonnegative integers with transition probabilities P ( i, j ) =  , if i = 0 and j = 1 ,p ( i ) , if i ≥ j = i + 1 , − p ( i ) , if i ≥ j = i − , , otherwise . The resulting tree is then a Galton-Watson tree with generation dependent oﬀspring distributions,where P { a vertex at height k has n oﬀspring } = p ( k ) n (1 − p ( k )) . Example 3.2.1.

Suppose µ is the point mass at L (that is, all segment lengths are L ). Then µ (( u, ∞ )) = (cid:40) , u < L , u ≥ L , and (cid:90) ∞ t µ (( u, ∞ )) du = (cid:40)(cid:82) Lt du = L − t, t < L , t ≥ L. This gives p ( k ) = (cid:90) L ( L − t ) k L k ρe − ρt dt = (cid:90) w k ρe − ρ ( L − Lw ) Ldw = θe − θ (cid:90) w k e θw dw for k ≥ , where θ := ρL = E [ X ] . We integrate by parts and ﬁnd that p ( k ) = θe − θ q ( k ) where q ( k ) = w k e θw θ (cid:12)(cid:12)(cid:12)(cid:12) w =1 w =0 − kθ (cid:90) w k − e θw dw = e θ θ − kθ q ( k − for k ≥ , which yields the recursion p ( k ) = 1 − kθ p ( k − , k ≥ , with p (1) = 1 − θ + e − θ θ . Solving explicitly, we obtain p ( k ) = k !  k (cid:88) j =0 ( − k − j j ! θ k − j + ( − k − e − θ θ k  for k ≥ . Discussion

Our observation that randomly sequenced fragments from a genome form a planar Poisson processin ( position, length ) coorindates has implications beyond the coverage function analysis performedin this paper. For example we have found that the visualization of sequencing data in this novelform is useful for quickly identifying instances of sequencing bias by eye, as it is easy to “see”deviations from the Poisson process. An example is shown in Figure 4 where fragments from anIllumina sequencing experiment are compared with an idealized simulation (where the fragmentsare placed uniformly at random). Speciﬁcally, paired-end reads from an RNA-Seq experimentconducted on a GAII sequencer were mapped back to the genome and fragments inferred from theread end locations. Bias in the sequencing is immediately visible, likely due to non-uniform PCRampliﬁcation [11] and other eﬀects. We hope that others will ﬁnd this approach to visualizingfragment data of use.Figure 4: (A) Fragments from a sequencing experiment shown in the ( t, l ) plane. (B) The spatialPoisson process resulting from fragments with the same length distribution as (A) but with positionsampled uniformly at random.The “shape” we have proposed for coverage functions was motivated by persistence ideas fromtopological data analysis (TDA). In the context of TDA, our setting is very simple (1-dimensional),however unlike what is typically done in TDA, we have provided a detailed probabilistic analysisthat can be used to construct a null hypothesis for coverage-based test statistics. For example, weenvision computing test statistics [16] based on the trees constructed from coverage functions andcomparing those to the statistics expected from the Galton-Watson trees. It should be interestingto perform similar analyses with high-dimensional generalizations for which we believe many of ourideas can be translated. There are also biological applications, for example in the analysis of pooledexperiments where fragments may be sequenced from diﬀerent genomes simultaneously.Indeed, we believe that the study of sequence coverage functions that we have initiated maybe of use in the analysis of many sequence census methods. The number of proposed protocolshas exploded in the past two years, as a result of dramatic drops in the price of sequencing. Forexample, in January 2010, the company Illumina announced a new sequencer, the HiSeq 2000,that they claim “changes the trajectory of sequencing” and can be used to sequence 25Gb per8ay. Although technologies such as the HiSeq 2000 were motivated by human genome sequencinga surprising development has been the fact that the majority of sequencing is in fact being usedfor sequence census experiments [20]. The vast amounts of sequence being produced in the contextof complex sequencing protocols, means that a detailed probabilistic understanding of randomsequencing is likely to become increasingly important in the coming years.

SNE is supported in part by NSF grant DMS-0907630 and VH is funded by NSF fellowship DMS-0902723. We thank Adam Roberts for his help in making Figure 4.

LP proposed the problem of understanding the random behaviour of coverage functions in thecontext of sequence census methods. VH investigated the jump skeleton based on ideas fromtopological data analysis. SE developed the probability theory and identiﬁed the relevance ofTheorem 3.0.1. SNE, VH and LP worked together on all aspects of the paper and wrote themanuscript.

References [1] S. Biasotti, D. Giorgi, M. Spagnuolo, and B. Falcidieno. Reeb graphs for shape analysis andapplications.

Theoretical Computer Science , 392(1-3):5 – 22, 2008. Computational AlgebraicGeometry and Applications.[2] Gunnar Carlsson. Topology and data.

Bull. Amer. Math. Soc. (N.S.) , 46(2):255–308, 2009.[3] Hamish Carr, Jack Snoeyink, and Ulrike Axen. Computing contour trees in all dimensions.

Comput. Geom. , 24(2):75–94, 2003. Special issue on the Fourth CGC Workshop on Computa-tional Geometry (Baltimore, MD, 1999).[4] D. J. Daley and D. Vere-Jones.

An introduction to the theory of point processes . SpringerSeries in Statistics. Springer-Verlag, New York, 1988.[5] Mark de Berg and Marc van Kreveld. Trekking in the Alps without freezing or getting tired.

Algorithmica , 18(3):306–323, 1997. First European Symposium on Algorithms (Bad Honnef,1993).[6] Herbert Edelsbrunner, John Harer, and Afra Zomorodian. Hierarchical Morse-Smale com-plexes for piecewise linear 2-manifolds.

Discrete Comput. Geom. , 30(1):87–107, 2003. ACMSymposium on Computational Geometry (Medford, MA, 2001).[7] Steven N. Evans.

Probability and real trees , volume 1920 of

Lecture Notes in Mathematics .Springer, Berlin, 2008. Lectures from the 35th Summer School on Probability Theory held inSaint-Flour, July 6–23, 2005. 98] Dean H. Fearn. Galton-Watson processes with generation dependence. In

Proceedings ofthe Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California,Berkeley, Calif., 1970/1971), Vol. IV: Biology and health , pages 159–172, Berkeley, Calif.,1972. Univ. California Press.[9] I. J. Good. The joint distribution for the sizes of the generations in a cascade process.

Proc.Cambridge Philos. Soc. , 51:240–242, 1955.[10] Geoﬀrey R. Grimmett and David R. Stirzaker.

Probability and random processes . OxfordUniversity Press, New York, third edition, 2001.[11] K Hansen, SE Brenner, and S Dudoit. Biases in illumina transcriptome sequencing caused byrandom hexamer priming.

Nucleic Acids Research , 2010.[12] Theodore E. Harris.

The theory of branching processes . Dover Phoenix Editions. Dover Pub-lications Inc., Mineola, NY, 2002. Corrected reprint of the 1963 original [Springer, Berlin;MR0163361 (29

J. Appl. Probability , 11:174–178, 1974.[14] Olav Kallenberg.

Foundations of modern probability . Probability and its Applications (NewYork). Springer-Verlag, New York, second edition, 2002.[15] ES Lander and MS Waterman. Genomic mapping by ﬁngerprinting random clones: a mathe-matical analysis.

Genomics , 2:231–239, 1988.[16] FA Matsen. A geometric approach to tree shape statistics.

Systematic Biology , 4:652–661,2006.[17] JL Weber and EW Myers. Human whole-genome shotgun sequencing.

Genome Research ,7:401–409, 1997.[18] MC Wendl. A general coverage theory for shotgun DNA sequencing.

Journal of ComputationalBiology , 13:1177–1196, 2006.[19] MC Wendl and W Brad Barbazuk. Extension of Lander-Waterman theory for sequencingﬁltered DNA libraries.

BMC Bioinformatics , 6:245, 2005.[20] B Wold and RM Myers. Sequence census methods for functional genomics.

Nature Methods ,5:19–21, 2008.[21] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology.