Archive | 2019

Statistical Modeling and Inference in Genetics

 
 

Abstract


Given the long mathematical history and tradition in genetics, and particularly in population genetics, it is not surprising that model-based statistical inference has always been an integral part of statistical genetics, and vice versa. Since the big data revolution due to novel sequencing technologies, statistical genetics has further relied heavily on numerical methods for inference. In this chapter we give a brief overview over the foundations of statistical inference, including both the frequentist and Bayesian schools, and introduce analytical and numerical methods commonly applied in statistical genetics. A particular focus is put on recent approximate techniques that now play an important role in several fields of statistical genetics. We conclude by applying several of the algorithms introduced to hidden Markov models, which have been used very successfully to model processes along chromosomes. Throughout we strive for the impossible task of making the material accessible to readers with limited statistical background, while hoping that it will also constitute a worthy refresher for more advanced readers. Readers who already have a solid statistical background may safely skip the first introductory part and jump directly to Section 1.2.3. 1.1 Statistical Models and Inference Statistical inference offers a formal approach to characterizing a random phenomenon using observations, either by providing a description of a past phenomenon, or by giving some predictions about future phenomena of a similar nature. This is typically done by estimating a vector of parameters θ from on a vector of observations or data \ue230, using the formal framework and laws of probability. The interpretation of probabilities is a somewhat contentious issue, with multiple competing interpretations. Specifically, probabilities can be seen as the frequencies with which specific events occur in a repeatable experiment (the frequentist interpretation; Lehmann and Casella, 2006), or as reflecting the uncertainty or degree of belief about the state of a random variable (the Bayesian interpretation; Robert, 2007). In frequentist statistics, only \ue230 is thus considered as a random variable, while in Bayesian statistics both\ue230 and θ are considered random variables. The goal of this chapter is not, however, to enter any debate about the validity of the two competing schools of thought. Instead, our aim is to introduce the most commonly used inference methods of both schools. Indeed, most researchers in statistical genetics, including ourselves, choose their approaches pragmatically based on computational considerations rather than Handbook of Statistical Genomics, Fourth Edition, Volume 1. Edited by David J. Balding, Ida Moltke and John Marioni. © 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd. CO PY RI GH TE D M AT ER IA L JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 2 D. Wegmann and C. Leuenberger strong philosophical grounds. Yet, the two schools differ slightly in their language. To keep the introduction succinct and consistent, we introduce the basic concepts of statistical modeling first from the Bayesian point of view. The main differences with respect to the frequentist view are then discussed below. 1.1.1 Statistical Models 1.1.1.1 Independence Assumptions The first step in statistical inference is to specify a statistical model, which consists of identifying all relevant variables \ue230,θ and formulating the joint probability distribution P(\ue230,θ) of their interaction, usually under some simplifying assumptions.1 It is hard to overestimate the importance of this step: ignoring a variable makes the strong assumption that this variable is independent or conditionally independent of all variables considered. By focusing on a summary statistic or subset T(\ue230) of the data \ue230, for instance, it is implied that T(\ue230) contains all information about θ present in \ue230. Similarly, all variables not included in θ are assumed to be independent of\ue230 conditioned on θ. A third type of assumption that is often made is to consider specific variables to be conditionally independent of each other. That is particularly relevant in hierarchical models where the probability distribution of one parameter is dependent on the values of other hierarchical parameters. Example 1.1 (Allele frequencies). We strive to illustrate all concepts in this chapter through a limited number of compelling scenarios that we revisit frequently. One of these is the problem of inferring the frequency f of the derived allele at a bi-allelic locus from DNA sequence data. While f may denote the frequency of either of the two alleles, we will assume here, without loss of generality, that the two alleles can be polarized into the ancestral and derived allele, where the latter arose from the former through a mutation. Consider now DNA sequence data d = {d1,... , dn} obtained for n diploid individuals with sequencing errors at rate ε. Obviously, f could easily be calculated if all genotypes were known. However, using a statistical model that properly accounts for genotyping uncertainty, a hierarchical parameter such as f can be estimated from much less data (and hence sequencing depth) than would be necessary to accurately infer all n genotypes. An appropriate statistical model with parameters θ = { f , ε} and data \ue230 = d might look as follows: P(d, f , ε) = P(d| f , ε)P( f , ε) = [ n ∏ i=1 ∑ gi P(di|gi, ε)P( gi| f )]P( f , ε). (1.1) Here, the sum runs over all possible values of the unknown genotypes gi. The model introduced in Example 1.1 makes the strong assumptions that the only relevant variables are the sequencing data d, the unknown genotypes g = {g1,... , gn}, the sequencing error rate ε and the allele frequency f . In addition, the model makes the conditional independence assumptions P(di|gi, ε, f , d−i) = P(di|gi, ε) that the sequencing data di obtained for individual i is independent of f and the sequencing data of all other individuals d−i when conditioning on a particular genotype gi. Variables may also become conditionally dependent, as do, for instance, f and ε once specific data is considered in the above model. Undeniably, the data d constrains ε and f : observing around 5% of derived alleles, for instance, is only compatible with f = 0 if ε ≈ 0.05, but not 1 To keep the notation simple, we will denote by P(⋅) the probability of both discrete and continuous variables. Also, we will typically assume the continuous case when describing general concepts and thus use integrals instead of sums. JWST943-c01 JWST943-Balding May 28, 2019 16:48 Printer Name: Trim: 254mm × 178mm 1 Statistical Modeling and Inference in Genetics 3 (a) (b) Figure 1.1 (a) Directed acyclic graph (DAG) representing the independence assumptions of Example 1.1 as given in equation (1.1). Observed data is shown as squares; unknown variables as circles. (b) The same DAG in plate notation, where a plate replicates the inside quantities as many times as specified in the plate (here n times). with a much lower ε. This also highlights that statistical dependence in a model never implies causality in nature. Indeed, the allele frequency does not causally affect the error rate of the sequencing machine, yet in the model the two variables f and ε are dependent as they are connected through the data d. Importantly, therefore, a statistical model is not a statement about causality, but only about (conditional) independence assumptions. An excellent discussion on this is given in Barber (2012, Ch. 2). It is often helpful to illustrate the specific independence assumptions of a statistical model graphically using a so-called directed acyclic graph (DAG; Barber, 2012; Koller and Friedman, 2009). In a DAG, each variable xi is a node, and any variable xj from which a directed edge points to xi is considered a parental variable of xi. A DAG for the model of Example 1.1 is given in Figure 1.1, from which the independence assumptions of the model are easily read: (1) Each variable in the DAG is assumed to be independent of any variable not included in the DAG. (2) Each variable is assumed not to be independent of its parental variables. In our case, for instance, we assume that the data di of individual i is not independent of the genotype gi, nor the sequencing error rate ε. (3) Each pair of variables a, b connected as a → x → b or a ← x → b is independent when conditioning on x. In our example, all di are independent of f and all other dj, j ≠ i, when conditioning on gi and ε. (4) If variables a, b are connected as a → x ← b, x is called a collider; conditioning on it, a and b become dependent. In our example, ε and gi are thus not independent as soon as specific data di is considered. The same holds for ε and f , unless we additionally condition on all gi. Let us recall at this point that a frequentist would discuss the above concepts with a slightly different vocabulary. 1.1.1.2 Probability Distributions Once independence assumptions are set, explicit assumptions on the probability distributions have to be made. We note that this is not a requirement for so-called nonparametric statistical approaches. However, we will not consider these here because most nonparametric approaches are either restricted to hypothesis testing or only justified when sample sizes are very large, while many problems in genetics have to be solved with limited data. Instead, we will focus on parametric modeling and assume that the observations \ue230 were generated from parameterized probability distributions P(\ue230|θ) with unknown parameters θ, but known function P, which thus need to be specified. Example 1.2 (Allele frequency). For the model of Example 1.1 given in equation (1.1), two probability functions have to be specified: P(di|gi, ε) and P( gi| f ). For the latter, we might be willing to assume that genotypes are in Hardy–Weinberg equilibrium (Hardy, 1908; Weinberg, 1908), such that P( g| f ) = 2g ) f

Volume None
Pages 1-50
DOI 10.1002/9781119487845.ch1
Language English
Journal None

Full Text