Detecting differentially methylated regions in bisulfite sequencing data using quasi-binomial mixed models with smooth covariate effect estimates
Kaiqiong Zhao, Karim Oualkacha, Lajmi Lakhal-Chaieb, Aurélie Labbe, Kathleen Klein, Sasha Bernatsky, Marie Hudson, Inés Colmegna, Celia M.T. Greenwood
DDetecting differentially methylated regions inbisulfite sequencing data using quasi-binomialmixed models with smooth covariate effectestimates
Kaiqiong Zhao , , Karim Oualkacha , Lajmi Lakhal-Chaieb , Aur´elie Labbe ,Kathleen Klein , Sasha Bernatsky , , Marie Hudson , , In´es Colmegna , , Celia M.T.Greenwood , , , Department of Epidemiology, Biostatistics and Occupational Health,McGill University Lady Davis Institute for Medical Research, Jewish General Hospital D´epartement de Math´ematiques, Universit´e du Qu´ebec `a Montr`eal D´epartement de Math´ematiques et de Statistique, Universit´e Laval D´epartement des Sciences de la D´ecision, HEC Montr`eal Department of Medicine, McGill University The Research Institute of the McGill University Health Centre Department of Human Genetics, McGill University Gerald Bronfman Department of Oncology, McGill UniversityJanuary 18, 2021
Abstract
Identifying disease-associated changes in DNA methylation can help to gain abetter understanding of disease etiology. Bisulfite sequencing technology allowsthe generation of methylation profiles at single base of DNA. We previously de-veloped a method for estimating smooth covariate effects and identifying differ-entially methylated regions (DMRs) from bisulfite sequencing data, which copeswith experimental errors and variable read depths; this method utilizes the bi-nomial distribution to characterize the variability in the methylated counts.However, bisulfite sequencing data frequently include low-count integers andcan exhibit over or under dispersion relative to the binomial distribution. Wepresent a substantial improvement to our previous work by proposing a quasi-likelihood-based regional testing approach which accounts for multiplicative andadditive sources of dispersion. We demonstrate the theoretical properties of the a r X i v : . [ s t a t . M E ] J a n esulting tests, as well as their marginal and conditional interpretations. Sim-ulations show that the proposed method provides correct inference for smoothcovariate effects and captures the major methylation patterns with excellentpower. Conceptually, the emergence of a disease phenotype is believed to stem from thecombined effects of genetic predisposition and environmental exposures (Ober andVercelli, 2011). A plausible mechanism behind this gene-environment interplay isepigenetic modification, which regulates gene activity through modifications of DNAaccessibility. Epigenetics may explain how exposures leave heritable marks on thegenome that impact disease susceptibility (Jaenisch and Bird, 2003). Therefore, in-creased understanding of epigenetic-disease association could lead to novel insightsinto disease causation and possible therapies (Feinberg, 2007).The most studied epigenetic mark is DNA methylation, which involves the cova-lent addition of a methyl group to a cytosine nucleotide. DNA methylation, in themammalian genomes, occurs predominantly at cytosine-guanine dinucleotides (i.e.CpG sites) (Lister et al., 2009). Methylation of CpG-rich promoters can silence geneexpression by preventing transcriptional factor binding to DNA (Choy et al., 2010).More generally, DNA methylation has the potential to activate or repress gene ex-pression, depending on whether the mark inactivates a positive or negative regulatoryelement (Jones, 1999). Known or suspected drivers behind methylation alterationsinclude genetic variations (McRae et al., 2014), environmental toxins (Hanson andGluckman, 2008), external stressors (Dolinoy et al., 2007) and aging (Horvath, 2013).There is also evidence that localized abnormal methylation is strongly linked to manydiseases, including breast cancer (Hu et al., 2005), autism spectrum disorder (Dun-away et al., 2016), and systemic autoimmune disease (Kato et al., 2005).High-resolution, large-scale measurement of DNA methylation is now possible withrecent advances in bisulfite sequencing (BS-seq) protocol, which is implemented eithergenome-wide or in targeted regions. Although whole-genome bisulfite sequencing(WGBS) allows a comprehensive characterization of the methylation landscape, it isinefficient for large-scale studies as only 20% or less of CpGs are thought to havevariable methylation across individuals or tissues (Ziller et al., 2013). On the other2and, Targeted Custom Capture Bisulfite Sequencing (TCCBS) platform enables acomprehensive yet cost-effective interrogation of functional CpGs in disease-targetedtissues or cells (Allum et al., 2015). This approach has been successfully used toidentify novel disease-associated epigenetic variants (Shao et al., 2019; Allum et al.,2019; Ziller et al., 2016). In this work, we aim to improve sensitivity to detect, amongall the regions targeted by TCCBS, differentially methylated regions (DMRs) that areassociated with phenotypes or traits.Like other sequencing experiments, the raw data from TCCBS are short sequencereads. After proper alignment and data processing, the methylation level at a singlecytosine can be summarized as a pair of counts: the number of methylated readsand the total number of reads covering the site, i.e. read depth. Such data possessseveral challenges for statistical analysis. Typically, read depth varies drasticallyacross sites and individuals, which leads to measures with wide-ranging precision andmany missing values (Sims et al., 2014). Additional statistical challenges are createdby the strong spatial correlations observed in methylation levels at neighboring CpGsites (Hansen et al., 2012; Rackham et al., 2017; Korthauer et al., 2018; Shokoohi et al.,2018), as well as the possibility of data errors, arising from excessive or insufficientbisulfite treatment or other aspects of the sequencing processes (Cheng and Zhu,2013; Lakhal-Chaieb et al., 2017). Furthermore, in addition to the trait of interest(e.g. disease or treatment group), other factors, such as age (Horvath, 2013), batcheffects (Leek et al., 2010), or cell-type mixture proportions (for mixed tissue samples)(McGregor et al., 2016) have effects on methylation levels. Hence, it is desirable toadjust methylation signals for multiple covariates simultaneously.To detect truly differentially methylated regions without finding false associations,it is crucial to accurately account for the sources of variability across individuals. Weran into this issue in a recent analysis of methylation profiles and anti-citrullatedprotein antibodies (ACPA). Figure 1 (A) illustrates methylation proportions in atargeted region for samples from this study. (A full description of the study, referredto as the ACPA dataset, is in Section 2.2). Clearly, dispersion is much larger betweensamples in the blue group. In panel (C), it can be seen that p-values testing formethylation differences, assuming a binomial mean-variance relationship are much toosmall. In contrast, allowing for dispersion through a quasi-binomial model providesp-values in line with null expectation for this region. As such, the restrictive mean-variance relationship implied by a binomial generalized linear model (GLM) may not3A) . . . . . . Genomic Position M e t h y l a t i on p r opo r t i on ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllll llllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllll llllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllll lllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllll lllllll lllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllll ll lllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllll lll ll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllll ll llll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllll llllll ll ll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllll ll l llllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllll ll lllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllll lllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllll lll llllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllll llllll lll lll ll l l lllll lll llllllll lllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllll llllll llll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllll llllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllll llllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllll l lllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllll lll lll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllll lllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllllllllll lll llll l l llll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllll lll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllll llllllllllll ll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll llll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllll llllllll lll llll l l lllll lll lllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllll lllll l l lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllll llllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllll llllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllll llllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllll llllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllllllllllllllllllll lll lll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllllllllllllllllll l lll lll lll ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllllllllll lll lll l ll l l lllll lll llllllll llllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllll ll llllllllllllllllllllllllllllllllllll lll ll l l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllll lllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllll lllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllll l lll lllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllll l llll llllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllll lllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllll lll l lll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllll lllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllll lllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllll llllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllllllllll ll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllll llllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllll llllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllll llll lllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllll ll l llllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllll llll lll llll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll llllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllll lll ll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllll lllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll llllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllll lllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllll lll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll llllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllll l l lllllllllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllll ll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll ll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllllllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllll llllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll llllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllllllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll lllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllllllllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll lllllllll lll lll l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllll l llllllllllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllll lllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllll lllllllllll lll lll l l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllll lllllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll llllllllllllllllllll llllllll lll lll l l l l l llll l ll llllllll ll llllllllll llllllllllllllllllllllllllllllll lllllllllllllllllll ll lllllll lllllllllllllllllllllllllllll lll ll ControlCase (B) (C) ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll
Dispersion Estimates
Genomic Positions l llll lll ll llll lll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll ll ll ll l ll llll ll lllllllllllll . . . . . . Single−site P−values
Observed P−value E m p i r i c a l P − v a l ue l lll l lll ll lll l lll llll ll lllllllll lllll ll llllll ll lllll ll lllll lllllll llllllll l lll lllllll l ll l llll ll ll llll ll ll ll l ll l lll ll llll ll lll l l ll BinomialQuasi−binomial
Figure 1.
Illustration of observed dispersion in a targeted region that underwent bisul-fite sequencing. (A) Observed methylation proportions in one region for two groupsof samples (yellow and blue); data are fully described in Section 2.2. (B) Estimateddispersion for each CpG site from a single-site quasi-binomial GLM. (C) Single-sitep-values for methylation difference between the two groups. Horizontal axis are thep-values estimated from either binomial (ignoring dispersion) or quasi-binomial (ac-counting for dispersion) GLMs. Vertical axis shows the empirical p-values computedfrom 199 permutations; the empirical p-value is a benchmark for valid statistical tests.(Single-site beta-binomial regression models generate similar dispersion estimate pat-tern and p-value distribution to quasi-binomial GLM). adequately accommodate the data variability, and thus can lead to inflation of falsepositives. This is known as over or underdispersion, i.e. data presenting greater or4ower variability than assumed by a GLM model.Moving in this direction, we have developed a SmOoth ModeliNg of BisUlfiteSequencing (SOMNiBUS) method to detect DMRs in targeted bisulfite sequencingdata (Zhao et al., 2020). The method provides a general framework of analysis, andsimultaneously addresses regional testing, estimation of multiple covariate effects, ad-justment for read depth variability and experimental errors. Specifically, Zhao et al.(2020) proposed a hierarchical binomial regression model, which allows covariate ef-fects to vary smoothly along genomic position. A salient feature of SOMNiBUS is itsone-stage nature. Several existing methods first smooth methylation data and then,in a second stage, estimate covariate effects based on the smoothed data (Hansenet al., 2012; Lakhal-Chaieb et al., 2017; Hebestreit et al., 2013), and this two-stageframework could lead to biased uncertainty estimates. In contrast, SOMNiBUS col-lapses smoothing and testing steps into a single step, and achieves accurate statisticaluncertainty assessment of DMRs. That said, its underlying binomial assumption maybe overly restrictive and is only applicable when data exhibit variability levels thatare similar to those anticipated based on a binomial distribution (such as data frominbred animal or cell line experiments). In this work, we propose an extension ofSOMNiBUS, which maintains all the good properties of the standard SOMNiBUS,and at the same time explicitly allows the variability in regional methylation countsto exceed or fall short of what binomial model permits.The importance of accounting for dispersion in BS-seq data has been well recog-nized in analysis of single CpG sites. Faced with dispersion in discrete data analysis,one commonly used option is to convert the methylated and total counts to propor-tions. In this way, testing of differentially methylated single CpG sites can be donevia the two sample t-test (Hansen et al., 2012) or beta regression (Hebestreit et al.,2013), both of which allow direct computation of (within-group) sample variation.However, this conversion loses information, since it fails to distinguish between noisyand accurate measurements (Wu et al., 2015), often as a consequence of the stochas-ticity of read depth, and also disregards the discrete nature of the data (Lea et al.,2015). On the other hand, there are approaches for DNA methylation analysis thatdirectly model counts while accounting for dispersion. These count-based approachesuse either additive overdispersion models, or multiplicative under- or overdispersionmodels to describe the variation driving the dispersion (Browne et al., 2005). In amultiplicative model, one includes a multiplicative scale factor, i.e. the dispersion5arameter, in the variance of the binomial response. Thus, the dispersion inflatesor deflates the variance estimates of the covariate effect by the multiplicative factor.Such approaches include the quasi-binomial regression model (Akalin et al., 2012) andthe beta-binomial regression model (Dolzhenko and Smith, 2014; Feng et al., 2014;Park et al., 2014; Park and Wu, 2016). In contrast, additive overdispersion methodsadd a subject-level random effect (RE) to capture the extra-binomial variation amongindividual observations. Both ABBA (Rackham et al., 2017) and MACAU (Lea et al.,2015), that use binomial mixed effect models fall in this category. An advantage ofthe multiplicative approach, particularly the quasi-binomial model, is that it natu-rally allows for both overdispersion and underdispersion, whereas the additive modelonly allows overdispersion. On the other hand, the additive overdispersion approachlinks directly with a multilevel model and can be readily extended to analyze datawith a hierarchical or clustering structure.(A) (B) ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll
Genomic Positions D i s pe r s i on E s t i m a t e s Multiplicative dispersion without subject−level RE ll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll
Genomic Positions D i s pe r s i on E s t i m a t e s Multiplicative dispersion with subject−level RE
Figure 2.
A byproduct of introducing a subject-level RE, on top of a multiplicativedispersion parameter, to a model with smooth covariate effects is a regional dispersionpattern of varying degree.
Estimated dispersion for each CpG site obtained from asingle-site quasi-binomial GLM, for two simulated regional methylation datasets: (A)data were simulated from a multiplicative-dispersion-only model ( φ = 3 , σ = 0), and(B) data were simulated from a model with both a multiplicative dispersion and asubject-level RE ( φ = 3 , σ = 3); see Section 2.1 for detailed model formulations andnotation definitions. The challenge of accounting for dispersion when detecting DMRs is further com-plicated by several factors. Firstly, even within a small genomic region, different CpG6ites may exhibit different levels of dispersion and strong spatial correlation (Figure1 B). Hence, a multiplicative dispersion model with a common dispersion parameterdoes not adequately capture the dispersion heterogeneity across loci (Figure 2 A). Inaddition, challenges are presented by the complex correlation structure in the regionalmethylation data. Apart from the spatial correlations among neighboring CpGs,there are additional correlations among methylation measurements on the same sub-ject. Ignoring this within-subject correlation could lead to overestimation of precisionand invalid statistical tests (Cui et al., 2016). One means to accommodate such acorrelation structure is to add a subject-level RE that can also capture the overdisper-sion induced by independent variation across different subjects. Furthermore, whenmodeling discrete data with a hierarchical structure, extra non-structural specific ran-dom dispersion can arise, beyond that introduced by the subject-level RE (Breslowand Clayton, 1993; Molenberghs et al., 2007; Vahabi et al., 2019), and thus, often,parametric distributions with restrictive mean-variance relations poorly describe theoutcomes for individual subjects (i.e. the conditional distribution of outcome giventhe RE) (Molenberghs et al., 2010, 2012; Ivanova et al., 2014). Hence, properly ad-dressing both multiplicative and additive sources of dispersion in methylation data isessential for making reliable inference at the region level.
Table 1.
List of existing DNA methylation analytical methods and our proposal withtheir capabilities.
Method regional one-stage count-based read-depthvariability adjust forconfounding within-subjectcorrelation non-structuraldispersion varying levelsof dispersionacross loci experimentalerrorsdSOMNiBUS (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
SOMNiBUS (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
BSmooth (cid:88) (cid:88) – (cid:88) (cid:88) SMSC (cid:88) (cid:88) – (cid:88) (cid:88) (cid:88) dmrseq (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Biseq (cid:88) (cid:88) – (cid:88) (cid:88) (cid:88) (cid:88) GlobalTest (cid:88) (cid:88) (cid:88) NA † NA † NA † ABBA (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
MACAU (cid:88) (cid:88) (cid:88) (cid:88) NA ‡ (cid:88) (cid:88)(cid:88) – : These three methods are of a two-stage nature. Their smoothing stage indeed accounts for read-depth variability, but theirtesting stage, which relies on t-test or beta regression, ignores the read-depth variability. † : GlobalTest treats methylation levels at multiple loci as covariates and trait of interest as outcome. It is not necessary forGlobalTest to account for the three features on covariance structure of methylation across samples and loci. ‡ : MACAU is a single-site method and within-subject correlation is irrelevant when analyzing individual sitesone at a time. combination of asubject-specific RE (i.e an additive overdispersion) and a multiplicative dispersion.The RE term accounts for between-sample heterogeneity, and at the same time en-ables flexible dispersion patterns in a region (Figure 2 B), which is highly plausible inmethylation data (Figure 1 B). The multiplicative dispersion, on the other hand, ex-plicitly allows the variability in individual subject’s methylation levels to exceed or fallshort of what binomial distribution assumes, and thus captures the extra dispersionthat cannot explained by RE. In addition, our approach accounts for possible dataerrors in the observed methylated counts. Specifically, we assume that the observedread counts arise from an unobserved latent true methylation state compounded byerrors. We then build a specialized expectation-maximization (EM) algorithm for the8uasi-binomial mixed model to make inference about DMRs in the presence of dataerrors. Here we present our model for describing regional methylation data. Details on thealgorithm, and the inference method for the model, are provided in Section 4.We consider DNA methylation measures over a targeted genomic region from N independent samples. Let m i be the number of CpG sites for the i th sample, i = 1 , , . . . N . We write t ij for the genomic position (in base pairs) for the i th sampleat the j th CpG site, j = 1 , , . . . , m i . Methylation levels at a site are quantifiedby the number of methylated reads and the total number of reads. We define X ij as the total number of reads aligned to CpG j from sample i . We denote the true methylation status for the k th read obtained at CpG j of sample i as S ijk , where k = 1 , , . . . X ij . For a single DNA strand read, S ijk is binary and we define S ijk = 1if the corresponding read is methylated and S ijk = 0 otherwise. We additionallydenote the true methylated counts at CpG j for sample i with S ij = (cid:80) X ij k =1 S ijk ,summing over all reads aligned to position t ij . Furthermore, we assume that we havethe information on P covariates for the N samples, denoted as Z i = ( Z i , Z i , . . . Z P i ),for i = 1 , , . . . N .We propose a quasi-binomial mixed effect model to describe the relationship be-tween methylated counts, S ij for j = 1 , , . . . m i , and the sample-level covariates Z i .Specifically,log π ij − π ij = β ( t ij ) + β ( t ij ) Z i + β ( t ij ) Z i + . . . + β P ( t ij ) Z P i + u i , (1) u i iid ∼ N (0 , σ ) V ar( S ij | u i ) = φX ij π ij (1 − π ij ) (2)where π ij = E ( S ij | u i ) /X ij is the individual’s methylation proportion (i.e. the con-ditional mean), β ( t ij ) and { β p ( t ij ) } Pp =1 are functional parameters for the interceptand covariate effects on π ij , and σ is the random effect variance. In this model,we assume the underlying proportion of methylated reads for the i th sample at the9 th CpG site, π ij , depends on covariates Z i and on nearby methylation patternsthrough a logit link function. In addition, each π ij incorporates a subject-specificrandom intercept (i.e. an additive overdispersion) u i that is normally distributedand independent across samples. The inclusion of u i allows for sample heterogeneityin baseline methylation patterns, and at the same time accounts for the correlationamong methylation measurements taken on the same sample. Moreover, we assumethe variance of S ij for individual samples to be a product of a multiplicative dispersionparameter φ and a known mean-variance function implied by a binomial distribution( V ( π ij ) = X ij π ij (1 − π ij )).Both the random effects u = ( u , u , . . . u N ) T and the multiplicative dispersionparameter φ capture extra-binomial dispersion. However, they address two differentaspects of dispersion: u models the variation that is due to independent noise acrosssamples, while φ aims to relax the assumption of the conditional distribution of S ij given u such that it is not confined to a binomial distribution. In fact, our modelgeneralizes the binomial-based model in Zhao et al. (2020) by introducing both theadditive dispersion term u and multiplicative dispersion term φ . Specially, imposing φ = 1 in model (1) leads to an additive-dispersion-only model and σ = 0 correspondsto a multiplicative-dispersion-only model. When σ = 0 and φ = 1, our model reducesto the binomial-based model in Zhao et al. (2020). A key feature of the mixed effect model in (1) is that the regression coefficients β p ( t ij ) need to be interpreted conditional on the value of random effect u i . Forexample, β p ( t ij ) describes how an individual’s methylation proportions in a regiondepend on covariate Z p . If one desires estimates of such covariate effects on theaverage population, it is more appropriate to determine the marginal model impliedby (1). After applying a cumulative Gaussian approximation to the logistic functionand taking an expectation over u i , it can be shown that the marginal mean, π Mij , hasthe form π Mij = E ( S ij ) /X ij ≈ g (cid:32) P (cid:88) p =0 a β p ( t ij ) Z pi (cid:33) , (3)where g ( x ) = 1 / (1 + exp( − x )), Z i ≡
1, and the constant a = (1 + c σ ) − / with c = √ . /π ; see detailed derivations in Appendix A.1. The approximation in (3)10s quite accurate with errors ≤ . Z p through a logistic link with attenu-ated regression coefficients aβ p ( t ij ). Although the smooth covariate effect parameters β p ( t ij ) have no marginal interpretation, they do have a strong relationship to theirmarginal counterparts. Hence, the results from hypothesis testing H : β p ( t ij ) = 0describe the significance of the covariate effect on both the population-averaged andan individual’s DNA methylation levels across a region.Similarly, the marginal variance of S ij does not coincide with its conditional coun-terpart as shown in (2). Specifically, our mixed effect model implies a marginalvariance of S ij defined as V ar( S ij ) ≈ X ij π (cid:63)ij (1 − π (cid:63)ij ) (cid:8) φ + σ ( X ij − φ ) π (cid:63)ij (1 − π (cid:63)ij )+ σ / − π (cid:63)ij ) (cid:2) σ π (cid:63)ij (1 − π (cid:63)ij )( X ij − φ − / (cid:3)(cid:9) , (4)where π (cid:63)ij = g − (cid:16)(cid:80) Pp =0 β p ( t ij ) Z pi (cid:17) ; see detailed derivations in Appendix A.2. Notethat π (cid:63)ij is the mean methylation proportion when setting random effects u i to zeroand is related to the marginal mean π Mij via π (cid:63)ij = g (cid:0) g − (cid:0) π Mij (cid:1) /a (cid:1) . Equation (4)illustrates that, under the dSOMNiBUS model, the marginal variance of methy-lated counts at a CpG site is approximately the variance of the binomial modelmultiplied by a dispersion factor φ (cid:63) = φ + σ ( X ij − φ ) π (cid:63)ij (1 − π (cid:63)ij ) + σ / − π (cid:63)ij ) (cid:2) σ π (cid:63)ij (1 − π (cid:63)ij )( X ij − φ − / (cid:3) , which depends on the combined effect of φ , the multiplicative dispersion for the conditional variance given the RE, and σ ,the variance of the subject-level RE. Notably, the marginal dispersion factor φ (cid:63) alsodepends on genomic position t ij via the dependence of π (cid:63)ij on t ij . Consequently,our dSOMNiBUS model in (1) naturally allows dispersion levels to vary across loci,whereas a multiplicative-dispersion-only model (i.e. σ = 0) can only accommodateconstant dispersion in a region, as illustrated in Figure 2. It is also clear from Equation(4) that an additive-dispersion-only model (i.e., φ = 1) only allows for overdispersion,and the combination of additive and multiplicative dispersion naturally accounts forboth over- and underdispersion. 11 .1.2 Dealing with possible measurement errors in methylated counts In the presence of experimental errors, the true methylation data, S ij are unknownand one only observes Y ij . We assume the following error mechanism P ( Y ijk = 1 | S ijk = 0) = p P ( Y ijk = 1 | S ijk = 1) = p . (5)Here, these two parameters capture errors; p is the rate of false methylation calls,and 1 − p is the rate of false non-methylation calls. These rates are assumed tobe constant across all reads and positions. The error parameters p and p can beestimated by looking at raw sequencing data at CpG sites known in advance to bemethylated or unmethylated (Wreczycka et al., 2017). We assume hereafter that p and p are known. The methodology details on how to make inference about covariateeffects β p ( t ij ) and estimate dispersion parameters φ and σ , in the presence of dataerrors, are described in Section 4.3. We first apply our approach to targeted bisulfite sequencing data from a rheumatoidarthritis study (Shao et al., 2019). Participants were sampled from the CARTa-GENE cohort ( ), a population-based cohort includ-ing 43,000 general population subjects aged 40 to 69 years in Quebec, Canada. Thestudy aims to investigate association between DNA methylation and the levels ofanti-citrullinated protein antibodies (ACPA), a marker of rheumatoid arthritis (RA)risk that often presents prior to any clinical manifestations (Forslind et al., 2004).Firstly, the serum ACPA levels were measured for a randomly sampled 3600 in-dividuals from the CARTaGENE cohort, based upon which individuals were classi-fied as either ACPA positive or ACPA negative. Then, the whole blood samples ofthe ACPA positive individuals, and a selected subset of age-sex-and-smoking-status-matched ACPA negative individuals were sent for Targeted Custom Capture BisulfiteSequencing. Specifically, the sequencing used an immune targeted panel that coversthe majority of genomic regions with relevance to RA and blood cells. Cell type pro-portions in the blood samples were also measured at the time of the sampling (Shao12t al., 2019).Using this sampling approach, two batches of data, referred to as data 1 and data2, were collected in 2017 and 2019, respectively. Notably, the classification criteriafor ACPA status are slightly different between data 1 and 2. When sampling data1, subjects with serum ACPA levels greater than 20 optical density (OD) units werecalled as ACPA postive and samples with ACPA levels less than 20 OD were definedas ACPA negative. After data cleaning, data 1 consisted of 69 ACPA positive subjectsand 68 ACPA negative subjects. In contrast, the sampling of data 2 was based onmore extreme cutoffs for ACPA levels, and resulted in 60 ACPA positive subjects(ACPA levels ≥
60 OD) and 60 ACPA negative subjects (ACPA levels <
20 OD).This change in decision is reflected in the different distributions of serum ACPA levelsbetween data 1 and 2, as shown in Supplementary Figure S1. Average sequence readdepths in targeted regions were 5 and 35 in data 1 and 2, respectively (SupplementaryFigure S2), due to improvement in the sequencing protocols implemented between thetwo experiments.In this article, we restricted our attention to regions with at least 50 CpG sites.In addition, we excluded regions with more than 95% CpGs having median readdepth 0 or having median methylation proportion as 0. Overall, we analyzed 10,759regions in dataset 1 and 12,983 regions in dataset 2. We excluded the samples whoreported a diagnosis of RA before the CARTaGENE study started. Subjects withmissing information on cell type proportions were also removed from our analysis.Supplementary Table S1 presents the sample characteristics in data 1 and 2.We apply our approach to both data 1 and 2, with the aim to identify the differ-entially methylated regions that show association with ACPA, after adjustment forage, sex, smoking status and cell type composition. Specifically, we assumed no dataerrors in the datasets ( p = 1 − p = 0). We used natural cubic splines to expandthe smooth terms in the model, and its rank L p was approximately as the number ofCpGs in a region divided by 10 for β ( t ), and divided by 20 for β p ( t ) , p ≥ Figure 3 presents the distribution of estimated multiplicative dispersion φ and additivedispersion σ for all test regions in dataset 1 and 2. Overall, widespread overdisper-sion is observed; 98.5% regions show multiplicative dispersion φ greater than 1 and131.2% regions show additive dispersion σ greater than 0.05. The Pearson correlationcoefficient between the estimated φ and σ is − . φ > σ > . D a t a D a t a f ^ s ^ count (B) (C) l l l og ( f ^ ) f ^ l l −20−10010 0.0050.55Data 1 Data 2 l og ( s ^ ) s ^ Figure 3.
Distribution of the estimated multiplicative dispersion parameter φ andadditive dispersion parameter σ , for all test regions in dataset 1 and 2. Panel (A)shows the 2-dimensional histogram for (cid:98) φ and (cid:98) σ , where the color intensity representsthe number of regions with a particular combination of values of (cid:98) φ and (cid:98) σ . Panels (B)and (C) show the rotated kernel density plots (i.e. violin plots) for (cid:98) φ and (cid:98) σ (in anatural logarithmic scale), separately. Figure 4 shows quantile-quantile (QQ) plots for the regional p-values for the effectof ACPA on the 292 regions of Chromosome 18 in the two datasets. Detailed in-ference steps are given in Section 4. The results are compared among four different14pproaches: (1) dSOMNiBUS which models both the multiplicative and additive dis-persion, (2) the multiplicative-dispersion-only model, (3) the additive-dispersion-onlymodel, and (4) the standard SOMNiBUS which ignores any extra-binomial variation.Figure 4 reveals that, when ignoring either type of dispersion, the distribution ofregional p-values is biased away from what would be expected under the null. Theinclusion of both multiplicative and additive dispersion is important for correct typeI error control. lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
Data 1 Data 20.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.50.02.55.07.510.0
Expected - log ( p ) O b s e r v ed - l og ( p ) l l l l dSOMNiBUS multiplicative−dispersion−only model additive−dispersion−only model SOMNiBUS Figure 4.
QQ plot for regional p-values, obtained from models addressing differenttypes of dispersion.
To test DMRs, we propose a region-based statistic with a F limiting distribution;see details in Section 4.4.3. To test the validity of our inference, we compare ourregional p-values to bootstrap-based p-values, whose null distribution is constructedby parametric bootstraps (Davison and Hinkley, 1997) and does not rely on anydistributional assumptions. Figure 5 shows the distributions of bootstrap-based andour analytical p-values for the targeted regions on chromosome 18, demonstratingthat our inference method generates p-values in line with the bootstrap-based results.Thus, dSOMNiBUS provides accurate tests for DMRs without requiring extensive15 l ll lll ll l lll l lll lll lll lll ll ll lll l ll l l lll ll ll l l lll ll l ll llll l ll ll ll llll ll lll lll lll l l l llll ll ll l l l lll ll ll lll ll ll lll l ll ll lll l l lll l lll lll lll l ll ll ll lll lllll lll ll l ll ll lll ll lll l ll ll l l lll ll llll ll l lll lll lll lll ll l ll l lll ll l l ll ll ll lll ll l lll ll llll llll l ll l ll ll l ll ll l llll ll l ll ll lll ll l llll l ll lll l lll lll ll l l
Data 1 Data 20.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.000.250.500.751.00
Analytical p−values P e r m u t a t i on ba s ed p − v a l ue s f r o m pa r a m e t r i c boo t s t r ap Figure 5.
Comparison between the observed regional p values from our approach andthe permulation-based p values from parametric bootstrap. computational time.
We conducted simulations to assess the proposed inference of smooth covariate effects,and to compare the performance of our method with five existing methods:
BiSeq (Hebestreit et al., 2013),
BSmooth (Hansen et al., 2012),
SMSC (Lakhal-Chaieb et al.,2017), dmrseq (Korthauer et al., 2018) and
GlobalTest (Goeman et al., 2006), interms of type I error and power. Detailed descriptions of these five methods aregiven in Supplementary Section 3.2. We also made special modifications for theimplementations of
BSmooth , SMSC and dmrseq , which are primarily designed forWGBS data, to make them as appropriate as possible for targeted regions. see detailsin Supplementary Section 3.1.
We adopt similar simulation parameters as described in Zhao et al. (2020), and simu-lated methylation regions with 123 CpG sites under various settings. We first gener-ated the read depth X ij by adding Bernoulli random variables (with proportion 0.5)to a pre-specified regional read-depth pattern (Supplementary Figure S3). In thisway, the spatial correlation of read depth observed in real data was well preserved in16he simulated data. The rest of simulation parameters were defined in Table 2. Table 2.
Simulation settings for the functional parameters β p ( t ), sample size N , errorparameters p and p , multiplicative parameter φ and RE variances σ . Simulation Possible valuesparameters β p ( t ) Scenario 1: three covariates: Z ∼ Bernoulli (0 . Z ∼ Bernoulli (0 .
58) and Z ∼ Bernoulli (0 . β ( t ) , β ( t ) and β ( t ) and intercept β ( t ), shown in the red curves in Figure 7.Here, Z is the null covariate with effect β ( t ) ≡ Z ∼ Bernoulli (0 . β ( t ) , β ( t )), which yield methylation proportion parametersas depicted in Figure 6. N p , p ) (0 . , . † or (0 , φ (1 , σ (0 , , , u i i.i.d ∼ N (0 , σ ) for i = 1 , , . . . N † the value 0 .
003 was reported by Prochenka et al. (2015) as insufficient Bisulfite conversion rate and 0 . Simulate dispersed-binomial counts.
Given the values of { Z , . . . Z P } , { β p ( t ) , p =0 , , . . . P } and { u i , i = 1 , , . . . N } under each setting, the individual’s methylationproportion, π ij , can be readily calculated from the mean model in (1). We thengenerated the true methylation counts S ij from a beta-binomial distribution withproportion parameter µ = π ij , correlation parameter ρ = φ − X ij − n = X ij . Specifically, S ij were drawn from the following probability mass function P ( S ij = k | µ, ρ, n ) = (cid:32) nk (cid:33) B ( k + α, n − k + β ) B ( α, β )where α = µ (1 − ρ ) /ρ, β = (1 − µ )(1 − ρ )(1 − µ ) /ρ , and B ( · , · ) is the beta function.The variance of S ij can be thus derived as V ar( S ij ) = [1 + ( n − ρ ] [ nµ (1 − µ )] = φX ij π ij (1 − π ij ) , which coincides with our assumed mean-variance relationship in (2). We then gener-ated the observed methylated counts Y ij according to the error model in (5), whichimplies Y ij | S ij ∼ Binomial( S ij , p ) + Binomial( X ij − S ij , p ) . . . . . . . Genomic Position p Figure 6.
The 15 simulation settings of methylation parameters π ( t ) and π ( t ) inScenario 2. Here, π ( t ) and π ( t ) denote the methylation parameters for samples with Z = 0 and Z = 1 at position t , respectively. Under this scenario, π ( t ) (red dotted-dashed curve) is fixed across settings, whereas π ( t )s (black solid lines) vary acrosssettings corresponding to different degrees of closeness between methylation patternsin the two groups. Under each scenario and setting, we generated data sets with sample sizes N =100, each 1000 times. We then applied dSOMNiBUS along with methods BiSeq , dmrseq , BSmooth , SMSC and
GlobalTest to the simulated data sets. For our ap-proach dSOMNiBUS , we used cubic splines with dimension L p = 5 to parameterize thesmooth terms of interest. We also assumed that the correct values of error parameters p and p were known. Figure 7 presents the estimates of the functional parameters β ( t ) , β ( t ) , β ( t ) and β ( t ) over 1000 simulations, obtained from dSOMNiBUS; here, data were generatedunder Scenario 1, with multiplicative dispersion parameter φ = 3, RE variance σ = 3,and error parameters p = 0 .
003 and 1 − p = 0 .
1. Figure 7 demonstrates that theproposed method provides unbiased curve estimates for smooth covariate effects whenthe regional methylation counts exhibit extra-parametric variation and are measured18 − − − − b (t) Genomic Position − . . . b (t) Genomic Position b (t) Genomic Position − . . . b (t) Genomic Position
Figure 7.
Estimates of smooth covariate effects (gray) over the 1000 simulations inScenario 1, using dSOMNiBUS. The red curves are the true functional parameters usedto generate the data. Data were generated with error using φ = 3 and σ = 3. with errors.Figure 8 and 9 demonstrate the performance of the proposed pointwise confidenceinterval (CI) estimates (Section 4.4.2 ) and regional test (Section 4.4.3), respectively.The results from dSOMNiBUS are compared to the multiplicative-dispersion-onlymodel and the additive-dispersion-only model. Figure 8 displays the empirical cover-age probabilities of the analytical 95% CIs for β ( t ), under different settings of φ and σ . Figure 9 shows the QQ plots for the regional p-values when the null hypothesis H : β ( t ) = 0 is correct. The results show that ignoring the presence of additive dis-persion (i.e. the multiplicative-dispersion-only model) leads to substantial estimationbias, poor CI coverage probabilities and highly inflated type I errors. Although theadditive-dispersion-only model provides relatively accurate pointwise CIs, the distri-butions of its regional p-values are biased away from what would be expected underthe null, when multiplicative dispersion φ >
1. Overall, dSOMNiBUS provides point-19 = s = s = s = f = f = Genomic Positions P o i n t w i s e C I c o v e r age dSOMNiBUS multiplicative−dispersion−only model additive−dispersion−only model Figure 8.
Empirical coverage probability of the analytical 95% CIs for β ( t ) over 1000simulations, under different vales of φ and σ . The empirical coverage probabilitiesare defined as the percentage of simulations where the analytical CIs cover the truevalue of β ( t ). Data were generated with error, under simulation Scenario 1. Theresults from dSOMNiBUS (green) and the additive-dispersion-only model (purple) areindistinguishable in all settings but σ = 0 and φ = 3 and dSOMNiBUS (green) andthe multiplicative-dispersion-only model (orange) are indistinguishable when σ = 0. wise CIs attaining their nominal levels, and region-based statistics whose distributionunder the null is well calibrated, regardless of the types and degrees of dispersion thatdata exhibit. Similar results were observed when data were generated without error(Supplementary Figures S5 and S6). Figures 10 and 11 further demonstrate the performance of the proposed regional test,when compared with the existing methods
GlobalTest , dmrseq , BSmooth , SMSC , and
BiSeq . Here, data were simulated with error parameters p = 0 .
003 and 1 − p = 0 . = s = s = s = f = f = Expected - log ( p ) O b s e r v ed - l og ( p ) dSOMNiBUS multiplicative−dispersion−only model additive−dispersion−only model Figure 9.
QQ plot for regional p-values for the test H : β ( t ) = 0, obtained fromdSOMNiBUS, the multiplicative-dispersion-only model and the additive-dispersion-onlymodel. Data were simulated with error, under simulation Scenario 1. When φ = 1, theresults from dSOMNiBUS (green) and the additive-dispersion-only model (purple) areindistinguishable. When σ = 0, the lines for the multiplicative-dispersion-only model(orange) and dSOMNiBUS (green) are indistinguishable. Z . Because we estimated the empirical regional p-values for BSmooth and
SMSC bypermutations, both methods are able to control type I errors, under all settings of φ and σ . Both BiSeq and dmrseq show deflated type I error rate when σ = 0and inflated type I error rate when σ >
0. The distributions of p-values from
GlobalTest are well calibrated when the within subject correlation σ >
0, but areslightly biased away from the uniform distribution when σ = 0. When σ = 0 and φ = 3, dSOMNiBUS provides slightly conservative type I errors; this bias vanisheswhen the data were generated without error (Supplementary Figures S7). Figure 11shows the powers of the six methods for detecting DMRs under the 15 settings ofmethylation patterns displayed in Figure 6. Here, methylation difference is definedas the maximum difference between π ( t ) and π ( t ) in the region. When data exhibitneither additive nor multiplicative dispersion, dSOMNiBUS and BSmooth provide the21 = s = s = s = f = f = Expected - log ( p ) O b s e r v ed - l og ( p ) dSOMNiBUS GlobalTest dmrseq BSmooth SMSC BiSeq Figure 10.
QQ plot for regional p-values for the test H : β ( t ) = 0, obtained fromdSOMNiBUS, GlobalTest, dmrseq, BSmooth, SMSC, and BiSeq. Data were simulatedwith error, under simulation Scenario 1. highest power, followed by dmrseq , BiSeq , GlobalTest , and
SMSC . When σ = 0 and φ = 3, BSmooth and dmrseq are more powerful than other methods. When thereare correlations among methylation measurements on the same subject, i.e. σ >
0, dSOMNiBUS clearly outperforms the five alternative methods; this superiorityremains when the data were generated without error (Supplementary Figures S8).In summary, dSOMNiBUS exhibits greater power to detect DMRs, while correctlycontrolling type I error rates, especially when the regional methylation counts exhibit(additive) extra-binomial variation.
We have proposed and evaluated a novel method, called dSOMNiBUS, for estimatingsmooth covariate effects for BS-seq data. We demonstrate that our model, whichincorporates both multiplicative and additive sources of data dispersion, provides aplausible representation of realistic dispersion trends in regional methylation data. In22 = s = s = s = f = f = Methylation Differences P o w e r dSOMNiBUS GlobalTest dmrseq BSmooth SMSC BiSeq Figure 11.
Powers to detect DMRs using the six methods for the 15 simulation settingsin Scenario 2 under different levels of maximum methylation differences between π ( t )and π ( t ) in the region, calculated over 100 simulations. addition, dSOMNiBUS simultaneously accounts for experimental errors, estimationof multiple covariate effects, and flexible dispersion patterns in a region. Also, weprovide a formal inference for smooth covariate effects and construct a region-basedstatistic for the test of DMRs, where outcomes might be contaminated by errorsand/or exhibit extra-parametric variations. Results from simulations and real dataapplications show that the new method captures important underlying methylationpatterns with excellent power, provides accurate estimates of covariate effects, andcorrectly quantifies the underlying uncertainty in the estimates. The method hasbeen implemented in the R package SOMNiBUS , which has been submitted to R Bio-conductor.Our model captures dispersion in the regional count data via the combination ofa subject-specific RE and a multiplicative dispersion. The latter aims to capture theextra random dispersion beyond that introduced by the subject-to-subject variation.An alternative way to add multiplicative despersion might be to add locus-specificREs. Such model would avoid the problem of estimating φ , but would result insubstantially increased number of REs, in which case our Laplace approximation isunlikely to provide well-founded inference (Shun and McCullagh, 1995). In addition,23uch a model only captures overdispersion. In contrast, our quasi-binomial mixedeffect model provides an adequate representation of any kind of dispersion withoutmuch increase in computational complexity.An extension worth exploring in the future is to model the dispersion parameter φ as a function of covariates. For example, the methylation variation across cancersamples has been found to be higher than for normal samples (Hansen et al., 2011;Schoofs et al., 2013). Identification of such disease-associated methylation variationchanges might provide further insights into the biological mechanisms. This extensionwould also allow modelling of the hypothesis that some individuals are more sensitiveto their environment (Meaney and Szyf, 2005).Our proposed methods can also be applied to other types of next-generation se-quencing data. For example, allele-specific gene expression (ASE) measured fromRNA-seq data are quantified by the numbers of reads originating from the two allelesfor that site (Fan et al., 2020). Such data share a similar structure to bisulfite se-quencing data and could be analyzed by dSOMNiBUS. From the methodology pointof view, our proposal of combining quasi likelihood with random effects can be gen-erally applied to any type of count data for a more comprehensive representation ofdispersion. In this section, we present the methodology details on how to make inference aboutcovariate effects β p ( t ij ) and simultaneously estimate the additive and multiplicativedispersion parameters φ and σ in our smoothed quasi-binomial mixed model (1). Westart with the case where true methylation counts S ij are available, and determine thecomplete data marginal quasi-likelihood function in Section 4.1. Then we describethe estimating algorithms for the complete and contaminated data in Sections 4.2and 4.3, respectively. We additional estimate the pointwise CIs for covariate effects β p ( t ij ) and obtain tests of hypotheses for these effects in Section 4.4.24 .1 Laplace-approximated marginal quasi-likelihood function In model (1), the function parameters β p ( t ij ) can be represented by the coefficientsof chosen spline bases of rank L p , β p ( t ij ) = (cid:80) L p l =1 α pl B ( p ) l ( t ij ) , for p = 0 , , . . . P. Here (cid:110) B ( p ) l ( · ) (cid:111) L p l =1 denotes the spline basis, and α p = ( α p , . . . α pL p ) T ∈ R L p are thecoefficients to be estimated. In this way, we can write the conditional mean in (1) ina compact way as g − ( π ) = X ( B ) α + X (1) u , where π = ( π , . . . π m , π , . . . π m , . . . π Nm N ) T ∈ [0 , M with M = (cid:80) Ni =1 m i , α =( α , α , . . . α p ) T ∈ R K with K = (cid:80) Pp =0 L p , and u = ( u , u , . . . u N ) T . X ( B ) is thespanned design matrix for α of dimension M × K , stacked with elements B ( p ) l ( t ij ) × Z pi with Z i ≡ X (1) is a random effect model matrix of dimension M × N , with element1 if the corresponding CpG site in the row belongs to the sample in the column, and0 otherwise. If we write the overall spanned design matrix X = (cid:2) X ( B ) , X (1) (cid:3) ∈R M × ( K + N ) and B = ( α T , u T ) T , the conditional mean can be further simplified as g − ( π ) = X B . To impose the assumption that the true covariate effect function is more likely to besmooth than jumpy, we add a smoothness penalty for each β p ( t ), p = 0 , , . . . P . Thetotal amount of such penalty is an aggregate from all smooth terms, i.e. L Smooth = P (cid:88) p =0 λ p (cid:90) (cid:0) β (cid:48)(cid:48) p ( t ) (cid:1) dt = P (cid:88) p =0 λ p α p T A p α p = α T A λ α , (6)where A p (cid:48) s are L p × L p positive semidefinite matrices with the ( l, l (cid:48) ) element A p ( l, l (cid:48) ) = (cid:82) B ( p ) (cid:48)(cid:48) l ( t ) B ( p ) (cid:48)(cid:48) l (cid:48) ( t ) dt, which are fixed quantities given the specified set of bases. Theweights λ p , i.e. the smoothing parameters, are positive parameters which establisha tradeoff between the closeness of the curve to the data and the smoothness of thefitted curves. A λ is a K × K positive semidefinite block diagonal matrix of the form A λ = Diag { λ A , λ A , . . . , λ P A P } . 25 andom-effect view of the smoothness penalty. As justified in Wahba (1983)and Silverman (1985), employing such smoothing penalty (6) during fitting is equiv-alent to imposing random effects for spline coefficients α . Specifically, α is assumedto follow a (degenerate) multivariate normal distribution with precision matrix A λ , α ∼ M V N ( , A λ − ) , where A λ − is the pseudoinverse of A λ . From a Bayesian viewpoint, imposing smooth-ness is equivalent to specifying a prior distribution on function roughness. Thisrandom-effect formulation of the smooth curve estimation problem opens up the pos-sibility of estimating λ and φ using marginal (quasi-)likelihood maximization. Inaddition, under such a formulation, it requires no extra effort to estimate the ‘actual’RE term u in our model (1), once the inference procedure for α is well established.In the rest of inference steps, we treat α as random effects. We first consider specifying the conditional “distribution” of S given the values ofREs B . Following the notion of extended quasi-likelihood (McCullagh and Nelder,1989, Section 9.6), we define the following conditional quasi-likelihood qL ( S | B ) ( B , φ ) ∝ exp (cid:40) − φ (cid:88) i,j d ij ( S ij , π ij ) − M φ (cid:41) , (7)where d ij ( S ij , π ij ) = − (cid:90) π ij S ij /X ij S ij − X ij π ij π ij (1 − π ij ) dπ ij is the quasi-deviance function corresponding to a single observation. It can be eas-ily checked that this quasi-likelihood exhibits the properties of log-likelihood, withrespect to B . Such properties approximately hold for the dispersion parameter φ , provided that φ be small and κ r = O ( φ r − ), where κ r is the rth-order cumu-lant of S | B (Efron, 1986; Jørgensen, 1987; McCullagh and Nelder, 1989). Let ql ( S | B ) ( B , φ ) = log (cid:2) qL ( S | B ) ( B , φ ) (cid:3) denote the conditional log-quasi-likelihood. Itshould be noted that the integral inside ql ( S | B ) ( B , φ ) rarely needs to be evaluatedfor the estimation of B , because the inference described later only requires the com-putation of its first and second derivatives, i.e. ∂ql ( S | B ) ( B , φ ) ∂ B = 1 φ X T ( S − Λ X π ) , ql ( S | B ) ( B , φ ) ∂ B ∂ B T = − φ X T W X , where Λ X ∈ R M × M is the diagonal matrix with values of read-depths, and W is theweight matrix whose diagonal is X ij π ij (1 − π ij ). For notational simplicity, we write Θ = ( λ , σ ) for the parameters involved in thecovariance structure of random effects B . Combining the conditional ‘distribution’ S | B with the marginal distribution of B , we obtain the following joint log-quasi-likelihood of the observed data S and unobserved random effects B q(cid:96) ( S , B ) ( B , φ, Θ ) = ql ( S | B ) ( B , φ ) − α T A λ α − σ u T u (cid:124) (cid:123)(cid:122) (cid:125) − φ B T Σ Θ B + 12 log {| A λ | + } + N (cid:0) /σ (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) / {| Σ Θ /φ | + } , (8)where Σ Θ = diag { φ A λ , φ/σ I N } ∈ R ( K + N ) × ( K + N ) , and |•| + denotes the generalizeddeterminant of a matrix, i.e. the product of its non-zero eigenvalues. Here we intro-duce the scaling by φ in Σ Θ merely for later convenience, and this allows us to factorout the dispersion parameter φ in the penalized quasi-score in (12). In such way, thepoint estimates of random effects B are independent of the estimate of φ .This joint log-quasi-likelihood is composed of three parts: 1) the outcome ‘distri-bution’ depending on B and φ , 2) multiple quadratic penalties for B depending onregularization parameters Θ , and 3) fixed regularized terms for Θ . Our goals are toestimate the variance component parameters Θ , the dispersion parameter φ , and alsopredict the values of random effects B . When φ = 1, this fits a generalized linearmixed model (GLMM). A legitimate (quasi-)likelihood is the marginal ‘density’ evaluated at the observeddata S only, which is obtained by integrating out random effects B from the jointquasi-likelihood of S and B , qL M ( φ, Θ ) = (cid:90) exp (cid:8) q(cid:96) ( S , B ) ( B , φ, Θ ) (cid:9) d B . (9)27onceptually, maximizing qL M ( φ, Θ ) yields the maximum quasi-likelihood estimatorsfor Θ , and φ . However, the analytical solutions for this high-dimensional integral arenot easy to find, and an approximation approach is needed.As in Wood (2011), we use the Laplace approximation to evaluate the integralinside the marginal quasi-likelihood. Let (cid:98) B Θ be the value of B maximizing the jointquasi-likelihood q(cid:96) ( S , B ) ( B , φ, Θ ) given the values of variance component parameters Θ , i.e. (cid:98) B Θ = argmax (cid:26) ql ( S | B ) ( B , φ ) − φ B T Σ Θ B (cid:27) , (10)where terms not dependent on B have been dropped from the joint quasi-likelihood.The objective function in (10) is often referred to as the penalized (quasi-)likelihood.A second-order Taylor expansion of q(cid:96) ( S , B ) ( B , φ, Θ ), around (cid:98) B (the subscript Θ hasbeen dropped for notational simplicity), gives q(cid:96) ( S , B ) ( B , φ, Θ ) ≈ q(cid:96) ( S , B ) ( (cid:98) B , φ, Θ ) − (cid:16) B − (cid:98) B (cid:17) T H (cid:98) B (cid:16) B − (cid:98) B (cid:17) , where H (cid:98) B = −∇ B q(cid:96) ( S , B ) ( (cid:98) B , φ, Θ ) = 1 φ (cid:16) X T (cid:99) W X + Σ Θ (cid:17) . Therefore, the marginal quasi-likelihood in (9) can be approximately written as qL M ( φ, Θ ) ≈ exp (cid:110) q(cid:96) ( S , B ) ( (cid:98) B , φ, Θ ) (cid:111) (cid:90) exp (cid:26) − (cid:16) B − (cid:98) B (cid:17) T H (cid:98) B (cid:16) B − (cid:98) B (cid:17)(cid:27) d B ≈ exp (cid:110) q(cid:96) ( S , B ) ( (cid:98) B , φ, Θ ) (cid:111) √ π K + N (cid:12)(cid:12)(cid:12)(cid:12) X T (cid:99) W X + Σ Θ φ (cid:12)(cid:12)(cid:12)(cid:12) / ∝ φ − M/ exp (cid:32) − (cid:80) i,j (cid:98) d ij φ (cid:33) exp (cid:18) − φ (cid:98) B T Σ Θ (cid:98) B (cid:19) | Σ Θ /φ | / (cid:12)(cid:12)(cid:12)(cid:12) X T (cid:99) W X + Σ Θ φ (cid:12)(cid:12)(cid:12)(cid:12) − / . (11)In equation (11), (cid:98) d ij = d ij ( S ij , (cid:98) π ij ), where (cid:98) π ij = g − ( X ( l, ) (cid:98) B ) and l is the row inthe model matrix X corresponding to CpG j for sample i . We denote this Laplace-approximated marginal quasi-likelihood in (11) as qL Laplace ( φ, Θ ; (cid:98) B ) and simply writeLaplace( φ, Θ ; (cid:98) B ) = log[ qL Laplace ( φ, Θ ; (cid:98) B )], which depends on Θ via the dependenceof Σ Θ and (cid:98) B (and thus (cid:99) W and (cid:98) d ) on Θ .28 .2 Estimation algorithm for the complete data The essence of estimating Θ , B , and φ , is to optimize the Laplace-approximatedmarginal quasi-likelihood in (11). Note that such approximation requires calculatingthe maximum of the penalized quasi-likelihood in (10), (cid:98) B , along with its correspondingHessian H (cid:98) B , which is only feasible for given values of the penalty parameters Θ . Todisentangle the complicated dependence of (cid:98) B on Θ , we adopt a nested-optimizationstrategy proposed by Wood (2011). Specifically, the algorithm has an outer iterationfor updating Θ and φ , with each iterative step supplementing with an inner iterationto estimate random effects B corresponding to the current Θ , as summarized inAlgorithm 1. This Section proceeds with the detailed description of each step inAlgorithm 1. Algorithm 1:
Algorithm to find ( (cid:98) B , (cid:98) φ, (cid:98) Θ ) = argmax B ,φ, Θ (cid:96) ( S, B ) ( B , φ, Θ )using data { S , Z , Y } Initialize Θ (0) , φ (0) ; Choose ε = 10 − ; Set s = 0; repeat Step 1. Solve U ( B ; Θ ( s ) ) = (12) to obtain B ( s ) ;Step 2. Newton’s update for the Laplace-approximated marginallikelihood (log( φ ) , log( Θ )) ( s +1) = (log( φ ) , log( Θ )) ( s ) − (cid:104) ∇ Laplace( B ( s ) ) (cid:105) − ∇ Laplace( B ( s ) ); s ← s + 1; until (cid:107) B ( s ) − B ( s − (cid:107) < ε ;Return Θ ( s ) , B ( s ) , φ ( s ) ;Step 3: Calculate (cid:98) φ F le using B ( s ) B given the current ΘGiven the estimates of penalty parameters Θ , (cid:98) B can be computed as the solution to U ( B ) = 1 φ (cid:8) X T ( S − Λ X π ) − Σ Θ B (cid:9) = , (12)where U ( B ) is the quasi-score for the penalized quasi-likelihood in (10) with respectto B . We use the Newton’s method to solve these system of nonlinear equations.29pecifically we compute the gradient of U ( B ), ∇ U ( B ) = − X T W X + Σ Θ φ , and a single update from step l to step l + 1 for B thus takes the form B ( l +1) = B ( l ) + (cid:0) X T W X + Σ Θ (cid:1) − (cid:104) X T (cid:0) S − Λ X π ( l ) (cid:1) − Σ Θ B ( l ) (cid:105) . We then iteratively update B until convergence, which constitutes iteration Step 1 inAlgorithm 1. The outer iteration, which aims to maximize the Laplace-approximated marginalquasi-likelihood in (11), is also achieved by a Newton’s method. Wood (2011) has de-rived the derivatives and Hessian of Laplace( φ, Θ ; (cid:98) B ) with respect to ρ = (log( Θ ) , log( φ )),using a mixture of implicit and direct differentiations. We denote these first and sec-ond derivatives as ∇ Laplace( ρ ; (cid:98) B ) and ∇ Laplace( ρ ; (cid:98) B ), respectively. Relying on thework of Wood (2011), the maximization in the outer iteration can be readily achievedvia ρ ( s +1) = ρ ( s ) − (cid:104) ∇ Laplace (cid:16) ρ ( s ) ; (cid:98) B ( s ) (cid:17)(cid:105) − ∇ Laplace (cid:16) ρ ( s ) ; (cid:98) B ( s ) (cid:17) . (13)Here, (cid:98) B ( s ) are the estimated mean parameters given the current Θ ( s ) , obtained fromthe inner iteration in Section 4.2.1. Each update in (13) constitutes iteration Step 2in Algorithm 1. We iterate between the Step 1 and Step 2 until convergence to obtain (cid:98) B , (cid:98) Θ and (cid:98) φ . φ using the moment-based estimator As described in the previous section, the dispersion parameter φ can be estimated aspart of the outer iteration of the marginal quasi-likelihood maximization. We referto this estimator as likelihood-based dispersion estimator, denoted as (cid:98) φ Lik .In generalized linear models, it is common to estimate φ by dividing Pearson’s lack-of-fit statistic by the residual degrees of freedom, and this is known as the moment-based scale/dispersion estimator. We can apply the similar ideas here. Instead of30sing (cid:98) φ Lik , we take one step further and estimate φ using the final estimate (cid:98) B (andthus (cid:98) π ). Specifically, Pearson’s dispersion estimator can be written as (cid:98) φ P = 1 M − τ (cid:88) i,j (cid:32) S ij − X ij (cid:98) π ij (cid:112) X ij (cid:98) π ij (1 − (cid:98) π ij ) (cid:33) . Here τ is the effective degrees of freedom (Wood, 2017), defined as τ = trace ( F ) , with F = (cid:16) X T (cid:99) W X + Σ (cid:98) Θ (cid:17) − X T (cid:99) W X . (14)However, (cid:98) φ P can be unstable at finite sample sizes, especially when a few Pearsonresiduals are huge (Farrington, 1995; Fletcher, 2012). For example, in our model, (cid:98) π ij close to 0 can lead to a huge Pearson residual, even though the deviance d ij ( S ij , (cid:98) π ij ) in(7) is modest. Therefore, we adopt an improved version of the Pearson estimator, i.e.the Fletcher estimator (Fletcher, 2012), which is designed to mitigate this problem.The Fletcher’s dispersion estimator (cid:98) φ F le is defined as (cid:98) φ F le = (cid:98) φ P a , where a ij = 1 − (cid:98) π ij X ij (cid:98) π ij (1 − (cid:98) π ij ) ( S ij − X ij (cid:98) π ij ) and a = 1 M (cid:88) i,j a ij . If the mean model is adequate, then approximately we have( M − τ ) (cid:98) φ F le φ ∼ χ M − τ (15)(McCullagh, 1985; Fletcher, 2012). Therefore, (cid:98) φ F le provides an unbiased estimatorfor φ , which is also confirmed by simulation results as shown in Supplementary FigureS9. In contrast, the estimation using (cid:98) φ Lik can be considerably biased (Supplemen-tary Figure S9). Hence, we calculate the moment-based estimate for the dispersionparameter, which constitutes the Step 3 in Algorithm 1.
In the presence of experimental errors, the true methylation data, S ij are unknownand one only observes Y ij , which is assumed to be a mixture of binomial counts arisingfrom both the truly methylated and truly unmethylated reads. When S ij is modeledby a parametric distribution, like in Zhao et al. (2020), the EM algorithm (Dempsteret al., 1977) provides accurate estimation of the smooth covariate effects even thoughthe true methylation data are missing. Motivated by the work of Elashoff and Ryan(2004), we propose an extension of the EM algorithm with special treatment for themultiplicative dispersion parameter φ , to the case of quasi-likelihood-based analyses.31 .3.1 Expectation-Solving algorithm Elashoff and Ryan (2004) proposed an extension of the EM algorithm, called Expectation-Solving (ES) algorithm, to accommodate missing (or mis-measured) data when a nat-ural set of estimating equations exists for the complete data setting. Specifically, theE step computes the conditional expectation of the estimating equations given theobserved data, and S step solves these expected estimating equations.To apply the ES algorithm to our case, we need to evaluate the conditional ex-pectation of three sets of estimating equations: U ( B ; Θ ( s ) , S ) = 1 φ (cid:2) X T (cid:0) S − Λ X π ( s ) (cid:1) − Σ Θ ( s ) B (cid:3) = ∇ Θ Laplace( Θ , φ ; B ( s ) , S ) = 1 φ (cid:88) i,j (cid:40) S ij − X ij π ( s ) ij π ( s ) ij (1 − π ( s ) ij ) × dπ ( s ) ij d Θ (cid:41) + f ( Θ , φ ; B ( s ) ) = ∇ φ Laplace( Θ , φ ; B ( s ) , S ) = 1 φ (cid:88) i,j (cid:90) π ( s ) ij S ij /X ij S ij − X ij π ij π ij (1 − π ij ) dπ ij + f ( Θ , φ ; B ( s ) ) = , for B , Θ and φ , respectively. Here, Θ ( s ) , B ( s ) , and π ( s ) are estimates from the previousiterations, f ( · ) and f ( · ) denote the components that are independent of S . E step for B and S . The estimating equations for B and Θ are linear in thelatent methylated counts S , and thus their expectations equal U ( B ; Θ ( s ) , η (cid:63) ) and ∇ Θ Laplace( Θ , φ ; B ( s ) , η (cid:63) ), respectively. Here, η (cid:63) ∈ R M are the conditional expecta-tions of S given Y evaluated at the trial estimates ( B (cid:63) , Θ (cid:63) ), and for our model, takethe form η (cid:63)ij = E ( S ij | Y ij ; B (cid:63) , Θ (cid:63) ) = Y ij p π (cid:63)ij p π (cid:63)ij + p (1 − π (cid:63)ij ) + ( X ij − Y ij ) (1 − p ) π (cid:63)ij (1 − p ) π (cid:63)ij + (1 − p )(1 − π (cid:63)ij ) , (16)where π (cid:63)ij = g − ( X ( l, ) B (cid:63) ) and l is the row in the model matrix X corresponding toCpG j for sample i . These expected estimating equations can then be solved usingthe direct nested iteration method in Algorithm 1. E step for φ . However, the estimating equation for φ is not linear in the unknownmethylated counts S ; see details in Appendix B.1. Therefore, the closed-form exactexpression for E S | Y ; B (cid:63) , Θ (cid:63) ( ∇ φ Laplace( Θ , φ ; B ( s ) , S )) is not available, and the E-S al-gorithm cannot be readily applied to estimating φ from the contaminated data. To32ircumvent this problem, we propose a direct method to estimate φ without under-going the E-S iteration. φ Specifically, we estimate φ by exploiting its relationship with the dispersion for theobserved outcome Y , denoted as φ Yij , which is defined as φ Yij = V ar( Y ij | u i ) X ij π Yij (1 − π Yij ) , with π Yij = E ( Y ij | u i ) = π ij p + (1 − π ij ) p . Based on our assumed mean-variance relationship (2) and error model (5), we canexpress φ Yij in terms of φ , π ij and error parameters p and p , φ Yij = 1 + ( φ −
1) ( π Yij − p )( p − π Yij ) π Yij (1 − π Yij ) ; (17)see detailed derivations in Appendix B.2. Although we assume a constant dispersion φ for the true outcome S , the observed outcome Y implied by our error model,possesses dispersion parameter φ Yij varying with each CpG site, when φ (cid:54) = 1.Directly running the nested iteration method (Algorithm 1) on the observed data { Y, Z, X } reports a constant dispersion estimate (cid:98) φ Y and (cid:98) π Yij for all i and j , along withother useful estimates. We assume that (cid:98) φ Y is an estimate for the mean of individualdispersions φ Yij , i.e.1 M (cid:88) i,j φ Yij = 1 + ( φ −
1) 1 M (cid:88) i,j ( π Yij − p )( p − π Yij ) π Yij (1 − π Yij ) ; (18)empirical results show that this is a reasonable assumption, as shown in Supple-mentary Figure S11. We then propose to estimate φ by plugging in the error-proneoutcome-related estimates (cid:98) φ Y and (cid:98) π Yij to the relation in (18): (cid:98) φ = ( (cid:98) φ Y − (cid:34) M (cid:88) i,j ( (cid:98) π Yij − p )( p − (cid:98) π Yij ) (cid:98) π Yij (1 − (cid:98) π Yij ) (cid:35) − + 1 . We propose a hybrid ES algorithm to estimate our model using the error-prone out-comes Y . We first estimate φ using the aforementioned plug-in approach and thenestimate B and Θ using ES iterations assuming φ is fixed and known; detailed steps33re summarized in Algorithm 2. We denote the final estimates from our algorithm as (cid:98) φ , (cid:98) B and (cid:98) Θ . The components of (cid:98) α inside the vector of (cid:98) B leads to estimates of thefunctional parameters β p ( t ), for p = 0 , , . . . , P : (cid:91) β p ( t ) = (cid:110) B ( p ) ( t ) (cid:111) T { (cid:99) α p } , where t is a genomic position lying within the range of the input positions { t ij } , and B ( p ) ( t ) = ( B ( p )1 ( t ) , B ( p )2 ( t ) , . . . B ( p ) L p ( t )) T ∈ R L p is a column vector with nonrandomquantities obtained from evaluating the set of basis functions { B ( p ) l ( · ) } l at position t . Algorithm 2:
A hybrid ES algorithm to estimate the smoothed quasi-binomial mixed model with error-prone outcomes.Step 1: run Algorithm 1 on { Y, Z, X } ; return (cid:98) π Y , (cid:98) φ Y , (cid:98) B , and (cid:98) Θ ;Step 2: calculate the plug-in estimator (cid:98) φ ;Step 3: E-S iterations with φ fixed at (cid:98) φ to estimate B and Θ ; specificallyInitialize Θ (0) = (cid:98) Θ , , B (0) = (cid:98) B ; Choose ε = 10 − ; Set (cid:96) = 0; repeat • E step: η ( (cid:96) ) ij = E ( S ij | Y ij ; B ( (cid:96) ) ); • S step: ( B ( (cid:96) ) , Θ ( (cid:96) ) ) = argmax B , Θ (cid:96) ( B , Θ ) (cid:16) B , Θ ; η ( (cid:96) ) ij , (cid:98) φ (cid:17) . Specifically repeat • Solve U ( B ; Θ ( s ) ; η ( (cid:96) ) ) = to obtain B ( s ) using data η ( (cid:96) ) ij ; • Newton’s update for the Laplace approximated marginal likelihoodevaluated at data η ( (cid:96) ) ij :(log Θ ) ( s +1) = (log Θ ) ( s ) − (cid:104) ∇ Θ Laplace( B ( s ) ) (cid:105) − ∇ Θ Laplace( B ( s ) ); s ← s + 1; until (cid:107) B ( s ) − B ( s − (cid:107) < ε ; (cid:96) ← (cid:96) + 1; until (cid:107) B ( (cid:96) ) − B ( (cid:96) − (cid:107) < ε ;Return Θ ( (cid:96) ) , B ( (cid:96) ) ; We then estimate the pointwise confidence intervals (CI) for the smoothed covariateeffects { β ( t ) , β ( t ) , . . . , β P ( t ) } , and obtain tests of hypotheses for these effects. Note34hat the inference is carried out conditional on the values of variance componentparameters Θ and dispersion parameter φ , i.e. the uncertainty in estimating them isnot accounted for. As did in Elashoff and Ryan (2004), we can re-express the E step as the solution tothe following M-dimensional estimating equation: U (2) ( S ) = S − (cid:98) η = , where (cid:98) η are the conditional expectations in (16) evaluated at the current estimate (cid:98) π . In this way, the overall ES algorithm can be viewed as solving an expanded set ofequations of dimension K + N + M , whose first K + N components are U ( B ) = in (12) and whose second M components are U (2) ( S ) = .Under this formulation, we use the established theory for estimating equations(Lindsay, 1982; Heyde and Morton, 1996; Small et al., 2003), and propose a model-based variance estimator for (cid:98) B . Specifically, under correct specification of the firsttwo moments of S , the asymptotic variance of (cid:98) B can be written as V ar( (cid:98) B ) = (cid:2) ( − D ) − (cid:3) ( B , B ) , where D is the first order derivative of the expanded estimating equations for B and S , and [ • ] ( B , B ) stands for the matrix block corresponding to B . In our case, D takesthe form D = − φ X T W X + 1 φ Σ Θ − φ X T W δ X − I M . Here, W δ is a diagonal matrix with elements X ij δ ij , where δ ij = Y ij p p (cid:2) p π ij + p (1 − π ij ) (cid:3) + ( X ij − Y ij ) (1 − p )(1 − p ) (cid:2) (1 − p ) π ij + (1 − p )(1 − π ij ) (cid:3) , and reduces to a zero matrix when p = 1 − p = 0. Then, the asymptotic varianceof (cid:98) B can be simplified as V ar( (cid:98) B ) = (cid:2) X T ( W − W δ ) X + Σ Θ (cid:3) − φ. (19)Therefore, the desired variance estimator of (cid:98) B can be obtained by plugging in thefinal estimates (cid:98) B , (cid:98) Θ and (cid:98) φ into equation (19).35 .4.2 Confidence interval estimation Let (cid:98) V denote the aforementioned variance estimator and (cid:98) V p be the diagonal blocksof (cid:98) V corresponding to α p , with dimensions L p × L p . We then immediately have theestimated variance of (cid:91) β p ( t ): (cid:100) V ar( (cid:91) β p ( t )) = (cid:110) B ( p ) ( t ) (cid:111) T (cid:98) V p (cid:110) B ( p ) ( t ) (cid:111) . Therefore, theconfidence interval for β p ( t ) at significance level ν can be approximately estimatedby (cid:91) β p ( t ) ± Z ν/ (cid:113)(cid:100) V ar( (cid:91) β p ( t )) , for any t in the range of interest, where Z ν/ is ν/ We can also construct a region-wide test of the null hypothesis H : β p ( t ) = 0 , for any t in the genomic interval . This test depends on the association between covariate Z p and methylation levelsacross the region, after adjustment for all the other covariates, and the null hypothesisis equivalent to H : α p = . We propose the following region-based F statistic T p = (cid:99) α p T (cid:110) (cid:98) V p (cid:111) − (cid:99) α p τ p , where { (cid:98) V p } − denotes inverse if (cid:98) V p is nonsigular; for singular (cid:98) V p , the inverse isreplaced by the Moore-Penrose inverse { (cid:98) V p } − . Here, τ p is the effective degrees offreedom (EDF) for smooth term β p ( t ), which depends on the magnitude of smoothingparameter λ and random effect variances σ . Motivated by the work of Wood (2013),we define the EDF τ p as τ p = b p (cid:88) l = a p (2 F − F F ) ( l,l ) , for p = 0 , , . . . P, where a p = (cid:80) p − m =0 L m + 1 if p > a p = 1 if p = 0, b p = (cid:80) pm =0 L m for any p , and( • ) ( l,l ) stands for the l th leading diagonal element of a matrix. F is the smoothingmatrix of our model, as defined in (14), which can be viewed as the matrix mappingthe pseudo data to its predicted mean.Let V p = (cid:98) V p · φ/ (cid:98) φ be the variance estimator for α p when the dispersion parameter φ is known. Zhao et al. (2020) have shown the following asymptotic results under thenull (cid:99) α p T { V p } − (cid:99) α p ∼ χ τ p . T p asymptotically follows a F distributionwith degrees of freedom τ p and M − τ , i.e. T p ∼ F τ p ,M − τ . References
Akalin, A., Kormaksson, M., Li, S., Garrett-Bakelman, F. E., Figueroa, M. E., Mel-nick, A., and Mason, C. E. (2012). methylkit: a comprehensive r package for theanalysis of genome-wide dna methylation profiles.
Genome biology , 13(10):1–9.Allum, F., Hedman, ˚A. K., Shao, X., Cheung, W. A., Vijay, J., Gu´enard, F., Kwan,T., Simon, M.-M., Ge, B., Moura, C., et al. (2019). Dissecting features of epigeneticvariants underlying cardiometabolic risk using full-resolution epigenome profilingin regulatory elements.
Nature communications , 10(1):1–13.Allum, F., Shao, X., Gu´enard, F., Simon, M.-M., Busche, S., Caron, M., Lambourne,J., Lessard, J., Tandre, K., Hedman, ˚A. K., et al. (2015). Characterization of func-tional methylomes by next-generation capture sequencing identifies novel disease-associated variants.
Nature communications , 6(1):1–12.Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linearmixed models.
Journal of the American statistical Association , 88(421):9–25.Browne, W. J., Subramanian, S. V., Jones, K., and Goldstein, H. (2005). Variancepartitioning in multilevel logistic models that exhibit overdispersion.
Journal ofthe Royal Statistical Society: Series A (Statistics in Society) , 168(3):599–613.Cheng, L. and Zhu, Y. (2013). A classification approach for dna methylation profilingwith bisulfite next-generation sequencing data.
Bioinformatics , 30(2):172–179.Choy, M.-K., Movassagh, M., Goh, H.-G., Bennett, M. R., Down, T. A., and Foo,R. S. (2010). Genome-wide conserved consensus transcription factor binding motifsare hyper-methylated.
BMC genomics , 11(1):519.Cui, S., Ji, T., Li, J., Cheng, J., and Qiu, J. (2016). What if we ignore the ran-dom effects when analyzing rna-seq data in a multifactor experiment.
Statisticalapplications in genetics and molecular biology , 15(2):87–105.37avison, A. C. and Hinkley, D. V. (1997).
Bootstrap methods and their application .Number 1. Cambridge university press.Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the em algorithm.
Journal of the Royal Statistical Society.Series B (Statistical Methodology) , pages 1–38.Dolinoy, D. C., Huang, D., and Jirtle, R. L. (2007). Maternal nutrient supplementa-tion counteracts bisphenol a-induced dna hypomethylation in early development.
Proceedings of the National Academy of Sciences , 104(32):13056–13061.Dolzhenko, E. and Smith, A. D. (2014). Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite se-quencing experiments.
BMC bioinformatics , 15(1):215.Dunaway, K. W., Islam, M. S., Coulson, R. L., Lopez, S. J., Ciernia, A. V., Chu, R. G.,Yasui, D. H., Pessah, I. N., Lott, P., Mordaunt, C., et al. (2016). Cumulativeimpact of polychlorinated biphenyl and large chromosomal duplications on dnamethylation, chromatin, and expression of autism candidate genes.
Cell reports ,17(11):3035–3048.Efron, B. (1986). Double exponential families and their use in generalized linearregression.
Journal of the American Statistical Association , 81(395):709–721.Elashoff, M. and Ryan, L. (2004). An em algorithm for estimating equations.
Journalof Computational and Graphical Statistics , 13(1):48–65.Fan, J., Hu, J., Xue, C., Zhang, H., Susztak, K., Reilly, M. P., Xiao, R., and Li, M.(2020). Asep: Gene-based detection of allele-specific expression across individualsin a population by rna sequencing.
PLoS Genetics , 16(5):e1008786.Farrington, C. (1995). Pearson statistics, goodness of fit, and overdispersion in gen-eralised linear models. In
Statistical Modelling , pages 109–116. Springer.Feinberg, A. P. (2007). Phenotypic plasticity and the epigenetics of human disease.
Nature , 447(7143):433. 38eng, H., Conneely, K. N., and Wu, H. (2014). A bayesian hierarchical model todetect differentially methylated loci from single nucleotide resolution sequencingdata.
Nucleic acids research , 42(8):e69–e69.Fletcher, D. (2012). Estimating overdispersion when fitting a generalized linear modelto sparse data.
Biometrika , 99(1):230–237.Forslind, K., Ahlm´en, M., Eberhardt, K., Hafstr¨om, I., and Svensson, B. (2004).Prediction of radiological outcome in early rheumatoid arthritis in clinical practice:role of antibodies to citrullinated peptides (anti-ccp).
Annals of the rheumaticdiseases , 63(9):1090–1095.Goeman, J. J., Van De Geer, S. A., and Van Houwelingen, H. C. (2006). Testingagainst a high dimensional alternative.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 68(3):477–493.Hansen, K. D., Langmead, B., and Irizarry, R. A. (2012). Bsmooth: from wholegenome bisulfite sequencing reads to differentially methylated regions.
Genomebiology , 13(10):R83.Hansen, K. D., Timp, W., Bravo, H. C., Sabunciyan, S., Langmead, B., McDonald,O. G., Wen, B., Wu, H., Liu, Y., Diep, D., et al. (2011). Increased methylationvariation in epigenetic domains across cancer types.
Nature genetics , 43(8):768.Hanson, M. A. and Gluckman, P. D. (2008). Developmental origins of health anddisease: new insights.
Basic & clinical pharmacology & toxicology , 102(2):90–93.Hebestreit, K., Dugas, M., and Klein, H.-U. (2013). Detection of significantly differ-entially methylated regions in targeted bisulfite sequencing data.
Bioinformatics ,29(13):1647–1653.Heyde, C. and Morton, R. (1996). Quasi-likelihood and generalizing the em algorithm.
Journal of the Royal Statistical Society: Series B (Methodological) , 58(2):317–327.Horvath, S. (2013). Dna methylation age of human tissues and cell types.
Genomebiology , 14(10):3156. 39u, M., Yao, J., Cai, L., Bachman, K. E., Van Den Brˆule, F., Velculescu, V., andPolyak, K. (2005). Distinct epigenetic changes in the stromal cells of breast cancers.
Nature genetics , 37(8):899–905.Hudson, M., Bernatsky, S., Colmegna, I., Lora, M., Pastinen, T., Klein Oros, K.,and Greenwood, C. M. (2017). Novel insights into systemic autoimmune rheumaticdiseases using shared molecular signatures and an integrative analysis.
Epigenetics ,12(6):433–440.Ivanova, A., Molenberghs, G., and Verbeke, G. (2014). A model for overdispersedhierarchical ordinal data.
Statistical Modelling , 14(5):399–415.Jaenisch, R. and Bird, A. (2003). Epigenetic regulation of gene expression: how thegenome integrates intrinsic and environmental signals.
Nature genetics , 33:245.Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995).
Continuous univariate distri-butions . John Wiley & Sons, Ltd.Jones, P. A. (1999). The dna methylation paradox.
Trends in Genetics , 15(1):34–37.Jørgensen, B. (1987). Exponential dispersion models.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 49(2):127–145.Kato, T., Iwamoto, K., Kakiuchi, C., Kuratomi, G., and Okazaki, Y. (2005). Geneticor epigenetic difference causing discordance between monozygotic twins as a clueto molecular basis of mental disorders.
Molecular psychiatry , 10(7):622–630.Korthauer, K., Chakraborty, S., Benjamini, Y., and Irizarry, R. A. (2018). Detectionand accurate false discovery rate control of differentially methylated regions fromwhole genome bisulfite sequencing.
Biostatistics .Lakhal-Chaieb, L., Greenwood, C. M., Ouhourane, M., Zhao, K., Abdous, B., andOualkacha, K. (2017). A smoothed em-algorithm for dna methylation profiles fromsequencing-based methods in cell lines or for a single cell type.
Statistical applica-tions in genetics and molecular biology , 16(5-6):333–347.Lea, A. J., Tung, J., and Zhou, X. (2015). A flexible, efficient binomial mixed modelfor identifying differential dna methylation in bisulfite sequencing data.
PLoS ge-netics , 11(11):e1005650. 40eek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E.,Geman, D., Baggerly, K., and Irizarry, R. A. (2010). Tackling the widespread andcritical impact of batch effects in high-throughput data.
Nature Reviews Genetics ,11(10):733–739.Lindsay, B. (1982). Conditional score functions: some optimality results.
Biometrika ,69(3):503–512.Lister, R., Pelizzola, M., Dowen, R. H., Hawkins, R. D., Hon, G., Tonti-Filippini, J.,Nery, J. R., Lee, L., Ye, Z., Ngo, Q.-M., et al. (2009). Human dna methylomes atbase resolution show widespread epigenomic differences. nature , 462(7271):315.McCullagh, P. (1985). On the asymptotic distribution of pearson’s statistic in linearexponential-family models.
International Statistical Review/Revue Internationalede Statistique , pages 61–67.McCullagh, P. and Nelder, J. A. (1989). Generalized linear models 2nd edition chap-man and hall.
London, UK .McGregor, K., Bernatsky, S., Colmegna, I., Hudson, M., Pastinen, T., Labbe, A.,and Greenwood, C. M. (2016). An evaluation of methods correcting for cell-typeheterogeneity in dna methylation studies.
Genome biology , 17(1):84.McRae, A. F., Powell, J. E., Henders, A. K., Bowdler, L., Hemani, G., Shah, S.,Painter, J. N., Martin, N. G., Visscher, P. M., and Montgomery, G. W. (2014).Contribution of genetic variation to transgenerational inheritance of dna methyla-tion.
Genome biology , 15(5):R73.Meaney, M. J. and Szyf, M. (2005). Environmental programming of stress responsesthrough dna methylation: life at the interface between a dynamic environment anda fixed genome.
Dialogues in clinical neuroscience , 7(2):103.Molenberghs, G., Verbeke, G., and Dem´etrio, C. G. (2007). An extended random-effects approach to modeling repeated, overdispersed count data.
Lifetime dataanalysis , 13(4):513–531.Molenberghs, G., Verbeke, G., Dem´etrio, C. G., Vieira, A. M., et al. (2010). A fam-ily of generalized linear models for repeated measures with normal and conjugaterandom effects.
Statistical science , 25(3):325–347.41olenberghs, G., Verbeke, G., Iddi, S., and Dem´etrio, C. G. (2012). A combined betaand normal random-effects model for repeated, overdispersed binary and binomialdata.
Journal of Multivariate Analysis , 111:94–109.Ober, C. and Vercelli, D. (2011). Gene–environment interactions in human disease:nuisance or opportunity?
Trends in genetics , 27(3):107–115.Park, Y., Figueroa, M. E., Rozek, L. S., and Sartor, M. A. (2014). Methylsig: a wholegenome dna methylation analysis pipeline.
Bioinformatics , 30(17):2414–2422.Park, Y. and Wu, H. (2016). Differential methylation analysis for bs-seq data undergeneral experimental design.
Bioinformatics , 32(10):1446–1453.Prochenka, A., Pokarowski, P., Gasperowicz, P., Kosi´nska, J., Stawi´nski, P., Zbie´c-Piekarska, R., Sp´olnicka, M., Branicki, W., and P(cid:32)loski, R. (2015). A cautionary noteon using binary calls for analysis of dna methylation.
Bioinformatics , 31(9):1519–1520.Rackham, O. J., Langley, S. R., Oates, T., Vradi, E., Harmston, N., Srivastava, P. K.,Behmoaras, J., Dellaportas, P., Bottolo, L., and Petretto, E. (2017). A bayesianapproach for analysis of whole-genome bisulphite sequencing data identifies disease-associated changes in dna methylation.
Genetics , pages genetics–116.Schoofs, T., Rohde, C., Hebestreit, K., Klein, H.-U., G¨ollner, S., Schulze, I., Lerdrup,M., Dietrich, N., Agrawal-Singh, S., Witten, A., et al. (2013). Dna methylationchanges are a late event in acute promyelocytic leukemia and coincide with lossof transcription factor binding.
Blood, The Journal of the American Society ofHematology , 121(1):178–187.Shao, X., Hudson, M., Colmegna, I., Greenwood, C. M., Fritzler, M. J., Awadalla, P.,Pastinen, T., and Bernatsky, S. (2019). Rheumatoid arthritis-relevant dna methyla-tion changes identified in acpa-positive asymptomatic individuals using methylomecapture sequencing.
Clinical epigenetics , 11(1):110.Shokoohi, F., Stephens, D. A., Bourque, G., Pastinen, T., Greenwood, C. M., andLabbe, A. (2018). A hidden markov model for identifying differentially methylatedsites in bisulfite sequencing data.
Biometrics .42hun, Z. and McCullagh, P. (1995). Laplace approximation of high dimensional inte-grals.
Journal of the Royal Statistical Society: Series B (Methodological) , 57(4):749–760.Silverman, B. W. (1985). Some aspects of the spline smoothing approach to non-parametric regression curve fitting.
Journal of the Royal Statistical Society: SeriesB (Methodological) , 47(1):1–21.Sims, D., Sudbery, I., Ilott, N. E., Heger, A., and Ponting, C. P. (2014). Sequenc-ing depth and coverage: key considerations in genomic analyses.
Nature ReviewsGenetics , 15(2):121–132.Small, C. G., Christopher, G., Wang, J., et al. (2003).
Numerical methods for non-linear estimating equations , volume 29. Oxford University Press on Demand.Vahabi, N., Kazemnejad, A., and Datta, S. (2019). A joint overdispersed marginalizedrandom-effects model for analyzing two or more longitudinal ordinal responses.
Statistical Methods in Medical Research , 28(1):50–69.Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smooth-ing spline.
Journal of the Royal Statistical Society: Series B (Methodological) ,45(1):133–150.Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likeli-hood estimation of semiparametric generalized linear models.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 73(1):3–36.Wood, S. N. (2013). On p-values for smooth components of an extended generalizedadditive model.
Biometrika , 100(1):221–228.Wood, S. N. (2017).
Generalized additive models: an introduction with R . CRC press.Wreczycka, K., Gosdschan, A., Yusuf, D., Gruening, B., Assenov, Y., and Akalin, A.(2017). Strategies for analyzing bisulfite sequencing data.
Journal of biotechnology ,261:105–115.Wu, H., Xu, T., Feng, H., Chen, L., Li, B., Yao, B., Qin, Z., Jin, P., and Conneely,K. N. (2015). Detection of differentially methylated regions from whole-genome43isulfite sequencing data without replicates.
Nucleic acids research , 43(21):e141–e141.Zhao, K., Oualkacha, K., Lakhal-Chaieb, L., Labbe, A., Klein, K., Ciampi, A., Hud-son, M., Colmegna, I., Pastinen, T., Zhang, T., et al. (2020). A novel statisticalmethod for modeling covariate effects in bisulfite sequencing derived measures ofdna methylation.
Biometrics .Ziller, M. J., Gu, H., M¨uller, F., Donaghey, J., Tsai, L. T.-Y., Kohlbacher, O.,De Jager, P. L., Rosen, E. D., Bennett, D. A., Bernstein, B. E., et al. (2013).Charting a dynamic dna methylation landscape of the human genome.
Nature ,500(7463):477.Ziller, M. J., Stamenova, E. K., Gu, H., Gnirke, A., and Meissner, A. (2016). Targetedbisulfite sequencing of the dynamic dna methylome.
Epigenetics & chromatin ,9(1):1–9.
A Marginal interpretations for dSOMNiBUS
A.1 Marginal mean
The latent variable representation of the logistic mixed effect model in (1) is S (cid:63)ijk = η ij + (cid:15) ij + u i S ijk = , if S (cid:63)ijk ≥ , if S (cid:63)ijk < S (cid:63)ijk is the unobserved latent variable, η ij = (cid:80) Pp =0 β p ( t ij ) Z pi is the linear pre-dictor calculated from all the fixed effect, (cid:15) ij are iid error terms following a logisticdistribution, and u i is the subject-specific random effect as defined in Section 2.1. Inaddition, the error term (cid:15) ij and RE u i are mutually independent. Specifically, thecumulative distribution function (cdf) for (cid:15) takes the form g ( x ) = 1 / (1 + exp( − x )).The calculation of marginal mean π Mij = P ( η ij + (cid:15) ij + u i ≥
0) requires integration overthe joint distribution of (cid:15) ij and u i , which has no closed-form solution. Instead, we44an approximate the logistic cdf g ( x ) by a normal cdf (Johnson et al., 1995, p. 119),which will lead to a more analytically tractable solution. Specifically, we have g ( x ) ≈ Φ( cx ) , with c = √ . /π, where Φ( x ) is the cdf of the standard normal distribution. For any x value, themaximum absolute difference of this approximation is 0.00948.Therefore, we can approximately view (cid:15) ij as a normal random variable, (cid:15) ij ∼ N (0 , /c ). Since (cid:15) ij and u i are independent, we have (cid:15) ij + u i ∼ N (0 , /c + σ ). Themarginal mean can be thus derived as π Mij = P ( (cid:15) ij + u i ≥ − η ij ) = P (cid:32) (cid:15) ij + u i (cid:112) /c + σ ≥ − η ij (cid:112) /c + σ (cid:33) ≈ Φ (cid:32) η ij (cid:112) /c + σ (cid:33) ≈ g (cid:32) η ij (cid:112) c σ . (cid:33) A.2 Marginal variance
We will use the mixed effect model formulation in (1) to derive the marginal variance.Using the law of total variance, the marginal variance of S ij is the sum of two parts: V ar ( S ij ) = E { V ar ( S ij | u i ) } + V ar { E ( S ij | u i ) } = φX ij E { π ij (1 − π ij ) } + X ij V ar ( π ij ) , (20)where π ij = g ( η ij + u i ) is the conditional mean dependent on u i . The exact closed-form formula does not exist for either E ( π ij ) or V ar ( π ij ). Nevertheless, we can workon the second-order Taylor expansion of π ij around u i = 0, i.e. π ij = g ( η ij + u i ) ≈ g ( η ij ) + g (cid:48) ( η ij ) u i + g (cid:48)(cid:48) ( η ij ) u i /
2. Thus, we have E ( π ij ) ≈ g ( η ij ) + g (cid:48)(cid:48) ( η ij ) σ / V ar ( π ij ) ≈ E (cid:40)(cid:20) g (cid:48) ( η ij ) u i + g (cid:48)(cid:48) ( η ij )2 (cid:0) u i − σ (cid:1)(cid:21) (cid:41) = σ [ g (cid:48) ( η ij )] + σ g (cid:48)(cid:48) ( η ij )] , and E (cid:0) π ij (cid:1) ≈ σ [ g (cid:48) ( η ij )] + σ g (cid:48)(cid:48) ( η ij )] + (cid:20) g ( η ij ) + g (cid:48)(cid:48) t ( η ij )2 σ (cid:21) . Substituting theabove approximations into (20) yields the results in equation (4).45 Estimate φ from the contaminated data B.1 No exact expression available for the E step for φ Once evaluated the integral in the quasi-deviance d ij ( S ij , π ij ) (7), the estimatingequation for φ takes the form ∇ φ Laplace( Θ , φ ; B ( s ) , S ) = 1 φ (cid:88) i,j (cid:90) π ( s ) ij S ij /X ij S ij − X ij π ij π ij (1 − π ij ) dπ ij + f ( Θ , φ ; B ( s ) )= 1 φ (cid:88) i,j (cid:110) ( X ij − S ij ) log(1 − π ( s ) ij ) + S ij log( π ( s ) ij ) − ( X ij − S ij ) log(1 − S ij /X ij ) − S ij log( S ij /X ij ) (cid:9) + f ( Θ , φ ; B ( s ) ) . This estimating equation is not linear in terms of the unknown methylated counts S .Thus, replacing S ij by η (cid:63)ij = E ( S ij | Y ij ; B (cid:63) , Θ (cid:63) ) does not necessarily provide an accu-rate estimate for E S | Y ; Θ (cid:63) , B (cid:63) ( ∇ φ Laplace( Θ , φ ; B ( s ) , S )), and the exact expression forthis expectation is not readily available from the first two moments of the distributionof S ij . B.2 The relation between φ Yij and φ All the expectation and variance in this section are conditional on the values ofrandom effects u i . For notational simplicity, we drop u i from all the derivations inthis section.The variance of Y ij depends on its mean π Yij as well as the joint probability P ( Y ijk =1 , Y ijk (cid:48) = 1), i.e. observing methylated signals at both the k th and k (cid:48) th reads, where k, k (cid:48) = 1 , , . . . X ij and k (cid:54) = k (cid:48) : V ar( Y ij ) = E ( Y ij ) − [ E ( Y ij )] = E X ij (cid:88) k =1 Y ijk − X ij ( π Yij ) = X ij (cid:88) k =1 E ( Y ijk ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 E ( Y ijk Y ijk (cid:48) ) − X ij ( π Yij ) = X ij π Yij − X ij ( π Yij ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 P ( Y ijk = 1 , Y ijk (cid:48) = 1) . (21)46y the law of total probability, we have P ( Y ijk = Y ijk (cid:48) = 1) = (cid:88) s =0 1 (cid:88) s =0 P ( S ijk = s , S ijk (cid:48) = s ) P ( Y ijk = Y ijk (cid:48) = 1 | S ijk = s , S ijk (cid:48) = s ) . Joint distribution of the bivariate outcomes ( S ijk , S ijk (cid:48) ) . Note that, underour assumed mean-variance relationship in (2), S ijk and S ijk (cid:48) are not necessarilyindependent. Define a ijkk (cid:48) = P ( S ijk = 1 , S ijk (cid:48) = 1). The joint probability massfunction of ( S ijk , S ijk (cid:48) ) can be thus written as P ( S ijk = 1 , S ijk (cid:48) = 1) = a ijkk (cid:48) P ( S ijk = 1 , S ijk (cid:48) = 0) = π ij − a ijkk (cid:48) P ( S ijk = 0 , S ijk (cid:48) = 1) = π ij − a ijkk (cid:48) P ( S ijk = 0 , S ijk (cid:48) = 0) = 1 − π ij + a ijkk (cid:48) . We now can write the probability of observing two methylated reads as P ( Y ijk = Y ijk (cid:48) = 1) = p (1 − π ij + a ijkk (cid:48) ) + 2 p p ( π ij − a ijkk (cid:48) ) + p a ijkk (cid:48) . Here, we assume that given the true methylation states S ijk and S ijk (cid:48) , the observedmethylation states Y ijk and Y ijk (cid:48) are independent. Derive the values of a ijkk (cid:48) . From first principle, we can express the variance of S ij = (cid:80) X ij k =1 S ijk , V ar( S ij ) = X ij (cid:88) k =1 V ar( S ijk ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 C ov( S ijk , S ijk (cid:48) )= X ij π ij (1 − π ij ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 E ( S ijk S ijk (cid:48) ) − X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 E ( S ijk ) E ( S ijk (cid:48) )= X ij π ij (1 − π ij ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 P ( S ijk = 1 , S ijk (cid:48) = 1) − X ij ( X ij − π ij = X ij π ij (1 − π ij ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 a ijkk (cid:48) − X ij ( X ij − π ij .
47n the other hand, we have V ar( S ij ) = φX ij π ij (1 − π ij ). Equating these two quantitiesgives 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 a ijkk (cid:48) = ( φ − X ij π ij (1 − π ij ) + X ij ( X ij − π ij Derive V ar ( Y ij ) and φ Y . Now, we can plug the expression of P ( Y ijk = Y ijk (cid:48) = 1)in (21) and write V ar( Y ij ) in terms of φ V ar( Y ij ) = X ij π Yij − X ij ( π Yij ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 (cid:2) p (1 − π ij + a ijkk (cid:48) ) + 2 p p ( π ij − a ijkk (cid:48) ) + p a ijkk (cid:48) (cid:3) = X ij π Yij − X ij ( π Yij ) + X ij ( X ij − (cid:8) p (1 − π ij ) + 2 p p π ij (cid:9) + 2( p − p ) X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 a ijkk (cid:48) . = X ij π Yij − X ij ( π Yij ) + X ij ( X ij − (cid:8) p (1 − π ij ) + 2 p p π ij (cid:9) +( p − p ) (cid:8) ( φ − X ij π ij (1 − π ij ) + X ij ( X ij − π ij (cid:9) = X ij π Yij (1 − π Yij ) + ( p − p ) ( φ − X ij π ij (1 − π ij )The multiplicative dispersion parameter for the mis-measured outcome Y is thus φ Yij = V ar( Y ij ) X ij π Yij (1 − π Yij ) = 1 + ( φ − π ij (1 − π ij ) π Yij (1 − π Yij ) ( p − p ) . Plugging in π ij = π Yij − p0
47n the other hand, we have V ar( S ij ) = φX ij π ij (1 − π ij ). Equating these two quantitiesgives 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 a ijkk (cid:48) = ( φ − X ij π ij (1 − π ij ) + X ij ( X ij − π ij Derive V ar ( Y ij ) and φ Y . Now, we can plug the expression of P ( Y ijk = Y ijk (cid:48) = 1)in (21) and write V ar( Y ij ) in terms of φ V ar( Y ij ) = X ij π Yij − X ij ( π Yij ) + 2 X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 (cid:2) p (1 − π ij + a ijkk (cid:48) ) + 2 p p ( π ij − a ijkk (cid:48) ) + p a ijkk (cid:48) (cid:3) = X ij π Yij − X ij ( π Yij ) + X ij ( X ij − (cid:8) p (1 − π ij ) + 2 p p π ij (cid:9) + 2( p − p ) X ij (cid:88) k =1 k − (cid:88) k (cid:48) =1 a ijkk (cid:48) . = X ij π Yij − X ij ( π Yij ) + X ij ( X ij − (cid:8) p (1 − π ij ) + 2 p p π ij (cid:9) +( p − p ) (cid:8) ( φ − X ij π ij (1 − π ij ) + X ij ( X ij − π ij (cid:9) = X ij π Yij (1 − π Yij ) + ( p − p ) ( φ − X ij π ij (1 − π ij )The multiplicative dispersion parameter for the mis-measured outcome Y is thus φ Yij = V ar( Y ij ) X ij π Yij (1 − π Yij ) = 1 + ( φ − π ij (1 − π ij ) π Yij (1 − π Yij ) ( p − p ) . Plugging in π ij = π Yij − p0 p − p0