[PDF] Segmentation and genome annotation algorithms

Abstract

Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These algorithms take as input epigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq) measurements of histone modifications or transcription factor binding. They partition the genome and assign a label to each segment such that positions with the same label exhibit similar patterns of input data. SAGA algorithms discover categories of activity such as promoters, enhancers, or parts of genes without prior knowledge of known genomic elements. In this sense, they generally act in an unsupervised fashion like clustering algorithms, but with the additional simultaneous function of segmenting the genome. Here, we review the common methodological framework that underlies these methods, review variants of and improvements upon this basic framework, catalogue existing large-scale reference annotations, and discuss the outlook for future work.

Full PDF

SSegmentation and genome annotation algorithms

Maxwell W. Libbrecht (0000-0003-2502-0262) (cid:89) * ,Rachel C.W. Chan (0000-0003-1009-6379) (cid:89) ,Michael M. Hoffman (0000-0002-4517-1562) School of Computing Science, Simon Fraser University, Burnaby, BC, Canada Department of Computer Science, University of Toronto, Toronto, ON, Canada Princess Margaret Cancer Centre, Toronto, ON, Canada Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada Vector Institute for Artificial Intelligence, Toronto, ON, Canada (cid:89)

These authors contributed equally to this work.* maxwell [email protected], [email protected]

Abstract

Segmentation and genome annotation (SAGA) algorithms are widely used tounderstand genome activity and gene regulation. These algorithms take as inputepigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq)measurements of histone modifications or transcription factor binding. They partitionthe genome and assign a label to each segment such that positions with the same labelexhibit similar patterns of input data. SAGA algorithms discover categories of activitysuch as promoters, enhancers, or parts of genes without prior knowledge of knowngenomic elements. In this sense, they generally act in an unsupervised fashion likeclustering algorithms, but with the additional simultaneous function of segmenting thegenome. Here, we review the common methodological framework that underlies thesemethods, review variants of and improvements upon this basic framework, catalogueexisting large-scale reference annotations, and discuss the outlook for future work.

Background and motivation

High-throughput sequencing technology has enabled numerous techniques forgenome-scale measurement of chemical and physical properties of chromatin andassociated molecules in individual cell types. Using sequencing assays, the Encyclopediaof DNA Elements (ENCODE) Project, the Roadmap Epigenomics Project, and myriadindividual researchers have generated thousands of such datasets. These datasetsquantify various facets of gene regulation such as genome-wide transcription-factorbinding, histone modifications, open chromatin, and RNA transcription. Each datasetmeasures a particular activity at billions of positions, and the collection of datasets doesso in hundreds of samples across a variety of species and tissues. Transforming thesequantifications of diverse properties into a holistic understanding of each part of thegenome requires effective means for summarization. Segmentation and genomeannotation (SAGA) algorithms (Box 1) have emerged as the predominant way tosummarize activity at each position of the genome, distilling complex data into aninterpretable pr´ecis of genomic activity.SAGA algorithms take as input a collection of genomic datasets, such as ChIP-seqmeasurements of histone modifications or of transcription factor binding (Figure 1).January 5, 2021 1/24 a r X i v : . [ q - b i o . GN ] J a n ox 1: Terminology SAGA.

We define a segmentation and genome annotation (SAGA) algorithm as aprocedure that:1. assigns to each position of a whole genome a label (“genome annotation”),2. from a set of multiple ( ≥

3) classes,3. by(a) integrating multiple independent observations at each position, and(b) modeling dependence between adjacent positions (“segmentation”).Previously, researchers have used several other terms to describe this task, including“segmentation” [1], “chromatin state annotation” [2] and “semi-automated genomeannotation” [3].

Assay.

An experiment that produces a measurement at each genomic position, suchchromatin immunoprecipitation-sequencing (ChIP-seq) or assay for transposase-accessible chromatin-sequencing (ATAC-seq).

Label.

One of a finite set of classes assigned to each genomic segment that sharessimilar activity. Other terms include “state” or “chromatin state”.

Sample.

A population of cells on which one can perform an assay, such as aprimary tissue sample or a cell line. Other terms include “cell type”, “epigenome”,or “biosample”.January 5, 2021 2/24

Biological interpretation

Interpret labelsSegmentation and genome annotation (SAGA)

Annotation Signal datasetsGenomic assay sequencing readsPreprocessing

Fig 1. Overview of segmentation and genome annotation (SAGA).

First,preprocessing transforms genomic assay sequencing reads into signal datasets. Second,with signal datasets as input, a SAGA algorithm partitions the genome and assigns aninteger label to each segment, yielding an annotation. Third, a researcher interprets thelabels, assigning a biological interpretation to each.January 5, 2021 3/24he SAGA task is to use the input datasets to partition the genome into segments andassign a label to each segment. SAGA algorithms perform this task in a way that leadsto positions with the same label having similar patterns in the input data.

Year Name or description References • HMMSeg [1]2010 • Chromatin colors [4]2010 • Chromatin states model [5]2012 • ChromHMM [2, 6–8]2012 • Segway [6, 9–11]2013 • TreeHMM [12]2015 • Spectacle [13]2015 • hiHMM [14]2015 • Ensembl Regulatory Build (with Segway, ChromHMM) [15]2015 • EpiCSeg [16]2015 • Segway+GBR [3, 17]2016 • IDEAS [18, 19]2017 • GenoSTAN [20]2017 • diHMM [21]2018 • iSeg [22]2018 • StatePaintR [23]2019 • RT States [24]2019 • ConsHMM [25]2020 • modHMM [26]2020 • SPIN [27]2020 • SegRNA [28]

Table 1. Timeline of selected SAGA methods.

The first SAGA methods were developed in the 2000s, but have increased in usagerecently, thanks to the wide availability of genomic datasets (Table 1). Large-scalegenomic profiling projects such as ENCODE [29] and Roadmap Epigenomics [7]produced SAGA annotations as a primary output. Researchers have developed a largevariety of SAGA strategies with the goal of improving upon the basic SAGA framework.In this review, we summarize the main strategies used by most SAGA methods.Then, we discuss differences between methods, the challenges they face, and the outlookfor future work.

Input data

Experimental assays used for input data

SAGA methods typically use as input a number of different experimental datasets, eachdescribing some local property of the genome [30]. Such properties might includechromatin accessibility or presence of some DNA-binding protein. Although input datainitially came from microarray methods such as tiling arrays [31], they now usuallycome from sequence census assays [32].A standard collection of input datasets might measure histone modifications orDNA-binding proteins (using assays like ChIP-seq [33] or cleavage under targets andrelease using nuclease (CUT&RUN) [34]), and chromatin accessibility (using assays likedeoxyribonuclease-sequencing (DNase-seq) [35, 36] or ATAC-seq [37]). Supplying aSAGA algorithm with datasets that measure chromatin activity yields an outputJanuary 5, 2021 4/24nnotation that captures the regulatory state of chromatin. Creating these chromatinactivity annotations has served as the predominant use of SAGA methods thus far.Less frequently, researchers have gone beyond measurements of chromatin andDNA-binding proteins and have used SAGA methods for other kinds of data. Theoutput annotation summarizes the input datasets, so the choice of input greatlyinfluences the annotation’s content and its subsequent interpretation. SAGA methodscan work for any sort of dense linear signal along the genome. Individual studies haveapplied it DNA replication timing data [3, 24, 27], interspecies comparative genomicsdata [25], and RNA-seq data [28]. Other studies have even found ways to incorporatenon-linear chromatin 3D genome organization data into the SAGA framework [3, 27].

Signal representation of genomic assays

Most genomic assay data so far has come from bulk samples of cells. These data depicta noisy mixture of sampling an assayed property from the many cells within thepopulation. These cells may represent subpopulations of slightly different types, orwithin different cell cycle stages. Thus, each subpopulation might have differentcharacteristics in the assayed properties. In the mixture of cell subpopulations, onlyfrequently sampled properties will rise above background noise. By comparison, lessfrequently sampled properties seen in a minority of cells, may remain indistinguishablefrom background noise.Often, the property examined by an epigenomic assay is exhibited or not exhibitedby some position of a single chromosome in a single cell, with no gradations between theextremes. For example, at some nucleotide of one chromosome in a single cell, aninterrogated histone modification is either present or it is not. A single diploid cell hastwo copies of the chromosome. Thus, at that position, each eudiploid cell can have only0, 1, or 2 instances of the histone modification.Summing or averaging discrete counts over a population of cells leads to arepresentation of the assay data called “signal”. Signal appears as a continuous-scalemeasurement. Signal arises, however, only from the aggregation of position-specificproperties, which in each cell may have only a small number of potential ordinal valueat the moment of observation.Unlike epigenomic assays, transcriptomic assays can measure any number oftranscript copies of one position per cell. Despite similar data representations, one mustavoid the temptation to interpret epigenomic signal intensity as one might interprettranscriptomic signal intensity. For a transcriptomic assay, greater signal intensity for atranscriptomic assay might reflect a greater “level” of some transcriptional propertywithin each cell. For an epigenomic assay, greater signal intensity indicates primarilythat a higher number of cells within a sample have the property of interest.In both the epigenomic and transcriptomic cases, it remains difficult or impossible tountangle the contribution to higher signal intensity that arises from frequency ofmolecular activity within each cell of a subpopulation from that from the composition ofsubpopulations within a whole bulk population. Improvements in single-cell assays,however, may enable SAGA algorithms on data from single cells in the near future (see“Outlook for future work”).

Preprocessing of input data

SAGA methods generally use a signal representation of the input data. This signalrepresentation originates from raw experimental data, such as sequencing reads, by wayof a preprocessing procedure. For simplicity, we describe the steps of preprocessing as ifa human analyst conducted them all individually, although some SAGA softwarepackages might perform some steps without manual intervention:January 5, 2021 5/24equired preprocessing for all SAGA methods:1. The analyst transforms the experimental data into raw numeric signal data. • For sequencing data, the analyst:1. Aligns each sequencing read to the reference genome,2. May choose to extend each read to an estimated length of the DNAfragment it begins, and3. Computes the number of reads per base or extended reads per base foreach genomic position [9, 10]. • For microarray data, the analyst:1. Acquires microarray signal intensity for the experimental sample and for acontrol sample, and2. Computes the ratio of experimental intensity to control intensity.2. The analyst chooses units to represent the strength of activity at each positionand may perform further transformation of the raw numeric signal data intothese units. • For sequencing data, the analyst commonly uses one of: • Read count (no transformation), • Fold enrichment of observed data relative to a control [6], or • − log Poisson p-values indicating the likelihood of statisticallysignificant peaks relative to control [19].The latter two units can mitigate experimental artifacts because theycompare to a control experiment such as a ChIP input control. • For microarray data, the analyst commonly performs log transformation ofthe intensity ratios [4, 38, 39].Optional preprocessing or preprocessing required only for specific SAGA methods:3. The analyst may normalize data to harmonize signal across cell types [40].Normalization proves especially important when annotating multiple cell types(see “Annotating multiple cell types”).4. To prevent large outlier signal values from dominating the results, the analystmay transform signals using one of three variance-stabilizing transformations ofeach signal value x : • asinh x [9], • log ( x + pseudocount) [19], or • an empirical variance-stabilizing transformation [41].5. The analyst may downsample 1-bp resolution signal into bins (see “Spatialresolution”). This involves computing one of: • Average read count, • Reads per million mapped reads fold enrichment [42], • Total count of reads [16, 43, 44], or • Maximum count of reads of each bin [6, 18].Binning greatly decreases the computational cost of the SAGA algorithm andcan improve the data’s statistical properties.6. The analyst may binarize numeric signal data into presence/absencevalues [2, 12, 21, 45, 46]. Binarizing signal simplifies analysis by avoiding issuesrelated to the choice of units, but eliminates all but 1 bit of information aboutsignal intensity per bin.January 5, 2021 6/24 issing data

Genomic assays almost always cannot produce signal for every region of the wholegenome. Regions where an assay cannot provide reliable information about theinterrogated property constitute “missing data” for that assay. Missing data insequencing assays may arise due to unmappable sequences, which occur when repetitivereads do not uniquely map to a region [47, 48]. Missing data in microarray assays comesfrom regions covered by no microarray probes. There are three main ways to treatregions of missing data: (1) by treating missing data as 0-valued data, (2) by decreasingthe model resolution, averaging over available data so that the missing data has limitedimpact, or (3) statistical marginalization over the missing data [9, 49].When analyzing coordinated assays across multiple cell types, researchers may haveto contend with having no data on some properties within a subset of cell types. Thisrepresents another kind of missing data: one with an entire dataset missing rather thanonly data at specific positions. Researchers can impute [21] entire missing datasetsthrough tools such as ChromImpute [50], PREDICTD [51] or Avocado [52].Alternatively they can use a SAGA model with built-in capability for handling themissing datasets [12].

Hidden Markov model (HMM) formulation

Many SAGA methods rely on an HMM, a probabilistic model of the relationshipsbetween sequences of observed events and the unobservable hidden states whichgenerate the observed events. The structure of HMMs, and similar models such asdynamic Bayesian networks (DBNs) [53], naturally reflect the SAGA task of clusteringobserved data generated by processes that act on sequences of genomic positions.

Simple HMM example

As an illustration of a simple HMM, consider a dog, Rover, and his owner, Thomas.Thomas is 5 years old and too short to see out of the windows in his home. Rover canleave the house through his dog door and loves taking walks, playing indoors, andnapping. Every morning, he will either wait by the door for Thomas, play with hissqueaky toys, or sleep in. Whichever action he takes depends on the weather he seesoutdoors. For example, on rainy days Rover will more likely nap or play with his toysindoors.Thomas must infer the state of the weather outside, hidden to him, based on thebehavior he observes from Rover. Thomas knows the general weather patterns outsidenear his home—for example, that rainy weather likely continues across multiple days.This scenario fits well into the an HMM framework. It has a sequence ofobservations (Rover’s behavior) generated by hidden, non-independent unobservables(the weather outside). One would like to infer the sequence of hidden unobservablesbased on the sequence of observations.

Mathematical formulation

Formally, we can define an HMM over time t ∈ { , . . . , T } as follows [54, 55]. Let thesequence of observed events X = { X t } Tt =1 consist of each observed event X t at everytime t . Let the sequence of hidden states Q = { Q t } Tt =1 consist of each hidden state Q t at every time t . Each Q t takes on a value q t from a set of m possible hidden statevalues (Figure 2a).January 5, 2021 7/24 Q Hidden Q t Q T X Observed X t X T . . .. . . B rrainy A (r | r) = 0 . ¬ r not rainy A ( ¬ r |¬ r) = 0 . A ( ¬ r | r) = 0 . A (r |¬ r) = 0 . Fig 2. Two representations of an HMM. (a)

Conditional dependence diagramrepresentation of an unrolled HMM with sequence of hidden states { Q t } Tt =1 andsequence of observations { X t } Tt =1 . In this representation, each node represents a hiddendiscrete (white rectangle) or observed continuous (grey circle) random variable. Forevery index t , each hidden random variable Q t takes on some value q t ; similarly, eachobserved variable X t takes on some value x t . X t may represent either scalar or vectorobservations. Solid arcs represent conditional dependence relationships between randomvariables. (b) State transition diagram representation of Rover and Thomas’s weatherexample. In this representation, each node represents a potential value of the hiddenvariable Q t . The hidden variable takes on values r (rainy) or ¬ r (not rainy) on anygiven day t . Solid arcs represent transitions between hidden states, which havetransition probabilities A .January 5, 2021 8/24nder the Markov assumption, the probability of realizing state value q t +1 at thenext time step t + 1 depends only on the current state value q t : P ( Q t +1 = q t +1 | Q t = q t , Q t − = q t − , . . . , Q = q ) = P ( Q t +1 = q t +1 | Q t = q t ) . We define the transition probability A ( q t +1 | q t ) = P ( Q t +1 = q t +1 | Q t = q t ), whichreflects the frequency of moving from state q t to state q t +1 .We define the emission probability B ( x t | q t ) = P ( X t = x t | Q t = q t ) as the probabilitythat the observable X t is x t if the present hidden state Q t = q t . Specifically, we assumethat B ( x t | q t ) depends only on Q t = q t , such that P ( X t = x t | Q t = q t , Q t − = q t − , . . . , Q = q ) = P ( X t = x t | Q t = q t ) . Finally, we define the hidden state probability at the first time stepas π ( q ) = P ( Q = q ). We can fully define an HMM M = ( A, B, π ) by specifying allof A , B and π .In the case of Rover and Thomas, we have m = 2 possible hidden states (rainy,not-rainy) and 3 possible observations (Rover is napping, playing indoors, or waiting bythe door). To Thomas, the hidden variable Q t captures the weather outside, while theobserved variable X captures Rover’s behavior. The probability of the state of theweather outside changing on a day-to-day basis is defined by the transitionprobabilities A (Figure 2b). The probability of Rover’s behavior, given the weather, isdefined by the emission probabilities B . Algorithms for inference, decoding, and training

Inference

The main task one uses HMMs for is to quantify how well some predicted sequence ofhidden states fits the observed data. Other common tasks like decoding or trainingserve as variations of, or build on, this inference task.In HMMs inference, we can compute the likelihood of any sequence of hiddenstates Q . We use the sequence of observed events X and the model probabilities M tocompute the likelihood function P ( X | Q , M ). The likelihood function is the probabilitythat our predicted sequence of hidden states produced our observed sequence ofobserved states. We often compute the likelihood function using the forward-backwardalgorithm [56, 57]. Viterbi decoding

Given a sequence of observed events X , we often wish to know the maximum likelihoodsequence of corresponding hidden states Q . For example, if Thomas observes that inthe past 3 mornings, Rover slept, played, and then slept again, what weather sequenceoutside is most likely?To answer this question, we decode the optimal sequence of hidden states q ∗ suchthat q ∗ = arg max Q P ( Q | X , M ). The Viterbi algorithm [58] provides an efficientsolution for this problem, making it unnecessary to compare the likelihood for everypossible sequence of hidden states. Training

Usually, we do not know the model parameters (

A, B, π ) and must learn them fromdata. We define training as the process of learning these parameters, and training dataas the sequence of observations upon which we learn. An efficient algorithm that findsthe global optimum parameter values for some training data does not exist. Instead,January 5, 2021 9/24esearchers commonly train HMMs using expectation-maximization (EM) [59]algorithms such as the Baum-Welch algorithm [60], which find a local optimum. Otherreviews [54] describe inference and training methods in more detail. HMMs for SAGA

We can readily apply the HMM formalization to genomic data for use in SAGAmethods. Instead of time, we define the dynamic axis t in terms of physical positionalong a chromosome. Each position t refers to a single base pair or, in the case oflower-resolution models, a fixed-size region (see “Spatial resolution”). The observationat each genomic position usually represents genomic signal (see “Input data”). Eachposition’s hidden state represents its label (see “Understanding labels”). As a result,decoding the most probable sequence of hidden states reveals the most probablesequence of labels across the genome. We call this resulting sequence of labels anannotation.Many SAGA methods use an HMM structure [2, 9, 12, 16, 21, 24, 39, 42], orgeneralizations thereof. For example, DBNs are generalizations of HMMs that canmodel connections between variables over adjacent time steps. Methods such asSegway [9] use a DBN model in their approach to the SAGA problem. This can make iteasier to extend the model to tasks such as semi-supervised, instead of unsupervised,annotation [61]. Understanding labels

SAGA methods are unsupervised. The labels they produce usually begin with integerdesignations without any essential meaning. Ideally each label corresponds to aparticular category of genomic element. To make this correspondence explicit, we mustassign a biological interpretation, such as “Enhancer” or “Transcribed gene”, to eachlabel.Usually, one makes assignments of labels to biological interpretations in apost-processing step. In post-processing, a researcher compares each label to knownbiological phenomena and assigns an interpretation that matches the researcher’sunderstanding of molecular biology. For example, a label characterized by the histonemodification H3K36me3 (associated with transcription) and enriched in annotated genebodies might have the interpretation “Transcribed”. A label characterized by H3K27acand H3K4me1, both histone modifications canonically associated with active enhancers,might have the interpretation “Enhancer” [29].The interpretation process provides an opportunity to discover new categories ofgenomic elements. For example, one SAGA study found that their model consistentlyproduces a label corresponding to transcription termination sites. Previously, none haddescribed a distinctive epigenetic signature for transcription termination [6].Manual interpretation proves time-consuming for human analysis. Applying SAGAto multiple cell types independently exacerbates this problem (see “Annotating multiplecell types”).Two existing methods automate the label interpretation process: expert rules andmachine learning. In both cases, an interpretation program considers the informationthat a researcher would use for interpretation. This includes examining the relationshipbetween labels and individual input data properties. It also includes reviewingcolocalization of labels with features in previously created annotations. Theseannotations may have come from SAGA approaches or other manual or automatedmethods.January 5, 2021 10/24n the expert rule approach, an analyst designs rules about what properties a givenlabel must have to receive a particular interpretation. The analyst then applies theserules to assign interpretations to labels from all models [15].In the machine learning approach, one trains a classifier on previous manualannotations. The classifier then learns a model that assigns interpretations to labelsgiven their properties [11]. One analysis [11] found that automatic interpretation agreedwith manual for 77% of labels, compared to 19% expected by chance.

Spatial resolution

Baroque music often employs a musical architecture known as “ternary form”.Specifically, pieces of this structure follow a general “ABA” pattern, whereupon thesecond “A” section recapitulates the first with some variation. Each section containsmultiple musical “sentences”, which may repeat or vary. Just like linguistic sentences,each musical sentence contains clusters of notes, or motifs, between “breaths” in themusical articulation. Finer examination of the motifs shows they contain a few notesand chords each. Finer examination of the notes themselves shows they behave just likeisolated phonemes in speech, with little meaning on their own.The genome resembles a musical composition in that one observes differentbehaviors at different scales. The scale of genomic behavior one wishes to observeinfluences the choice of SAGA method and parameters chosen for the method. Toobserve nucleosome-scale behavior such as genes, promoters and enhancers, onedesires ∼ bp segments. To describe behavior on the scale of topological domains [62],one desires segments of length 10 bp to 10 bp [1, 3, 17].The most important parameter influencing segment length is the underlyingresolution of the SAGA method. As noted above (see “Input data”), most SAGAmethods downsample data into bins. To observe nucleosome-scale segmentlengths ( ∼ bp), one should use one should use 100 bp to 200 bp resolution [2, 9, 18].To observe domain-scale segment lengths ( ∼ bp), one should use ∼ bpresolution [3, 4, 27]. Segway [9] and RoboCOP [63] provide some of few SAGA methodsoptimized for single-base resolution inference, and can identify behavior on a 1 bp scale.While most existing SAGA methods handle data at just one genomic scale, there existmethods capable of learning from data at multiple genomic scales [21].Limitations of of the experimental data itself influence the choice of SAGA modelresolution. Spatial imprecision in ChIP-seq data gives it an inherent resolution of about10 bp. More precise assays such as ChIP-exo [64] and ChIP-nexus [65] can approach1 bp resolution. Conversely, assays like DNA adenine methyltransferaseidentification (DamID) and Repli-seq have a coarser resolution of ≥

100 bp.The desired scale may also influence the choice of input data. When aiming toannotate at the domain scale, one should include data with activity at this scale, suchas replication time data and Hi-C data [3, 4, 24, 27]. The inclusion of long-range contactinformation from Hi-C data poses a challenge because standard algorithms for HMMscannot be used for a probabilistic model that includes long-range dependencies.Therefore, one must instead use alternative approaches such as graph-basedregularization [3] or approximate inference [27].SAGA methods model segment length through their transition parameters. HMMmodels assume a geometric distribution in determination of a segment’s length [66].Related DBN methods can include constraints to tune segment length further.Constraints include the enforcement of a minimum or maximum segment length [9].Enforcement of a minimum segment length ensures that one does not obtain segmentsshorter than the effective resolution of the underlying data or biological phenomena.Probabilistic models often additionally use a prior distribution on the transitionJanuary 5, 2021 11/24arameters during training to encourage them to produce shorter or longer segmentlengths.

Choosing the number of labels

Most SAGA methods require the user to define the number of labels. Using more labelsincreases the granularity of the resulting annotation at the cost of added complexity.Typically, the number of labels ranges from 5–20, with more recent work favoring 10–15labels.One might think to make the choice of number of labels automatically with astatistical approach. The Akaike information criterion (AIC), Bayes informationcriterion (BIC), and factorized information criterion (FIC) [67] measure the statisticalsupport a particular number of labels has. Instead of a fixed number of labels, one maygive the model flexibility to choose the number of labels during training and include ahyperparameter that encourages it to choose a higher or lower number [14]. Or onemight define labels according to local minima in an optimization based on a networkmodel of assays [46]. One could even exhaustively assign a separate label to everyobserved presence/absence pattern in binary data [43].In practice, however, researchers rarely use these statistical approaches fordetermining the number of labels. Optimizing an information criterion does notnecessarily yield the most interpretable annotation. Interpretability reigns supreme inmost SAGA applications. End users find annotations most useful when they haveabout 5–20 labels for two reasons. First, most can only articulate that many knowndistinctions between classes of genomic elements. Second, even if one could findmeaningful distinctions between a large number of labels, few using the resultingannotations could keep fine distinctions between such a large number of patterns intheir working memory. [68] Even if a statistical approach supported the use of 50 labels,the complexity of such an annotation would make it impractical for most users.

Annotating multiple cell types

There now exist epigenomics datasets describing hundreds of biologicalsamples (Figure 3a). Researchers have correspondingly adapted SAGA methods to workwith many samples simultaneously.We use the term “sample” to refer to some population of cells on which one canperform an epigenomic assay. A sample could correspond to a primary tissue sample, acell line, cells affected by some perturbation such as drug or disease, or even cells fromdifferent species.The simplest approach for annotating multiple samples involves simply training aseparate model on each sample [11] (Figure 3b). The large number of models producedby this approach necessitates using an automated label interpretation process (see“Label interpretation”).Two categories of approaches aim to share information across samples. The first,“horizontal sharing” or “concatenated” approaches, share information between samplesto inform the label-training process. The second, “vertical sharing” or “stacked”approaches, share position-specific information to inform the label assignment of eachposition.January 5, 2021 12/24 e ll t y pe s Assay types

HeartLiverLeukemia… DN a s e - s eq H K m e3 C h I P - s eq C T C F C h I P - s eq R ep li - s eq ... A B C

Concatenated

ABC

Stacked

AB C

Independent

A B

Fig 3. Annotating multiple cell types. (a)

Datasets generated by the ENCODEand Roadmap Epigenomics consortia as of 2019. The black cells represent the datasetsactually generated out of a larger number of potential combinations of cell type andassay type. (b)

Annotating 6 datasets from 3 different samples: 3 from cell type A,2 from cell type B, and 1 from cell type C. Colored letters over signal data indicate dataassociated with those samples. One can use three different SAGA strategies with thiscollection of datasets:

Independent : Performing training and inference completelyindependently on each sample. This yields a different annotation for each sample.

Concatenated (horizontal sharing): Training a single model across all cell types. Thisyields one annotation per sample with a shared label set. Each sample must have thesame datasets, necessitating imputation of any missing datasets.

Stacked (verticalsharing): Performing training and inference on datasets from all samples. This yields asingle pan-cell-type annotation.January 5, 2021 13/24 orizontal sharing: emphasizing similarities across samples forlearning labels

The simplest way to remove the need for interpreting multiple models is to apply asingle model across many samples. To do this, one can treat each sample as referring toseparate copies of a longer genome added horizontally after the first one in a“concatenated” approach (Figure 3b). One performs concatenated training and inferencelittle differently than if the data from different samples pertained to differentchromosomes in the same genome. Because all samples share a single concatenatedmodel, researchers need only perform post-processing interpretation once.The concatenated approach has wide usage [6, 7, 69], but has two downsides. First,concatenated SAGA requires that every sample has data from the same assays. Inpractice, this criterion often does not hold true. This means that—unless these assaysare imputed or treated as missing (see “Vertical sharing: emphasizing similarities acrosssamples in positional information”)—one must exclude data for an assay conducted ineven all but one samples. In a simple concatenated approach, one cannot annotate asample which lacks even one dataset present in the others.Second, data from different samples can have artifactual differences or batch effects.Applying the same model across multiple cell types assumes that all datasets from thesame assay type have similar statistical properties. This can result in label distributionsto vary wildly across samples and biologically implausible sample-specific labels. Datanormalization can help abate the problem of different statistical properties betweensamples, but usually does so incompletely. This problem is particularly significant whenusing continuous signal. In contrast, binarizing the data (see “Input data”) can coverup some experimental biases.One might expect that concatenated annotation would benefit training by increasingthe amount of training data. As it turns out, multiplying the amount of training datadoes not significantly aid the training process, as the types of labels vary little acrosssamples. Most complex eukaryotic organisms studied with SAGA have very largegenomes, and just one sample provides plenty of training data. In fact, forcomputational efficiency, researchers often train on just a fraction of the availablesamples [7], a fraction of the genome from a given sample [9] or both.

Vertical sharing: emphasizing similarities across samples inpositional information

Another class of multi-sample SAGA methods shares position-specific informationacross samples as part of the annotation process. These methods take advantage of thenon-independence of biological activity across samples at a genomic position. Forexample, if a given position has an active enhancer label in many samples, it is morelikely to act as an active enhancer in a new sample.The simplest type of vertical sharing approach learns a model on data from allsamples jointly (Figure 3b). One can implement this “stacked” approach by addingdatasets from all samples vertically into a single combined model. A stacked modelcaptures patterns of activity across multiple cell types. For example, a stacked model,unlike an independent model, can find a label for an enhancer active in cell type A andcell type B but inactive in cell type C.Although conceptually simple, the stacked approach tends not to work well for morethan several cell types. Stacking fails with larger number of cell types because eachpattern of activity requires its own label. Therefore, the number of labels must growexponentially in the number of samples. A simple stacked model that treats all assaysas independent also ignores the relationship between assays on the same cell type, or thesame assay type on different cell types.January 5, 2021 14/24 second approach uses a concatenated model that additionally learns aposition-specific preference over the labels for each position. Through this preference,data from one sample can influence inference on another. Two implementations haveapplied variants of this hybrid horizontal-vertical sharing approach. First,TreeHMM [12] uses a cellular lineage tree as part of its input. For each genomic position,TreeHMM models statistical dependency between the label of a parent cell type andthat of a child cell type. Second, IDEAS [18] uses a similar approach to TreeHMM, butdynamically identifies groups of related samples rather than using a fixed developmentalstructure. The IDEAS model allows these sample groups to vary across positions, whichallows for different relationships between samples in different genomic regions.A third approach for vertical sharing uses a pairwise prior to transferposition-specific information between cell types [3, 17]. In other words, if position i andposition j received the same label in many other samples, then they should be morelikely to receive the same label in an additional sample. In contrast to the othermethods in this section, the pairwise prior approach does not require the use ofconcatenated annotation. Therefore, the pairwise approach has the advantage of notrequiring the same available data in all cell types.A fourth approach imputes missing datasets in the target cell type, then applies anyof the above annotation methods to the imputed data [50]. Imputation provides threeadvantages. First, it ensures that all target cell types have the same set of datasets.Second, one can conduct imputation entirely as a preprocessing step, allowing the use ofany SAGA method afterward. Third, the imputation process can normalize someartifactual differences between datasets, making concatenated annotation moreappropriate.Vertical sharing approaches have the downside that one cannot understand theannotation of each sample in isolation. This arises from the influence on labelassignments in one sample by data from other samples. In particular, vertical sharingtends to mask differences between samples. For example, if some position has anenhancer label in many samples, vertical sharing approaches will annotate that positionas an enhancer in a target cell type too, even with no enhancer-related data in thetarget cell type. Using and visualizing SAGA annotations

Name Method Labels Samples Datasets per sample Reference

Roadmap (5 mark) ChromHMM concatenated 15 127 5 [7]Roadmap (6 mark) ChromHMM concatenated 18 98 6 [7]Roadmap (imputed) Impute+ChromHMM 25 127 5–12 [7]IDEAS IDEAS 20 127 5 [19]Segway Encyclopedia Segway independent 14–33 164 3–126 [11]Segway domains Segway+GBR 5 8 12 [3]Ensembl Regulatory Build ChromHMM independent 25 89 6–10 [15]

Table 2. Existing large-scale human reference annotations.

A number of resources can aid in the application of SAGA algorithms andannotations. Reference annotations exist for many cell types (Table 2). These obviatethe need for a user of the annotation to actually run a SAGA method. Alternatively, ifthe user must run a SAGA algorithm on their own data, standardized protocolsdescribe how to perform this process [8, 70].Most users of SAGA annotations view them through one of three visualizationstrategies. The first, and most common, strategy displays individual annotations asJanuary 5, 2021 15/24 chr15 position (20k bp) C e ll t y pe Label type

LowConfidenceQuiescentConstitutiveHetFacultativeHetTranscribedPromoterEnhancerRegPermissiveBivalent C e ll t y pe - s pe c i f i c anno t a t i on s BivalentConstitutiveHetEnhancerFacultativeHetLowConfidencePromoterQuiescentRegPermissiveTranscribed

First exonUpstream(10kb) Middle exonsMiddle introns Downstream(10kb)First intron LastintronLastexon O v e r l ap en r i c h m en t l og2 ( ob s / e x pe c t ed ) chr15 position (20k bp) F un c t i ona li t y sc o r e ( po s i t i v e ) ph y l o P c on s e r v a t i on ( nega t i v e ) Label type

LowConfidenceQuiescentConstitutiveHetFacultativeHetTranscribedPromoterEnhancerRegPermissiveBivalent >> > chr15 position (20k bp) C e ll t y pe Label type

LowConfidenceQuiescentConstitutiveHetFacultativeHetTranscribedPromoterEnhancerRegPermissiveBivalent

Importance score(CAAS/epilogos)

AB C

Fig 4. Visualizations of SAGA annotations. (a)

Genome browser displayshowing 164 cell type annotations for a 20 kbp region on human chromosome 15(GRCh37/hg19) [71]. Each annotation has 8 labels: Promoter (red), Enhancer (orange),Transcribed (green), Permissive regulatory (yellow), Bivalent (purple), Facultativeheterochromatin (light blue), Constitutive heterochromatin (black), Quiescent (grey),and Low Confidence (light grey). (b)

Importance score (conservation-associatedactivity score (CAAS)) for the same region. Total height at each position indicates theposition’s estimated importance. Height of a given color band denotes the contributiontowards importance of the associated label. (c)

Genome-wide visualization of theSAGA annotation for 164 samples aggregated over GENCODE [72] protein-coding genecomponents. Rows: the 9 labels of the annotation. Columns: gene components,including 10 kbp flanking regions upstream and downstream. Each cell shows theenrichment of the row’s label with a position along the column’s component. Figuresderived from [11].January 5, 2021 16/24ndividual rows or “tracks” on a genome browser (Figure 4a). In each row, the browserdisplays the segments of that annotation for a region of one chromosome, usuallyindicating the label by color. Popular genome browsers for displaying segmentationsinclude the University of California, Santa Cruz (UCSC) Genome Browser [73], theWashington University in St. Louis (WashU) Epigenome Browser [74], and Ensembl [75].A second visualization strategy integrates annotations of all samples (Figure 4b).This visualization stacks all labels for a given position on top of one another and scalesthe vertical axis by an estimate of functional importance of that position. Two methodscan estimate this importance: Epilogos ( https://epilogos.altius.org/ ), whichemphasizes rare activity, and the CAAS, which measures activity that is correlated withevolutionary conservation [11].A third visualization strategy aggregates information about each label across theentire genome. This shows the enrichment of each label at positions of knownsignificance, such as gene components (Figure 4c) or curated enhancers. Tools such asSegtools [76] and deepTools [77] can create these visualizations.SAGA annotations can provide valuable reference datasets to other analyses andtools. The assignment of one and only one label from a small set to every mappablepart of the genome greatly eases downstream analyses. SAGA annotations summarizegenomic activity in a much simpler way than the individual input datasets, and eventhan processed versions of the input datasets such as peak calls.Most SAGA annotations are in the tab-delimited browser extensible data (BED)format ( https://genome.ucsc.edu/FAQ/FAQformat.html ). This makes iteasy to remix SAGA annotations with other datasets using powerful software such asBEDTools [78]. SAGA annotations form building blocks for methods for integrativeanalysis of genomic data such as CADD [79].

Conclusions and outlook for future work

SAGA methods provide a powerful and flexible tool for analyzing genomic datasets.These methods promise to continue to play an important role as researchers generatemore datasets. Despite the large existing literature, future work could still addressmany challenges.

Alternate scales and data types.

Nucleosome-scale annotations (100 bp to 1000 bpsegments) of histone modifications have wide usage. While annotations of different datatypes or at different length scales exist, they are less widely used. Currently, there existreference domain annotations with segments of length 10 bp to 10 bp for only a smallnumber of samples [3, 4, 42, 80], and few or no annotations at other scales. Data preprocessing . Genome annotations would improve with better processingand normalization of input datasets. Representations such as fold enrichment used byexisting methods seem primitive compared to more rigorous quantification schemes usedin RNA-seq analysis such as transcripts per million (TPM). One could also improveSAGA preprocessing by more frequently incorporating information from multi-mappingreads [81].

Confidence estimates.

Most methods do not report any measure of confidence intheir predictions. Two types of confidence would prove useful. First, one would oftenlike to know the level of confidence that a position in some sample has label X and notlabel Y. Second, in many cases one would like to have confidence in a differentiallabeling between two samples—that cell type A and cell type B have different labels.January 5, 2021 17/24wo methods work towards a solution for the second problem [82, 83], but there remainsmuch room for further work.

Continuous representations.

Existing SAGA methods output a discreteannotation, assigning a single label to each position. In this discrete approach,annotations cannot easily represent varying strength in activity of genomic elements orelements that simultaneously multiple types of activity. A continuous annotationapproach analogous to the topic models used for text document classification mightaddress this limitation [84].

Single cell data.

Existing SAGA methods use data from bulk samples of cells.Increasing availability of data from single-cell assays necessitates the development ofmethods that can leverage this additional information.

Pan-cell-type annotation.

The semantics of genome annotations correspond poorlyto the way most molecular biologists conceptualize genomic elements. Most existingannotations are cell-type-specific—the annotation states that a given locus acts as anactive enhancer in cell type A. In contrast, researchers often state that a given locus “isan enhancer”.In contrast, other annotations—such of those of protein-coding genes—serve as apan-cell-type characterization. Each gene has a fixed location, and only its expressionvaries across samples.There exists a need for pan-cell-type epigenome annotations. Such an annotationwould define fixed intervals for regulatory elements such as promoters, enhancers, andinsulators, and it would specify in which samples each element is active. Specificallytargeting this task in the SAGA model could improve results over pan-cell-typeannotations assembled from multiple cell-type-specific SAGA models [11].

Author contributions

Conceptualization, M.W.L. and M.M.H.; Funding acquisition, M.M.H.; Investigation,M.W.L., R.C.W.C., and M.M.H.; Visualization, M.W.L. and R.C.W.C.; Writing —original draft, M.W.L., R.C.W.C., and M.M.H.; Writing — review & editing, M.W.L.and M.M.H.

Acknowledgments

This work was supported by the Natural Sciences and Engineering Research Council ofCanada (RGPIN-2015-03948 to M.M.H.)

Competing interests

The authors declare no competing interests.

Acronyms

AIC

Akaike information criterion

ATAC-seq assay for transposase-accessible chromatin-sequencingJanuary 5, 2021 18/24 ED browser extensible data BIC

Bayes information criterion

CAAS conservation-associated activity score

ChIP-seq chromatin immunoprecipitation-sequencing

CUT&RUN cleavage under targets and release using nuclease

DamID

DNA adenine methyltransferase identification

DNase-seq deoxyribonuclease-sequencing

DBN dynamic Bayesian network EM expectation-maximization ENCODE

Encyclopedia of DNA Elements

FIC factorized information criterion

HMM hidden Markov model

SAGA segmentation and genome annotation

TPM transcripts per million

UCSC

University of California, Santa Cruz

WashU

Washington University in St. Louis

References

1. Day N, Hemmaplardh A, Thurman RE, Stamatoyannopoulos JA, Noble WS.Unsupervised segmentation of continuous genomic data. Bioinformatics.2007;23(11):1424–1426.2. Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery andcharacterization. Nature Methods. 2012;9(3):215–216.3. Libbrecht MW, Ay F, Hoffman MM, Gilbert DM, Bilmes JA, Noble WS. Jointannotation of chromatin state and chromatin conformation reveals relationshipsamong domain types and identifies domains of cell-type-specific expression.Genome Research. 2015;25(4):544–557.4. Filion GJ, van Bemmel JG, Braunschweig U, Talhout W, Kind J, Ward LD, et al.Systematic protein location mapping reveals five principal chromatin types in

Drosophila cells. Cell. 2010;143(2):212–224.5. Ernst J, Kellis M. Discovery and characterization of chromatin states forsystematic annotation of the human genome. Nature Biotechnology.2010;28(8):817–825.6. Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, et al.Integrative annotation of chromatin elements from ENCODE data. Nucleic AcidsResearch. 2012;41(2):827–841.January 5, 2021 19/24. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al.Integrative analysis of 111 reference human epigenomes. Nature.2015;518(7539):317–330.8. Ernst J, Kellis M. Chromatin-state discovery and genome annotation withChromHMM. Nature Protocols. 2017;12(12):2478–2492.9. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervisedpattern discovery in human chromatin structure through genomic segmentation.Nature Methods. 2012;9(5):473–476.10. Chan RC, Libbrecht MW, Roberts EG, Bilmes JA, Noble WS, Hoffman MM.Segway 2.0: Gaussian mixture models and minibatch training. Bioinformatics.2018;34(4):669–671.11. Libbrecht MW, Rodriguez OL, Weng Z, Bilmes JA, Hoffman MM, Noble WS. Aunified encyclopedia of human functional DNA elements through fully automatedannotation of 164 human cell types. Genome Biology. 2019;20(1):180.12. Biesinger J, Wang Y, Xie X. Discovering and mapping chromatin states using atree hidden Markov model. BMC Bioinformatics. 2013;14(Suppl 5):S4.13. Song J, Chen KC. Spectacle: fast chromatin state annotation using spectrallearning. Genome Biology. 2015;16(1):33.14. Sohn KA, Ho JW, Djordjevic D, Jeong Hh, Park PJ, Kim JH. hiHMM: Bayesiannon-parametric joint inference of chromatin state maps. Bioinformatics.2015;31(13):2066–2074.15. Zerbino DR, Wilder SP, Johnson N, Juettemann T, Flicek PR. The Ensemblregulatory build. Genome Biology. 2015;16(1):56.16. Mammana A, Chung HR. Chromatin segmentation based on a probabilisticmodel for read counts explains a large portion of the epigenome. Genome Biology.2015;16(1):151.17. Libbrecht MW, Hoffman MM, Bilmes JA, Noble WS. Entropic graph-basedposterior regularization. In: Proceedings of the International Conference onMachine Learning; 2015. p. 1992–2000.18. Zhang Y, An L, Yue F, Hardison RC. Jointly characterizing epigenetic dynamicsacross multiple human cell types. Nucleic Acids Research. 2016;44(14):6721–6731.19. Zhang Y, Hardison RC. Accurate and reproducible functional maps in 127human cell types via 2D genome segmentation. Nucleic Acids Research.2017;45(17):9823–9836.20. Zacher B, Michel M, Schwalb B, Cramer P, Tresch A, Gagneur J. Accuratepromoter and enhancer identification in 127 ENCODE and roadmap epigenomicscell types and tissues by GenoSTAN. PLoS One. 2017;12(1):e0169249.21. Marco E, Meuleman W, Huang J, Glass K, Pinello L, Wang J, et al. Multi-scalechromatin state annotation using a hierarchical hidden Markov model. NatureCommunications. 2017;8:15011.22. Girimurugan SB, Liu Y, Lung PY, Vera DL, Dennis JH, Bass HW, et al. iSeg: anefficient algorithm for segmentation of genomic and epigenomic data. BMCBioinformatics. 2018;19(1):131.January 5, 2021 20/243. Coetzee SG, Ramjan Z, Dinh HQ, Berman BP, Hazelett DJ.StateHub-StatePaintR: rapid and reproducible chromatin state evaluation forcustom genome annotation. F1000Research. 2020;7(214):214.24. Poulet A, Li B, Dubos T, Rivera-Mulia JC, Gilbert DM, Qin ZS. RT States:systematic annotation of the human genome using cell type-specific replicationtiming programs. Bioinformatics. 2019;35(13):2167–2176.25. Arneson A, Ernst J. Systematic discovery of conservation states forsingle-nucleotide annotation of the human genome. Communications Biology.2019;2(1):248.26. Benner P, Vingron M. ModHMM: A modular supra-Bayesian genomesegmentation method. Journal of Computational Biology. 2020;27(4):442–457.27. Wang Y, Zhang Y, Zhang R, van Schaik T, Zhang L, Sasaki T, et al. SPINreveals genome-wide landscape of nuclear compartmentalization. bioRxiv.2020;2020.03.09.982967.28. Mendez M, FANTOM Consortium Main Contributors, Scott MS, Hoffman MM.Unsupervised analysis of multi-experiment transcriptomic patterns with SegRNAidentifies unannotated transcripts. bioRxiv. 2020;2020.07.28.225193.29. ENCODE Project Consortium. An integrated encyclopedia of DNA elements inthe human genome. Nature. 2012;489(7414):57–74.30. Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machinelearning for integrating data in biology and medicine: Principles, practice, andopportunities. Information Fusion. 2019;50:71–91.31. ENCODE Project Consortium, et al. Identification and analysis of functionalelements in 1% of the human genome by the ENCODE pilot project. Nature.2007;447(7146):799–816.32. Wold B, Myers RM. Sequence census methods for functional genomics. NatureMethods. 2007;5(1):19–21.33. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, et al.High-resolution profiling of histone methylations in the human genome. Cell.2007;129(4):823–837.34. Skene PJ, Henikoff S. An efficient targeted nuclease strategy for high-resolutionmapping of DNA binding sites. eLife. 2017;6:e21856.35. Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, et al.High-resolution mapping and characterization of open chromatin across thegenome. Cell. 2008;132(2):311–322.36. Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, et al.Global mapping of protein-DNA interactions in vivo by digital genomicfootprinting. Nature Methods. 2009;6(4):283–289.37. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition ofnative chromatin for fast and sensitive epigenomic profiling of open chromatin,DNA-binding proteins and nucleosome position. Nature Methods.2013;10(12):1213–1218.January 5, 2021 21/248. Schuettengruber B, Ganapathi M, Leblanc B, Portoso M, Jaschek R, Tolhuis B,et al. Functional anatomy of polycomb and trithorax chromatin landscapes in

Drosophila embryos. PLoS Biology. 2009;7(1):e1000013.39. Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, Ernst J,et al. Comprehensive analysis of the chromatin landscape in

Drosophilamelanogaster . Nature. 2011;471(7339):480–485.40. Xiang G, Keller CA, Giardine B, An L, Li Q, Zhang Y, et al. S3norm:simultaneous normalization of sequencing depth and signal-to-noise ratio inepigenomic data. Nucleic Acids Research. 2020;48(8):e43.41. Bayat F, Libbrecht M. Variance-stabilized units for sequencing-based genomicsignals. bioRxiv. 2020;2020.01.31.929174.42. Larson JL, Huttenhower C, Quackenbush J, Yuan GC. A tiered hidden Markovmodel characterizes multi-scale chromatin states. Genomics. 2013;102(1):1–7.43. Taudt A, Nguyen MA, Heinig M, Johannes F, Colome-Tatche M. chromstaR:Tracking combinatorial chromatin state dynamics in space and time. bioRxiv.2016;038612.44. Zehnder T, Benner P, Vingron M. Predicting enhancers in mammalian genomesusing supervised hidden Markov models. BMC Bioinformatics. 2019;20(1):157.45. Hamada M, Ono Y, Fujimaki R, Asai K. Learning chromatin states withfactorized information criteria. Bioinformatics. 2015;31(15):2426–2433.46. Zhou J, Troyanskaya OG. Probabilistic modelling of chromatin code landscapereveals functional diversity of enhancer-like chromatin states. NatureCommunications. 2016;7:10528.47. Derrien T, Estell´e J, Sola SM, Knowles DG, Raineri E, Guig´o R, et al. Fastcomputation and applications of genome mappability. PLoS One.2012;7(1):e30377.48. Karimzadeh M, Ernst C, Kundaje A, Hoffman MM. Umap and Bismap:quantifying genome and methylome mappability. Nucleic Acids Research.2018;46(20):e120.49. Lian H, Thompson WA, Thurman R, Stamatoyannopoulos JA, Noble WS,Lawrence CE. Automated mapping of large-scale chromatin structure inENCODE. Bioinformatics. 2008;24(17):1911–1916.50. Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematicannotation of diverse human tissues. Nature Biotechnology. 2015;33(4):364–376.51. Durham TJ, Libbrecht MW, Howbert JJ, Bilmes J, Noble WS. PREDICTDparallel epigenomics data imputation with cloud-based tensor decomposition.Nature Communications. 2018;9(1):1402.52. Schreiber J, Durham T, Bilmes J, Noble WS. Avocado: a multi-scale deep tensorfactorization method learns a latent representation of the human epigenome.Genome Biol. 2020;21(1):81.53. Dean T, Kanazawa K. A model for reasoning about persistence and causation.Computational Intelligence. 1989;5(2):142–150.January 5, 2021 22/244. Bilmes JA. What HMMs can do. IEICE Transactions on Information andSystems. 2006;89(3):869–891.55. Yoon BJ. Hidden Markov models and their applications in biological sequenceanalysis. Current Genomics. 2009;10(6):402–415.56. Ferguson JD. Variable duration models for speech. Proceedings of Symposium onthe Application of Hidden Markov Models to Text and Speech. 1980; p. 143–179.57. Levinson SE. Continuously variable duration hidden Markov models forautomatic speech recognition. Computer Speech & Language. 1986;1(1):29–45.58. Viterbi A. Error bounds for convolutional codes and an asymptotically optimumdecoding algorithm. IEEE Transactions on Information Theory.1967;13(2):260–269.59. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal Statistical Society: Series B(Methodological). 1977;39(1):1–22.60. Baum LE, Petrie T, Soules G, Weiss N. A maximization technique occurring inthe statistical analysis of probabilistic functions of Markov chains. Annals ofMathematical Statistics. 1970;41(1):164–171.61. Chan RC, McNeil M, Roberts EG, Mendez M, Libbrecht MW, Hoffman MM.Semi-supervised segmentation and genome annotation. bioRxiv.2020;2020.01.30.926923.62. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains inmammalian genomes identified by analysis of chromatin interactions. Nature.2012;485(7398):376–380.63. Mitra S, Zhong J, MacAlpine D, Hartemink AJ. RoboCOP: Jointly computingchromatin occupancy profiles for numerous factors from chromatin accessibilitydata. bioRxiv. 2020;2020.06.03.132001.64. Rhee HS, Pugh BF. Comprehensive genome-wide protein-DNA interactionsdetected at single-nucleotide resolution. Cell. 2011;147(6):1408–1419.65. He Q, Johnston J, Zeitlinger J. ChIP-nexus enables improved detection of in vivoin vivo