Multiple Testing in Nonparametric Hidden Markov Models: An Empirical Bayes Approach
aa r X i v : . [ m a t h . S T ] J a n Multiple Testing in Nonparametric Hidden Markov Models:An Empirical Bayes Approach
Kweku Abraham [email protected]
Université Paris-Saclay, CNRS, Laboratoire de Mathématiques d’Orsay,91405 Orsay, France
Ismaël Castillo [email protected]
Sorbonne Université, Laboratoire de Probabilités, Statistique et Modélisation,4 Place Jussieu, 75005 Paris, France
Elisabeth Gassiat [email protected]
Université Paris-Saclay, CNRS, Laboratoire de Mathématiques d’Orsay,91405 Orsay, France
Abstract
Given a nonparametric Hidden Markov Model (HMM) with two states, the question ofconstructing efficient multiple testing procedures is considered, treating one of the statesas an unknown null hypothesis. A procedure is introduced, based on nonparametric em-pirical Bayes ideas, that controls the False Discovery Rate (FDR) at a user–specified level.Guarantees on power are also provided, in the form of a control of the true positive rate.One of the key steps in the construction requires supremum–norm convergence of prelim-inary estimators of the emission densities of the HMM. We provide the existence of suchestimators, with convergence at the optimal minimax rate, for the case of a HMM with J ≥ Keywords: efficient multiple testing, hidden Markov models, false discovery rate, truediscovery rate, supremum norm estimation, minimax rate
1. Introduction
We consider the problem of multiple testing in a hidden Markov model (HMM) setting.Given data ( X i : i ≤ N ) whose distribution is governed by an unknown categorical variable θ = ( θ i : i ≤ N ) drawn from a Markov chain, one seeks to test the null hypotheses H ,i : θ i = 0 against the alternatives H ,i : θ i = 0 simultaneously for i = 1 , . . . N . Inseeking procedures with optimal properties with respect to multiple testing–measures ofrisk, for example with controlled False Discovery Rate (FDR) and maximal ‘power’ asmeasured by the True Discovery Rate (TDR), it is natural to consider thresholding basedon the probabilities of the θ i ’s being zero conditional on the observations X i , i = 1 , . . . , N (see Section 2.1). The conditional probabilities are simply posterior probabilities in theBayesian world, and smoothing probabilities in the latent variables vocabulary. They will(mainly) be called ℓ -value’s in this work.A first such procedure is one that rejects all coordinates whose ℓ –value is below a user-specified level t , see e.g. Efron et al. (2001); Efron (2007a). From the Bayesian point ofview, this is directly related to the Bayes factor for testing the individual coordinate. The © Kweku Abraham, Ismaël Castillo and Elisabeth Gassiat.License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/ . braham, Castillo and Gassiat procedure we consider in this paper is still one based on ℓ –value thresholding, but witha data–dependent threshold chosen in such a way that the expected false discovery rateconditional on the data is equal or very close to t (Müller et al., 2004; Sun and Cai, 2009);this typically yields an FDR close to t , and, being less ‘conservative’ than the previousprocedure, enjoys certain optimality properties. There are other alternatives, such as so-called q –value procedures (Storey, 2003), that are based on controlling ‘marginal’ versionsof the FDR.Of course, it is rare that the model parameters are known in practice, so that insteadof the ‘oracle’ procedures described above (so called because the ℓ –values depend on theparameters), calculations are based on first estimating these in the chosen modeling: the‘empirical’ Bayes method. We consider here a nonparametric HMM setting, with unknownparameters corresponding to emission densities and to characteristics of the underlyingMarkov chain. The key question addressed in this paper is this: what is the impact of theestimation step on the FDR?
More precisely, one would like to understand whether the dis-cussed thresholding procedures still (asymptotically) maintain multiple testing optimalityproperties when the parameters are estimated. We also note that, being in a nonpara-metric setting, the loss function chosen to measure the quality of estimation may havemore influence over the plug–in operation compared to the parametric situation. Our maincontributions can be summarised as follows.• Our first main results, Theorems 2 and 3, show theoretically that in the nonparametricHMM setting an empirical Bayesian procedure attains the target FDR level and enjoysTDR optimality. The proofs of these two theorems are partly based on a result inDe Castro et al. (2017), which shows how control of plug–in estimators propagates togive control of ℓ –value errors. A key step is to have good supremum–norm estimators,in contrast to the L –norm estimators previously found in the literature.• Our second main results, which are both key to obtaining the first and also of inde-pendent interest, concern supremum–norm estimation of emission densities in non-parametric HMMs. We provide estimators, and prove in Theorems 4 and 5 that thesupremum–norm risk of these estimators achieves the parametric convergence rate N − / for discrete observations (where the set of possible values is countable), andthe convergence rate ( N/ log N ) − s/ (2 s +1) , familiar from the classical i.i.d. density es-timation setting and also proved to be optimal in the HMM context (see Proposition6), for Hölder densities with regularity s .Our key question connects with the frequentist analysis of the behaviour of empiricalBayes procedures, a topic currently under rapid development: we briefly review some con-nections at the end of Section 1.3. It has previously been considered in an i.i.d. setting inSun and Cai (2007), in a graph setting (with a q –value based procedure) in Rebafka et al.(2019), and, most pertinently, in a parametric HMM setting in Sun and Cai (2009), whereinit is argued that first estimating the model parameters leaves the FDR asymptotically un-changed. Parametric modeling of HMMs is known, however, to lead to poor results inmany applications, as shown for instance in Yau et al. (2011). We draw attention also tothe extensive simulations conducted and discussed in Wang et al. (2019) for real valued ob-servations, and in Su and Wang (2020) for count data. These latter two works demonstrate ultiple Testing in Nonparametric HMMs empirically that the FDR and TDR are badly impacted by parametric modeling in case ofmisspecification, while nonparametric empirical Bayes methods as considered here closelymatch the optimal behaviour of oracle ℓ –value procedures.A further advantage of modelling the HMM densities nonparametrically is that it ensuresour results allow for fairly arbitrary distributions under the null hypothesis. In contrast,many common multiple testing procedures – including the original Benjamini–Hochbergprocedure – assume that the null distribution is known. One can of course adjust suchprocedures to use an estimated null hypothesis, but there are so far only a few settingsin which it has been proved that this plug-in step has no negative effect on the desiredproperties of the procedures. We refer to the recent work by Roquain and Verzelen (2020)for more discussion concerning this issue.Finally, we note that as well as enabling the plug-in results which yield control of theFDR, estimating the emission densities in terms of the supremum norm is useful in itsown right. Indeed, practically speaking, results of this type justify that plots of densityestimators will be visual close to the original density. Such estimators can also be helpfulfor identifying change points, estimating level sets, and constructing confidence bands foruncertainty quantification. Let us now place these results in the broader multiple testing and HMM contexts.
Multiple testing.
The problem of identifying relevant variables among a large number ofpossible candidates is ubiquitous with high dimensional data: indeed, multiple testing meth-ods are very popular in the analysis of genomic data, in astrostatistics, and in imaging, toname just a few practical applications. Since the seminal work of Benjamini and Hochberg(1995), controlling the FDR has been the goal of much of the extensive literature on thesubject.Early works tended to assume i.i.d. data. Efron (2007b) noted that ignoring depen-dence and using methods designed for FDR control with independent data could result ineither too conservative or too liberal procedures, showing that dependence must carefullybeing taken into account. A number of works, including those of Benjamini and Yekutieli(2001), Farcomeni (2007), Finner et al. (2007) and Wu (2008), have shown that under cer-tain assumptions on the dependence structure, some multiple testing procedures designedfor independent case (such as the step-up Benjamini–Hochberg procedure) still control theFDR below a given target level. Such procedures, although having guaranteed FDR evenunder dependence, may suffer from being too conservative.The control of power in dependent data settings is less developed. Some works in thisdirection include those of Xie et al. (2011) and of Heller and Rosset (2020) which considerthe ‘general two group model’, wherein the θ i ’s are independent and identically distributed,but for each i the distribution of X i given θ may depend on the whole vector θ and notonly on θ i . In some settings, such as with genetic data, allowing for the θ i ’s themselves tobe dependent can however be more natural, and the HMM model for X considered hereallows for a natural local structure of θ – while still remaining tractable – by modelling itas a Markov chain. braham, Castillo and Gassiat Hidden Markov models.
HMMs have been widely used for applications as varied asspeech modelling, computational finance and gene prediction since works of Baum, Petrieand coauthors introduced practical algorithms and proved parametric estimation rates ina discrete data setting (Petrie, 1967; Baum and Petrie, 1966; Baum et al., 1970). Laterworks, including those of Bickel et al. (1998) and of Douc and Matias (2001), extendedthese proofs to allow parametric modelling of the emission distributions.Recently, Gassiat et al. (2016) opened the possibility that consistency holds also whenthe emission densities are modelled nonparametrically by proving identifiability under mildconditions. Anandkumar et al. (2012) introduced in the parametric case a spectral methodwhich was then generalised in De Castro et al. (2016) and Lehéricy (2018) to indeed giveconsistency at a usual rate in the nonparametric setting. These nonparametric works how-ever focus on L –estimation, and do not immediately generalise to give rate-optimal supre-mum norm estimation: indeed, attempting to apply a typical wavelet method of estimatingindividual coefficients at a parametric rate and aggregating, one runs into an alignment issuearising from the fact that the emission densities are identifiable only up to a permutation.An insight of the current work is that returning to the spectral method and using a kernelbased estimator allows this issue to be bypassed. Consider a hidden Markov model (HMM), in which the observations X = ( X n ) n ≤ N satisfy X n | θ ∼ f θ n , ≤ n ≤ N,θ = ( θ n ) n ≤ N ∼ Markov( π, Q ) , (1)and, conditional on θ , the entries of X are independent. The vector θ of ‘hidden states’ takesvalues in { , } N (we will later also consider the case where θ takes values in { , . . . , J } N forsome J ≥
2) and Markov( π, Q ) denotes a Markov chain of initial distribution π = ( π , π ),and 2 × Q . The ‘emission densities’ f , f are probability densities withrespect to some common dominating measure µ on a measurable space X . For simplicitywe will assume that µ is either Lebesgue measure on R or counting measure on Z ⊂ R ;our results adapt straightforwardly to the d –dimensional setting, and in principle versionsshould hold for more general measure spaces (see the discussion in Section 4.4). We use H = { Q, π, f , f } to denote a generic set of parameters for the HMM. We denote by Π H the law of ( X, θ ) in (1), and by extension also the marginal laws of X and θ . We write E H to denote the expectation operator associated to Π H .The goal of multiple testing is to provide a procedure ϕ = ϕ ( X ) which identifies wellfor which i we have signal ( θ i = 0). We will measure the performance of ϕ through thefalse discovery rate (FDR) and the true discovery rate (TDR). Defining the false discoveryproportion (FDP) at θ as FDP θ ( ϕ ) := P Ni =1 { θ i = 0 , ϕ i = 1 } ∨ (cid:0)P Ni =1 ϕ i (cid:1) , (2)the FDR at θ is given by FDR θ ( ϕ ) := E [FDP θ ( ϕ ( X )) | θ ] . (3) ultiple Testing in Nonparametric HMMs We consider the average false discovery rate for θ generated according to the ‘prior’ lawΠ H : FDR H ( ϕ ) := E θ ∼ Π H FDR θ ( ϕ ) ≡ E ( X,θ ) ∼ Π H FDP θ ( ϕ ) , (4)and we define the ‘posterior FDR’ as the FDR obtained by drawing θ from its posteriorpostFDR H ( ϕ ) = postFDR H ( ϕ ; X ) := E H [FDP θ ( ϕ ) | X ] . (5)The true discovery rate is defined as the expected proportion of signals which are detectedby a procedure: TDR H ( ϕ ) = E H h P Ni =1 { θ i = 1 , ϕ i = 1 } ∨ ( P Ni =1 θ i ) i . (6) Bayesian formulation and latent variable formulation.
Let P denote the “true” distri-bution of the data X arising from model (1). If, in (1), the distribution of θ is interpretedas “prior” distribution (it is of course an “oracle prior”, as π, Q are components of theunknown “true” parameter H = ( π, Q, f , f )), the distribution of X = ( X n ) n ≤ N in the (or-acle) Bayesian setting is simply the true distribution P . Of course, one may also avoid theBayesian vocabulary and simply view model (1) as a latent variable model: under such pointof view, ℓ –values are known as smoothing probabilities and θ | X is simply a conditional dis-tribution. We find it convenient to nevertheless use Bayesian terminology. Partially thisis in accordance with classical decision theory, wherein Bayesian terminology is commonlyused for describing optimal classifiers (indeed, as Storey (2003) observed, “classical classi-fication theory seems to be a bridge between Bayesian modeling and hypothesis testing”).It is also helpful preparation for considering a setting where θ is fixed and non-random, asdicussed next. Connection with frequentist analysis of Bayesian procedures.
Recent years have seennotable progress on providing frequentist validations of the use of posterior distributionsfor inference, with most results concerning the estimation task, and more recently alsouncertainty quantification and confidence sets (Ghosal and van der Vaart, 2017). One canconsider using the HMM model (1) not because one believes θ is genuinely random witha Markov structure, but rather as a way to model some block structure of a fixed true θ , wherein neighbour coordinates of X have a higher chance of coming from the samedistribution. The first results in this spirit in a multiple testing setting were obtainedrecently for sparse sequences (without block structure) in Castillo and Roquain (2020).We plan to investigate in further work the Bayesian procedure studied in this paper forstructured sequences of fixed θ where the HMM modeling will then be a Bayesian prior.We also note that the results we obtain below still constitute a (partial) frequentist Bayesvalidation, in the following sense. Consider a standard Bayesian approach where θ is viewedas parameter and given a HMM prior, but not the other parameters ( f , f , π, Q ), which areestimated separately. Then Theorems 2 and 3 below prove that if the true (frequentist) datagenerating distribution is some nonparametric HMM, then the empirical Bayes procedurederived from the posterior on θ behaves consistently from the multiple testing viewpoint:its FDR is controlled with optimality guarantees on the TDR. It is a less strong frequentistanalysis than under an arbitrary fixed θ , but it validates the frequentist use of the procedureassuming that the data comes from some (fairly arbitrary) non-parametric HMM: this stillallows one to capture many typical signals with varied latent densities. braham, Castillo and Gassiat In Section 2 we introduce our multiple testing procedure and establish its asymptotic per-formance in Theorems 2 and 3. Section 3 is devoted to the estimation of the emissiondensities, with asymptotic supremum norm control established in Theorems 4 and 5. Wealso give in Proposition 6 a lower bound for the estimation of Hölder emission densitieswith regularity s in the HMM context. Finally, Proposition 7 gives examples of how toovercome the ‘label switching’ issue, present in the HMM setting as for mixture models, inorder to know which estimator corresponds to the null state and which to the alternative.This allows us to avoid the assumption, common to many multiple testing methods, thatthe distribution of the data under the null is known.In Section 4, we provide a detailed discussion of our assumptions and comparisons ofour results with the literature. We also explain how the rates of convergence of our emissiondensities estimators can be understood as minimax rates of convergence in supremum norm.Proofs of the main theorems are given in Section 5. Intermediate results useful forthese proofs are given in Appendices A and B. Appendix C gives a proof of a minimaxlower bound. For the reader’s convenience, the notation introduced throughout the paperis gathered in Appendix D.
2. The Empirical Bayesian Procedure
We analyse an empirical Bayesian approach to the multiple testing problem, based onthresholding by the posterior (smoothing) probabilities, here called the ‘ ℓ –values’ and alsoknown in the literature as the ‘local indices of significance’ (Efron et al., 2001; Efron, 2007a;Sun and Cai, 2009): ℓ i ( X ) ≡ ℓ i,H ( X ) = Π H ( θ i = 0 | X ) . (7)In the ‘oracle’ setting (where the parameter H is known), it is well known that theoptimal (weighted) classification procedure is an ℓ –value thresholding procedure; that is, itis ϕ λ,H for some λ , where ϕ λ,H ( X ) = ( { ℓ i,H ( X ) < λ } ) i ≤ N . (8)It has been shown in Sun and Cai (2009) that this class of procedures (possibly with data-driven thresholds) is also optimal in a multiple testing sense, in that a procedure makingfalse discoveries at a pre-specified rate and maximising a suitable notion of the multipletesting power is necessarily an ℓ –value thresholding procedure.The FDR is the expectation of the posterior FDR, so that using the latter (whichis observable) to choose the threshold is a natural approach. When the parameter H isunobserved, we use an estimator ˆ H = ( ˆ Q, ˆ π, ˆ f , ˆ f ) instead (to be constructed later), andso we are led to the procedure ϕ ˆ λ, ˆ H , whereˆ λ = ˆ λ ( ˆ H, t ) := sup { λ : postFDR ˆ H ( ϕ λ, ˆ H ) ≤ t } . (9)We also note an alternative characterisation of the threshold ˆ λ . In view of the definitions ultiple Testing in Nonparametric HMMs (5) and (7), we have the following expression for the posterior FDR:postFDR H ( ϕ ) = P Ni =1 ℓ i,H ϕ i ∨ ( P ni =1 ϕ i ) . (10)That is, the posterior FDR of a procedure ϕ is the average of the selected ℓ –values. Conse-quently, the procedure ϕ ˆ λ, ˆ H must threshold at one of the “empirical ℓ –values” (i.e. at someˆ ℓ i = ℓ i, ˆ H ), as postFDR ˆ H ( ϕ λ, ˆ H ) only changes when λ crosses such a threshold. The thresh-old ˆ λ can therefore equivalently be expressed, as in Sun and Cai (2009), as ˆ λ = ˆ ℓ ( ˆ K +1) , withˆ ℓ ( i ) denoting the i th order statistic of { ℓ i, ˆ H : 1 ≤ i ≤ N } , where ˆ K is defined by1ˆ K ˆ K X i =1 ˆ ℓ ( i ) ≤ t < K + 1 ˆ K +1 X i =1 ˆ ℓ ( i ) . (11)(By convention the left inequality automatically holds in the case ˆ K = 0, and we defineˆ ℓ ( N +1) := ∞ so that the right inequality automatically holds in the case ˆ K = N .) Note thatˆ K is well defined and unique, by monotonicity of the average of nondecreasing numbers.This monotonicity also makes clear the following dichotomy:postFDR ˆ H ( ϕ λ, ˆ H ) ≤ t ⇐⇒ λ ≤ ˆ λ. (12)If there are no ties, the procedure ϕ ˆ λ, ˆ H necessarily rejects ˆ K of the null hypotheses.In the case of ties, it may reject fewer, and to avoid potential conservativity, we thereforeconsider a slightly adjusted procedure ˆ ϕ . Definition 1.
Define ˆ ϕ = ˆ ϕ ( t ) to be a procedure rejecting exactly ˆ K of the hypotheses withthe smallest ˆ ℓ i values, choosing arbitrarily in case of ties, where ˆ K is defined by (11) . Wewrite ˆ S for the rejection set ˆ S = { i ≤ N : ˆ ϕ i = 1 } , and we note that by construction we have | ˆ S | = ˆ K and { i : ˆ ℓ i ( X ) < ˆ λ } ⊆ ˆ S ⊆ { i : ˆ ℓ i ( X ) ≤ ˆ λ } . We make the following assumptions on the parameters. The assumptions are not par-ticularly restrictive, and are discussed in detail in Section 4.1.
Assumption A.
1. There exists a constant ν > j =0 , E X ∼ f j ( | X | ν ) < ∞ .
2. There exists x ∗ ∈ R ∪ {±∞} such that either f ( x ) /f ( x ) → ∞ , as x ↑ x ∗ , or f ( x ) /f ( x ) → ∞ , as x ↓ x ∗ ,
1. We define the order statistics so that repeats are allowed: the order statistics are defined by the factthat { ℓ i , i ≤ N } = { ℓ ( j ) , j ≤ N } as a multiset ( ∀ x ∈ R , { i : ℓ i = x } = { i : ℓ ( i ) = x } ) and ℓ (1) ≤ ℓ (2) ≤ · · · ≤ ℓ ( N ) . braham, Castillo and Gassiat where we take the conventions that 1 / ∞ , 0 / µ is countingmeasure on Z and x ∗
6∈ {±∞} , the limits are interpreted to mean that f ( x ∗ ) > f ( x ∗ ) = 0.] Assumption B.
1. The matrix Q has full rank (i.e. its two rows are distinct), and δ := min i,j Q i,j > .
2. The Markov chain is stationary: the initial distribution π = ( π , π ) is the invariantdistribution for Q .Throughout we will write f π ( x ) = π f ( x ) + π f ( x ) (13)for the marginal distribution of each X i , i ≤ N , under Assumption B; note that necessarilymin( π , π ) ≥ δ under the assumption. We note the following illustrative examples of pairsof densities with respect to the Lebesgue measure µ = d x which satisfy both parts ofAssumption A. Examples. i. f j ( x ) = φ ( x − µ j ), where φ is the density of a standard normal randomvariable and µ = µ .ii. f is the density of any normal random variable, and f is the density of any Cauchyrandom variable, or any other distribution with polynomial tails.iii. f , f are compactly supported densities, and the support of f is not a subset of thesupport of f .iv. f , f are the densities of Beta random variables, f j ( x ) = c j x α j − (1 − x ) β j − { x ∈ [0 , } for a normalising constant c j , and α > α or β > β (or both). Our main result shows that for suitably chosen ˆ H = ( ˆ Q, ˆ π, ˆ f , ˆ f ), the procedure ˆ ϕ achievesan FDR upper bounded by the level t chosen by the user, at least asymptotically. Theexistence of estimators with suitable consistency properties is shown in the next sectionunder mild further assumptions. Here k·k denotes the usual Euclidean norm for vectors(and later also the corresponding operator norm for matrices), k·k F denotes the Frobeniusmatrix norm k A k F = P ij A ij , and k·k ∞ denotes the L ∞ (supremum) norm on functionstaking values in R . Theorem 2.
Grant Assumptions A and B. Suppose that for some u > ν − and somesequence ε N such that ε N (log N ) u → , the estimators ˆ Q, ˆ π and ˆ f j , j = 0 , satisfy Π H (max {k ˆ Q − Q k F , k ˆ π − π k , k ˆ f − f k ∞ , k ˆ f − f k ∞ } > ε N ) → , as N → ∞ . (14) Then for ˆ ϕ the multiple testing procedure of Definition 1 we have FDR H ( ˆ ϕ ) → min( t, π ) . ultiple Testing in Nonparametric HMMs As alluded to, the construction of ˆ ϕ suggests it should have close to optimal power,and the following result shows that this is indeed true under an extra condition on thedistribution of ( f /f )( X ). The extra condition is only used to prove a property of thelimiting ℓ –values, so that a version of Theorem 3 may also hold in the discrete setting – seethe discussion in Section 4.4. As is common in the literature (again see Section 4.4), theprecise notion of power is given by the marginal true discovery rate (mTDR), the averageproportion of true signals which a testing procedure discovers:mTDR H ( ϕ ) = E H { i : θ i = 1 , ϕ i = 1 } E H { i : θ i = 1 } . (15)The marginal FDR is defined correspondingly:mFDR H ( ϕ ) = E H { i : θ i = 0 , ϕ i = 1 } E H { i : ϕ i = 1 } , (16)with the convention that 0 / H ( ϕ ), FDR H ( ϕ ) for many procedures, including ˆ ϕ (asis implied by ideas in the proof of the following result). Theorem 3.
In the setting of Theorem 2, additionally grant that the distribution functionof the random variable ( f /f )( X ) is continuous and strictly increasing. Then the procedure ˆ ϕ of Theorem 2 satisfies the following as N → ∞ : mTDR H ( ˆ ϕ ) = sup { mTDR H ( ψ ) : mFDR H ( ψ ) ≤ mFDR H ( ˆ ϕ ) } + o (1)= sup { mTDR H ( ψ ) : mFDR H ( ψ ) ≤ t } + o (1) . The suprema are over all multiple testing procedures ψ satisfying the bound on their mFDR,including oracle procedures allowed knowledge of the parameters H . The essence of the proof of Theorem 2 is to show that ˆ ℓ i ≈ ℓ i for most i ≤ N (seeLemma 9, in Section 5.1) and that consequently postFDR H ( ˆ ϕ ) is close to postFDR ˆ H ( ˆ ϕ ).The latter, thanks to our definition of ˆ λ , is close t .In proving Theorem 3, there is no a priori control of the power analogous to the boundpostFDR ˆ H ( ˆ ϕ ) ≤ t , hence we cannot simply argue by symmetry. Instead, one shows that ˆ λ concentrates around some λ ∗ ∈ (0 , ℓ i ≈ ℓ i , it follows that mTDR H ( ˆ ϕ ) ≈ mTDR H ( ϕ λ ∗ , ˆ H ) ≈ mTDR H ( ϕ λ ∗ ,H ) and similarly thatmFDR H ( ˆ ϕ ) ≈ mFDR H ( ϕ λ ∗ ,H ) ≈ t . Known optimality results for the class ( ϕ λ,H : λ ≥ H ( ϕ λ ∗ ,H ) is the largest of procedures with mFDRat most mFDR H ( ϕ λ ∗ ,H ) ≈ t (see Lemma 21), so that the same is approximately true ofmTDR H ( ˆ ϕ ).See Section 5.1 for the proofs.
3. Supremum Norm Estimation of Emission Densities
Of course, Theorems 2 and 3 are only useful if one can estimate H at an appropriate ratein the specified norms, and the results of this section ensure that this is indeed possible ina wide range of nonparametric settings. Estimation is possible not only in the two-state braham, Castillo and Gassiat setting, and since estimation results are of independent interest we assume in this sectionthat the data X is drawn from model (1) for Q a J × J matrix and π a distribution on { , . . . , J } , with the state vector θ taking values in { , . . . , J } N , for some known J ≥ J –state estimation setting we instead use the following conditions, designed to ensure aspectral estimation method works. Assumption B’.
The matrix Q is full rank, the J –state Markov chain ( θ n ) n ∈ N is irreducibleand aperiodic, and θ follows the invariant distribution. [This is weaker than Assumption Bin general, but equivalent in the two-state setting.] Assumption C.
The density functions f , . . . f J are linearly independent. [In the two-statesetting it suffices to assume f = f , which is implied by Assumption A.]Under these assumptions, in a parametric setting a variant of a typical regularity condi-tion suffices to show that estimation is possible at a parametric rate, so that our theoremsoffer a new proof of the results of Sun and Cai (2009): see Section 4.2. Of greater inter-est here, though, is that Theorem 2 also allows for a nonparametric setting. As notedalready, this is a major improvement for applications – see for instance Yau et al. (2011),Wang et al. (2019) and Su and Wang (2020). Estimating the Markov parameters Q and π consistently up to a permutation at a polynomial rate has already been proved possible(see (De Castro et al., 2017, Appendix C)), and we therefore focus on estimation, in thesupremum norm, of the emissions densities themselves. Note first of all that in a discretesetting estimation is possible at a parametric rate. Theorem 4.
Assume that the dominating measure µ is the counting measure on Z . Let M N be a sequence tending to infinity, arbitrarily slowly. Under Assumptions B’ and C,there exist estimators ˆ f , . . . , ˆ f J and a permutation τ such that Π H ( k ˆ f j − f τ ( j ) k ∞ ≥ M N N − / ) → . The proof is a simplification of that of Theorem 5 (to follow) and so is sketched only:see Appendix B.4.For the remainder of this section we assume that the functions f , . . . , f J are densitieswith respect to the Lebesgue measure on R , µ = d x . We demonstrate that consistentestimation of these densities in the supremum norm is indeed possible at a near-minimaxrate in the nonparametric setting, under the following typical smoothness condition. Assumption D. f , . . . f J belong to C s ( R ) for some s >
0, where for C ( R ) denoting allbounded continuous functions from R to itself (equipped with the usual supremum norm k·k ∞ ) and writing j = ⌊ s ⌋ for the integer part of s , C s ( R ) denotes the usual space of(locally) Hölder-continuous functions C s ( R ) = { f : f ( j ) ∈ C s − j ( R ) } , s ≥ C s ( R ) = { f ∈ C ( R ) : sup < | x − y |≤ (cid:16) | f ( x ) − f ( y ) || x − y | s (cid:17) < ∞} s ∈ (0 , , ultiple Testing in Nonparametric HMMs equipped with the usual norm k f k C s = k f ( ⌊ s ⌋ ) k C s −⌊ s ⌋ + X ≤ i< ⌊ s ⌋ k f ( i ) k ∞ , s ≥ k f k C s = k f k ∞ + sup < | x − y |≤ | f ( y ) − f ( x ) || y − x | s , < s < . The results also extend in the usual way to Besov spaces, e.g. using results from(Giné and Nickl, 2016, Chapter 4).
Theorem 5.
Grant Assumptions B’, C and D. Suppose L → ∞ as N → ∞ , and L max(5 , ( J +3) / r N → , where r N = ( N/ log N ) − s/ (1+2 s ) . Then there exist estimators ˆ f j , ≤ j ≤ J (continuous so that the supremum below is measurable) and a permutation τ such that, for some C > , Π H (cid:16) k ˆ f j − f τ ( j ) k ∞ ≥ CL r N (cid:17) → . (17) Convergence in expectation also holds: for some C ′ > , E H k ˆ f j − f τ ( j ) k ∞ ≤ C ′ L r n . (18)The proof is given in Section 5.2. The parameter L has the interpretation of thedimension of a matrix used in the contruction of the estimators (see Algorithm 1) and itcan be chosen to diverge arbitrarily slowly, so that the upper bound is arbitrarily closeto the following lower bound. Such a lower bound is familiar from the i.i.d. setting, butdoes not automatically apply in the current setting. Indeed, the mixture components ina nonparametric mixture model are not identifiable, so that our assumptions necessarilyexclude the i.i.d. subcase of a HMM. The content of the following proposition is that theseassumptions do not, however, make estimation easier than having i.i.d. samples from eachof the emission densities. We refer to Appendix C for a formal statement and proof. Proposition 6 (informal statement) . The rate r N = ( N/ log N ) − s/ (1+2 s ) is a lower boundfor the minimax supremum-norm estimation rate for the emission densities in a two–statenonparametric HMM. The algorithm solving Theorem 5 uses a ‘spectral’ method similar to those of Anandkumar et al.(2012); Lehéricy (2018); De Castro et al. (2017). However, De Castro et al. (2017) andLehéricy (2018) expand in terms of orthonormal basis functions, and use particular prop-erties of L –projections which do not straightforwardly adapt to the L ∞ setting. Here, weinstead consider spectral kernel density estimation : see Algorithm 1 for a description of theestimating procedure. This approach allows us to directly estimate the values of the densityfunctions at each point x and bypass the need for projection properties.Finally, note that Theorems 4 and 5 only show that one may estimate the parametersconsistently up to a permutation . While this is generally sufficient for estimation purpose,since the labelling of the states is usually of no relevance, any multiple testing proceduretargeting FDR control necessarily treats the null and the alternative differently, so it isessential that we can identify which of our estimators corresponds to the null state. We willtherefore also require the following condition. braham, Castillo and Gassiat Condition E.
There exist estimators ˆ f , . . . , ˆ f J in Theorem 5 (or Theorem 4) for whichthe permutation τ is the identity.It suffices that there exist { ˆ f , . . . , ˆ f J , τ } as in Theorem 5 for which the permutation τ can be estimated consistently by some ˆ τ , since we can define ˇ f j = ˆ f ˆ τ ( j ) . We give twoillustrative assumptions, each plausible in the original two-state FDR setting, under whichCondition E holds. A version of the following proposition also holds under such assumptionsin the discrete setting, using Theorem 4 in place of Theorem 5 in the proof, which is givenat the end of Section 5.2. Proposition 7.
In the setting of Theorem 2 grant Assumptions A, B and D. Then Con-dition E is verified, and there exist estimators ˆ Q, ˆ π, ˆ f , ˆ f satisfying (14) for any rate ε N slower than r N = ( N/ log N ) − s/ (1+2 s ) , under either of the following assumptions:1. π > π .2. For some known x ∗ ∈ R ∪ { + ∞} , f ( x ) /f ( x ) → ∞ as x ↑ x ∗ .
4. Discussion
Generality of the assumptions.
Assumptions A to D and Condition E are not restrictive,so that Theorems 2, 4 and 5 hold in typical nonparametric settings (we discuss the extraassumption of Theorem 3 in Section 4.4).Assumption A 2 is a signal strength assumption, without which the proofs remain validonly for large enough values of t . It is known that weak signals are a case requiring special at-tention for multiple testing, discussed for example in a different setting in Heller and Rosset(2020).The full rank assumption on Q in Assumption B’ is necessary even for indentifiability upto a permutation in the two-state case (with nonparametric emission densities). For J >
Implementing the method.
Our proposed method for estimating the emission densi-ties can be implemented through Algorithm 1. Then, given estimators of the parame-ters, efficient computations of ℓ –values is easily done using the forward–backward algo-rithm for HMMs. Indeed the empirical Bayes multiple testing procedure is implemented inSun and Cai (2009), Wang et al. (2019) and Su and Wang (2020). [These works use mix-ture models with unknown number of components to estimate the emission densities, eithervia fully Bayesian methods or via penalized maximum likelihood (using the EM algorithm).] ultiple Testing in Nonparametric HMMs Bickel et al. (1998) prove a central limit theorem for the maximum likelihood estimator ofthe model parameter (which we denote, say, by h ) under standard regularity conditions,so that it may be estimated at a parametric rate up to label switching. To these, addingthe condition that the parametrisation map h ( f ,h , . . . f J,h ) is Lipschitz continuous withrespect to the Euclidean norm and the supremum norm (at least on a neighbourhood of thetrue parameter), we arrive at the following.
Proposition 8.
In a parametric model satisfying mild regularity conditions, Assumptions B’and C are enough to ensure that there exist estimators ˆ Q, ˆ π, ˆ f , . . . , ˆ f J such that for somepermutation τ and any M N → ∞ , max (cid:0) k ˆ Q − Q k F , k ˆ π − π k , k ˆ f − f τ (1) k ∞ , . . . , k ˆ f J − f τ ( J ) k ∞ (cid:1) < M N N − / , with probability tending to 1. We note that many common parametric families, including Gaussian models, exponen-tial models and Poisson models, satisfy a suitable regularity condition (this can be seen byusing standard formulae for exponential families to calculate the derivative of the parametri-sation map and bounding).Under an assumption akin to those of Proposition 7 to ensure that a version of Condi-tion E holds, we see that Theorems 2 and 3 apply in a parametric setting. Except perhapsfor the regularity condition, our assumptions are weaker than those of Sun and Cai (2009)(after adapting Theorem 3 slightly – see Section 4.4), so that we slightly generalise theirmain results even in the parametric setting.
The constants of Theorem 5 depend only on quantitative measures (as listed below) of thedegree to which Assumptions B’, C and D hold, so that a uniform version of (18),sup H ∈H E H k ˆ f j − f τ ( j ) k ∞ ≤ C ′ L r n , holds if the following bounds are satisfied on the set H (and similarly for (17)). Theestimators ˆ f , . . . ˆ f J do not depend on knowledge of the bound M < ∞ , so the result isadaptive in these quantities (though recall that the smoothness s is assumed known – seealso the discussion of adaptation in Section 4.4).• sup H ∈H ( κ ( Q )) ≤ M , where κ ( Q ) = k Q kk Q − k , the condition number, measures howfar Q is from having less than full rank.• inf H ∈H γ ps ≥ M − where γ ps denotes the pseudo spectral gap of the matrix Q asdefined in Paulin (2015). This bound quantitatively measures how far the chain θ isfrom being reducible or periodic, and is only used to control the mixing time of thechain θ . It can therefore be replaced by any assumption ensuring a uniform boundon the mixing time; in particular, in the two-state case of Section 2, the chain θ is braham, Castillo and Gassiat necessarily reversible and it suffices to assume a uniform lower bound on the absolutespectral gap γ ∗ , defined by γ ∗ = ( − sup {| λ | : λ an eigenvalue of Q , λ = 1 } the eigenvalue 1 of Q has multiplicity 1 , H ∈H min j ( π j ) > M − . This too measures how far the chain is from being reducible.• sup H ∈H max j k f j k C s ≤ M .• sup H ∈H max( L, /C ) ≤ M , where ( C, L ) are the constants, depending on H , fromLemma 23 in Appendix B. Denoting by σ J ( A ) the J th largest singular value of amatrix A , these constants control σ J ( O L ) where O L = ( E [ h l ( X ) | θ = j ] l ≤ L ,j ≤ J )for some suitably chosen functions h l , l ≤ L . The lemma shows that h , . . . , h L can be chosen in a universal way such that max( L, /C ) < ∞ whenever f , . . . f J arelinearly independent, so these constants quantitively measure the linear independenceof these functions. In the case J = 2, a sufficient (but not necessary) condition forsuch a uniform bound to hold is that P X ∼ f ( X ∈ A ) = P X ∼ f ( X ∈ A ) for some known set A : one constructs the estimators ˆ f , ˆ f using, in Algorithm 1, L = 2, h = 1, h = A .• inf H ∈H c ≥ M − where c = c ( H ) is the constant of Lemma 25 in Appendix B. Thelemma shows that this constant is positive whenever f , . . . , f J are distinct and so itprovides a quantitative measure of the degree of distinctness of these functions. Inview of the proof, a sufficient (but not necessary) condition for such a uniform boundto hold is that f , . . . f J can uniformly be separated at a point, i.e. that the set H issuch that inf H ∈H sup x ∈ R min j = j ′ | f j ( x ) − f j ′ ( x ) | > M − . In what follows, we use for example C = C ( H ) to denote any constant which depends onlyon the above bounds (i.e. on M < ∞ ). We note that the set H over which the upper boundis uniform (under the sufficient conditions of the last two items, with A = [ − , I ⊂ H . In particular, inaddition to the above constraints, one needs to add the following conditions.• sup H ∈I (max j E X ∼ f j | X | ν ) < ∞ for some ν = ν ( I ) > H ∈I Π H (( f /f )( X ) > u ) > u > H ∈ I .• inf H ∈I min i,j Q ij >
0. [This is in fact implied already by the bounds on the π j andon the pseudo-spectral gap, since for Theorem 2 we are in the two-state setting.]We write C = C ( I ) to denote any constant which depends only on H and these quantities. ultiple Testing in Nonparametric HMMs Weakening the assumption of Theorem 3.
Theorem 3 remains true if we replace the assump-tion on ( f /f )( X ) with the following; see Lemma 17 for a proof that this new conditionholds under the assumptions of Theorem 3. Condition F.
Viewing the sample ( X n : 1 ≤ n ≤ N ) as coming from a bi-infinite HMM( X n : n ∈ Z ), grant that the distribution function of ℓ ∞ i ( X ) := Π H ( θ i = 0 | ( X n ) n ∈ Z ) (19)is continuous and strictly increasing on [0 , ℓ ∞ i has a strictlypositive derivative. In the discrete context (that is, when the X i ’s take discrete values),understanding when the distribution of the variables ℓ ∞ i have a density with respect toLebesgue measure is known to be hard, since it is mostly still an open problem for theclosely related stationary filter Φ ∞ i ( X ) := Π H ( θ i = 0 | ( X n ) n ≤ i ), see Blackwell (1957),Bárány and Kolossváry (2015) and references therein.Of particular interest, though, is the fact that this new condition is only about thecontinuity of the distribution function, not about its absolute continuity. Continuity is aweaker property that could be easier to understand and could hold in much more generality,so that Condition F opens up the possiblity that a version of Theorem 3 may hold evenin certain discrete settings. Indeed, simulations in Su and Wang (2020) are suggestive thatthe conclusions of the theorem hold. They compare various multiple testing procedures andprovide empirical evidence that the TDR of the empirical Bayes multiple testing methodusing nonparametric modeling of HMMs roughly matches that of an oracle thresholdingprocedure and is the best among the multiple testing procedures they compare. Use of marginal FDR and TDR in Theorem 3.
The proof of Theorem 3 in fact shows,after some minor adjustments, thatTDR H ( ˆ ϕ ) ≥ TDR H ( ϕ λ max ,H ) − o (1) , where λ max = λ max ( t, H ) is chosen maximal such that FDR H ( ϕ λ max ,H ) ≤ t , so that ˆ ϕ is(asymptotically) optimal for the TDR when restricting to the class of procedures whoseTDR and FDR asymptotically coincide with their marginal equivalents. Heller and Rosset(2020) show in a non-Markovian setting that the procedure maximising the TDR among allprocedures with controlled FDR is not in this class, but their results leave open the possiblitythat Theorem 3 remains true with the full FDR and TDR. Indeed, a main conclusion oftheir work is that the class ( ϕ λ,H : λ ≥
0) (or rather, the equivalent of this class for theirsetting) is optimal for the problem of maximing TDR with controlled FDR provided oneallows data-driven thresholds – such as ˆ λ – whereas the current proof of Theorem 3 usesthat for mTDR optimality with mFDR control it suffices to consider the class for non-random thresholds. Furthermore, the difference between the FDR and TDR of the optimalprocedure and their marginal versions in the setting of Heller and Rosset (2020) manifestsitself for weak signals, so that our signal strength assumption may suffice to rule out suchdifferences. braham, Castillo and Gassiat Adaptation.
The estimator we construct for Theorem 5 uses knowledge of the smoothness s . One can adjust the arguments of Lehéricy (2018) to show that a careful applicationof Lepskii’s method allows adaptation up to a maximum smoothness s max < ∞ – andindeed state-by-state adaptation, wherein each state is estimated at a rate adapting to itssmoothness parameter s j , rather than requiring s j = s for all j . As usual, the rough idea isto constructs estimators ˆ f Lj , j ≤ J for each L ≤ L max and use k ˆ f Lj − ˆ f L max j k ∞ as a proxyfor the bias, so that one can make a suitable bias-variance tradeoff. In the HMM setting,as noted in Lehéricy (2018), one must also use ˆ f L max j to “align” the estimators ˆ f Lj up to asingle permutation τ rather than needing a different permutation τ L for each level L ; onecan show using the triangle inequality that this alignment is successful for all large enough L ≤ L max with probability tending to 1. General measure spaces.
The proofs of Theorems 2 and 3 essentially only use the as-sumption that µ is Lebesgue measure on R or counting measure on Z in showing Lemma 12,so that versions of these theorems continue to hold on general (metric) measure spaces af-ter adjusting Assumption A appropriately. Theorem 4 readily generalises to µ being anydiscrete measure of known support. The proof of Theorem 5 uses kernel density estimationtechniques, and in principle it should be possible to prove a version of this result in any set-ting where kernel-type estimators with suitable properties exist – for example, using resultsfrom Cleanthous et al. (2020), on manifolds.
5. Proofs
The following lemma isolates part of the proof of Theorems 2 and 3, showing that ˆ ℓ i ( X )converges to ℓ i ( X ) at a rate slightly slower than the convergence rate ε N of the estimatorsˆ H . Lemma 9.
In the setting of Theorem 2 define ε ′ N = ε N (log N ) u , and recall that by definition u > ν − and by assumption ε ′ N → , where ν is the parameter of Assumption A. Then max i ≤ N Π H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N ) → , as N → ∞ . (20) Consequently, there exists δ N → such that Π H ( { i ≤ N : | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N } > N δ N ) → . Proof
We begin by showing that Π H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > M ε ′ N ) → i , forsome constant M = M ( I ). [Recall that a constant M ( I ) depends only on certain boundsfor the parameter H = ( Q, π, f , f ) as described in Section 4.3.]Let ( E N ) N be a sequence of events with probability tending 1 on whichmax (cid:16) k ˆ Q − Q k F , k ˆ π − π k , max j ∈{ , } k ˆ f j − f j k ∞ (cid:17) ≤ ε N , and define δ = min i,j Q i,j , ˆ δ = min i,j ˆ Q i,j ,ρ = (1 − δ ) / (1 − δ ) , ˆ ρ = (1 − δ ) / (1 − ˆ δ ) . ultiple Testing in Nonparametric HMMs Then Proposition 2.2 of De Castro et al. (2017) yields that for some C depending only ona lower bound for δ , | ˆ ℓ i ( X ) − ℓ i ( X ) | ≤ C n ρ i − k ˆ π − π k + (cid:2) (1 − ρ ) − + (1 − ˆ ρ ) − (cid:3) k ˆ Q − Q k F + N X n =1 ((ˆ ρ ∨ ρ ) | n − i | /f π ( X n )) max j =0 , | ˆ f j ( X n ) − f j ( X n ) | o . (21)(The proposition there is stated with c ∗ ( x ) := min j =0 , P k Q jk f k ( x ) in place of f π ( x ), butwe note c ∗ ( x ) so defined is lower bounded by δf π ( x ). Also note that De Castro et al. (2017)assume that f , f are densities with respect to Lebesgue measure, but this assumption isnot used in the proof of Proposition 2.2 therein.) Recalling we assumed that δ was (strictly)positive, we see that on E N , for N large enough we have ˆ δ > ˜ δ = δ/
2, ˆ ρ < ˜ ρ = (1 + ρ ) / ρ, ˆ ρ and δ, ˆ δ in (21) by ˜ ρ < δ >
0. On the event E N , choosing theconstant M = M (˜ δ, ˜ ρ, C ) = M ( I ) large enough we see by a union bound thatΠ H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > M ε ′ N ) ≤ Π H ( E cN ) + Π H (cid:18) ε N N X n =1 ˜ ρ | n − i | f π ( X n ) > ε ′ N (cid:19) . For κ > S κ,i = { n ≤ N : | n − i | ≤ κ log N } . We can split the termsin S κ,i from those in S cκ,i to see, for C ′ = 2 P ∞ n =0 ˜ ρ n < ∞ , that X n ≤ N ˜ ρ | n − i | f π ( X n ) ≤ C ′ h max n ∈ S κ,i (cid:16) f π ( X n ) (cid:17) + ˜ ρ κ log N max n ≤ N (cid:16) f π ( X n ) (cid:17)i , so that again appealing to a union bound, it suffices to showΠ H (cid:18) max n ∈ S κ,i (cid:16) f π ( X n ) (cid:17) > C ′ ( ε ′ N /ε N ) (cid:19) → , and (22)Π H (cid:18) ˜ ρ κ log N max n ≤ N (cid:16) f π ( X n ) (cid:17) > C ′ ( ε ′ N /ε N ) (cid:19) → . (23)Lemma 12 (in Appendix A.1) tells us that for any a > ν − , with ν the constant ofAssumption A, we have Π H (max i ≤ R /f π ( X i ) > R a ) → R → ∞ . By stationarity ofthe process X , taking R = | S κ,i | ≤ (2 κ log N + 1) we deduce thatΠ H (cid:16) max n ∈ S κ,i f π ( X n ) > (2 κ log N + 1) a (cid:17) → . Recalling that ε ′ N /ε N > (log N ) u , we see that (22) holds for all κ if u > a . Next we applyLemma 12 with R = N to seeΠ H (cid:18) max n ≤ N (cid:16) f π ( X n ) (cid:17) > N a (cid:19) → . Noting that ˜ ρ − κ log N = N κ log(1 / ˜ ρ ) and choosing κ > a (log 1 / ˜ ρ ) − yields (23). This concludesthe proof that for some constant M and each i ≤ N , Π H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > M ε ′ N ) → braham, Castillo and Gassiat To see that max i Π H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N ) →
0, we note that by initially considering ε ′ n defined for some u ′ < u we can remove the constant M . Thanks to stationarity of theHMM X , we further note thatmax i ≤ N Π H (cid:16) max n ∈ S κ,i f π ( X n ) > (2 κ log N + 1) a (cid:17) → i , we deduce (20).Finally, defining δ N = (cid:16) max i ≤ N Π H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N ) (cid:17) / , we appeal to Markov’s inequality to see thatΠ H ( { i ≤ N : | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N } > N δ N ) ≤ N δ
N N X i =1 Π H (cid:0) | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N (cid:1) ≤ δ − N max i ≤ N Π H ( | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N ) = δ N , which tends to zero, concluding the proof. Proof [Proof of Theorem 2] Write ˆ t = postFDR ˆ H ˆ ϕ and recall we write ˆ S for the rejectionset of ˆ ϕ . We have, for any sequences of positive numbers ε ′ N and of events F N , | FDR H ( ˆ ϕ ) − E H ˆ t | = (cid:12)(cid:12) E X ∼ Π H [postFDR H ( ˆ ϕ ) − postFDR ˆ H ( ˆ ϕ )] (cid:12)(cid:12) ≤ E H (cid:20) P Ni =1 | ℓ i ( X ) − ˆ ℓ i ( X ) | { i ∈ ˆ S } ∨ | ˆ S | (cid:21) ≤ ε ′ N + Π H ( F cN ) + E H h F N P Ni =1 {| ℓ i ( X ) − ˆ ℓ i ( X ) | > ε ′ N } ∨ | ˆ S | i , where we have used that | ℓ i ( X ) − ˆ ℓ i ( X ) | ≤ i . Lemma 13 in Appendix A.1 showsthat E H [ˆ t ] → min( t, π ), so that it is enough to show the right side tends to zero for suitable ε ′ N and F N .Lemma 14 tells us that Π( | ˆ S | > aN ) → a >
0. Combining with Lemma 9by a union bound, we deduce that for suitably chosen ε ′ N → , δ N → a >
0, we haveΠ H ( F cN ) → F N = (cid:8) { i ≤ N : | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N } ≤ N δ N (cid:9) ∩ (cid:8) | ˆ S | > aN (cid:9) . Then E H h F N P Ni =1 {| ℓ i ( X ) − ˆ ℓ i ( X ) | > ε ′ N } ∨ | ˆ S | i ≤ N δ N aN → , yielding the result.The following lemma, mentioned already in the sketch proof in Section 2.2, will help usin proving Theorem 3. ultiple Testing in Nonparametric HMMs Lemma 10.
Under the assumptions of Theorem 3, define λ ∗ ∈ ( t, implicitly by E [ ℓ ∞ i ( X ) | ℓ ∞ i ( X ) < λ ∗ ] = min( t, π ) , where ℓ ∞ i is as in (19) (by stationarity the conditional expectation does not depend on i ).Such a λ ∗ exists; it satisfies, for ε > , E [ ℓ ∞ i ( X ) | ℓ ∞ i ( X ) < λ ∗ − ε ] < min( t, π ) ,E [ ℓ ∞ i ( X ) | ℓ ∞ i ( X ) < λ ∗ + ε ] > t if t < π ; and we have ˆ λ → λ ∗ in probability as N → ∞ . (24) Proof
Lemma 17 (in Appendix A.2) tells us that under the assumptions of Theorem 3,the distribution function of ℓ ∞ i is continuous and strictly increasing. Lemma 18 then tellsus that the same is true of the map λ E [ ℓ ∞ i | ℓ ∞ i < λ ], and that E [ ℓ ∞ i | ℓ ∞ i < t ] < t .Noting also that E [ ℓ ∞ i | ℓ ∞ i <
1] = E [ ℓ ∞ i ] = π (since ℓ ∞ i < λ ∗ ∈ ( t,
1] by the intermediate value theorem.Strict monotonicity of the conditional expectation implies the claimed inequalities whenconditioning on ℓ ∞ i < λ ∗ − ε and on ℓ ∞ i < λ ∗ + ε .For the convergence in probability, we show for ε > ˆ H ( ϕ λ ∗ − ε, ˆ H ) < t . We omit the almost identical proof thatfor t < π we have postFDR ˆ H ( ϕ λ ∗ + ε, ˆ H ) > t . From these two bounds one deduces thatˆ λ ∈ ( λ ∗ − ε, λ ∗ + ε ), implying (24).By Lemma 19, there exist ξ N , δ N →
0, such that with probability tending to 1 { i : 1 ≤ i ≤ N, | ˆ ℓ i ( X ) − ℓ ∞ i ( X ) | > ξ N } ≤ N δ N , and we observe thatpostFDR ˆ H ( ϕ λ ∗ − ε, ˆ H ) = P ˆ ℓ i { ˆ ℓ i < λ ∗ − ε } ∨ ( P { ˆ ℓ i < λ ∗ − ε } ) ≤ P ˆ ℓ i { ˆ ℓ i < λ ∗ − ε, | ˆ ℓ i − ℓ ∞ i | ≤ ξ N } ∨ ( P { ˆ ℓ i < λ ∗ − ε, | ˆ ℓ i − ℓ ∞ i | ≤ ξ N } ) + { i : | ˆ ℓ i − ℓ ∞ i | > ξ N } { i : ˆ ℓ i < λ ∗ − ε }≤ P ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ N } ∨ ( P { ℓ ∞ i < λ ∗ − ε − ξ N , | ˆ ℓ i − ℓ ∞ i | ≤ ξ N } + ξ N + { i : | ˆ ℓ i − ℓ ∞ i | > ξ N } { i : ˆ ℓ i < λ ∗ − ε } . Since λ ∗ > t (the proof of) Lemma 14 implies that for some c > ε > { i : ˆ ℓ i < λ ∗ − ε } > cN with probability tending to 1. We also lower bound the denominatorin the first term of the final line by { i : ℓ ∞ i < λ ∗ − ε − ξ N } − { i : | ˆ ℓ i − ℓ ∞ i | > ξ N } ; for ε, ξ N , c ′ small enough note that { i : ℓ ∞ i < λ ∗ − ε − ξ N } > c ′ N with probability tending to1 by ergodicity (i.e. applying Lemma 20 with g ( x ) = { x < λ ∗ − ε − ξ } for some ξ > ξ N ),using that Π( ℓ ∞ i < λ ∗ − ε − ξ N ) >
0. It follows that for an event C N of probability tendingto 1, postFDR ˆ H ( ϕ λ ∗ − ε, ˆ H ) is upper bounded by C cN + P ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ N } P { ℓ ∞ i < λ ∗ − ε − ξ N } (cid:16) O (cid:16) { i : | ˆ ℓ i − ℓ ∞ i | > ξ N } { i : ℓ ∞ i < λ ∗ − ε − ξ N } (cid:17)(cid:17) + ξ N + δ N /c ≤ P ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ N } P { ℓ ∞ i < λ ∗ − ε − ξ N } + o p (1) . braham, Castillo and Gassiat Again using the ergodicity result Lemma 20, we have that, for fixed ξ > N N X i =1 ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ } → E H [ ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ } ] in probability,1 N N X i =1 { ℓ ∞ i < λ ∗ − ε − ξ } → Π H ( ℓ ∞ i < λ ∗ − ε − ξ ) > N large enough P Ni =1 ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ N } P Ni =1 { ℓ ∞ i < λ ∗ − ε − ξ N } ≤ P Ni =1 ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ } P Ni =1 { ℓ ∞ i < λ ∗ − ε − ξ }≤ E H [ ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ } ]Π H ( ℓ ∞ i < λ ∗ − ε − ξ ) + o p (1) . Finally we note that E H [ ℓ ∞ i { ℓ ∞ i < λ ∗ − ε + ξ } ]Π H ( ℓ ∞ i < λ ∗ − ε − ξ ) = E H (cid:2) ℓ ∞ i | ℓ ∞ i < λ ∗ − ε + ξ (cid:3) Π H ( ℓ ∞ i < λ ∗ − ε + ξ )Π H ( ℓ ∞ i < λ ∗ − ε − ξ ) . Uniformly in ξ satisfying 0 < ξ < ε/
2, we have by monotonicity E H [ ℓ ∞ i | ℓ ∞ i < λ ∗ − ε + ξ ] ≤ E H [ ℓ ∞ i | ℓ ∞ i < λ ∗ − ε/ < min( t, π ) . Observe also that Π H ( λ ∗ − ε − ξ ≤ ℓ ∞ i < λ ∗ − ε + ξ ) → ξ → ℓ ∞ i , while Π H ( ℓ ∞ i < λ ∗ − ε − ξ ) is bounded away from zero for λ ∗ − ε − ξ bounded away from zero. It follows that by choosing ξ = ξ ( ε ) small enough wemay ensure that E H [ ℓ ∞ i | ℓ ∞ i < λ ∗ − ε + ξ ] Π H ( ℓ ∞ i < λ ∗ − ε + ξ )Π H ( ℓ ∞ i < λ ∗ − ε − ξ ) < min( t, π ) . We conclude, as claimed, that postFDR ˆ H ( ϕ λ ∗ − ε, ˆ H ) < min( t, π ) with probability tendingto 1. Proof [Proof of Theorem 3] Define λ ∗ as in Lemma 10. In the case λ ∗ = 1, one showsthat ˆ ϕ rejects all but o p ( N ) of the hypotheses. It follows that, asymptotically, its mTDRis close to that of the procedure which rejects all null hypotheses, which trivially has thebest mTDR of any procedure. We omit the proof details in this case and henceforth assumethat λ ∗ <
1, or equivalently (in view of Lemma 10) that t < π .We compare ˆ ϕ to the ‘oracle’ procedure ϕ λ ∗ ,H , which we will argue has optimal multipletesting properties. For ε N > { ℓ i < λ ∗ } ≤ { λ ∗ − ε N ≤ ℓ i < λ ∗ } + { ˆ ℓ i < ˆ λ } + { ˆ λ < λ ∗ − ε N / } + { ˆ ℓ i − ℓ i > ε N / } . Lemma 10 tells us that ˆ λ tends to λ ∗ in probability, so that Π(ˆ λ < λ ∗ − ε N / → ε N tending to zero slowly enough, and Lemma 9 tells us that { i : | ˆ ℓ i − ℓ i | > ε N / } /N → ultiple Testing in Nonparametric HMMs probability, again for ε N tending to zero slowly enough. Lemma 19 tells us that there exist ξ N → { i : | ℓ i − ℓ ∞ i | > ξ N } /N → ℓ ∞ i is continuous, so that as N → ∞ E { i : λ ∗ − ε N ≤ ℓ i < λ ∗ } /N ≤ E { i : | ℓ i − ℓ ∞ i | > ξ N } /N +Π( λ ∗ − ε N − ξ N ≤ ℓ ∞ i < λ ∗ + ξ N ) → . We deduce that E [ { i : θ i = 1 , ˆ ℓ i < ˆ λ } ] ≥ E [ { i : θ i = 1 , ℓ i < λ ∗ } ] − o ( N ) , so that, dividing each side by E { i : θ i = 1 } = N π ,mTDR H ( ˆ ϕ ) ≥ mTDR H ( ϕ λ ∗ ,H ) − o (1) . Next we consider the mFDR. A similar decomposition to those above yields that E [ { i : θ i = 0 , ˆ ℓ i < ˆ λ } ] ≤ E [ { i : θ i = 0 , ℓ i < λ ∗ } ] + o ( N ) ,E [ { i : ˆ ℓ i < ˆ λ } ] ≥ E [ { i : ℓ i < λ ∗ } ] − o ( N ) . One also has (by comparison to ℓ ∞ i as above, or by adapting the proof of Lemma 14) that E { i : ℓ i < λ ∗ + ε N } ≥ cN for some c >
0, so that a Taylor expansion yieldsmFDR H ( ˆ ϕ ) ≤ E { i : θ i = 0 , ℓ i < λ ∗ } + o ( N ) E { i : ℓ i < λ ∗ } − o ( N ) ≤ mFDR H ( ϕ λ ∗ ,H ) + o (1) . Define g ( x ) = sup { mTDR H ( ψ ) : mFDR H ( ψ ) ≤ x } . Trivially mTDR H ( ˆ ϕ ) ≤ g (mFDR H ( ˆ ϕ )),and hence the following chain of equalities (justified below) proves the first claim of the the-orem: mTDR H ( ˆ ϕ ) ≥ mTDR H ( ϕ λ ∗ ,H ) − o (1) ≥ g (cid:0) mFDR H ( ϕ λ ∗ ,H ) (cid:1) − o (1) ≥ g (cid:0) mFDR H ( ˆ ϕ ) − o (1) (cid:1) − o (1) ≥ g (cid:0) mFDR H ( ˆ ϕ ) (cid:1) − o (1) . The first line was proved above. The second is a consequence of an optimality propertyfor the class ( ϕ λ,H : λ ∈ [0 , g given by Lemma 22.It remains to prove the second claim of the theorem. This will follow, with the samearguments as above, from proving that mFDR H ( ϕ λ ∗ ,H ) ≥ t − o (1). Observe that, usingLemma 19 as above, one can show E [ X i ≤ N ℓ i { ℓ i < λ ∗ } ] = E X i ≤ N [ ℓ ∞ i { ℓ ∞ i < λ ∗ } ] + o ( N ) E [ X i ≤ N { ℓ i < λ ∗ } ] = E [ X i ≤ N { ℓ ∞ i < λ ∗ } ] + o ( N ) . braham, Castillo and Gassiat Stationarity of the HMM implies that E [ X i ≤ N ℓ ∞ i { ℓ ∞ i < λ ∗ } ] = N E [ ℓ ∞ | ℓ < λ ∗ ]Π( ℓ < λ ∗ ) ,E [ X i ≤ N { ℓ ∞ i < λ } ] = N Π( ℓ ∞ < λ ∗ ) , and hence by definition of λ ∗ (recall we have assumed t < π ) E X i ≤ N ( ℓ ∞ i − t ) { ℓ ∞ i < λ ∗ } = 0 . Returning to the ℓ –values themselves and using also Lemma 17 to see that Π( ℓ ∞ < λ ∗ ) > N − E [ X i ≤ N ( ℓ i − t ) { ℓ i < λ ∗ } ] → ,N − E X i ≤ N { ℓ i < λ ∗ } → Π( ℓ ∞ < λ ∗ ) > , and we may rearrange to see that mFDR H ( ϕ λ ∗ ,H ) ≥ t − o (1). We construct the estimators of Theorem 5 using a spectral kernel density estimation method.Let K be a bounded Lipschitz-continuous function, supported in [ − , K L ( x, y ) = 2 L K (2 L ( x − y )) ,K L [ f ]( x ) = Z K L ( x, y ) f ( y ) d y, (25)then we have, for any f ∈ C s ( R ), k f − K L [ f ] k ∞ ≤ C k f k C s − Ls . (26)Note that such a function, a ‘bounded convolution kernel of order s ’, exists, see Tsybakov(2009) (in particular, to ensure K is Lipschitz, one builds the kernel using a Gegenbauerbasis with parameter α > C = C ( H ), max j k K L [ f j ] k ∞ ≤ k K k ∞ max j k f j k ∞ ≤ C (27)since R − | K ( x ) | d x ≤ k K k ∞ . [Recall that a constant C = C ( H ) depends only on certainbounds for the parameter H = ( Q, π, f , . . . , f J ) as described in Section 4.3. In fact, aswith the above, we allow such a constant to also depend on the kernel K since this kernelcan be chosen independent of H . Similarly, we will permit such a constant C to depend onthe functions h , . . . , h L and the sets D N of Algorithm 1.]The premise of the estimation algorithm comes from the following lemma, which adaptsideas found in Anandkumar et al. (2012) and Lehéricy (2018). ultiple Testing in Nonparametric HMMs Lemma 11.
For L ∈ N , let h , . . . h L be arbitrary functions. Define, for data X from theHMM (1) , M x ≡ M x,L ,L := ( E H [ h l ( X ) K L ( x, X ) h m ( X )] l,m ≤ L ) ∈ R L × L , (28) P ≡ P L := ( E H [ h l ( X ) h m ( X )] l,m ≤ L ) ∈ R L × L , (29) D x ≡ D x,L := diag (cid:0) K L [ f j ]( x ) j ≤ J (cid:1) ∈ R J × J , (30) O ≡ O L := ( E H [ h l ( X ) | θ = j ] l ≤ L ,j ≤ J ) ∈ R L × J . (31) Then M x = O diag( π ) QD x QO ⊺ , and P = O diag( π ) Q O ⊺ . If V ∈ R L × J is such that V ⊺ P V is invertible (it suffices to assume
P V has rank J , whichholds under the assumption that P has rank J if the columns of V consist of orthonormalright singular vectors of P , or any other orthonormal basis of the column space of P ) thenthe matrix B x ≡ B x,L ,L := ( V ⊺ P V ) − V ⊺ M x V (32) satisfies B x = ( QO ⊺ V ) − D x ( QO ⊺ V ) , (33) so that the matrices ( B x : x ∈ R ) are diagonalisable simultaneously , with B x having eigen-values ( D xj : j ≤ J ) = ( K L [ f j ]( x ) : j ≤ J ) . Proof
Conditioning on ( θ , θ , θ ), we see M xl,m = X a,b,c Π H ( θ = a, θ = b, θ = c ) E H [ h l ( X ) K L ( x, X ) h m ( X ) | θ = a, θ = b, θ = c ]= X a,b,c π a Q a,b Q b,c O l,a O m,c E X ∼ f b [ K L ( x, X )]= ( O diag( π ) QD x QO ⊺ ) l,m , and similarly we have P = ( X a,b,c π a Q a,b Q b,c O l,a O m,c ) l,m = O diag( π ) Q O ⊺ . Next, note that if V ⊺ P V is invertible then so is QO ⊺ V (since V ⊺ P V = V ⊺ O diag( π ) Q ( QO ⊺ V ) , and a product AB of square matrices is invertible if and only if each of A and B are). Theresult (33) then follows from the expressions for P and M x .Lemma 11 suggests estimating the eigenvalues K L [ f j ]( x ) of B x by using empirical versionsof V , P and M x , an idea which is implemented in the following algorithm. The algorithmrequires as inputs functions h , . . . h L and sets D N with certain properties; the existenceof suitable inputs is discussed in the remarks thereafter. We introduce notation for the“eigen-separation” of a matrix B ∈ R J × J with eigenvalues λ , . . . , λ J :sep( B ) = min i = j | λ i − λ j | . (34) braham, Castillo and Gassiat Recall that σ J ( B ) denotes the J th largest singular value of B . Algorithm 1
Kernel density estimator input • Data ( X n : n ≤ N + 2) drawn from the HMM (1).• Functions h , . . . h L , uniformly bounded, such that O = ( E [ h l ( X ) | θ = j ] l ≤ L ,j ≤ J ) is of rank J , with σ J ( O )bounded away from 0 uniformly in N , at least for N largeenough.• Finite sets D N ⊆ { ( a, u ) ∈ R J ( J − / × R J ( J − / : P | a i | ≤ } such that max ( a,u ) ∈ D N sep( B a,u ) is bounded away from 0uniformly in N , at least for N large enough, where B a,u = P a i B u i for B x as in Lemma 11 for some V . estimate the matrices P, ( M x , x ∈ R ) of Lemma 11 by taking empiricalaverages:ˆ P = ˆ P L = ( N − X n ≤ N h l ( X n ) h m ( X n +2 )) l,m ≤ L , ˆ M x = ˆ M x,L ,L = ( N − X n ≤ N h l ( X n ) K L ( x, X n +1 ) h m ( X n +2 )) l,m ≤ L . Let ˆ V = ˆ V L ∈ R L × J be a matrix of orthonormal right singularvectors of ˆ P (fail if ˆ P is of rank less than J ). set , for x ∈ R and for a, u ∈ R J ( J − / ˆ B x = ˆ B x,L ,L := ( ˆ V ⊺ ˆ P ˆ V ) − ˆ V ⊺ ˆ M x ˆ V , ˆ B a,u := X a i ˆ B u i . choose ˆ R of normalised columns diagonalising ˆ B ˆ a, ˆ u , where (ˆ a, ˆ u ) ∈ argmax D N sep( ˆ B a,u )(fail if ˆ B ˆ a, ˆ u is not diagonalisable). output ( ˆ f j : j ≤ J ), where, defining˜ f Lj ( x ) = ( ˆ R − ˆ B x ˆ R ) jj , we set ˆ f j ( x ) = ( ˜ f Lj ( x ) | ˜ f Lj ( x ) | ≤ N α N α sign( ˜ f Lj ( x )) otherwise , for α > L such that 2 L ≍ ( N/ log N ) / (1+2 s ) . [Thein-probability result (17) also holds for ˜ f Lj .] Remarks. i. For notational convenience, we have considered observing N + 2 data points X , . . . , X N +2 so that we can form N triples of consecutive observations; the proofsgo through for the original N data points by adjusting constants.ii. Under Assumption C, h , . . . h L can be chosen without knowledge of the parameters,for example by letting L tend to infinity arbitrarily slowly and taking the h l to be ultiple Testing in Nonparametric HMMs indicator functions of the first L of a countable collection of sets generating theBorel σ –algebra (see Lemma 23, in Appendix B.1). In principle, L = J is sufficientto achieve O of rank J , but without further assumptions, the appropriate functions h , . . . h J will necessarily depend on the unknown parameters. In the case J = 2, itsuffices to assume in addition to the other conditions of Theorem 5 that P X ∼ f ( X ∈ A ) = P X ∼ f ( X ∈ A ) for some known A , by taking h = 1 , h = A .iii. Lemma 11 implies that the condition on D N is independent of V provided V is suchthat V ⊺ P V is invertible. Lemma 25, the proof of which uses only that f , . . . , f J aredistinct, shows that the choice V = ˆ V is suitable with probability tending to 1 and that D N can be chosen independent of the parameters, for example by taking a cartesianproduct of increasing dyadic sets of rationals. In the case J = 2, the description of thealgorithm simplifies, in that necessarily ˆ a = 1 ∈ R . A corresponding simplificationalso works in the general J state case if one is willing to assume that there exists x ∈ R for which the values f j ( x ) , j ≤ J are all distinct, in that one may define ˆ R as diagonalising ˆ B ˆ x where ˆ x maximises sep( ˆ B x ) over x in (some finite increasing sievein) R .iv. Lemmas 24 and 26 prove that with probability tending to 1 ˆ P has rank J and thatˆ B ˆ a, ˆ u is diagonalisable, and hence that the outputs ˆ f j are well-defined.v. Since the f j are assumed Hölder continuous, and satisfy tail bounds, one could infact calculate ˆ f j ( x ) only for x in some finite set, then construct estimators ˇ f j viainterpolation, in order to ease computation. Proof [Proof of Theorem 5] Construct ˆ f j , ˜ f Lj using Algorithm 1. Continuity of these func-tions follows from continuity of the map x ˆ B x , which in turn follows from that of themap x ˆ M x , proved in Lemma 24. Observe also that k f j k ∞ < ∞ for all j ≤ J , so thatfor N large enough that k f j k ∞ ≤ N a we have k ˆ f j − f τ ( j ) k ∞ ≤ k ˜ f Lj − f τ ( j ) k ∞ , hence for the in-probability result it suffices to prove (17) with ˜ f j = ˜ f Lj in place of ˆ f j .For a constant c >
0, define the event A = {k ˆ P − P k ≤ cL r N , k ˆ M x − M x k ≤ cL r N ∀ x ∈ R } . (35)This is indeed a measurable event, and for suitable c = c ( κ, H ) it has probability at least1 − N − κ , by Lemma 24, which also tells us that ˆ V ⊺ P ˆ V is invertible on A and that, defining˜ B x := ( ˆ V ⊺ P ˆ V ) − ˆ V ⊺ M x ˆ V , we have, for some C depending on H and on the constant c of event A , A sup x ∈ R k ˜ B x − ˆ B x k ≤ CL r N . Lemma 11 tells us (on A ) that ˜ B x = ( QO ⊺ ˆ V ) − D x QO ⊺ ˆ V , and we write ˜ R for a matrixwhose columns are those of QO ⊺ ˆ V but scaled to have unit Euclidean norm, which thus braham, Castillo and Gassiat diagonalises ˜ B x for all x . By Lemma 31, k ˆ R − ˜ R τ k ≤ CL / r N on A for some permutation τ , where ˜ R τ is obtained by permuting the columns of ˜ R according to τ . Next we applyLemma 32 with T = R , A x = ˜ B x , ˆ A x = ˆ B x , R = ˜ R . Noting that k ˜ R − k ≤ C ′ L / and κ ( ˜ R ) ≤ C ′ L for some constant C ′ = C ′ ( H ) (see Lemma 34b), and that the constant λ max of the lemma is bounded by a constant depending only on H (see (27)), we deduce thatsup x max j | ˜ f Lj ( x ) − K L [ f τ ( j ) ]( x ) | ≤ c ′ L [ L r N + L / L / r N ] ≤ c ′′ L r N , for some constants c ′ , c ′′ . The in-probability result (17) follows, since the choice of L ensuresby (26) that k f τ ( j ) − K L [ f τ ( j ) ] k ∞ ≤ C ′′ r N on A for some C ′′ , so that for a suitable constant C , Π H ( k ˜ f j − f τ ( j ) k ∞ > CL r N ) ≤ Π( A c ) ≤ N − κ → . For the in-expectation result (18), observe that by truncating at ± N α we have ensuredthat E H k ˆ f j − f τ ( j ) k ∞ ≤ CL r N + 2 N α Π H ( A c ) . Choosing c = c ( κ, H ) in the definition of the event A corresponding to some κ ≥ s/ (1 +2 s ) + α concludes the proof. Proof [Proof of Proposition 7] Let ˆ f , ˆ f , ˆ Q, ˆ π be estimators which satisfyΠ H ( k ˆ f − f τ (0) k + k ˆ f − f τ (1) k + k ˆ Q − Q σ,σ k F + k ˆ π − π σ k > Cε N ) → τ, σ and a constant C >
0, with Q σ,σ defined by permuting therows and columns of Q , and π σ defined similarly. The existence of suitable ˆ f , ˆ f is given byTheorem 5, and the existence of suitable ˆ Q, ˆ π is proved by results in (De Castro et al., 2017,Appendix C) (and by arguments as in (De Castro et al., 2016, Section 8.6) to acceleratethe possibly slow rate). Moreover, the estimators of De Castro et al. (2017) are constructedusing a spectral method, so that one may in fact assume σ = τ . [One could also “align” σ and τ by hand, by noting that by ergodicity the invariant density f π can be estimated atthe rate r N using a standard kernel density estimator, and permuting rows and columns ofˆ Q and ˆ π so that P ˆ π i ˆ f i is close to this kernel density estimator; linear independence of the f i ensures that this alignment method works.]Next, under the assumption π > π , define ˇ f j = ˆ f ˆ τ ( j ) , ˇ Q = ˆ Q ˆ τ, ˆ τ and ˇ π = ˆ π ˆ τ , whereˆ τ (0) = 1 − ˆ τ (1) = { ˆ π > ˆ π } . Consistency of ˆ π implies that ˆ τ consistently estimates thepermutation τ = σ of (36), henceΠ H ( k ˇ f − f k + k ˇ f − f k + k ˇ Q − Q k F + k ˇ π − π k ) > Cε N ) ≤ Π H (ˆ τ = τ ) + Π H ( k ˆ f − f τ (0) k + k ˆ f − f τ (1) k + k ˆ Q − Q τ,τ k F + k ˆ π − π τ k > Cε N ) → . For the second case, we want to define ˆ τ (0) = { lim sup x ↑ x ∗ ( ˆ f / ˆ f )( x ) > } and proceedsimilarly, but the compact support of K means that ˆ f ( x ) = ˆ f ( x ) = 0 for x > − L +max k X k , and the right side may be strictly smaller than x ∗ . Instead, noting that necessarily ultiple Testing in Nonparametric HMMs Π( X ≤ x ∗ ) > x ∗ >
0, we set ˜ X n = X n { X n ≤ x ∗ } and defineˆ τ (0) = 1 − ˆ τ (1) = { ˆ f ( M N ) > ˆ f ( M N ) } ,M N = max i ≤ log N ( ˜ X i );note that by construction we have ˆ f ˆ τ (1) ( M N ) ≥ ˆ f ˆ τ (0) ( M N ). We show that k ˆ f ˆ τ (1) − f k ∞ >Cε N on an event A N of probability tending to 1; it will follow from (36) that ˆ τ ≡ ˆ τ − = τ on A N , and the result will follow.The variables ˜ X i , i ≤ N have a density with respect to the measure µ defined by addingan atom at 0 to Lebesgue measure. Let u be as in Theorem 2, so that u > ν − and ε N (log N ) u → ν as in 1. The proof of Lemma 12 shows that with probability tendingto 1 we have f ( M N ) ≥ min i ≤ log N ( f ( ˜ X i )) ≥ (log N ) − u , hence f ( M N ) > Cε N . We alsonote that M N ↑ x ∗ almost surely, so that f ( M N ) > f ( M N ) for all N large enough.Let A N be an event of probability tending to 1 on which f ( M N ) > Cε N , f ( M N ) > f ( M N ) , k ˆ f − f τ (0) k ∞ ≤ Cε N , k ˆ f − f τ (1) k ∞ ≤ Cε N , whose existence we have just demonstrated. On A N we have both ˆ f ( M N ) ≥ f τ (1) ( M N ) − Cε N and ˆ f ( M N ) ≥ f τ (0) ( M N ) − Cε N henceˆ f ˆ τ (1) ( M N ) = max( ˆ f ( M N ) , ˆ f ( M N )) ≥ max j ( f j ( M N ) − Cε N ) = f ( M N ) − Cε N > f ( M N ) + Cε N > f ( M N ) + Cε N , so that k ˆ f ˆ τ (1) − f k ∞ > Cε N on A N as claimed. Acknowledgments
The authors would like to acknowledge Étienne Roquain, Gloria Buritica and Ramon vanHandel for fruitful discussions about this work. K.A. was supported in this work by grantsfrom the Fondation Mathématique Jacques Hadamard (FMJH). I.C. and E.G. would liketo acknowledge support for this project from Institut Universitaire de France. I.C. is partlysupported by ANR-17-CE40-0001 grant (BASICS).
Appendix A. Auxiliary Results for Section 2
A.1 Lemmas for Theorem 2
Recall f π = π f + π f is the density of each X i , i ≤ N , in the HMM model (1). Lemma 12.
Under Assumption A we have, for any a > ν − , Π H (max i ≤ R /f π ( X i ) > R a ) → as R → ∞ . braham, Castillo and Gassiat Proof
For A = R a , B = R b with a, b > H (cid:16) max i ≤ R f π ( X i ) > A (cid:17) ≤ R Π H (cid:0) f π ( X ) < A − (cid:1) ≤ R Z B − B (cid:8) f π ( x ) < A − (cid:9) f π ( x ) d µ ( x ) + R Π H (cid:0) | X | > B (cid:1) ≤ Rµ ([ − B, B ]) /A + R Π H (cid:0) | X | > B (cid:1) . Since f π is a mixture of the densities f , f , an application of Markov’s inequality yieldsΠ H (cid:0) | X | > B (cid:1) ≤ max j P X ∼ f j (cid:0) | X | > B (cid:1) ≤ B − ν max j E X ∼ f j | X | ν , which is at most a constant times B − ν by the assumption. Choosing b > /ν , we have R Π H ( | X | ≥ B ) → B = R b ≥ µ is equal to either to Lebesgue or counting measure, µ ([ − B, B ]) ≤ B + 1 ≤ B . Then Rµ ([ − B, B ]) /A ≤ R b − a , which tends to zero for a > b , so that any a > ν − is permissible.For the following two lemmas recall the definition ˆ S = ˆ S ( t ) = { i : ˆ ϕ i = 1 } , where ˆ ϕ isas in Definition 1, so that ˆ K = | ˆ S | is characterised by1ˆ K ˆ K X i =1 ˆ ℓ ( i ) ≤ t < K + 1 ˆ K +1 X i =1 ˆ ℓ ( i ) . where, by convention, the left inequality holds if ˆ K = 0, and ˆ ℓ ( N +1) = ∞ so that the rightinequality holds if ˆ K = N . Recall the definitionˆ t := postFDR ˆ H ( ˆ ϕ ) = 1ˆ K ˆ K X i =1 ˆ ℓ ( i ) . Lemma 13.
In the setting of Theorem 2, E H ˆ t → min( t, π ) . Proof
Since 0 ≤ ˆ t ≤
1, it’s enough to show that ˆ t → min( t, π ) in probability. ByLemma 15, we have 1 N N X i =1 ˆ ℓ i ( X ) → π in probability. (37)By monotonicity of the average of increasing numbers, we haveˆ t ≤ N N X i =1 ˆ ℓ ( i ) = 1 N N X i =1 ˆ ℓ i , and by construction we note also that ˆ t ≤ t , hence ˆ t ≤ min( t, π ) + o p (1). ultiple Testing in Nonparametric HMMs To obtain a matching lower bound, we decompose relative to the event C = { ˆ K = N } .Observe, using (37), that ˆ t C = C N N X i =1 ˆ ℓ i ≥ C π − o p (1) . By definition of ˆ K we also have t C c < K + 1 ˆ K +1 X i =1 ˆ ℓ ( i ) C c = ˆ K ˆ K + 1 ˆ t C c + ˆ ℓ ( ˆ K +1) ˆ K + 1 C c , hence, since ˆ ℓ ( ˆ K +1) ≤ C c ,ˆ t C c > ˆ K + 1ˆ K t C c − ˆ ℓ ( ˆ K +1) ˆ K C c > t C c − K .
By Lemma 14, ˆ K → ∞ in probability for any t >
0, so that the above display impliesˆ t C c > t C c − o p (1) and henceˆ t > t C c + π C − o p (1) ≥ min( t, π ) − o p (1) , proving the lower bound.Recall the definition of constants C = C ( I ) from Section 4.3. Lemma 14.
In the setting of Theorem 2, for all t > , there exists a = a ( t, I ) > suchthat Π H ( | ˆ S | > aN ) → . Proof
The definition of ˆ λ trivially implies ˆ λ ≥ t , so that { i : ˆ ℓ i < t } ⊆ { i : ˆ ℓ i < ˆ λ } ⊆ ˆ S . For A ∈ N write ℓ ′ i ( X ) := Π H ( θ i = 0 | X i − A , . . . , X i + A ) , A < i ≤ N − A. By Lemma 16, there exist A = A ( t ) and events G N of probability tending to 1 such that n (cid:8) i : A < i ≤ N − A, | ˆ ℓ i ( X ) − ℓ ′ i ( X ) | > t/ (cid:9) ≤ N δ N o , for some δ N →
0. On G N , we observe that { i ≤ N : ˆ ℓ i < t } ≥ { i : A < i ≤ N − A, ℓ ′ i < t/ } − N δ N , hence it suffices to show that there exists c > { i : A < i ≤ N − A : ℓ ′ i < t/ } >cN with probability tending to 1. braham, Castillo and Gassiat By ergodicity (i.e. applying Lemma 20 with g ( x ) = { x < t/ } ) we have for any ε > H (cid:16) (cid:8) i : A < i ≤ N − A : ℓ ′ i < t/ (cid:9) > (cid:0) N − A (cid:1)(cid:0) Π H ( ℓ ′ i < t/ − ε (cid:1)(cid:17) → , hence it suffices to show that Π H ( ℓ ′ i < t/ = 0.Fix i satisfying A < i ≤ N − A . For α, β ∈ { , } A write η α,β = π α Y a p ≥ p δ f ( X i ) f ( X i ) . In view of Assumption A, assume without loss of generality that there exists x ∗ ∈ R ∪ {±∞} such that f ( x ) /f ( x ) → ∞ as x ↑ x ∗ . Then we deduce that for some u = u ( t, δ ) > H ( ℓ ′ i < t/ ≥ Π H (cid:16) f ( X i ) f ( X i ) > − ttδ (cid:17) ≥ π P X ∼ f ( x ∗ − u ≤ X ≤ x ∗ ) > , as required. Lemma 15.
In the setting of Theorem 2, N N X i =1 ˆ ℓ i ( X ) → π in probability as N → ∞ . ultiple Testing in Nonparametric HMMs Proof
It is required to prove, for ε > H (cid:16)(cid:12)(cid:12)(cid:12) N N X i =1 ˆ ℓ i ( X ) − π (cid:12)(cid:12)(cid:12) > ε (cid:17) → . By Lemma 16, defining ℓ ′ i ( X ) = Π H ( θ i = 0 | X i − A , . . . , X i + A ) , A < i ≤ N − A, there exists A = A ( ε ) for which, with probability tending to 1, { i : A < i ≤ N − A, | ˆ ℓ i ( X ) − ℓ ′ i ( X ) | > ε/ } ≤ N δ N . On the event on which the last line holds we can decompose: (cid:12)(cid:12)(cid:12) N N X i =1 ˆ ℓ i ( X ) − π (cid:12)(cid:12)(cid:12) ≤ AN + ε/ δ N + 1 N (cid:12)(cid:12)(cid:12) N − A X i = A +1 ( ℓ ′ i ( X ) − π ) (cid:12)(cid:12)(cid:12) . Finally, by ergodicity of ℓ ′ i ( X ) (see Lemma 20) we haveΠ H (cid:16) N (cid:12)(cid:12)(cid:12) N − A X i = A +1 ( ℓ ′ i ( X ) − π ) (cid:12)(cid:12)(cid:12) > ε/ (cid:17) ≤ Π H (cid:16) N − A (cid:12)(cid:12)(cid:12) N − A X i = A +1 ( ℓ ′ i ( X ) − E H [ ℓ ′ i ( X )]) (cid:12)(cid:12)(cid:12) > ε/ (cid:17) → , where we have used that E H [ ℓ ′ i ( X )] = E H Π H ( θ i = 0 | X i − A , . . . , X i + A ) = Π H ( θ i = 0) = π . The result follows.
Lemma 16.
For A ∈ N , define ℓ ′ i ( X ) = Π H ( θ i = 0 | X i − A , . . . , X i + A ) , A < i ≤ N − A. For any fixed ε > , there exists A = A ( ε ) and δ N → such that { i : A < i ≤ N − A, | ˆ ℓ i ( X ) − ℓ ′ i ( X ) | > ε } ≤ N δ N , with probability tending to 1. A similar result holds in the limit A → ∞ , see Lemma 19 below. Proof
Essentially, this is a consequence of Lemma 9 and exponential mixing – henceforgetfulness – of the Markov chain θ . Precisely, Lemma 9 tells us that there exist events G N of probability tending to 1 on which n (cid:8) i ≤ N : | ˆ ℓ i ( X ) − ℓ i ( X ) | > ε ′ N (cid:9) ≤ N δ N o , for some ε ′ N →
0; in particular note ε ′ N < ε/ N large. Next, we apply (Cappé et al.,2005, Proposition 4.3.23iii). Our Assumption B implies that Assumption 4.3.24 therein braham, Castillo and Gassiat holds, so by the consequent Lemma 4.3.25 one sees that the ρ ( y ) in the proposition can bereplaced by ρ = (1 − δ ) / (1 − δ ). Applying the proposition with j = k − A yields | Π H ( θ k = 0 | X , . . . , X n ) − Π H ( θ k = 0 | X k − A , . . . , X n ) | < ρ A , k > A. Any two-state Markov chain is reversible, hence by time-reversal we similarly obtain | Π H ( θ k = 0 | X k − A , . . . , X n ) − Π H ( θ k = 0 | X k − A , . . . , X k + A ) | < ρ A , and hence | ℓ k ( X ) − ℓ ′ k ( X ) | < ρ A , A < k ≤ N − A. Choose A = A ( ε ) so that 4 ρ A < ε/
2; then, on G N and for N large, an application of thetriangle inequality yields { i : A < i ≤ N − A, | ˆ ℓ i ( X ) − ℓ ′ i ( X ) | > ε } ≤ N δ N , and the result follows. A.2 Lemmas for Theorem 3
We may concretely define ℓ ∞ i as the almost sure limit ℓ ∞ i ( X ) = lim K →∞ Π H ( θ i = 0 | X − K , . . . , X K ); (38)this limit is well defined by a standard martingale convergence theorem. Lemma 17.
In the setting of Theorem 2, assume that the distribution function of the vari-able f ( X ) /f ( X ) is continuous and strictly increasing on (0 , ∞ ) . Then the distributionfunction of ℓ ∞ i ( X ) is continuous and strictly increasing on [0 , . Note that atomicity of ℓ i ( X ) relates to that of f ( X i ) /f ( X i ), rather than that of X i itself, since for example the distribution of ℓ is atomic when N = 1 if Π H ( f ( X ) /f ( X ) = c ) > c . It is therefore unsurprising that the key properties of the dis-tribution of ℓ ∞ i ( X ) depend on the distribution of the ratio f ( X i ) /f ( X i ). Proof
Let G denote the distribution function of ( f /f )( X ) when X ∼ f µ and G thedistribution function of ( f /f )( X ) when X ∼ f µ .Define the stationary filter sequence (Φ ∞ i ( X )) i ∈ Z byΦ ∞ i ( X ) := Π H ( θ i = 0 | ( X n : n ∈ Z , n ≤ i )) . (39)Using the usual forward-backward equations, see Baum et al. (1970), and taking almost-surelimits one obtains the following forward equation: for each i ,Φ ∞ i ( X ) = [(1 − p )Φ ∞ i − ( X ) + q (1 − Φ ∞ i − ( X ))] f ( X i )((1 − p ) f ( X i ) + pf ( X i ))Φ ∞ i − ( X ) + ( qf ( X i ) + (1 − q ) f ( X i ))(1 − Φ ∞ i − ( X )) ultiple Testing in Nonparametric HMMs where p = Q and q = Q , leading toΦ ∞ i ( X ) = (1 − p )Φ ∞ i − ( X ) + q (1 − Φ ∞ i − ( X ))(1 − p + p ( f /f )( X i ))Φ ∞ i − ( X ) + ( q + (1 − q )( f /f )( X i ))(1 − Φ ∞ i − ( X )) . (40)That is, if we define A (Φ) = (1 − p )Φ + q (1 − Φ), thenΦ ∞ i ( X ) = A (Φ ∞ i − ( X )) A (Φ ∞ i − ( X )) + ( f /f )( X i )(1 − A (Φ ∞ i − ( X ))) . (41)Since conditional on Φ ∞ i − ( X ), X i has distribution (cid:2) A (Φ ∞ i − ( X )) f ( x ) + (1 − A (Φ ∞ i − ( X )) f ( x ) (cid:3) µ ,we deduce that (Φ ∞ i ( X )) i ∈ Z is a stationary Markov chain with transition kernel K (Φ , d Φ ′ )given by K (Φ , d Φ ′ ) = Z δ g (Φ ,x ) ( d Φ ′ ) [(Φ(1 − p ) + (1 − Φ) q ) f ( x ) + (Φ p + (1 − Φ)(1 − q )) f ( x )] dµ ( x )= Z δ g (Φ ,x ) ( d Φ ′ ) [ A (Φ) f ( x ) + (1 − A (Φ)) f ( x )] dµ ( x ) , where g (Φ , x ) = A (Φ) A (Φ) + ( f /f )( x )(1 − A (Φ)) . Then, for each t ∈ (0 , H (cid:0) Φ ∞ i ( X ) ≤ t | Φ ∞ i − ( X ) (cid:1) = Π H ( f /f )( X i ) ≥ A (Φ ∞ i − ( X ))1 − A (Φ ∞ i − ( X )) (1 /t − | Φ ∞ i − ( X ) ! . Recall that π G + π G is assumed to be continuous and strictly increasing on (0 , + ∞ ),and that π > π >
0, so that G and G are both continuous, and on the set where G is not strictly increasing, G is strictly increasing and vice versa. We deduce thatΠ H (cid:0) Φ ∞ i ( X ) ≤ t | Φ ∞ i − ( X ) (cid:1) = A (Φ ∞ i − ( X )) " − G A (Φ ∞ i − ( X ))1 − A (Φ ∞ i − ( X )) (cid:18) t − (cid:19)! + (cid:0) − A (Φ ∞ i − ( X )) (cid:1) " − G A (Φ ∞ i − ( X ))1 − A (Φ ∞ i − ( X )) (cid:18) t − (cid:19)! . Then Φ ∞ i ( X ) has, conditionally on Φ ∞ i − ( X ), a continuous and strictly increasing distribu-tion function on (0 , ∞ i ( X ) since for all t ,Π H (Φ ∞ i ( X ) ≤ t ) = E H [Π H (cid:0) Φ ∞ i ( X ) ≤ t | Φ ∞ i − ( X ) (cid:1) ] . That is, Φ ∞ i ( X ) has (conditionally on Φ ∞ i − ( X ) and unconditionally) no atoms and support(0 , i , ℓ ∞ i ( X ) = (1 − p )Φ ∞ i ( X ) ℓ ∞ i +1 ( X ) A (Φ ∞ i ( X )) + p Φ ∞ i ( X )(1 − ℓ ∞ i +1 ( X ))1 − A (Φ ∞ i ( X )) . (42) braham, Castillo and Gassiat Let C (Φ) = Φ(1 − Φ) A (Φ)(1 − A (Φ)) and notice that for any p, q ∈ (0 ,
1) there exists a = a ( p, q ) < ∈ (0 , | − p − q | C (Φ) ≤ a . Then an easy recursion yields ℓ ∞ i ( X ) = p Φ ∞ i ( X )1 − A (Φ ∞ i ( X )) + X k ≥ (1 − p − q ) k C (Φ ∞ i ( X )) C (Φ ∞ i +1 ( X )) · · · C (Φ ∞ i + k − ( X )) p Φ ∞ i + k ( X )1 − A (Φ ∞ i + k ( X )) . Indeed, since for any Φ ∈ (0 , | − p − q | C (Φ) ≤ a ( p, q ) <
1, the series converges almostsurely. We see that for each i , ℓ ∞ i ( X ) is a function of (Φ ∞ k ( X )) k ≥ i , and we have ℓ ∞ i ( X ) = p Φ ∞ i ( X )1 − A (Φ ∞ i ( X )) + (1 − p − q ) C (Φ ∞ i ( X )) ℓ ∞ i +1 ( X ) . It follows that for all t ,Π H ( ℓ ∞ i ( X ) ≤ t | Φ ∞ i − ( X ))= E H h Π H (cid:16) (1 − p − q ) C (Φ ∞ i ( X )) ℓ ∞ i +1 ( X ) ≤ t − p Φ ∞ i ( X )1 − A (Φ ∞ i ( X )) | Φ ∞ i ( X ) (cid:17) | Φ ∞ i − ( X ) i . (43)Define the function F ℓ by F ℓ ( t ; Φ ∞ i − ( X )) = Π H (cid:0) ℓ ∞ i ( X ) ≤ t | Φ ∞ i − ( X ) (cid:1) ;note that by stationarity F ℓ does not depend on i . Then by (43), if (1 − p − q ) >
0, we have F ℓ ( t ; Φ ∞ i − ( X )) = E H (cid:20) F ℓ (cid:18) − p − q ) C (Φ ∞ i ( X )) (cid:16) t − p Φ ∞ i ( X )1 − A (Φ ∞ i ( X )) (cid:17) ; Φ ∞ i ( X ) (cid:19) | Φ ∞ i − ( X ) (cid:21) ;that is, for any t and any Φ ∈ (0 , F ℓ ( t ; Φ) = Z F ℓ (cid:18) − p − q ) C ( x ) (cid:18) t − px − A ( x ) (cid:19) ; x (cid:19) K (Φ , dx ) . (44)Similarly, if (1 − p − q ) <
0, defining the function ˜ F ℓ by ˜ F ℓ ( t, Φ) = lim s → t,s 1) (both conditionally onΦ ∞ i − ( X ) and unconditionally) implies, together with equations (44) and (45), that what-ever the sign of (1 − p − q ), the function t E H [ F ℓ ( t ; Φ ∞ i − ( X ))] is continuous and strictlyincreasing, which is to say that the distribution function of ℓ ∞ i ( X ) is continuous and strictlyincreasing. Lemma 18. Under the conditions of Theorem 3, writing ℓ ∞ i ( X ) = Π H ( θ i = 0 | ( X n ) n ∈ Z ) , the function m defined by m ( λ ) = E [ ℓ ∞ i ( X ) | ℓ ∞ i ( X ) < λ ] is continuous and strictly increasing on (0 , , and m ( λ ) < λ for all λ ∈ (0 , . ultiple Testing in Nonparametric HMMs Proof For any random variable U and any a < b such that P ( U < a ) > 0, we have E [ U | U < b ] = E [ U | U < a ] P ( U < a | U < b ) + E [ U | a ≤ U < b ] P ( U ≥ a | U < b )= E [ U | U < a ](1 − P ( U ≥ a | U < b )) + E [ U | a ≤ U < b ] P ( U ≥ a | U < b ) , hence E [ U | U < b ] − E [ U | U < a ] = P ( a ≤ U < b ) P ( U < b ) (cid:16) E [ U | a ≤ U < b ] − E [ U | U < a ] (cid:17) . (46)Note now that E [ U | U < a ] < a : indeed, if V d = ( U − a ) | U < a, then V ≤ V is strictly negative with positive probability, hence E [ V ] < 0. [For U = ℓ ∞ i this yields that m ( λ ) < λ as claimed.] We similarly note that E [ U | a ≤ U < b ] ≥ a , sothat, using also that U is bounded, so that E [ U | U < a ] ≥ − c for some c < ∞ ,0 < E [ U | a ≤ U < b ] − E [ U | U < a ] < b + c. Consequently, returning to (46), to see that E [ U | U < x ] is strictly increasing on { x : P ( U < x ) > } it suffices to show that P ( a ≤ U < b ) > a, b , and to show it iscontinuous it suffices to show that P ( a ≤ U < b ) → b − a → 0. Taking U = ℓ ∞ i , weconclude by Lemma 17, which tells us that the distribution function of ℓ ∞ i is continuousand strictly increasing and also implies that Π H ( ℓ ∞ i < λ ) > λ > Lemma 19. Recall the definition ℓ ∞ i ( X ) = Π H ( θ i = 0 | ( X n : n ∈ Z )) . There exist δ N , ξ N , ξ ′ N → such that with probability tending to 1, { i : 1 ≤ i ≤ N, | ℓ i ( X ) − ℓ ∞ i ( X ) | > ξ N } ≤ N δ N { i : 1 ≤ i ≤ N, | ˆ ℓ i ( X ) − ℓ ∞ i ( X ) | > ξ ′ N } ≤ N δ N . Proof Define ℓ ′ i ( X ) = Π H ( θ i = 0 | X i − A N , . . . , X i + A N ) . As in Lemma 16, we may argueusing (Cappé et al., 2005, Proposition 4.3.23iii) that for a suitable sequence A N → ∞ satisfying A N /N → 0, that { i ≤ N : | ℓ i ( X ) − ℓ ′ i ( X ) | > ρ A N } ≤ A N . Recalling from (38) that ℓ ∞ i ( X ) is formally defined as an almost sure limit of ℓ ′ i ( X ) as A N → ∞ , so that ℓ ′ i → ℓ ∞ i in probability also, this proves the first bound. The secondbound follows similarly after an appeal to Lemma 9. braham, Castillo and Gassiat Lemma 20 (Ergodic theorems) . The sequences ℓ ′ i and ℓ ∞ i , defined for A ∈ N by ℓ ′ i ( X ) = Π H ( θ i = 0 | X i − A , . . . , X i + A ) , A < i ≤ N − A,ℓ ∞ i ( X ) = Π H ( θ i = 0 | ( X n : n ∈ Z )) , are ergodic, so that for any bounded function g , N N X i =1 g ( ℓ ′ i ) → E π [ g ( ℓ ′ )] , a.s. (hence also in probability) , and similarly for ℓ ∞ i . Proof These are standard ergodicity results for functions of Markov chains; see for exam-ple (Durrett, Chapter 6). In the case of ℓ ′ i one can also note that g ( ℓ ′ i ( X )) is a functionof the Markov chain ( θ i − A , . . . , θ i + A , X i − A , . . . , X i + A ) to reduce to the ergodic theorem forMarkov chains themselves. Lemma 21. In the setting of Theorem 2, define the class ( ϕ λ,H : λ ∈ [0 , as in (8) , anddefine the mTDR and mFDR as in (15) and (16) . Then for each λ ∈ (0 , we have mTDR H ( ϕ λ,H ) = sup { mTDR H ( ψ ) : mFDR H ( ψ ) ≤ mFDR H ( ϕ λ,H ) } . Remarks. i. A version of this result in the HMM setting originates in Sun and Cai(2009), but to avoid a monotonicity property needed therein we instead adapt theproof of (Rebafka et al., 2019, Lemma 9.2) (see also the proof of (Cai et al., Theorem1)). The proof is valid for ℓ –value procedures in any (correctly specified) model, notjust the hidden Markov model (1).ii. The result does not in general hold for λ = 0, since mFDR H ( ψ ) = 0 whenever E H [ ℓ i ( X ) ψ i ( X )] = 0, so that if Π H ( ℓ i ( X ) = 0) > 0, the test ψ defined by ψ i ( X ) = { ℓ i ( X ) = 0 } has mFDR H ( ψ ) = 0 and mTDR H ( ψ ) = Π( ℓ i ( X ) = 0 | θ i = 1) =Π( ℓ i ( X ) = 0) > H ( ϕ ,H ).iii. In general, { mFDR H ( ϕ λ,H ) : λ ∈ [0 , } is a proper subset of [0 , ϕ λ,H need not be optimal for every threshold. In particular, the supremum ofthe set is generally strictly smaller than one, and – especially in discrete data settings– there may be jump discontinuities in the function λ mFDR H ( ϕ λ,H ). The first ofthese does not cause any issues, since mTDR H ( ϕ ,H ) = 1 = sup ψ mTDR H ( ψ ), whileLemma 22 overcomes the issues raised in the second case in the setting of Theorem 3. Proof Fix λ > 0; write ϕ for ϕ λ,H and let a = mFDR H ( ϕ ). Observe that for any multipletesting procedure ψ , mFDR H ( ψ ) ≤ a if and only if E X i ≤ N ( ℓ i − a ) ψ i ≤ , with equality in one implying equality in the other. It follows that if mFDR H ( ψ ) ≤ a then E X i ≤ N ( ℓ i − a )( ϕ − ψ ) ≥ . (47) ultiple Testing in Nonparametric HMMs We note also that a < λ . Indeed, if ϕ = 0 almost surely then this is true by definition(recall the convention that 0/0=0 in the definition (16) of the mFDR). Otherwise, thereexists k such that ϕ k = 1 (and hence ℓ k < λ ) with positive probability; then U = ( ℓ k − λ ) ϕ k satisfies U ≤ U < > 0, which together imply that E [ U ] < E X i ≤ N ( ℓ i − λ ) ϕ i < . We next show that, for all i ,( ℓ i − a )( ϕ i − ψ i ) ≤ λ − a − λ (1 − ℓ i )( ϕ i − ψ i ) . (48)Indeed, if ϕ i = 1, then ℓ i < λ , so that ℓ i − a < − ℓ i − λ ( λ − a ) , and multiplying by ϕ i − ψ i ≥ ϕ i = 0, then ℓ i ≥ λ > a , sothat ℓ i − a ≥ − ℓ i − λ ( λ − a ) , and multiplying by ϕ i − ψ i ≤ a < λ < 1, so that also (1 − λ ) / ( λ − a ) > 0, we deduce from (47) and (48)that E X i ≤ N (1 − ℓ i )( ϕ i − ψ i ) ≥ . Finally, by definition,mTDR H ( ϕ ) = E [ P i ≤ N (1 − ℓ i ) ϕ i ] N π , mTDR H ( ψ ) = E [ P i ≤ N (1 − ℓ i ) ψ i ] N π , hence mTDR H ( ϕ ) ≥ mTDR H ( ψ ) as claimed. Lemma 22. In the setting of Theorem 3, define the map g : x sup { mTDR( ψ ) : mFDR( ψ ) ≤ x } , where the supremum is defined over multiple testing procedures ψ . Then for sequences x N , y N such that | x N − y N | → , we have | g ( x N ) − g ( y N ) | → as N → ∞ . [Note that g depends implicitly on N , so that this does not simply say that g is continuous.] Proof Prompted by Lemma 21, we focus on tests ψ of the form ϕ λ,H , λ ∈ [0 , 1] and define,for N ≥ λ N = sup { λ : mFDR H ( ϕ λ,H ) ≤ x N } ,µ N = sup { λ : mFDR H ( ϕ λ,H ) ≤ y N } . braham, Castillo and Gassiat As with the postFDR (recall (12)) one has the dichotomies, implied by the fact thatmFDR H ( ϕ λ,H ) is non-decreasing and left-continuous in λ ,mFDR H ( ϕ λ,H ) ≤ x N ⇐⇒ λ ≤ λ N , mFDR H ( ϕ λ,H ) ≤ y N ⇐⇒ λ ≤ µ N , and we set ˜ λ N = min( λ N + 1 /N, 1) and ˜ µ N = min( µ N + 1 /N, λ N = 1, ˜ µ N = 1) implies thatmTDR H ( ϕ λ N ,H ) ≤ g ( x N ) ≤ mTDR H ( ϕ ˜ λ N ,H )mTDR( ϕ µ N ,H ) ≤ g ( y N ) ≤ mTDR( ϕ ˜ µ N ,H ) . We now show that if λ ′ > λ , then for N large enoughmFDR H ( ϕ λ ′ ,H ) > mFDR H ( ϕ λ,H ); (49)it will follow that necessarily | λ N − µ N | → 0, since otherwise we cannot have | x N − y N | → y N is a constant sequence, and follows for general y N by compactness).Writing a = mFDR H ( ϕ λ,H ) we note that as in the proof of Lemma 21 we have a < λ , andfor any test ψ , mFDR H ( ψ ) ≤ a ⇐⇒ E X i ≤ N ( ℓ i − a ) ψ i ≤ . Then E X i ≤ N ( ℓ i − a ) { ℓ i < λ ′ } = E X i ≤ N ( ℓ i − a ) { ℓ i < λ } + E X i ≤ N ( ℓ i − a ) { λ ≤ ℓ i < λ ′ } . The first term on the right equals zero, and we show that the second is strictly positive, forlarge N . Indeed, by Lemma 19 there exists a sequence ξ N → E { i : | ℓ i − ℓ ∞ i | >ξ N } /N → N → ∞ ; then the term in question is lower bounded by ( λ − a ) multipliedby E { i : λ + ξ N ≤ ℓ ∞ i < λ ′ − ξ N } − E { i : | ℓ i − ℓ ∞ i | > ξ N } . Lemma 17 tells us that under the assumptions of Theorem 3 the distribution function of ℓ ∞ i is strictly increasing, so that for N large enough that λ + ξ N < λ ′ − ξ N the first termon the right of the latest display is of order N and the second is of smaller order, so thatindeed the difference is positive, proving (49).Finally we prove that, as a consequence of the fact that | λ N − µ N | → 0, we have | mTDR H ( ϕ λ N ,H ) − mTDR H ( ϕ µ N ,H ) | → 0. Since also | ˜ λ N − λ N | → | ˜ µ N − µ N | → 0, thesame proof will imply that each of mTDR H ( ϕ λ N ,H ) , mTDR H ( ϕ ˜ λ N ,H ), mTDR H ( ϕ µ N ,H )and mTDR H ( ϕ ˜ µ N ,H ) differ by at most o (1), allowing us to conclude.Assume for notational convenience that λ N ≥ µ N . The denominator in the expressionsdefining each of the mTDR’s is E { i : θ i = 1 } = N π , and we see thatmTDR H ( ϕ λ N ,H ) = mTDR H ( ϕ µ N ,H ) + E { i : θ i = 1 , µ N ≤ ℓ i < λ N } N π . ultiple Testing in Nonparametric HMMs As used above, by Lemma 19 there exists a sequence ξ N → E { i : | ℓ i − ℓ ∞ i | >ξ N } /N → N → ∞ . Lemma 17 tells us that the distribution function of ℓ ∞ i is continuous– and hence uniformly continuous – and we see that N − E { i : θ i = 1 , µ N ≤ ℓ i < λ N }≤ Π H ( µ N − ξ N ≤ ℓ ∞ < λ N + ξ N ) + N − E { i : | ℓ i − ℓ ∞ i | > ξ N } → , as N → ∞ , proving the claim. Appendix B. Auxiliary Results for the upper bounds of Section 3 B.1 Well-definedness of the EstimatorsLemma 23. In the setting of Theorem 5, there exist ( h l ) l ∈ N (not depending on H ) uniformlysupremum-norm bounded such that O L = ( E [ h l ( X ) | θ = j ] l ≤ L ,j ≤ J ) ∈ R L × J satisfies σ J ( O L ) ≥ C, uniformly in L ≥ L , for some C, L depending on the parameters f j , j ≤ J . Proof For L > L ′ , σ J ( O L ) > σ J ( O L ′ ) because O L ′ is a submatrix of O L , see e.g.(Stewart and Sun, 1990, Chapter 1, Theorem 4.4). So it suffices to show that σ J ( O L ) > L .Choose a countable family of sets A = { A , . . . } generating the Borel σ –algebra on R ,for example A = { ( −∞ , q ) : q ∈ Q } , and let h l = A l . Suppose for a contradiction that σ J ( O L ) = 0 for all L ∈ N , or, put another way, that the J vectors ( h h l , f j i l ≤ L ) ∈ R L , j ≤ J are linearly dependent for all L ∈ N , so that there exist a L , . . . , a LJ ∈ [ − , 1] for which P j | a Lj | = 1 and P j a Lj h h l , f j i = 0 for all l ≤ L . By Bolzano–Weierstrass, there is a sequence L n → ∞ such that for each j ≤ J , a L n j converges to some a ∞ j , and note that necessarily( a ∞ j ) j ≤ J is not the zero vector. For each l ∈ N , we have that h h l , X j ≤ J a ∞ j f j i = X j ≤ J a ∞ j h h l , f j i = lim n →∞ X j ≤ J a L n j h h l , f j i = 0 . Since { h l : l ∈ N } generates the Borel σ –algebra, it follows that P j a ∞ j f j corresponds tothe zero measure hence, since it is a continuous function, is the zero function, contradictingthat the functions f j , j ≤ J are linearly independent. Lemma 24. Under the assumptions of Theorem 5, define ˆ P and ( ˆ M x , ˆ B x , x ∈ R ) as inAlgorithm 1 for L such that L ≍ ( N/ log N ) / (1+2 s ) . Thena. The map x ˆ M x is continuous. For any κ > , there exists c = c ( κ, H ) such thatthe event A = {k ˆ P − P k ≤ cL r N , sup x ∈ R k ˆ M x − M x k ≤ cL r N } (is measurable and) has probability at least − N − κ for N large. braham, Castillo and Gassiat b. On A , for N large enough ˆ P has rank J , and the matrices ˜ B x = ( ˆ V ⊺ P ˆ V ) − ˆ V ⊺ M x ˆ V , x ∈ R , (50) are well defined.c. On A , for some C > depending on both the constant c of A and on H , we have for N large enough sup x ∈ R max( k ˆ B x k , k ˜ B x k ) ≤ CL / , (51)sup x ∈ R k ˜ B x − ˆ B x k ≤ CL r N . (52) Proof Lemma 28 and Lemma 29 together tell us that for suitable c = c ( κ, H ),Π( k ˆ P − P k ≤ cL r N , sup x ∈ Q k ˆ M x − M x k ≤ cL r N ) ≥ − N − κ . [In fact a union bound yields this with 2 N − κ in place of N − κ , but the factor 2 can beremoved by initially considering some κ ′ > κ .] We prove the claimed continuity of the map x ˆ M x ; it will follow that { sup x ∈ Q k ˆ M x − M x k ≤ cL r N } = { sup x ∈ R k ˆ M x − M x k ≤ cL r N } , which implies measurability and the probability bound for A . This continuity results fromthe assumed Lipschitz continuity of K . Indeed, if Λ is the Lipschitz constant for K , observethat if | x − y | < δ then for any n | K L ( x, X n +1 ) − K L ( y, X n +1 ) | ≤ sup t ∈ R | K L ( x, t ) − K L ( y, t ) | ≤ sup | u − v | < L δ L | K ( u ) − K ( v ) | ≤ L Λ δ, hence, for some C = C ( H ), k ˆ M x − ˆ M y k ≤ L N max l k h l k ∞ max n ≤ N | K L ( x, X n +1 ) − K L ( y, X n +1 ) | ≤ C L L N | x − y | . Next, in view of the assumption on O made in the algorithm, Lemma 33 implies that σ J ( P ) is bounded away from zero for large N and consequently by Lemma 34a, on A andfor N large we have that ˆ P is of rank J and that ˆ V ⊺ P ˆ V is invertible (recall that L r N → k ˆ P − P k < σ J ( P ) / B x is well defined for each x ∈ R and can beexpressed as ( QO ⊺ ˆ V ) − D x QO ⊺ ˆ V . It follows, using Lemma 34b and eq. (27), that on A , fora constant c = c ( H ) and any x ∈ R we have k ˜ B x k ≤ κ ( QO ⊺ ˆ V ) max j | K L [ f j ]( x ) | ≤ cL / for N large. ultiple Testing in Nonparametric HMMs Finally, Lemma 34c tells us that on A , for any x ∈ R and for N large enough that cL r N < σ J ( P ) / k ˜ B x − ˆ B x k ≤ . h k ˆ M x − M x k σ J ( P ) + k M x kk ˆ P − P k σ J ( P ) i , ∀ x ∈ R . Noting that k M x k ≤ cL for some c = c ( H ) by Lemma 33, we deduce (52). The bound for k ˆ B x k then follows from the bound for k ˜ B x k by the triangle inequality. Lemma 25. Recall sep( B ) denotes the eigen-separation of a matrix B , in that if B haseigenvalues λ , . . . , λ J then sep( B ) = min j = j ′ | λ j − λ j ′ | . On the event A of Lemma 24, let L ≍ ( N/ log N ) / (1+2 s ) and define B a,u ≡ ˜ B a,u as in Algorithm 1 for V = ˆ V : ˜ B a,u = X a i ˜ B u i , ˜ B x = ( ˆ V ⊺ P ˆ V ) − ˆ V ⊺ M x ˆ V . Let D N be an increasing sequence of finite sets consisting of dyadic rationals whose union ∪ N D N is dense in R . Define D N = { ( a, u ) ∈ D J ( J − / N × D J ( J − / N : X i | a i | ≤ } Then there exists a constant c depending only on f , . . . , f J and positive when they are alldistinct such that, on A , max { sep( ˜ B a,u ) : ( a, u ) ∈ D N } ≥ c, for all N large.Remark. Recall, as remarked after Algorithm 1, that proving this result for V = ˆ V impliesit holds for any V such that B x = ( V ⊺ P V ) − ( V ⊺ M x V ) is well-defined. Proof In view of Lemma 11, ˜ B a,u , being a linear combination of simultaneously diagonal-isable matrices, is diagonalisable for any a, u , with eigenvalues( X i a i K L [ f j ]( u i )) j ≤ J . Recall that k K L [ f j ] − f j k ∞ → L = L ( N ) → ∞ by (26). It follows by the triangleinequality that max D N | X i a i ( K L [ f j ]( u i ) − f j ( u i )) | → , hencemax D N sep( ˜ B a,u ) = max D N min j = j ′ (cid:12)(cid:12)(cid:12)X i a i K L [ f j − f j ′ ]( u i ) (cid:12)(cid:12)(cid:12) > max D N min j = j ′ (cid:12)(cid:12)(cid:12)X i a i (cid:0) f j ( u i ) − f j ′ ( u i ) (cid:1)(cid:12)(cid:12)(cid:12) , (53)for N large, provided this latter quantity is strictly positive. braham, Castillo and Gassiat Next, let U N be a sequence of sets, increasing to R , such that sup u ∈ U N min d ∈D N | u − d | → . Observe that, since f ∈ C s ( R ),( a, u ) min j = j ′ | X i a i ( f j ( u i ) − f j ′ ( u i )) | is uniformly continuous on R J ( J − / × R J ( J − / , so that max ( a,u ) ∈ D N min j = j ′ (cid:12)(cid:12)(cid:12)X i a i (cid:0) f j ( u i ) − f j ′ ( u i ) (cid:1)(cid:12)(cid:12)(cid:12) > sup a sup u ∈ U N min j = j ′ (cid:12)(cid:12)(cid:12)X i a i (cid:0) f j ( u i ) − f j ′ ( u i ) (cid:1)(cid:12)(cid:12)(cid:12) (54)for N large, provided this latter quantity is strictly positive. The supremum on the rightcan be extended: while at first we must take the supremum over ( a such that P | a i | ≤ u ∈ U N , the result remains true taking the supremum instead over all u ∈ R J ( J − / ,using that f j ( u ) → u → ∞ . We now prove thatsup a,u min j = j ′ (cid:12)(cid:12)(cid:12)X i a i (cid:0) f j ( u i ) − f j ′ ( u i ) (cid:1)(cid:12)(cid:12)(cid:12) > . Choose for each pair j = j ′ some x ∈ R such that f j ( x ) = f j ′ ( x ), and collect these x intothe vector u . For each j = j ′ , the set { v ∈ R J ( J − / : h v, f j ( u i ) − f j ′ ( u i ) i = 0 } is a propersubspace of R J ( J − / , so the union over these J ( J − / R J ( J − / (for example it has Lebesgue measure zero) and we may choose a in the complement of theunion. Scale invariance means that moreover we may assume a satisfies P i | a i | = 1. Then | P i a i ( f j ( u i ) − f j ′ ( u i )) | > j = j ′ , as required.Finally, combining also with (53) and (54) we deduce thatmax(sep( ˜ B a,u ) : ( a, u ) ∈ D N ) > sup a,u min j = j ′ (cid:12)(cid:12)(cid:12)X i a i ( f j ( u i ) − f j ′ ( u i )) (cid:12)(cid:12)(cid:12) > , concluding the proof. Lemma 26. In the setting of Theorem 5, let A be the event of Lemma 24. Define ˆ B x =ˆ B x,L ,L and ˆ B a,u as in Algorithm 1 for L ≍ ( N/ log N ) / (1+2 s ) . There exists a constant c = c ( H ) > such that on the event A we have sep( ˆ B ˆ a, ˆ u ) > c, (55) for N large and, defining ˜ B a,u = P a i ˜ B u i for ˜ B x as in (50) , we also have sep( ˜ B ˆ a, ˆ u ) > c. (56) Note that (55) implies in particular that ˆ B ˆ a, ˆ u has J distinct eigenvalues and so is diagonal-isable. Proof By Lemma 24, on A the matrices ˆ B x , ˜ B x are well-defined and bounded up to aconstant by L / , and satisfy for some C = C ( H )sup x k ˜ B x − ˆ B x k ≤ CL r N , sup x max( k ˜ B x k , k ˆ B x k ) ≤ CL / . ultiple Testing in Nonparametric HMMs By the triangle inequality, we deduce that k ˆ B a,u k ≤ X | a i |k ˆ B u i k ≤ sup x k ˆ B x k ≤ CL / , and similarly k ˜ B a,u k ≤ CL / . Let ( a N , u N ) ∈ argmax D N (sep( ˜ B a,u )) and recall by assump-tion that sep( ˜ B a N ,u N ) > c uniformly in N large enough, for some c > V = ˆ V in Algorithm 1, and hencereplacing B x defined therein with ˜ B x , is valid on A .] We apply the Ostrowski–Elsnertheorem (Theorem 35) to A = ˆ B a,u , B = ˜ B a,u to see for a constant C = C ( H ) that for any a, u we have min τ max j | λ τ ( j ) ( ˜ B a,u ) − λ j ( ˆ B a,u ) | ≤ CL ( J − / (2 J )0 ( L r N ) /J , where λ j , j ≤ J are maps taking matrices to their eigenvalues. This last expression tendsto zero as N → ∞ (since by assumption L ( J +3) / r N → 0) and in particular it is smallerthan sep( ˜ B a N ,u N ) / N large.By the triangle inequality we deduce that on A ,sep( ˆ B a N ,u N ) ≥ sep( ˜ B a N ,u N ) − a,u min τ max j | λ τ ( j ) ( ˜ B a,u ) − λ j ( ˆ B a,u ) | ≥ (3 / 5) sep( ˜ B a N ,u N ) . It follows by definition of ˆ a, ˆ u thatsep( ˆ B ˆ a, ˆ u ) ≥ sep( ˆ B a N ,u N ) ≥ (3 / 5) sep( ˜ B a N ,u N ) , proving (55). Applying the triangle inequality again we conclude thatsep( ˜ B ˆ a, ˆ u ) ≥ (1 / 5) sep( ˜ B a N ,u N ) , proving (56). B.2 Concentration of Empirical Estimators We note the following concentration results for Markov chains, adapted as in (De Castro et al.,2016, Proposition 13) from results of Paulin (2015), which will allow us to control the errorsof the empirical estimators ˆ P and ˆ M x . The pseudo-spectral gap of a chain is defined inPaulin (2015), wherein it is noted that its reciprocal is equivalent to the mixing time. Thebracketing numbers N [] ( T , k·k L ( P ) , ε ) are defined as the smallest number of pairs of func-tions ( f , ¯ f ) such that every g ∈ T is bracketed by one of the pairs, where ( f , ¯ f ) brackets g if f ≤ g ≤ ¯ f pointwise. Lemma 27. Let Y be a stationary Markov chain taking values in Y with pseudo-spectral gap γ ps > , with law denoted P . Let T be some countable class of real valued and measurable braham, Castillo and Gassiat functions on Y . Assume there exist σ, b > such that for all t ∈ T , k t k L ( P ) ≤ σ and k t k ∞ ≤ b . Suppose that the L ( P ) bracketing entropy H [] ( T , k·k L ( P ) , ε ) := log N [] ( T , k·k L ( P ) , ε ) , is upper bounded by some ¯ H ( ε ) , achievable using brackets of L ∞ –diameter at most b . Thenfor fixed t ∈ T we have P ( | X ( h ( Y i ) − Eh ( Y )) | ≥ x ) ≤ (cid:16) − x γ ps N + 1 /γ ps ) σ + 20 bx (cid:17) , (57) and there exists C > depending only on a lower bound for γ ps such that P (cid:16) sup t ∈T N X n =1 ( t ( Y n ) − Et ) ≥ C [ A + σ √ N x + bx ] (cid:17) ≤ exp( − x ) , (58) where A = √ N Z σ q ¯ H ( u ) ∧ N d u + ( b + σ ) ¯ H ( σ ) . Proof For the first claim, see (Paulin, 2015, Theorem 3.4) (but note there is an updatedversion of the paper on arXiv). For the second, observe that the proof of the same theoremgives the following bound for the Laplace transform of S = P ( t ( Y n ) − Et ) /b : E exp( λS ) ≤ exp (cid:16) N + 1 /γ ps )( σ /b ) γ ps λ (cid:16) − λγ ps (cid:17) − (cid:17) . (59)One now appeals to (Massart, 2007, Theorem 6.8) and the consequent Corollary 6.8. Whilethe theorem is stated for independent random variables, the proof uses this condition onlywhen applying Lemma 6.6 of the same reference, a version of which holds also in the currentsetting thanks to (59). Lemma 28. In the setting of Theorem 5 and defining P, ˆ P as in Algorithm 1, for any κ > there exists C = C ( κ, H ) such that Π H (cid:16) k ˆ P − P k > CL ( N/ log N ) − / (cid:17) ≤ N − κ Proof Noting that Y n = ( X n , X n +1 , X n +2 , θ n , θ n +1 , θ n +2 ) defines a stationary Markovchain, we apply (57) to deduce thatΠ (cid:16)(cid:12)(cid:12)(cid:12) N N X i =1 h ij ( Y n ) − E [ h ij ] (cid:12)(cid:12)(cid:12) > C (cid:16) log NN (cid:17) / (cid:17) ≤ (cid:16) − C γ ps N log N N + 1 /γ ps ) Var π ( h ij ) + 20 C ( N log N ) / k h ij k ∞ (cid:17) , where h ij ( Y n ) = h i ( Y n, ) h j ( Y n, ) and where γ ps is the pseudo-spectral gap of the chain Y n .We note that Var π ( h ij ) ≤ k h ij k ∞ ≤ k h i k ∞ k h j k ∞ is bounded by assumption. The pseudo ultiple Testing in Nonparametric HMMs spectral gap is also bounded: by (Paulin, 2015, Proposition 3.4) its reciprocal is controlledup to a constant by the mixing time of the Markov chain Y n , which is equal to the mixingtime of the chain ( θ n , θ n +1 , θ n +2 ) n . This latter quantity is bounded since the assumptionthat Q is irreducible and aperiodic on a finite state space implies that θ mixes exponentially,at a rate governed (again, in view of (Paulin, 2015, Proposition 3.4)) by the pseudo-spectralgap of Q itself and min j π j .We deduce that for a constant c = c ( H ) we haveΠ (cid:16)(cid:12)(cid:12)(cid:12) N X h ij ( Y n ) − E [ h ij ] (cid:12)(cid:12)(cid:12) > C (cid:16) log NN (cid:17) / (cid:17) ≤ − C c log( N )) . For any κ > 0, choosing C = C ( κ, c ) large enough, this last probability is smaller than N − κ as claimed. Lemma 29. In the setting of Theorem 5, define M x = M x,L ,L , ˆ M x = ˆ M x,L ,L as inAlgorithm 1, and recall that we choose L such that L ≍ ( N/ log N ) / (1+2 s ) and assumedthat L r N → . For any κ > there exists C = C ( κ, H ) such that Π H (cid:16) sup x ∈ Q k ˆ M x − M x k ≥ CL ( N/ log N ) − s/ (1+2 s ) (cid:17) ≤ N − κ . Proof As in Lemma 28 we note that the pseudo-spectral gap of the chain Y n = ( X n , X n +1 , X n +2 , θ n , θ n +1 , θ n +2 )is bounded away from zero provided the same is true of min j π j and the pseudo-spectralgap of Q itself, which holds by Assumption B’. We apply Lemma 27 to the family T = {± h i ⊗ K L ( x, · ) ⊗ h j : i, j ≤ L , x ∈ Q } . Recall we assume that max( k h l k ∞ : l ≤ L ) isbounded independently of L . Lemma 30 implies the bracketing entropy bound H [] ( T , k·k L (Π H ) , ε ) ≤ ¯ H ( ε ) = C log( L / L ε − ) , ε ≤ σ, where we may take b = 2 L +2 max i ( k h i k ∞ ) k K k ∞ = C L ,σ ≤ L max i ( k h i k ∞ ) k f π k ∞ Z K ( z ) d z = C L/ ;to bound σ we have substituted z = 2 L ( x − y ) into R K L ( x, y ) f π ( y ) d y ≤ k f π k ∞ R L K (2 L ( x − y )) d y .An application of Jensen’s inequality yields the standard bound Z x q log(1 /u ) d u ≤ x q /x ) ≤ x (cid:16) q log(1 /x ) (cid:17) . (60)Performing suitable substitutions we deduce that Z σ q log( L / L /u ) d u = L / L Z σ/ (2 L L / )0 q log(1 /v )d v ≤ σ (cid:16) q log( L / L /σ ) (cid:17) ≤ C √ L L , braham, Castillo and Gassiat for some constant C , since by assumption L r N → 0, which implies that log( L ) ≤ log N ≍ L . Noting that ( b + σ ) ¯ H ( σ ) ≤ CL L for some C , we deduce thatΠ H (cid:16) sup t ∈T N X n =1 (cid:0) t ( Y n ) − E H t (cid:1) ≥ C [ √ N L ( √ L + p κ log N )+2 L ( L + κ log N )] (cid:17) ≤ exp( − κ log N ) . Since 2 L ≍ ( N/ log N ) / (1+2 s ) we find, bounding the operator norm by the L times themaximum of the entries, that as claimed,Π H (cid:16) sup x ∈ Q k ˆ M x − M x k ≥ CL ( N/ log N ) − s/ (1+2 s ) (cid:17) ≤ N − κ . (61) Lemma 30. Let T = { h i ⊗ K L ( t, · ) ⊗ h j : i, j ≤ L , t ∈ R } . Then we have the followingbound for the bracketing numbers: N [] ( T , k·k L (Π H ) , ε ) ≤ CL max(2 L ε − , . (62) for some constant C > . This bound is achieved with brackets whose L ∞ –diameter is atmost L +2 k K k ∞ max i k h i k ∞ . Proof The kernel K is assumed to be bounded, continuous, and supported in [ − , K (1) = K ( − 1) = 0. Let U = { ( −∞ ,u ) : u ∈ R } , let V = { ( a,b ] − ( c,d ] : a, b, c, d ∈ R } . We show that L − N [] ( T , k·k L (Π H ) , ε k K L k ∞ ) ≤ N [] ( V , k·k L (Π H ) , ε ) ≤ N [] ( U , k·k L (Π H ) , ε/ (63)The first inequality follows from the fact that, given brackets [ v k , v k ] , k ≤ N V of L (Π H )–diameter ε for V , we can define t ikj = k K L k ∞ h i ⊗ v k ⊗ h j , t ikj = k K L k ∞ h i ⊗ v k ⊗ h j to obtain brackets [ t ikj , t ikj ] , i, j ≤ L , k ≤ N V for T whose L (Π H )–diameter is k K L k ∞ ε .For the second inequality, observe that any v ∈ V can be written in the form ( u − u ) − ( u − u ) for u , u , u , u ∈ U . Then, given brackets [ u k , u k ] , k ≤ N U for U , it follows that[ v ijkl , v ijkl ] , i, j, k, l ≤ N U form brackets for V , where v ijkl = ( u i − u j ) − ( u k − u l ) , v ijkl = ( u i − u j ) − ( u k − u l ) , and the L (Π H )–diameter of such a bracket, if k u k − u k k L (Π H ) = ( E [ u k − u k ] ) / ≤ ε/ k , is by Cauchy–Schwarz at most (cid:0) E Π H [( v i ,i ,i ,i − v i ,i ,i ,i ) ] (cid:1) / = (cid:0) E Π H (cid:2)X j u i j − u i j (cid:3) (cid:1) / ≤ (cid:0) X j E (cid:2) u i j − u i j (cid:3) (cid:1) / ≤ ε. It remains to bound N [] ( U , k·k L (Π H ) , ε ). One argues as in the proof of the Glivenko–Cantelli theorem: let R = ⌈ ε − ⌉ , set x = −∞ , x R = ∞ and choose x k such that ultiple Testing in Nonparametric HMMs Π H ( X ∈ [ x k − , x k )) = R − ≤ ε . [This is possible because the distribution of X hasa density so is non-atomic, but the proof would require only minor adjustments to accomo-date distributions with atoms.] Define u k = ( −∞ ,x k − ) , u k = ( −∞ ,x k ) , ≤ k ≤ R, and note any u ∈ U is contained in one of the brackets [ u k , u k ]. The L (Π H )–diameter ofsuch a bracket is at most (cid:0) Π H (cid:8) X ∈ [ x k − , x k ) (cid:9)(cid:1) / = R − / ≤ ε. It follows that N [] ( U , k·k L (Π H ) , ε ) ≤ ⌈ ε − ⌉ . The bracketing bound (63) follows for a suitable constant C by substituting into (63) uponnoting that ⌈ ε − ⌉ ≤ ε − , 1) and that k K L k ∞ = 2 L k K k ∞ .Finally, we note that u i ( x ) − u i ( x ) ∈ [0 , 1] for x ∈ R hence v ijkl − v ijkl ≤ v = ( u − u ) − ( u − u ) wemay assume u ≤ u , u ≤ u , and carefully considering the consequences). The bracketsfor T have L ∞ –diameter at most 2 L +2 k K k max i k h i k ∞ as a consequence. B.3 Matrix Approximation Theory ArgumentsLemma 31. Define A as in Lemma 24. In the setting of Theorem 5, define ˆ R as in Algo-rithm 1 for L ≍ ( N/ log N ) / (1+2 s ) , and define ˜ R to have columns equal to the normalisedcolumns of QO ⊺ ˆ V . Then, on A , ˆ R is well-defined and k ˆ R − ˜ R τ k ≤ k ˆ R − ˜ R τ k F ≤ CL / r N , for some C = C ( H ) and some permutation τ , where ˜ R τ is obtained by permuting thecolumns of ˜ R according to τ . Proof Lemma 26 tells us on A that ˆ B ˆ a, ˆ u is diagonalisable, so that ˆ R is well defined, andmoreover that min (cid:0) sep( ˆ B ˆ a, ˆ u ) , sep( ˜ B ˆ a, ˆ u ) (cid:1) > c, for some constant c = c ( H ) > 0. Now we apply (Anandkumar et al., 2012, Lemma C.3),which says, as a consequence of the Bauer–Fike theorem, that if ε = κ ( ˜ R ) sep( ˜ B ˆ a, ˆ u ) − k ˆ B ˆ a, ˆ u − ˜ B ˆ a, ˆ u k is smaller than 1 / 2, then k ˆ R − ˜ R k ≤ k ˆ R − ˜ R k F ≤ J / ( J − k ˜ R − k ε. By construction P | ˆ a i | ≤ 1, hence by the triangle inequality and Lemma 24, on A we have k ˆ B ˆ a, ˆ u − ˜ B ˆ a, ˆ u k ≤ X i | ˆ a i |k ˆ B ˆ u i − ˜ B ˆ u i k ≤ sup x k ˆ B x − ˜ B x k ≤ CL r N , braham, Castillo and Gassiat for some C = C ( H ). By Lemma 34b, we have κ ( ˜ R ) ≤ CL and k ˜ R − k ≤ CL / . We deducethat ε → A , hence is smaller than 1/2 for large N , and the result follows.One could directly use the Ostrowski–Elsner theorem (Theorem 35) to obtain a versionof Theorem 5 with a suboptimal estimation rate. We here go through the slightly cir-cuitous route of using Theorem 35 to prove an eigen-separation condition (i.e. Lemma 26)and deducing Lemma 31 because we may then apply the following lemma, adapted from(Anandkumar et al., 2012, Lemma C.4), to obtain a near-minimax rate instead. Lemma 32. Suppose ( A t : t ∈ T ) are J × J matrices simultaneously diagonalised by amatrix R with unit norm columns: R − A t R = diag( λ t, , . . . , λ t,J ) , t ∈ T . Let ˆ R be a matrix such that for some permutation τ of { , . . . , J } we have k ˆ R − R τ k := ε R ≤ (1 / k R − k − , where R τ has is obtained by permuting the columns of R according to τ . Assume λ max := sup t max j | λ t,j | < ∞ . For matrices ( ˆ A t : t ∈ T ) , write ε A := sup t k A t − ˆ A t k , and define ˆ λ t,j = e ⊺ j ˆ R − ˆ A t ˆ Re j . Then sup t max j | ˆ λ t,j − λ t,τ ( j ) | ≤ κ ( R )[ ε A + λ max k R − k ε R ] . Proof Let ˆ ζ ⊺ j be the j th row of ˆ R − , let ˆ ξ j be the j th column of ˆ R , and define ζ j , ξ j correspondingly with respect to the matrix R τ obtained by permuting the columns of R according to τ . Then λ t,τ ( j ) = ζ ⊺ j A t ξ j , ˆ λ t,j = ˆ ζ ⊺ j ˆ A t ˆ ξ j , and we have | ˆ λ t,j − λ t,τ ( j ) | = | ˆ ζ ⊺ j ˆ A t ˆ ξ j − ζ ⊺ j A t ξ j | = | ˆ ζ ⊺ j ˆ A t ( ˆ ξ j − ξ j ) + ˆ ζ ⊺ j ( ˆ A t − A t ) ξ j + (ˆ ζ ⊺ j − ζ ⊺ j ) A t ξ j |≤ k ˆ ζ ⊺ j kk ˆ A t kk ˆ ξ j − ξ j k + k ˆ ζ ⊺ j kk ξ j k ε A + k A t ξ j kk ˆ ζ j − ζ j k Using Lemma 36, we have that k ˆ R − − R − τ k ≤ k R − k ε R / (1 − k R − k ε R ) , and we further note the following: ultiple Testing in Nonparametric HMMs • k ζ ⊺ j k = k e ⊺ τ ( j ) R − k ≤ k R − k , and k ˆ ζ ⊺ j − ζ ⊺ j k ≤ k ˆ R − − R − τ k ≤ k R − k ε R / (1 −k R − k ε R ), so that also k ˆ ζ ⊺ j k ≤ k ζ ⊺ τ ( j ) k + k ˆ ζ j − ζ τ ( j ) k ≤ k R − k / (1 − k R − k ε R ) .• k ξ j k ≤ k R k , and k ˆ ξ j − ξ j k ≤ k ˆ R − R τ k = ε R .• k A i k = k R diag( λ i, · ) R − k ≤ κ ( R ) λ max , and k ˆ A t k ≤ k A t k + ε A ≤ κ ( R ) λ max + ε A .• k A t ξ j k = | λ t,τ ( j ) |k ξ j k ≤ λ max k R k .Then, continuing the inequalities from the display, we have | ˆ λ t,j − λ t,τ ( j ) | ≤ k R − k − k R − k ε R h ( κ ( R ) λ max + ε A ) ε R + k R k ε A i + λ max k R kk R − k ε R − k R − k ε R ≤ κ ( R ) + k R − k ε R − k R − k ε R ε A + 2 λ max κ ( R ) k R − k ε R − k R − k ε R ≤ (1 + 2 κ ( R )) ε A + 4 λ max k R − k κ ( R ) ε R , where for the last line we have used that k R − k ε R ≤ / t ∈ T concludes the result since necessarily 1 + 2 κ ( R ) ≤ κ ( R ) < κ ( R ). Lemma 33. Define O = O L , P = P L , ( M x = M x,L,L : x ∈ R ) as in Lemma 11 forfunctions ( h l ) l ≤ L satisfying a sup-norm bound uniformly in L and assume that σ J ( O ) ≥ c > uniformly in L ≥ L for some L = L ( H ) (for example, by choosing ( h l : l ≤ L ) asin Lemma 23). Then κ ( O ) ≤ CL / , σ J ( P ) ≥ c ′ , and k M x k ≤ C ′ L , for some constants c ′ , C, C ′ > , uniformly in L ≥ L and all L . Proof Given the assumed bound on σ J ( O ), to control κ ( O ) it remains to bound k O k , sinceone has the standard expression κ ( O ) := k O kk O − k ≡ k O k /σ J ( O ) . Then it suffices to note,using Cauchy–Schwarz and the fact that |h f j , h l i| = | R h l ( x ) f j ( x ) d x | ≤ k h l k ∞ , that k O k = sup k v k =1 X j ( X l v l h f j , h l i ) ≤ max l k h l k ∞ J L . (64)Next, Assumption B’ implies σ J ( Q ) > σ J (diag( π )) = min j π j > 0. Using submul-tiplicativity of σ J (see Lemma 36) and the expression P = O diag( π ) Q O ⊺ (from Lemma 11),we have σ J ( P ) = σ J ( O diag( π ) Q O ⊺ ) ≥ σ J ( O ) σ J (diag( π )) σ J ( Q ) σ J ( O ⊺ ) ≥ c ′ ( H ) > . For M x , the expression M x = O diag( π ) QD x QO ⊺ from Lemma 11 similarly yields k M x k ≤ k O k k Q k max j | K L [ f j ]( x ) | . Recalling that k K L [ f j ] k ∞ is bounded (see (27)) we deduce the result.The following collects several useful results from De Castro et al. (2017) and Anandkumar et al.(2012). braham, Castillo and Gassiat Lemma 34. Assume σ J ( O ) ≥ c > uniformly in L ≥ L , so that by Lemma 33 we alsohave σ J ( P ) > and κ ( O ) ≤ CL / for some C . On the event B = {k ˆ P − P k < σ J ( P ) / } ,for L ≥ L and N large enough we have the following.a. σ J ( ˆ P ) > c/ . Writing ˆ V and V for matrices of orthonormal right singular vectors of ˆ P and P respectively we have σ J ( ˆ V ⊺ V ) ≥ / , and consequently ˆ V ⊺ P ˆ V is invertible.b. κ ( QO ⊺ ˆ V ) ≤ CL / , k ˜ R − k ≤ C ′ L / and κ ( ˜ R ) ≤ C ′′ L , where ˜ R is the matrix whosecolumns are those of QO ⊺ ˆ V but rescaled to have unit norm.c. For any x ∈ R , k ˜ B x − ˆ B x k ≤ . h k ˆ M x − M x k σ J ( P ) + k M x kk ˆ P − P k σ J ( P ) i . Proof We throughout use various basic properties of σ J , κ , which are summarised inLemma 36 below.a. By Lemma 33, σ J ( P ) > 0. The result then follows from standard approximationtheory. In particular (Anandkumar et al., 2012, Lemma C.1, part 2) tells us that σ J ( ˆ P ) > σ J ( P ) / > 0. That σ J ( V ⊺ ˆ V ) ≥ / B is given by (Anandkumar et al.,2012, Lemma C.1, part 3) and submultiplicativity of σ J yields σ J ( ˆ V ⊺ P ˆ V ) = σ J ( ˆ V ⊺ ( V V ⊺ ) P ( V V ⊺ ) ˆ V ) ≥ σ J ( V ⊺ ˆ V ) σ J ( V ⊺ P V ) ≥ (3 / σ J ( P ) > , which implies invertibility of ˆ V ⊺ P ˆ V .b. Observe that κ ( QO ⊺ ˆ V ) = k QO ⊺ ˆ V k σ J ( QO ⊺ ˆ V ) ≤ k QO ⊺ k σ J ( QO ⊺ V ) σ J ( V ⊺ ˆ V ) . We have σ J ( QO ⊺ V ) = σ J ( QO ⊺ ) and we deduce that κ ( QO ⊺ ˆ V ) ≤ (4 / / κ ( QO ⊺ ) ≤ κ ( Q ) κ ( O ) by part a. Assumption B’ implies κ ( Q ) < ∞ . For R , see (Anandkumar et al.,2012, Lemma C.5), which tells us that k ˜ R − k ≤ κ ( QO ⊺ ˆ V ) and κ ( ˜ R ) ≤ κ ( QO ⊺ ˆ V ) .c. One decomposes k ˜ B x − ˆ B x k ≤ k ( ˆ V ⊺ ˆ P ˆ V ) − kk ˆ V ⊺ ( M x − ˆ M x ) ˆ V k + k ˆ V ⊺ M x ˆ V kk ( ˆ V ⊺ P ˆ V ) − − ( ˆ V ⊺ P ˆ V ) − k , then uses Lemma 36 with ˆ A = ˆ V ⊺ ˆ B x ˆ V , A = ˆ V ⊺ ˜ B x ˆ V , noting that in part a we showed k ( ˆ V ⊺ P ˆ V ) − k ≡ σ J ( ˆ V ⊺ P ˆ V ) − ≤ (4 / σ J ( P ) − . See the proof of (De Castro et al.,2017, Lemma F.4, on p28), which adapts to the current setting. Theorem 35 (Ostrowski–Elsner, e.g. (Stewart and Sun, 1990, Chapter IV, Theorem 1.4)) . For a matrix U ∈ R J × J , write ( λ i ( U ) : i ≤ J ) for the eigenvalues of U . Then for matrices A, B ∈ R J × J we have min τ max j | λ τ ( j ) ( A ) − λ j ( B ) | ≤ (2 J − k A k + k B k ) ( J − /J k A − B k /J , (65) where the minimum is over permutations τ . ultiple Testing in Nonparametric HMMs Lemma 36. Let A and ˆ A be matrices such that A is invertible and k A − ˆ A k < k A − k − .Then ˆ A is invertible and k ˆ A − − A − k ≤ k A − k k A − ˆ A k − k A − kk A − ˆ A k . We also have the following: κ ( A ) = κ ( A ⊺ ) ; σ J ( A ) = σ J ( A ⊺ ) ; σ J ( A ) = σ J ( AW ⊺ ) for anymatrix W whose columns are orthonormal and whose domain is R J ; σ J ( AB ) ≥ σ J ( A ) σ J ( B ) ,and κ ( AB ) ≤ κ ( A ) κ ( B ) for matrices A, B . Proof For the first see (Stewart and Sun, 1990, Chapter III, Theorem 2.5) The other re-sults can be found in Chapter I.4 of the same reference. B.4 Sketch Proof of Theorem 4 The arguments used to prove Theorem 5 work also in this discrete setting, given the followingobservations and slight adaptations. To ease notation we assume that f j ( x ) = 0 for all x ≤ j ≤ J . We make the following definitions, which correspond to taking h l = l , i.e. h l ( x ) = { x = l } , and replacing K L ( x, y ) by { x = y } : M x = M x,L = Π H ( X = l, X = x, X = m ) l,m ≤ L , x ∈ N P = P L = Π H ( X = l, X = m ) l,m ≤ L ,O = O L = Π H ( X = l | θ = j ) l ≤ L ,j ≤ J ,D x = (diag Π H ( X = x | θ = j ) j ≤ J ) ≡ diag(( O xj ) j ) . The proof of Lemma 11 is unchanged with these adjusted definitions, and we adapt thedefinitions in Algorithm 1 correspondingly:ˆ M x = (cid:16) N X n ≤ N l ( X n ) x ( X n +1 ) m ( X n +2 ) (cid:17) l,m ≤ L , ˆ P = (cid:16) N X n ≤ N l ( X n ) m ( X n +2 ) (cid:17) l,m ≤ L , ˆ B x = ( ˆ V ⊺ ˆ P ˆ V ) − ˆ V ⊺ ˆ M x ˆ V , for ˆ V comprising right singular vectors of ˆ P .Observe that the proofs of Lemmas 23 and 33 work in the current setting for the currentchoice of the h l [indeed, thanks to the disjoint support of h l , h m for l = m one can improvethe bound in eq. (64) to k O L k ≤ J ], and similarly a version of Lemma 25 holds by choosing D N = A N × U N for sequences of finite sets A N ⊂ R , U N ⊂ N such that ∪ N U N = N and ∪ N A N is dense in { a ∈ R : P i | a i | ≤ } .Next note that a version of the Glivenko–Cantelli theorem gives control over sup x ∈ N k ˆ M x − M x k for our new definitions of ˆ M x , M x ; we give here a slightly indirect proof of this factby reusing the machinery of Lemma 29. Indeed, inspecting the proof of Lemma 30, onededuces that H [] ( T , k·k L (Π H ) , ε ) ≤ L max( ε − , braham, Castillo and Gassiat for T = { i ⊗ x ⊗ j : x ∈ N , i, j ≤ L } . It follows, in view of a standard bound (see (60)) Z x q log(1 /u ) d u ≤ x (1 + q log(1 /x )) , and recalling as in Lemma 28 that the chain Y n = ( X n , X n +1 , X n +2 , θ n , θ n +1 , θ n +2 )has pseudo-spectral gap bounded away from zero by Assumption B’, that Lemma 27, appliedwith b = σ = 1, yieldsΠ (cid:0) sup x ∈ N | ˆ M xij − M xij | > C ( N − / + N − / √ u + N − u ) (cid:1) ≤ exp( − u ) . We note that k ˆ M x − M x k ≤ L max ij | ˆ M xij − M xij | . Combining with Lemma 28, for any c N → ∞ , we may choose suitable L → ∞ and u → ∞ to deduceΠ H ( k ˆ P − P k ≤ c N N − / , sup x ∈ N k ˆ M x − M x k ≤ c N N − / ) → . The rest of the proof exactly mirrors that of Theorem 5. Appendix C. Proof of the Lower Bound For the lower bound for simplicity we consider the (in view of the multiple testing applica-tion) most relevant case J = 2. Let σ denote the set of all permutations of { , } . Define,for s, R > C s defined in Assumption D, C s ( R ) = n f ∈ C s : f ≥ , Z R f = 1 , k f k C s ≤ R o . Parameters. The unknown parameters are H = ( Q, π, f ), where f = ( f , f ) denotes thevector of emission densities. Denoting by P f i the distribution of density f i on R , i = 0 , X = ( X , . . . , X N ) isΠ H = Π ( N ) H = X v ∈{ , } N w v N O j =1 P f vj , where w v denotes the probability under the Markov chain to observe the successive sequenceof states ( v , . . . , v N ) ∈ { , } N , that is w ( v ,...,v N ) = π v Q v ,v · · · Q v N − ,v N . Class H sep of well–separated parameters. Let F sep be a class of pairs f = ( f , f ) thatare well–separated in the following sense, for a (small) d > F sep = { f = ( f , f ) ∈ C s ( R ) : | ( f − f )(0) | ≥ d, | P f ([ − , − P f ([ − , | ≥ d } . (66)We define, for given Q, π , H sep = H sep ( Q, π, R, d, s ) = { H = ( Q, π, f ) : f ∈ F sep } . (67) ultiple Testing in Nonparametric HMMs Minimax risk. For f = ( f , f ) and g = ( g , g ) two pairs of real functions, denote ρ ( f , g ) = min ϕ ∈ σ (cid:16) k g ϕ (0) − f k ∞ + k g ϕ (1) − f k ∞ (cid:17) . (68)The loss ρ is a pseudo–metric, verifying the axioms of a distance except that one can have ρ ( f , g ) = 0 for f = g . We note that one could also consider the equivalent loss obtained byreplacing the sum in (68) with a maximum.Let us consider the minimax risk R n = R n ( H sep ) = inf T =( T ,T ) sup H ∈H sep E H [ ρ ( T , f )] . (69)Since E [min( X, Y )] ≤ min( EX, EY ), one notes that R n ≤ inf T =( T ,T ) sup H ∈ H sep (cid:20) min ϕ ∈ σ (cid:16) E H k T ϕ (1) − f k ∞ + E H k T ϕ (2) − f k ∞ (cid:17)(cid:21) . (70)In view of Section 4.3 (and constructing ˆ f , ˆ f using L = 2, h = 1, h = [ − , 1] inAlgorithm 1), Theorem 5 provides a procedure for which the last quantity is boundedfrom above by (any rate slower than) r N = ( N/ log N ) − s/ (2 s +1) . The next result providesthe corresponding minimax lower bound. Note that the lower bound in Proposition 37 ispointwise in Q and π , and thus continues to hold if π, Q are allowed to vary in some set. Proposition 37. Consider J = 2 classes, and fix both π = ( π , π ) ∈ [0 , and Q a × transition matrix. Given s, R, d > , let H sep be as in (67) and let R n = R n ( H sep ) be as in (69) . Then there exists C = C ( s, R ) > such that, for N large enough, R n ( H sep ) ≥ C (cid:18) log NN (cid:19) s s +1 . Proof We reduce the estimation problem to a classification problem in a standard way.Suppose the two sets of densities { f ( m )0 , ≤ m ≤ M } and { f ( m )1 , ≤ m ≤ M } are suchthat for some 0 < s , s < C ,min {k f ( i )1 − f ( j )0 k ∞ , ≤ i, j ≤ M } ≥ C , (71)min {k f ( i )0 − f ( j )0 k : 0 ≤ i, j ≤ M, i = j } ≥ s , (72)min {k f ( i )1 − f ( j )1 k : 0 ≤ i, j ≤ M, i = j } ≥ s . (73)It follows that the family of functions f ( m ) = ( f ( m )0 , f ( m )1 ) is 2( s + s )–separated in termsof ρ , since for m = m ′ , ρ ( f ( m ) , f ( m ′ ) ) ≥ min (cid:16) k f ( m )0 − f ( m ′ )0 k ∞ + k f ( m )1 − f ( m ′ )1 k ∞ , k f ( m )1 − f ( m ′ )0 k ∞ + k f ( m )0 − f ( m ′ )1 k ∞ (cid:17) ≥ min(2( s + s ) , C ) = 2( s + s ) =: 2 S. For a given estimator T of f ∈ { f (0) , . . . , f ( M ) } , let j ∗ ( T ) be the index j such that f ( j ) is the closest to T in the ρ pseudo-distance. Since the family ( f ( m ) , m ∈ { , . . . , M } ) is braham, Castillo and Gassiat S –separated, we have ρ ( T , f ( m ) ) ≥ S { j ∗ ( T ) = m } . Writing H m = ( Q, π, f ( m ) ), we havesup H ∈H sep E H [ ρ ( T , f )] ≥ max ≤ m ≤ M E H m h ρ (cid:16) T , f ( m ) (cid:17)i ≥ S max ≤ m ≤ M Π H m [ j ∗ ( T ) = m ] ≥ Sp e,M , (74)where p e,M = inf ψ max ≤ m ≤ M Π H m [ ψ = m ] , with the infimum being over all classifiers ψ .Taking the infimum with respect to T in (74), one obtains R n ( H sep ) ≥ Sp e,M .Lemma 39 shows that in order to bound p e,M from below it suffices to bounds KL(Π H m , Π H )from above, where KL( P, Q ) denotes the Kullback-Leibler divergence between distributions P and Q with densities p, q , KL( P, Q ) = E P h log (cid:16) pq (cid:17)i . (75)By convexity of the map ( x, y ) → x log( x/y ), writing v = ( v j ) ∈ { , } N , one obtainsKL(Π H m , Π H ) ≤ X v ∈{ , } N w v KL N O j =1 P f ( m ) vj , N O j =1 P f (0) vj . For a given v ∈ { , } N , let n i ( v ), i = 0 , 1, denote the number of elements of v equal to i .The tensorisation property of the KL divergence impliesKL N O j =1 P f ( m ) vj , N O j =1 P f (0) vj = n ( v ) KL( P f ( m )1 , P f (0)1 ) + n ( v ) KL( P f ( m )2 , P f (0)2 ) , where n ( v ) , n ( v ) are both at most N .Let us now choose functions f ( m )0 , f ( m )1 , satisfying eqs. (71) to (73) for which we havegood control over KL( f ( m ) j , f (0) j ), j = 0 , ≤ m ≤ M . For φ the standard normaldensity and g m,A defined as in Lemma 38, set, f ( m )0 ( x ) = g m,A ( x ) , m ≥ , f (0)0 ( x ) = rφ ( rx ) ,f ( m )1 ( x ) = g m,A ( x − /r ) , m ≥ , f (0)1 ( x ) = rφ ( r ( x − /r )) , where we choose A = c (cid:18) log NN (cid:19) s s +1 , A = c (cid:18) log NN (cid:19) s s +1 , M = d l(cid:16) N log N (cid:17) s +1 m , with r, c , c small, but fixed, positive constants. Note firstly that for r, c , c small enough(and N large enough) each pair ( f ( m )0 , f ( m )1 ) is in F sep for some d > , R > 0. Indeed,examining the definition of g m,A from Lemma 38, we see for all 0 ≤ m ≤ M that we have | f ( m )1 (0) − f ( m )0 (0) | = r | φ (2) − φ (0) | ;that P f ( m )0 [ − , ≥ rφ ( r/ R ∞ x φ ( u ) du =: ¯Φ( x ) ≤ φ ( x ) /x for x > r < P f ( m )1 [ − , 1] = Z r − − r − φ ≤ ¯Φ(2 − r ) ≤ φ (2 − r )2 − r . ultiple Testing in Nonparametric HMMs We further note by Lemma 38 that for suitable c , c , r, d, R , we have both that (72) and (73)hold for s = A / s = A / 2, and also thatKL (cid:18) P f ( m )0 , P f (0)0 (cid:19) ≤ C A M ≤ Cd c log NN , KL (cid:18) P f ( m )1 , P f ( m )1 (cid:19) ≤ C A M ≤ Cd c log NN . Putting the previous bounds together leads toKL(Π H m , Π H ) ≤ N · KL (cid:18) P f ( m )0 , P f (0)0 (cid:19) + N · KL (cid:18) P f ( m )1 , P f (0)1 (cid:19) ≤ C log Nd h c + c i . In particular, one can bound from above1 M M X m =1 KL(Π H m , Π H ) ≤ ( C/d )( c + c ) log N ≤ (log M ) / , provided c , c are small enough constants, and we deduce by Lemma 39 that p e,M :=inf ψ max ≤ m ≤ M Π H m [ ψ = m ] is greater than a positive constant. Finally, recalling (74),we R n ( H sep ) ≥ Sp e,M , with S = 2( s + s ) = A + A . The proposition follows by definition of A , A . Lemma 38. Let ψ be a C ∞ function with support in ( − / , / such that k ψ k ∞ = 1 and R R ψ = 0 . Let φ ( · ) denote the standard normal density and for m ∈ { , . . . , M } and some A, r > and integer M ≥ , set g ( x ) = rφ ( rx ) and g m,A ( x ) = rφ ( rx ) + Aψ ( M x − m + 1 / . Then for s, R > , the functions g m,A are densities belonging to C s ( R ) provided AM s ≤ R/ and r, A are small enough, k g m,A − g p,A k ∞ = A, ( for all m = p ) , and, for P g the distribution with density g on R , and some C = C ( r ) > , any m ∈{ , . . . , M } , KL( P g m,A , P g ,A ) ≤ CA /M. Proof For the statement on supremum norms, it suffices to note that x → ψ ( M x − m )have disjoint support. For the KL–bounds, one expands the logarithm at the order 2 in aneighborhood of 0. Lemma 39. For a family of points ( H m ) ≤ m ≤ M in H with M ≥ , let p e,M = inf ψ max ≤ m ≤ M Π H m [ ψ = m ] , (76) braham, Castillo and Gassiat where the infimum is over all possible measurable ψ taking values in { , . . . , M } . Suppose,for α < / , M M X m =1 KL(Π H m , Π H ) ≤ α log M. Then p e,M ≥ √ M √ M − α − s α log M ! . Proof This follows from combining Proposition 2.3 and (the proof of) Theorem 2.5 inTsybakov (2009). Appendix D. Notation We give notation assuming, as in Section 3, that there are a (known) number J of hiddenstates { , . . . , J } (recall that J = 2 for Section 2 and the proofs of results therein, withhidden states labelled 0 and 1, and the notation is adapted accordingly). HMM parameters. X = ( X n ) n ≤ N (or ( X n ) n ≤ N +2 for convenience, or ( X n ) n ∈ N for some of the proofs and lem-mas) the data, drawn from the HMM (1). θ = ( θ n ) n ≤ N the vector of hidden states, taking values in { , . . . , J } N . Q, π the transition matrix of θ and its stationary (and initial) distribution. µ a dominating measure on the space X = R (equipped with the usual Borel σ –algebra)in which X takes values. Throughout we take µ to equal Lebesgue measure on R orcounting measure on Z ⊂ R . f , . . . , f J the emission densities, i.e. f j is the density of X conditional on θ = j . f π the density of X ; this is only used in the two-state case so f π = π f + π f . H = ( Q, π, f , . . . , f J ), ˆ H = ( ˆ Q, ˆ π, ˆ f , . . . , ˆ f J ).Π H , E H the law of X for parameter H and the associated expecation operator. H , I : see Section 4.3. [Also note that C = C ( H ) is allowed to depend on the kernel K , thefunctions ( h l ) l ∈ N and the sets D N since these can be chosen universally.] ν, x ∗ constants as in Assumption A. δ a lower bound for min i,j Q ij . Multiple testing. FDP , FDR , postFDR , mFDR , mTDR , see eqs. (2) to (5), (15) and (16) (also (10) for analternative characterisation of postFDR). ultiple Testing in Nonparametric HMMs ℓ i ≡ ℓ i ( X ) ≡ ℓ i,H ( X ) = Π H ( θ i = 0 | X ); ˆ ℓ i = ℓ i, ˆ H ; ℓ ′ i = Π H ( θ i = 0 | X i − A , . . . , X i + A ) forsome A ; ℓ ∞ i = Π H ( θ i = 0 | ( X n ) n ∈ Z ).Φ ∞ i = Π H ( θ i = 0 | ( X n : n ∈ Z , n ≤ i )). ϕ λ,H = ( { ℓ i,H < λ } ) i ≤ N ˆ λ = sup { λ : postFDR ˆ H ( ϕ λ, ˆ H ) ≤ t } . λ ∗ the solution to E [ ℓ ∞ i | ℓ ∞ i < λ ∗ ] = min( t, π ).ˆ ϕ ≡ ˆ ϕ ( t ) = ϕ ˆ λ, ˆ H when there are no ties in ℓ –values, and is defined by Definition 1 whenthere may be ties.ˆ S = { i : ˆ ϕ i = 1 } , ˆ K = | ˆ S | . Estimation. h , . . . , h L , where L is either constant or diverges slowly to infinity; bounded func-tions such that “witness” the linear independence of f , . . . , f J (see Algorithm 1 andLemma 23). K, K L , ... a convolution kernel, see (25) M x ≡ M x,L ,L = ( E H [ h i ( X ) K L ( x, X ) h j ( X )] i,j ≤ L ) ∈ R L × L . P ≡ P L = ( E H [ h i ( X ) h j ( X )] i,j ≤ L ) ∈ R L × L O = O L = ( E H [ h i ( X ) | θ = a ] i ≤ L ,a ≤ J ) ∈ R L × J D = D x = diag(( K L [ f j ]( x )) j ≤ J ), i.e. the diagonal matrix whose diagonal entries are D jj = K L [ f j ]( x ). V = V L ∈ R L × J a matrix such that V ⊺ P V is invertible. Specifically, we either take V toequal a matrix of orthonormal right singular vectors of P (so that σ J ( V ⊺ P V ) = σ J ( P ))or, on the event of Lemma 24, to equal ˆ V (defined in Algorithm 1). B x = B x,L = [ V ⊺ P V ] − V ⊺ M x V ≡ [ QO ⊺ V ] − D x QO ⊺ V .ˆ M x , ˆ P , ˆ O, ˆ V empirical versions of M x , P, O, V, B x (see Algorithm 1, p24).ˆ B x = [ ˆ V ⊺ ˆ P ˆ V ] − ˆ V ⊺ ˆ M x ˆ V , ˜ B x = [ ˆ V ⊺ P ˆ V ] − ˆ V ⊺ M x ˆ V , ˆ B a,u = P a i ˆ B u i and ˜ B a,u = P a i ˜ B u i for a, u ∈ R J ( J − / such that P | a i | ≤ B ) = min i = j | λ i − λ j | the “eigen-separation” of a matrix B ∈ R J × J , with eigenvalues λ , . . . , λ J .ˆ a, ˆ u, D N See Algorithm 1, p24.ˆ R a matrix of normalised columns diagonalising ˆ B ˆ a, ˆ u , ˜ R a matrix whose columns are thoseof QO ⊺ ˆ V but scaled to have unit Euclidean norm (which therefore diagonalises ˜ B a,u for any a, u ). A = {k ˆ P − P k ≤ cL r N , k ˆ M x − M x k ≤ cL r N ∀ x ∈ R } the event of Lemma 24. braham, Castillo and Gassiat C s the usual Hölder space (see Assumption D equipped with the usual norm k·k C s . Minimax lower bound. C s ( R ) the subspace of C s consisting of densities with Hölder norm bounded by R . σ the set of all permutations on { , } . ρ ( f , g ) = min ϕ ∈ σ (cid:0) k g ϕ (0) − f k ∞ + k g ϕ (1) − f k ∞ (cid:1) , for f = ( f , f ), g = ( g , g ). F sep = (cid:8) f = ( f , f ) ∈ C s ( R ) : | ( f − f )(0) | ≥ d, | P f ([ − , − P f ([ − , | ≥ d (cid:9) . H sep = { H = ( Q, π, f ) : f ∈ F sep } , for some arbitrary (fixed) Q, π . Miscellaneous. k·k , k·k F , k·k ∞ the Euclidean norm on vectors or the corresponding operator norm on ma-trices, the Frobenius norm on matrices, and the L ∞ (supremum) norm on functionstaking values in R . σ j ( A ) the j th largest singular value of a matrix A . κ ( A ) = σ ( A ) /σ J ( A ) = k A kk A − k for a matrix with smaller dimension J , the conditionnumber of the matrix A . o (1) , o p (1) The usual little-oh notation: a N = o (1) if a N → N → inf ty , a N = o p (1) if a N → N → ∞ . C s ( R ) the usual space of locally Hölder smooth functions, equipped with the usual Höldernorm k·k C s ( R ) (see Assumption D). Note that since we consider density functions, wecould equivalently use the space of globally Hölder smooth functions. r N = ( N/ log N ) − s/ (1+2 s ) . ε N some rate of consistency of estimators in (14). N [] , H [] : The bracketing numbers/entropy, wherein N [] ( T , k·k L ( P ) , ε ) is the smallest num-ber of pairs of functions ( f , ¯ f ) such that every g ∈ T is bracketed by one of thepairs, where ( f , ¯ f ) brackets g if f ≤ g ≤ ¯ f pointwise, and H [] ( T , k·k L ( P ) , ε ) :=log N [] ( T , k·k L ( P ) , ε ) . References G. Alexandrovich, H. Holzmann, and A. Leister. Nonparametric identification and maxi-mum likelihood estimation for hidden Markov models. Biometrika , 103(2):423–434, 2016.A. Anandkumar, D. J. Hsu, and S. M. Kakade. A method of moments for mixture mod-els and hidden Markov models. In , page33.1–33.34, 2012.B. Bárány and I. Kolossváry. On the absolute continuity of the Blackwell measure. J. Stat.Phys. , 159(1):158–171, 2015. ultiple Testing in Nonparametric HMMs L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite stateMarkov chains. Ann. Math. Statist. , 37:1554–1563, 1966.L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring inthe statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. ,41:164–171, 1970.Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerfulapproach to multiple testing. J. Roy. Statist. Soc. Ser. B , 57(1):289–300, 1995.Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testingunder dependency. Ann. Statist. , 29(4):1165–1188, 2001.P. J. Bickel, Y. Ritov, and T. Rydén. Asymptotic normality of the maximum-likelihoodestimator for general hidden Markov models. Ann. Statist. , 26(4):1614–1635, 1998.D. Blackwell. The entropy of functions of finite-state Markov chains. In Transactions of thefirst Prague conference on information theory, Statistical decision functions, random pro-cesses held at Liblice near Prague from November 28 to 30, 1956 , pages 13–20. PublishingHouse of the Czechoslovak Academy of Sciences, Prague, 1957.T. T. Cai, W. Sun, and W. Wang. Covariate-assisted ranking and screening for large-scaletwo-sample inference.O. Cappé, E. Moulines, and T. Rydén. Inference in hidden Markov models . Springer Seriesin Statistics. Springer, New York, 2005. With Randal Douc’s contributions to Chapter 9and Christian P. Robert’s to Chapters 6, 7 and 13, With Chapter 14 by Gersende Fort,Philippe Soulier and Moulines, and Chapter 15 by Stéphane Boucheron and ElisabethGassiat.I. Castillo and E. Roquain. On spike and slab empirical Bayes multiple testing. Ann.Statist. , 48(5):2548–2574, 2020.G. Cleanthous, A. G. Georgiadis, G. Kerkyacharian, P. Petrushev, and D. Picard. Kerneland wavelet density estimators on manifolds and more general metric spaces. Bernoulli ,26(3):1832–1862, 2020.Y. De Castro, E. Gassiat, and C. Lacour. Minimax adaptive estimation of nonparametrichidden Markov models. J. Mach. Learn. Res. , 17:Paper No. 111, 43, 2016.Y. De Castro, E. Gassiat, and S. Le Corff. Consistent estimation of the filtering and marginalsmoothing distributions in nonparametric hidden Markov models. IEEE Trans. Inform.Theory , 63(8):4758–4777, 2017.R. Douc and C. Matias. Asymptotics of the maximum likelihood estimator for generalhidden Markov models. Bernoulli , 7(3):381–420, 2001.R. Durrett. Probability—theory and examples , volume 49 of Cambridge Series in Statisticaland Probabilistic Mathematics . braham, Castillo and Gassiat B. Efron. Size, power and false discovery rates. Ann. Statist. , 35(4):1351–1377, 2007a.B. Efron. Correlation and large-scale simultaneous significance testing. J. Amer. Statist.Assoc. , 102(477):93–103, 2007b.B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher. Empirical Bayes analysis of a mi-croarray experiment. J. Amer. Statist. Assoc. , 96(456):1151–1160, 2001.A. Farcomeni. Some results on the control of the false discovery rate under dependence. Scand. J. Statist. , 34(2):275–297, 2007.H. Finner, T. Dickhaus, and M. Roters. Dependency and false discovery rate: asymptotics. Ann. Statist. , 35(4):1432–1455, 2007.E. Gassiat, A. Cleynen, and S. Robin. Inference in finite state space non parametric hiddenMarkov models and applications. Stat. Comput. , 26(1-2):61–71, 2016.S. Ghosal and A. van der Vaart. Fundamentals of nonparametric Bayesian inference , vol-ume 44 of Cambridge Series in Statistical and Probabilistic Mathematics . CambridgeUniversity Press, Cambridge, 2017.E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models .Cambridge Series in Statistical and Probabilistic Mathematics, [40]. Cambridge Univer-sity Press, New York, 2016.R. Heller and S. Rosset. Optimal control of false discovery criteria in the two-group model. J. R. Stat. Soc. Ser. B. Stat. Methodol. , 2020. to appear.L. Lehéricy. Nonasymptotic control of the MLE for misspecified nonpara-metric hidden Markov models. ArXiv eprint 1807.03997, 2018. URL https://arxiv.org/pdf/1807.03997 .L. Lehéricy. State-by-state minimax adaptive estimation for nonparametric hidden Markovmodels. J. Mach. Learn. Res. , 19:Paper No. 39, 46, 2018.P. Massart. Concentration inequalities and model selection , volume 1896 of Lecture Notesin Mathematics . Springer, Berlin, 2007. Lectures from the 33rd Summer School onProbability Theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard.P. Müller, G. Parmigiani, C. Robert, and J. Rousseau. Optimal sample size for multipletesting: the case of gene expression microarrays. J. Amer. Statist. Assoc. , 99(468):990–1001, 2004.D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectralmethods. Electron. J. Probab. , 20:no. 79, 32, 2015.T. Petrie. Probabilistic functions of finite-state Markov chains. Proc. Nat. Acad. Sci.U.S.A. , 57:580–581, 1967.T. Rebafka, E. Roquain, and F. Villers. Graph inference with clustering and false discoveryrate control. ArXiv eprint 1907.10176, 2019. URL https://arxiv.org/abs/1907.10176 . ultiple Testing in Nonparametric HMMs E. Roquain and N. Verzelen. On using empirical null distributions in Benjamini–Hochbergprocedure. ArXiv eprint 1912.03109, 2020. URL https://arxiv.org/pdf/1912.03109 .G. W. Stewart and J. G. Sun. Matrix perturbation theory . Computer Science and ScientificComputing. Academic Press, Inc., Boston, MA, 1990.J. D. Storey. The positive false discovery rate: a Bayesian interpretation and the q -value. Ann. Statist. , 31(6):2013–2035, 2003.W. Su and X. Wang. Hidden Markov model in multiple testing on dependent count data. J. Stat. Comput. Simul. , 90(5):889–906, 2020.W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discoveryrate control. J. Amer. Statist. Assoc. , 102(479):901–912, 2007.W. Sun and T. T. Cai. Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser.B Stat. Methodol. , 71(2):393–424, 2009.A. B. Tsybakov. Introduction to nonparametric estimation . Springer Series in Statistics.Springer, New York, 2009. Revised and extended from the 2004 French original, Trans-lated by Vladimir Zaiats.A. van der Vaart. Asymptotic statistics , volume 3 of Cambridge Series in Statistical andProbabilistic Mathematics . Cambridge University Press, Cambridge, 1998.X. Wang, A. Shojaie, and J. Zou. Bayesian hidden Markov models for dependent large-scalemultiple testing. Comput. Statist. Data Anal. , 136:123–136, 2019.W. B. Wu. On false discovery control under dependence. Ann. Statist. , 36(1):364–380, 2008.J. Xie, T. T. Cai, J. Maris, and H. Li. Optimal false discovery rate control for dependentdata. Stat. Interface , 4(4):417–430, 2011.C. Yau, O. Papaspiliopoulos, G. O. Roberts, and C. Holmes. Bayesian non-parametrichidden Markov models with applications in genomics. J. R. Stat. Soc. Ser. B Stat.Methodol. , 73(1):37–57, 2011., 73(1):37–57, 2011.