[PDF] Calibrating the scan statistic: finite sample performance vs. asymptotics

Abstract

We consider the problem of detecting an elevated mean on an interval with unknown location and length in the univariate Gaussian sequence model. Recent results have shown that using scale-dependent critical values for the scan statistic allows to attain asymptotically optimal detection simultaneously for all signal lengths, thereby improving on the traditional scan, but this procedure has been criticized for losing too much power for short signals. We explain this discrepancy by showing that these asymptotic optimality results will necessarily be too imprecise to discern the performance of scan statistics in a practically relevant way, even in a large sample context. Instead, we propose to assess the performance with a new finite sample criterion. We then present three calibrations for scan statistics that perform well across a range of relevant signal lengths: The first calibration uses a particular adjustment to the critical values and is therefore tailored to the Gaussian case. The second calibration uses a scale-dependent adjustment to the significance levels and is therefore applicable to arbitrary known null distributions. The third calibration restricts the scan to a particular sparse subset of the scan windows and then applies a weighted Bonferroni adjustment to the corresponding test statistics. This calibration is also applicable to arbitrary null distributions and in addition is very simple to implement.

Full PDF

CCalibrating the scan statistic: ﬁnite sample performance vs.asymptotics

Guenther Walther ∗ and Andrew Perry † Department of Statistics, 390 Serra MallStanford University, Stanford, CA [email protected] 2019

Abstract

We consider the problem of detecting an elevated mean on an interval with unknown location and lengthin the univariate Gaussian sequence model. Recent results have shown that using scale-dependent criticalvalues for the scan statistic allows to attain asymptotically optimal detection simultaneously for all signallengths, thereby improving on the traditional scan, but this procedure has been criticized for losing too muchpower for short signals. We explain this discrepancy by showing that these asymptotic optimality results willnecessarily be too imprecise to discern the performance of scan statistics in a practically relevant way, even ina large sample context. Instead, we propose to assess the performance with a new ﬁnite sample criterion. Wethen present three calibrations for scan statistics that perform well across a range of relevant signal lengths:The ﬁrst calibration uses a particular adjustment to the critical values and is therefore tailored to the Gaussiancase. The second calibration uses a scale-dependent adjustment to the signiﬁcance levels and is thereforeapplicable to arbitrary known null distributions. The third calibration restricts the scan to a particular sparsesubset of the scan windows and then applies a weighted Bonferroni adjustment to the corresponding teststatistics. This calibration is also applicable to arbitrary null distributions and in addition is very simple toimplement.

Keywords and phrases.

Scan statistic, multiscale inference.

MSC 2000 subject classiﬁcations.

Primary 62G10; secondary 62G32. ∗ Work supported by NSF grants DMS-1501767 and DMS-1916074 † Work supported a Summer Undergraduate Research Grant from the Vice Provost for Undergraduate Education at Stanford University a r X i v : . [ m a t h . S T ] A ug Introduction

There has been an considerable amount of recent work that requires a good solution to the following problem:We observe(1) Y i = f n ( i ) + Z i , i = 1 , . . . , n, where the Z i are i.i.d. N (0 , and f n ( i ) = µ n ( i ∈ I n ) with I n = ( j n , k n ] , ≤ j n < k n ≤ n . The taskis to decide whether a signal is present, i.e. to test µ n = 0 vs. µ n > , when both the amplitude µ n and thesupport I n are unknown. This is the prototypical model for the problem of detecting objects in noisy data andfor certain goodness-of-ﬁt tests, see the references below. The standard approach to this problem is based on the scan statistic (generalized likelihood ratio statistic)(2) Scan n ( Y n ) = max I T I ( Y n ) , where for intervals I = ( j, k ] , ≤ j < k ≤ n , we write T I ( Y n ) := (cid:80) i ∈ I Y i (cid:112) | I | = (cid:80) ki = j +1 Y i √ k − j , see e.g. Glaz et al. (2001) or Arias-Castro et al. (2005). Naus and Wallenstein (2004) observed that thescan statistic is sensitive for the detection of signals with very short support | I n | at the expense of signals withmoderate and large support. A heuristic explanation of this effect is as follows: There are about nw disjointintervals I with length | I | = w . Since the corresponding T I ( Z n ) are independent N (0 , , their maximumconcentrates around (cid:112) nw . This remains true for the maximum over all (overlapping) intervals of length w because the corresponding T I ( Z n ) are strongly correlated with the set of nw independent ones. Hence thedistribution of Scan n ( Z n ) is dominated by the small intervals with | I | ≈ and concentrates around √ n ;see Siegmund and and Venkatraman (1995) for a formal proof. This heuristic suggests that the scan does notprovide an optimal aggregation of the evidence for the various interval lengths w , and that there may be a‘free lunch’: since max I : | I | = w T I ( Z n ) concentrates around (cid:112) nw , it may be possible to increase T I ( Y n ) by √ n − (cid:113) n | I | without noticeably changing the null distribution of max I T I ( Z n ) , thereby increasing thepower at larger scales and remedying the problem described by Naus and Wallenstein (2004). This idea wasformalized by D¨umbgen and Spokoiny (2001) who introduced the statistic(3) DS n ( Y n ) = max I (cid:18) T I ( Y n ) − (cid:114) en | I | (cid:19) and established various optimality results. These results show that DS n leads to asymptotically optimal detectionin the model (1) for all scales | I n | , in the sense that no other statistical test can improve on DS n even if the scale | I n | of the signal were known in advance. In other words, while scanning over locations leads to an unavoidablemultiple testing penalty, there is no further material price to pay in the asymptotic minimax framework forscanning over multiple scales | I | when using DS n . In contrast, it was shown in Chan and Walther (2013) thatinference using Scan n will be suboptimal on all but the smallest scales.The asymptotic optimality of DS n across all scales has made this statistic a popular choice for a range ofproblems involving the detection of signals or testing goodness-of-ﬁt, see e.g. D¨umbgen and Walther (2008),2ohde (2008), Frick et al. (2014), K¨onig et al. (2018). However, DS n has been criticized by Siegmund (2017)for losing too much detection power on small scales. Indeed, Figure 1 shows a plot of the critical values for T I ( Y n ) as a function of | I | , for various methods for aggregating these statistics when n = 10 . The green lineshows the critical values resulting from the calibration DS n . Compared to the black line resulting from Scan n ,it is clear that the price for obtaining smaller critical values (and hence more power) at larger scales comes atthe price of larger critical values (and thus reduced power) at small scales. For example, if n = 10 then theexact critical value at | I | = 1 is 4.14 with Scan n and 5.09 with DS n . So in order to declare a discovery, T I ( Y n ) needs to clear the 4 σ threshold when using the calibration Scan n , but it needs to exceed the 5 σ threshold withthe calibration DS n . This is arguably an unacceptable a price to pay if the concern is about the discovery ofa signal on a small scale | I n | , which is typically the case in applications using the model (1), as pointed outby Siegmund (2017). It is also informative to examine this discrepancy from the angle of the multiple testingproblem which lies at the heart of model (1). Looking at different sample sizes, one ﬁnds that the critical value for Scan n is 4.71 when n = 10 and 5.21 for n = 10 . Thus the penalty incurred by using DS n on the small scalesis equivalent to the multiple testing penalty incurred when increasing the sample size by a factor of between 10and 100. Furthermore, simulations and theoretical considerations show that this discrepancy between DS n and Scan n will not disappear asymptotically. Therefore, these ﬁnite sample results clearly favor Scan n over DS n ,despite the strong theoretical support for the latter.In Section 2 we explain why the practical performance of DS n can be markedly inferior to that of Scan n despite the asymptotic optimality results for DS n . We introduce an exact ﬁnite sample criterion that is a moreuseful measure of the performance of a scan statistic. We then present three calibrations that also satisfy theseoptimality results (and hence allow detection thresholds that are substantially lower than those for Scan n at largerscales) without paying a material price at smaller scales. The ﬁrst calibration uses a particular adjustment to thecritical values of the T I ( Y n ) and is therefore tailored to the Gaussian case in (1). The second calibration involvesan adjustment to the signiﬁcance levels of the T I ( Y n ) and is therefore not speciﬁc to the Gaussian case but canbe applied to arbitrary null distributions of Z n . The third calibration restricts the scan to a particular sparseapproximation set of intervals I and then simply applies a weighted Bonferroni adjustment to those T I ( Y n ) .This calibration is also applicable for arbitrary null distributions of Z n and furthermore has the advantage thatthe adjustment via Bonferroni removes the need to approximate critical values by simulation or by analyticalapproximation. For each of these three calibrations, the key to achieving the desired performance is a slightrelenting on the demand for optimality at the largest scales | I n | ≈ n , which is arguably not of practical concern.In Section 4 we examine the theoretical and practical performance of these calibrations in terms of the new ﬁnitesample criterion. Section 5 brieﬂy discusses how these calibrations extend to other settings such as observationsfrom densities and the multivariate case. Siegmund and Venkatraman (1995) show that(4)

Scan n ( Z n ) = (cid:112) n + O p (cid:16) log log n √ log n (cid:17) = (cid:112) n + o p (1) The procedures in Sections 3 and 4 use interval lengths | I | that are bounded by n , but the critical values are similar and theconclusions described here are not sensitive to that upper bound. Scan n ( Z n ) is the maximum of ∼ n i.i.d. standardnormals, the ﬁrst order term √ n shows that it behaves likes the maximum of only ∼ n i.i.d. standardnormals, the reason being that many T I ( Z n ) are correlated since the I are overlapping. Arias-Castro et al. (2005)show that if (cid:107) f n (cid:107) := (cid:112) | I n | µ n ≥ (cid:112) (2 + (cid:15) ) log n for (cid:15) > , then Scan n will be asymptotically powerful , i.e.the probabilities of type I and of type II error both go to zero asymptotically. They generalize the setting (1)to multivariate and geometrically deﬁned signals and show that these detection problems give rise to similardetection boundaries of the form √ D log n , which determine the amplitude of the signal that is detectable. Toappreciate the importance of the constant D , note that √ n is essentially the Bonferroni adjusted criticalvalue for n independent z-tests. Hence the threshold √ D log n = (cid:112) n D corresponds to a multiple testingproblem whose difﬁculty is determined by n D independent z-tests. For this reason Arias-Castro et al. (2005) call D the exponent of effective dimension.A key insight of D¨umbgen and Spokoiny (2001) is that employing critical values that depend on the scale | I n | allows to detect even smaller amplitudes, so the above detection boundary can in fact be improved upon:Translating their methodology into the setting (1) one can show that if(5) (cid:107) f n (cid:107) ≥ (cid:114) (2 + (cid:15) n ) log n | I n | with (cid:15) n → not too fast, namely (cid:15) n (cid:113) log n | I n | → ∞ , then DS n has asymptotic power 1. Thus, if e.g. | I n | /n = n − / , then the critical multiplier for log n in the detection boundary can be reduced from 2 to 1 if one uses DS n in place of Scan n . This multiplier cannot be reduced further, as reliable detection becomes asymptoticallyimpossible for any procedure if ‘ (cid:15) n ’ is replaced by ‘ − (cid:15) n ’ in (5), as can be seen from the lower boundsgiven by D¨umbgen and Spokoiny (2001) in the case of small scales | I n | and by D¨umbgen and Walther (2008) inthe case of large scales.While the above results provide a precise characterization of the detection boundary, we will now argue thatthese asymptotic results will necessarily be too imprecise to discern practically relevant performance character-istics of these max-type statistics with an appropriate level of precision, even in the large sample context. This isillustrated with the following theorem: Theorem 1

If there exists a sequence of critical values { κ n } for which Scan n is asymptotically powerful against { f n } in the model (1) with | I n | ≤ n p for a certain p > , then Scan n is also asymptotically powerful for anysequence of critical values ˜ κ n = κ n + O (1) with O (1) ≥ . Hence if the critical values κ n result in an asymptotically powerful test, then so will the critical values κ n + 100 (say), even though the latter are clearly not a useful choice even if the sample size is enormous. Asexplained in the proof of the theorem, analogous conclusions hold for related test statistics such as DS n , or whenconsidering asymptotic optimality with a ﬁxed signiﬁcance level. The upshot of the theorem is that asymptoticoptimality statements such as (5) will characterize optimal critical values only to a precision of O (1) , whilethe null distribution (4) concentrates with a rate of o (1) around √ n . Importantly, it is the o (1) term thatdetermines relevant performance characteristics of the statistic. To see this heuristically, note that the √ n term arises as a multiple testing adjustment for ∼ n independent test statistics. Suppose we greatly increase themultiple testing problem by a factor of k . Since (cid:113) k n ) ≤ (cid:112) n + 2 k √ log n = (cid:112) n + o (1) , o (1) term and will therefore be overlooked by the asymptotic theory. But it is well known instatistical practice that a Bonferroni-like correction by such a large factor will typically affect the power of thestatistic in a way that is quite relevant for inference.In summary, while the asymptotic optimality theory allows to derive the fundamental difﬁculty of the detec-tion problem in terms of the optimal detection threshold , such as (5), and this provides a necessary conditionfor the large sample optimality of a test statistic, the above considerations show that these conditions are notsufﬁcient for a good performance of the test statistic, even in the large sample setting. This raises the questionof how one should evaluate the performance of statistics such as Scan n and DS n in a practically informativeway. One option would be to develop a reﬁned asymptotic theory. But there are reasons to doubt whether thiswould be informative. For example, D¨umbgen and Spokoiny (2001) introduce a reﬁned version of the statistic DS n which employs an iterated logarithm. An inspection of this reﬁnement suggests that it should improve theperformance on small scales, but our simulations show that it is nearly indistinguishable from DS n for samplesizes up to , presumably because the asymptotics set in too slowly. For this reason we suspect that a reﬁnedasymptotic optimality theory may likewise not sufﬁciently illuminate the performance of a statistic. Thereforewe propose a different approach which focuses on the ﬁnite sample performance of the statistic:Deﬁne (cid:96) min ( n, | I n | ) as the smallest (cid:96) norm (cid:107) f n (cid:107) := (cid:112) | I n | µ n that the statistic is able to reliably detect inmodel (1), i.e. with power 80% at the 10% signiﬁcance level. Then we solve for e n ( | I n | ) in the equation (cid:96) min ( n, | I n | ) = (cid:114) e n ( | I n | ) log en | I n | . Note the formal similarity to the exponent of effective dimension D described above, which measures the fun-damental difﬁculty of the detection problem. But in contrast to the exponent of effective dimension, the realizedexponent e n ( | I n | ) is a function of the test statistic and shows how close the test statistic comes to attaining theasymptotic detection boundary in the ﬁnite sample situation at hand. Importantly, e n ( | I n | ) can be readily com-puted via Monte Carlo, see Section 4. e n ( | I n | ) is therefore well suited to compare various test statistics to thebenchmark given by the traditional and popular statistic Scan n .The asymptotic detection threshold (5) suggests that a statistic that performs well for the detection problem(1) should have a realized exponent e n ( | I n | ) close to 1 for all scales | I n | . As an example, for n = 10 we ﬁnd that Scan n has a realized exponent e n (1) = 1 . while e n ( n ) = 2 . , see Section 4. The poor performance at thescale | I n | = n is not surprising since it was shown in Chan and Walther (2013) that Scan n will (asymptotically)attain the optimal threshold (5) only at the smallest scales. DS n is designed to attain asymptotic optimality acrossall scales. Its realized exponent e n ( n ) = 1 . shows indeed a considerable improvement over Scan n , but theprice for this improvement is the disappointing performance at the important small scales: e n (1) = 1 . . Thismeans that the power loss of DS n as compared to Scan n is equivalent to increasing the size of the multiple testingproblem by a factor of n . − . ≈ , validating the criticism of DS n that was stated in the Introduction.This immediately raises the question whether this outcome represents an unavoidable trade-off, or whetherit is possible to construct a statistic that attains the advantageous performance of DS n at larger scales withoutsacriﬁcing performance at small scales. In the next section we will answer this question to the afﬁrmativeby presenting three different approaches for calibrating the various scales of such max-type statistics: the ﬁrstapproach calibrates the critical values, the second calibrates the signiﬁcance levels, and third one introduces aweighted Bonferroni scheme on a sparse subset of the scan windows which is very simple to implement.5 Three ways of calibrating scan statistics for good performance

For each of the following three calibrations, the key to achieving a good ﬁnite sample performance is to give upsome power at the largest scales | I n | ≈ n . As explained in the Introduction, those scales are typically not of aconcern, and it turns out that a rather small sacriﬁce in power there will produce a considerable improvement inﬁnite sample performance. The ﬁrst calibration is a simpliﬁed version of a standardization used by Sharpnack and Arias-Castro (2016) inthe context of proving a limiting distribution. We therefore call this correction to the critical values of the scanthe

Sharpnack-Arias-Castro calibration (6)

SAC n ( Y n ) = max I (cid:18) T I ( Y n ) − (cid:114) (cid:104) en | I | (1 + log | I | ) (cid:105)(cid:19) . This statistic is similar to DS n given in (3), but the factor (1 + log | I | ) increases the penalty for larger scalesand therefore transfers some power to smaller scales. The resulting critical values T I ( Y n ) are therefore (cid:114) (cid:104) en | I | (1 + log | I | ) (cid:105) + q n ( α ) where q n ( α ) is the (1 − α ) -quantile of SAC n ( Z n ) , which is obtained by simulation, see Section 4. These criticalvalues are plotted as a function of | I | in Figure 1 (blue line) for sample size n = 10 . For comparison, theplot also shows the critical values resulting from the calibration DS n (green line), which uses the correctionterm (cid:113) en | I | added to the (1 − α ) -quantile of DS n ( Z n ) , as well as those resulting from Scan n (black line),which do not depend on the scale | I | and are simply given by the (1 − α ) -quantile of Scan n ( Z n ) . The plotshows that while DS n has critical values that are substantially smaller than those of Scan n for most of the scales | I | (and therefore DS n has more power there), it also has considerably larger critical values and hence inferiorperformance at the smallest scales. In contrast, the critical values of SAC n are only slightly larger than thoseof Scan n at the smallest scales while still being considerably smaller at the other scales. Therefore the simplecorrection term in SAC n arguably produces a useful improvement over the traditional scan.We note that the correction term in SAC n is speciﬁc to the Gaussian tails of the model (1). In order to use SAC n for different settings it is therefore necessary to transform the statistic such that it has (sub)Gaussian tailsunder the null hypothesis. While this is possible in many cases, e.g. Rivera and Walther (2013) show how to dothis for likelihood ratio statistics by employing the square-root of the log likelihood ratio, it is desirable to devisea calibration that is applicable to more general null distributions of the test statistic. To this end we propose to calibrate the signiﬁcance level rather than the critical value, following the idea of the blocked scan introduced in Walther (2010). The blocked scan groups intervals of roughly the same length (e.g.having length within a factor of 2) into a block. Then all intervals within a block are assigned the same criticalvalue. The signiﬁcance level for each block is set so that the resulting test performs well across all scales. Itturns out that this can be achieved by assigning a signiﬁcance level that is proportional to a harmonic sequence.In more detail: 6 .53.73.94.14.34.54.74.95.15.35.55.7 0 500 1000 1500 2000

Length of scan window C r i t i c a l v a l ue Length of scan window C r i t i c a l v a l ue Figure 1: Critical values as a function of the length of the scan window for

Scan n (black line), DS n (green), SAC n (blue) and BlockedScan n (orange). Sample size is n = 10 and signiﬁcance level is 10%. The criticalvalues were simulated with Monte Carlo simulations, using the same largest window length of about n forall four procedures. The bottom plot zooms in on the smallest window lengths.7e deﬁne the ﬁrst block to comprise the smallest intervals with length up to about log n , namely | I | < s n , where s n := (cid:100) log log n (cid:101) . From there on we use powers of 2 to group interval lengths: The B th blockcontains all intervals with length | I | ∈ [2 B − s n , B − s n ) , B = 2 , . . . , B max := (cid:98) log n (cid:99) − s n + 1 . Thechoice of B max means that the largest intervals we consider have length about n ; we found that the results inthis paper are not sensitive to this endpoint and all simulations reported in this paper use this endpoint for themaximum in (2),(3),(6). Now we let the signiﬁcance level of the B th block decrease like a harmonic sequence: BlockedScan ( Y n ) rejects if for any block B : max I ∈ B th block T I ( Y n ) > c B,n (cid:18) ˜ αB (cid:19) where c B,n ( α ) is the (1 − α ) -quantile of max I ∈ B th block T I ( Z n ) , which is obtained by simulation, and ˜ α is setso that BlockedScan ( Z n ) has overall level α , see Walther (2010).Letting the signiﬁcance level of the B th block decrease with B like a harmonic sequence, rather than lettingit increase as in the original proposal for the blocked scan in Walther (2010), has the same effect as using thecorrection term SAC n instead of DS n for the critical values: Asymptotic optimality at the largest scales | I | ≈ n will be lost, but one gains a notably better performance at the smallest scales. This can be seen in Figure 1, whichshows that the critical values c B,n (orange line) mimic those of

SAC n . Note that the critical values of all four statistics considered so far need to be approximated either analytically orby simulation. Our third proposal avoids the effort that comes with such an approximation if the null distributionof T I ( Z n ) is known, as is the case for model (1) where it is standard normal. In order to explain the idea it ishelpful to ﬁrst review how the critical values of the above four statistics can be approximated. Note that the scanin (2) is deﬁned as the maximum over ∼ n intervals. Hence the simulation of the null distribution becomescomputationally infeasible for larger n , which has motivated analytical approximations to quantiles such as inSiegmund and and Venkatraman (1995). Alternatively, work developed in the last 15 years has shown that onecan effectively approximate this maximum by evaluating T I only on a judiciously chosen set of intervals I .Importantly, it sufﬁces to use only about O ( n ) intervals, hence critical values can be readily simulated via MonteCarlo. Our third proposal exploits the sparsity of such a collection of intervals not only for computation, butalso for inference. The idea is that if the collection is sparse enough, then a simple union bound may producecritical values for the local test statistics that will result in a good performance. Note that this requires to strikea delicate balance in order to guarantee good power: the collection of intervals has to be rich enough so it canprovide a good approximation for an arbitrary interval I , but it also has to be sparse enough so that a Bonferroniadjustment will not unduly diminish the power of the local statistics. We will demonstrate below that applying aweighted Bonferroni adjustment to the approximating set of intervals introduced by Walther (2010) (see Riveraand Walther (2013) for the univariate version used here) results in a test that does indeed perform nearly as wellas SAC n and BlockedScan n . We call this calibration the Bonferroni scan .For integers (cid:96) ≥ deﬁne m (cid:96) := 2 (cid:96) and d (cid:96) := (cid:108) m (cid:96) / (cid:113) enm (cid:96) (cid:109) . Then we approximate intervals with lengthsin ( m (cid:96) , m (cid:96) ] with intervals from the collection J (cid:96) := (cid:110) ( j, k ] : j, k ∈ { id (cid:96) , i = 0 , , , . . . } and m (cid:96) ≤ k − j < m (cid:96) (cid:111) which is essentially the collection given in Rivera and Walther (2013) but indexed backwards. The spacing d (cid:96) (2 log( en/m (cid:96) )) − / , and this rate guarantees both a good enough approximation as well as a sufﬁcientlysparse representation. This is an important difference to other approximation schemes introduced in the literature,such as the ones given in Arias-Castro et al. (2005) or in Sharpnack and Arias-Castro (2016), although it may bepossible to modify those to produce comparable results.Now we proceed as with the blocked scan above: We group intervals into blocks and then assign a signiﬁ-cance level that is proportional to a harmonic sequence. So we deﬁne B th block := (cid:40) (cid:83) s n − (cid:96) =0 J (cid:96) if B = 1 J B − s n if B = 2 , . . . , B max Hence the B th block contains intervals of exactly the same lengths as for the blocked scan, but the endpoints ofthe intervals are thinned out with the spacing d (cid:96) . In contrast to the blocked scan, we can directly ﬁnd the criticalvalue for the B th block by using a Bonferroni adjustment: If v n ( α ) denotes the (1 − α ) quantile of T I ( Z n ) , thenthe critical value for the B th block is v n (cid:32) α B th block ) B (cid:80) B max i =1 1 i (cid:33) and BonferroniScan n ( Y n ) rejects if for any block B the statistic max I ∈ B th block T I ( Y n ) exceeds that criticalvalue. The number of intervals in the B th block, B th block ) , provides the Bonferroni adjustment within eachblock, while the weighted Bonferroni adjustment across blocks is given by B − , standardized by (cid:80) B max i =1 1 i .Therefore it follows immediately that this test has level at most α .Of course one may also evaluate all of the above calibrations on this approximating set of intervals. Wedenote the resulting procedures with the superscript app . Using this approximating set allows for a much fasterevaluation of the statistic as well as simulation of the critical values, while the power is comparable or evenslightly better than when using all intervals; this is presumably because the small loss due to approximatingthe intervals is compensated by the smaller critical values due to the fewer intervals that the statistic maximizesover. In order to provide a fair comparison of the Bonferroni scan to the other four procedures considered so farwe simulated their critical values on the approximating set. Figure 2 shows a comparison of the critical valuesfor sample size n = 10 . The plot shows that while the critical values of BonferroniScan n are not as small asthose of SAC n or the blocked scan, they are still competitive, which is remarkable given that they were obtainedwith a Bonferroni correction. We note that the critical values were simulated with simulation runs, whichis possible for the sample size n = 10 because the approximating set has close to O ( n ) intervals rather than O ( n ) . The Bonferroni scan can be readily employed for even much larger sample sizes since it does not requireany simulations to ﬁnd its critical values. This section compares the performance of the three calibrations with the traditional scan in terms of the realizedexponent e n ( | I n | ) . As explained in Section 2, the realized exponent is a standardized measure of the smallest (cid:96) norm that the calibration can reliably detect in the model (1). Therefore, a calibration that performs wellfor this task should have a small realized exponent, and ideally e n ( | I n | ) should be close to the bound 1 givenby the asymptotic detection boundary (5) for all signal lengths | I n | . We investigate this both theoretically and9 .33.53.73.94.14.34.54.74.95.15.35.55.75.96.16.3 0e+00 5e+04 1e+05 Length of scan window C r i t i c a l v a l ue Length of scan window C r i t i c a l v a l ue Figure 2: Critical values as a function of the length of the scan window for

Scan appn (black line), DS appn (green), SAC appn (blue),

BlockedScan appn (orange) and the Bonferroni scan (magenta). Sample size is n = 10 andsigniﬁcance level is 10%. The critical values were simulated with Monte Carlo simulations, using the samelargest window length of about n for all four procedures. The bottom plot zooms in on the smallest windowlengths. 10ignal length | I n | Scan appn DS appn SAC appn

BlockedScan appn e n ( | I n | ) for various calibrations and sample size n = 10 .experimentally for the calibrations considered so far. Theorem 2

Each of the calibrations DS n , DS appn , SAC n , SAC appn , BlockedScan n , BlockedScan appn and theBonferroni scan has a realized exponent e n ( | I n | ) that satisﬁes e n ( | I n | ) ≤ b (cid:113) log n | I n | for all | I n | ∈ [1 , n p ] , p ∈ (0 , , where b = b ( p ) is a constant. We note that the traditional scan

Scan n does not satisfy such a bound as can be seen from the results in Chanand Walther (2013). Furthermore, as explained in the proof, it is possible to sharpen the result of Theorem 2 byderiving the appropriate value of b for each calibration. Such a result would provide a theoretical explanation ofwhy DS n has an inferior performance for small | I n | and therefore complement existing optimality results thatare not sensitive enough to discern this effect, as was explained in Section 2. We did not pursue this further sinceour focus is the exact evaluation of (cid:15) n ( | I n | ) given below.In order to analyze the ﬁnite sample performance of these calibrations we evaluated their realized exponentswith simulations. We evaluated all calibrations on the approximating set, for the reasons explained at the end ofSection 3. We ﬁrst computed the realized exponent e n ( | I n | ) for sample size n = 10 . Table 4 shows e n ( | I n | ) fora representative selection of interval lengths | I n | for the various calibrations discussed in the previous section.The column with signal length I n = 1 conﬁrms the poor performance of DS n discussed in the Introduction: therealized exponent is 1.80, while that of the traditional scan is 1.41. In contrast, the performance of SAC n and ofthe blocked scan is only slightly inferior for the smallest scales while they increasingly outperform the traditionalscan from signal length 15 onwards. The Bonferroni scan does not perform quite as well as the SAC n and theblocked scan, but it is still competitive, providing a clear improvement over the traditional scan for | I n | ≥ without incurring the large penalty at small scales that DS n does. We feel that this makes the Bonferroni scan anattractive choice since unlike all the other calibrations, it does not require an ancillary approximation to obtainits critical values and it is therefore particularly simple to use.We computed the realized exponents also for sample size n = 10 and arrived at the same conclusions, seeTable 4 and the discussion in Section 2.Comparing the table for sample size n = 10 with that for n = 10 shows that e n ( | I n | ) converges veryslowly to its asymptotic bound of 1, with many values of e n ( | I n | ) being closer to 2 rather than 1 even when n = 10 . This shows the merit of evaluating the performance with the ﬁnite sample criterion e n ( | I n | ) rather thanby establishing an asymptotic result as is typically done in the literature.11ignal length | I n | Scan appn DS appn SAC appn

BlockedScan appn e n ( | I n | ) for various calibrations and sample size n = 10 . The results of this paper were derived for the Gaussian sequence model (1) because this allows to focus onthe main ideas. However, the conclusions and methodology can be carried over to other settings where scanstatistics are used. For example, one case of particular interest in the literature is the setting where one observesan (in)homogeneous Poisson Process and the problem is to detect an interval where the intensity is elevatedcompared to a known baseline, i.e. one is looking for an interval with an unusually large number of events, seeGlaz et al. (2001). Conditioning on the total number of observed events allows to eliminate certain nuisanceparameters and shows that the problem is equivalent to testing whether n i.i.d. observations arise from a knowndensity f (which w.l.o.g. can be taken to be the uniform density on [0 , ) versus the alternative that f iselevated on an interval I :(7) f r,I ( x ) = r x ∈ I ) + 1( x ∈ I c ) rF ( I ) + F ( I c ) f ( x ) , so the problem becomes testing r = 1 vs. r > , see Loader (1991) and Rivera and Walther (2013). Theresults in the latter paper suggest that the heuristics, methodology and optimality results in the density/intensitymodel (7) are quite analogous to the Gaussian sequence model (1). In particular, it was pointed out in Riveraand Walther (2013) that the square root of the log likelihood ratio statistic has a subgaussian tail, which allowsto transfer the methodology from the Gaussian sequence model, with the empirical measure playing the role ofthe interval length | I n | /n . The approximating set in Section 3 will omit the ﬁrst block, as it is well known that atleast n observations are necessary for sensible inference in the density setting.Some of the most important applications of scan statistics concern the multivariate setting. In that situationit is particularly important to evaluate the performance with a ﬁnite sample criterion such as e n ( | I n | ) becauseof the large-scale multiple testing problem involved: While there are of the order ∼ n intervals that containdistinct subsets of n points sampled from U [0 , , there are ∼ n d distinct axis-parallel rectangles that containdistinct subsets of n points sampled from U [0 , d . Nevertheless, Arias-Castro et al. (2005) show (for observa-tions on a regular d-dimensional grid) that for many relevant classes of scanning windows, such as axis-parallelrectangles and balls, the effective dimension of the multiple testing problem is essentially linear in the samplesize, i.e. the problem is not fundamentally more difﬁcult than in the univariate model (1). Moreover, it is shownin Walther (2010) that if one employs scale-dependent critical values (the blocked scan) rather than the tradi-tional scan as in Arias-Castro et al. (2005), then it is possible to overcome the ‘curse of dimensionality’: if thesignal is supported on a lower-dimensional marginal, then the d-dimensional blocked scan has essentially thesame asymptotic detection power as an optimal lower-dimensional test would have, so there is no fundamen-tal penalty for scanning in the higher dimensional space. The optimality results of Arias-Castro et al. (2005)and Walther (2010) are asymptotic. Since a multivariate setting ampliﬁes the problems with the usefulness of12symptotic results described in Section 2, it is of interest to reexamine the methodology with a ﬁnite samplecriterion such as e n ( | I n | ) . We note that all of the methodology developed in this paper can be carried over tothe multivariate setting. In fact, the approximating set in Section 3 is essentially the univariate version of the d -dimensional approximating set introduced in Walther (2010). Proof of theorem 1:

For simplicity we write Y n ( I ) := T I ( Y n ) = (cid:80) i ∈ I Y i √ | I | and likewise for Z n ( I ) . Model (1)gives(8) Y n ( I ) = Z n ( I ) + µ n | I ∩ I n | (cid:112) | I | Hence max I Y n ( I ) = max (cid:16) max I : I ∩ I n = ∅ Y n ( I ) , max I : | I ∩ In | √ | I || In | ≥ Y n ( I ) , max I :0 < | I ∩ In | √ | I || In | < Y n ( I ) (cid:17) ≤ max (cid:16) max I Z n ( I ) , max I : | I ∩ In | √ | I || In | ≥ Z n ( I ) + µ n (cid:112) | I n | , max I :0 < | I ∩ In | √ | I || In | < Z n ( I ) + 3 µ n (cid:112) | I n | (cid:17) . (9) We will show below that(10) max I : | I ∩ In | √ | I || In | ≥ Z n ( I ) d ≤ R where R is a universal random variable whose support is the real line, and(11) A n := max I : I ∩ I n (cid:54) = ∅ Z n ( I ) satisﬁes A n − (cid:112) n p → −∞ . Therefore (9) gives IP f n ( Scan n ( Y n ) > κ n ) ≤ IP(max I Z n ( I ) > κ n ) + IP (cid:16) R > κ n − µ n (cid:112) | I n | (cid:17) + IP (cid:16) A n > κ n − µ n (cid:112) | I n | (cid:17) Since

Scan n ( Y n ) is asymptotically powerful against { f n } the LHS converges to and the ﬁrst term on theRHS converges to . Hence, writing b n := µ n (cid:112) | I n | − κ n and observing κ n ≥ √ n eventually by (2) inKabluchko (2011): ≤ lim inf n (cid:16) IP(

R > − b n ) + IP (cid:16) A n − (cid:112) n > − b n (cid:17)(cid:17) This implies b n → ∞ since the support of R is the real line and A n − √ n p → −∞ . But then we have for13ny sequence ˜ κ n = κ n + O (1) : IP f n ( Scan n ( Y n ) > ˜ κ n ) ≥ IP f n ( Y n ( I n ) > ˜ κ n ) (12) = IP (cid:16) N (0 ,

1) + µ n (cid:112) | I n | > ˜ κ n (cid:17) by (8)(13) → since µ n (cid:112) | I n | − κ n → ∞ , (14)and IP ( Scan n ( Y n ) > ˜ κ n ) ≤ IP(

Scan n ( Z n ) > κ n ) → if ˜ κ n ≥ κ n . Hence Scan n is also asymptoticallypowerful against { f n } when employing the critical values ˜ κ n . In fact, this is also true for any sequence of criticalvalues ˜ κ n = √ n + O (1) with O (1) ≥ (say), as can be seen by instead taking b n := µ n (cid:112) | I n | − √ n above and using (4).We note that the above proof uses the fact that the type I error probability goes to 0. Some related optimalityresults in the literature follow the approach of D¨umbgen and Spokoiny (2001) and establish asymptotic power1 at a ﬁxed signiﬁcance level. The main conclusion of this theorem, namely that asymptotic optimality leaves aleeway of size O (1) for the critical value, will typically hold also in that setting and for related statistics, as canbe seen by the inspecting the proofs.It remains to prove (10) and (11). In order to prove (10) we will show that(15) max I : | I ∩ In | √ | I || In | ≥ Z n ( I ) d ≤ R + R + R where the R i are independent universal random variables and the support of R is the real line, hence the supportof R + R + R also equals the real line.It is straightforward to check that the condition on I implies | I ∩ I n | ≥ | I n | and | I | ≤ | I n | . Thereforethere exist intervals S n and L n , each having integers as endpoints and depending only on I n , such that | S n | ≥| I n | / , | L n | ≤ | I n | and S n ⊂ I ⊂ L n for every I . (Let S n be the smallest such interval whose midpointequals that of I n , and construct L n by moving each endpoint of I n outward by | I n | , then intersect the resultinginterval with (0 , n ] ). Hence we can write I as the union of three disjoint intervals I = I left ∪ S n ∪ I right , where I left , I right might be empty and | I left | , | I right | ≤ | I n | . So Z n ( I ) = (cid:80) i ∈ I left Z i (cid:112) | I | + (cid:80) i ∈ S n Z i (cid:112) | I | + (cid:80) i ∈ I right Z i (cid:112) | I | The three terms are independent. As for the middle term, | S n || I | ∈ [ · , implies max I (cid:80) i ∈ S n Z i (cid:112) | I | = max I (cid:115) | S n || I | Z n ( S n ) ≤ (cid:114) Z n ( S n ) ( Z n ( S n ) <

0) + Z n ( S n ) ( Z n ( S n ) ≥ d = (cid:114) Z ( Z <

0) + Z ( Z ≥

0) =: R where Z ∼ N (0 , . Clearly the support of R is real line. As for the third term, (cid:80) i ∈ I right Z i is the sum of the14rst | I right | Z i s following the right endpoint of S n . Using | I right | ≤ | I n | and | I | ∈ [ | I n | / , | I n | ] we get max I (cid:80) i ∈ I right Z i (cid:112) | I | d ≤ max k =0 , ,..., | I n | max f k ∈ [ | I n | , | I n | ] (cid:80) ki =1 Z i √ f kd ≤ sup ≤ t ≤ | I n | sup f t ∈ [ | I n | , | I n | ] W ( t ) √ f t where W ( t ) is Brownian motion = √

16 sup ≤ t ≤ | I n | W ( t ) (cid:112) | I n | since the sup is ≥ a.s. =: √ R The distribution of R is known to be that of a standard normal conditional on being positive. The ﬁrst termis analogous, proving (15) since stochastic ordering is preserved when taking sums of independent randomvariables.As for (11), write I n = ( j n , k n ] and note that I ∩ I n (cid:54) = ∅ implies that I ⊂ I n or the left endpoint j n ∈ I orthe right endpoint k n ∈ I . Hence(16) max I : I ∩ I n (cid:54) = ∅ Z n ( I ) ≤ max (cid:16) max I ⊂ I n Z n ( I ) , max I : j n ∈ I Z n ( I ) , max I : k n ∈ I Z n ( I ) (cid:17) By Lemma 1 of Chan and Walther (2013), max I ⊂ I n Z n ( I ) d ≤ L + (cid:112) e | I n | ) for some universal randomvariable L . So if | I n | ≤ n p with p < , then max I ⊂ I n Z n ( I ) − √ n p → −∞ . Next, max I : j n ∈ I Z n ( I ) = max a ∈{ ,...,j n − } max b ∈{ j n +1 ,...,n } (cid:80) j n i = a +1 Z i + (cid:80) bi = j n +1 Z i √ b − a ≤ max a ∈{ ,...,j n − } (cid:12)(cid:12)(cid:12)(cid:80) j n i = a +1 Z i |√ j n − a + max b ∈{ j n +1 ,...,n } (cid:12)(cid:12)(cid:12)(cid:80) bi = j n +1 Z i (cid:12)(cid:12)(cid:12) √ b − j n These two terms are independent since they involve disjoint sets of Z i s . Each term is d ≤ max ≤ k ≤ n (cid:80) ki =1 Z i √ k =: L n , hence max I : j n ∈ I Z n ( I ) d ≤ L n + L (cid:48) n , where L n and L (cid:48) n are i.i.d. copies, since stochastic ordering ispreserved when taking sums of independent random variables. By a theorem of Darling and Erd¨os (1956), L n = O p ( √ log log n ) . The third term in (16) is bounded analogously. (11) follows form (16) and the abovebounds. (cid:50) Proof of theorem 2:

We will ﬁrst establish the claim for DS n , SAC n and BlockedScan n . For any signal f n from model (1) with support I n = ( j n , k n ] we have T j n k n ( Y n ) = √ k n − j n µ n + T j n k n ( Z n ) = (cid:107) f n (cid:107) + T j n k n ( Z n ) by (8). So if a generic test assigns critical values c jkn ( α ) to the T jk ( Y n ) , then IP f n ( test rejects ) ≥ IP f n (cid:16) T j n k n ( Y n ) > c j n k n n ( α ) (cid:17) = 1 − Φ (cid:16) c j n k n n ( α ) − (cid:107) f n (cid:107) (cid:17) since T j n k n ( Z n ) ∼ N (0 , (17) ≥ provided (cid:107) f n (cid:107) ≥ c j n k n n ( α ) + z . , where z . denotes the 80th percentile of N (0 , . Since this inequality holds uniformly for such f n it follows15hat the smallest detectable (cid:96) norm for this generic test, (cid:96) min ( n, | I n | ) , is not larger than c j n k n n ( α ) + z . . So ifthe c jkn ( α ) satisfy(18) c jkn ( α ) − (cid:114) enk − j ≤ b for all ≤ j < k ≤ n with k − j ≤ n p for some number b , then e n ( | I n | ) = (cid:96) min ( n, | I n | )2 log en | I n | ≤ b + z . ) (cid:113) en | I n | + ( b + z . ) en | I n | ≤ √ b + z . ) + ( b + z . ) (cid:113) log en | I n | . Hence the claim of the theorem follows for a particular calibration c jkn ( α ) once (18) is established for thatcalibration. We will now check this condition for the calibrations used by DS n , SAC n and BlockedScan n . Itfollows from Thm. 2.1 in D¨umbgen and Spokoiny (2001) that DS n ( Z n ) = O p (1) . Hence its critical value for T jk is given by c jkn ( α ) = (cid:113) enk − j + κ n ( α ) with κ n ( α ) = O (1) , so (18) holds. Next we check this conditionfor SAC n . Since the penalty term in SAC n is larger than that in DS n we obtain SAC n ( Z n ) = O p (1) and hencethe (1 − α ) -quantile of SAC n ( Z n ) also stays bounded in n . So (18) holds for SAC n provided that (cid:114) (cid:104) enk − j (1 + log( k − j )) (cid:105) ≤ (cid:114) enk − j + O (1) for all ≤ j < k ≤ n with k − j ≤ n p . But this is a consequence of k − j ≤ n p since using √ x ≤ x one sees that the ﬁrst square root is notlarger than (cid:114) enk − j + √ e ( k − j )) (cid:113) log enk − j ≤ (cid:114) enk − j + (cid:114) − p . As an aside, (17) also shows how much

SAC n loses on the largest scales k − j ≈ n when compared to DS n :Then the DS penalty is (cid:113) enk n − j n ≈ √ , while the SAC penalty is approximately √ n . So in orderto achieve 80% power via the condition (cid:107) f n (cid:107) ≥ c j n k n n ( α ) + z . in (17), it follows that some constant (cid:107) f n (cid:107) is sufﬁcient for DS n while SAC n needs (cid:107) f n (cid:107) to grow with n but only at the very slow rate √ n .Furthermore, proceeding similarly as in Chan and Lai (2006) and Kabluchko (2011) it is possible to replacethe inequalities in (17) with an asymptotic expansion. The critical values for DS n are c jkn ( α ) = (cid:113) enk − j + κ n ( α ) , while the above approximation to the SAC penalty shows that for small intervals k − j (cid:28) n the criticalvalues of SAC n are ≈ (cid:113) enk − j + √ log log( e ( k − j )) (cid:113) log enk − j + ˜ κ n with ˜ κ n < κ n . Plugging these critical values inthe expansion for the smallest detectable (cid:96) norm shows that for I n (cid:28) n we obtain e n ( | I n | ) ≈ b DS (cid:113) log n | In | + o (cid:16) (cid:113) log n | In | (cid:17) in the case of DS n , and likewise for SAC n . Importantly, b DS > b SAC and b DS is also largerthan the corresponding constants for the blocked scan and the Bonferroni scan. Hence such an expansion wouldprovide a theoretical explanation of the superior performance of these three calibrations when compared to DS n and therefore complement existing optimality results that are not sensitive enough to discern this effect, as wasexplained in Section 2. We leave the rigorous demonstration of such an optimality theory open.16ontinuing with the proof we next check (18) for the blocked scan. By the union bound α = IP (cid:32) B max (cid:91) B =1 (cid:110) max ( j,k ] ∈ Bth block T jk ( Z n ) > c B,n (cid:16) ˜ αB (cid:17)(cid:111)(cid:33) ≤ B max (cid:88) B =1 ˜ αB ≤ ˜ α (log B max + 1) ≤ ˜ α (log log n + 2) , so(19) ˜ αB ≥ α (log n )(log log n + 2) . In the case B ≥ the proofs of Theorems 6.1 and 2.1 in D¨umbgen and Spokoiny (2001) show that for someconstant C (which is universal in this context) and for every S ≥ : IP (cid:16) max ( j,k ] ∈ Bth block T jk ( Z n ) > (cid:114) en B − s n + S log log e e n B − s n (cid:17) ≤ C exp (cid:16) ( C − SC ) log log e e n B − s n (cid:17) ≤ C (cid:16) log e e n B − s n (cid:17) − by choosing S ≥ C ( C + 2) . (20)In the case B = 1 the inequality (20) also holds for S ≥ as can be checked by applying the union bound,the Gaussian tail bound and the fact that there are not more than n s n ≤ n log n intervals in the ﬁrst block.Since we need to establish (18) only for interval lengths k − j ≤ n p , we need only consider block indices B ≤ log n p − s n + 2 . For those B , (20) is not larger than C (cid:16) log e e n n p (cid:17) − ≤ C (cid:16) (1 − p ) log n (cid:17) − ≤ α (log n )(log log n + 2) eventually . Comparing with (19) we immediately obtain c B,n (cid:16) ˜ αB (cid:17) ≤ (cid:114) en B − s n + S log log e e n B − s n . Now (18) follows, because every interval ( j, k ] belongs to a block with some index B . Then k − j < B − s n and so the critical value c B,n (cid:16) ˜ αB (cid:17) for T jk ( Y n ) is not larger than (cid:114) enk − j + S log log e e nk − j = (cid:114) enk − j + O (1) . Next we will establish the claim for the four calibrations that use the approximating set of intervals. Thesparsity of the approximating set makes it straightforward to establish (18) for the null distribution, as will beseen below. On the other hand, we now have to account for the approximation error incurred by not being ableto match the support of the signal exactly. The approximating set is constructed such that there is a bound on theerror that is of the size needed: 17 roposition 1

For every I = ( j, k ] ⊂ (0 , n ] there exists an interval J in the approximating collection such that | I ∩ J | (cid:112) | I || J | ≥ (cid:118)(cid:117)(cid:117)(cid:116) − (cid:113) en | I |∧| J | ≥ − (cid:113) en | I |∧| J | − en | I |∧| J | . The proof of the proposition is below. So if f n ( i ) = µ n ( i ∈ I n ) is a signal from model (1), then there existsan interval J n in the approximating set such that T J n ( Y n ) = | I n ∩ J n | (cid:112) | J n | µ n + T J n ( Z n ) = | I n ∩ J n | (cid:112) | J n || I n | (cid:107) f n (cid:107) + T J n ( Z n ) by (8) and since (cid:107) f n (cid:107) = (cid:112) | I n | µ n . So if (cid:107) f n (cid:107) = (cid:113) en | I n | + b n , then we get with Proposition 1 T J n ( Y n ) ≥ (cid:16) − (cid:113) en | I n | − en | I n | (cid:17)(cid:16)(cid:114) en | I n | + b n (cid:17) + T J n ( Z n ) ≥ (cid:114) en | I n | + b n (cid:16) − (cid:112) (1 − p ) log( en ) (cid:17) − − (cid:115) − p ) log( en ) + T J n ( Z n ) since | I n | ≤ n p ≥ (cid:114) en | J n | + b n (cid:16) − (cid:112) (1 − p ) log( en ) (cid:17) − − √ e )(1 − p ) (cid:112) log( en ) + T J n ( Z n ) as | I n || J n | ∈ [ , .(21)Now we can proceed as before: Suppose a generic test on the approximating set assigns critical values ˜ c jkn ( α ) to the T jk ( Y n ) that satisfy(22) ˜ c jkn ( α ) − (cid:114) enk − j ≤ ˜ b for all ( j, k ] in the approximating set with k − j ≤ n p . Set b n such that the sum of the middle three terms in (21) equals ˜ b + z . . Then, denoting the endpoints of J n by J n = ( j n , k n ] : IP f n (cid:16) test rejects (cid:17) ≥ IP f n (cid:16) T J n ( Y n ) > ˜ c j n k n n ( α ) (cid:17) ≥ IP (cid:16)(cid:114) en | J n | +˜ b + z . + T J n ( Z n ) > ˜ c j n k n n ( α ) (cid:17) ≥ by (22). Hence the smallest detectable (cid:96) norm for this generic test is not larger than (cid:113) en | I n | + b n . Since theabove choice for b n implies b n → ˜ b + 1 + z . , the claim of the theorem will follow as before upon verifying(22).As an aside we note that in the case without an approximating set the bound on the smallest detectable (cid:96) norm has the term b + z . in place of b n ≈ ˜ b + 1 + z . , with b from (18) and ˜ b from (22). Hence using theapproximating set adds 1 to that bound but allows to use ˜ b , and ˜ b ≤ b because the there are fewer intervals inthe approximating set to maximize over. We found in our simulations that using the approximating set typicallyresults in slightly more power.(22) clearly holds for DS appn , SAC appn and

BlockedScan appn since (18) holds for their counterparts DS n , SAC n and BlockedScan n , and the former statistics cannot be larger than their counterparts since they maximizeover a subset of the intervals that their counterparts maximize over. Therefore the claim of the theorem follows18or these three calibrations.We note that we established (18) by appealing to Theorem 2.1 in D¨umbgen and Spokoiny (2001), whichrests on their Theorem 6.1. The assumptions of that theorem are difﬁcult to check in general, as exempliﬁed bythe proof of their Theorem 2.1. The sparsity of the approximating set makes it possible to alternatively establish(22) directly with a simple application of the union bound and the Gaussian tail bound, similarly to the followingderivation for the Bonferroni scan: w B := 2 B − s n is an upper bound on the interval length in the B th block. It follows from (23) that B th block ) ≤  nw B (cid:16) log( en ) (cid:17) if B = 1 nw B (cid:16) nw B e (cid:17) if B ≥ Hence the Gaussian tail bound gives v n (cid:32) α B th block ) B (cid:80) B max i =1 1 i (cid:33) = Φ − (cid:16) α B th block ) B (cid:80) B max i =1 1 i (cid:17) ≤ (cid:115) B th block ) B (cid:80) B max i =1 1 i α ≤ (cid:115) nw B + 2 log 8(log en ) B (cid:80) i i α ≤ (cid:115) nw B + 2 log 12(log en ) (log log n + ) α since B ≤ B max ≤

32 log n and B max (cid:88) i =1 i ≤ log B max + 12 B max + 0 . ≤ log log n + 32 ≤ (cid:114) nw B + log (cid:16) α (log n ) (log log n + ) (cid:17)(cid:113) nw B ≤ (cid:114) nw B + O (1) if w B ≤ n p , p < which establishes (22) for the Bonferroni scan.It remains to prove Proposition 1: Let (cid:96) be such that | I | ∈ [ m (cid:96) , m (cid:96) ) . Elementary considerations show that19ne can pick J ∈ J (cid:96) such that | I (cid:52) J | ≤ d (cid:96) . Hence | I ∩ J | (cid:112) | I || J | = (cid:115) | I | − | I \ J || I | (cid:115) | J | − | J \ I || J | = (cid:115) − α | I (cid:52) J || I | (cid:115) − (1 − α ) | I (cid:52) J || J | where α = | I \ J || I (cid:52) J |≥ (cid:115) − | I (cid:52) J | min( | I | , | J | ) since (1 − αx )(1 − (1 − α ) x ) ≥ − x ≥ (cid:115) − d (cid:96) m (cid:96) If m (cid:96) > (cid:113) enm (cid:96) , then d (cid:96) m (cid:96) ≤ m (cid:96) (cid:113) enm(cid:96) + 1 m (cid:96) ≤ (cid:113) en | I |∧| J | and the ﬁrst inequality of the claim follows. If m (cid:96) ≤ (cid:113) enm (cid:96) , then d (cid:96) = (cid:108) m (cid:96) (cid:113) enm(cid:96) (cid:109) = 1 , so by the deﬁnitionof J (cid:96) we can take J := I and the ﬁrst inequality of the claim also holds. The second inequality of the propositionfollows from √ − x ≥ − x − x for x ∈ (0 , . (cid:50) The proof of the theorem uses the following bounds on the number of intervals in the approximating set:There are no more than nd (cid:96) possible left endpoints for intervals in J (cid:96) and for each left endpoint there are no morethan m (cid:96) d (cid:96) right endpoints, hence J (cid:96) ≤ nm (cid:96) d (cid:96) d (cid:96) ≤ nm (cid:96) enm (cid:96) m (cid:96) m (cid:96) = 2 n − (cid:96) log( e − (cid:96) n ) . Therefore(23) B th block ) (cid:40) ≤ (cid:80) s n − (cid:96) =0 n − (cid:96) log( e − (cid:96) n ) ≤ n log( en ) if B = 1= J B − s n ≤ n − B +2 − s n log( e − B +2 − s n n ) ≤ n − B if B ≥ since − s n ≤ n . (cid:50) References

Arias-Castro, E., Donoho, D. and Huo, X. (2005). Near-optimal detection of geometric objects by fast multi-scale methods.

IEEE Trans. Inform. Theory , 2402–2425.Chan, H.P. and Lai, T.L. (2006). Maxima of asymptotically gaussian random ﬁelds and moderate deviationapproximations to boundary crossing probabilities of sums of random variables with multidimensionalindices. Ann. Probab. , 80–121.Chan, H.P. and Walther, G. (2013). Detection with the scan and the average likelihood ratio. Statistica Sinica , 409–428. 20arling, D.A. and Erd¨os, P. (1956). A limit theorem for the maximum of normalized sums of independentrandom variables. Duke Math. J. , 143–154.D¨umbgen, L. and Spokoiny, V.G. (2001). Multiscale testing of qualitative hypotheses. Ann. Statist. , 124–152.D¨umbgen, L. and Walther, G. (2008). Multiscale inference about a density. Ann. Statist. , 1758–1785.Frick, K., Munk, A. and Sieling, H. (2014). Multiscale change point inference. J. R. Stat. Soc. Ser. B. ,495–580.Glaz, J., Naus, J. and Wallenstein, S. Scan Statistics.

Springer Series in Statistics. Springer-Verlag, New York,2001.Kabluchko, Z. (2011). Extremes of the standardized gaussian noise.

Stochastic Processes and their Applications , 515–533.K¨onig, C., Munk, A. and Werner, F. (2018). Multidimensional multiscale scanning in exponential families:Limit theory and statistical consequences. arXiv:1802.07995Loader, C. R. (1991). Large-deviation approximations to the distribution of the scan statistics.

Adv. Appl. Prob. , 751–771.Naus, J. I. and Wallenstein, S. (2004). Multiple window and cluster size scan procedures. Meth. Comp. Appl.Probab. , 389–400.Rivera, C. and Walther, G. (2013). Optimal detection of a jump in the intensity of a Poisson process or in adensity with likelihood ratio statistics. Scand. J. Stat. , 752-769.Rohde, A. (2008). Adaptive goodness-of-ﬁt tests based on signed ranks. Ann. Statist. , 1346–1374.Sharpnack, J, and Arias-Castro, E. (2016). Exact asymptotics for the scan statistic and fast alternatives. Elect.J. Statist. , 2641-2684.Siegmund, D. (2017). Personal communication.Siegmund, D. and Venkatraman, E.S. (1995). Using the generalized likelihood ratio statistic for sequentialdetection of a change-point. Ann. Statist. , 255–271.Walther, G. (2010). Optimal and fast detection of spatial clusters with scan statistics. Ann. Statist.38