An Asymptotic Theory of Joint Sequential Changepoint Detection and Identification for General Stochastic Models
aa r X i v : . [ m a t h . S T ] M a r IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. , 2021 1
An Asymptotic Theory of Joint SequentialChangepoint Detection and Identification forGeneral Stochastic Models
Alexander G. Tartakovsky,
Senior Member, IEEE
Abstract —The paper addresses a joint sequential change-point detection and identification/isolation problem for a generalstochastic model, assuming that the observed data may bedependent and non-identically distributed, the prior distribu-tion of the change point is arbitrary, and the post-changehypotheses are composite. The developed detection–identificationtheory generalizes the changepoint detection theory developed byTartakovsky (2019) to the case of multiple composite post-changehypotheses when one has not only to detect a change as quicklyas possible but also to identify (or isolate) the true post-changedistribution. We propose a multi-hypothesis change detection–identification rule and show that it is nearly optimal, minimizingmoments of the delay to detection as the probability of a falsealarm and the probabilities of misidentification go to zero.
Index Terms —Asymptotic Optimality; Changepoint Detection-Identification Problems; Expected Detection Delay; GeneralStochastic Models; Moments of the Delay to Detection.
I. I
NTRODUCTION I N many applications, one needs not only to detect an abruptchange as quickly as possible but also to provide a detaileddiagnosis of the occurred change – to determine which typeof change is in effect. For example, the problem of detectionand diagnosis is important for rapid detection and isolationof intrusions in large-scale distributed computer networks,target detection with radar, sonar and optical sensors in acluttered environment, detecting terrorists’ malicious activity,fault detection and isolation in dynamic systems and networks,and integrity monitoring of navigation systems, to name afew (see [20, Ch 10] for an overview and references). Inother words, there are several kinds of changes that canbe associated with several different post-change distributionsand the goal is to detect the change and to identify whichdistribution corresponds to the change. As a result, the problemof changepoint detection and diagnosis is a generalizationof the quickest change detection problem [6], [13], [15],[16], [20] to the case of N > post-change hypotheses,and it can be formulated as a joint change detection andidentification problem. In the literature, this problem is usually The work was supported in part by the Russian Science Foundation underthe grant 18-19-00452 at the Moscow Institute of Physics and Technology.A. G. Tartakovsky is a Deputy Head of the Space informatics Laboratoryat the Moscow Institute of Physics and Technology, Russia and President ofAGT StatConsult, Los Angeles, California, USA; e-mail: [email protected] received March 23, 2020; revised October 26, 2020; acceptedFebruary 22, 2021.Copyright (c) 2020 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected]. called change detection and isolation . The detection–isolationproblem has been considered in both Bayesian and minimaxsettings. In 1995, Nikiforov [8] suggested a minimax approachto the change detection–isolation problem and showed that themultihypothesis version of the CUSUM rule is asymptoticallyoptimal when the average run length (ARL) to a false alarmand the mean time to false isolation become large. Severalversions of the multihypothesis CUSUM-type and SR-typeprocedures, which have minimax optimality properties in theclasses of rules with constraints imposed on the ARL to afalse alarm and conditional probabilities of false isolation,are proposed by Nikiforov [9], [10] and Tartakovsky [22].These rules asymptotically minimize maximal expected delaysto detection and isolation as the ARL to a false alarm is largeand the probabilities of wrong isolations are small. Dayanik etal. [2] proposed an asymptotically optimal Bayesian detection–isolation rule assuming that the prior distribution of the changepoint is geometric. In all these papers, the optimality resultswere restricted to the case of independent and identicallydistributed (i.i.d.) observations (in pre- and post-change modeswith different distributions) and simple post-change hypothe-ses. In many practical applications, the i.i.d. assumption istoo restrictive. The observations may be either non-identicallydistributed or dependent or both, i.e., non-i.i.d. Also, in a vari-ety of applications, a pre-change distribution is known but thepost-change distribution is rarely known completely. A morerealistic situation is parametric uncertainty when the parameterof the post-change distribution is unknown since a putativeparameter value is rarely representative. Lai [5] provided acertain generalization for the non-i.i.d. case and compositehypotheses for a specific loss function. See Chapter 10 inTartakovsky et al. [20] for a detailed overview.One of the most challenging and important versions of thechange detection–isolation problem is the multidecision andmultistream detection problem when it is necessary not onlyto detect a change as soon as possible but also to identify thestreams where the change happens with given probabilities ofmisidentification. Specifically, there are N data streams andthe change occurs in some of them at an unknown point intime. It is necessary to detect the change in distribution assoon as possible and indicate which streams are “corrupted.”Both the rates of false alarms and misidentification shouldbe controlled by given (usually low) levels. In the following,we will refer to this problem as the Multistream SequentialChange Detection–Identification problem.In this paper, we address a simplified multistream detection–
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. , 2021 identification scenario where change can occur only in asingle stream and we need to determine in which stream.We assume that the observations in streams can have a verygeneral structure, i.e., can be dependent and non-identicallydistributed. We focus on a semi-Bayesian setting assumingthat the change point is random possessing the (prior) dis-tribution. However, we do not suppose that there is a priordistribution on post-change hypotheses. We generalize theasymptotic Bayesian theory developed by Tartakovsky [23]for a single post-change hypothesis (for a single stream).Specifically, we show that under certain conditions (related tothe law of large numbers for the log-likelihood processes) theproposed multihypothesis detection–identification rule asymp-totically minimizes the trade-off between positive momentsof the detection delay and the false alarm/misclassificationrates expressed via the weighted probabilities of false alarmand false identification. The key assumption in the generalasymptotic theory is the stability property of the log-likelihoodratio processes in streams between the “change” and “no-change” hypotheses, which can be formulated in terms of thelaw of large numbers and rates of convergence of the properlynormalized log-likelihood ratios and their adaptive versions inthe vicinity of the true parameter values.The rest of the paper is organized as follows. In Sec-tion II, we describe the general stochastic model, which istreated in the paper. In Section III, we introduce the mixture-based change detection–identification rule. In Section IV,we formulate the asymptotic optimization problems in theclass of changepoint detection–identification rules with theconstraint imposed on the probabilities of false alarm andwrong identification. In Section V, we obtain upper boundson the probabilities of false alarms and misidentification asfunctions of thresholds. In Section VI, we derive asymptoticlower bounds for moments of the detection delay in theclass of rules with given probabilities of false alarms andmisidentification, and in Section VII, we prove asymptoticoptimality of the proposed mixture detection–identificationrule as the probabilities of false alarm and misidentification goto zero. In Section VIII, we consider an example that illustratesgeneral results. Section IX concludes.II. T HE G ENERAL S TOCHASTIC M ODEL
Suppose there are N independent data streams { X n ( i ) } n > , i = 1 , . . . , N , observed sequentially in time subject to achange at an unknown time ν ∈ { , , , . . . } , so that X ( i ) , . . . , X ν ( i ) are generated by one stochastic model and X ν +1 ( i ) , X ν +2 ( i ) , . . . by another model when the changeoccurs in the i th stream. We will assume that the changein distributions may happen only in one stream and it isnot known which stream is affected, i.e., we are interestedin a “multisample slippage” changepoint model (given ν and that the i th stream is affected with the parameter θ i )for which joint density p ( X n | H ν,i , θ i ) of the data X n =( X n (1) , . . . , X n ( N )) , X n ( i ) = ( X ( i ) , . . . , X n ( i )) observed up to time n is of the form p ( X n | H ν,i , θ i ) = p ( X n | H ∞ )= n Y t =1 N Y ℓ =1 g ℓ ( X t ( ℓ ) | X t − ( ℓ )) for ν > n,p ( X n | H ν,i , θ i ) = ν Y t =1 g i ( X t ( i ) | X t − ( i )) × n Y t = ν +1 f i,θ i ( X t ( i ) | X t − ( i )) × Y j ∈N \{ i } n Y t =1 g j ( X t ( j ) | X t − ( j )) for ν < n, (1)where H ν,i denotes the hypothesis that the change oc-curs at time ν in the stream i , g i ( X t ( i ) | X t − ( i )) and f i,θ i ( X t ( i ) | X t − ( i )) are conditional pre- and post-changedensities in the i th data stream, respectively (with respectto some sigma-finite measure), and N = { , , . . . , N } . Inother words, all components X t ( ℓ ) , ℓ ∈ N , have condi-tional densities g ℓ ( X t ( ℓ ) | X t − ( ℓ )) before the change occursand X t ( i ) has conditional density f i,θ i ( X t ( i ) | X t − ( i )) afterthe change occurs in the i th stream and the rest of thecomponents X t ( j ) , j ∈ N \ { i } have conditional densities g j ( X t ( j ) | X t − ( j )) . The parameters θ i ∈ Θ i , i = 1 , . . . , N ofthe post-change distributions are unknown. The event ν = ∞ and the corresponding hypothesis H ∞ : ν = ∞ mean thatthere never is a change. Notice that the model (1) implies that X ν +1 ( i ) is the first post-change observation under hypothesis H ν,i .Regarding the change point ν we assume that it is arandom variable independent of the observations with priordistribution π k = P ( ν = k ) , k = 0 , , , . . . with π k > for k ∈ { , , , . . . } = Z + . We will also assume that a changepoint may take negative values, which means that the changehas occurred by the time the observations became available.However, the detailed structure of the distribution P ( ν = k ) for k = − , − , . . . is not important. The only value whichmatters is the total probability q = P ( ν − of the changebeing in effect before the observations become available, so weset P ( ν −
1) = P ( ν = −
1) = π − , π − ∈ [0 , . Therefore,in what follows we assume that ν ∈ {− , , , . . . } = Z andthe prior distribution of the change point is defined on Z .III. T HE D ETECTION –I DENTIFICATION R ULE
A changepoint detection–identification rule is a pair δ =( d, T ) , where T is a stopping time (with respect to the filtration {F n = σ ( X n ) } n ∈ Z + ) associated with the time of alarm onchange and d = d T ∈ N is a decision on which stream isaffected (or which post-change distribution is true) which ismade at time T .It follows from (1) that for an assumed value of thechange point ν = k , stream i ∈ N , and the post-changeparameter value in the i th stream θ i ∈ Θ i , the likelihood ratio(LR) LR i,θ i ( k, n ) = p ( X n | H k,i , θ i ) /p ( X n | H ∞ ) between the ARTAKOVSKY: JOINT SEQUENTIAL CHANGEPOINT DETECTION AND IDENTIFICATION 3 hypotheses H k,i and H ∞ for observations accumulated by thetime n has the form LR i,θ i ( k, n ) = n Y t = k +1 L i,θ i ( t ) , i ∈ N , n > k (2)( k = − , , , . . . ), where L i,θ i ( t ) = f i,θ i ( X t ( i ) | X t − ( i )) /g i ( X t ( i ) | X t − ( i )) . We suppose that L i,θ i (0) = 1 , so that LR i,θ i ( − , n ) = LR i,θ i (0 , n ) . Define the average (over the prior π k ) LRstatistics Λ πi,θ i ( n ) = n − X k = − π k LR i,θ i ( k, n ) , i ∈ N . (3)Let W i ( θ i ) , R Θ i d W i ( θ i ) = 1 , i ∈ N be mixing measuresand, for k < n and i ∈ N , define the LR-mixtures LR i,W ( k, n ) = Z Θ i LR i,θ i ( k, n ) d W i ( θ i ) , (4)and the statistics Λ πi,W ( n ) = (P n − k = − π k LR i,W ( k, n ) , i ∈ N P ( ν > n ) i = 0 ; (5) ¯Λ π,Wij ( n ) = Λ πi,W ( n ) P n − k = − π k sup θ j ∈ Θ j LR j,θ j ( k, n ) ,i, j ∈ N , i = j, n > π,Wi ( n ) = Λ πi,W ( n ) P ( ν > n ) , i ∈ N , n > , (6)where in the statistic Λ πi,W ( n ) defined in (5) i = 0 correspondsto the hypothesis H that there is no change (in the first n observations).Write N = { , , . . . , N } . For the set of positive thresholds A = ( A ij ) , j ∈ N \ { i } , i ∈ N , the change detection–identification rule δ A = ( d A , T A ) is defined as follows: T A = min ℓ ∈N T ( ℓ ) A , d A = i if T A = T ( i ) A , (7)where the Markov times T ( i ) A , i ∈ N are given by T ( i ) A = inf n n > π,Wij ( n ) > A ij ∀ j ∈ N \ { i } o . (8)In definitions of stopping times we always set inf { ∅ } = ∞ ,i.e., T ( i ) A = ∞ if there is no such n . If T A = T ( i ) A for severalvalues of i then any of them can be taken.IV. O PTIMIZATION P ROBLEMS AND A SSUMPTIONS
Let E k,i,θ i and E ∞ denote expectations under probabilitymeasures P k,i,θ i and P ∞ , respectively, where P k,i,θ i corre-sponds to model (1) with an assumed value of the parameter θ i ∈ Θ i , change point ν = k , and the affected stream i ∈ N . Define the probability measure P πi,θ i ( A × K ) = P k ∈K π k P k,i,θ i ( A ) under which the change point ν hasdistribution π = { π k } k ∈ Z and the model for the observationsis of the form (1) and let E πi,θ i denote the correspondingexpectation. For r > , ν = k ∈ Z , θ i ∈ Θ i , and i ∈ N introducethe risk associated with the conditional r th moment of thedetection delay R rk,i,θ i ( δ ) = E k,i,θ i [( T − k ) r ; d = i | T > k ] , (9)where for k = − we set T − k = T , but not T + 1 , as well asthe integrated (over prior π ) risk associated with the momentsof delay to detection ¯ R ri,θ i ( δ ) := E πi,θ i [( T − ν ) r ; d = i | T > ν ]= E πi,θ i [( T − ν ) r , d = i, T > ν ] P πi,θ i ( T > ν )= ∞ X k = − π k E k,i,θ i [( T − k ) r , d = i, T > k ] P ∞ k = − π k P k,i,θ i ( T > k )= ∞ X k = − π k R rk,i,θ i ( δ ) P k,i,θ i ( T > k ) P ( ν
0) + P ∞ k =1 π k P ∞ ( T > k )= ∞ X k = − π k R rk,i,θ i ( δ ) P ∞ ( T > k )1 − PFA π ( δ ) , (10)where PFA π ( δ ) = P πi,θ i ( T ν )= ∞ X k = − π k P k,i,θ i ( T k )= ∞ X k =0 π k P ∞ ( T k ) (11)is the weighted probability of false alarm. Note that in (10) and(11) we used the equality P k,i,θ i ( T k ) = P ∞ ( T k ) sincethe event { T k } belongs to the sigma-algebra F k = σ ( X k ) and, hence, depends only on the first k observations whichdistribution corresponds to the measure P ∞ . This implies, inparticular, that P πi,θ i ( T > ν ) = 1 − PFA π ( δ )= P ( ν
0) + ∞ X k =1 π k P ∞ ( T > k ) . Also, introduce
PFA πi ( δ ) = P πi,θ i ( T ν ; d = i )= ∞ X k =0 π k P i,θ i ( T k ; d = i )= ∞ X k =0 π k P ∞ ( T k ; d = i ) , (12)the weighted probability of false alarm on the event { d = i } ,i.e., the probability of raising the alarm with the decision d = i that there is a change in the i th stream when there is no change.The loss associated with wrong identification is reasonableto measure by the maximal probabilities of wrong decisions(misidentification) PMI πij ( δ ) = sup θ i ∈ Θ i P πi,θ i ( d = j ; T < ∞| T > ν ) , (13) IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. , 2021 i, j = 1 , . . . , N, i = j . Note that P πi,θ i ( d = j ; T < ∞| T > ν ) = P πi,θ i ( d = j ; ν < T < ∞ ) P πi,θ i ( T > ν )= P ∞ k = − π k P k,i,θ i ( d = j ; k < T < ∞ )1 − PFA π ( δ ) . Define the class of change detection–identification rules δ with constraints on the probabilities of false alarm PFA πi ( δ ) and the probabilities of misidentification PMI πij ( δ ) : C π ( α , β ) = { δ : PFA πi ( δ ) α i , i ∈ N and PMI πij ( δ ) β ij , i, j ∈ N , i = j } , (14)where α = ( α , . . . , α N ) and β = ( β ij ) i,j ∈N ,i = j are the setsof prescribed probabilities α i ∈ (0 , and β ij ∈ (0 , .Ideally, we would be interested in finding an optimal rule δ opt = ( d opt , T opt ) that solves the optimization problem ¯ R ri,θ i ( δ opt ) = inf δ ∈ C π ( α , β ) ¯ R ri,θ i ( δ ) ∀ θ i ∈ Θ i , i ∈ N . However, this problem is intractable for arbitrary values of α i ∈ (0 , and β ij ∈ (0 , . For this reason, we will focus onthe asymptotic problem assuming that the given probabilities α i and β ij approach zero. To be more specific, we will beinterested in proving that the proposed detection–identificationrule δ A = ( d A , T A ) defined in (7)–(8) is first-order uniformlyasymptotically optimal in the following sense lim α max → ,β max → inf δ ∈ C π ( α , β ) ¯ R ri,θ i ( δ )¯ R ri,θ i ( δ A ) = 1 for all θ i ∈ Θ i and i ∈ N , (15)where A = A ( α , β ) is the set of suitably selected thresholdssuch that δ A ∈ C π ( α , β ) . Hereafter α max = max i ∈N α i , β max = max i,j ∈N ,i = j β ij .In addition, we will prove that the rule δ A = ( d A , T A ) is uniformly pointwise first-order asymptotically optimal in asense of minimizing the conditional risk (9) for all changepoint values ν = k ∈ Z , i.e., lim α max → ,β max → inf δ ∈ C π ( α , β ) R rk,i,θ i ( δ ) R rk,i,θ i ( δ A ) = 1 for all θ i ∈ Θ i , k ∈ Z , i ∈ N . (16)It is also of interest to consider the class of detection–identification rules C ⋆π ( α, ¯ β ) = (cid:8) δ : PFA π ( δ ) α, PMI πi ( δ ) ¯ β i , i ∈ N (cid:9) (17)( ¯ β = ( ¯ β , . . . , ¯ β N ) ) with constrains on the total probabilityof false alarm PFA π ( δ ) (defined in (11)) regardless of thedecision d = i which is made under hypothesis H ∞ and onthe misidentification probabilities PMI πi ( δ ) = sup θ i ∈ Θ i P πi,θ i ( d = i ; T < ∞| T > ν ) , i ∈ N . Obviously,
PFA π ( δ ) = P Ni =1 PFA πi ( δ ) and PMI πi ( δ ) = P j ∈N \{ i } PMI πij ( δ ) .In this paper, we consider only a fixed number of hypotheses N . The large-scale (Big Data) case where N → ∞ with a certain rate (which requires a different definition of false alarmand misidentification rates) will be considered elsewhere.In the following, we assume that mixing measures W i , i =1 , . . . , N , satisfy the condition: W i { ϑ ∈ Θ i : | ϑ − θ i | < κ } > for any κ > and any θ i ∈ Θ i . (18)By (2), for the assumed values of ν = k , i ∈ N , and θ i ∈ Θ i , the log-likelihood ratio (LLR) λ i,θ i ( k, k + n ) =log LR i,θ i ( k, k + n ) of observations accumulated by the time k + n is λ i,θ i ( k, k + n ) = k + n X t = k +1 log L i,θ i ( t ) , n > , and the LLR between the hypotheses H k,i and H k,j of obser-vations accumulated by the time k + n is λ i,θ i ; j,θ j ( k, k + n ) = log p ( X k + n | H k,i ) p ( X k + n | H k,j ) ≡ λ i,θ i ( k, k + n ) − λ j,θ j ( k, k + n ) , n > . For j = 0 , we set λ ,θ ( k, k + n ) = 0 , so that λ i,θ i ;0 ,θ ( k, k + n ) = λ i,θ i ( k, k + n ) .To study asymptotic optimality we need certain constraintsimposed on the prior distribution π = { π k } and on theasymptotic behavior of the decision statistics as the samplesize increases (i.e., on the general stochastic model).For κ > , let Γ κ ,θ i = { ϑ ∈ Θ i : | ϑ − θ i | < κ } and for < I ij ( θ i , θ j ) < ∞ , j ∈ N \ { i } , i ∈ N , define p M,k ( ε ; i, θ i ; j, θ j ) = P k,i,θ i (cid:26) M max n M λ i,θ i ; j,θ j ( k, k + n ) > (1 + ε ) I ij ( θ i , θ j ) (cid:27) , Υ r ( κ , ε ; i, θ i ) = ∞ X n =1 n r − sup k ∈ Z + P k,i,θ i n n inf ϑ ∈ Γ κ ,θi λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε o , (19)where I i ( θ i , θ ) = I i ( θ i ) , so that p M,k ( ε ; i, θ i ; 0 , θ ) = p M,k ( ε ; i, θ i )= P k,i,θ i (cid:26) M max n M λ i,θ i ( k, k + n ) > (1 + ε ) I i ( θ i ) (cid:27) . Regarding the model for the observations (1), we assumethat the following two conditions are satisfied (for local LLRsin data streams): C . There exist positive and finite numbers I i ( θ i ) , θ i ∈ Θ i , i ∈ N and I ij ( θ i , θ j ) , θ j ∈ Θ j , j ∈ N \ { i } , θ i ∈ Θ i , i ∈ N ,such that for any ε > M →∞ p M,k ( ε ; i, θ i ; j, θ j ) = 0 for all k ∈ Z + , θ i ∈ Θ i , θ j ∈ Θ j , j ∈ N \ { i } , i ∈ N . (20) C . For any ε > and some r > κ → Υ r ( κ , ε ; i, θ i ) < ∞ for all θ i ∈ Θ i , i ∈ N . (21) ARTAKOVSKY: JOINT SEQUENTIAL CHANGEPOINT DETECTION AND IDENTIFICATION 5
Note that condition C holds whenever λ i,θ i ; j,θ j ( k, k + n ) /n converges almost surely (a.s.) to I ij ( θ i , θ j ) under P k,i,θ i , i.e.,for all θ i ∈ Θ i P k,i,θ i (cid:26) lim n →∞ n λ i,θ i ; j,θ j ( k, k + n ) = I ij ( θ i , θ j ) (cid:27) = 1 . (22)Regarding the prior distribution π k = P ( ν = k ) we assumethat it is fully supported (i.e., π k > for all k ∈ Z + , π − < and π ∞ = 0 ) and the following two conditions aresatisfied: CP . For some µ < ∞ , lim n →∞ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log ∞ X k = n +1 π k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = µ. (23) CP . If µ = 0 , then in addition ∞ X k =0 π k | log π k | r < ∞ for some r > . (24)The class of prior distributions satisfying conditions CP and CP will be denoted by C ( µ ) .Note that if µ > , then the prior distribution has anexponential right tail. In this case, condition (24) holds au-tomatically. If µ = 0 , the distribution has a heavy tail, i.e.,belongs to the model with a vanishing hazard rate. However,we cannot allow this distribution to have a too heavy tail,which is guaranteed by condition CP . A typical heavy-tailed prior distribution that satisfies both conditions CP with µ = 0 and CP for all r > is a discrete Weibull-typedistribution with the shape parameter < κ < . Constraint(24) is often guaranteed by finiteness of the r -th moment, E [ ν r ] < ∞ .To obtain lower bounds for moments of the detection delaywe need only right-tail conditions (20). However, to establishthe asymptotic optimality property of the rule δ A both right-tail and left-tail conditions (20) and (21) are needed.V. U PPER B OUNDS ON P ROBABILITIES OF F ALSE A LARMAND M ISIDENTIFICATION OF THE D ETECTION –I DENTIFICATION R ULE δ A Let e P π,ni,θ i ( A ) = P πi,θ i ( A , ν < n ) denote the measure P πi,θ i on the event { ν < n } . The distribution e P π,ni,θ i ( X n ∈ X n ) hasdensity f π,ni,θ i ( X n ) = n − X k = − " π k k Y t =1 g i ( X t ( i ) | X t − ( i )) n Y t = k +1 f i,θ i ( X t ( i ) | X t − ( i )) × P ( ν < n ) Y j ∈N \{ i } n Y t =1 g j ( X t ( j ) | X t − ( j )) , where Q − t =1 g i ( X t ( i ) | X t − ( i )) = 1 . Write f π,ni,W ( X n ) = Z Θ i f π,ni,θ i ( X n ) d W i ( θ i ) . Next, define the statistic e Λ π,Wi,j,θ j ( n ) = Λ πi,W ( n ) / Λ πj,θ j ( n ) and the measure e P π,nℓ,W ( A ) = Z Θ ℓ e P π,nℓ,θ ℓ ( A )d W ℓ ( θ ℓ ) . Denote by P | F n the restriction of the measure P to the sigma-algebra F n . Obviously, e Λ π,Wi,j,θ j ( n ) = d e P π,ni,W d e P π,nj,θ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F n , i = j, and hence, the statistic e Λ π,Wi,j,θ j ( n ) is a ( e P π,nj,θ j , F n ) -martingalewith unit expectation for all θ j ∈ Θ j . Therefore, by the Wald–Doob identity, for any stopping time T and all θ j ∈ Θ j , e E πi,θ i he Λ π,Wj,i,θ i ( T )1l {A ,T < ∞} i = e E πj,W (cid:2) {A ,T < ∞} (cid:3) = e P π,Tj,W ( A ∩ {
T < ∞} ) , (25)where e E πj,W and e E πj,θ j stand for the operators of expectationunder e P π,Tj,W and e P π,Tj,θ j , respectively.The following theorem establishes upper bounds for the PFAand PMI of the proposed detection–identification rule δ A . Notethat these bounds are valid in the most general case – neither ofthe conditions on the model C , C or on the prior distribution CP , CP are required. Theorem 1.
Let δ A be the changepoint detection–identification rule defined in (7) – (8) . The following upperbounds for the PFA and PMI of rule δ A hold PFA πi ( δ A ) (1 + A i ) − , i ∈ N , (26) PFA π ( δ A ) N X i =1 (1 + A i ) − (27) and PMI πij ( δ A ) A i A i A ji , i, j ∈ N , i = j, (28) PMI πi ( δ A ) A i A i X j ∈N \{ i } A ji , i ∈ N . (29) Thus, if α max < − π − , then A i = 1 − α i α i and A ij = 1(1 − α j ) β ji imply δ A ∈ C π ( α , β ) , (30) and if A i = A for i ∈ N and A ij = A j for j ∈ N \ { i } ,then A = Nα (1 − α/N ) and A j = N − − α/N ) ¯ β j imply δ A ∈ C ⋆π ( α, ¯ β ) . (31) Proof:
Using the Bayes rule, notation (2)–(6), and thefact that LR i,θ i ( k, n ) = 1 for k > n , we obtain P ( ν = k |F n ) = π k LR i,W ( k, n ) P ∞ j = − π j LR i,W ( j, n )= π k LR i,W ( k, n ) P n − j = − π j LR i,W ( j, n ) + P ( ν > n ) , IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. , 2021 so that P ( ν > n |F n ) = ∞ X k = n P ( ν = k |F n ) = P ( ν > n )Λ πi,W + P ( ν > n )= 1¯Λ π,Wi ( n ) + 1 . Next, obviously,
PFA i ( δ A ) = P πi,θ i ( T ( i ) A ν, T A = T ( i ) A ) P πi,θ i ( T ( i ) A ν ) . Therefore, taking into account that P πi,θ i ( T ( i ) A ν ) = E πi,θ i [ P ( T ( i ) A ν |F T ( i ) A )] and that ¯Λ π,Wi ( T ( i ) A ) > A i on { T ( i ) A < ∞} , we obtain PFA i ( δ A ) E πi,θ i [(1 + ¯Λ π,Wi ( T ( i ) A )) − ; T ( i ) A < ∞ ] / (1 + A i ) and inequalities (26) follow. Inequality (27) follows immedi-ately from the fact that PFA π ( δ ) = P Ni =1 PFA πi ( δ ) .To prove the upper bound (28) note that e Λ π,Wj,i,θ i ( n ) > ¯Λ π,Wji ( n ) for all n > and θ i ∈ Θ i and that ¯Λ π,Wji ( T ( j ) A ) > A ji on { T ( j ) A < ∞} and we have P πi,θ i ( d A = j, ν < T A < ∞ )= P πi,θ i ( T A = T ( j ) A , ν < T ( j ) A < ∞ )= e P π,T A i,θ i ( T A = T ( j ) A , T ( j ) A < ∞ )= e E πi,θ i " ¯Λ π,Wji ( T ( j ) A )¯Λ π,Wji ( T ( j ) A ) 1l { T A = T ( j ) A ,T ( j ) A < ∞} A ji e E πi,θ i h ¯Λ π,Wji ( T ( j ) A )1l { T A = T ( j ) A ,T ( j ) A < ∞} i A ji e E πi,θ i he Λ π,Wj,i,θ i ( T ( j ) A )1l { T A = T ( j ) A ,T ( j ) A < ∞} i ∀ θ i ∈ Θ i , where, by equality (25), the last term is equal to A ji e P π,T A j,W ( { T A = T ( j ) A } ∩ { T ( j ) A < ∞} ) . This yields P πi,θ i ( d A = j, ν < T A < ∞ ) A ji e P π,T A j,W ( { T A = T ( j ) A } ∩ { T ( j ) A < ∞} ) A ji for all θ i ∈ Θ i . Since P πi,θ i ( d A = j | T A > ν ) = P πi,θ i ( d A = j, ν
Typically, the upper bounds (26)–(29) for PFA andPMI are not tight but rather quite conservative, especially whenovershoots over thresholds are large (i.e., when the hypotheses H i and H ∞ are not close). Unfortunately, in the general non-i.i.d. case, the improvement of these bounds is not possible.In the i.i.d. case where observations are independent and identically distributed with the common pre-change density g i ( x ) and the common post-change density f i ( x ) in the i thstream (i.e., when the post-change hypotheses are simple), itis possible to obtain asymptotically accurate approximationsusing the renewal theory similarly to how it was done in [20,Th 7.1.5, p. 327] for the PFA in the single-stream case.VI. L OWER B OUNDS ON THE M OMENTS OF THE D ETECTION D ELAY IN C LASSES C π ( α , β ) AND C ⋆π ( α, ¯ β ) For i ∈ N , define Ψ i ( α , β ) = max ( | log α i | I i ( θ i ) + µ , max j ∈N \{ i } | log β ji | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) (32)and Ψ ⋆i ( α, ¯ β ) = max ( | log α | I i ( θ i ) + µ , max j ∈N \{ i } | log ¯ β j | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) . (33)The following theorem establishes asymptotic lower boundson moments of the detection delay R rk,i,θ i ( δ ) and ¯ R ri,θ i ( δ ) ( r > ) in classes of detection–identification rules C π ( α , β ) and C ⋆π ( α, ¯ β ) defined in (14) and (17), respectively. Thesebounds will be used in the next section for proving asymptoticoptimality of the detection–identification rule δ A with suitablethresholds. Theorem 2.
Let, for some µ > , the prior distributionbelong to class C ( µ ) . Assume that for some positive and finitenumbers I i ( θ i ) ( θ i ∈ Θ i , i ∈ N ) and I ij ( θ i , θ j ) ( θ i ∈ Θ i , θ j ∈ Θ j , i ∈ N , j ∈ N \ { i } ) condition C holds. If inf θ j ∈ Θ j I ij ( θ i , θ j ) > for all j ∈ N \ { i } , then for all r > , θ i ∈ Θ i , and i ∈ N , lim inf α max ,β max → inf δ ∈ C π ( α , β ) R rk,i,θ i ( δ )[Ψ i ( α , β )] r > for all k ∈ Z , (34) lim inf α max ,β max → inf δ ∈ C π ( α , β ) ¯ R ri,θ i ( δ )[Ψ i ( α , β )] r > (35) and lim inf α max ,β max → inf δ ∈ C ⋆π ( α, ¯ β ) R rk,i,θ i ( δ )[Ψ ⋆i ( α, ¯ β )] r > for all k ∈ Z , (36) lim inf α max ,β max → inf δ ∈ C ⋆π ( α, ¯ β ) ¯ R ri,θ i ( δ )[Ψ ⋆i ( α, ¯ β )] r > , (37) where Ψ i ( α , β ) and Ψ ⋆i ( α, ¯ β ) are defined in (32) and (33) ,respectively.Proof: We only provide the proof of asymptotic lowerbounds (34) and (35). The proof of (36) and (37) is essentiallysimilar.
ARTAKOVSKY: JOINT SEQUENTIAL CHANGEPOINT DETECTION AND IDENTIFICATION 7
Notice that the proof can be split into two parts since ifwe show that, on one hand, for any rule δ ∈ C π ( α , β ) as α max → , β max → R rk,i,θ i ( δ ) > max j ∈N \{ i } | log β ji | inf θ j ∈ Θ j I ij ( θ i , θ j ) r (1+ o (1)) ∀ k ∈ Z (38)and ¯ R ri,θ i ( δ ) > max j ∈N \{ i } | log β ji | inf θ j ∈ Θ j I ij ( θ i , θ j ) r (1 + o (1)) , (39)and on the other hand R rk,i,θ i ( δ ) > (cid:18) | log α i | I i ( θ i ) + µ (cid:19) r (1 + o (1)) ∀ k ∈ Z , (40)and ¯ R ri,θ i ( δ ) > (cid:18) | log α i | I i ( θ i ) + µ (cid:19) r (1 + o (1)) , (41)where o (1) → , then, obviously, combining inequalities (38)and (40) yields (34) and combining (39) and (41) yields (35).The detailed proof of inequalities (38)–(41) is postponed tothe Appendix.VII. A SYMPTOTIC O PTIMALITY
The following proposition, whose proof is given in theAppendix, establishes first-order asymptotic approximationsto the moments of the detection delay of the detection–identification rule δ A when thresholds A ij go to infinityregardless of the PFA and PMI constraints. Write A min =min i ∈N ,j ∈N \{ i } A ij . Proposition 1.
Let r > and let the prior distribution ofthe change point belong to class C ( µ ) . Assume that for some < I i ( θ i ) < ∞ , θ i ∈ Θ i , i ∈ N and < I ij ( θ i , θ j ) < ∞ , θ i ∈ Θ i , θ j ∈ Θ j , i ∈ N , j ∈ N \ { i } right-tailand left-tail conditions C and C are satisfied and that inf θ j ∈ Θ j I ij ( θ i , θ j ) > for all j ∈ N \ { i } , i ∈ N . Then,for all < m r , θ i ∈ Θ i , and i ∈ N as A min → ∞R mk,i,θ i ( δ A ) ∼ [Ψ i ( A, θ i , µ )] m for all k ∈ Z (42) and ¯ R mi,θ i ( δ A ) ∼ [Ψ i ( A, θ i , µ )] m , (43) where Ψ i ( A, θ i , µ ) = max ( log A i I i ( θ i ) + µ , max j ∈N \{ i } log A ij inf θ j ∈ Θ j I ij ( θ i , θ j ) ) . (44)Hereafter we use a standard notation x a ∼ y a as a → a if lim a → a ( x a /y a ) = 1 .In order to prove this proposition we need the followinglemma, whose proof is given in the Appendix. For i =1 , . . . , N , define λ i,W ( k, k + n ) = log LR i,W ( k, k + n ) ,λ πi ( n ) = log " n − X k = − π k sup θ i ∈ Θ i LR i,θ i ( k, n ) , e Ψ i ( A, π k , θ i , µ, ε ) = max ( log( A i /π k ) I i ( θ i ) + µ − ε , max j ∈N \{ i } log( A ij /π k )inf θ j ∈ Θ j I ij ( θ i , θ j ) − ε ) , (45) M i ( A ) = M i ( A, π k , θ i , µ, ε ) = 1 + j e Ψ i ( A, π k , θ i , µ, ε ) k , where ⌊ y ⌋ is the greatest integer. Lemma 1.
Let r > and let the prior distribution of thechange point satisfy condition (23) . Then, for a sufficientlylarge A min , any < ε < J ij ( θ i , µ ) and all k ∈ Z , E k,i,θ i [( T A − k ) + ] r h e Ψ i ( A, π k , θ i , µ, ε ) i r + r r − ∞ X n = M i ( A ) n r − P k,i,θ i ( n inf ϑ ∈ Γ κ ,θi λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε ) , (46) where T A − k = T A for k = − , x + = max(0 , x ) , and J ij ( θ i , µ ) = min { I i ( θ i ) + µ, min j ∈N \{ i } inf θ j ∈ Θ j I ij ( θ i , θ j ) } . Theorem 1, Theorem 2 and Proposition 1 allow us toconclude that the detection–identification rule δ A is asymp-totically first-order optimal in classes C π ( α , β ) and C ⋆π ( α, ¯ β ) as α max , β max → . Theorem 3.
Let r > and let the prior distribution of thechange point belong to class C ( µ ) . Assume that for some < I i ( θ i ) < ∞ , θ i ∈ Θ i , i ∈ N and < I ij ( θ i , θ j ) < ∞ , θ i ∈ Θ i , θ j ∈ Θ j , i ∈ N , j ∈ N \ { i } right-tailand left-tail conditions C and C are satisfied and that inf θ j ∈ Θ j I ij ( θ i , θ j ) > for all j ∈ N \ { i } , i ∈ N . (i) If thresholds A i , i ∈ N and A ij , j ∈ N \{ i } , i ∈ N are soselected that PFA πi ( δ A ) α i , PMI ij ( δ A ) β ij and log A i ∼| log α i | , log A ij ∼ | log β ji | as α max , β max → , in particularas A i = (1 − α i ) /α i and A ij = [(1 − α j ) β ji ] − , then δ A isfirst-order asymptotically optimal as α max , β max → in class C π ( α , β ) , minimizing moments of the detection delay up toorder r : for all < m r , θ i ∈ Θ i , and i ∈ N inf δ ∈ C π ( α , β ) R mk,i,θ i ( δ ) ∼ R mk,i,θ i ( δ A ) ∼ max ( | log α i | I i ( θ i ) + µ , max j ∈N \{ i } | log β ji | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) m as α max , β max → for all k ∈ Z (47) and inf δ ∈ C π ( α , β ) ¯ R mi,θ i ( δ ) ∼ ¯ R mi,θ i ( δ A ) ∼ max ( | log α i | I i ( θ i ) + µ , max j ∈N \{ i } | log β ji | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) m as α max , β max → . (48) (ii) If thresholds A i = A and A ij = A j , j ∈ N \ { i } , i ∈ N are so selected that PFA π ( δ A ) α , PMI i ( δ A ) ¯ β i IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. , 2021 and log A ∼ | log α | , log A j ∼ | log ¯ β j | as α, ¯ β max → , inparticular as A = N (1 − α/N ) /α and A j = ( N − − α/N ) ¯ β j ] − , then δ A is first-order asymptotically optimal as α, ¯ β max → in class C ⋆π ( α, ¯ β ) , minimizing moments of thedetection delay up to order r : for all < m r , θ i ∈ Θ i ,and i ∈ N , inf δ ∈ C ⋆π ( α, ¯ β ) R mk,i,θ i ( δ ) ∼ R mk,i,θ i ( δ A ) ∼ max ( | log α | I i ( θ i ) + µ , max j ∈N \{ i } | log ¯ β j | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) m as α, ¯ β max → for all k ∈ Z (49) and inf δ ∈ C ⋆π ( α, ¯ β ) ¯ R mi,θ i ( δ ) ∼ ¯ R mi,θ i ( δ A ) ∼ max ( | log α | I i ( θ i ) + µ , max j ∈N \{ i } | log ¯ β j | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) m as α, ¯ β max → . (50) Proof:
Proof of (i). Setting log A i ∼ | log α i | and log A ij ∼ | log β ji | in (42) yields as α max , β max → R mk,i,θ i ( δ A ) ∼ max ( | log α i | I i ( θ i ) + µ , max j ∈N \{ i } | log β ji | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) m , i ∈ N . (51)In particular, log A i ∼ | log α i | and log A ij ∼ | log β ji | if A i = (1 − α i ) /α i and A ij = [(1 − α j ) β ji ] − , and byTheorem 1, PFA πi ( δ A ) α i and PMI ij ( δ A ) β ij withthis choice of thresholds (see (30)). Comparing asymptoticapproximations (51) with the lower bounds (34) in Theorem 2completes the proof of (47). The proof of (48) is similar.Proof of (ii). Setting log A i = log A ∼ | log α | and log A ij = log A j ∼ | log ¯ β j | in (42) yields as α max , ¯ β max → R mk,i,θ i ( δ A ) ∼ max ( | log α | I i ( θ i ) + µ , max j ∈N \{ i } | log ¯ β j | inf θ j ∈ Θ j I ij ( θ i , θ j ) ) m , i ∈ N . (52)In particular, log A ∼ | log α | and log A j ∼ | log ¯ β j | if A = N (1 − α/N ) /α and A j = ( N − − α/N ) ¯ β j ] − ,and by Theorem 1, PFA π ( δ A ) α and PMI i ( δ A ) ¯ β i withthis choice of thresholds (see (31)). Comparing asymptoticapproximations (52) with the lower bounds (36) in Theorem 2completes the proof of (49). The proof of (50) is similar. Remark 2.
If the prior distribution π = π α max ,β max dependson the PFA α max and PMI β max constraints and µ α max ,β max → as α max , β max → , then a modification of the precedingargument can be used to show that the assertions of Theorem 3hold with µ = 0 .Note that conditions (20) are satisfied if n λ i,θ i ; j,θ j ( k, k + n ) P k,i,θi − a.s. −−−−−−−→ n →∞ I ij ( θ i , θ j ) (see Lemma B.1 in [18, p. 243]). Assume also that for somepositive and finite numbers I ,i ( θ i ) , i ∈ N , − n λ i,θ i ( k, k + n ) P ∞ − a.s. −−−−−→ n →∞ I ,i ( θ i ) . In particular, in the i.i.d. case, these conditions hold with I ij ( θ i , θ j ) ≡ K ij ( θ i , θ j ) = Z (cid:18) log f i,θ i ( x ) f j,θ j ( x ) (cid:19) f i,θ i ( x )d x,I ,i ( θ i ) ≡ K ,i ( θ i ) = Z (cid:18) log g i ( x ) f i,θ i ( x ) (cid:19) g i ( x )d x being the Kullback–Leibler information numbers. Then, I ij ( θ i , θ j ) = I i ( θ i ) + I ,j ( θ j ) > I i ( θ i ) . Therefore, if the priordistribution of the change point is heavy-tailed (i.e., µ = 0 )and the PFA is smaller than the PMI, α i < β ji , α < ¯ β j , whichis typical in many applications, then asymptotics (48) and (50)are reduced to inf δ ∈ C π ( α , β ) ¯ R mi,θ i ( δ ) ∼ (cid:18) | log α i | I i ( θ i ) (cid:19) m ∼ ¯ R mi,θ i ( δ A ) (53)(as α max , β max → ) and inf δ ∈ C ⋆π ( α, β ) ¯ R mi,θ i ( δ ) ∼ (cid:18) | log α | I i ( θ i ) (cid:19) m ∼ ¯ R mi,θ i ( δ A ) (54)(as α, ¯ β max → ).Consider now the fully Bayesian setting where not only theprior distribution π = { π k } k ∈ Z of the changepoint ν is given,but also the prior distribution p = { p i } i ∈N of hypotheses P ( H i ) = p i , i ∈ N is specified. Then in place of the maximalprobabilities of misidentification (13) one can consider thefollowing average probabilities of misidentification PMI π,Wi ( δ ) = P π,Wi ( d = i, T < ∞| T > ν )= Z Θ i P πi,θ i ( d = i, T < ∞| T > ν )d W i ( θ i ) , PMI π,W,p ( δ ) = N X i =1 p i PMI π,Wi ( δ ) , and the risk associated with the detection delay is measuredby ¯ R rπ,W,p ( δ ) = E π,W,p [( T − ν ) r | T > ν ] (in place of (10)).Here P π,Wi ( A × K ) = X k ∈K π k Z Θ i P k,i,θ i ( A )d W i ( θ i ) , P π,W,p ( A × K ) = N X i =1 p i P π,Wi ( A × K ) , and E π,W,p is the expectation under the measure P π,W,p . Itfollows from Theorem 1 that for the rule δ A with A i = A , A ij = A j , i ∈ N , j ∈ N we have PMI π,W,pi ( δ ) A A N − A i , i ∈ N , PMI π,W,p ( δ ) (1 + A )( N − A N X i =1 p i A i . Introduce the class of detection–identification rules ¯ C π,W,p ( α, β ) = n δ : PFA π ( δ ) α and PMI π,W,p ( δ ) β o ARTAKOVSKY: JOINT SEQUENTIAL CHANGEPOINT DETECTION AND IDENTIFICATION 9 for which the weighted probability of false alarm does notexceed α ∈ (0 , and the average probability of misidentifica-tion does not exceed β ∈ (0 , . Note that δ A ∈ ¯ C π,W,p ( α, β ) whenever A = Nα (1 − α/N ) and A ij = A = N − − α/N ) β . Using Theorem 3 it is easy to prove that rule δ A is first-orderasymptotically optimal in the fully Bayesian setting in class ¯ C π,W,p ( α, β ) . Specifically, the following theorem holds. Theorem 4.
Let r > , let the prior distribution of the changepoint belong to class C ( µ ) , and let p = { p i } i ∈N be the priordistribution of hypotheses that the change occurs in the i thdata stream. Assume that for some < I i ( θ i ) < ∞ , θ i ∈ Θ i , i ∈ N and < I ij ( θ i , θ j ) < ∞ , θ i ∈ Θ i , θ j ∈ Θ j , i ∈ N , j ∈ N \ { i } right-tail and left-tail conditions C and C aresatisfied and that inf θ j ∈ Θ j I ij ( θ i , θ j ) > for all j ∈ N \ { i } , i ∈ N . If thresholds A i = A and A ij = A , j ∈ N \ { i } , i ∈ N in rule δ A are so selected that PFA π ( δ A ) α , PMI π,W,p ( δ ) β and log A ∼ | log α | , log A ∼ | log β | as α, β → , in particular as A = N (1 − α/N ) /α and A = ( N − − α/N ) β ] − , then δ A is first-orderasymptotically optimal as α, β → in class ¯ C π,W,p ( α, β ) ,minimizing moments of the detection delay up to order r : forall < m r , inf δ ∈ ¯ C π,W,p ( α,β ) ¯ R mπ,W,p ( δ ) ∼ ¯ R mπ,W,p ( δ A ) ∼ max { γ ( p, W, µ ) | log α | , γ ( p, W ) | log β |} m as α, β → , (55) where γ ( p, W, µ ) = N X i =1 p i Z Θ i I i ( θ i ) + µ d W i ( θ i ) ,γ ( p, W ) = N X i =1 p i Z Θ i j ∈N \{ i } inf θ j ∈ Θ j I ij ( θ i , θ j ) d W i ( θ i ) . Remark 3.
First-order approximations (47)–(50) and (55) forthe moments of the detection delay are usually not accurate.In the general non-i.i.d. case, it is difficult if at all possible,to obtain more accurate higher-order approximations. Higher-order approximations for the expected detection delays ( m =1 ) can be obtained in the i.i.d. case using nonlinear renewaltheory and techniques developed in [3, Th 3.3 ] and [20, Th4.3.4, Th 7.1.5].VIII. A N E XAMPLE : D
ETECTION OF S IGNALS WITH U NKNOWN A MPLITUDES
Suppose there is an N -channel sensor system and we areable to observe the output vector X n = ( X n (1) , . . . , X n ( N )) , n = 1 , , . . . The observations X n ( i ) in the i th channel havethe form X n ( i ) = θ i S n ( i )1l { n>ν } + ξ n ( i ) , n > , i = 1 , . . . , N, where θ i is an unknown intensity or amplitude ( θ i > ) of adeterministic signal S n ( i ) (e.g., the signal S n ( i ) = cos( ω i n ) )and { ξ n ( i ) } n ∈ Z + , i ∈ N are mutually independent noises which are AR ( p ) Gaussian stable processes that obey recur-sions ξ n ( i ) = p X t =1 ̺ i,t ξ n − t ( i ) + w n ( i ) , n > . (56)Here { w n ( i ) } n > , i ∈ N , are mutually independent i.i.d.Gaussian sequences with mean zero and standard deviation σ > . The coefficients ̺ i, , . . . , ̺ i,p and variance σ areknown.A signal may appear only in one channel and should bedetected and isolated quickly, i.e., the number of a channelwhere the signal appears should be identified along withdetection.Define e S i,n = S n ( i ) − P p n t =1 ̺ i,t S n − t ( i ) and e X i,n = X n ( i ) − P p n t =1 ̺ i,t X n − t ( i ) , where p n = p if n > p and p n = n if n p . The LLRs have the form λ i,θ i ( k, k + n ) = θ i σ k + n X t = k +1 e S i,t e X i,t − θ i σ k + n X t = k +1 e S i,t ,λ i,θ i ; j,θ j ( k, k + n ) = λ i,θ i ( k, k + n ) − λ j,θ j ( k, k + n ) . Under measure P k,i,ϑ , ϑ ∈ Θ i , the LLR λ i,θ i ; j,θ j ( k, k + n ) is aGaussian process (with independent non-identically distributedincrements) with mean and variance E k,i,ϑ [ λ i,θ i ; j,θ j ( k, k + n )] =12 σ " (2 θ i ϑ − θ i ) k + n X t = k +1 e S i,t + θ j k + n X t = k +1 e S j,t , Var k,i,ϑ [ λ i,θ i ; j,θ j ( k, k + n )] =1 σ " θ i k + n X t = k +1 e S i,t + θ j k + n X t = k +1 e S j,t . (57)Let Θ i = (0 , ∞ ) , i ∈ N and assume that lim n →∞ n sup k ∈ Z + k + n X t = k +1 e S i,t = Q i , where < Q i < ∞ . This is typically the case in most signalprocessing applications, e.g., for the sequence of sine pulses S n ( i ) = sin( ω i n + φ i ) with frequency ω i and phase φ i . Thenfor all k ∈ Z + and θ i , θ j ∈ (0 , ∞ )1 n λ i,θ i ; j,θ j ( k, k + n ) P k,i,θi − a.s. −−−−−−−→ n →∞ θ i Q i + θ j Q j σ = I ij ( θ i , θ j ) , j ∈ N \ { i } , i ∈ N , n λ i,θ i ( k, k + n ) P k,i,θi − a.s. −−−−−−−→ n →∞ θ i Q i σ = I i ( θ i ) , i ∈ N , so that condition C holds. Furthermore, since all moments ofthe LLR are finite condition C holds for all r > . Indeed,using (57), we obtain that I i ( ϑ, θ i ) =: lim n →∞ n E k,i,ϑ [ λ i,θ i ( k, k + n )] = ( ϑθ i − ϑ / Q i /σ and for any κ > P k,i,θ i (cid:18) n inf | ϑ − θ i | < κ λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε (cid:19) P k,i,θ i sup ϑ ∈ [ θ i − κ ,θ i + κ ] (cid:12)(cid:12)(cid:12)(cid:12) n λ i,ϑ ( k, k + n ) − I i ( ϑ, θ i ) (cid:12)(cid:12)(cid:12)(cid:12) > ε ! = P k,i,θ i (cid:0) | Y k,n ( θ i ) | > ε √ n (cid:1) , where Y k,n ( θ i ) = θ i √ n k + n X t = k +1 e S i,t η t ( i ) , n > and { η t ( i ) } t > is the sequence of standard zero-mean normalrandom variables. Hence { Y k,n ( θ i ) } n > is the sequence ofnormal random variables with mean zero and variance σ i,n = n − σ − θ i P k + nt = k +1 ( e S i,t ) , which is asymptotic to θ i Q i /σ .Thus, for a sufficiently large n there exists δ > such that σ n δ + θ i Q i /σ , and we obtain that for all large n P k,i,θ i (cid:18) n inf | ϑ − θ i | < κ λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε (cid:19) P (cid:18) | ˆ η | > δ + θ i Q i /σ σ n ε √ nδ + θ i Q i /σ (cid:19) P (cid:18) | ˆ η | > ε √ nδ + θ i Q i /σ (cid:19) , where ˆ η is a standard normal random variable. Therefore, Υ r ( κ , ε ; i, θ i )= ∞ X n =1 n r − sup k ∈ Z + P k,i,θ i ( n inf | ϑ − θ i | < κ λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε ) ∞ X n =1 n r − P (cid:18) | ˆ η | > ε √ nδ + θ i Q i /σ (cid:19) , where the right-hand side term is finite for all r > due to thefiniteness of all moments of the normal distribution, so thatcondition C holds for all r > .Obviously, inf θ j ∈ (0 , ∞ ) I ij ( θ i , θ j ) = θ i Q i / (2 σ ) = I i ( θ i ) > . Therefore, by Theorem 3, the detection–identification rule δ A is asymptotically first-order optimal withrespect to all positive moments of the detection delay andasymptotic formulas (48) and (50) hold with inf θ j ∈ (0 , ∞ ) I ij ( θ i , θ j ) = I i ( θ i ) = θ i Q i σ . If max j = i β ji > α i , max j = i ¯ β j > α , and µ = 0 , thenasymptotic formulas (53) and (54) hold.Note that by condition C rule δ A is asymptotically op-timal for almost arbitrary mixing distributions W i ( θ i ) . Inthis example, it is most convenient to select the conjugateprior, W i ( θ i ) = F ( θ i /v i ) , where F ( y ) is a standard normaldistribution and v i > , in which case the decision statisticscan be computed explicitly.It is worth noting that this example arises in certain inter-esting practical applications, e.g., in multichannel/multisensorsurveillance systems such as radars, sonars, and electro-optic/infrared sensor systems, which deal with detecting mov-ing and maneuvering targets that appear at unknown times, andit is necessary to detect a signal from a randomly appearing target in clutter and noise with the minimal average detectiondelay as well as to identify a channel where it appears. See[1], [7], [14], [19]. Another challenging application area wherethe multichannel model is useful is cyber-security [17], [21],[24]. Malicious intrusion attempts in computer networks (spamcampaigns, personal data theft, worms, distributed denial-of-service (DDoS) attacks, etc.) incur significant financial damageand are severe harm to the integrity of personal information. Itis therefore essential to devise automated techniques to detectcomputer network intrusions as quickly as possible so thatan appropriate response can be provided and the negativeconsequences for the users are eliminated. In particular, DDoSattacks typically involve many traffic streams resulting in alarge number of packets aimed at congesting the target’s serveror network. IX. C ONCLUDING R EMARKS
1. Since we do not specify a class of models for theobservations such as Gaussian, Markov, or HMM and buildthe decision statistics on the LLR processes, we restrict thebehavior of LLRs which is expressed by conditions C and C related to the law of large numbers for the LLR and ratesof convergence in the law of large numbers. As the examplein Section VIII shows, these conditions hold for the additivechanges (in the mean) of the AR( p ) process governed by theGaussian process. These conditions also hold in a variety ofnon-additive examples (detection of changes in spectrum oftime series such as AR( p ) and ARCH( p ) processes) as wellas for a large class of homogeneous Markov processes [11],[12], [18, Sec 3.1, Ch 4] and for hidden Markov models withfinite hidden state space [4].2. While we focused on the multistream detection–identification problem (1), it should be noted that similarresults also hold in the “scalar” detection–isolation problemwhen the observations { X n } n > represent either a scalarprocess or a vector process but all components of this processchange at time ν . Specifically, let { f θ ( X t | X t − ) , θ ∈ Θ } bea parametric family of densities and for i = 1 , . . . , N and Θ i ⊂ Θ consider the model p ( X n | H ν,i , θ ) = p ( X n | H ∞ ) = n Y t =1 f θ ( X t | X t − ) for ν > n,p ( X n | H ν,i , θ ) = ν Y t =1 f θ ( X t | X t − ) n Y t = ν +1 f θ ( X t | X t − ) for ν < n, θ ∈ Θ i , where θ is the known pre-change parameter and θ is theunknown post-change parameter. In other words, there are N types of change and for the i th type of change the valueof the post-change parameter θ belongs to a subset Θ i ofthe parameter space Θ . It is necessary to detect and isolatea change as rapidly as possible, i.e., to identify what typeof change has occurred. The change detection–identification ARTAKOVSKY: JOINT SEQUENTIAL CHANGEPOINT DETECTION AND IDENTIFICATION 11 rule δ A = ( d A , T A ) is defined as in (7) where the statistics ¯Λ π,Wij ( n ) get modified as follows ¯Λ π,Wij ( n ) = P n − k = − π k R Θ i LR θ ( k, n ) d W i ( θ ) P n − k = − π k sup θ ∈ Θ j LR θ ( k, n ) , i, j ∈ N ;¯Λ π,Wi ( n ) = P n − k = − π k R Θ i LR θ ( k, n ) d W i ( θ ) P ( ν > n ) , i ∈ N with the likelihood ratio LR θ ( k, n ) = n Y t = k +1 f θ ( X t | X t − ) f θ ( X t | X t − ) . Write λ θ ( k, k + n ) = log LR θ ( k, n ) and λ θ,θ ∗ ( k, k + n ) = λ θ ( k, k + n ) − λ θ ∗ ( k, k + n ) , where λ θ ∗ ( k, k + n ) = 0 for θ ∗ = θ . Conditions C and C also get modified as follows C . There exist positive and finite numbers I ( θ, θ ) = I ( θ ) , θ ∈ Θ i , i ∈ N and I ( θ, θ ∗ ) , θ ∗ ∈ Θ j , j ∈ N \ { i } , θ ∈ Θ i , i ∈ N , such that for any ε > and all k ∈ Z + , θ ∈ Θ i , θ ∗ ∈ Θ j , j ∈ N \ { i } , i ∈ N lim M →∞ p M,k ( ε ; θ, θ ∗ ) = 0 . C . For any ε > and some r > r ( ε ; θ ) < ∞ for all θ ∈ Θ i , i ∈ N , where p M,k ( ε ; θ ; θ ∗ ) = P k,θ (cid:26) M max n M λ θ,θ ∗ ( k, k + n ) > (1 + ε ) I ( θ, θ ∗ ) (cid:27) , Υ r ( ε ; θ ) = lim κ → ∞ X n =1 n r − sup k ∈ Z + P k,θ ( n inf { ϑ ∈ Θ : | ϑ − θ | < κ } λ ϑ ( k, k + n ) < I ( θ ) − ε ) . Essentially the same argument shows that all previousresults hold in this case too. In particular, the assertions ofTheorem 3 are correct: as α max , β max → for all θ ∈ Θ i andall i ∈ N inf δ ∈ C π ( α , β ) ¯ R mθ ( δ ) ∼ ¯ R mθ ( δ A ) ∼ max (cid:26) | log α i | I ( θ ) + µ , max j ∈N \{ i } | log β ji | inf θ ∗ ∈ Θ j I ( θ, θ ∗ ) (cid:27) m , i.e., the detection–identification rule δ A is asymptotically op-timal to first order.Note also that, in general, these asymptotics are not reducedto (53) even when α i = β ji . Everything depends on theconfiguration of the hypotheses.3. All previous results can be easily generalized for thecase when the change points are different for different streams,i.e., when ν = ν i with prior distributions π ( i ) k = P ( ν i = k ) ,assuming that condition (23) for π k = π ( i ) k holds with µ = µ i , i ∈ N . Then in relations (32), (33), (44), (47), (48), (49), (50)and other relations where µ is present, the value of µ shouldbe simply replaced with µ i . 4. For independent observations as well as for many Markovand certain hidden Markov models the decision statistics ¯Λ π,Wij ( n ) can be computed effectively, so implementation ofthe proposed detection–identification rule is not an issue.Still, in general, the computational complexity and memoryrequirements of rule δ A are high. To avoid this complication,rule δ A can be modified into a window-limited version wherethe summation in the statistics ¯Λ π,Wij ( n ) over potential changepoints k is restricted to the sliding window of size ℓ . Followingguidelines of [18, Ch 3, Sec 3.10] (where asymptotic optimal-ity of mixture window-limited rules was established in thesingle-stream case), it can be shown that the window-limitedversion also has first-order asymptotic optimality properties aslong as the size of the window ℓ ( A ) goes to infinity as A → ∞ at such a rate that ℓ ( A ) / log A → ∞ but log ℓ ( A ) / log A → .The details are omitted.5. If π ∈ C ( µ = 0) or π α,β depends on α, β and µ α,β → as α max , β max → , then an alternative detection–identification rule δ ∗ A = ( d ∗ , T ∗ A ) defined as in (7)–(8) wherein the definition of T ( i ) A the statistics ¯Λ π,Wij ( n ) are replaced bythe statistics R ij ( n ) = P n − k =0 R Θ i LR i,θ i ( k, n ) d W i ( θ i ) P n − k =0 sup θ j ∈ Θ j LR j,θ j ( k, n ) , i, j ∈ N ; R i ( n ) = n − X k =0 Z Θ i LR i,θ i ( k, n ) d W i ( θ i ) , i ∈ N , is also asymptotically optimal to first order. Specifically, witha suitable selection of thresholds asymptotic approximations(53) and (54) hold for δ ∗ A .6. For practical purposes, it is more reasonable to con-sider a “frequentist” problem setup that does not use priordistributions of the changepoint π and hypotheses p . Webelieve that the most reasonable performance metric for falsealarms is the maximal conditional local probability of a falsealarm in a prespecified time-window ℓ , sup k< ∞ P ∞ ( k T < k + ℓ | T > k ) (see, e.g., [18], [20] for a detaileddiscussion). The optimality results in the Bayesian problemobtained in this paper are of importance in the frequentist(minimax and pointwise) problem, which can be embeddedinto the Bayesian criterion with asymptotically improper uni-form distribution of the changepoint. See Pergamenchtchikovand Tartakovsky [11], [12] and Tartakovsky [18, Ch 4] for thesingle population. A CKNOWLEDGEMENT
The author would like to thank referees whose commentsimproved the article.A
PPENDIX : P
ROOFS
Proof of Theorem 2:
The proof is split into two parts.
Part 1: Proof of asymptotic inequalities (38) and (39).To prove (38) and (39) define M β ji = M β ji ( ε, θ i , θ j ) = (1 − ε ) | log β ji | /I ij ( θ i , θ j ) and note first that, by the Chebyshev inequality, for every ε ∈ (0 , and r > E k,i,θ i [( T − k ) r ; d = i ; T > k ] > M rβ ji P k,i,θ i (cid:8) T − k > M β ji , d = i, T > k (cid:9) = M rβ ji P k,i,θ i (cid:8) T − k > M β ji , d = i (cid:9) . Therefore, for all θ j ∈ Θ j and j ∈ N \ { i } E k,i,θ i [( T − k ) r ; d = i ; T > k ] > (cid:20) (1 − ε ) | log β ji | I ij ( θ i , θ j ) (cid:21) r (1 + o (1)) whenever for all ε ∈ (0 , and all fixed k ∈ Z lim α max ,β max → inf δ ∈ C π ( α , β ) P k,i,θ i (cid:8) T − k > M β ji , d = i (cid:9) = 1 , (A.1)and inequality (38) follows since ε can be arbitrarily small and R rk,i,θ i ( δ ) > E k,i,θ i [( T − k ) r ; d = i ; T > k ] . Recall that for k = − we set T − k = T rather than T + 1 everywhere. Note that R r ,i,θ i ( δ ) ≡ R r − ,i,θ i ( δ ) .Analogously, ¯ R ri,θ i ( δ ) > E πi,θ i [( T − ν ) r ; d = i ; T > ν ] > M rβ ji P πi,θ i (cid:8) T − ν > M β ji , d = i (cid:9) , so that inequality (39) holds whenever lim α max ,β max → inf δ ∈ C π ( α , β ) P πi,θ i (cid:8) T − ν > M β ji , d = i (cid:9) = 1 . (A.2)Hence, we now focus on proving equalities (A.1) and (A.2).Obviously, P k,i,θ i ( T − k > M β ji , d = i ) = P k,i,θ i ( d = i ) − P k,i,θ i ( T − k M β ji , d = i )= 1 − P k,i,θ i ( d = i ) − P i,k,θ i ( T k, d = i ) − P k,i,θ i ( k < T M β ji + k, d = i ) , where P i,k,θ i ( T k, d = i ) = P ∞ ( T k, d = i ) . Write Π k = P ( ν > k ) . For any δ ∈ C π ( α , β ) and k > , we have α i > PFA i ( δ ) = ∞ X t =0 π t P ∞ ( T t, d = i ) > ∞ X t = k π t P ∞ ( T t, d = i ) > P ∞ ( T k, d = i )Π k − , and PMI ij ( δ ) = sup θ i ∈ Θ i ∞ X s = − π s P s,i,θ i ( d = j, T < ∞ ) β ij so that, for any δ ∈ C π ( α , β ) , P ∞ ( T k, d = i ) α i / Π k − , k ∈ Z + (A.3)and sup θ i ∈ Θ i P k,i,θ i ( d = j, T < ∞ ) β ij /π k , k ∈ Z , (A.4) sup θ i ∈ Θ i P k,i,θ i ( d = i, T < ∞ ) π − k X j ∈N \{ i } β ij . (A.5) Therefore, P k,i,θ i ( T − k > M β ji , d = i ) > − α i / Π k − − π − k X j ∈N \{ i } β ij − P k,i,θ i ( k < T M β ji + k, d = i ) . This inequality implies that to prove (A.1) we have to showthat, as α max , β max → δ ∈ C π ( α , β ) P k,i,θ i (cid:8) < T − k M β ji , d = i (cid:9) → . (A.6)For the sake of brevity, we will write λ i,j ( k, k + n ) for theLLR λ i,θ i ; j,θ j ( k, k + n ) . Let A k,β = { k < T k + M β ji } and for C > B k,β,i,j = { d = i, A k,β } \ { max k
Next, multiplying both sides of inequality (A.7) by π k andsumming over k ∈ Z , we obtain P πi,θ i (cid:8) < T − ν M β ji , d = i (cid:9) β ji e (1 − ε ) | log β ji | + ∞ X k = − π k p M βji ,k ( ε ; i, θ i ; j, θ j ) β ε ji + P ( ν > K β ) + K β X k = − π k p M βji ,k ( ε ; i, θ i ; j, θ j ) , where K β is an arbitrary integer which goes to infinity as β max → . Obviously, the first term goes to 0 as β max → .The second term P ( ν > K β ) → by conditions (23) and(24). The third term also goes to 0 due to condition C andLebesgue’s dominated convergence theorem. Hence, for any δ ∈ C ( α , β ) , P πi,θ i (cid:8) < T − ν M β ji , d = i (cid:9) → as α max , β max → . Finally, we have P πi,θ i (cid:8) T − ν > M β ji , d = i (cid:9) = P πi,θ i ( T > ν, d = i ) − P πi,θ i (cid:8) < T − ν M β ji , d = i (cid:9) , where P πi,θ i ( T > ν, d = i ) = P πi,θ i ( d = i | T > ν )[1 − PFA π ( δ )] > − X j ∈N \{ i } β ij − N X ℓ =1 α ℓ ! → (A.8)as α max , β max → for any δ ∈ C π ( α , β ) . This yields (A.2),and therefore, inequalities (39). Part 2: Proof of asymptotic inequalities (40) and (41).Changing the measure P ∞ → P k,i,θ i and using an argumentsimilar to that used in Part 1 to obtain (A.7) with M β ij replaced by N α i = (1 − ε ) | log α i | I i ( θ i ) + µ + ε we obtain P k,i,θ i { < T − k N α i , d = i } e (1+ ε ) I i ( θ i ) N αi P ∞ { < T − k N α i , d = i } + P k,i,θ i (cid:26) N α i max n N αi λ i,θ i ( k, k + n ) > (1 + ε ) I i ( θ i ) (cid:27) , (A.9)where for all ε ∈ (0 , e (1+ ε ) I i ( θ i ) N αi P ∞ { < T − k N α i , d = i } exp (cid:8) − ε | log α i | + ( µ + ε )( k − (cid:9) := U α i ,k ( ε, ε ) . (A.10)Using (A.9) and (A.10), we obtain sup δ ∈ C π ( α , β ) P k,i,θ i { < T − k N α i , d = i } U α i ,k ( ε, ε )+ p N αi ,k ( ε ; i, θ i ) , where for every fixed k ∈ Z + the value of U α i ,k ( ε, ε ) tends to zero and also p N αi ,k ( ε ; i, θ i ) → as α max → by condition C . Hence, it follows that for every fixed k ∈ Z lim α max → sup δ ∈ C π ( α , β ) P k,i,θ i { < T − k N α i , d = i } = 0 . (A.11)Next, we have R rk,i,θ i ( δ ) > E k,i,θ i [( T − k ) r , d = i, T > k ] > N rα i P k,i,θ i ( T − k > N α i , d = i, T > k )= N rα i P k,i,θ i ( T − k > N α i , d = i ) > N rα i [ P k,i,θ i ( T > k, d = i ) − P k,i,θ i (0 < T − k N α i , d = i )] , where the second inequality follows from the Chebyshevinequality and P k,i,θ i ( T > k, d = i ) = 1 − P k,i,θ i ( d = i ) − P i,k,θ i ( T k, d = i ) > − α i / Π k − − π − k X j ∈N \{ i } β ij (see (A.4)–(A.5)). Therefore, inf δ ∈ C π ( α , β ) R rk,i,θ i ( δ ) > N rα i (cid:2) inf δ ∈ C π ( α , β ) P ∞ ( T > k, d = i ) − sup δ ∈ C π ( α , β ) P k,i,θ i (0 < T − k N α i , d = i ) (cid:3) , where lim α max → inf δ ∈ C π ( α , β ) P ∞ ( T > k, d = i ) = 1 , (A.12)and by (A.11) the second term on the right hand-side goes to for any fixed k ∈ Z .It follows that for all fixed k ∈ Z inf δ ∈ C π ( α , β ) R rk,i,θ i ( δ ) > (cid:20) (1 − ε ) | log α i | I i ( θ i ) + µ + ε (cid:21) r (1 + o (1)) , where ε and ε can be arbitrarily small, which implies theinequality (40).Next, define K α i = K α i ( ε, µ, ε ) = (cid:22) ε | log α i | µ + ε (cid:23) . Using inequalities (A.9) and (A.10), we obtain P πi,θ i (0 < T − ν N α i , d = i )= ∞ X k = − π k P k,i,θ i (0 < T − k N α i , d = i )= K αi X k = − π k P k,i,θ i (0 < T − k N α i , d = i )+ ∞ X k = K αi +1 π k P k,i,θ i (0 < T − k N α i , d = i ) K αi X k = − π k U α i ,k ( ε, ε ) + K αi X k = − π k p N αi ,k ( ε ; i, θ i ) + ∞ X k = K αi +1 π k Π K αi + max − k K αi U α i ,k ( ε, ε ) + K αi X k = − π k p N αi ,k ( ε ; i, θ i )= Π K αi + U α i ,K αi ( ε, ε ) + K αi X k = − π k p N αi ,k ( ε ; i, θ i ) , where T − k = T for k = − . If µ > , by condition (23), log Π K αi ∼ − µ K α i as α max → , so Π K αi → . If µ = 0 ,this probability goes to as α max → as well since, bycondition (24), Π K αi < ∞ X k = K αi π k | log π k | −−−−−→ α max → . Obviously, the second term U α i ,K αi ( ε, ε ) → as α max → . By condition C and Lebesgue’s dominated convergencetheorem, the third term goes to 0, and therefore, all three termsgo to zero as α max , β max → for all ε, ε > , so that P πi,θ i (0 < T − ν N α i , d = i ) → as α max , β max → . Since P πi,θ i ( T − ν > N α i , d = i )= P πi,θ i ( T > ν, d = i ) − P πi,θ i (0 < T − ν N α i , d = i ) and by (A.8) P πi,θ i ( T > ν, d = i ) → as α max , β max → forany δ ∈ C π ( α , β ) , it follows that P πi,θ i ( T − ν > N α i , d = i ) → as α max , β max → . Finally, by the Chebyshev inequality, ¯ R ri,θ i ( δ ) > E πi,θ i [( T − ν ) r , d = i, T > ν ] > N rα i P πi,θ i ( T − ν > N α i , d = i ) , which implies that for any δ ∈ C π ( α , β ) as α max , β max → R ri,θ i ( δ ) > (cid:20) (1 − ε ) | log α i | I i ( θ i ) + µ + ε (cid:21) r (1 + o (1)) . Owing to the fact that ε and ε can be arbitrarily small theinequality (41) follows. Proof of Lemma 1:
For k ∈ Z + , define the exit times τ ( k ) i ( A ) = inf { n > λ i,W ( k, k + n ) − λ πj ( k + n ) > log( A ij /π k ) ∀ j ∈ N \ { i }} , i ∈ N , where λ i,W ( k, k + n ) = log Λ i,W ( k, k + n ) and λ π ( k + n ) =log P ( ν > k + n ) = log Π k + n − .Obviously, for any n > k and k ∈ Z + , log ¯Λ π,Wij ( n ) > log π k LR i,W ( k, n ) P n − ℓ = − π ℓ sup θ j ∈ Θ j LR j,θ j ( ℓ, n ) ! = λ i,W ( k, n ) − λ πj ( n ) + log π k , so for every set A = ( A ij ) of positive thresholds A ij , wehave ( T A − k ) + ( T ( i ) A − k ) + τ ( k ) i ( A ) and, hence, E k,i,θ i [( T A − k ) + ] r E k,i,θ i [( τ ( k ) i ( A )) r ] . Note that since we set T A − k = T A for k = − , it follows that E − ,i,θ i [( T A − k ) + ] r = E ,i,θ i [ T A ] r E ,i,θ i [( τ (0) i ( A )) r ] .Setting τ = τ ( k ) i ( A ) and N = M i ( A ) in inequality (A.1)in Lemma A1 in [18, p. 239] we obtain that the followinginequality holds: E k,i,θ i h(cid:16) τ ( k ) i ( A ) (cid:17) r i [ M i ( A )] r + r r − ∞ X n = M i ( A ) n r − P k,i,θ i (cid:16) τ ( k ) i ( A ) > n (cid:17) . (A.13)Next, we have P k,i,θ i (cid:16) τ ( k ) i ( A ) > n (cid:17) P k,i,θ i ( λ i,W ( k, k + n ) − λ πj ( k + n ) n< n log (cid:18) A ij π k (cid:19) , ∀ j ∈ N \ { i } ) P k,i,θ i ( λ i,W ( k, k + n ) − log Π k + n − n< n log (cid:18) A i π k (cid:19) ) . Let f M i ( A i ) = 1 + (cid:22) log( A i /π k ) I i ( θ i ) + µ − ε (cid:23) . Clearly, for all n > f M i ( A i ) the last probability does notexceed the probability P k,i,θ i (cid:26) λ i,W ( k, k + n ) n < I i ( θ i ) + µ − ε − | log Π k + n − | n (cid:27) and, by condition CP , for a sufficiently large value of A i there exists a small κ such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ − | log Π k + f M i ( A i ) − | f M i ( A i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < κ. Therefore, for all sufficiently large n , P k,i,θ i (cid:16) τ ( k ) i ( A ) > n (cid:17) P k,i,θ i (cid:18) n λ i,W ( k, k + n ) < I i ( θ i ) − ε + κ (cid:19) . Also, λ i,W ( k, k + n ) > inf ϑ ∈ Γ κ ,θi λ i,ϑ ( k, k + n ) + log W i (Γ κ ,θ i ) , where Γ κ ,θ i = { ϑ ∈ Θ i : | ϑ − θ i | < κ } . Thus, for all suffi-ciently large n and A min , for which κ + | log W (Γ κ ,θ i ) | /n <ε/ , we have P k,i,θ i (cid:16) τ ( k ) i ( A ) > n (cid:17) P k,i,θ i ( n inf ϑ ∈ Γ κ ,θi λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε + κ + 1 n | log W (Γ κ ,θ i ) | ) ARTAKOVSKY: JOINT SEQUENTIAL CHANGEPOINT DETECTION AND IDENTIFICATION 15 P k,i,θ i (cid:18) n inf ϑ ∈ Γ κ ,θi λ i,ϑ ( k, k + n ) < I i ( θ i ) − ε/ (cid:19) . (A.14)Using (A.13) and (A.14) yields inequality (46) and the proofis complete. Proof of Proposition 1:
By Theorem 1, the rule δ A belongs to class C π ( α , β ) when α i = 11 + A i ; β ij = 1 + A i A i A ji , j ∈ N \ { i } , i ∈ N , and hence, Theorem 2 implies (under condition C ) theasymptotic (as A min → ∞ ) lower bounds R rk,i,θ i ( δ A ) > [Ψ i ( A, θ i , µ )] r (1 + o (1)) ∀ k ∈ Z (A.15)and ¯ R ri,θ i ( δ A ) > [Ψ i ( A, θ i , µ )] r (1 + o (1)) , (A.16)which hold for all r > , θ i ∈ Θ i , and i ∈ N . Thus, toprove the validity of the asymptotic approximations (42) and(43) it suffices to show that, under the left-tail condition C ,for < m r and all θ i ∈ Θ i and i ∈ N the followingasymptotic upper bounds hold as A min → ∞ : R mk,i,θ i ( δ A ) [Ψ i ( A, θ i , µ )] m (1 + o (1)) ∀ k ∈ Z (A.17)and ¯ R mi,θ i ( δ A ) [Ψ i ( A, θ i , µ )] m (1 + o (1)) . (A.18)It follows from inequality (46) in Lemma 1 that for any < ε < J ij ( θ i , µ ) E k,i,θ i [( T A − k ) r ; d A = i ; T A > k ] h e Ψ i ( A, π k , θ i , µ, ε ) i r + r r − Υ r ( κ , ε ; i, θ i ) , (A.19)where Υ r ( κ , ε ; i, θ i ) is defined in (19). Similarly to (A.3) wehave P ∞ ( T A k, d A = i ) [(1 + A i )Π k − ] − , so that P ∞ ( T A k ) k − N X i =1
11 + A i , and hence, P ∞ ( T A > k ) > − k − N X i =1
11 + A i . Using this inequality and inequality (A.19), we obtain R rk,i,θ i ( δ A ) = E k,i,θ i [( T A − k ) r ; d A = i ; T A > k ] P ∞ ( T A > k ) (cid:16) j log( A/π k ) I i ( θ i )+ µ − ε k(cid:17) r + r r − Υ r ( κ , ε ; i, θ i )1 − P Ni =1 / [(1 + A i )Π k − ] . (A.20)Since, by condition C , Υ r ( κ , ε ; i, θ i ) < ∞ for all θ i ∈ Θ i and i ∈ N , this implies the asymptotic upper bound (A.17).This completes the proof of the asymptotic approximation(42). Next, using inequality (A.19) we obtain E πi,θ i [( T A − ν ) r ; d A = i ; T A > ν ]= ∞ X k = − π k E k,i,θ i [( T A − k ) r ; d A = i ; T A > k ] ∞ X k = − π k h e Ψ i ( A, π k , θ i , µ, ε ) i r + r r − Υ r ( κ , ε ; i, θ i ) . Recall that we set T A − k = T A for k = − . Applying thisinequality together with inequality − PFA π ( δ A ) > − N X i =1
11 + A i (see (26)) yields ¯ R ri,θ i ( δ A ) = ∞ X k = − π k E k,i,θ i [( T A − k ) r ; d A = i ; T A > k ]1 − PFA π ( T A ) ∞ X k = − π k h e Ψ i ( A, π k , θ i , µ, ε ) i r + r r − Υ r ( κ , ε ; i, θ i )1 − P Ni =1 (1 / (1 + A i ) . (A.21)By condition C , Υ r ( κ , ε ; i, θ i ) < ∞ for any ε > and any θ i ∈ Θ i and, by condition (24), P ∞ k =0 π k | log π k | r < ∞ . Thisimplies that, as A min → ∞ , for all < m r , all θ i ∈ Θ i ,and all i ∈ N the following upper bound holds ¯ R ri,θ i ( δ A ) h e Ψ i ( A, π k = 1 , θ i , µ, ε ) i r (1 + o (1)) . Since ε can be arbitrarily small and lim ε → e Ψ i ( A, π k =1 , θ i , µ, ε ) = Ψ i ( A, θ i , µ ) , the upper bound (A.18) follows andthe proof of the asymptotic approximation (43) is complete.R EFERENCES[1] P. A. Bakut, I. A. Bolshakov, B. M. Gerasimov, A. A. Kuriksha, V. G.Repin, G. P. Tartakovsky, and V. V. Shirokov,
Statistical Radar Theory .Moscow, USSR: Sovetskoe Radio, 1963, vol. 1 (G. P. Tartakovsky,Editor), in Russian.[2] S. Dayanik, W. B. Powell, and K. Yamazaki, “Asymptotically optimalBayesian sequential change detection and identification rules,”
Annalsof Operations Research , vol. 208, no. 1, pp. 337–370, Jan. 2013.[3] V. P. Dragalin, A. G. Tartakovsky, and V. V. Veeravalli, “Multihypothesissequential probability ratio tests–Part II: Accurate asymptotic expansionsfor the expected sample size,”
IEEE Transactions on Information Theory ,vol. 46, no. 4, pp. 1366–1383, Apr. 2000.[4] C.-D. Fuh and A. G. Tartakovsky, “Asymptotic Bayesian theory of quick-est change detection for hidden Markov models,”
IEEE Transactions onInformation Theory , vol. 65, no. 1, pp. 511–529, Jan. 2019.[5] T. L. Lai, “Sequential multiple hypothesis testing and efficient faultdetection-isolation in stochastic systems,”
IEEE Transactions on Infor-mation Theory , vol. 46, no. 2, pp. 595–608, Mar. 2000.[6] G. Lorden, “Procedures for reacting to a change in distribution,”
Annalsof Mathematical Statistics , vol. 42, no. 6, pp. 1897–1908, Dec. 1971.[7] J. Marage and Y. Mori,
Sonar and Underwater Acoustics . London,Hoboken: STE Ltd and John Wiley & Sons, 2013.[8] I. V. Nikiforov, “A generalized change detection problem,”
IEEE Trans-actions on Information Theory , vol. 41, no. 1, pp. 171–187, Jan. 1995.[9] ——, “A simple recursive algorithm for diagnosis of abrupt changesin random signals,”
IEEE Transactions on Information Theory , vol. 46,no. 7, pp. 2740–2746, Jul. 2000. [10] ——, “A lower bound for the detection/isolation delay in a class ofsequential tests,”
IEEE Transactions on Information Theory , vol. 49,no. 11, pp. 3037–3046, Nov. 2003.[11] S. Pergamenchtchikov and A. G. Tartakovsky, “Asymptotically optimalpointwise and minimax quickest change-point detection for dependentdata,”
Statistical Inference for Stochastic Processes , vol. 21, no. 1, pp.217–259, Jan. 2018.[12] ——, “Asymptotically optimal pointwise and minimax change-pointdetection for general stochastic models with a composite post-changehypothesis,”
Journal of Multivariate Analysis , vol. 174, no. 4, pp. 1–20,Oct. 2019.[13] M. Pollak, “Optimal detection of a change in distribution,”
Annals ofStatistics , vol. 13, no. 1, pp. 206–227, Mar. 1985.[14] M. A. Richards,
Fundamentals of Radar Signal Processing , ser. 2ndedition. USA: McGraw-Hill Education Europe, 2014.[15] A. N. Shiryaev, “On optimum methods in quickest detection problems,”
Theory of Probability and its Applications , vol. 8, no. 1, pp. 22–46, Jan.1963.[16] ——,
Optimal Stopping Rules , ser. Series on Stochastic Modelling andApplied Probability. New York, USA: Springer-Verlag, 1978, vol. 8.[17] A. G. Tartakovsky, “Rapid detection of attacks in computer networks byquickest changepoint detection methods,” in
Data Analysis for NetworkCyber-Security , N. Adams and N. Heard, Eds. London, UK: ImperialCollege Press, 2014, pp. 33–70.[18] ——,
Sequential Change Detection and Hypothesis Testing: GeneralNon-i.i.d. Stochastic Models and Asymptotically Optimal Rules , ser.Monographs on Statistics and Applied Probability 165. Boca Raton,London, New York: Chapman & Hall/CRC Press, 2020.[19] A. G. Tartakovsky and J. Brown, “Adaptive spatial-temporal filteringmethods for clutter removal and target tracking,”
IEEE Transactions onAerospace and Electronic Systems , vol. 44, no. 4, pp. 1522–1537, Oct.2008.[20] A. G. Tartakovsky, I. V. Nikiforov, and M. Basseville,
Sequential Anal-ysis: Hypothesis Testing and Changepoint Detection , ser. Monographson Statistics and Applied Probability 136. Boca Raton, London, NewYork: Chapman & Hall/CRC Press, 2015.[21] A. G. Tartakovsky, A. S. Polunchenko, and G. Sokolov, “Efficient com-puter network anomaly detection by changepoint detection methods,”
IEEE Journal of Selected Topics in Signal Processing , vol. 7, no. 1, pp.4–11, Feb. 2013.[22] A. G. Tartakovsky, “Multidecision quickest change-point detection:Previous achievements and open problems,”
Sequential Analysis , vol. 27,no. 2, pp. 201–231, Apr. 2008.[23] ——, “Asymptotic optimality of mixture rules for detecting changes ingeneral stochastic models,”
IEEE Transactions on Information Theory ,vol. 65, no. 3, pp. 1413–1429, March 2019.[24] A. G. Tartakovsky, B. L. Rozovskii, R. B. Bla´zek, and H. Kim,“Detection of intrusions in information systems by sequential change-point methods,”
Statistical Methodology , vol. 3, no. 3, pp. 252–293, Jul.2006.