Bandit Change-Point Detection for Real-Time Monitoring High-Dimensional Data Under Sampling Control
BBandit Change-Point Detection forReal-Time Monitoring High-Dimensional
Data Under Sampling Control
Wanrong Zhang ∗ Yajun Mei † H. Milton Stewart School of Industrial and Systems EngineeringGeorgia Institute of Technology, Atlanta, GA 30332September 28, 2020
Abstract
In many real-world problems of real-time monitoring high-dimensional stream-ing data, one wants to detect an undesired event or change quickly once it occurs,but under the sampling control constraint in the sense that one might be able toonly observe or use selected components data for decision-making per time step inthe resource-constrained environments. In this paper, we propose to incorporatemulti-armed bandit approaches into sequential change-point detection to develop anefficient bandit change-point detection algorithm. Our proposed algorithm, termedThompson-Sampling-Shiryaev-Roberts-Pollak (TSSRP), consists of two policies pertime step: the adaptive sampling policy applies the Thompson Sampling algorithm tobalance between exploration for acquiring long-term knowledge and exploitation forimmediate reward gain, and the statistical decision policy fuses the local Shiryaev-Roberts-Pollak statistics to determine whether to raise a global alarm by sum shrink-age techniques. Extensive numerical simulations and case studies demonstrate thestatistical and computational efficiency of our proposed TSSRP algorithm.
Keywords: adaptive sampling; change-point detection; Thompson sampling; Shiryaev-Roberts procedure ∗ W.Z. supported in part by an ARC-TRIAD fellowship from the Georgia Institute of Technology. † Y.M. supported in part by NSF grants DMS-1613258 and DMS-1830344. W.Z. and Y.M. also supportedin part by the National Center for Advancing Translational Sciences of the National Institutes of Healthunder Award Number UL1TR002378. The content is solely the responsibility of the authors and does notnecessarily represent the official views of the National Institutes of Health. a r X i v : . [ s t a t . M E ] S e p Introduction
Real-time monitoring high-dimensional streaming data under sampling control con-straints appears in many important applications such as intrusion detection in computernetworks [Bas99], event detection in social networks [VBC + + K -dimensional random vector X t = ( X ,t , · · · , X K,t ) at each time step, but we canonly observe q out of K components per time step. Here the component X k,t can beeither the raw data from local sensors or the derived features such as wavelet coefficients,principal components. Initially, the system is in control in the sense that X k,t follows aprobability density function f k . At some unknown time ν, an event may occur and changethe distributions of a sparse subset of the components. Our goal is to design an efficientalgorithm to adaptively decide which variable to sample at each time step, and when toraise a global alarm to indicate the possible occurrence of the change.Monitoring high-dimensional streaming data has raised much attention in the sta-tistical quality control (SPC) and sequential change-point detection literature in recentyears, often without resource constraints, see [ZQ09, Li19, Li20, LXTP20, ZWZJ15]. Ingeneral, there are two distinct approaches to develop a statistically efficient scheme forcomplete data. The first one is to construct a global-level monitoring statistics directlyfrom likelihood functions, often from the sequential change-point detection frameworks, see[ZS12, XS13, WM15, CF15, Cha17, CC + +
93, PH08, TNB14]. Here we will extend the secondapproach to the resource-constrained environments that involve an additional challenge indesigning sampling policies for real-time monitoring streaming data.The problem of real-time monitoring high-dimensional streaming data under samplingcontrol has been studied in the literature of statistical process control in applied statistics,2ee [LMS15, XWL18, WXTL18, XZBL19]. They follow the intuitive ideas to propose asampling policy by introducing a compensation parameter for the unobserved variables toincrease the chance of exploring them. Here we propose to borrow the powerful tools fromthe multi-armed bandit (MAB) problem for efficient real-time monitoring streaming data.Many efficient sampling algorithms have been developed in the MAB to balance the tradeoffbetween exploration for acquiring long-term knowledge and exploitation for immediatereward gain. See [LR85, Rob85, Sco10, GGW11, BCB +
12, AG12, CZKX18, Zha19] andreferences therein. On the other hand, our research is very different from the classical MABbecause, besides sampling policies, we also need to conduct the statistical hypothesis testat each time step to decide whether a change has occurred.This paper proposes a bandit change-point detection algorithm for efficient real-timemonitoring of high-dimensional streaming data under sampling control. Our method,termed as Thompson-Sampling-Shiryaev-Roberts-Pollak (TSSRP) algorithm, incorporatesthe Thompson Sampling algorithm in the multi-armed bandit problem into the Shiryaev-Roberts-Pollak procedure in the sequential change-point detection literature. Our proposedTSSRP algorithm can balance between exploiting those observed local components thatmaximize the immediate detection performance and exploring not-been-monitored localcomponents that might provide new information to improve future detection performance.In particular, our TSSRP algorithm performs similar to random sampling in the in-controlstate when no changes occur, but becomes a greedy sampling on those affected local com-ponents in the out-of-control state when a change occurs. Numerical simulations and casestudies show the efficiency of our proposed TSSRP algorithm.The remainder of this paper is organized as follows. In Section 2, we provide the math-ematical formulation of our problem and also review the background of the multi-armedbandit problem and the sequential change-point detection problem. Next, we introduceour proposed method and develop its theoretical properties in Section 3. Then we evaluatethe performance of our proposed algorithm through simulation studies and real data casestudies in Section 4 and 5, respectively. Concluding remarks are included in Section 6. Weprovide the proofs of the theorems in the Appendix. Additional experiments are presentedin Supplementary Materials.
In this section, we present the mathematical formulation of real-time monitoring high-dimensional streaming data in resource-constrained environments in Subsection 2.1. Thenwe provide a brief review of the Thompson Sampling algorithm for the multi-armed banditproblem in Subsection 2.2, followed by the review of the Bayesian approach for sequentialchange-point detection in Subsection 2.3.
Suppose we are monitoring K independent data streams in a system. Let X k,t denote theobservation of the k -th data stream at time t for t = 1 , , . . . and k = 1 , , . . . , K . Here eachdata stream X k,t can be the raw data itself or its derived features such as wavelet coefficients,3rincipal components. Samples from data streams are assumed mutually independent.Each data stream generates identically distributed data from a specific distribution f θ k .At some unknown change time ν ∈ { , , . . . } , an undesirable event occurs and changesthe distributions of some data streams abruptly in the sense of changing the values of theparameters θ k . Conditional on the change time ν >
1, for those affected data streams, theobservation X k, , . . . , X k,ν − are independent and identically distributed (iid) with density f θ k, while X k,ν , X k,ν +1 , . . . are iid with another density f θ k, , where θ k, (cid:54) = θ k, . For thoseunaffected data streams, all observations X k, , X k, . . . are iid with density f θ k, . Here we donot know which subset of the local streams changed, and we assume that the affected localstreams are sparse. In practice, practitioners usually specify f θ k, as the interested-smallestmagnitude of a change to be detected.Under sampling control, we can only observe or use selected partial data for decisionmaking. We assume that only q < K data streams can be selected to collect data ateach time t . Mathematically, let δ k,t be the indicator function that δ k,t = 1 if and onlyif the k -th data stream X k,t is selected at time step t . The resource constraint impliesthat (cid:80) Kk =1 δ k,t = q at each time step t = 1 , , · · · . Imagine that we put q sensors onto K locations, then the δ k,t can be thought as whether to put a sensor to the k -th data streamat time t . Let S t denote the locations where δ k,t = 1, and we refer to it as the sensorlayouts. Under our notation, the observed data can be represented as { X ∗ k,t } = { X k,t δ k,t } , k = 1 , , . . . , K .For the change-point detection problem under sampling control, a statistical schemeconsists of two policies per time step. One is the adaptive sampling policy δ that decidesthe observable location δ k,t , and the other is the statistical decision policy, often definedas a stopping time T, that raises an alarm based on the observations available { X ∗ k,t } = { X k,t δ k,t } ≤ k ≤ K, ≤ t ≤ T . Our objective is to design a scalable and efficient statistical schemeof ( δ, T ) that minimizes the detection delay D ( T ) = sup ≤ ν< ∞ E ν ( T − ν | T ≥ ν ) , (1)subject to the false alarm constraint E ( T | ν = ∞ ) ≥ γ. (2)It is worth noting that our problem connects to the multi-armed bandit problem in thesampling policy; however, they are fundamentally different due to different performancecriteria. In this subsection, we briefly review the multi-armed bandit (MAB) problem, first intro-duced by [Rob52], and one of the most popular algorithms, Thompson Sampling [Tho33].Under a classical setting of MAB, a gambler can play one of K slot machines (or arms)for K ≥
2, but she or he has no prior knowledge about which machine has a potentiallyhigher reward. The only way to learn rewards is to play the machines. The problem of4nterest is how the gambler decides which arm to play at each time step, to maximize thetotal rewards through N plays.One of the most popular MAB algorithms is Thompson Sampling, which is a naturalBayesian algorithm. It has been widely used for personalized advertisements and productrecommendation [AG13], as its efficiency has been well demonstrated in many real-worldapplications, especially in the high-dimensional setting. See [Sco10], [CL11]. In particu-lar, [AG12] showed that the Thompson Sampling algorithm asymptotically minimizes theexpected regret.The idea of Thompson Sampling is to sample arms based on the largest values of therandom realizations of the posterior distributions instead of the posterior means. Specifi-cally, at each time step, after updating the posterior distribution of the mean θ k for eacharm, we randomly sample from the posterior distributions, denoted by ˆ θ k from the k -tharm. Then we select the arm with the largest random realization, i.e., arg max ≤ k ≤ K ˆ θ k .This allows us to have better chances to sample those arms with fewer observations, therebybalancing the tradeoff between the exploration for acquiring long-term knowledge and theexploitation for immediate reward gain.In the multi-armed bandit problem, when we are allowed to observe q arms each time,it is natural to extend the original Thompson Sampling algorithm to select the q -largestrealizations. Such an approach often holds nice properties under reasonable conditions,see [AVW87, PT99, KCG16]. Thus we will adopt the Thompson sampling with q -largestrealizations in our context. We now review the Bayesian approach for the simplest sequential change-point detectionproblem pioneered by [Shi63], as well as the corresponding limiting Bayes approach. See[Rob66, Pol85, Pol87]. Consider the simplest univariate case when we observe a sequenceof independent observations X , X , . . . , whose distribution might change from f to f atsome unknown time ν . Since the goal is to detect the change time ν quickly, the statisticalprocedure is defined as a stopping time T with respect to the observed data { X t } t ≥ , where { T = t } means that we raise the alarm at time t to indicate that a change has occurred upto time t. Under the Bayesian formulation, it is assumed that the change-point ν has a geometricprior distribution: P ( ν = t ) = p (1 − p ) t − for t = 1 , , , · · · (3)where 0 < p < ν , the pre-change observations, X , . . . , X ν − , are iid with density f θ andare independent of the post-change observations, X ν , X ν +1 , . . . which are iid with density f θ . Assume that the cost of per post-change observation is c > . Then the Bayesianformulation of sequential change-point detection is to find a statistical procedure T thatminimizes the Bayes risk P ( T < ν ) + c E ( T − ν ) + .[Shi63] first solved this problem, and the Bayesian solution is to raise an alarm at the firsttime when the posterior probability of change having occurred, i.e., P ( ν ≤ t | X , . . . , X t ),5s greater than a certain threshold. Under the limiting Bayes framework, one considersthe test statistic of the form P ( ν ≤ t | X , . . . , X t ) /p (1 − P ( ν ≤ t | X , . . . , X t )) as p goes tozero. This yields the so-called Shiryaev-Roberts procedure ([Rob66]) that raises an alarmat time T A = inf { t | R t ≥ A } , (4)where R t is the Shiryaev-Roberts statistic defined as R t = t (cid:88) j =1 t (cid:89) i = j f θ ( X i ) f θ ( X i ) , (5)and the threshold A is a pre-specified constant. [Pol85, Pol87] showed that this procedureenjoys nice asymptotic minimax properties, i.e., minimize the worst detection delay in (1)up to within an o (1) term subject to the false alarm constraint in (2), as γ goes to ∞ . In this section, we present our proposed TSSRP algorithm for the real-time monitoringhigh-dimensional streaming data under sampling control. For better presentation, we splitthis section into three subsections. We give the proposed methodology in Subsection 3.1and discuss the choice of parameters and prior distribution in Subsection 3.2. Besides, wealso develop the theoretical properties of our proposed TSSRP algorithm in Subsection 3.3.
In the context of real-time monitoring high-dimensional streaming data under samplingcontrol, a statistical procedure consists of two policies per time step: (1) the adaptivesampling policy to decide the observation location; (2) the statistical decision policy toraise a global alarm based on the observed data. A common challenge in both componentsor policies is how to construct local statistics for each local stream that can guide us tomake efficient decisions for both adaptive sampling and statistical decision policies. Ourproposed method’s key novelty is to recursively update two-dimensional local statistics overtime that allow conveniently implement the Thompson Sampling.
We propose to recursively compute two-dimensional local statistics, denoted by R k,t and L k,t , at the k -th local data steam at each time step t = 1 , , · · · , where R k,t mimicsthe classical Shiryaev-Robert statistics R k,t = (cid:40) ( R k,t − + 1) f θk, ( X k,t ) f θk, ( X k,t ) , if δ k,t = 1; R k,t − + 1 , if δ k,t = 0 . (6)6ith initial value R k,t =0 = 0 , and the statistics L k,t mimics the likelihood ratio functionand L k,t = (cid:40) L k,t − f θk, ( X k,t ) f θk, ( X k,t ) , if δ k,t = 1; L k,t − , if δ k,t = 0 . (7)with initial value L k,t =0 = 1 . At a high-level, the local statistic R k,t mainly provides the evidence how likely a localchange has occurred, whereas the other local statistic L k,t is related to the number ofsamples taken at a given local data stream. If δ k,t = 1, i.e., if one takes observationsfrom that specific local data streams, then the update on R k,t and L k,t follows the classicalShiryaev-Robert or likelihood statistics, respectively. On the other hand, if δ k,t = 0, i.e., ifwe do not take local observations, then we recursively update R k,t and L k,t by adding ormultiplying the constant 1, respectively. The intuition is to treat f θk, ( X k,t ) f θk, ( X k,t ) as 1 if X k,t ismissing.Note that the definition or computation of ( R k,t , L k,t ) depends on which local sensorswill be observed, i.e., the values of sampling indicator variables δ k,t ’s, which will be definedin the next subsection. Our proposed adaptive sampling policy is as follows. At each time step t = 0 , , · · · , wenot only compute the observed two-dimensional local statistics, ( R k,t , L k,t ) , but also samplea randomized value ˜ R k,t from a pre-specified prior distribution G that can be treated as the“initial values.” When G is a point mass density of 0 , then ˜ R k,t ≡ k = 1 , · · · , K. Otherwise, the values ˜ R k,t can be different to different sensors, which can be thought of asthe prior knowledge how likely a local data stream is likely affected by the change. Next,we compute a real-valued local statistic that determines the sampling policies: R ∗ k,t = R k,t + L k,t ˜ R k,t . (8)Finally, at time step t + 1 , we follow the Thompson Sampling to adaptively choose thelocal data streams with the largest q values of R ∗ ( k ) ,t in (8). Let l ( k ) ,t +1 denote the cor-responding index of the k th largest values, then the new sensor layout will be S t +1 = { l (1) ,t +1 , . . . , l ( q ) ,t +1 } at time t + 1 . Let us provide a high-level rationale for our proposed adaptive sampling policy. First,note that R ∗ k,t in (8) can be thought of as a randomized version of R k,t and allows us tobalance better the tradeoff between those local streams having larger observed R k,t andthose local streams having fewer observations. Second, our sampling policy is computa-tionally efficient, since it is based on the recursive updates of the two-dimensional localstatistics, ( R k,t , L k,t ) . Finally, our sampling policy is the Thompson sampling method un-der the limiting Bayes framework, since the larger the R ∗ k,t value, the larger the realizationof the posterior distribution of a local change. The following theorem summaries the closerelationship. 7 heorem 1. Assume that the change time ν k for the k -th data stream has a prior Geometric( p )distribution: P ( ν k = 0) = Π k, , P ( ν k = t ) = (1 − Π k, ) p (1 − p ) t − for t = 1 , , , · · · , (9) where the initial prior probability Π k, has a prior G p distribution. Suppose that G p /p → G in distribution, then R ∗ k,t in (8) has the same distribution as lim p → Π k,t p (1 − Π k,t ) , (10) where Π k,t = P ( ν k ≤ t | X ∗ k, , · · · , X ∗ k,t ) is the posterior estimation how likely a local changeoccurs, and X ∗ k,t = X k,t δ k,t denotes the observed data. The detailed proof of Theorem 1 is presented in the Appendix A.
Our proposed global decision policy is to raise an global alarm based on the largest r values of the local statistics R k,t in (6), and is defined as the stopping time T = inf { t ≥ r (cid:88) k =1 R ( k ) ,t ≥ A } , (11)where r is a pre-specified parameter, and A is a pre-specified constant so as to satisfy thefalse alarm constraint in (2). Here R (1) ,t ≥ . . . ≥ R ( k ) ,t ≥ . . . ≥ R ( K ) ,t denote the decreasingorder of the local statistics R k,t in (6).We should acknowledge that there are many other ways to raise a global decision. Forinstance, for local statistics in the summation, we can use the randomized version R ∗ ( k ) ,t or the logarithm version log R ( k ) ,t . Moreover, there are different ways to use the shrinkagetransformation to combine local statistics to raise a global alarm; see [Mei11] and [LZM19].Based on our extensive simulation experiences, the stopping time in (11) is stable andoutperforms other stopping rules in most cases. The discussion on comparison with otherstopping rules is deferred to the supplementary material Section 3.
We summarize the proposed Thompson-Sampling-Shiryaev-Roberts-Pollak (TSSRP) inAlgorithm 1.It is useful to highlight that our proposed TSSRP method is computationally scalable.First, it requires only 3 K registers for retaining relevant information of ( R k,t , L k,t , ˜ R k,t )about the K local processes: the first two are on the observed data regarding the localchange, and the last is on the prior knowledge of the local change. Second, since thetwo-dimensional local statistics ( R k,t , L k,t ) can be computed recursively and the “initial”values ˜ R k,t can be sampled directly from the prior distribution G at each time step, thecomputational cost of our TSSRP method is linear to the number K of local data streams.Thus our method can be easily implemented for real-time monitoring.8 lgorithm 1 Thompson-Sampling-Shiryaev-Roberts-Pollak (TSSRP) algorithm
Parameters : the number r , the number of observed sensors q , a prior distribution G andthe stopping threshold A . Input : K data streams Initialize:
Set R k,t = 0, L k,t = 1, and sample ˜ R k,t from G for all k = 1 , , . . . , K . Ran-domly sample q data streams as the initial layout S Algorithm : In each round t ← , , . . . do the following until reaching the stopping con-dition in 11:(1) based on the current sensor layout S t , recursively update two-dimensional local statis-tics ( R k,t , L k,t ) in (6) and (7)(2) For each data stream k , sample the “initial” value ˜ R k,t from G , and calculate the localsampling statistics R ∗ k,t in (8)(3) Order the local sampling statistics R ∗ k,t k = 1 , , . . . , K , from the largest to the smallest,and let l ( k ) ,t denote the variable index of the order statistics R ( k ) ,t (4) Update the sensor layout = { l (1) ,t , . . . , l ( q ) ,t } and proceed to the next iteration The TSSRP algorithm involves several parameters. Below we will discuss the choice ofthese parameters.
Choice of the prior distribution : In practice, we could choose the prior distributionaccording to our prior knowledge. For example, in the manufacturing process, we mayknow that certain production lines could have a higher chance of being out of control.Alternatively, we could choose uniform distribution if no prior knowledge is present, or usethe point mass 0 distribution, i.e., P ( ˜ R k,t = 0) = 1. In the latter case, it reduces to agreedy sampling algorithm without randomization In our numerical simulation studies inSection 4, we consider two different priors: one is uniform(0 ,
1) and the other is the pointmass 0 prior P . We note that [Pol85] also investigates the choice of the prior distribution G , but under adifferent context in which the randomized Shiryaev-Roberts statistics leads to an equalizerstopping time T in that sense that E ν ( T − ν | T ≥ ν ) is constant as a function of candidatechange-point ν. Unfortunately, it is generally challenging to find an explicit solution forsuch prior distribution. Nevertheless, our purpose of the randomization is different from[Pol85]: ours is for balancing the exploration and exploitation, while theirs is for almostminimax property.
Choice of r : Intuitively, the tuning parameter r in the stopping time (11) decides howmany local sensors should be involved in the final decision making. Thus an ideal choiceshould be a plausible approximation of the actual number of changed data streams. If r ismuch smaller than the actual number of changed data streams, our final decision will notutilize all information that the data might provide. Meanwhile, if r is much larger, our finaldecision will involve unnecessary noisy local statistics and lead to poor performance. As[Mei11] suggested, when the total number of changed data streams is unknown, a smaller r might be more robust. Based on our extensive numerical experience, we found that it isbetter for r to be less than the total number q of sensors. Otherwise, it will include the9nobserved variables and thus degrade the performance. Choice of A : The parameter A is the stopping threshold that controls the average runlength to false alarm of our method, which is analogous to controlling the type I error. Inpractice, the threshold A is usually determined by Monte-Carlo simulation. One often usesthe bisection method to find the smallest A so that the proposed method satisfies the falsealarm constraint in (2). We will adopt this approach in our paper to choose the parameter A. In this subsection, we provide some theoretical properties of the TSSRP algorithm:Theorem 2 investigates the in-control average run length properties, whereas Theorem 3and Theorem 4 provide a deep understanding of the sensor layouts.First, we investigate the ARL to false alarm of the proposed TSSRP algorithm in thefollowing theorem.
Theorem 2 (Average Run Length to false Alarm) . E ∞ ( T ) ≥ A/K . Moreover, E ∞ ( T ) = O ( A ) , where O ( A ) /A is bounded as A → ∞ . Theorem 2 provides us guidance to select conservative upper and lower bounds of A .Specifically, K · ARL can serve as the upper bound in the bisection search to speed up thethreshold choosing procedure. The detailed proof of Theorem 2 is given in Appendix B.Unfortunately, it remains an open problem to derive the bounds on the detection delays.Next, we investigate the properties for sensor layouts. Theorem 3 shows that the sensorlayout will go through all the data streams eventually when the system is in control.
Theorem 3.
Let S t be the sensor layout at time t . Then under H : ν = ∞ , there exsitsa t (cid:48) > t such that P ( k ∈ S t (cid:48) ) > , for each t > and each k, ≤ k ≤ K . Theorem 3 implies that each variable has a chance to be explored, regardless of thesensor deployments in previous steps when no changes occur. In other words, the algorithmperforms similar to random sampling in the in-control state.Finally, Theorem 4 below implies that the sensor layout of the TSSRP algorithm willeventually converge to the changed data streams when the system is out of control.
Theorem 4.
Let S t be the sensor layout at time t . Then under H : ν < ∞ , we have P ( k ∈ S t , ∀ t > t | k ∈ S t ) > for all changed data stream k , and all t > ν . We have shown that the sensor layout will not stay on any unchanged data streamsforever in Theorem 3. The sensors will eventually be redistributed to the affected datastreams at a certain time. Theorem 4 states that once a sensor is deployed to the out-of-control data stream, then there is a nonzero probability that the sensor will stay on thisdata stream forever. This property ensures that the sensors will eventually keep monitoringthe affected data streams. The proofs of Theorem 3 and Theorem 4 are given in AppendixC. In summary, our TSSRP algorithm admits a nice property that it does more explorationwhen the system is in control, while more exploitation when the system is out of control.10able 1: Comparisons of the detection delay between the TSSRP algorithm and TRASalgorithm under different numbers of changed data streams for r = 10 when the data isindependent mutivariate Gaussian distributed The number of changes 1 2 3 4 5TSSRP( U [0 , P ) 12.766(0.184) 9.701(0.117) 8.280(0.093) 7.689(0.083) 7.185(0.074)TRAS(∆ = 0 .
03) 27.028(0.422) 20.081(0.269) 16.419(0.207) 14.392(0.185) 12.686(0.153)TRAS(∆ = 0 .
05) 27.787(0.344) 20.673(0.229) 17.422(0.181) 15.485(0.166) 14.382(0.153)TRAS(∆ = 0 .
1) 44.926(0.281) 32.543(0.216) 27.734(0.166) 24.512(0.145) 22.400(0.133)The number of changes 6 7 8 9 10TSSRP( U [0 , P ) 6.713(0.063) 6.365(0.059) 6.155(0.054) 5.969(0.049) 5.874(0.047)TRAS(∆ = 0 .
03) 11.39(0.133) 10.769(0.127) 10.027(0.110) 9.425(0.105) 8.867(0.095)TRAS(∆ = 0 .
05) 12.959(0.132) 11.806(0.117) 11.103(0.112) 10.583(0.100) 9.907(0.090)TRAS(∆ = 0 .
1) 20.871(0.121) 19.609(0.119) 18.775(0.109) 17.531(0.106) 17.114(0.098)
Intuitively, such nice property comes from the structure of the local SRP statistics. Whenwe observe a new data, the recursive formula for local statistics involves the likelihoodratio f θ k, ( x k,t ) /f θ k, ( x k,t ). Under the in-control state, the expectation of the increment ofall local statistics at time t is one, i.e., E R ∗ k,t = t ∀ k, t . Thus our adaptive sampling issimilar to random sampling under the in-control state. Under the out-of-control state, thelikelihood ratio under the post-change distribution will be more likely to be greater thanone. Therefore, the affected data streams’ local statistics will increase faster than thoseof unaffected data streams, which enables the sensor layout to converge to those affectedlocal components. In this section, we study the performance of the proposed TSSRP algorithm and com-pare it with the existing one in [LMS15]. The general setting for our simulations is asfollows. We consider monitoring K = 100 independent data streams. We assume that q = 10 out of K = 100 data streams can be monitored at each time step. The nominalvalue for the ARL to a false alarm is fixed as E ∞ T = γ = 1000 . We follow the classicapproach [XS13, XHW13, Mei10] and report the detection delay when the change occursat time ν = 1. It is the most challenging setting because it is more difficult to detect whenthe change occurred at the very beginning.We report our simulation results in two subsections, depending on different distributionsof the data. Subsection 4.1 considers the efficiency of our proposed TSSRP algorithm whenthe data are from the Gaussian distribution, and Subsection 4.2 considers the robustnessof our algorithm when the data are from the t distribution. In each simulation study, weevaluate the average detection delay based on 1000 Monte Carlo replications.11 .1 Gaussian Distributed Data We monitor K = 100 independent Gaussian data streams whose pre-change distribu-tions are N (0 ,
1) with possible mean change in some local data streams. We considerdifferent numbers of changed data streams ranging from 1 to 10 local streams. Our pro-posed algorithm is constructed for the post-change mean lower bound µ = 1 .
5, and thetrue post-change mean µ ,true = 2. As mentioned in Section 3.2, the parameter r in theglobal stopping time defined in 11 ideally should be the number of changed data steams.The latter, however, is usually unknown in practice. Thus we report the simulation resultsof different global monitoring schemes under a large r (= 10) in the main body of the paperand a small r (= 3) in the supplementary material of the paper.For our proposed TRSSP algorithm, we consider two choices of the prior G : the uniformdistribution U [0 ,
1] and the point mass 0 distribution of P = P ( ˜ R k,t = 0) = 1. We referredthem to as TSSRP( U [0 , P ), respectively. For the comparison purpose, wecompare our TSSRP algorithm with the baseline TRAS algorithm proposed by [LMS15].The TRAS algorithm first constructs a local CUSUM statistic for each observed variable.For the unobserved variables, the local statistics are updated by adding a compensationparameter ∆. Next, the top- r local statistics are combined to determine whether to raisean alarm. The algorithm adaptively deploys the sensors to the data streams with q largestlocal statistics at each time. Since TRAS has a tuning parameter ∆, we present resultsobtained by three different choices of ∆ = 0 . , . , . r = 10. Additional simulationresults under r = 3 is deferred to Table 4 in supplementary material. Besides the detectiondelays, the corresponding standard errors are also included in these tables to understandthe detection delay distribution better. Table 1 shows three advantages for our algorithmwhen data is drawn from independent multivariate Gaussian distribution: (1) The TSSRPalgorithm can significantly reduce the detection delay compared to the TRAS algorithm.To explain this result, we note that the TRAS depends on a tuning parameter, which isdifficult to be optimally chosen. In contrast, the TSSRP does not require such a tuningparameter. (2) The TSSRP algorithm is relatively stable with different priors. Determinedby the classical Shiryaev-Roberts statistics, TSSRP( P ) performs well. TSSRP( U [0 , P ), which implies that we can even improve the modelby incorporating Thompson Sampling in constructing the local statistics. (3) The TSSRPalgorithm is more stable when r varies compared to the TRAS algorithm. The performanceof the TSSRP is better for a large r , and in contrast, The TRAS performs better with asmaller r . This result agrees with our intuition: the order of Shiryaev-Roberts statisticsis the exponent of the order of the CUSUM statistics. The sum of top- r statistics mainlycaptures the maximum local statistics across all the data streams. Therefore, our TSSRPalgorithm can reduce the effect of noise. In this subsection, we study the robustness properties of our proposed TSSRP methodwhen the distributional assumption does not hold. We apply the procedure for Gaussiandistributions to deal with data that follows t distributions. All the simulation settings are12 The number of changed data streams A v e r age de t e c t i on de l a y TSSRP(U[0,1])TSSRP(P )TRAS( D =0.03)TRAS( D =0.05) Figure 1: Experiments in assessing robustness with t distributed datathe same as those in the previous section, except that we simulate the samples from the t distribution with in-control mean zero and out-of-control mean two and degree of freedom df = 5. The stopping threshold A is determined by 1000 Monte Carlo simulations underthe t distribution. The global statistics are constructed by summing up the top-ten localstatistics. We summarize in Figure 1 the detection delays and the standard errors of theTSSRP under the uniform distribution prior U [0 ,
1] and the deterministic prior P , and ofthe TRAS under two choices of the tuning parameter. Our results suggest that both theTSSRP and the TRAS algorithm are robust when we violate the distributional assumption,and our TSSRP still outperforms the TRAS algorithm. This experiment further bolstersthe stability of our algorithm, indicating that the hypothesized distributions can be quitefar from the underlying distributions of the data, and our algorithm will still raise an alarmquickly. In this section, we evaluate the performance of the TSSRP algorithm on a real caseexample: the hot forming process. We also give an additional solar flare detection exampleon the high-dimensional case in supplementary material Section 1.We consider the Hot Forming Process example in [LJ10]. We want to detect anomalies inthis physical system. Figure 2 Left illustrates a two-dimensional (2-D) physical illustrationof the hot forming process. [LJ10] identified the causal relationship of the five variables inthis process: the final dimension of workpiece X , the tension in workpiece X , the materialflow stress X , temperature X and Blank Holding Force X , which can be represented asa Bayesian network. All the variables are proven to follow standard normal distribution13igure 2: Left: 2-D illustration of the hot forming process. Right: Bayesian network forthe hot forming processwhen the system is under normal operating conditions. [LJ10] gave the parameterizationmodel of the Bayesian network, and Figure 2 Right illustrates the dependence across thefive variables: X i = (cid:88) X j ∈ p ( X i ) w j,i X j + (cid:15) i (12)where p ( X ) = { Y : Y → X ∈ Edge Set } denotes the parents of X ; and w ( X j , X i ) isthe weight of the edge X j → X i , which refers to the causal influence from X j to X i ; (cid:15) i ∼ N (0 , σ i ) is the independent Gaussian noise.In this study, we assume that the Bayesian network is unknown to us. We will only usethe network structure to generate data under different scenarios. We generate the changedroot variables by setting the true mean change as two, and the remaining variables accordingto the Bayesian network. In the algorithm, we set the post-change mean as the interestedsmallest shift magnitude θ k, = 1 . r = 2 as in [LMS15]. In each replicate, thechanged data streams and the initial sensor layouts are selected randomly. We evaluatethe detection delay D ( T ) as the average detection delay under any change possibilities.Table 2 summarizes the detection delay comparisons between the TSSRP algorithmand the TRAS algorithm in the single change and two changes cases. It implies that theperformance of our TSSRP algorithm is better than that of the baseline TRAS algorithmin this hot forming procedure.We are interested in studying how the sensor layouts update over the time under thein-control and out-of-control states empirically to validate Theorem 3 and Theorem 4. Wesummarize the average percentages of each data stream being observed under the two statesin Table 3, based on 1000 replicates. Specifically, it is defined aspercentage of being observed = . Here, the out-of-control state is when X and X change at the very beginning. Thesimulations further confirm that our TSSRP algorithm works similar to random sampling14able 2: Comparisons of the detection delay between the TSSRP algorithm and the TRASalgorithm under different numbers of changed root variables in the hot forming processexample The number of changes 1 2TSSRP( U [0 , P ) 6.476(0.091) 5.286(0.065)TRAS(∆ = 0 .
01) 8.405(0.134) 7.385(0.113)TRAS(∆ = 0 .
1) 8.172(0.129) 7.092(0.116)TRAS(∆ = 0 .
5) 9.937(0.146) 8.776(0.141)
Table 3: Sensor layouts distribution under the in-control and out-of-control states in thehot forming process example
Variable Percentage (In-control) Percentage (Out-of-control) X X X X X when the system is in control, and greedily selects the changed data streams when thesystem is out-of-control. Processing high-velocity streams of high-dimensional data in resource-constrained en-vironments is a big challenge. In this paper, we propose a bandit change-point detectionapproach to adaptively sample useful local components and determine a global stoppingtime for real-time monitoring of high-dimensional streaming data. Our proposed algorithm,termed Thompson-Sampling Shiryaev-Roberts-Pollak (TSSRP) algorithm, can balance be-tween exploiting those observed local components that maximize the immediate detectionperformance and exploring not-been-monitored local components that might accumulatenew information to improve future detection performance. Our numerical simulations andcase studies show that the TSSRP algorithm can significantly reduce the detection delaycompared to the existing methods.This work can be extended in several directions. First, based on the numerical simula-tion studies, we conjecture that our proposed TSSRP algorithm is first-order asymptoticallyoptimal under a general setting. Still, it remains an open problem to prove it, as it is highlynon-trivial to analyze the expected detection delay of the proposed method. Second, in-stead of a fixed number of active sensors, one could consider changing the number of activesensors per time step, and increase the number of sensors if a change likely occurs. Third,it is also interesting to find an optimal value of the number of active sensors that can adap-15ively adjust to make the best use of the resource. Finally, one could extend our methodto the situation when the data streams have a spatial or temporal correlation structure.
Appendix A Proof for Theorem 1
Proof.
Let us first investigate how to update the posterior probability Π k,t . There are twosubcases, depending on whether we take an observation from the k -th data stream or not.In the first case, when we do not take observations from the k -th stream, i.e., when δ k,t = 0,we have the following recursive formula for the posterior probability Π k,t .Π k,t = P ( ν k ≤ t | X ∗ k, , · · · , X ∗ k,t − )= P ( ν k ≤ t − | X ∗ k, , · · · , X ∗ k,t − ) + P ( ν k = t | ν k > t − P ( ν k > t − | X ∗ k, , · · · , X ∗ k,t − )= Π k,t − + (1 − Π k,t − ) p. The second case is when we take observations from the k -th stream, i.e., when δ k,t = 1, byBayes rule, we have the following the recursive formula:Π k,t = P ( ν k ≤ t | X ∗ k, , · · · , X ∗ k,t − , X k,t )= P ( ν k ≤ t, X ∗ k, , · · · , X ∗ k,t − , X k,t ) P ( X ∗ k, , · · · , X ∗ k,t − , X k,t )= P ( ν k ≤ t − , X ∗ k, , · · · , X ∗ k,t − , X k,t ) + P ( ν k = t, X ∗ k, , · · · , X ∗ k,t − , X k,t ) P ( X ∗ k, , · · · , X ∗ k,t − , X k,t )= f θ k, ( X k,t )Π k,t − + (1 − Π k,t − ) pf θ k, ( X k,t ) f θ k, ( X k,t )Π k,t − + (1 − Π k,t − ) pf θ k, ( X k,t ) + (1 − Π k,t − )(1 − p ) f θ k, ( X k,t ) , where we use the fact that P ( X ∗ k, , · · · , X ∗ k,t − , X k,t ) = P ( X k,t , ν k < t | X ∗ k, , · · · , X ∗ k,t − )+ P ( X k,t , ν k = t | X ∗ k, , · · · , X ∗ k,t − )+ P ( X k,t , ν k > t | X ∗ k, , · · · , X ∗ k,t − )= f θ k, ( X k,t )Π k,t − + (1 − Π k,t − ) pf θ k, ( X k,t )+(1 − Π k,t − )(1 − p ) f θ k, ( X k,t ) . Let R ∗ p,k,t denote Π k,t p (1 − Π k,t ) . Then the recursive formula for R p,k,t is as follows. R p,k,t = f θ k, ( X k,t )(1 − p ) f θ k, ( X k,t ) ( R p,k,t − + 1) , if δ k,t = 111 − p ( R p,k,t − + 1) , if δ k,t = 0 . We consider the limiting Bayesian approach by letting p → , and we arrive at the updatingrule for R ∗ k,t = lim p → R ∗ p,k,t as in (6) for R k,t , except that the initial value R ∗ k,t =0 are fromthe prior distribution G and the initial value R k,t is 0 .
16y using (6), a proof by induction shows that R ∗ k,t = R k,t + L k,t R ∗ k,t =0 for any time t = 1 , , · · · . Since ˜ R k,t also has the same prior distribution G as R ∗ k,t =0 , relation (8) holds,completing the proof. Appendix B Proof for Theorem 2
Proof.
A high-level argument is as follows. In the derivation of the ARL, we observethat (cid:80) Kk =1 R k,t − Kt is a martingale under the pre-change hypothesis. Moreover, we have (cid:80) rk =1 R ( k ) ,t ≤ (cid:80) Kk =1 R k,t . By the optional sampling theorem, for the stopping time T defined in (11), we show that E ∞ [ (cid:80) Kk =1 R k,T /K ] = E ∞ [ T ], resulting the lower bound.For the upper bound, we define a new stopping time based on any fixed data stream k : T (cid:48) = inf { t ≥ R k,t ≥ A } . Then we have E ∞ [ T ] ≤ E ∞ [ T (cid:48) ] = O ( A ), where the last equalityfollows from Theorem 1 in [Pol87].Below are the detailed arguments. We first prove the second part of the theorem.For any fixed k , 1 ≤ k ≤ K , we have R k,t ≤ (cid:80) rk =1 R ( k ) ,t . We define a new stoppingtime based on any fixed data stream k : T (cid:48) = inf { t ≥ R k,t ≥ A } . Then we have E ∞ [ T ] ≤ E ∞ [ T (cid:48) ] = O ( A ), where the last equality follows from Theorem 1 in [Pol87].To provide a lower bound of E ∞ T , we first show that (cid:80) Kk =1 R k,t − Kt is a martingaleunder the pre-change hypothesis as follows. E ∞ [ K (cid:88) k =1 R k, ( t +1) − K ( t + 1) | X ∗ k, , · · · , X ∗ k,t − , X ∗ k,t ]= E ∞ [ K (cid:88) k =1 ( R k,t + 1) exp( δ k,t +1 log f θ k, ( X k,t +1 ) f θ k, ( X k,t +1 ) ) − K ( t + 1) | X ∗ k, , · · · , X ∗ k,t − , X ∗ k,t ]= K (cid:88) k =1 ( R k,t + 1) − K ( t + 1)= K (cid:88) k =1 R k,t − Kt.
Hence, E ∞ [ (cid:80) Kk =1 R k,T − KT ] = 0 exists for the stopping time T that are bounded above.Since (cid:12)(cid:12)(cid:12)(cid:80) Kk =1 R k,t (cid:12)(cid:12)(cid:12) < KA/r on { T > t } , we have lim inf t →∞ (cid:82) { T >t } | (cid:80) Kk =1 R k,t − Kt | dP ∞ = 0.Therefore, the martingale optional sampling theorem applies to yield E ∞ ( (cid:80) Kk =1 R k,T − KT ) = 0. Observing that (cid:80) rk =1 R ( k ) ,t ≤ (cid:80) Kk =1 R k,t , we have E ∞ [ KT ] = E ∞ [ K (cid:88) k =1 R k,T ] ≥ E ∞ [ r (cid:88) k =1 R ( k ) ,T ] ≥ A, (13)completing the proof. 17 ppendix C Proofs for Theorem 3 and Theorem 4 To prove Theorem 3 and Theorem 4, we first present a simple lemma and its proof.
Lemma 1.
Suppose that there exists a set U such that for all k ∈ U , there exists a t > such that for all t > t , we have P ( k ∈ S t ) = 0 . We have R ∗ k (cid:48) ,t ≥ R ∗ k,t , for all k (cid:48) ∈ [ K ] /U, k ∈ U and for all t > t .Proof. We will prove by contradiction. Suppose there exists a t > t , a k ∈ U , and a k (cid:48) ∈ [ K ] /U such that R ∗ k,t > R ∗ k (cid:48) ,t , we consider two cases. If k (cid:48) / ∈ S t , then the sensorswill first be deployed to the k -th data stream then to the k (cid:48) -th data stream because theincrements of the two processes are the same if they are unobserved. If k (cid:48) ∈ S t , then itindicates that R ∗ k (cid:48) ,t > R ∗ k,t , which leads to contradiction.We will use Lemma 1 to prove Theorem 3 and Theorem 4. Proof of Theorem 3.
Suppose that there exists a set U such that for each k ∈ U , thereexists a t >
0, such that for all t (cid:48) > t , we have P ( k ∈ S t (cid:48) ) = 0.For any data streams in the set [ K ] /U , we only consider the case when the data isobserved, because otherwise, the increments are the same as that of any data streams in theset U . For each k ∈ U , the increment of the logarithmic scale of the local statistic is nearlyzero, while the increment for each k (cid:48) ∈ [ K ] /U is nearly (cid:80) Tt = t log f θ k, ( X k,t ) /f θ k, ( X k,t ). Let Y k,t = log f θ k, ( X k,t ) /f θ k, ( X k,t ). It remain to prove P f θk, (cid:16) inf T < ∞ (cid:80) Tt = t Y k,t ≥ (cid:17) = 0.Define Z n = (cid:80) t + nt = t Y k,t and τ − = inf { n : Z n ≤ } . Notice that Y k,t has negativedrift since E f θk, log f θ k, ( X k,t ) /f θ k, ( X k,t ) <
0, we have P f θk, (inf n< ∞ Z n ≥
0) = P f θk, { τ − = ∞} = 0. It means that R ∗ k,t for any k ∈ U will eventually get greater than R ∗ k (cid:48) ,t for any k (cid:48) ∈ [ K ] /U , which is contradictory to our Lemma 1. Proof of Theorem 4.
Suppose we observe the k -th data stream at time t and it is out ofcontrol. Let Y k,t = log f θ k, ( X k,t ) /f θ k, ( X k,t ) and Z n = (cid:80) t + nt = t Y k,t as in the proof of Theorem3. Under the out-of-control state, we have E f θk, [ Y k,t ] >
0. By Corollary 8.44 in [Sie13], for τ − = inf { n : Z n ≤ } , we have P ( τ − = ∞ ) >
0, which implies P f θk, (inf n< ∞ Z n ≥ > k -th data stream is alwaysgreater than the local statistics of other unobserved data streams. Therefore, we have P ( k ∈ S t , ∀ t > t | k ∈ S t ) > References [AG12] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In
Conference on Learning Theory , pages 39–1, 2012.[AG13] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual banditswith linear payoffs. In
International Conference on Machine Learning , pages127–135, 2013. 18AVW87] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptot-ically efficient allocation rules for the multiarmed bandit problem with mul-tiple plays-part i: Iid rewards.
IEEE Transactions on Automatic Control ,32(11):968–976, 1987.[Bas99] Tim Bass. Multisensor data fusion for next generation distributed intrusiondetection systems. In
Proceedings of the IRIS National Symposium on Sensorand Data Fusion , volume 24, pages 24–27. Citeseer, 1999.[BCB +
12] S´ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochasticand nonstochastic multi-armed bandit problems.
Foundations and Trends R (cid:13) inMachine Learning , 5(1):1–122, 2012.[BN +
93] Mich`ele Basseville, Igor V Nikiforov, et al.
Detection of abrupt changes: theoryand application , volume 104. Prentice Hall Englewood Cliffs, 1993.[CC +
19] Lynna Chu, Hao Chen, et al. Asymptotic distribution-free change-point de-tection for multivariate and non-euclidean data.
The Annals of Statistics ,47(1):382–414, 2019.[CF15] Haeran Cho and Piotr Fryzlewicz. Multiple-change-point detection for highdimensional time series via sparsified binary segmentation.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 77(2):475–507, 2015.[Cha17] Hock Peng Chan. Optimal sequential detection in multi-stream data.
TheAnnals of Statistics , 45(6):2736–2763, 2017.[CL11] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling.In
Advances in neural information processing systems , pages 2249–2257, 2011.[CZKX18] Yang Cao, Wen Zheng, Branislav Kveton, and Yao Xie. Nearly optimal adap-tive procedure for piecewise-stationary bandit: a change-point detection ap-proach. arXiv preprint arXiv:1802.03692 , 2018.[DEK +
06] Yu Ding, Elsayed A Elsayed, Soundar Kumara, J-C Lu, Feng Niu, and JianjunShi. Distributed sensing for quality and productivity improvements.
IEEETransactions on Automation Science and Engineering , 3(4):344–359, 2006.[GGW11] John Gittins, Kevin Glazebrook, and Richard Weber.
Multi-armed bandit al-location indices . John Wiley & Sons, 2011.[KCG16] Emilie Kaufmann, Olivier Capp´e, and Aur´elien Garivier. On the complexity ofbest-arm identification in multi-armed bandit models.
The Journal of MachineLearning Research , 17(1):1–42, 2016.[Lai95] Tze Leung Lai. Sequential changepoint detection in quality control and dynam-ical systems.
Journal of the Royal Statistical Society. Series B (Methodological) ,pages 613–658, 1995. 19Lai98] Tze Leung Lai. Information bounds and quick detection of parameter changesin stochastic systems.
IEEE Transactions on Information Theory , 44(7):2917–2929, 1998.[Li19] Jun Li. A two-stage online monitoring procedure for high-dimensional datastreams.
Journal of Quality Technology , 51(4):392–406, 2019.[Li20] Jun Li. Efficient global monitoring statistics for high-dimensional data.
Qualityand Reliability Engineering International , 36(1):18–32, 2020.[LJ10] Jing Li and Jionghua Jin. Optimal sensor allocation by integrating causalmodels and set-covering algorithms.
IIE Transactions , 42(8):564–576, 2010.[LMS15] Kaibo Liu, Yajun Mei, and Jianjun Shi. An adaptive sampling strategy for on-line high-dimensional process monitoring.
Technometrics , 57(3):305–319, 2015.[Lor71] Gary Lorden. Procedures for reacting to a change in distribution.
The Annalsof Mathematical Statistics , 42(6):1897–1908, 1971.[LR85] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive alloca-tion rules.
Advances in applied mathematics , 6(1):4–22, 1985.[LXTP20] Wendong Li, Dongdong Xiang, Fugee Tsung, and Xiaolong Pu. A diagnosticprocedure for high-dimensional data streams via missed discovery rate control.
Technometrics , 62(1):84–100, 2020.[LZM19] Kun Liu, Ruizhi Zhang, and Yajun Mei. Scalable sum-shrinkage schemes fordistributed monitoring large-scale data streams.
Statistica Sinica , 29:1–22,2019.[Mei10] Yajun Mei. Efficient scalable schemes for monitoring a large number of datastreams.
Biometrika , 97(2):419–433, 2010.[Mei11] Yajun Mei. Quickest detection in censoring sensor networks. In
Informa-tion Theory Proceedings (ISIT), 2011 IEEE International Symposium on , pages2148–2152. IEEE, 2011.[PH08] H Vincent Poor and Olympia Hadjiliadis.
Quickest detection . Cambridge Uni-versity Press, 2008.[Pol85] Moshe Pollak. Optimal detection of a change in distribution.
The Annals ofStatistics , pages 206–227, 1985.[Pol87] Moshe Pollak. Average run lengths of an optimal method of detecting a changein distribution.
The Annals of Statistics , 15(2):749–779, 1987.[PT99] Dimitrios G Pandelis and Demosthenis Teneketzis. On the optimality of thegittins index rule for multi-armed bandits with multiple plays.
MathematicalMethods of Operations Research , 50(3):449–461, 1999.20Rob52] Herbert Robbins. Some aspects of the sequential design of experiments.
Bulletinof the American Mathematical Society , 58(5):527–535, 1952.[Rob66] SW Roberts. A comparison of some control chart procedures.
Technometrics ,8(3):411–430, 1966.[Rob85] Herbert Robbins. Some aspects of the sequential design of experiments. In
Herbert Robbins Selected Papers , pages 169–177. Springer, 1985.[Sco10] Steven L Scott. A modern bayesian look at the multi-armed bandit.
AppliedStochastic Models in Business and Industry , 26(6):639–658, 2010.[Shi63] Albert N Shiryaev. On optimum methods in quickest detection problems.
The-ory of Probability & Its Applications , 8(1):22–46, 1963.[Sie13] David Siegmund.
Sequential analysis: tests and confidence intervals . SpringerScience & Business Media, 2013.[Tho33] William R Thompson. On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples.
Biometrika , 25(3/4):285–294,1933.[TNB14] Alexander Tartakovsky, Igor Nikiforov, and Michele Basseville.
Sequential anal-ysis: Hypothesis testing and changepoint detection . Chapman and Hall/CRC,2014.[VBC +
14] Bimal Viswanath, M Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna PGummadi, Balachander Krishnamurthy, and Alan Mislove. Towards detectinganomalous user behavior in online social networks. In { USENIX } SecuritySymposium ( { USENIX } Security 14) , pages 223–238, 2014.[WM15] Yuan Wang and Yajun Mei. Large-scale multi-stream quickest change detec-tion via shrinkage post-change estimation.
IEEE Transactions on InformationTheory , 61(12):6926–6938, 2015.[WXTL18] Andi Wang, Xiaochen Xian, Fugee Tsung, and Kaibo Liu. A spatial-adaptivesampling procedure for online monitoring of big data streams.
Journal of Qual-ity Technology , 50(4):329–343, 2018.[XHW13] Yao Xie, Jiaji Huang, and Rebecca Willett. Change-point detection for high-dimensional time series with missing data.
IEEE Journal of Selected Topics inSignal Processing , 7(1):12–27, 2013.[XS13] Yao Xie and David Siegmund. Sequential multi-sensor change-point detection.In
Information Theory and Applications Workshop (ITA), 2013 , pages 1–20.IEEE, 2013. 21XWL18] Xiaochen Xian, Andi Wang, and Kaibo Liu. A nonparametric adaptive sam-pling strategy for online monitoring of big data streams.
Technometrics ,60(1):14–25, 2018.[XZBL19] Xiaochen Xian, Chen Zhang, Scott Bonk, and Kaibo Liu. Online monitoringof big data streams: A rank-based sampling algorithm by data augmentation.
Journal of Quality Technology , pages 1–19, 2019.[YSK15] Shihao Yang, Mauricio Santillana, and Samuel C Kou. Accurate estimationof influenza epidemics using google search data via argo.
Proceedings of theNational Academy of Sciences , 112(47):14473–14478, 2015.[Zha19] Qing Zhao. Multi-armed bandits: Theory and applications to online learning innetworks.
Synthesis Lectures on Communication Networks , 12(1):1–165, 2019.[ZQ09] Changliang Zou and Peihua Qiu. Multivariate statistical process control usinglasso.
Journal of the American Statistical Association , 104(488):1586–1596,2009.[ZS12] Nancy R Zhang and David O Siegmund. Model selection for high-dimensional,multi-sequence change-point problems.
Statistica Sinica , pages 1507–1538,2012.[ZWZJ15] Changliang Zou, Zhaojun Wang, Xuemin Zi, and Wei Jiang. An efficient on-line monitoring method for high-dimensional data streams.
Technometrics ,57(3):374–387, 2015. 22
UPPLEMENTARY MATERIAL
Here we present supplementary materials that can better understand our proposedmethod, and we split into three sections: additional case study, additional simulationstudies, and more global decision policies.
1. Additional Case Study: Solar Flare Data
We apply the proposed method to a real dataset collected by the Solar Data Obser-vatory, which illustrates an abrupt emergence of a solar flare. The Solar Data Observa-tory generates 1.5 terabytes of daily data [XHW13]. Traditional image detection methodsthat require fully observable data cannot process such high-dimensional and high-velocitydatasets on platforms with limited processing power. Besides, when the solar flare incurs,we unsure about its location and the number of affected pixels. Consider the transmissionwidth constraint and the processing power limit; we wish to leverage selected partial datato detect solar flare events. We apply the proposed TSSRP algorithm to this dataset todemonstrate its computational efficiency and monitoring capability.The data is publicly available at http://nislab.ee.duke.edu/MOUSSE/index.html. Itcontains a total of n = 300 frames in the video. Each frame is of size 232 ×
292 pixels,which results in D = 67744 dimensional streaming data. According to the original video,there are two obvious transient flares. One occurred at around t = 187 ∼ t = 216 ∼ t = 210 before the solar flare emerges, respectively. Figure 4 (b andd) are snapshots at t = 223 when the second solar flare is brightest. From the figures, wecan see that the solar flare signal is sparse. Thus, it is reasonable to monitor this processby sample only a small fraction of the data.In this study, we first preprocess the data by the MOUSSE algorithm proposed by[XHW13] to obtain the residual data to remove the background. We verified the normalityassumption for the residuals through the Kolmogorov-Smirnov test. Since there is nochange-point before the first 100 solar flares, we treat the first 100 frames as historicalobservations. Then we standardize the residual data so that the first 100 frames havemean zero and standard deviation one. The threshold is determined by 1000 Monte Carlosimulations of the standard normal distribution, and the IC-ARL is set to be 2500.We set r = 40 , θ k, = 0 . q = 2000, for data monitoring. Figure 3 plots the monitoringstatistics (cid:80) rk =1 R (0)( k ) ,t against t , the index of time, based on the top-40 local statistics. Thedetection threshold is exp(10). From Figure 3, we see that the two solar flares are clear.The monitoring statistic goes sharply up when the solar flare incurs and goes down quicklywhen it ends. We detect the first solar flare at time t = 192 and the second one at time t = 217, comparing to the results of the TRAS algorithm where the two solar flares aredetected at time t = 190 and time t = 221 in [LMS15]. Our algorithm is comparable tothe algorithm by [XHW13], which requires full observations. Their algorithm detects thetwo solar flares at time t = 191 and time t = 217.23 + + + time M on i t o r i ng S t a t i s t i cs Figure 3: Monitoring statistics against time in the solar flare detection example (a) Raw data at t = 210 (b) Raw data at t = 223(c) Residual at t = 210 (d) Residual at t = 223 Figure 4:
Solar flare detection: the snapshot of the video for rawdata and residual data at frame t = 210 when there is no solar flare(a and c); at frame t = 223 when the solar flare reaches the peak (band d).
2. Additional Simulation Study when r = 3 Here we present our proposed TSSRP algorithm’s properties when r = 3, i.e., whenraising a global alarm based on the sum of the largest r = 3 local Shiryaev-Robert statistics R k,t . r = 3 when the data isindependent multivariate Gaussian distributed The number of changes 1 2 3 4 5TSSRP( U [0 , P ) 12.975(0.198) 9.669(0.118) 8.534(0.102) 7.672(0.083) 7.150(0.075)TRAS(∆ = 0 .
03) 24.442(0.431) 17.15(0.271) 14.046(0.228) 12.126(0.185) 10.437(0.155)TRAS(∆ = 0 .
05) 23.85(0.383) 17.28(0.255) 14.007(0.193) 12.389(0.174) 10.932(0.145)TRAS(∆ = 0 .
1) 29.078(0.310) 21.780(0.216) 18.602(0.169) 16.452(0.148) 14.782(0.136)The number of changes 6 7 8 9 10TSSRP( U [0 , P ) 6.624(0.063) 6.312(0.059) 6.144(0.055) 5.875(0.050) 5.871(0.046)TRAS(∆ = 0 .
03) 9.607(0.140) 8.945(0.127) 8.296(0.105) 7.972(0.097) 7.361(0.079)TRAS(∆ = 0 .
05) 13.483(0.223) 9.207(0.110) 8.695(0.098) 8.136(0.089) 7.81(0.078)TRAS(∆ = 0 .
1) 13.809(0.122) 12.944(0.110) 12.180(0.100) 11.718(0.095) 11.320(0.092)
3. More global decision policies