Online detection of cascading change-points
OONLINE DETECTION OF A CASCADE OF MULTIPLE CHANGE-POINTS
Rui Zhang, a Rui Yao, b Yao Xie, a Feng Qiu ba School of Industrial and Systems Engineering (ISyE), Georgia Institute of Technology, Atlanta, GA b Energy Systems Division, Argonne National Laboratory, Lemont, IL
ABSTRACT
We propose novel online detection of a cascading failure inthe network using sequential measurement data. Our goal isto detect the failure as quickly as possible after it occurs. Toachieve this goal, we propose a temporal diffusion networkmodel representing the dynamics of the potential changes tohelp us capture the changes as quickly as possible. In cas-cading failures, once the failure affects a node, its measure-ments change from a pre-change distribution to an unknownpost-change distribution. We develop sequential generalizedlikelihood ratio statistics that perform joint detection and esti-mation in detecting the change. Numerical experiments showthat our method outperforms existing methods.
Index Terms — independent cascade model, likelihoodratio test, change-point detection
1. INTRODUCTION
Cascading failure is an important topic in the study of powersystems. A power system consists of a large number of busesconnected by lines, so a failure or anomaly in one componentweakens the whole system, makes other components morevulnerable, and increases the risk of other components mal-functioning. Cascading failure is the process caused by theinitial failure of one component, which propagates to causethe consecutive failure of other components [1]. Because cas-cading failures can lead to major blackouts, it is important todeploy fast detection and effective mitigation.In this paper, we employ online monitoring to detect thepower system’s cascading failure with measurements fromthe sensor networks. Specifically, we are interested in (1)the propagation of a failure in the power network, and (2)the change in the measurements’ distribution after failure. Insome cases, the failure of a component cannot be directlymonitored, and the measurements can be leveraged to detectthe changes in the system and infer the failure. Wiltshire andDobson [2, 3] monitor measurements that can reflect the stateof an area of the power system, while Hines [4] shows thatauto-correlation in the frequency signal increases as the sys-tem is getting close to a critical slowdown and applies thisproperty to detect a failure before it occurs. Rigatos [5] ap-plies a neural fuzzy network to detect changes in measure-ments and identify the type of anomaly. These studies demon-strate the value of measurements in system monitoring and event detection, but there has been little work on using mea-surements for the online detection of failures.The propagation of failures in a power system can be seenas an example of diffusion in a large network [6]. Diffusionnetworks have been widely studied in disease spreading [7],social networks [8], emergency detection, and other fields.Most of these works focus on inferring the latent network byobserving multiple cascade events. For example, Dobson andRen [9, 10] use failure data in power systems to estimate theparameters of the propagation models. To perform online de-tection, we assume that the underlying network and the prop-agation model are given, and we infer the propagation of thefailure in the network using the measurements.A failure in the power system can be modeled as a change-point when the distribution of measurements changes. Chen[11] examines a general setting with multiple change-pointsin multiple data streams, but does not utilize the graph topol-ogy of the data streams. Several works consider the prop-agation of change-points through the network. Raghaven[12] considers change-point propagation in a line-type net-work, Kurt [13] assumes that all possible propagation pathsare equally likely, and Zou [14] partitions the network intoseveral complete graphs to guarantee a result with connectedchange-points.In our work, based on a generic model for change-pointpropagation, we propose a novel statistical approach for theonline detection of cascading failures based on measurementdata. Our main contributions are as follows. We model the la-tent failure propagation with the independent cascade model[15], a point process that propagates across the networkthrough the edge. The independent cascade model can modelthe property that the risk of a component failing increases asmore and more components around it fail. Hoffmann [16]applied this model to infer the underlying graph. We focuson inferring the propagation path and online monitoring, as-suming that the graph is known, and the propagation durationfollows an exponential model. Inferring the propagation pathis of great use in improving resilience, forecasting, and failurelocalization. Based on our model, we propose a change-pointdetection algorithm pruned to be fast enough to monitorlarge power networks. General likelihood ratio statistics areapplied, since we assume the post-change distribution of themeasurement, and the true failure time (change-point) of each a r X i v : . [ s t a t . O T ] O c t ode is unknown. We are the first to apply the independentcascade model on change-point detection to the best of ourknowledge.
2. PROBLEM SETUP
Consider a graph G = ( V , E ) , which is formed by a set ofnodes V = { , , , . . . , N } and a set of edges E ⊂ V × V . V is the set of components in the power network, and E can beconstructed according to the physical network or interactiongraph [17, 18]. Assume G is undirected, i.e., ( i, j ) ∈ E ⇔ ( j, i ) ∈ E . For each time instance, there is a measurement oneach node. X i,t ∈ R is the measurement of i th node at time t ,where t = 1 , . . . , T . τ ∗ i ∈ R > ∪ {∞} , is the true failure timeof i th node, and τ ∗ i = ∞ means there is no failure on the i thnode. τ ∗ = ( τ ∗ , τ ∗ , . . . , τ ∗ N ) is the vector of all true failuretimes, which is unknown. We assume that the distributions ofmeasurements change after the failure; therefore failures arethe change-points. Failure (change-point) propagation model:
In this work,we consider the propagation process of the network’s failures,which is modeled by the independent cascade model. In thismodel, whenever a failure occurs on a node, it increases theneighboring nodes’ tendency to fail. Mathematically, we de-fine the influence of node i on node j as α i,j , which is pos-itive. We do not assume the distribution for the first failure,i.e., the time of the first failure is adversarial. After the occur-rence of the first failure, the distribution of the next failure isdetermined by the conditional hazard rate (intensity function) λ i ( t ) : λ i ( t ) = (cid:40)(cid:80) j :( j,i ) ∈E ,τ ∗ j Define the history (filtration) up to time t as H ( t ) = { τ ∗ i ≤ t, ∀ i = 1 , . . . , N } . Then, given the H ( t ) , theconditional intensity function of the i th node is defined as: λ i ( t ) (cid:44) lim ∆ t → P { τ ∗ i ∈ [ t + ∆ t ] |H ( t ) , τ ∗ i > t } ∆ t . The distribution of τ ∗ is uniquely defined by the conditionalhazard rate [15]. Measurement model: We can use different measurements todetect a failure in the power network. Wiltshire [2] uses thedifference between the true and the estimated voltage angle,Dobson [3] uses the area angle to monitor the outage in the 𝑇𝜆 ! 𝜏 "∗ 𝜏 $∗ 𝜏 %∗ 𝛼 ",! 𝛼 %,! + 𝛼 ",! 𝛼 $,! + 𝛼 %,! + 𝛼 ",! 𝛼 " , ! 𝛼 " , ! 𝛼 % , ! 𝛼 " , ! 𝛼 % , ! 𝛼 $ , ! Fig. 1 : Example of failure diffusion. Red circles: failednodes. Yellow solid lines: possible paths for failure diffusion.Yellow dashed lines: paths with failed nodes at both ends.Yellow circles: nodes affected by failed neighbors. In thisexample, the failure starts at node one. Then all the neigh-bors of node one are affected, and node two and node fourfail sequentially. As the failure is diffusing, node three is sur-rounded by more and more failed nodes, and its hazard ratecontinues to increase.power network, and Hines [4] applies the autocorrelation ofthe phase angle to detect the critical transition. To simplify thestudy, we assume that the measurements at each node are in-dependent, conditioned on the failure time. Before a change,they follow an i.i.d. standard normal distribution, and after achange they follow an i.i.d. normal distribution with an un-known mean and variance. That is, X i,t i.i.d ∼ (cid:40) N (0 , 1) if t < τ ∗ i , N ( µ i , σ i ) if t ≥ τ ∗ i . (2)Notice that we can always use a certain length of data as awarm start to compute the sample mean and variance of thepre-change distribution. Therefore we assume that the pre-change distribution is known and can be standardized. Likelihood function: According to the above models, in atime window [0 , T ] , given measurements and failure times,proposition 2.1 is the likelihood function. Proposition 2.1 According to the model defined by Equation(1) and Equation (2), the likelihood function for a given τ ∗ and X i,t s in [0 , T ] can be expressed as the following: f ( τ ∗ i , X i,t , ∀ i = 1 , . . . , N, t = 1 , . . . , T )= N (cid:89) i =2 f ( τ ∗ ( i ) | τ ∗ (1) · · · τ ∗ ( i − ) (cid:124) (cid:123)(cid:122) (cid:125) failure propagation model · N (cid:89) i =1 f ( X i,t | τ ∗ i ) (cid:124) (cid:123)(cid:122) (cid:125) measurement model (3)Our method can be applied in a more general setting. For thefailure propagation model, one can choose different hazardfunctions. Three models are provided in [15]. In our study,we chose the exponential model. For the measurement model,one can choose the different distribution, and the measure-ment can be a high dimension. 3. TEST STATISTICS We employed the same hypothesis test as in [14]. An alarmis raised when there are at least η change-points. To performhe online detection, at each time instance T we consider thefollowing hypothesis test: H ,η,T : τ ∗ ( η ) > T, H ,η,T : 0 ≤ τ ∗ ( η ) ≤ T. Our test statistics are general likelihood ratio (GLR) statisticssince we assume that the mean and post-change distributionvariance are unknown. As shown in Equation (3), given thefailure time and the measurements, the likelihood can be writ-ten in two parts: the likelihood of failure propagation modeland the likelihood of measurement model. To monitor thepower network, the length between two time instances is usu-ally short; therefore, when searching for the possible τ , weonly search for integer τ i s. Log likelihood of failure propagation model: We assume α j,i s are known since they can be estimated depending on thetopology of the power grid and power flow. Define C ( i ) = { j ∈ V| ( j, i ) ∈ E} to be the set of the i th node’s neighbors.Given τ , the log likelihood function for [0 , T ] is shown inproposition 3.1: Proposition 3.1 Given T, failure time τ = ( τ , . . . , τ N ) ,graph G , and parameters α i,j ∀ i, j ∈ V , the log likelihoodfunction of the failure propagation model is: (cid:96) ,T = log f ( τ , τ , . . . , τ N | α i,j ∀ i, j = 1 , . . . , N ) (4) = (cid:88) i : τ i ≤ T,τ i (cid:54) = τ (1) (cid:110) log (cid:0) (cid:88) j ∈C ( i ) α j,i I ( τ j < τ i ) (cid:1) − (cid:88) j ∈C ( i ) α j,i ( τ i − τ j ) + (cid:111) − (cid:88) i : τ i >T (cid:88) j ∈C ( i ) α j,i ( T − τ j ) + , where ( · ) + = max( · , , and I ( · ) is the indicator function. To reduce the possibility of a false alarm raised by caseswith disconnected change-points, we consider a probabilisticmodel. Zou [14] reduces such cases by partitioning the graphinto several fully connected sub-graphs. Log likelihood of measurement model: Since we assumethat the mean and variance of post-change distribution are un-known, we estimate the µ i s and σ i s by maximum likelihoodestimation (MLE): ˆ µ i , ˆ σ i . Therefore the log likelihood func-tion of measurements of the i th node, given τ i , is: (cid:96) ,i,T = log f ( X i,t ∀ t = 1 , . . . , T | τ i )= − (cid:16) T ∨ ( τ i − (cid:88) t =1 X i,t T (cid:88) t = τ i ( X i,t − ˆ µ i ) (cid:17) − T π ) − ( T − τ i + 1) + log(ˆ σ i ) . (5)Since we assume that the distribution of measurements at eachnode is independent given τ , the log likelihood function of allmeasurements is the summation of the log likelihood functionof each node, i.e., (cid:96) ,T = (cid:80) Ni =1 (cid:96) ,i,T . Therefore, given failure time τ and measurements X i,t s,the log likelihood at time T is: (cid:96) T ( τ, X i,t ∀ i = 1 , . . . , N, t = 1 , . . . , T ) = (cid:96) ,T + (cid:96) ,T . (6)Notice that if τ i > T for all i , Equation (4) equals 0 and (cid:96) T isthe sum of log likelihoods of standard normal distribution forthe measurements on each node, according to Equation (5).To perform the hypothesis test between H ,η,T and H ,η,T , we need to search for τ such that the log likeli-hood in Equation (6) is maximized. Let’s define U ( η ) = { τ : (cid:80) Ni =1 I ( τ i ≤ T ) ≥ η } and L ( η ) = { τ : (cid:80) Ni =1 I ( τ i ≤ T ) ≤ η − } . Therefore, to detect a change for at least η change-points, we apply the following GLR test statistics ∀ η = 1 , . . . , N : S η,T = max τ ∈ U ( η ) (cid:96) T ( τ ) − max τ ∈ L ( η ) (cid:96) T ( τ ) , and the corresponding stopping time is Γ = inf { T > S η,T > b } , for some preset threshold b . 4. ALGORITHM In practice, we hope to detect the failure as soon as possi-ble. Therefore we can only search for the propagation pathswith at most m nodes affected by the failure. However, thecomputation cost of the maximum likelihood under the alter-native hypothesis is still high. For example, for a fully con-nected graph with N nodes and time T , the computation costis O ( T m N ! / (( N − m )! m !) . Therefore, a fast algorithm isneeded for applying our detection in practice. Pruning is of-ten used in change-point detection algorithms [19, 20]. Sliding window: Instead of using all the data up to time T ,we compute the test statistics with the data from T − L + 1 to T , where L is the length of the sliding window. Random sampling propagation paths: As stated above,the number of possible propagation paths in a fully con-nected network grows exponentially as the number of nodesincreases. To reduce the computation cost, we propose a ran-dom sampling method. Define F as the failure set that con-tains the failed nodes, and R = { j / ∈ F : ∃ i ∈ F , α i,j > } as the risk set. Then, we generate the next possible fail-ure points by randomly picking P points in R withoutreplacement with probability vector p = ( p i ) i ∈R , where p i = ˜ p i ( (cid:80) j ∈R ˜ p j ) − and ˜ p i = (cid:80) j ∈F α j,i . With thisscheme, we reduce the number of paths to O ( N P m ) . Thinning: Consider the likelihood Ł T = e (cid:96) T , Ł ,i,T = e (cid:96) ,i,T , Ł ,T = e (cid:96) ,T and Ł ,T = e (cid:96) ,T . Given a propagationpath, we need to compute the maximum Ł T ( τ ) , which is theproduct of Ł ,T and Ł ,T . We define the q th percentile of the i th node to be l ,i,q . We also define a lower bound l for Ł ,T .Instead of maximizing Ł T ( τ ) over all the possible choices,we maximize it only in a thinned set { τ : Ł ,i,T ( τ i ) ≥ l ,i,q , ∀ i = 1 , . . . , N } ∩ { τ : Ł ,T ( τ ) ≥ l } . Specifically, 𝝉 𝒊 𝝉 𝒊 + 𝟓ℒ ",$ 𝑙 " Fig. 2 : Illustration of searching for τ j given the last failurepoint τ i . We search τ j from τ i +1 , and we have Ł ,T ( τ i +4) >l and Ł ,T ( τ i + 5) < l . By monotonicity of e − x , we stopthe search at τ i + 5 .given τ i , we try τ j ∈ { τ i + 1 , τ i + 2 . . . } ∪ { τ j : Ł ,j,T ( τ j ) ≥ l ,j,q } , from the smallest to the largest until Ł ,T < l asshown in Figure 2. The computation cost of this step is O ( h ) ,where h depends on the topology of G , α i,j s, l and q . More-over, we know that h ≤ [ L (1 − q )] m . Computation cost isreduced by exploiting the monotonicity of e − x .With the above three strategies, we can reduce the com-putation cost to O ( N P m h ) , which is linear to N , the size ofthe network. As shown in the following numerical examples,we can now compute the test statistics for a 300-bus powersystem quickly. 5. NUMERICAL EXAMPLES Two commonly used performance metrics in change-point de-tection are average run length (ARL) (recall that this con-trols the false detection rate — a large ARL means a lowfalse alarm rate) and expected detection delay (EDD). Fora stopping time Γ , we define ARL as E [Γ] and EDD as E [(Γ − τ ∗ (1) ) + ] , where E i is the expectation with the probabil-ity measure under hypothesis H i . ARL is positively related tothe threshold, and EDD is negatively related to the threshold.The trade-off between ARL and EDD is similar to the trade-off between false alarms and detection sensitivity. A goodprocedure should have a small EDD, given the same ARL.In the experiments of this section, the pre-change distri-bution is N (0 , and post-change distribution is N (1 , . Case I : Detect when the first change-point exists. In Figure3a, we show the results in a 300-bus power system (fromMATPOWER [21]). In this large system, we apply our fastalgorithm described in the previous section, with L = 100 , q = 0 . , P = 1 , l = e − , and m = 5 . The average compu-tation time for each time step is less than 3 seconds. When pa-rameters are unknown, we compare our method with the gen-eralized likelihood ratio (GLR). When parameters are known,we compare our method with cumulative sum (CuSum), theoptimal algorithm when pre-change and post-change param-eters are known. We also show the results of CuSum when µ is mis-specified ( µ = 2 , . ).We compare our methods with GLR and CuSum becauseCuSum is the optimal procedure [22] when the parameter isknown, and GLR is a natural generalization of CuSum whenthe parameter is unknown. GLR has many good properties (a)(b) Fig. 3 : (a) Comparison of CuSum, generalized likelihood ra-tio, and the proposed method. (b) Comparison of generalizedmulti-chart CuSum, S-CuSum, and the proposed method.and is widely used when parameters are unknown [23]. Over-all, our method shows the best performance, which is reason-able because our method is not only based on the likelihoodratio but also considers the likelihood of failure propagation. Case II : Detect when there are at least η change-points. Herewe compare our proposed method with generalized multi-chart CuSum, and S-CuSum[14], because these are the mostwell-known algorithms for tackling such problems. In thisexperiment, the graph is fully connected with 15 nodes. Theparameters for the algorithm are L = 100 , q = 0 . , P = 1 , l = e − , and m = 5 . We set η = 3 . To compute the ARL,we generate data with η − affected nodes. The result in fig-ure 3b shows that our method outperforms both generalizedmulti-chart CuSum and S-CuSum. 6. CONCLUSION In this paper, we proposed a fast algorithm to perform thechange-point detection by modeling the cascading failure asa temporal diffusion process in a network. Numerical exper-iments show that our proposed method outperforms the ex-isting methods. With the current experiment setup, we testedour proposed method on an IEEE 300-bus system, which isconsidered relatively large, and our results show that the pro-posed algorithm can scale up to larger systems. . REFERENCES [1] R. Yao, S. Huang, K. Sun, F. Liu, X. Zhang, and S. Mei,“A multi-timescale quasi-dynamic model for simulationof cascading outages,” IEEE Trans. Power Syst. , vol. 31,no. 4, pp. 3189–3201, July 2016.[2] Richard Andrew Wiltshire, Gerard Ledwich, and PeterO’Shea, “A kalman filtering approach to rapidly detect-ing modal changes in power systems,” IEEE Transac-tions on Power Systems , vol. 22, no. 4, pp. 1698–1706,2007.[3] Ian Dobson, “New angles for monitoring areas,” in . IEEE, 2010, pp. 1–13.[4] Paul D.H. Hines, Eduardo Cotilla-Sanchez, BenjaminO’Hara, and Christopher Danforth, “Estimating dy-namic instability risk by measuring critical slowingdown,” in . IEEE, 2011, pp. 1–5.[5] G Rigatos, P Siano, and A Piccolo, “Neural network-based approach for early detection of cascading eventsin electric power systems,” IET Generation, Transmis-sion & Distribution , vol. 3, no. 7, pp. 650–665, 2009.[6] Jianping Cao, Dongliang Duan, Liuqing Yang, Qing-peng Zhang, Senzhang Wang, and Feiyue Wang, So-cial influence analysis in the big data era: a review , pp.301–334, Cambridge University Press, 2016.[7] Eli A Meirom, Constantine Caramanis, Shie Mannor,Ariel Orda, and Sanjay Shakkottai, “Detecting cascadesfrom weak signatures,” IEEE Transactions on NetworkScience and Engineering , vol. 5, no. 4, pp. 313–325,2017.[8] Shuang-Hong Yang and Hongyuan Zha, “Mixture ofmutually exciting processes for viral diffusion,” in In-ternational Conference on Machine Learning , 2013, pp.1–9.[9] Ian Dobson, “Estimating the propagation and extent ofcascading line outages from utility data with a branchingprocess,” IEEE Transactions on Power Systems , vol. 27,no. 4, pp. 2146–2155, 2012.[10] Hui Ren and Ian Dobson, “Using transmission line out-age data to estimate cascading failure propagation in anelectric power system,” IEEE Transactions on Circuitsand Systems II: Express Briefs , vol. 55, no. 9, pp. 927–931, 2008.[11] Yunxiao Chen and Xiaoou Li, “Compound sequentialchange point detection in multiple data streams,” arXivpreprint arXiv:1909.05903 , 2019. [12] Vasanthan Raghavan and Venugopal V Veeravalli,“Quickest change detection of a markov process acrossa sensor array,” IEEE Transactions on Information The-ory , vol. 56, no. 4, pp. 1961–1981, 2010.[13] Mehmet Necip Kurt and Xiaodong Wang, “Multi-sensorsequential change detection with unknown change prop-agation pattern,” IEEE Transactions on Aerospace andElectronic Systems , 2018.[14] Shaofeng Zou, Venugopal V Veeravalli, Jian Li, andDon Towsley, “Quickest detection of dynamic eventsin networks,” arXiv preprint arXiv:1807.06143 , 2018.[15] Manuel Gomez-Rodriguez, EDU David Balduzzi, MPGDE, and Bernhard Sch¨olkopf, “Uncovering the tempo-ral dynamics of diffusion networks,” in InternationalConference on Machine Learning , 2011.[16] Jessica Hoffmann and Constantine Caramanis, “Learn-ing graphs from noisy epidemic cascades,” Proceedingsof the ACM on Measurement and Analysis of ComputingSystems , vol. 3, no. 2, pp. 1–34, 2019.[17] Paul D.H. Hines, Ian Dobson, Eduardo Cotilla-Sanchez,and Margaret Eppstein, “Dual graph and random chem-istry methods for cascading failure analysis,” in . IEEE, 2013, pp. 2141–2150.[18] Junjian Qi, Kai Sun, and Shengwei Mei, “An interac-tion model for simulation and mitigation of cascadingfailures,” IEEE Transactions on Power Systems , vol. 30,no. 2, pp. 804–819, 2014.[19] Guillem Rigaill, “Pruned dynamic programming for op-timal multiple change-point detection,” arXiv preprintarXiv:1004.0887 , vol. 17, 2010.[20] Oscar Hernan Madrid Padilla, Yi Yu, Daren Wang,and Alessandro Rinaldo, “Optimal nonparametricchange point detection and localization,” arXiv preprintarXiv:1905.10019 , 2019.[21] Ray Daniel Zimmerman, Carlos Edmundo Murillo-S´anchez, and Robert John Thomas, “Matpower: Steady-state operations, planning, and analysis tools for powersystems research and education,” IEEE Transactions onpower systems , vol. 26, no. 1, pp. 12–19, 2010.[22] Gary Lorden et al., “Procedures for reacting to a changein distribution,” The Annals of Mathematical Statistics ,vol. 42, no. 6, pp. 1897–1908, 1971.[23] Tze Leung Lai, “Sequential analysis: some classicalproblems and new challenges,”