A Change-Point Based Control Chart for Detecting Sparse Changes in High-Dimensional Heteroscedastic Data
AA Change-Point Based Control Chart forDetecting Sparse Changes in High-DimensionalHeteroscedastic Data
Zezhong Wang and Inez Maria Zwetsloot Department of Systems Engineering and Engineering Management, CityUniversity of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong * Corresponding author: Zezhong Wang, [email protected] 26, 2021
Total Words:4226 1 a r X i v : . [ s t a t . M E ] J a n Change-Point Based Control Chart for Detecting Sparse Changes inHigh-Dimensional Heteroscedastic DataAbstract
Because of the curse-of-dimensionality, high-dimensional processes present challenges totraditional multivariate statistical process monitoring (SPM) techniques. In addition, theunknown underlying distribution and complicated dependency among variables such asheteroscedasticity increase uncertainty of estimated parameters, and decrease the effective-ness of control charts. In addition, the requirement of sufficient reference samples limitsthe application of traditional charts in high dimension low sample size scenarios (small n ,large p ). More difficulties appear in detecting and diagnosing abnormal behaviors that arecaused by a small set of variables, i.e., sparse changes. In this article, we propose a change-point based control chart to detect sparse shifts in the mean vector of high-dimensionalheteroscedastic processes. Our proposed method can start monitoring when the numberof observations is a lot smaller than the dimensionality. The simulation results show itsrobustness to nonnormality and heteroscedasticity. A real data example is used to illustratethe effectiveness of the proposed control chart in high-dimensional applications. Supple-mentary material and code are provided online. Key Words:
Statistical process monitoring (SPM); High-dimensional control chart; Change-point; Sparse changes; Heteroscedasticity; Moving window2 . Introduction
In modern manufacturing systems, widely used sensor and internet technologies make itpossible to collect hundreds of measurements related to the final product and its productionover time. These measurements can be used to determine whether the process’ performanceis under control or whether corrective action is needed. Statistical process monitoring(SPM) tools for high-dimensional data are becoming popular for such applications.Detecting abnormality from high-dimensional data is a challenging task. Traditionalmultivariate control charts, such as the Hotelling’s T control chart and the MEWMAchart, are popular in monitoring multiple variables simultaneously. However, they arechallenged by the “curse-of-dimensionality” which affects parameter estimates and signaldetection ability significantly (Hastie et al., 2009).The accuracy of parameter estimates is an important factor of control charts’ per-formances, but the uncertainty of estimates, especially of the estimated covariance ma-trix, grows rapidly with dimensionality (Zhang et al., 2020). In conventional monitoringschemes, large amounts of in-control (IC) data are needed for getting reliable parameterestimates before monitoring can start (Zamba and Hawkins, 2006). One popular way toavoid directly estimating the convariance matrix is projecting the variables into a lowerdimensional subspace by principal components analysis (PCA), and then monitoring thePCA scores. De Ketelaere et al. (2015) give an overview of PCA-based high-dimensionalprocess monitoring techniques.PCA-based methods are generally designed under the assumption that the processfollows a multivariate normal distribution, this assumption is violated in many high-dimensional processes applications (Zhang et al., 2020). The underlying distribution isusually unknown and challenging to verify. This situation inspires the development of non-parametric control charts, such as the spatial sign based method proposed by Zou andTsung (2011).One drawback of the nonparametric method is that it requires sufficient IC observationsto estimate various parameters which affect the performance of control charts (Zou and3sung, 2011; Zou et al., 2012; Holland and Hawkins, 2014). They can not be directlyapplied in high dimension low sample size (HDLSS) scenarios where the number of variablesis much greater than the number of individual observations ( p > n ) , this is also referredto as short-run processes. Change-point based control charts have better performances inthis scenario (Li et al., 2014).It is rare that all variables change simultaneously in a process, a small subset of variablesare more likely to cause the abnormality. Under this sparsity assumption, the out-of-control(OC) signals can be easily buried in the noise of numerous variables (Shu and Fan, 2018).Sparsity also increases the difficulty to locate and identify assignable causes among variables(Wang and Jiang, 2009).One overlooked character in SPM solutions for high-dimensional data is heteroscedastic-ity (Hong et al., 2018). A straightforward definition of heteroscedasticity is the inequalityof error variance (Downs and Rocke, 1979). This character is deeply discussed in financialand economic fields, and is treated as a set of variances to be modeled (Engle, 2001). Itis rarely considered in manufacturing industries thought it would affect the accuracy ofparameters estimate. Consequently, the performances of control charts are fallible.To overcome the three mentioned challenges of monitoring high-dimensional processes,1) the HDLSS problem makes parameter estimation impossible, 2) sparsity of the change,and 3) heteroscedasticity in the data, we propose a change-point based control chart fordetecting sparse mean shifts in high-dimensional heteroscedastic data. The change-pointmethods allow us to deal naturally with the first challenge. We apply a supremum-basedtest to deal with the second and third challenges. The proposed method can start monitor-ing when the sample size is much smaller than the dimension ( p (cid:29) n ) . Another attractionis that this method is robust to both nonnormality and heteroscedasticity. We also pro-pose a post-signal diagnosis method to estimate the change-point and assignable causes.Simulation and comparison results show that the proposed method has good performancein detecting large and sparse shifts from high-dimensional heteroscedastic data streams.Application of the methods is illustrated by a real-data example.The structure of this article is as follows. Section 2 is a review of the related litera-4ure. In Section 3, we propose our novel change-point based control chart for detectingsparse shifts in the mean vector of possibly heteroscedastic non-normal data. Section 4shows the performance of the proposed method. In Section 5, we propose a post-signaldiagnosis procedure. Section 6 discusses a comparison with another method. In Section7, a manufacturing process example is used to showcase the proposed method. Section 8concludes.
2. Related Work
Quite some research has been done to adapt classical SPM methods to high-dimensionalmonitoring problems. Combining dimensionality reduction techniques with control chartsis one alternative to efficiently detect sparse changes in the high-dimensional setting. Wangand Jiang (2009), Zou and Qiu (2009), Zou et al. (2011), Capizzi and Masarotto (2011),Jiang et al. (2012), and Abdella et al. (2017) applied different penalized likelihood-basedvariable selection algorithms, such as LASSO, to screen out suspicious variables that possi-bly make the process deviate from its normal stage. Next, conventional control charts areapplied to the selected variables.These above discussed methods are efficient under the assumption of normal distributeddata, which is usually violated. Nonparametric methods are developed to fill this gap. Zouet al. (2015) proposed an online monitoring scheme based on a goodness-of-fit (GOF)test for monitoring nonnormal and i.i.d. observations. Shu and Fan (2018) proposed adistribution-free control chart by adapting a Minkowski distance based test.The sparsity also increases the difficulty for signal diagnosis, one benefit of variable-selection algorithm is that the tasks of monitoring and diagnosis are naturally integratedand conveniently solved simultaneously (Jiang et al., 2012). In Zhang et al. (2020), theyproposed a square-root LASSO based diagnosis framework. An effective tool to locatethe occurrence time of changes in data sequences is change-point estimate, it has beenwidely used in univariate, multivariate, and profile monitoring problems. Amiri and Al-lahyari (2012) reviewed the applications of change-point estimation techniques in process5onitoring systematically. This type of method not only supports practitioners to iden-tify assignable causes but also shows robustness to nonnormality. Therefore, change-pointbased methods are potentially useful in monitoring high-dimensional data streams.Zou et al. (2011) estimated the change-point in post-signal diagnosis phase. Li et al.(2014); Huang et al. (2014); Chen et al. (2016) also proposed change-point based controlcharts. The first one is developed from a two-sample location test proposed by Chenet al. (2010), which is a sum-of-square type test. It is more efficient in monitoring denseshifts. Huang et al. (2014) proposed a novel Reproducing Kernel Hilbert Space (RKHS)-based control chart which is robust to nonnormality and is able to detect a wide range ofprocess changes include the sparse changes. The method proposed by Chen et al. (2016)is developed from Bickel (1969), and only dense scenarios are considered in the simulationpart. One more attraction of the above change-point based schemes is that they are capableof starting monitoring with a small set of observations.The heteroscedasticity has been well studied in financial data. This type of data canbe modeled by autoregressive conditional heteroscedastic (ARCH) models (Engle, 1982) orgeneralized autoregressive conditional heteroskedastic (GARCH) models (Bollerslev, 1986).To monitor for shifts in heteroscedastic data, Schipper and Schmid (2001) and Bodnar(2009) proposed control charts based on univariate and multivariate GARCH models re-spectively. Both methods used the same strategy that is monitoring the fitting errorsbetween the target process and the observed process. It is challenging to apply the samestrategy to high-dimensional scenarios, because of the computational complexity (Frisén,2008). Another alternative strategy for monitoring heteroscedastic data is using methodsthat are robust to inconsistent covariance matrices.The above literature offers solutions for either sparse changes in normal data or HDLSSmethods for dense changes or method focused on heteroscedasticity. No method is able todo it all. Our objective is to develop a method to detect sparse changes in heteroscedasticand possibly non-normal data in the HDLSS scenarios. We will adapt the two-samplehypothesis tests of high-dimensional means proposed by Chang et al. (2017) to processmonitoring in Section 3, because it allows for different covariance matrices between two6amples.
3. Change-Point Based Control Chart
In this section we introduce our proposed method.
Given a series of independent and individual p -dimensional observations X i = ( X i , X i , ..., X ip ) (cid:48) , i = 1 , , ..., n , which follow a multivariate normal distribution, the change-point model is H : X i ∼ N p ( µ , Σ ) and H : X i ∼ N p ( µ , Σ ) , i ≤ τN p ( µ , Σ ) , i > τ, (1)where µ , µ , Σ , and Σ are the unknown IC and OC mean vectors and covariancematrices. The change-point τ , which indicates that the process changes after X τ , is alsounknown and needs to be estimated. If only a mean shift occurs, Σ is equal to Σ inEquation (1).Various high-dimensional hypothesis tests for the shifts in locations have been proposed,see, for example, Randles (2000); Zhang (2002); Chen et al. (2010); Saha et al. (2017). Thesetests have been integrated into process monitoring tools that were reviewed in Sections 1and 2. High-dimensional test statistics can be categorized as sum-of-squares-based teststatistic (Chen et al., 2010), and supremum-based test statistic (Randles, 2000). Theformer is powerful when there are many small differences between µ and µ , in otherwords, the signals are dense but weak. However, this type of method could be ineffectiveunder sparsity assumption, where the accumulation of all differences will not be greatlyinfluenced by a few large differences especially in a big data stream. For the case that afew variables have large shifts, the signals are sparse but strong, a supremum-based teststatistic should have better performance (Gregory et al., 2015).Our proposed control chart for detecting sparse mean shifts is based on the supremum-7ased-test statistics proposed by Chang et al. (2017). Chang et al. (2017) proposed fourtests. The first two are supremum-based test based on a non-studentized test statisticsand a studentized test statistic. The second two use the same two test statistics and addan initial screening step to reduce the dimension of the data. In this paper we focus onthe non-studentized method without screening. We have also implemented our proposedmethods using the studentized test statistics and discovered it shows inferior performancein all experiments. The detailed results of the studentized test statistics are therefor onlyreported in the supplementary material.To test for H vs. H in Equation (1), we partition the n observations into two sets, { X , ..., X k } and { X k +1 , ..., X n } at a split point k (3 ≤ k ≤ n − . Each subset containsat least observations in order to calculate the sample means and variances. Thereforethis methods can start monitoring with at least observations which is appropriate formonitoring short-run processes. As in Chang et al. (2017), the non-studentized statistics( N S ) with n observations and split point k are as follows, T NSn,k = max ≤ r ≤ p √ k ( n − k ) | ¯ X k,r − ¯ X n − k,r |√ n , (2)where ¯ X k,r and ¯ X n − k,r are the means of the r th variable ( X r ) in the pre-shift sample andpost-shift sample respectively. The monitoring statistic at time point n , referred as U NSn ,is the maximum value of T NSn,k over all split points k , U NSn = max ≤ k ≤ n − T NSn,k . (3)The change-point τ can be estimated directly by the corresponding k when U NSn exceedsthe control limits: ˆ τ NSn = arg k max ≤ k ≤ n − ( T NSn,k ) . (4)Though Chang et al. (2017) present mathematical expressions for the critical values,they still use a fully data driven bootstrap method to estimate them. Therefore, we alsouse the same strategy to determine the control limits based on a predefined False Alarm robability (FAP) . Note that for progressive monitors our statistic will be based on larger and larger samples(i.e. n increases with time). The efficiency of detecting sparse shifts decreases with thesample size ( n ), because the signals are more likely to be buried by the noises. Anotherpractical problem of our method is that the time and complexity of data processing multiplywith incoming observations, because n − iterations are needed to compute the controlstatistic from n observations. A third issue is that the statistic is influenced by the samplesize, and therefore, the control limits are dynamic. To reduce the detection delay andpromote the sensitivity of the proposed monitoring scheme, we add a moving window tothe proposed method.For a fixed window size W , the monitoring procedure starts with the first W observations ( X , .., X W ) . After collecting s new observations, s is the step size, it excludes the oldest s observations and moves to the second window ( X s , .., X W + s ) . An appropriate step sizecan speed up the computation and reduce the serial correlation, making the monitoringstatistics become approximately independent. The observation matrix at time point n ( n >W ) is X n = ( X n − W +1 , X n − W +2 , ..., X n ) . The window-based non-studentized statistics are T NSn,W,k ∗ = max ≤ r ≤ p √ k ∗ ( W − k ∗ ) | ¯ X k ∗ ,r − ¯ X W − k ∗ ,r |√ W , (5)where k ∗ (3 ≤ k ∗ ≤ W − is the split point inside the current window, and ¯ X k ∗ ,r is themean of first k ∗ observations in window W n . The corresponding charting statistic based onthis window is U NSn,W = max ≤ k ∗ ≤ W − T NSn,W,k ∗ . (6)Note that we observe U NSn,W each s time only. Estimating the change-point is alsostraightforward, if the control chart signals at time point n , the corresponding change-9oint estimate is ˆ τ NSn,W = n − W + arg k ∗ max ≤ k ∗ ≤ W − T NSn,W,k ∗ . (7)The statistic depends on the window size, which is a pre-specified fixed value. Consequently,the control limits are a constant value and they need to be simulated for different dimen-sionality and window sizes. Our method, referred by N S W , signals when U NSn,W > h p,W ,where h p,W is the control limit. In this paper, the control limits are determined by prespecifying the
F AP instead of the
ARL , because the
ARL is affected by the window size and autocorrelation. Moreover,simulation for
ARL s is more time consuming. We define the
F AP as the probability ofat lease one false alarm in the process in time to n (Chakraborti et al., 2008). We usea bootstrap method to get the control limits with a predefined F AP = α , as defined inAlgorithm 1.Step 1. Import B , n , W , s , and X p . Where B is the number of bootstrap samples, n is the monitoring interval, W is the window size and s is the step size.Step 2. Draw B bootstrap samples of size W with replacesment from X p .Step 3. Use Equation 6 to calculate the charting statistics U NS,bW,W , b = 1 , , ....B forevery bootstrap sample.Step 4. Find the control limits as the (1 − α ) Q empirical quantile from the pooled U NS,bW,W , b = 1 , , ....B respectively with Q = max { z ∈ Z | z ≤ ( n − Ws ) } +1 . Algorithm 1:
Bootstrap method for control limitsStep 4 in Algorithm 1 is valid because appropriate step sizes can make the sequential U NSn,W series an approximately independent process. Figure 1 shows the autocorrelation of U NSn,W with s = 1 , , , and W = 20 . When s = 5 , the autocorrelation is small for all lags.[Figure 1 near here.]In this paper, we set α = 0 . , B = 10000 , n = 100 , W = 20 , , , s = 5 , and X p isdrawn from a p -dimensional normal distribution. All codes for calculating the statistics,10ontrol limits and performances analysis are available on GitHub ( https://github.com/wyfwzz/Supplementary-code-for-A-change-point-based-control-chart...- ).
4. Performance Study
To evaluate the OC performances of the proposed
N S W method, we use simulation exper-iments. In these experiments, we vary the underlying distribution, dimensionality, windowsize, shift size, sparsity level and change-point. The detailed settings are as follow:1. Models: the IC process follows a p -dimension standardized normal distribution N p ( , I p ) .To test the robustness to heteroscedasticity, correlation, and nonnormality, we designfour OC models, where for each model { X i } τi =1 ∼ N p ( , I p ) and• Model I: { X i } ni = τ +1 ∼ N p ( µ , I p ) . This model is the baseline model which onlyinvolves a change in the mean vector;• Model II: { X i } ni = τ +1 ∼ N p ( µ , λ t I ) , where λ t controls the change of variance andis set equal to { . , . , ...., . , , . , ..., . } which is repeated until n . Thismodel is used to simulate heteroscedasticity, after τ , the variances vary withtime;• Model III: { X i } ni = τ +1 ∼ N p ( µ , Σ ) , where Σ = ( σ l,m ) (1 ≤ l,m ≤ p ) , and σ l,m =0 . | l − m | . This model incorporates the dependency among variables;• Model IV: { X i } ni = τ +1 ∼ t p, ( µ , Σ ) , where Σ = ( σ l,m ) (1 ≤ l,m ≤ p ) , and σ l,m =0 . | l − m | . This model is designed for testing the robustness to nonnormality;2. Dimensionality: p = 20 , , ;3. Window size: W = 20 , , ;4. Change point: τ = 10 , , ; 11. Sparsity level (the percentage of OC variables): v = 10% , , which means the first v × p variables change simultaneously after τ ;6. Shifts size: δ = 0 . , , . , , so that µ = ( δ vp , p − vp ) , where δ vp is a v × p size vectorwith values of δ and p − vp is a p − v × p size vector with values of . We run R = 1000 simulation runs for every scenario to test the performance of the proposedmethod. In every simulation run, the monitoring procedure will stop once a signal isdetected or when we run out of the predefined process with n = 100 observations. Theproportion of runs that are stopped by a signal to all runs represents the detecting powerof the proposed method. This metric is called the Detection Rate (DR) , where DR = (cid:80) Rj =1 I ( U NS,jn,W >h p,W | H ) R . When DR is close to , it indicates that the proposed method issensitive to change.The DR reflects the power of our method, in addition, a timely detection is also impor-tant. We define the Conditionally Expected Detection Delay (CED) which is the time be-tween the average stop point and the change point:
CED = (cid:80) Rj =1 min ( arg n ( U NS,jn,W >h p,W | H )) R − τ . CED is influenced by W , s , and τ . When W ≤ τ , the smaller CED reflects better per-formances. When
W > τ and
CED = W − τ , it means that the control chart can detecta signal as early as the first window. All simulation results for all experiments can befound in the supplementary materials, along with the simulation results of the studentizedmethod. We discuss the highlights in the rest of Section 4. Table 1 shows the DR and CED for the
N S W control chart under Model I with various p , W , δ , τ , and v . The N S W method has very low DR when the shift is smaller than . .The DR increases with the growth of window size. It also performs well in more sparsescenarios where v = 10% with large change and appropriate window size. The location of τ has no significant influence on DR unless it is very small ( τ = 10 ). It has better DR in12etecting larger changes of higher dimensional mean vector.[Table 1 near here.]The CED of the
N S W method is worth considering when DR ≥ . . When τ > W ,the proposed method can detect a signal with about extra observations after the changepoint. Large W can result in slower detection (large CED ) especially if τ is very small.This result confirms the conclusion that the proposed method works well when the numberof pre-shift and after-shift observations are balanced within one window. Therefore theoptimal window size need to be determined based on a trade-off between sensitivity anddetection delay. Consequently, we recommend the N S W method for detecting large andsparse shifts in high-dimensional mean vector with larger window size, and we select W = 40 for the remaining experiments. The baseline model only considers mean shifts in multivariate normal distributed data,which is often an invalid assumption in practice. This section focuses on the results fromModel II to IV to explore the robustness of our proposed method under heteroscedasticity,correlation, and nonnormality. The performance of our method is shown in Table 2, formore results, see the supplementary material.Compared with the baseline model, the DR approximately decrease by percent pointsin heteroscedastic model with small shift size, and DR = 1 when δ = 2 . This result isconsistent under different levels of sparsity. The CED results show that heteroscedasticityhas neglectable influence on the detection delay, and the
N S W method is robust to datawith unequal variances. The results from the other two models yield the same conclusionregarding of robustness. Therefore, we highly recommend to monitor a process with the N S W method when the underlying distribution is unknown or shows deviation from anormal distribution. [Table 2 near here.]13 . Diagnostic After a signal is obtained, it is of interest to diagnose the problem. Equations (4) and (7)give the change-point estimate which can be useful in post-signal diagnosis. More thanthat, locating the assignable causes, the suspicious variables, is worth considering. Onelimitation of the proposed supremum-based method is that the signal is directly causedby one variable, and the abnormal behavior in other variables is ignored. We propose thefollowing diagnostic procedure to select all suspicious variables: ˆ V NSn = arg r ( √ k ∗ ( W − k ∗ ) | ¯ X k ∗ ,r − ¯ X W − k ∗ ,r |√ W > h p,W | U NSn,W > h p,W , H ) , (8)where ˆ V NSn is the set of variables that all yield a signal. To evaluate the accuracy of the di-agnostic procedure, we compared ˆ V NSn with the predefined variable set V NSn = ( X , .., X vp ) .The Detection Rate of Variables (DRV) is defined as
DRV = P ( | ˆ V NSn ∩ V NSn || V NSn | ) = (cid:80) i ∈ ˆ V NSn I ( i ≤ p × v ) p × v , (9)where i is an element in ˆ V NSn (and the index of the variable that signals).Table 3 shows the post-signal diagnostic performances of the proposed method with p = 100 , τ = 25 , and W = 40 under different models. The Conditional Change-PointEstimate (CPE) in Equation 7 and
DRV are calculated from simulations. The
N S W method can estimate the change-point accurately under various setting of shift size andsparsity. There is no significant different performance between different models. It alsoshows good performance in finding the suspicious variables, especially when δ = 2 , and isable to detect approximately of the change variables correctly. One possible way toimprove the capability of locating assignable causes is adding a variable selection algorithmto the proposed method. [Table 3 near here.]14 . Comparison As discussed in Section 2, Chen et al. (2016) proposed a distribution-free EWMA controlchart (
DFEWMA ). The
DFEWMA method is also based on a change-point model and isrobust to nonnormality. It is efficient in detecting small and moderate shifts in locationparameters for unknown distributions. We will compare the
DFEWMA control chart withour proposed method, because it is one of the most comparable methods that we can findand for the
DFEWMA method there is the matlab code available online. Before introducingthe
DFEWMA chart, we first need to rewrite H in Equation 1 as X i ∼ N p ( µ , Σ ) , i = − m + 1 , ..., , , ..., τ,N p ( µ , Σ ) , i > τ . The difference is including m IC observations to compose a reference sample. After col-lecting X n , the charting statistic can be constructed as T n ( W, λ ) = (cid:80) pr =1 T n,r ( W, λ ) , where T n,r ( W, λ ) = n (cid:88) i = n − W +1 (1 − λ ) n − i R n,i,r − W ( m + n + 1) / (cid:112) W ( m + n + 1)( m + n − W ) / , (10)where W is the window size, λ is the smoothing parameter, R n,i,r is the rank of X i,r amongthe sample X − m +1 ,r , ..., X n,r . The data-dependent control limits are determined onlinerather than before monitoring. For more detailed properties of this method, see Chen et al.(2016).To make the DFEWMA method more comparable with the proposed
N S W methods,we use Model I and the same setting of parameters ( p, τ, W , and v ) as in Section 4, and theshift size are set at δ = 1 , . , . Next, IC observations are used as reference sample forthe
DFEWMA method. Table 4 shows the simulation results of the
N S W and DFEWMA methods, when τ = 50 and v = 10% , additional results can be found in the supplementarymaterial. Both methods are designed to have a F AP = 0 . with observations. Theempirical F AP are in brackets.
DFEWMA has larger empirical
F AP than the target value15f . . We can neither fix the problem nor change the comparer, since the codes of othermethods are unavailable.The DR of the DFEWMA chart is close to in all scenarios which outperforms the N S W method. We don’t know how much it is affected by the larger F AP . It has betterperformance with less sparsity (large v ), because it is a spatial-rank based statistic. Thesame explanation is also applicable to interpret the increase of DR with dimensionality.Because the DFEWMA chart uses increasing window size and moves with every new obser-vation instead of using fixed window size and steps as us, it can achieve smaller detectiondelay. The proposed
N S W method only outperforms in CED with W = 20 in lower di-mension scenarios. One more reason for the good performance of the DFEWMA methodis that it has reference observations. As already pointed out by Chen et al. (2016),the efficiency of the
DFEWMA chart relies on the large enough reference sample, which isunnecessary for our proposed method. An additional drawback of Chen et al. (2016) is theexcessive false alarm rate. If there is a large reference sample and the practitioner does notmind many false alarms we recommend the
DFEWMA method. For all other scenarios webelieve the practitioner should rather use our proposed method. It’s detection power is alittle smaller but monitoring can start quickly and false alarm is at the nominal level.[Table 4 near here.]
7. Case Study
In this section, we illustrate the proposed method by applying it to a real dataset from asemiconductor manufacturing process which is under constant surveillance via the moni-toring of signals/variables collected from sensors and or measurement points. This datasetis available online ( http://archive.ics.uci.edu/ml/datasets/SECOM ). It consists of variables each with records in chronological order. A classification label ( ± ) isgiven to indicate whether the product passes or fails, where − represents pass and isa failed one. Among these, observations are the IC sample, and the remaining observations compose the OC sample. 16ince the dataset contains constant values, null values and potential outliers, pre-processing is necessary. We remove the constant variables from both samples, leaving variables. The outliers in the IC sample are identified by Tukey’s fences (Tukey, 1977), andreplaced by the the variable median. The missing values in both samples are replaced byvariable median. After that, all the observations are standardized by the sample mean andstandard deviation from the IC observations. The proposed method is applied to monitorthe standard scores. Figure 2 shows the heteroscedastic standard scores of some variableswhen they are in control. [Figure 2 near here]The control limits are simulated by the proposed bootstrap method in Algorithm 1 basedon the IC sample with F AP = 0 . . Based on the assumption that the process starts froman IC status and then the mean vector changes at time τ , where τ = 10 , , . As shownin Table 5, the N S W method is efficient in monitoring real high-dimensional process. TheDR is equal to in all scenarios. When W > τ , it can always detect an out of control signalin the first window, and the
CP E is close to the real one. When
W < τ , observationsdetection delay is acceptable. These results confirm that the N S W method is robust tounknown distributions and heteroscedasticity.[Table 5 near here.]
8. Conclusion
Curse-of-dimensionality and unknown distributions increase the unreliability of parameterestimate, especially when a large amount of IC observations are unavailable. In analyzingOC performance, sparsity is another practical assumption, which challenges both signaldetection and diagnosis methods. In this research we proposed a change-point based con-trol chart for monitoring sparse changes in high-dimensional mean vector, specifically forHDLSS scenarios. A moving window is added to speed up the computation and increase thesensitivity. As shown by the experimentation results, the proposed
N S W chart is efficient17n detecting large sparse shifts. And it can achieve accurate estimation of the change-pointand the potential OC variables. It is robust to correlation, nonnormality, and heteroscedas-ticity, the latter is an important and often overlooked characteristic of high-dimensionalprocess data. More than that, the experiments give some reference in determining the op-timal window size, a tradeoff between sensitivity and false alarms needs to be considered.The real case study illustrates the robustness and practicability of the N S W method inapplication. Our proposed supremum-based method, the N S W method, is not as sensitiveas the spatial-rank-based DFEWMA chart in comparison. However the
DFEWMA needs alarge reference sample and shows excessive false alarms, both are not issues in our method.Future directions for research is to improve the sensitivity. One possible way is to adapt avariable selection algorithms before starting monitoring.
Supplementary Materials
The PDF file provides the studentized method and the additional tables of the detailedsimulation results.
Acknowledgement
The work of Inez M. Zwetsloot described in this paper was supported by a grant from theResearch Grants Council of the Hong Kong Special Administrative Region, China (ProjectNo. CityU 21215319).
Declaration of Interest Statement
Zezhong WANG is a Ph.D candidate in the City University of Hong Kong in the Depart-ment of Systems Engineering and Engineering Management. Her research interests includeStatistical Process Monitoring, in particular, the High-Dimensional Process Monitoring.18 eferences
Abdella, G. M., Al-Khalifa, K. N., Kim, S., Jeong, M. K., Elsayed, E. A., and Hamouda,A. M. (2017). Variable selection-based multivariate cumulative sum control chart.
Qualityand Reliability Engineering International , 33(3):565–578.Amiri, A. and Allahyari, S. (2012). Change point estimation methods for control chartpostsignal diagnostics: a literature review.
Quality and Reliability Engineering Interna-tional , 28(7):673–685.Bickel, P. J. (1969). A distribution free version of the smirnov two sample test in thep-variate case.
The Annals of Mathematical Statistics , 40(1):1–23.Bodnar, O. (2009). Application of the generalized likelihood ratio test for detecting changesin the mean of multivariate garch processes.
Communications in Statistics-Simulationand Computation , 38(5):919–938.Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.
Journal ofeconometrics , 31(3):307–327.Capizzi, G. and Masarotto, G. (2011). A least angle regression control chart for multidi-mensional data.
Technometrics , 53(3):285–296.Chakraborti, S., Human, S., and Graham, M. (2008). Phase i statistical process controlcharts: an overview and some results.
Quality Engineering , 21(1):52–62.Chang, J., Zheng, C., Zhou, W.-X., and Zhou, W. (2017). Simulation-based hypothesis test-ing of high dimensional means under covariance heterogeneity.
Biometrics , 73(4):1300–1310.Chen, N., Zi, X., and Zou, C. (2016). A distribution-free multivariate control chart.
Tech-nometrics , 58(4):448–459.Chen, S. X., Qin, Y.-L., et al. (2010). A two-sample test for high-dimensional data withapplications to gene-set testing.
The Annals of Statistics , 38(2):808–835.19e Ketelaere, B., Hubert, M., and Schmitt, E. (2015). Overview of pca-based statisti-cal process-monitoring methods for time-dependent, high-dimensional data.
Journal ofQuality Technology , 47(4):318–335.Downs, G. W. and Rocke, D. M. (1979). Interpreting heteroscedasticity.
American Journalof Political Science , pages 816–828.Engle, R. (2001). Garch 101: The use of arch/garch models in applied econometrics.
Journal of Economic Perspectives , 15(4):157–186.Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of thevariance of united kingdom inflation.
Econometrica: Journal of the Econometric Society ,pages 987–1007.Fani, S., Pasha, M., and Moghadam, M. B. (2019). A new approach to optimal design ofcontrol charts based on change point estimation and total time on test plot.
Communi-cations in Statistics-Theory and Methods , 48(22):5571–5584.Frisén, M. (2008).
Financial surveillance , volume 71. Wiley Online Library.Gregory, K. B., Carroll, R. J., Baladandayuthapani, V., and Lahiri, S. N. (2015). A two-sample test for equality of means in high dimension.
Journal of the American StatisticalAssociation , 110(510):837–849.Hastie, T., Tibshirani, R., and Friedman, J. (2009).
The elements of statistical learning:data mining, inference, and prediction . Springer Science & Business Media.Holland, M. D. and Hawkins, D. M. (2014). A control chart based on a nonparametricmultivariate change-point model.
Journal of Quality Technology , 46(1):63–77.Hong, D., Balzano, L., and Fessler, J. A. (2018). Asymptotic performance of pca forhigh-dimensional heteroscedastic data.
Journal of multivariate analysis , 167:435–452.20uang, S., Kong, Z., and Huang, W. (2014). High-dimensional process monitoring andchange point detection using embedding distributions in reproducing kernel hilbert space.
IIE Transactions , 46(10):999–1016.Jiang, W., Wang, K., and Tsung, F. (2012). A variable-selection-based multivariate ewmachart for process monitoring and diagnosis.
Journal of Quality Technology , 44(3):209–230.Kim, S., Jeong, M. K., and Elsayed, E. A. (2019). A penalized likelihood-based qualitymonitoring via l2-norm regularization for high-dimensional processes.
Journal of QualityTechnology , pages 1–16.Li, Y., Liu, Y., Zou, C., and Jiang, W. (2014). A self-starting control chart forhigh-dimensional short-run processes.
International Journal of Production Research ,52(2):445–461.Randles, R. H. (2000). A simpler, affine-invariant, multivariate, distribution-free sign test.
Journal of the American Statistical Association , 95(452):1263–1268.Saha, E., Sarkar, S., and Ghosh, A. K. (2017). Some high-dimensional one-sample testsbased on functions of interpoint distances.
Journal of Multivariate Analysis , 161:83–95.Schipper, S. and Schmid, W. (2001). Control charts for garch processes.
Nonlinear Analysis:Theory, Methods & Applications , 47(3):2049–2060.Sheikhrabori, R., Aminnayeri, M., and Ayoubi, M. (2018). Maximum likelihood estimationof change point from stationary to nonstationary in autoregressive models using dynamiclinear model.
Quality and Reliability Engineering International , 34(1):27–36.Shu, L. and Fan, J. (2018). A distribution-free control chart for monitoring high-dimensional processes based on interpoint distances.
Naval Research Logistics (NRL) ,65(4):317–330.Tukey, J. W. (1977).
Exploratory data analysis , volume 2. Reading, MA.21ang, K. and Jiang, W. (2009). High-dimensional process monitoring and fault isolationvia variable selection.
Journal of Quality Technology , 41(3):247–258.Zamba, K. and Hawkins, D. M. (2006). A multivariate change-point model for statisticalprocess control.
Technometrics , 48(4):539–549.Zhang, C., Chen, N., and Wu, J. (2020). Spatial rank-based high-dimensional monitoringthrough random projection.
Journal of Quality Technology , 52(2):111–127.Zhang, J. (2002). Powerful goodness-of-fit tests based on the likelihood ratio.
Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) , 64(2):281–294.Zou, C., Jiang, W., and Tsung, F. (2011). A lasso-based diagnostic framework for multi-variate statistical process control.
Technometrics , 53(3):297–309.Zou, C. and Qiu, P. (2009). Multivariate statistical process control using lasso.
Journal ofthe American Statistical Association , 104(488):1586–1596.Zou, C. and Tsung, F. (2011). A multivariate sign ewma control chart.
Technometrics ,53(1):84–97.Zou, C., Wang, Z., and Tsung, F. (2012). A spatial rank-based multivariate ewma controlchart.
Naval Research Logistics (NRL) , 59(2):91–110.Zou, C., Wang, Z., Zi, X., and Jiang, W. (2015). An efficient online monitoring method forhigh-dimensional data streams.
Technometrics , 57(3):374–387.22 a b l e : D R a nd C E D o f t h e N S W m e t h o dund e r M o d e l I w i t h v a r i o u ss e tt i n go f p , δ , W , τ , a nd v . D R C E D τ = τ = τ = τ = τ = τ = p W δ % % % % % % % % % % % % . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DR and CED for the proposed
N S W method under Model I to IV with varioussetting of δ and v . When p = 100 , τ = 25 , and W = 40 . DR CEDM odel δ v = 10% v = 25% v = 10% v = 25% I(Baseline) 1 0.508 0.822 18.7 17.31.5 1 1 15.1 15.02 1 1 15.0 15.0II(Heteroscedastic) 1 0.408 0.712 19.7 18.61.5 0.999 1 15.1 15.02 1 1 15.0 15.0III(Dependent) 1 0.348 0.539 19.2 19.51.5 0.968 0.992 16.3 15.62 1 1 15.0 15.0IV(Nonnormal) 1 0.404 0.547 18.9 19.21.5 0.955 0.989 16.3 15.72 1 1 15.0 15.0Table 3: Post-signal diagnosis of the
N S W chart under Model I to Model IV with varioussetting of v . When p = 100 , τ = 25 , W = 40 . CP E DRVM odel δ v = 10% v = 25% v = 10% v = 25% I(Baseline) 1 25.4 25.1 0.105 0.0451.5 24.9 24.9 0.261 0.2632 24.9 24.9 0.805 0.814II(Heteroscedastic) 1 24.3 24.2 0.103 0.0431.5 24.6 24.6 0.263 0.2412 24.8 24.8 0.843 0.836III(Dependent) 1 25.2 24.7 0.179 0.0901.5 24.9 24.7 0.405 0.3382 24.8 24.8 0.828 0.807IV(Nonnormal) 1 24.8 24.8 0.200 0.1211.5 24.7 24.7 0.396 0.3502 24.8 24.8 0.818 0.82124able 4: The DR and CED of the
DFEWMA and
N S W charts with various combinationsof p , W , and δ . When m = 100 , v = 10% , τ = 50 . Between bracket the F AP = 0 . . DR CEDp W δ N S W DFEWMA
N S W DFEWMA
20 20 0 (0.015) (0.05)1.5 0.347 1 8.08 11.012 0.831 1 7.30 8.0730 0 (0.013) (0.042)1.5 0.715 1 11.21 11.202 0.991 1 7.51 8.2340 0 (0.01) (0.046)1.5 0.902 1 12.39 10.512 1 1 7.64 8.0350 20 0 (0.018) (0.047)1.5 0.506 1 8.54 7.192 0.962 1 6.36 5.4030 0 (0.013) (0.037)1.5 0.918 1 10.11 7.232 1 1 6.30 5.3040 0 (0.016) (0.043)1.5 0.992 1 10.50 7.142 1 1 6.21 5.64100 20 0 (0.004) (0.047)1.5 0.431 1 8.84 5.332 0.977 1 6.51 4.3430 0 (0.011) (0.047)1.5 0.966 1 9.65 5.332 1 1 5.61 4.2140 0 (0.003) (0.037)1.5 1 1 9.11 5.542 1 1 5.39 4.45Table 5: The DR and CED of the proposed
N S W method in signal detection, and the CP E in post-signal diagnosis with various setting of W and τ . W
20 30 40 τ
10 25 50 10 25 50 10 25 50 DR CED
CP E U NSn,W with various s , when W = 20 .Figure 2: The heteroscedasticity of standard scores of X , X , X , and X .• Figure 1. The autocorrelation of U NSn,W with various s , when W = 20 .• Figure 2. The heteroscedasticity of standard scores of X , X , X , and X338