The Sequential Normal Scores Transformation
TThe Sequential Normal Scores
Transformation
Dr. William J. ConoverDepartment of Mathematics and Statistics, Texas Tech UniversityDr. V´ıctor G. Tercero-G´omezSchool of Engineering and Sciences, Tecnologico de MonterreyDr. Alvaro E. Cordero-FrancoFacultad de Ciencias F´ısico Matem´aticas, Universidad Autonoma de Nuevo LeonMay 11, 2017
Abstract
The sequential analysis of series often requires nonparametric procedures,where the most powerful ones frequently use rank transformations. Re-rankingthe data sequence after each new observation can become too intensive compu-tationally. This led to the idea of sequential ranks, where only the most recentobservation is ranked. However, difficulties finding, or approximating, the nulldistribution of the statistics may have contributed to the lack of popularity ofthese methods. In this paper, we propose transforming the sequential ranksinto sequential normal scores which are independent, and asymptotically stan-dard normal random variables. Thus original methods based on the normalityassumption may be used.A novel approach permits the inclusion of a priori information in the formof quantiles. It is developed as a strategy to increase the sensitivity of thescoring statistic. The result is a powerful convenient method to analyze non-normal data sequences. Also, four variations of sequential normal scores arepresented using examples from the literature. Researchers and practitionersmight find this approach useful to develop nonparametric procedures to addressnew problems extending the use of parametric procedures when distributionalassumptions are not met. These methods are especially useful with large datastreams where efficient computational methods are required. a r X i v : . [ m a t h . S T ] M a y eywords: Big data, nonparametric, sequential hypothesis testing, sequential ranks,statistical process control.AMS Subject Classifications: 62L10, 62Gxx.2 Introduction
Many applications of statistics involve sequences of observations, where decisions aremade at several time points along the sequence. According to Wald [1945], a se-quential test of hypothesis is the name given to any procedure, where at the n -thtrial of an experiment a decision is made to accept the null hypothesis, accept thealternative hypothesis, or continue the experiment by making another observation.Hence, the analysis is sequential, and the time it takes to reach a statistical conclu-sion is random in nature. Dodge and Romig [1929] developed sequential tests withapplications in quality inspection. Shewhart [1931] suggested the idea of online mon-itoring and continuous testing for statistical control. Wald [1945] conceived a generaltesting procedure named the sequential probability ratio test. Box [1957] proposedan evolutionary operation, a sequential method to carry on experiments. From theseinitial propositions, many statistical approaches have been created to solve a varietyof problems. However, most of these statistical methods, when employed in prac-tice, assume knowledge of the probability distribution of the observations, or thatthe observations can be averaged across groups so the sample mean is approximatelynormally distributed by the Central Limit Theorem (refer to Shewhart ¯ X -charts asan example).When the data are not normally distributed, the normal-based methods are oftenrobust to the non-normality by the Central Limit Theorem, but robustness protects3nly the validity of the probability of a Type I error, and such methods can sufferfrom a loss of power. Nonparametric methods are extensively used in analysis of fixed-sample-size applications, with the well-known advantage of often increasing the powerover parametric methods, especially in the presence of outliers or skewed data. Themost popular and most powerful nonparametric methods usually involve replacingthe observations by ranks, or scores based on ranks. Nevertheless, a major problemarises when trying to adapt nonparametric rank methods to sequential data dueto the intensive computation required if all the data are reranked each time a newnonparametric test is performed. Some solutions involve grouping the observations sothe sequence is not observed continuously, but only at selected intervals [Ross et al.,2011]. However, this lessens the sensitivity of the analysis by effectively reducing thenumber of times the sequence is tested.From Parent [1965], a reasonable solution to the computation problem involvesusing sequential ranks so only the most recent observation is ranked relative to theprevious, unchanged, ranks. Parent notes that there is a one-to-one correspondencebetween the ordinary ranks and the sequential ranks, so no information is lost byusing sequential ranks instead of ordinary ranks. In fact there is an advantage gained,in that sequential ranks are independent of each other while ordinary ranks have abuilt-in dependence.Parent adapted sequential ranks to allow continuous testing of a sequence of ob-4ervations by adapting the sequential probability ratio test which has optimum prop-erties as shown by Wald and Wolfowitz [1948]. The adaptation is awkward andunwieldy, which may explain why it has not become popular in usage. Reynolds[1975] solved the problem of sequential ranks being “bottom heavy” in that smallranks are more likely to occur than large ranks, which occur only as the sequenceaccumulates more observations. He “standardized” the sequential ranks by dividingthem at each stage by (i+1) where i represents the number of observations up to thatpoint. This effectively “spread” the ranks over the unit interval (0,1) at each stagein the analysis. However, the null distribution of the rank statistics was difficult tofind, and so he showed it converged to a Markov process, and used the theory ofMarkov processes to obtain critical values for statistical tests. This too is awkwardand unwieldy and has not been widely accepted in practice.We are suggesting (for the first time, perhaps) to take advantage of the relationshipbetween ranks and the estimators of a cumulative probability to replace the ranksby normal scores in the sense of Van der Waerden [1952], where each sequential rankis substituted with the corresponding quantile of the standard normal distribution.Thus, the ranks are effectively replaced by numbers that appear to have come from astandard normal distribution. Because these sequential normal scores are independentof each other, and behave like normal random variables, the usual sequential methodsbased on normal observations may be applied as approximate methods, and special5nalytical methods are no longer needed. Changes in location or variation occurringin the original sequence will be reflected by the changes in shift or spread of thesequential normal scores.This paper examines the theoretical validity of sequential normal scores and eval-uates the versatility through the analysis of known real life examples found in theliterature on applications of statistical methods to the analysis of sequences of ob-servations. Section 2 is meant for a wide audience of practitioners and academicsalike. It describes and discusses four variations of the proposed method with theircorresponding numerical examples that illustrate the flexibility in practical situations.Descriptions are straight forward, only a basic level of mathematics and statisticalmethods is required, and the discussion is centered around the application.Section 3 is intended for a specialized audience. It demands a deeper knowledgeof probability and statistical theory by offering a discussion on the mathematicaldevelopment of the method and some of its properties to establish confidence amongusers that the methods have a sound theoretical justification and foster further ap-plications. Readers who are interested only in the application might want to skipthis section. Finally, Section 5 provides a summary of the approach, its practicalimplications, and suggestions for future research.6 Sequential Normal Scores
Common industrial applications of sequential tests include the statistical monitoringof individual or batched observations where some knowledge might, or might not, beavailable about the population quantiles. If available, known or estimated quantilescan be incorporated to increase the sensitivity of a statistic. Otherwise, a self-startingapproach is required to build knowledge as it becomes available. Both situations areaddressed in this section through four related models. The first two models addressthe self-starting situation, while the last two aim to incorporate existing knowledgeabout a quantile. An example is given immediately after a model is presented toillustrate and discuss applicability.
Let X , X , ..., be a sequence of independent identically distributed random variableswith a continuous distribution function F ( x ). The sequential rank R i of X i , where i stands for the observed order within the sequence, is defined by Parent [1965] as therank of X i relative to the previous random variables in the sequence up to and includ-ing X i . The sequential ranks of all the observations preceding X i remain unchanged.This can save considerable computing time when using rank-based nonparametricmethods. 7or reasons explained in Section 3 we will use P i = R i − . i (1)to estimate F ( X i ). Sequential normal scores are obtained from P i using Z i = Φ − ( P i )where Φ − stands for the inverse cumulative standard normal distribution function.As will be proven in Section 3, the sequence { Z i : i = 1 , . . . } consists of mutuallyindependent asymptotically (as i gets large) standard normal random variables.As a short illustration of how sequential normal scores are obtained, considerthe following observations on the first 10 random variables X i ( i = 1 , , . . . ,
10) in asequence from an arbitrary continuous distribution function F ( x ):4.6, 5.1, 3.9, 4.4,4.8, 6.6, 5.3, 8.3, 4.7, and 5.0. The sequential ranks R i are: 1, 2, 1, 2, 4, 6, 6, 8, 4,and 6. Note that F ( x ) is assumed to be continuous so the probability of ties is zero,but if ties occur due to round off, average ranks can be used in practice. Hence, theestimates of F ( X i ), or P i , are: 0.5000, 0.7500, 0.1667, 0.3750, 0.7000, 0.9167, 0.7857,0.9375, 0.3889, and 0.5500. The sequential normal scores Z i are then (rounded to fourdecimals): 0.0000, 0.6745, -0.9674, -0.3186, 0.5244, 1.3830, 0.7916, 1.5341, -0.2822,0.1257. We will show in Section 3 that if the X i ’s in the sequence are independent, thesequential ranks are independent [Parent, 1965] and therefore the sequential normalscores are independent and approach in distribution the standard normal distributionas i gets large. 8 .1.1 Bearing example When bearings slide over a lubricated surface, vibrations are steady and predictable,in control. However, after a large period of utilization, microscopic fractures initiate avicious cycle that increases vibrations exponentially until bearings fatigue and break.Quick detection of a sustained change in bearing vibrations creates an advantage thatmight allow preventive maintenance to avoid costly machine repairs. To illustratethe application of sequential normal scores on real measurements, the procedure wasapplied over a data set from the IEEE PHM 2012 Data Challenge organized by theIEEE Reliability Society and FEMTO-ST Institute. During the challenge to estimatethe remaining useful life of bearings, several experiments to accelerate degradationwere carried out on a laboratory experimental platform called PRONOSTIA. Datasetsand further information about the data challenge are available in Nectoux et al. [2012].Vibrations were processed using a fast Fourier transform (FFT) and the results weresummarized as RMS (root mean square) measurements, a convention used to toprovide an indication of the amount of energy spent on vibration. FFT and RMSmeasurements were calculated by Barraza-Barraza [2015], where time series modelsare used to characterize PRONOSTIA’s bearing data. The specific data set used tocreate this example was taken from Nectoux et al. [2012], scenario 1, bearing 4. Here,the bearing was run to failure; hence, it is known, by design, that measurementsdescribe a movement from an in-control state to an out-of-control state.9sing a least squares approach over the first 1000 RMS observations, an MA(1)was fitted over the first differences, and independent errors were used for processmonitoring. A CUSUM chart was constructed after transforming the errors intosequential normal scores. Even if data is not normal, practitioners can rely on thenormal approximation of the scores and proceed with a CUSUM chart set up fornormally distributed observations. The monitored errors can be seen in Figure 1a, andthe corresponding CUSUM chart applied over the sequential normal scores is shownin Figure 1b. A one-sided CUSUM for positive shifts with an allowance k = 0 .
25 anda decision interval H = 7 .
267 were used for monitoring to achieve an approximate incontrol
ARL of 500. A signal is triggered at observation 1082, just before vibrationsstart to exhibit an evident “violent” behavior.
Let { X ij : i = 1 , . . . , ; j = 1 , . . . , m } be a sequence of independent identically dis-tributed random variables with continuous distribution function F ( x ). The secondsubscript refers to the fact that these random variables are grouped into batches(samples) of size m . For the first batch ( i = 1) the m random variables are rankedrelative to the other random variables in that batch, and the ranks are denoted by R ,j for j = 1 , . . . , m . For all subsequent batches the sequential rank R i,j of X i,j ,10 E rr o r s ( R M S i n c r e m e n t s - m od e l ) Observationserrors (a) Monitored errors from bearing vibrations. C + ObservationsCUSUMH (b) One-sided CUSUM using sequential nor-mal scores.
Figure 1: A CUSUM chart was used on sequential normal scores obtained fromobserved errors after fitting an ARIMA(0,1,1) model over a first set of 1000 in-controlRMS vibration measurements from bearings at the IEEE PHM 2012 Data Challenge.A one-sided CUSUM with a reference value k = 0 .
25 and a decision interval H = 7 . i stands for the batch number, and i ≥
2, is given by, R i,j ( X i,j ) = i − (cid:88) k =1 m (cid:88) l =1 I ( X k,l ≤ X i,j ) + 1 (2)for i ≥ j = 1 , , . . . , m , where I ( X kl ≤ X i,j ) is an indicator function.This sequential rank is computed for each j = 1 , . . . , m in batch i . Note thatseveral of these sequential ranks within a batch may be equal to one another, but donot change as more batches are observed. Again, this can save considerable computingtime when using rank-based nonparametric methods.For reasons explained in Section 3 we will use P ,j = ( R ,j − . m , (3)and P i,j = ( R i,j − . m ( i −
1) + 1 (4)for i >
1, to estimate F ( X i,j ). Sequential normal scores are obtained from P i,j using Z i,j = Φ − ( P i,j ) where Φ − stands for the inverse cumulative standard normal distri-bution function. As will be proven in Section 3, the sequence { Z i,j : i = 1 , , . . . ; j = 1 , , . . . , m } consists of mutually independent asymptotically (as i gets large) standard normal ran-dom variables. By comparing each random variable with only the previous batchesplus itself, and not with other random variable in the same batch, the independenceof the sequential ranks is kept, and the sensitivity to detect a change is improved bynot using comparisons within a batch potentially from the alternative distribution.12 .2.1 Service time example We use an example from Yang et al. [2012] and Yang and Arnold [2015] in which theefficiency of the service system in a bank is analyzed. The service process times for10 counters were measured (in minutes) every 2 days for 30 days. Fifteen in-controlsamples of size m = 10 were obtained from a bank branch. Yang and Arnold [2015]show that data appear to be right-skewed (see Figure 2a), hence a procedure basedon the normal distribution is not recommended. Later, 10 new samples from a newautomatic service system were collected. From the analysis carried out by Yang andArnold [2015] there is indeed a change in the new samples. That is, the 10 newsamples that belong to a out-of-control state showed a reduction in the variance. Byperforming a Phase I analysis, Yang and Arnold [2015] estimated the variance of theprocess from the first 15 samples, and carried a test on the remaining 10 samples.The test was rejected after the second sample.Without using a Phase I analysis to estimate an in-control value of the varianceto use as reference, sequential normal scores can be adapted to test for a variancechange. Scores Z i,j can be used in an optimal CUSUM for downward process varianceby monitoring the statistic C − i = min (0 , C − i − + s i − k )13here C = 0, s i corresponds to the sample variance (from the scores) in batch i, and k = 2 log( σ σ ) σ σ σ − σ , where σ and σ corresponds to the out-of-control and the in-control variance, re-spectively; which gives a signal when C − i < H . From Hawkins and Olwell [2012] weobtain a value of H = − .
645 and an allowance k = 0 .
793 to achieve an
ARL of370 .
0, which is optimum for a variance reduction from 1 to 0.8. The value, 0.8, forthe out-of-control variance was selected arbitrarily to illustrate the approach. Orig-inal observations and the results from the CUSUM are plotted in Figure 2. It canbe seen that a variance reduction was signaled at observation 22, that is, 7 observa-tions after the actual change. Note that the natural CUSUM change-point statistic,the greatest observation with cumulative value C − equal to zero, is 16, the actualfirst out-of-control sample. In this example, the first batch was transformed usingtheir corresponding relative ranks, and the subsequent batches were ranked usingsequential ranks. Then the inverse of a standard normal cumulative distribution wasevaluated on the resulting P i,j as defined in equations (3) and (4).If a priori information exists about the null distribution, such as the one obtainedfrom a Phase I analysis as done in Yang and Arnold [2015], historical data, or atarget median, an adaptation of the sequential normal scores to incorporate quantileinformation can be used to increase the efficiency of a test procedure. This adaptationis illustrated in the following two subsections.14 S e r v i ce ti m e ( m i nu t e s ) Observations (a) Original measurements from the Bank ser-vice times example. -3.5-3-2.5-2-1.5 -1 -0.50 C - ObservationsCUSUM H (b) Optimal CUSUM for variance on sequen-tial normal scores. Figure 2: Optimal CUSUM chart for subgroup variance with k = 0.793 and H =-1.645 was used on sequential normal scores obtained from service times obtainedfrom the bank example data set. θ Let X , X , . . . , be a sequence of independent identically distributed random vari-ables with a continuous distribution function F ( x ). In contrast to Section 2.1 wewill assume that a quantile θ and its corresponding probability F ( θ ) are known, orassumed to be known, such as θ = median. Define the conditional sequential rank R i | θ of X i differently depending on if X i < θ or if X i > θ . In particular, if X i < θ then the conditional sequential rank R i | θ of X i is the rank of X i relative only to theprevious random variables that were less than θ , including X i . On the other hand, if X i > θ then the conditional sequential rank R i | θ of X i is the rank of X i relative onlyto the previous random variables that were greater than θ , including X i .15et N − i be the number of random variables, of the first i random variables inthe sequence, that are less than or equal to θ , and let N + i be the number of randomvariables, of the first i random variables in the sequence that are greater than θ .Then N − i + N + i = i . For reasons explained in Section 3 we will use P i | θ = F ( θ ) R i | θ − . N − i , if X i ≤ θF ( θ ) + [1 − F ( θ )] R i | θ − . N + i , if X i > θ . (5)to estimate F ( X i | θ ). Conditional sequential normal scores are obtained from P i | θ using Z i | θ = Φ − (cid:0) P i | θ (cid:1) where Φ − stands for the inverse cumulative standard normaldistribution function. As will be proven in Section 3, the sequence (cid:8) Z i | θ : i = 1 , , ... (cid:9) consists of mutually independent asymptotically (as i gets large) standard normalrandom variables. From Aichouni et al. [2014] we obtain a data set of compressive strength for readymixed concrete. The compression strength of 22 samples of concrete with a targetspecification of 350 Kgf/cm were measured over a period of 30 days. As shown byAichouni et al., observations do not fit a normal distribution. They addressed thesituation of non-normality by using a Johnson transformation. However, the use of atransformation in small data sets comes with the risk of over-fitting and the selectionof a mistaken transformation function. By using conditional sequential normal scores,16his risk is avoided. Nevertheless, Aichouni et al. did not evaluate whether sampledmeasurements achieved the target value, they tested only for isolated changes (wherethe parametric scenario might be the best approach). To continue with the analy-sis, we evaluate if the compressive strength is actually significantly larger than themedian target. By defining the nominal value of 350 as the target process median,the analysis can be carried out by using conditional sequential normal scores fromequation (5). Here, it is assumed that under the null distribution F (350) = 0 . k = 0 .
25 and a control limit H = 5 .
597 used asparameters of a one-sided tabular CUSUM chart achieve an in-control performanceof 200 in terms of average run length (
ARL ). Even though we are monitoring anull median of 350, and this information is incorporated into the sequential normalscores, the monitored scores still have a mean value of zero and their behavior canbe approximated with a standard normal distribution. As can be seen, an alarm issignaled at sample 21, which indicates that the mixture is providing more strengththan nominal specification. Because sequential normal scores provide a conservativeapproximation of the standard normal distribution, with a slightly smaller variance(see Subsection 3.2 for details), the true ARL is at most the one specified by theCUSUM setup. 17 kg f / c m ^ Observations
Cylinder (a) Average compressive strength measure-ments. a v e r a g e kg f / c m Observations CUSUM H (b) CUSUM chart on sequential normalscores. Figure 3: CUSUM obtained from sequential normal scores of compressive strengthfor Ready Mixed Concrete (Kgf/cm ). The average was used as an individual ob-servation. An allowance of k = 0 .
25 and a decision interval of H = 5 .
597 wereused. θ Let { X i,j : i = 1 , . . . , ; j = 1 , . . . , m } be a sequence of independent identically dis-tributed random variables with a continuous distribution function F ( x ). The secondsubscript refers to the fact that these random variables are grouped into batches(samples) of size m. In contrast to Section 2.2 we will assume that a quantile θ andits corresponding probability F ( θ ) are known, or assumed to be known, such as θ =median. For the first batch ( i = 1) the random variables that are less than θ areranked relative to only the other random variables in the first batch that are lessthan θ , and the ranks of the random variables that are greater than θ are ranked18elative to only the other random variables in the first batch that are greater than θ . These conditional ranks are denoted by R ,j | θ for j = 1 , . . . , m . For all subsequentbatches define the conditional sequential rank R i,j | θ of X i,j differently depending onif X i,j ≤ θ or if X i,j > θ . In particular, if X i,j ≤ θ , then the conditional sequentialrank R i,j | θ of X i,j is the rank of X i,j relative to only the random variables that wereless than θ in the previous batches, including X i,j , but no other random variablesin the same batch i. On the other hand, if X i,j > θ then the conditional sequentialrank R i,j | θ of X i,j is the rank of X i,j relative to only the previous random variablesthat were greater than θ in the previous batches, including X i,j but no other randomvariables from the same batch i.Let N − be the number of random variables of the first batch in the sequence, thatare less or equal than θ and N +1 be the number of random variables of the first batchthat are greater than θ . For i >
1, let N − i be the number of random variables, of thefirst i − θ , and let N + i be the number of random variables of the first i − θ . Then, N − + N +1 = m and N − i + N + i = ( i − m for i >
1. For reasons explained in Section 3 we will use P ,j | θ for i = 1 and P i,j | θ for i >
1, as P ,j | θ = F ( θ ) R ,j | θ − . N − , X ,j ≤ θF ( θ ) + (1 − F ( θ )) R ,j | θ − . N +1 , X ,j > θ , (6)19nd for i ≥ P i,j | θ = F ( θ ) R i,j | θ − . N − i +1 , X i,j ≤ θF ( θ ) + (1 − F ( θ )) R i,j | θ − . N + i +1 , X i,j > θ . (7) Conditional sequential normal scores are obtained from P i,j | θ using Z i,j | θ = Φ − (cid:0) P i,j | θ (cid:1) where Φ − stands for the inverse cumulative standard normal distribution function.As proven in Section 3, the sequence (cid:8) Z i,j | θ : i ≥ j = 1 , . . . , m (cid:9) consists of mutuallyindependent asymptotically (as i gets large) standard normal random variables. From Bakir and McNeal [2010] we obtained data that consists of GPA results frommanagement major students of the Department of Business Administration at Al-abama State University. Measurements were taken from the period of Spring 2005through Spring 2009. The research showed that GPAs maintained a desired targetmedian level of 2.600 but they were significantly below the higher target median of2.800, which represented 70% of the maximum score of 4 points. The data presentedproblems of satisfying the assumption of normality, and a nonparametric control chartbased on signed ranks was proposed by the authors. The original data is plotted inFigure 4a. Using equation (6) and (7), observations are transformed into conditionalsequential normal scores that, in turn, are evaluated using an EWMA statistic U i = λ ¯ Z i + (1 − λ ) U i − (8)20 G P A Sample (a) Original GPA measurements. -0.4-0.3-0.2-0.10 U i SampleEWMA
Center
Limit (b) EWMA for subgroup observations on se-quential normal scores.
Figure 4: EWMA control chart applied over sequential normal scores from the GPAdata set example with a null median of 2.8.where U = 0, and ¯ Z i is the average of Z i,j | θ ; j = 1 , . . . , m ; as seen in Qiu [2013].Significance is achieved when the plotted statistic surpasses the limits ± ρ (cid:114) λ − λ [1 − (1 − λ ) i ] σ √ m . (9)Here, σ = 1, the variance of the standard normal distribution approximated by thesequential normal scores. Following the spc package in R , the xewma.arl functionis used in a calibration process to obtain the value of parameter ρ = 2 .
714 with aconvenient λ = 0 . ARL ≈ . .5 Application remarks Sequential normal scores extend the use of parametric procedures to deal with ob-servations from an unknown distribution. However, practitioners should be awarethat such a transformation is only appropriate to deal with sustained changes. If iso-lated changes are a concern, as might be the case for a test for outliers, a parametricapproach might be the best option.Big data users and researchers might find the sequential normal scores trans-formation appealing due to its asymptotic behavior and the reduced computationaleffort required for its implementation. On one hand, as seen in Section 3, the trans-formation becomes close to an exact quantile transformation into normality. Whena large data set is available, model based approaches usually fail to fit the preciseobserved behavior. Nonparametric approaches, such as sequential normal scores, aremore likely to provide null distributions with a better fit to the true unknown prob-ability law behind the statistic used to evaluate the data. On the other hand, byfollowing a sequential ranking scheme that avoids re-ranking previous observation, alarge amount of computational time is saved. For instance, Figure 5, illustrates thetime, in seconds, it takes for three nonparametric statistics used in statistical processcontrol literature, including sequential normal scores, to be calculated once, after anumber of observations is available from a data stream. We considered the statistic ofHawkins and Deng [2010], which is a change-point statistic based on Mann-Whitney22 T i m e ( S ec ond s ) Number of observationsLepage CC-MWSNS
Figure 5: Computer times to compute a charting statistic over different numberof observations from Hawkins and Deng [2010] (CC-MW), Chowdhury et al. [2015](Lepage), and the sequential normal scores (SNS).statistic; and the statistic of Chowdhury et al. [2015], where a reference sample isevaluated against a monitored sample using the Lepage statistic. All measurementswere carried using a Hewlett-Packard PC, model 6200 PRO SFF with an Intel Core I32120 processor, 500GB of hard drive, 3GB for memory and Windows 7 Pro. It can beseen that the statistic of Hawkins and Deng [2010] could not be computed for samplesize bigger than 10 , whereas the statistic of Chowdhury et al. [2015] and sequentialnormal score statistic presented good performance in terms of computational time fordata streams until 10 . In each scenario the computational time of sequential normalscores is at least 10 times faster than Chowdhury et al.’s [2015] statistic.23 Theoretical framework
Consider a sequence X , X , ..., of independent random variables identically dis-tributed according to a continuous distribution function F ( x ). Note that F ( X i )is a uniform (0 ,
1) random variable for all i ≥
1. Sequential ranks R i = R ( X i ) aredefined by Parent [1965] as R = 1 and, for i ≥ R i = i − (cid:88) j =1 I ( X j ≤ X i ) + 1 . (10)Note that the indicator variables I ( X j ≤ X i ) are Bernoulli random variables for1 ≤ j ≤ i −
1, with mean 1 /
2, variance 1 /
4, and covariance 1 / P i = P ( X i ) as an estimator of F ( X i ), where P i = ( R i − a ) i − b , b >
0; 0 < a < b. (11)Thus, 0 < P i < Z i = Φ − ( P i ) is a well-definednormal score, where Φ − is the inverse of the standard normal distribution functionWe would like to choose a and b so that the mean and variance of Z i are 0 and1 respectively. The mean of Z i equals 0 if and only if the mean (and median) of P i is 0.5, due to the quantile preserving property of monotonic transformations. The24ean of P i is, from equation (10) E ( P i ) = i − + ai − b (12)which equals 0.5 if and only if b = 2 a . The variance of Z i is a function of b , which isdiscussed in next subsection. The variance of P i can be easily found to be V ar ( P i ) = i − + ( i − i − ( i − b ) = i − i − b ) (13)which is a function of b, and equals 1/12 (the variance of F ( X )) if and only if b = 1 − i + √ i −
1. As i gets large, b approaches 1, which suggests using b = 1 forfinite sample sizes.Unfortunately the variance is not preserved in the normal scores transformation,so we are forced to use numerical approximation methods to find values of b thatresult in V ar ( Z i ) = 1. These results are summarized in Table 1.Table 1 shows that the actual value of b to obtain V ar ( Z i ) = 1 increases from b = 0 .
465 for i = 2 to b = 0 .
915 for i = 5000. The effect of using b = 1 as anapproximation is slight (less than 2% error for the standard deviation of Z i ) for25 >
31, and the approximation b = 0 . − . i (14)results in less than a 1% error for i > .
7% error for i = 2 ). If b = 1 is usedfor all values the result is a conservative estimate for the standard deviation of thesequential normal scores Z i , so that is what we used in our examples. A slight increasein power results from using the approximation for b when i <
31. Nevertheless,the increased power was not big enough to change the decisions obtained from theparticular numerical examples shown in Section 2.This approximation problem has been previously addressed. For example, Van derWaerden [1952] used a = 1, b = 2; Blom [1954] used a = 5 / b = 1 .
25; Tukey [1962]used a = 2 / b = 4 /
3; and Bliss et al. [1956] used a = 1 / b = 1 which they called a“rankit”. An empirical study by Solomon and Sawilowsky [2009] compared these fourapproximations and concluded that “Rankit emerged as the most accurate methodamong all sample sizes and distributions, thus it should be the default selection forscore normalization in the social and behavioral sciences”(p.448). Note that all fourapproximations preserve a mean of 0 for the normal scores because b = 2 a . Howeverall four approximations result in a variance less than 1.0 which we addressed in thissection. 26able 1: Standard deviation of Z i when b = 1, and values of b that give 1.0 forthe standard deviation of Z i . Also given is the standard deviation of Z i when theapproximation for b from equation (14) is used.i σ when b = 1 Value of b toobtain σ = 1 Value of b usingapproximation Value of σ using b from equation (14)2 .675 .465 .428 1.0373 .790 .565 .560 1.0044 .844 .618 .626 .9965 .876 .652 .666 .99410 .938 .727 .745 .99520 .969 .778 .784 .99930 .979 .800 .798 1.00031 .980 .802 .798 1.00032 .980 .803 .799 1.000100 .994 .847 .816 1.0011000 .999 .896 .823 1.0015000 1.000 .915 .824 1.00027 .3 Include a priori information about a known quantile Let X , X , . . . , be a sequence of independent observations, and F ( · ) correspondsto their common cumulative null distribution. Also, assume that F ( θ ) and θ areknown, where the latter is a constant. (To facilitate notation we’ll be using F ( a )and P ( X ≤ a ), without the subindex in the variable X , interchangeably, to expressthe evaluation of the cumulative distribution function at a constant a under the nullhypothesis, and we will suppress the fact that F ( θ ) is known even though it is implicitin all of the probabilities of this section).Given a constant a , a conditional sequential normal scores statistic can be con-structed by noting that P ( X ≤ a ) = P ( X ≤ a | X ≤ θ ) P ( X ≤ θ ) + P ( X ≤ a | X > θ ) P ( X > θ ) (15)where the P ( X ≤ a ) can be estimated by usingˆ P ( X i ≤ a | X i ≤ θ ) = (cid:80) i − j =1 I ( X j ≤ a ) I ( X j ≤ θ ) + 0 . (cid:80) i − j =1 I ( X j ≤ θ ) + 1 , (16)and ˆ P ( X i ≤ a | X i > θ ) = (cid:80) i − j =1 I ( X j ≤ a ) I ( X j > θ ) + 0 . (cid:80) i − j =1 I ( X j > θ ) + 1 (17)in equation (15). Equations (16) and (17) are maximum likelihood estimators withbiasing constants 0.5 and 1 defined in previous section. Hence, using the fact that28 ( X ≤ θ ) is known, a cumulative probability can be better estimated byˆ P ( X ≤ X i ) = P ( X ≤ θ ) (cid:80) i − j =1 I ( X j ≤ X i ) I ( X j ≤ θ )+0 . (cid:80) i − j =1 I ( X j ≤ θ )+1 , X i ≤ θP ( X ≤ θ ) + P ( X > θ ) (cid:80) i − j =1 I ( X j ≤ X i ) I ( X j >θ )+0 . (cid:80) i − j =1 I ( X j >θ )+1 , X i > θ (18)Hence, conditional sequential normal scores are then defined by evaluating Φ − ( ˆ P ( X ≤ X i )). The following Proposition states the independence of the sequence { P i : 1 , . . . , n } andthe asymptotic distribution of the sequence { Z i : 1 , . . . , n } , of model in Section 2.1. Proposition.
Let X , X , ..., X n be a sequence of i.i.d. random variables. Define P i and Z i as in Section 2.1. Then,1. Series { P i : 1 , . . . , n } are mutually independent random variables.2. { Z i : 1 , . . . , n } are independent asymptotically standard normal random vari-ables.Proof. Since Theorem 1.1 in Barndorff-Nielsen [1963] states that the sequence ofrandom variables { R i ( X i ) : i = 1 , , . . . } are mutually independent, it follows by The-orem 4.6.12 in Casella and Berger [2002] that { P i : i = 1 , , . . . } are also mutuallyindependent random variables. This in turn implies that the Z i are also mutuallyindependent. By applying the Strong Law of Large Numbers, the Glivenko-Cantelli29heorem implies that P i converges uniformly to a uniform (0,1) random variable, andthus Z i converges to a standard normal random variable, as i goes to infinity.It can be noted that the results of the Proposition apply to the models in Sec-tions 2.2, 2.3 and 2.4 as well. The variations in the proof involve redefining sequentialranks in each section, but otherwise are trivial and are omitted here. Corollary.
The results of the Proposition hold even if Pi,j is given by equations (3) and (4) in Section 2.2, or if P i | θ is given by equation (5) in Section 2.3, or if P ij | θ isgiven by equations (6) and (7) in Section 2.4. Previous results indicate that sequential normal scores are independent, and theyapproach to a standard normal random variable as the series grows to infinity. How-ever, in practice, practitioners are limited to a finite number of observations and mightwonder how sequential normal scores will perform with that restriction. Because thispaper introduces the concept of sequential normal scores for the first time, the au-thors thought it would be beneficial to examine the behavior of sequential normalscores in more detail. Extensive Monte Carlo simulations were conducted to deter-mine how well a standard normal distribution approximates the exact distribution ofsequential normal scores, as defined in Section 2.1. Although an exact distributioncan be obtained by using the fact that every ordering of the usual ranks of a random30equence of independent and identically distributed random variables is equally likely,and there is a one-to-one correspondence between the usual ranks and the sequentialranks, it was more convenient to use Monte Carlo sampling in this study.First, the exact distribution of the n -th sequential rank R n is well known to be theuniform distribution over the integers 1 to n . But, after converting to P n and then to Z n , how does the distribution of Z n compare with the standard normal distribution?Figure 6 shows the comparison for various values of n from 2 to 1000. It shows thatthe choice of using b = 1 for all values of n does not appear to make an appreciabledifference, as opposed to tailoring values of b to make the variance closer to 1.0.Second, the empirical distribution of the first n sequential normal scores is com-pared with the standard normal distribution by averaging 1000 empirical distributionfunctions, with the results shown in Figure 7. These results show that the standardnormal quantiles in the tails tend to be larger (in absolute value) than the exactquantiles, which will result in conservative tests, but the difference appears to benegligible for n greater than 30. This agrees with our comparison of exact varianceswith the variance of the standard normal distribution in Section 3.2. Also, these exactdistributions reveal a relatively large probability at x = 0, but this jump disappearsalmost completely as n reaches 100 or more.Finally, a single random sequence is evaluated at various lengths in Figure 8, andan Anderson-Darling goodness of fit is applied. The resulting empirical distribution31unctions show some divergence from the standard normal distribution in the middlevalues of x , but good agreement in the tails of the distributions where a good ap-proximation is more important. The resulting p-values for this series are significantat n = 100 and n = 300, but the significance disappears for larger values of n . Sequential normal scores % Figure 6: The exact distribution (solid line) of the n -th sequential normal score atdifferent values of n , and the cumulative standard normal distribution (dotted line).32 Sequential normal scores % First 10 First 20First 30 First 50 First 100First 300 First 500 First 1000
Figure 7: The expected value of the empirical distribution function (solid line) fordifferent sequence lengths n , and the cumulative standard normal distribution (dottedline). A new sequential statistic based on normal scores, named sequential normal scores,was proposed as a link between parametric and nonparametric procedures that aresequential in nature, such as control charts used in online monitoring of data streams.The statistic extends the concept of sequential ranks into a modified version of nor-mal scores to obtain a sequence of asymptotically standard normal and independent33
First 2
Sequential normal scores % First 10
First 20First 30
First 50
First 100First 300
First 500
First 1000 p-value>0.250 p-value>0.250 p-value>0.250p-value=0.137 p-value=0.039p-value>0.250p-value=0.040 p-value>0.250 p-value>0.250
Figure 8: Empirical distributions of a sample path of sequential normal scores, thecumulative normal standard distribution and the p-value obtained from an Anderson-Darling test for N (0 ,
1) at different moments in time.scores that can be analyzed with existing procedures originally created to deal withnormal and independent observations. Four different versions of the statistic werepresented to address different situations with individual or batched observations, orwhen a priori knowledge about a quantile exists or not. When sustained changes area concern, the applicability of the proposed transformation is illustrated with the uti-lization of control chart on sequential normal scores from different databases available34n the open literature. It is shown that sequential normal scores used with controlcharts are capable of detecting changes in a process, requiring only i.i.d. observa-tions under the null hypothesis. Also, if a priori knowledge exists about a processquantile under the null assumption, which might be the case when it is desired that aprocess follows a target median, sequential normal scores are adapted to incorporateexisting information in such a way that control charts with sequential normal scoresbecome sensitive to detect changes at the very start of a monitoring. Big data usersmight be willing to analyze their data streams with the sequential normal scorestransformation. If streams are very large, the fit to a normal distribution providedby sequential normal scores might be, in many cases, arguably better, than the fitone might find from parametric models–unless the true underlying distribution isknown. In addition, even though the sequential normal scores lack of complexitymakes them “computer friendly”, if data streams are very large, it is easy to restrictour comparisons to a moving window of reasonable size (e.g. 1000 or 5000) whendetermining the sequential ranks. This sets the memory usage of a computer andthe number of operations to transform data constant. The proposed statistic offersa tool that makes nonparametric analysis readily available for practitioners and anapproach that can be used by researchers to deal with new problems as they becomeavailable. 35 eferences
M Aichouni, AI Al-Ghonamy, and L Bachioua. Control charts for non-normal data:Illustrative example from the construction industry business. In
Proceedings ofthe 16th International Conference on Mathematical and Computational Methods inScience and Engineering , 24, pages 71–76, Kuala Lumpur, Malaysia, 2014. WSEASPress.Saad T Bakir and Bob McNeal. Monitoring the level of students’ gpas over time.
American Journal of Business Education (AJBE) , 3(6):43–50, 2010.Ole Barndorff-Nielsen. On the limit behaviour of extreme order statistics.
The Annalsof Mathematical Statistics , 34(3):992–1002, 1963.Diana Barraza-Barraza.
An adaptive ARX model to Estimate an Asset RemainingUseful Life . PhD thesis, Texas Tech University, 2015.CI Bliss, Mary L Greenwood, and Edna Sakamoto White. A rankit analysis of pairedcomparisons for measuring the effect of sprays on flavor.
Biometrics , 12(4):381–403,1956.Gunnar Blom. Transformations of the binomial, negative binomial, poisson and χ Biometrika , 41(3/4):302–316, 1954.36eorge EP Box. Evolutionary operation: A method for increasing industrial produc-tivity.
Applied Statistics , pages 81–101, 1957.George Casella and Roger L Berger.
Statistical inference , volume 2. Duxbury PacificGrove, CA, 2002.Shovan Chowdhury, Amitava Mukherjee, and Subhabrata Chakraborti. Distribution-free phase ii cusum control chart for joint monitoring of location and scale.
Qualityand Reliability Engineering International , 31(1):135–151, 2015.Harold French Dodge and HG Romig. A method of sampling inspection.
Bell SystemTechnical Journal , 8(4):613–631, 1929.Douglas M Hawkins and Qiqi Deng. A nonparametric change-point control chart.
Journal of Quality Technology , 42(2):165, 2010.Douglas M Hawkins and David H Olwell.
Cumulative sum charts and charting forquality improvement . Springer Science & Business Media, 2012.Patrick Nectoux, Rafael Gouriveau, Kamal Medjaher, Emmanuel Ramasso, BrigitteChebel-Morello, Noureddine Zerhouni, and Christophe Varnier. Pronostia: Anexperimental platform for bearings accelerated degradation tests. In
IEEE Inter-national Conference on Prognostics and Health Management, PHM’12. , pages 1–8.IEEE Catalog Number: CPF12PHM-CDR, 2012.37.A. Parent.
Sequential Ranking Procedures . PhD thesis, Department of Statis-tics, Stanford University., 1965. URL https://books.google.com.mx/books?id=pYFCAAAAIAAJ .Peihua Qiu.
Introduction to statistical process control . CRC Press, 2013.Marion R Reynolds. A sequential signed-rank test for symmetry.
The Annals ofStatistics , 3:382–400, 1975.Gordon J Ross, Dimitris K Tasoulis, and Niall M Adams. Nonparametric monitoringof data streams for changes in location and scale.
Technometrics , 53(4):379–389,2011.Walter A. Shewhart.
Economic control of quality of manufactured product . D. VanNostrand Company, Inc., 1931.Shira R Solomon and Shlomo S Sawilowsky. Impact of rank-based normalizing trans-formations on the accuracy of test scores.
Journal of Modern Applied StatisticalMethods , 8(2):448–462, 2009.John W Tukey. The future of data analysis.
The Annals of Mathematical Statistics ,33(1):1–67, 1962.BL Van der Waerden. Order tests for the two-sample problem and their power.In
Indagationes Mathematicae (Proceedings) , volume 55, pages 453–458. Elsevier,1952. 38braham Wald. Sequential tests of statistical hypotheses.
The Annals of Mathemat-ical Statistics , 16(2):117–186, 1945.Abraham Wald and Jacob Wolfowitz. Optimum character of the sequential probabil-ity ratio test.
The Annals of Mathematical Statistics , 19:326–339, 1948.Su-Fen Yang and Barry C Arnold. A new approach for monitoring process variance.
Journal of Statistical Computation and Simulation , pages 1–17, 2015.Su-Fen Yang, Tsung-Chi Cheng, Ying-Chao Hung, and Smiley W Cheng. A newchart for monitoring service process mean.