Deconstructing Principal Component Analysis Using a Data Reconciliation Perspective
DDeconstructing Principal Component Analysis Using a DataReconciliation Perspective
Shankar Narasimhan ∗ , Nirav Bhatt (cid:63) Systems & Control Group, Department of Chemical Engineering,Indian Institute of Technology Madras, Chennai - 600036, IndiaMay 5, 2015
Abstract
Data reconciliation (DR) and Principal Component Analysis (PCA) are two popular data analysis tech-niques in process industries. Data reconciliation is used to obtain accurate and consistent estimates ofvariables and parameters from erroneous measurements. PCA is primarily used as a method for reducingthe dimensionality of high dimensional data and as a preprocessing technique for denoising measurements.These techniques have been developed and deployed independently of each other. The primary purpose ofthis article is to elucidate the close relationship between these two seemingly disparate techniques. Thisleads to a unified framework for applying PCA and DR. Further, we show how the two techniques canbe deployed together in a collaborative and consistent manner to process data. The framework has beenextended to deal with partially measured systems and to incorporate partial knowledge available about theprocess model.
Keywords : Data reconciliation, Principal component analysis, Model identification, Estimation, Denoising
Data Reconciliation (DR) is a technique that was proposed in the early 1950s to derive accurate and consis-tent estimates of process variables and parameters from noisy measurements. This technique has been refined ∗ Corresponding authors. E-mail: [email protected] [email protected], a r X i v : . [ c s . L G ] M a y nd developed over the past fifty years. Several books and book chapters have been written on this and re-lated techniques [Romagnoli and Sanchez(1999), Veverka and Madron(1997), Narasimhan and Jordache(2000),Hodouin(2010), Bagajewicz(2001)]. The technique is now an integral part of simulation software packages suchas ASPEN PLUS (cid:114) and several standalone software packages for data reconciliation such as VALI, DATACON (cid:114) ,etc., are also available and deployed in chemical and mineral process industries. The main benefit derived fromapplying DR are accurate estimates of all process variables and parameters which satisfy the process constraintssuch as material and energy balances. The derived estimates are typically used in retrofitting, optimization andcontrol applications. In order to apply DR, the following information is required.(i) The constraints that have to be obeyed by the process variables and parameters must be defined. Theseconstraints are usually derived from first principles model using process knowledge, and consist of material andenergy conservation equations including property correlations, and can also include equipment design equations,and thermodynamic constraints.(ii) The set of process variables that are measured must be specified. Additionally, inaccuracies in thesemeasurements must be specified in terms of the variances and covariances of errors. This information is usuallyderived from sensor manuals or from historical data.Another multivariate data processing technique that has become very popular in recent years is PrincipalComponent Analysis (PCA) [Jolliffe(2002)]. This method is primarily used for reducing the dimensional-ity of data and to denoise them. It is also used in developing regression models, when there is collinearityin the regressors variables [Davis et al.(1999)Davis, Piovoso, Kosanovich and Bakshi]. In chemical engineer-ing, it has been used for process monitoring and fault detection, and diagnosis [Kourti and MacGregor(1995),Yoon and MacGregor(2001)]. Generally, PCA has been regarded as a data-driven multivariate statistical tech-nique. In a recent paper, PCA was interpreted as a model identification technique that discovers the linearrelationships between process variables [Narasimhan and Shah(2008)]. This interpretation of PCA is not wellknown, although other authors have previously alluded to it.The purpose of this article is to establish the close connection between PCA and DR. Specifically, it is shownthat PCA is a technique that discovers the underlying linear relationships between process variables whilesimultaneously reconciling the measurements with respect to the identified model. Exploring this connectionfurther, it is shown that Iterative PCA is a method which simultaneously extracts the linear process model,error-covariance matrix and reconciles the measurements [Narasimhan and Shah(2008)]. Several benefits accruefrom this interpretation: 2i) It shows that data reconciliation can be applied to a process purely using measured data, even if it is difficultto obtain a model and measurement error variances using a priori knowledge. It thus expands the applicabilityof data reconciliation and related techniques.(ii) PCA and IPCA can be used as techniques for obtaining a process model and measurement error-covariancematrix from data. Since these are the two essential information required to apply DR, it is now possibleto apply the rigorous and well developed companion technique such as gross error detection (GED) for faultdiagnosis. This will eliminate the difficulties and deficiencies present in the current approach of using PCA forfault diagnosis.Additional useful results presented in this paper include the interpretation of the process model obtained usingPCA, when only a subset of the process variables are measured. Modification of the PCA and IPCA techniquesto incorporate partial knowledge of some of the process constraints is also proposed. The impact of incorrectlyestimating the model order (the actual number of linear constraints) on the reconciled estimates is also discussed,leading to a recommendation for practical application of PCA and combining it with tools of DR and GED.The paper is organized as follows. Sections 2 and 3 introduce the background on DR and PCA, respectively.Model identification and data reconciliation using PCA for the case of known error-covariance matrix is describedin Section 4. For unknown error-covariances case, Section 5 describes a procedure for simultaneous modelidentification, estimation of error-covariances, and data reconciliation using IPCA. Section 6 extends PCA(IPCA) to partially measured systems, and known constraint matrix. Further, it discusses selection criteria ofmodel order when the model order is not known. Section 7 concludes the paper. The developed concepts areillustrated via a simulated flow process. In this section, the application of DR to linear steady–state processes is discussed, including the case when asubset of the process variables is measured (also known as partially measured systems).
The objective of data reconciliation is to obtain better estimates of process measurements by reducing theeffect of random errors in measurements. For this purpose, the relationships between different variables as3efined by process constraints are exploited. We restrict our attention to linearly constrained processes whichare operating under steady state. An example of such a process is a water distribution network, or a steamdistribution network with flows of different streams being measured. We first describe the data reconciliationmethodology for the case when the flows of all streams are measured.Let x ( j ) ∈ R n be an n -dimensional vector of the true values of the n process variables corresponding to asteady-state operating point for each sample j . The samples x ( j ) , j = 1 , , . . . , N can be drawn from the samesteady state or from different steady states. These variables are related by the following linear relationships: Ax ( j ) = m × (1)where A is an ( m × n )-dimensional matrix, and is an m -dimensional vector with elements being zero. Indata reconciliation, A is labelled as a “ constraint matrix ”. Note that the rows of A span an m -dimensionalsubspace of R n , while x ( j ) lies in an ( n − m )-dimensional subspace (orthogonal to the row space of A ) of R n .Let y ( j ) ∈ R n be the measurements of the n variables. The measurements are usually corrupted by randomerrors. Hence, the measurement model can be written as follows: y ( j ) = x ( j ) + (cid:15) ( j ) , (2)where (cid:15) ( j ) is an n -dimensional random error vector at sampling instant j . The following assumptions are madeabout the random errors: ( i ) (cid:15) ( j ) ∼ N ( , Σ (cid:15) )( ii ) E [ (cid:15) ( j ) (cid:15) ( k ) T ] = , ∀ j (cid:54) = k ( iii ) E [ x ( j ) (cid:15) ( j ) T ] = (3)where E [ · ] denotes the expectation operator. If the error variance-covariance matrix Σ (cid:15) is known, then thereconciled estimates for x ( j ) (denoted as ˆ x ( j )) can be obtained by minimizing the following objective function:min x ( j ) ( y ( j ) − x ( j )) T Σ − (cid:15) ( y ( j ) − x ( j )) s.t. Ax ( j ) = . (4)4he reconciled values of the variables are given by:ˆ x ( j ) = y ( j ) − Σ (cid:15) A T ( AΣ (cid:15) A T ) − Ay ( j ) = Wy ( j ) , (5)where W = I − Σ (cid:15) A T ( AΣ (cid:15) A T ) − A . Under the assumptions made regarding the measurements errors, it canbe shown that the reconciled estimates obtained using the above formulation are maximum likelihood estimates.It can also be verified that the estimates ˆ x ( j ) satisfy the imposed constraints and are normally distributed withmean, x ( j ), and covariance, WΣ (cid:15) W T .If all the measured samples are drawn from the same steady state operating point, then DR can be appliedto the average of the measured samples. However, if the samples are from different steady states, then DR isapplied to each sample independently. For ease of comparison with PCA, we consider a set of N samples (whichcould correspond to different steady state operating periods) to which DR is applied. The set of N samples isarranged in the form of an ( n × N ) − dimensional data matrix, Y as Y = [ y (1) , y (2) , . . . , y ( N )] = X + E , (6)where X and E are ( n × N ) − dimensional matrices of the true values and the errors, respectively. The matrixˆ X of reconciled estimates for the N samples is given byˆ X = WY . (7)The following example illustrates DR on the flow process shown in Figure 1. Example 1
The flow process consists of six flow variables and four balance equations, i.e., n = 6 and m = 4. The flow5alance equations for this process can be written as follows: AF = , with (8) A = − − − − − , (9) F = (cid:20) F F F F F F (cid:21) T . (10)In order to demonstrate the utility of DR in reducing noise in measurements, we assume that flows of all sixstreams are measured and noisy measurements are simulated as follows. First noise-free values (true flow rates)that satisfy the constraints are generated, followed by addition of noise to generate the noisy measurements.1. The flow variables F and F are chosen as independent variables and base values are specified for thesevariables. For simulating data corresponding to different steady states, normally distributed randomfluctuations are added to the base values of flow variables F and F .2. The true values of the dependent flow variables, F , F , F , and F are computed using four flow balanceequations in Eq. (8) for all steady states.3. Noisy measurements are simulated by adding normally distributed random errors to the true values cor-responding to different steady states.The base values, standard deviations of fluctuations for generating different steady states, and the standarddeviations of measurement errors are given in Table 1.Table 1: Base values, standard deviation of fluctuation (SDF), and standard deviations of error (SDE) for theflow data. Variable Base values SDF SDE F
10 1 0.1 F
10 2 0.08 F Eq. (8) 0.15 F Eq. (8) 0.2 F Eq. (8) 0.18 F Eq. (8) 0.1A sample of 1000 measurements are simulated and DR is applied to obtain the reconciled estimates correspondingto each sample. The root mean square error (RMSE) values between reconciled and true values over all the6 F F F F F
1 2 3 4
Figure 1: Schematic of a flow process1000 samples for each flow variable is computed and reported in Table 2. The RMSE in the measurements ofvariables are also reported in the table for a comparison. The RMSE values in the estimates of all variables areless than the RMSE values in the corresponding measurement, clearly indicating that reconciled estimates ofvariables are more accurate than the measurements.Table 2: Root mean square errors in measurements and reconciled estimates of flow variables for Example 1Variable RMSE before DR RMSE after DR F F F F F F For reasons of cost and feasibility, in most processes only a subset of variables is measured. We refer tosuch a system as a partially measured system. Application of DR to a partially measured system provides7econciled values of measured variables and estimates for unmeasured variables. DR also provides diagnosticson which of the measured variables can be reconciled (known as redundant measured variables) and which of theunmeasured variables can be uniquely estimated (also known as observable unmeasured variables). The detailsof the redundant and observable variables concepts can be found in [Mah(1990)]. The method for applying DRto a partially measured system is described below.The constraints for a partially measured system, Eq. (1), can be rewritten in terms of the measured variables(labelled as x k ) and the unmeasured variables (labelled as x u ) as: A k x k + A u x u = , (11)where A k and A u are the partitions of A matrix corresponding to x k and x u , respectively. Then, by constructinga projection matrix P such that PA u = , and multiplying Eq. (11), we can get a reduced set of constraintsinvolving only the measured variables as [Crowe et al.(1983)Crowe, Garcia Campos and Hrymak]: PA k x k = . (12)Then, the reduced data reconciliation problem is to find the minimum of the objective defined in Eq. (4) subjectto the constraints defined by Eq. (12). The reconciled values ˆ x k ( j ) obtained from the measurement y k ( j ) aregiven as follows: ˆ x k ( j ) = y k ( j ) − Σ (cid:15) ( PA k ) T ( PA k Σ (cid:15) ( PA k ) T ) − ( PA k ) y k ( j ) (13)The estimates for the unmeasured variables can be obtained by substituting for the estimates of the measuredvariables in Eq. (11) and solving these equations. The conditions under which unique estimates for the un-measured variables can be obtained and the identification of redundant and observable variables using linearalgebraic techniques are more completely described in the book by [Narasimhan and Jordache(2000)]. Alterna-tively, for flow processes, a graph-theoretic approach can be used for obtaining the reduced constraint matrix,and for determining the redundant measured variables and the observable unmeasured variables. Example 2
The following example illustrates the application of DR to a partially measured flow process. The graphtheoretical approach is used for observability and redundancy classification of variables, because it aids invisualization and ease of understanding. Consider again the flow process given in Figure 1, with only the flowsof streams 1, 2, and 5 being measured. In order to apply the graph theoretic procedure, the process is representedas a graph shown in Figure 2(b), in which an environment node is added to which all process inflows (stream8) and process outflows (stream 5) are connected. In order to obtain the graph corresponding to the reducedreconciliation problem, we merge nodes which are connected by streams whose flows are not measured. Thus,merging nodes 1 and 4, 2 and 3, 3 and 4, we get the reduced graph shown in Figure 2(c). The reduced graphcontains only streams 1 and 5 (whose flows are measured) for which one flow balance can be written (eitheraround node 1 or node E). The reduced DR problem is to obtain reconciled estimates of these two flows subjectto the flow balance constraint. The same reduced reconciliation problem can be obtained using the projectiontechnique described earlier. The measurements of streams that appear in the reduced graph are redundant,while the measured flow of stream 2 that was eliminated during the merging process is non-redundant. It canalso be deduced that the flows of streams 3, 4, and 6 are observable because the original process graph does notcontain any cycle which consists solely of unmeasured flows.The RMSE values in all the variables after DR are reported in Table 3. It shows that the reconciled estimates ofthe redundant variables F and F are more accurate than the measurements even with the partial measurements,while there is no improvement in the non-redundant flow F . Compared with the RMSE values of estimatesreported in Table 2 for the fully measured case, the estimates of all variables obtained are less accurate due toreduced information available about the process.Table 3: Root mean square errors in the reconciled estimates of flow variables for Example 2Variable RMSE after DR F F F F F F Principal component analysis (PCA) is one of the widely used multivariate statistical techniques. The traditionalview of PCA as a technique for dimensionality reduction is explained for the sake of continuity, before we describehow PCA can be used as a technique for identifying a steady state linear model from the data.PCA is a linear transformation of a set of variables to a new set of uncorrelated variables called principalcomponents (PCs) [Jolliffe(2002)]. The first PC is a new variable with the highest variance among all lineartransformations of the original variables. The second PC is a new variable orthogonal to the first PC (and9 F F F F F
1 2 3 F F F F F F
1 2 3 4 E F F (a) (b) (c) Figure 2: Flow network after merging variables for Example 2hence uncorrelated with it) that has the next highest variance among all linear transformations of the originalvariables and so on. For reducing the dimensionality of multivariate data, the PCs with the highest variancesare retained while the remaining PCs are discarded. The PCs can be obtained from the eigenvectors of thecovariance matrix of the data for the given set of variables. The eigenvectors of the covariance matrix of thedata can in turn be obtained using the singular value decomposition (SVD) of the data matrix as describedbelow.As defined in the preceding section, let Y be the ( n × N ) − dimensional data matrix. Let S y be the covariancematrix of Y defined by S y = 1 N YY T . (14)The SVD of the scaled data matrix can be written as svd ( Y √ N ) = U S V T + U S V T , (15)where U are the orthonormal eigenvectors corresponding to the p largest eigenvalues of S y while U are theorthonormal eigenvectors corresponding to the remaining ( n − p ) smallest eigenvalues of S y . The matrices S and S are diagonal matrices, whose diagonal elements are the square root of the eigenvalues of S y . It is10ssumed that the eigenvalues and corresponding eigenvectors are ordered in decreasing order of the magnitudesof the eigenvalues. It can be proved that the first p PCs are given by U T y ( j ), and the variances of these PCsare the corresponding eigenvalues.If the objective is to reduce the dimensionality of the data, then the values of the first p PCs only need to becomputed corresponding to each observation y ( j ) and stored. Several heuristics have been suggested in theliterature to choose the number of PCs p to be retained, such as percentage of total variance in data captured,or the SCREE plot which looks for sharp changes in the eigenvalues. More details on such heuristics can befound in the book by [Jolliffe(2002)].Alternatively, PCA can be used as a denoising technique. The denoised estimates of the measurements obtainedusing PCA corresponding to p retained PCs are given byˆ X = √ N U S V T . (16) The viewpoint that is of importance in this article is that PCA is a method for discovering the underlyinglinear relationships between variables, while at the same time obtaining the reconciled estimates of variablesthat satisfy the identified linear relations. The use of PCA as a method for identifying a linear model is less wellknown although it has been alluded to by different authors [Gertler et al.(1999)Gertler, Li, Huang and McAvoy,Huang(2001), Jolliffe(2002)]. This viewpoint is explored thoroughly in this work. We first analyze the case whenall the process variables are measured and the measurement error-covariance matrix is known. The generalizationto partially measured systems and estimation of error-covariance matrix simultaneously along with the linearconstraint model from data will be elucidated in the following sections. We also make the following additionalassumptions regarding the observations. The reasons for these additional assumptions will be explained laterin this section. (A1)
The data are drawn from at least n distinct steady states. (A2) The data used for model building do not contain any outliers or gross errors.If there are m linear relations between the measured process variables as defined by Eq. (1), then as statedearlier, the true values of variables lie in an R n − m dimensional subspace of R n . Therefore, the denoised estimates11f variables are also chosen to lie in an ( n − m ) dimensional subspace. If we assume that the number of linearconstraints m relating the variables is known a priori , then we can apply PCA, choose the number of retainedPCs p to be equal to n − m , and obtain the estimates ˆ X given by Eq. (16). Based on the orthogonality of thePCs, it immediately follows that the eigenvectors corresponding to the smallest m eigenvalues are orthogonal tothese estimates, and therefore they provide an estimate of the rows of the constraint matrix A (whose m rowslie in an R m dimensional subspace orthogonal to the true values of process variables), that is, the estimate ofthe constraint matrix is given by ˆ A = U T . (17)While data compression and denoising focuses on the eigenvectors corresponding to the largest eigenvaluesthat need to be retained, model identification is concerned with the eigenvectors corresponding to the smallesteigenvalues. The main questions to be answered are whether the estimate of the constraint matrix obtainedusing PCA is optimal, and whether the number of constraints can also be estimated from data, without anyprior knowledge.The estimates of process variables (that lie in a p -dimensional subspace) obtained using Eq. (16) and thecorresponding estimate of the constraint matrix (orthogonal to the estimates) obtained using Eq. (17), can beshown to be optimal in the least-squares sense [Rao(1964)]. In other words, given a sample of observations, PCAidentifies an optimal p -dimensional subspace in which the estimates lie such that the sum squared differencesbetween measured and estimated values is minimum. If we assume that the errors in measurements obey theassumptions listed in Eq. (3) with the additional condition that the error-covariance matrix is a scale of theidentify matrix, i.e., Σ (cid:15) = σ I , then the following result can be proved Σ Y | X ,N = E [ S y | X , N ] = M x + σ I , (18)where M x = N XX T is an ( n × n ) − dimensional matrix, and I is an n -dimensional identity matrix. Note that thenotation E [ S y | X , N ] is used to emphasize that the expected value of the sample covariance matrix is consideredunder the assumption that the samples are drawn from the same set of steady states X , and the only variabilityin S y is due to the measurement errors. Eq. (18) leads to the following properties: Property 1
The eigenvectors of Σ Y | X ,N and M x are identical. Property 2
The smallest m eigenvalues of Σ Y | X ,N are all equal to σ . n − m ) linearly independent steady states are obtained, then the rank of the matrix M x will be ( n − m ) with the last m eigenvalues being equal to zero. The eigenvectors corresponding to the m zero eigenvalues will be orthogonal to the true values and is a basis for the row space of the constraints. Wecan therefore use Property 1 to conclude that the eigenvectors corresponding to the m smallest eigenvalues of Σ Y | X ,N is a basis for the row space of the constraint matrix. Furthermore, Property 2 clearly provides a way ofidentifying the number of constraints, without any a priori knowledge by examining the eigenvalues of Σ Y | X ,N .The number of constraints (also referred to model-order) is estimated to be equal to the number of smalleigenvalues all of which should be equal. Thus, if Σ Y | X ,N is known, we can derive the number of constraintsand the constraint matrix by applying PCA to this matrix. However, we can only obtain an unbiased estimate of Σ Y | X ,N from the sample covariance matrix S y . It has been proved that the eigenvectors of S y are asymptoticallyunbiased estimates of the eigenvectors of Σ Y | X ,N , if the errors and (hence the measurements) are normallydistributed [Jolliffe(2002)]. Using this result, we can therefore conclude that if the errors in all measurementsare normally distributed with identical variances (homoscedastic errors) and are mutually independent, thenPCA can be used to derive an asymptotically unbiased estimate of the constraint matrix. It should be notedthat the estimate of the constraint matrix derived using PCA can differ from the true constraint matrix form(we may desire) by a rotation matrix. In other words, the estimated and true constraint matrices are relatedas ˆ A = QA , (19)where Q is some non-singular matrix.It can also be shown that the denoised estimates obtained using PCA given by Eq. (16) are the reconciled valuesof the data matrix Y corresponding to the constraint matrix ˆ A identified using PCA, under the assumptionthat Σ (cid:15) = σ I . By substituting the estimated constraint model given by Eq. (17) in Eq. (5), we obtainˆ X = ( I − U ( U T U ) − U T ) Y . (20)Substituting for the SVD of the data matrix and using the property that the eigenevctors are orthonormal, weget ˆ X = √ N ( I − U U T )( U S V T + U S V T ) . (21) Since the number of constraints may not be known a priori , it is recommended that data be obtained from as many distinctsteady state operating conditions as the number of variables n . X = √ N U S V T . (22)Based on the above analysis, it can be concluded that PCA is a method that simultaneously identifies theconstraint model and obtains reconciled estimates with respect to the identified model purely from data underthe assumption that measurement errors in different variables are independent and have the same variance.The number of constraints can also be obtained purely from data. It should, however, be noted that while theestimates obtained using PCA satisfy the identified constraint matrix, they will not satisfy the true processconstraints due to inaccuracies in the estimated constraint matrix. The larger the sample size of the data set,the more accurate will be the estimated constraint matrix, and more closely will the PCA estimates match thereconciled estimates derived by applying DR using the true process constraints. It may also be verified thatthe reconciled estimates are invariant to a rotation of the constraint matrix. Thus, the fact that the constraintmatrix estimated using PCA differs from the desired form of the true constraint matrix (derived from firstprinciples) by a rotation, will not have an effect on the reconciled estimates. The results in the preceding subsection were obtained under the assumption that the errors in measurements ofdifferent variables have identical variances. In this subsection, we describe the identification of constraint modelusing PCA when the measurement errors have different variances and may also, in general, be correlated. It ishowever assumed that the error-covariance matrix Σ (cid:15) is known. Furthermore, the errors in measurements ofdifferent samples are assumed to be independent and identically distributed with the same known covariancematrix. It may be noted that these assumptions regarding the measurement errors are the same as those madein DR.Narasimhan and shah[Narasimhan and Shah(2008)] described an approach for applying PCA for model identi-fication from data when the error-covariance matrix is known. This approach is based on transforming the datausing an appropriate matrix before applying PCA. The measurement errors corrupting the transformed dataare independent and identically distributed (i.i.d.), and PCA can be applied to the transformed data matrix.The approach is as follows: 14he Cholesky decomposition of Σ (cid:15) is given by Σ (cid:15) = LL T , (23)where L is an ( n × n ) − dimensional lower triangular matrix. Then, the noisy data matrix can be transformedas follows: Y s = L − Y = L − X + L − E , (24)where E is an ( n × N ) − dimensional error matrix. Let S y s is the covariance matrix of the transformed datadefined by S y s = 1 N Y s Y T s (25)By taking expectation we can show that Σ Y s | X ,N = E [ S y s | X , N ] = M x s + I , (26)where M x s = N L − XX T L − T . Eq. (26) is similar Eq. (18) and Properties 1 and 2 hold for the matrix Σ Y s | X ,N .Therefore, PCA can be applied to the transformed data in order obtain the constraint model and reconciledestimates corresponding to the transformed data. The reconciled estimates and constraint model correspondingto the original data can be derived using an inverse transformation as described below. Let the SVD of Y s bedecomposed corresponding to the first largest n − m singular values and remaining m singular values as Y s = √ N U s S s V T s (cid:124) (cid:123)(cid:122) (cid:125) ˆ X s + √ N U s S s V T s . (27)The first part on the right-hand side of above equation corresponds to the reconciled estimates of the trans-formed data, while the second part contains the information about the constraints which relate the transformedvariables. The reconciled estimates and the constraint matrix corresponding to the original data are given byˆ X = L ˆ X s (28)ˆ A = U T s L − . (29)Since the smallest m eigenvalues of S y s matrix are all equal to 1, it provides a systematic method to determinethe constraint model order. Due to finite sample sizes and numerical inaccuracies, the equality of smallest m eigenvalues can be numerically checked by testing whether the average of the smallest m eigenvalues approxi-mately equals unity. 15 xample 3 The flow process of Example 1 is again considered along with the simulated data set of 1000 samples. Weassume that the error-covariance matrix (used to simulate the measurements) is known, but neither the numberof constraints nor the constraint matrix is assumed to be known. It may be noted that the error-covariancematrix is diagonal, with the diagonal elements being the error variances as given in Table 1. The data matrixis transformed as defined by Eq. (24) and its SVD obtained. The six ordered singular values obtained are263.16, 21.83, 1.05, 1.04, 1.01, and 0.98. Examination of the singular values clearly indicates that there arefour constraints, because the smallest four singular values are all nearly equal. The denoised estimates of thetransformed data are obtained using the first two PCs, and the reconciled estimates of the original data areobtained using Eq. (28). The RMSE in reconciled estimates of each variable is computed and reported inTable 4. Comparing the RMSE of estimates obtained using PCA with those reported in Table 2 which areobtained by applying DR using the known process model, it is clear that they are nearly equal. Although thePCA estimates do not exactly satisfy the true flow balances given by Eq. (8), the maximum constraint residualis of the order of 10 − . These indirectly indicate that the constraint model obtained using PCA is a goodestimate of the true process constraints.Table 4: Root mean square errors in the reconciled estimates of flow variables using PCA for Example 3Variable RMSE using PCA F F F F F F In Example 3, the eigenvectors of the transformed data covariance matrix corresponding to the smallest foureigenvalues are used to derive an estimate of the process constraint matrix using Eq. (29), and is given belowˆ A = − . . − . . . − . . − . . . − . . . . − . − . − . − . . − . − . − . . . . (30)16he elements of the estimated constraint matrix do not seem to correspond with those of the constraint matrixderived from the first principles given by Eq. (9). However, it should be noted that an element by elementcomparison between the estimated and true constraint matrices cannot be done, because the estimated constraintmatrix may differ from the first principles model by a rotation matrix (see Eq. (19)). Thus, only the rowspaces of the estimated and true process constraint matrices can be compared. Two criteria were proposed by[Narasimhan and Shah(2008)] for making such a comparison (i) the subspace angle between the row subspacesof the estimated and true constraint matrices, and (ii) the sum of orthogonal distances of the row vectors of theestimated constraint matrix from the subspace defined by the rows of the true constraint matrix. This metric(denoted by ‘ α ’) is computed as follows: α = Σ i α i , with (31) α i = (cid:107) ˆ A i. − ˆ A i. A T ( AA T ) − A (cid:107) , (32)where ˆ A i. are the rows of the estimated constraint matrix. Hence, α value near to zero indicates that ˆ A is agood approximation of A .For Example 3, the subspace angle between the row spaces of estimated and true constraint matrices is 0.2377degrees and α = 0.0555. These values suggest that a good estimate of the constraint matrix has been obtainedusing PCA. Our experience with larger examples indicates that subspace angle is not a good criteria for com-parison, because it does not clearly indicate the quality of the estimated constraint matrix, especially as thenumber of constraints increases.A practically useful method for comparing the estimated and true constraint matrix is proposed in this work. Forthis purpose, the process variables are partitioned into a set of dependent variables x D and a set of independentvariables x I , based on process knowledge. The number of dependent variables should be chosen equal to thenumber of constraints. The constraints given by Eq. (1) can be rewritten as A D x D + A I x I = 0 , (33)where A D and A I are the sub-matrices of A corresponding to dependent and independent variables, respectively.From Eq. (33), the regression matrix relating the dependent variables to the independent variables can beobtained as x D = − ( A D ) − A I x I = Rx I . (34)17n a similar manner, the estimated regression matrix can be obtained from the estimated constraint matrix asˆ R = − ˆ A − D ˆ A I (35)An element by element comparison can be made between R and ˆ R .For Example 3, if the flow variables F to F are chosen as the dependent variables, then the true regressionmatrix and the estimated regression matrix derived from the estimated constraint matrix are given by R = (36)ˆ R = . . . . . − . . . . (37)An element by element comparison between the two regression matrices shows that the maximum absolutedifference between the elements is 0.0064. This again clearly shows that an accurate estimate of the constraintmatrix is obtained using PCA. In Section 4.1, it was established that PCA is a technique for obtaining the steady-state constraint model andreconciled estimates from measured data when the error-covariance matrix is known. In practice, the errorvariances and covariances are not easily available and may also change with time. If replicate measurementsare available corresponding to one or more steady states, then the error-covariance matrix can be directlyestimated from the data, provided the measurements corresponding to each steady state operating period areclearly identified. However, identification of steady operating periods from data is itself a challenging problem.Given these practical difficulties, the question is whether it is possible to simultaneously estimate the error-18ovariance matrix, the process constraint model, and the reconciled estimates purely from data, without theneed for replicate measurements corresponding to a steady state. Surprisingly, it is possible to extract all thisinformation from data. A method for this purpose known as iterative PCA (IPCA) was recently proposed by[Narasimhan and Shah(2008)]. A brief description of this method follows.In Section 4.1, it was shown how PCA can be used to obtain an estimate of the constraint matrix and reconciledestimates if the error-covariance matrix is given. In IPCA, this approach is iteratively combined with analgorithm for estimating the error-covariance matrix from data, given the constraint matrix, by solving thefollowing optimization problem.min Σ (cid:15) N log | ˆ AΣ (cid:15) ˆ A T | + N (cid:88) k =1 [ r T ( k )( ˆ AΣ (cid:15) ˆ A T ) − r ( k )] (38)Under the assumptions made regarding the measurement errors Eq. (3), the objective function used for theestimating the error-covariance matrix is identical to maximizing the likelihood function of the constraintsresiduals r ( k ) defined by r ( k ) = ˆ Ay ( k ) (39)An updated estimate of the constraint matrix can be obtained using the estimated error-covariance matrixas described in Section 4.1, and the procedure repeated until the estimates of the error variance matrix (andconstraint matrix) converges. At convergence, the smallest m eigenvalues should be all equal to unity. Thus,the convergence test can be applied to check if the average of the smallest m eigenvalues is close to unity.The method described in this section for estimating the measurement error-covariance matrix from constraintsresiduals is known as an indirect method in the area of data reconciliation. Using indirect methods, the max-imum number of elements of the symmetric error-covariance matrix that can be estimated is m ( m + 1 / Example 4
The flow process of Example 1 is considered to demonstrate IPCA. It is assumed that the error-covariance matrixis diagonal, but its elements are unknown. We need to estimate six elements of Σ (cid:15) , along with the constraintmatrix and reconciled estimates. The number of constraints for this example is four, and it is possible toestimate a maximum of 4(4 + 1) / Σ (cid:15) , the problem is identifiable. IPCA is applied to the data matrix and the estimatedstandard deviations of errors are given in Table 5. The results show that IPCA provides accurate estimatesof error variances. Further, the six ordered singular values obtained are 258.47, 21.53, 1.02, 1, 0.99, and 0.99,which clearly indicates that there are four constraints. The reconciled values and the constraint matrix can beestimated using Eqs. (28) and (29). The subspace angle between the row spaces of estimated and true constraintmatrices is 0.2373 degrees and α = 0.0509. The maximum constraint residual is of the order of 10 − . Theestimated regression matrix ( ˆ R ipca ) computed by considering the flow variables F to F as dependent variablesis as follows: ˆ R ipca = . . . . . − . . . . (40)The maximum absolute difference between the elements of R and ˆ R ipca is 0.0063. The comparison of theestimated constraint matrix using the subspace angle ( α ), and ˆ R ipca shows that IPCA provides an accurateestimate of the constraint matrix without knowledge of the error-covariance matrix. Further, the RMSE valuesin the reconciled estimates of each variable are computed and reported in Table 5. The RMSE values show thatthe reconciled estimates are comparable to the one obtained by applying DR using the known constraint model.The simulation results shows that IPCA is a reliable data-driven method for obtaining the reconciled values, aconstraint matrix, and estimates of standard deviations from the data without any a priori knowledge.20able 5: Standard deviation (SD) and root mean square errors in the reconciled estimates of flow variablesusing IPCA (Example 4) Variable Estimated SD RMSE using IPCA F F F F F F In the preceding sections, the close link between DR and PCA was established. Further, we have demonstratedthat it is possible to identify the constraint model and error-covariance matrix, if required, from data usingPCA (or IPCA). The estimated model and error-covariance matrix can be subsequently used for obtainingmore accurate estimates of variables, by reconciling the measurements with respect to the identified model. Inderiving these results, it was assumed that all process variables are measured, which is not valid in general. Theconnection between PCA and DR applied to a partially measured system is elucidated in this section.We consider a process defined in Section 2.2. Let n p be the number of measured variables out of n process vari-ables, and let y p ( j ) be measurements of n p variables at the j th sample, where j = 1 , . . . , N . The corresponding( n p × N ) − dimensional measurement matrix Y p can be obtained by collating y p ( j ) for all N samples. Then, theobjective is to identify a linear model relating the measured variables and obtain reconciled estimates of thesevariables.If the measurement error-covariance matrix is known, then we can apply PCA as described in Section 4.1 tothe data matrix Y p and estimate the number of constraints, linear constraint matrix and reconciled values ofthe measured variables. On the other hand, if the measurement error-covariance matrix is unknown, then wecan apply IPCA as described in Section 5 to simultaneously estimate the error-covariance matrix, constraintmatrix and reconciled values. The key question is how these estimates derived using PCA or IPCA are relatedto the estimates obtained by applying DR to the partially measured system described in Section 2.2, which usesknowledge of the true process constraints and error-covariance matrix. It may be noted that by applying PCA orIPCA to the data matrix Y p , it is possible to identify the linear constraints relating only the measured variables.The constraints relating only the measured variables can also be derived using a projection matrix on the known21rue process constraints as described in Section 2.2. The reduced data reconciliation problem obtained afterprojecting out the unmeasured variables is identical to the DR problem for the completely measured process,with the only difference being that the constraint matrix relating the measured variables is the reduced balancematrix PA k . Therefore, using similar arguments as described in Sections 4.1 and 5, we can derive the followingresult U p = QPA x , (41)where U p are the eigenvectors of the covariance matrix of Y p corresponding to the smallest m p eigenvalues, Q is an arbitrary non-singular rotation matrix and PA x is the reduced constraint matrix. The number ofconstraints in the reduced reconciliation problem m p can be shown to be equal to ( m − t ) where t is the rankof A u [Narasimhan and Jordache(2000)]. The number of constraints m p can also be obtained from the databecause the smallest m p eigenvalues should all be equal. The reconciled estimates corresponding to the identifiedmodel are given as before by ˆ X p = U p S p V p . (42)While the above discussion brings out the relationship between PCA and DR as applied to a partially measuredsystem, it should be pointed out that because there is no information regarding the unmeasured variables, PCA(or IPCA) can be used to estimate only the reduced constraint matrix that relate the measured variables. Thetrue constraint matrix that relates both the measured and unmeasured variables cannot obviously be estimatedfrom measured data. Furthermore, using the first principles model, it is also possible to classify unmeasuredvariables as observable or unobservable based on the given measurement structure, which cannot be derivedusing the data driven approach. The classification of measured variables as redundant or non-redundant can,however, be performed using the data driven approach, by examining the columns of the estimated reducedconstraint matrix. If all the elements of a column of ˆ A p are close to zero, then the corresponding measuredvariable does not participate in any of the reduced constraints and is therefore a non-redundant measuredvariable. The following example illustrates the application of PCA to a partially measured flow process. Example 5
The flow process of Example 2 is considered here. We assume that the error-covariances for these variables areknown. Then, PCA approach described in Section 4.1 can be applied to the available flow measurements. Thethree ordered singular values obtained are 169.84, 18.96, and 1.03. This indicates that there is one constraintamong the measured variables. The estimated constraint row is [ − . , . , . F is almost zero, which indicates that F is a non-redundant variable22nd its measurement cannot be reconciled. Comparing the RMSE values of estimates obtained using PCA inTable 6 with the RMSE values of measurements reported in Table 2, it is clear that PCA reconciles the flowvariables F and F , but not F . The maximum constraint residual is of the order of 10 − which indicates thatthe identified model is close to the true one.Table 6: Root mean square errors in the reconciled estimates of flow variables for the partially measured systemin Example 5 Variable RMSE using PCA with the known SD F F F In Section 4, it was shown how PCA can be used for simultaneous model identification and data reconciliationwhen no knowledge of the constraint matrix is available. In some cases, a subset of constraint matrix (or linearrelationships) may be known. In such a situation, it is required to identify only the remaining constraints thatrelate the measured variables. In this subsection, we extend the approach of PCA and IPCA so that the partialknowledge available about the process constraints can be fully exploited.Let A g be the ( m g × n ) − dimensional known constraint matrix. For simplicity we assume that the error-covariance matrix is known and is an identity matrix. The procedure described in this section can be generalizedusing the approach described in Sections 4 and 5, if the error-covariance matrix is not an identity matrix, or ifit is unknown. We are interested in identifying from the measured data only the linear constraints other thanthose that are specified. We ensure that none of the given constraints or any linear combination of these can beidentified by first determining the component of the measured data that is orthogonal to the given constraints.The orthogonal component of the data is given by Y proj = [ I − A T g ( A g A T g ) − A g ] Y . (43)PCA is applied using Y proj to estimate the remaining ( m − m g ) constraints. In order to identify the remainingconstraints from the SVD of the projected data matrix, it should be first noted that the projected data matrixsatisfies the given constraints and will therefore have a rank equal to ( n − m g ). The smallest m g singularvalues of the data matrix will be equal to zero and the corresponding left singular vectors (eigenvectors of thecovariance matrix of the projected data matrix) will be exact linear combinations of the given constraints. Out23f the non-zero singular values, we consider the smallest ( m − m g ) singular values (all of which should be almostequal). The SVD of the projected data matrix is therefore partitioned corresponding to the largest ( n − m ),next largest ( m − m g ), and remaining smallest m g singular values as svd ( Y proj ) = U proj, S proj, V T proj, + U proj, S proj, V T proj, + U proj, S proj, V T proj, (44)The transpose of U proj, is an estimate of the remaining ( m − m g ) constraints. The complete constraint matrixcan be constructed as ˆ A = U T proj, A g (45)The reconciled estimates are obtained using the estimated constraint matrix in Eq. (5). The following exampledemonstrates how partial knowledge of constraints can be combined with PCA to obtain improved estimates ofvariables as well as error-covariance matrix. Example 6
The flow system of Example 1 is considered here. It is assumed that the following constraints are known A g = − − . (46)However, the error-covariances, and the number of constraints are unknown. IPCA is applied to the data matrixto estimate the measurement error variances and remaining constraints. The six singular values obtained atconvergence are 258.17, 21.51, 1.01, 0.99, 0, and 0. The eigenvectors corresponding to the last two zero singularvalues correspond to the given constraints. Two singular values are equal to one, and, hence two additionalconstraints are correctly identified. The standard deviations of measurement errors estimated using IPCA andthe RMSE in reconciled estimates are shown in Table 7. Comparing with the results reported in Table 5, it canbe inferred that by utilizing knowledge of the two known process constraints, IPCA is able to obtain slightlymore accurate reconciled estimates. The subspace angle between the row spaces of estimated and true constraintmatrices is 0.0188 degree and α = 0.0032. Further, the estimated regression matrix computed by consideringthe flow variables F to F as dependent variables is as follows:24able 7: Partially known constraint matrix: Standard deviations (SD) and root mean square errors in thereconciled estimates of flow variables using IPCA (Example 6)Variable Estimated SD RMSE using IPCA F F F F F F R par = . . . . . − . . . . (47)The maximum absolute difference between the elements of R and ˆ R par is 7.4346 × − . These values indi-cate that the incorporation of the given constraints into IPCA (PCA) method improves the estimation of theremaining constraints. If the assumptions made regarding the measurement errors hold in practice, and the data are obtained whenthe process is strictly in a steady state, then the model order can be determined by examination of the singularvalues as described in Sections 4 and 5. If these assumptions do not hold, then it is difficult to precisely identifythe model order from data when PCA or IPCA is applied. In such a scenario, the estimated model order m e may be either greater or less than the true model order m . If the estimated model order is greater than thetrue model order, we refer to it as overfitting because the reconciled estimates are forced to satisfy additionalconstraints that are not valid. Conversely, if the estimated number of constraints is less than the actual number,then we refer to it as underfitting . In this section, we investigate the effect of underfitting and overfitting thedata, due to incorrect estimation of the number of constraints.We write the SVD of the data matrix in partitioned form as given by Eq. (15), where U contain the eigenvectorsof S y corresponding to the smallest m eigenvalues. In the case of underfitting, the eigenvectors correspondingto the smallest m e < m eigenvalues, denoted as U s will be chosen. This implies that U s is a subset of U ,25nd the columns of U s will therefore be an asymptotically unbiased estimate of a subset of linear combinationsof the rows of A . Therefore, in the limit as sample size goes to infinity, the estimated constraint matrix will beorthogonal to the true values of the variables. That is, U T s X = (48)The reconciled estimates corresponding to the identified constraint matrix is given byˆ X = Y − U s S s V T s (49)= ( I − U s U T s ) Y . (50)Taking expectation of the above equation and using Eq. (48) we can prove thatE (cid:104) ˆ X (cid:105) = X . (51)Eq. (51) indicates that the reconciled estimates are unbiased even if the estimated number of constraints is lessthan the actual number.If the number of constraints estimated is greater than m , then the estimated constraint matrix is given byˆ A = U T U T s (52)where U s is a subset of U . In the limit as the sample size goes to infinity, the eigenvectors U will beorthogonal to the true values, but the eigenvectors U s form a subspace of the true data space. This impliesthat U T U T s X = α (53)Substituting the above estimate of the constraint matrix in Eq. (20), the reconciled estimates are obtained asˆ X = Y − [ U U s ] U T U T s Y (54)= Y − ( U U T + U s U T s ) Y (55)26aking the expectation of above equation and using Eq. (53) we getE (cid:104) ˆ X (cid:105) = X − U s α (56)This shows that overfitting introduces a bias in the reconciled values. Based on the above analysis, we makethe recommendation that in practice it is better to be conservative in estimating the number of constraints, andavoid overfitting when using heuristics to determine the model order. The following example demonstrates theeffect of underfitting and overfitting the data. Example 7
In this example, we will demonstrate effect of selection of model order on the estimates of the constraints,standard deviations, and the reconciled values. The settings of Example 4 are considered to demonstrate theconcept. IPCA is applied to the data matrix under the assumption of different choices of model order. Thesingular values obtained at convergence of the method are reported for different model orders in Table 8.Examination of the singular values show that if the model order assumed is 5 which is greater than the truemodel order of 4, then the last five singular values are not all equal to unity. This clearly indicates that theassumed model order is incorrect. However, if the assumed model order is equal to or less than the true modelorder, then the number of unity singular values obtained is equal to the assumed model order. Based on theseobservations, a systematic method can be devised for determining the true model order using IPCA. We startwith the least value of the model order that results in an identifiable problem, that is the assumed value of m should be such that m ( m + 1) / ≥ n (in this example it is 3 since we have to estimate 6 error variances).If the number of unity singular values obtained at convergence is equal to m , then we cannot conclude thatthe assumed model order is correct. Instead we increment the assumed model order by unity and apply IPCAagain. If the number of unity singular values obtained is not equal to the assumed model order, then this violatesthe theoretical result expected from IPCA. The true model order is one less that the assumed model order atwhich this violation is observed. Although, in theory the true model order can be estimated exactly using thissystematic procedure with IPCA, it may not work in practice if the noise in measurements do not satisfy theassumptions made, or there is mild nonlinearity in the process constraints which we have not considered. Inpractical applications when there is an ambiguity in selecting the model order it is better to choose a lower value.This is clearly shown by the RMSE values of the reconciled estimates obtained for this example for differentmodel-order choices reported in Table 9. Comparing the RMSE values, it can be seen that overfitting leadsto poor reconciled estimates due to bias introduced in the estimates. In contrast underfitting only marginallyincreases the inaccuracy in the reconciled estimates. The α values computed for various orders in Table 10 alsoindicate that the overfitting leads to poor estimates of the constraint model.27able 8: Singular values for different model order selectionNo. Singular values m e = 5 m e = 4 m e = 3(overfitting) (perfect model order) (underfitting)1 30194.45 258.47 57191.372 1.98 21.53 41.353 1.00 1.02 2.544 0.16 0.99 1.015 0.11 0.99 0.996 0.07 0.99 0.99Table 9: RMSE values for different model order selection for each flow variableVariable RMSE values m e = 5 m e = 4 m e = 3(overfitting) (perfect model order) (underfitting) F F F F F F αm e = 5 (overfitting) 0.9125 m e = 4 (true model-order) 0.0509 m e = 3 (underfitting) 0.049228 Conclusions
PCA has been primarily regarded as a multivariate statistical technique which is useful for reducing dimension-ality of data as well as for denoising it. The main message that we have attempted to convey in this paper is thatPCA is a method that can be used to derive the steady state constraints of a linear process entirely from dataalong with reconciliation of the measurements. Iterative PCA which is a recent extension of PCA derives boththe error-covariance matrix of measurements and the process constraint model simultaneously from data. Thus,the information necessary to reconcile data, and the reconciled estimates can be extracted from the data itselfwithout any a priori process knowledge using PCA (or IPCA). We have also shown that if partial knowledgeof process constraints is available, then they can be exploited to improve the estimates. Further, the effect ofmodel order on the reconciled values have been studied, and it is shown that it is better to be conservative infitting the number of constraints (i.e. estimate less constraints) in case of unknown model order.The perspective provided in this paper can be used to seamlessly integrate PCA or IPCA with data reconciliation(DR) and its companion technique of gross error detection (GED). This integration is depicted in Fig. 3. Wepropose that IPCA should be used on historical (training) data to derive the process model and error-covariancematrix, if they cannot be easily obtained using a priori process knowledge. The derived model and error-covariance matrix can be used in the techniques of DR and GED for reconciling future measurements and todetect gross errors in these measurements or the model as required.
Acknowledgment
The financial support to Dr. Nirav Bhatt from Department of Science & Technology, India through INSPIREFaculty Fellowship is acknowledged.
References [Almasy and Mah(1984)] Almasy, GA; Mah, RSH. Estimation of measurement error variances from processdata. Ind Eng Chem Process Des Dev 1984;23:779–784.[Bagajewicz(2001)] Bagajewicz, MJ. Process Plant Instrumentation: Design and Upgrade. Technomic Publish-ing Company, Lancster, 2001. 29
CA & DR: A Unified Framework
Training Data:
Model building using Principal Component Analysis (PCA) and its variants Y PCA IPCA
Test Data:
Data reconciliation (DR) and Gross error detection (GED) DR/ GED 𝑨 𝑨 , 𝚺
𝑨 , 𝚺 y new 𝒙 𝒏𝒆𝒘𝒏𝒆𝒘