Dual Blind Denoising Autoencoders for Industrial Process Data Filtering
11 Dual Blind Denoising Autoencoders for IndustrialProcess Data Filtering
Sa´ul Langarica,
Student Member, IEEE , and Felipe N´u˜nez,
Member, IEEE
Abstract —In an industrial internet setting, ensuring trustwor-thiness of process data is a must when data-driven algorithms op-erate in the upper layers of the control system. Unfortunately, thecommon place in an industrial setting is to find time series heavilycorrupted by noise and outliers. Typical methods for cleaningthe data include the use of smoothing filters or model-basedobservers. In this work, a purely data-driven learning-basedapproach is proposed based on a combination of convolutionaland recurrent neural networks, in an auto-encoder configuration.Results show that the proposed technique outperforms classicalmethods in both a simulated example and an application usingreal process data from an industrial facility.
Index Terms —Autoencoders, Process control.
I. I
NTRODUCTION
The incorporation of industrial Internet of Things tech-nologies to modern industrial facilities allows the real-timeacquisition of an enormous amount of process data, typicallyin the form of time-series, which represents an opportunityto improve performance by using data-driven algorithms forsupervision, modeling and control [1]–[3].Data-driven techniques, such as statistical or machine learn-ing algorithms, are capable of dealing with the multivariateand intricate nature of industrial processes; however, they relyon the consistency and integrity of the data to work properly[3]. This imposes a limitation to online application of thesealgorithms in real facilities, since process data is often highlycorrupted with outliers and noise, caused by multiple factorsas environmental disturbances, human interventions, and faultysensors. Consequently, there is a lack of data-driven appli-cations operating online, and the vast majority of successfulimplementations use offline preprocessed data, simulations orgenerate the database in a controlled environment, as pilot-scale deployments in laboratories [4].A typical approach for dealing with corrupted process datais the use of smoothing filters, like simple discrete low-pass filters, the Savitsky-Golay (SG) filter [5] or exponentialmoving average filters (EMA) [6]. The main drawback of thesetechniques is their univariate nature, hence the redundancyand correlations among variables typically present in industrialprocesses are not made use of for denoising. Multivariatedenoisers, which can exploit cross-correlation between signals,are a natural improvement to univariate filters. Approaches likeKalman [7] or particle filters [8] are the flagship techniques;
This work was supported by ANID under grant ANID PIA ACT192013.S. Langarica and F. N´u˜nez are with the Department of Electrical Engineer-ing, Pontificia Universidad Cat´olica de Chile, Av. Vicu˜na Mackenna 4860,Santiago, Chile 7820436. E-mail: [email protected], [email protected]. however, they require the selection of a suitable model and a-priori estimation of parameters, as the covariance matrices inthe Kalman filter, which are critical for a good performance.A different approach for denoising is the use of transforms,like Wavelets [9] or Gabor [10], which exploit statisticalproperties of the noise so the signal can be thresholded inthe transformed domain to preserve only the high-valuedcoefficients, and then, by applying the inverse transform,obtain a cleaner signal. A limitation of these approaches is thedifficulty of knowing a priori the best basis for representingthe signals, and without knowledge on the noise nature, as isthe case in real process data, is hard to determine where tothreshold the transformed data.Learning-based denoising algorithms, like principal com-ponent analysis (PCA) [11], Kernel PCA [12] or dictionarylearning [13], solve some of the problems that fixed transformshave, by learning a suitable representation of the data in atransformed space. These approaches are also multivariate innature, so the spatial correlations between signals is exploited;nevertheless, these algorithms were designed for static data,e.g., images, and hence important information from temporalcorrelations are not exploited at all.Recently, denoising autoencoders (DAE) [14], [15] haveemerged as a suitable learning-based denoising technique thatis multivariate in nature and is capable of learning complexnonlinear structures and relationships between variables. Thisrepresents a great advantage over traditional learning-basedtechniques when dealing with highly nonlinear data. Orig-inally, DAEs emerged for image denoising, but the use ofrecurrent neural networks has allowed their application fordenoising dynamical data, such as audio and video [16], [17].However, unlike PCA, dictionary learning or fixed transformstechniques, DAEs are not blind in the sense that for learningto denoise a signal, the clean version of the signal (the target)has to be known beforehand. In addition, information aboutthe characteristics of the noise affecting the process is requiredto create realistic training examples. This is a great limitationfor the use of DAEs in real-world applications where the cleanversion of the signal and the noise characteristics are unknown.In this work, we propose the use of a dual blind denoisingautoencoder (DBDAE) for multivariate time series denoising,that preserves all the advantages of DAEs and eliminatesthe necessity of knowing beforehand the noise characteristicsand the clean version of the signals. The network structureis designed to exploit both spatial and temporal correlationsof the input data, by combining recurrent and convolutionalencoder networks. This dual encoding allows the networkto reconstruct missing or faulty signals, adding robustness a r X i v : . [ ee ss . SP ] A p r to online applications. Because of the predictive capabilitiesof the network, the phase delay is minimum compared totraditional techniques like low-pass filters, which is a criticaladvantage for real-time applications built on top, as feedbackcontrollers. Consequently, the main contributions of this workare: i) the formulation of a new blind denoising techniquebased on DAEs that eliminates the necessity of knowing a-priori information of the signals; and ii) a novel autoencoderarchitecture that enables the network to exploit both temporaland spatial correlations, which allows to reconstruct faultysignals, in addition to the denoising capabilities.The rest of this paper is organized as follows. Preliminariesof data filtering in industrial processes are given in SectionII. In Section III the basics of autoencoders and DAEs arepresented. Denoising and reconstruction based on the proposedDBDAE is presented and applied to a simulated dynamicalsystem in Section IV. Section V shows an implementation ofthe DBDAE to a real industrial process. Finally, conclusionsand directions for future research are presented in Section VI.II. P RELIMINARIES
A. Notation and Basic Definitions
In this work, R denotes the real numbers, Z ≥ the non-negative integers, R n the Euclidean space of dimension n ,and R n × m the set of n × m matrices with real coefficients.For a, b ∈ Z ≥ we use [ a ; b ] to denote their closed intervalin Z For a vector v ∈ R n , v i denotes its i th component.For a matrix A ∈ R n × m , A i denotes its i th column and A i its i th row. For an n-dimensional real-valued sequence α : Z ≥ → R n , α ( t ) denotes its t th element, and α [ a ; b ] denotesits restriction to the interval [ a ; b ] , i.e., a sub-sequence. Fora sub-sequence α [ a ; b ] , M ( α [ a ; b ] ) ∈ R n × ( b − a +1) is a matrixwhose i th column is equal to α ( a + i − , with i ∈ [1; b − a +1] .The same notation applies for an n-dimensional finite-lengthsequence α : [0; ¯ T ] → R n with the understanding that fora sub-sequence α [ a ; b ] , [ a ; b ] ⊆ [0; ¯ T ] must hold. Given anN-dimensional sequence α , we define its T -depth windowas a matrix-valued sequence β : Z ≥ → R N × T , where β ( t ) = M ( α [ t − T +1; t ] ) . B. Typical approaches for process data filtering
When dealing with real industrial time series, denoisingmethods can be classified into model-based and data-driven.However, in practice, model-based denoisers like Kalman orparticle filters cannot be applied in the vast majority of casessince obtaining an accurate model of these processes is manytimes unfeasible. On the other hand, commonly used data-driven techniques, while much simpler, are still effective,hence, they are preferred in real implementations.According to [18] the most used data-driven methods fordenoising industrial time series are low-pass digital filters, asthe EMA, and the SG filter. Although the EMA filter is quitesimple, large delays can be introduced to the denoised signalas its convergence is exponential. For a given sequence y , with y ( t ) ∈ R , the EMA filter construct the smoothed version as ˆy ( t ) = α ˆy ( t −
1) + (1 − α ) y ( t ) , (1) y y ... y n − y n f θ E ... z ˆ y ˆ y ... ˆ y n − ˆ y n g θ D Fig. 1. Simple AE architecture where f θ E encodes the input data to a latentrepresentation z and then g θ D decodes z to reconstruct the input. where ˆy ( t ) is the t th smoothed output of the EMA filter and α ∈ [0 , is the filter weight determining the importance givento the past output.The SG filter outperforms common low-pass filters in pre-serving useful high-frequency information and prevention ofextra delays [18]. The SG filter calculates the smoothed output ˆy ( t ) using a local discrete convolution over the M + 1 sam-ples sub-sequence y [ t − M ; t + M ] , where the convolution coeffi-cients are derived from a least-squares polynomial smoothing.Because EMA and SG filters are the most used denoisingtechniques in industry, they will be used as a baseline forcomparison with our DBDAE network.III. A UTOENCODERS
A. Background
Autoencoders (AEs) [19] are unsupervised neural networkstrained for reconstructing their inputs at the output layer,passing through an intermediate layer normally of lowerdimension. AEs can be regarded as a nonlinear generalizationof PCA aiming to encode input data in an intermediatelower-dimensional representation, which preserves most of theinformation in the inputs. This intermediate representation isknown as the latent space of the network.Formally, for a given sequence y , the AE maps an inputvector y ( t ) ∈ R n to a latent representation z ∈ R m with m < n . This mapping is done by a function f θ E , which in thesimplest case is a linear layer with σ as an arbitrary activationfunction, namely, z = f θ E = σ ( W θ E y ( t ) + b θ E ) , (2)where W θ E ∈ R m × n and b θ E ∈ R m are trainable parametersof the network. In more complex approaches, f θ E can bechosen to be any type of layer, even a stack of multiple layers.After the encoding, the latent vector is mapped back to theinput space by a second function g θ D known as the decoder: ˆy ( t ) = g θ D = σ ( W θ D z + b θ D ) , (3)where W θ D ∈ R n × m and b θ D ∈ R n .In training, the parameters of the AE are found by solvingthe following optimization problem: θ ∗ = argmin θ || ˆy ( t ) − y ( t ) || , (4)where θ accounts for all the trainable parameters. Figure 1illustrates a simple AE network. B. Denoising Autoencoder
When the input of the AE is corrupted with noise and thetarget output is clean, the resulting latent space is more robustand learns richer features [20]. Additionally, the network alsolearns how to denoise corrupted inputs. This has led to manydenoising applications in static data as images [21] and, withthe use of recurrent AEs [16], this technique has also beeneffectively applied to dynamic data [17]. However, DAEs workunder the assumption that the clean version of the input data isavailable as well as information about the noise. As mentionedbefore, this limits the applicability of DAEs to real-world datawhere, in general, none of these requirements are fulfilled.IV. P
ROCESS DATA FILTERING BASED ON DUAL BLINDDENOISING AUTOENCODERS
Blind denoising autoencoders (BDAEs) were first intro-duced in [22]. This particular type of AE enables to denoisea signal with only few assumptions on the noise and withoutaccess to the clean version of the signal, which makes theman appealing alternative to denoise real-world process data.However, in [22] this technique was applied only to staticdata. In the following, we propose a technique based onBDAEs for denoising multivariate time series based on a dualconvolutional-recurrent architecture, the proposed technique isalso capable of masking faulty sensors, making it ideal foronline processing of real industrial data.
A. General setup
Consider an underlying dynamical system generating data,which is sampled periodically by a set of sensors with period τ . From the sensors perspective, the plant can be modeled asa discrete-time dynamical system given by x ( n + 1) = F n ( x ( n ) , w ( n )) . (5)Where n denotes the time step, x ( n ) ∈ R W , with W the orderof the system, denotes the current value of the internal state, x ( n + 1) is the future value of the state, w ( n ) denotes state-space noise and F ( n ) is generally a smooth nonlinear mappinggoverning the dynamics of the system.The measurement (observation) process is given by ˜y ( n ) = C n ( x ( n ) , v ( n )) , (6)where ˜y ( n ) denotes the set of variables measured by thesensors, C n is the output mapping, and v ( n ) representsmeasurement noise. For a noise-free condition, we denote themeasured variables at time step n as: y ( n ) = C n ( x ( n ) , (7)Under this setup, the measurement process generates a se-quence ˜y : Z ≥ → R N , where without loss of generality N (cid:54) = W , therefore: ˜y ( n ) = [˜ y ( n ) , ˜ y ( n ) , . . . , ˜ y N ( n )] T ∈ R N , (8)where each ˜ y i ( n ) represents a real-valued measured variableof the underlying system at time instant n . B. Problem formulation
Let ˜Y be the T -depth window of ˜y . Hence ˜Y ( n ) = [ ˜y ( n − T +1) , ˜y ( n − T +2) , . . . , ˜y ( n )] ∈ R N × T . (9)In this setting, ˜Y k ( n ) ∈ R N is a column vector containing the N measurements at time instant n − T + k , and ˜Y k ( n ) ∈ R T is a row vector containing the last T samples of variable k .The AE will use ˜Y both for denoising and reconstructionof faulty signals. Since the AE is a dynamical system itselfthat processes the data sequentially (due to the use of recurrentneural networks) to accomplish these two tasks, two timescalesshould be considered in the derivations. The process time refers to the time scale at which (5) evolves and is indexed by n . On the other hand, the AE’s time refers to the internal timeof the AE, which uses the elements of ˜Y ( n ) that were obtainedat the process time scale but are processed at computer’sprocessor speed. These elements will be indexed by j .Since the objective of the AE is to operate in real-time,the AE’s timescale has to be fast enough to process themeasurements in ˜Y ( n ) before another set of measurementscaptured at process timescale comes in. In this setup, we havetwo dynamical systems at work, the slow system (the process)and the fast system (the AE), which at each process time stepgets triggered and iterates itself a larger number of steps thatdepends on the size of ˜Y ( n ) and the network architecture.The problem to be solved by the DBDAE is two-fold:1) Denoise the current measurement ˜y ( n ) given only noisymeasurements contained in ˜Y ( n ) . That is to say, approx-imate as much as possible: y ( n ) = C n ( x ( n ) , . (10)Unlike DAEs, in this case we do not have access to y ( n ) , therefore the denoising of input signals must bedone in a blind manner.2) Reconstruct faulty signals. In this case ˜Y ( n ) containsa row ˜Y k ( n ) ∈ R T of sensor measurements that islabeled as faulty, possibly by a previous fault-detectionalgorithm, ˜Y k ( n ) = [ l, l, . . . , l ] , ∈ R T . (11)Here l is an arbitrary label indicating that sensor k isfaulty. The DBDAE has to learn to recognize this labeland exploit both temporal and spatial correlations amongother non-faulty measurements to approximate as muchas possible: y ( n ) = C n ( x ( n ) , (12)To recover y ( n ) from ˜Y ( n ) , possibly masking one or moresignals in case of faulty sensors, the DBDAE has to dealmainly with three problems. The first one is to obtain richfeatures in the latent space, from which reconstruction anddenoising can be performed concurrently. Secondly, find amechanism to enhance the network to reconstruct faulty inputsignals. Finally, find a criterion to constrain the autoencoderto learn only valuable information from the inputs, leaving outnoise information.
1) Obtaining rich latent features:
Obtaining rich featuresfor multivariate time series is a challenging task since notonly spatial information matters, as in static data like images,but also time-dependent information needs to be taken intoaccount. Therefore, to obtain rich features both temporal cor-relations (within sensor information) and spatial correlations(between sensors information) must be present. To this end, wepropose to use a dual autoencoder structure with two encoders,a convolutional and a recurrent encoder, and one recurrentdecoder. In this case, at each process time step n , ˜Y ( n ) isprocessed independently in each encoder and then, by addingtheir encodings, rich latent features can be obtained.The recurrent encoder, generates its latent representation h L R R , which is a Q L R -dimensional finite-length sequence oflength T , by feeding the network iteratively with the elementsof ˜Y ( n ) and then recursively feeding back the internal stateof the network, namely, h L R R ( j ) = f RE ( ˜Y j ( n ) , h L R R ( j − , (13)where, f RE represents the recurrent network, with L R layersand Q l neurons in each layer l , that in our case is composedby Gated Recurrent Unit layers (GRU) [23]. The operationsat each GRU layer l are given by: z l ( j ) = σ ( W z l p l ( j ) + U z l h lR ( j ) + b z l ) r l ( j ) = σ ( W r l p l ( j ) + U r l h lR ( j ) + b r l ) n l ( j ) = tanh( W n l p l ( j ) + r l ( j ) ◦ ( U n l h lR ( j ) + b n l )) h lR ( j ) = (1 − z l ( j )) ◦ n l ( j ) + z l ( j ) ◦ h lR ( j ) , (14)where h lR ( j ) ∈ R Q l is the hidden state of layer l at AE’s time j , r l , z l and n l are Q l -dimensional finite-length sequencesrepresenting the value of reset, update and new gates of layer l respectively. σ and tanh are the sigmoid and the hyperbolictangent activation functions, ◦ is the Hadamard product and p l ( j ) is the input at AE’s time j of layer l . For the first layer( l = 1 ) the input, p ( j ) is given by the input of the network ˜Y j ( n ) , and for the subsequent layers the input is given by h l − ( j ) , which is the current hidden state of the previouslayer.Finally, the latent representation of the recurrent encoder is h LRR ( T ) ∈ R Q LR , which is the last hidden state of the lastlayer. The final output of the recurrent encoder is obtainedby projecting h LRR ( T ) by a linear layer S to obtain thedimension of the latent space, Q , namely, h R ( n ) = W S h LRR ( T ) + b S , (15)where W S ∈ R Q × Q LR and b S ∈ R Q are trainableparameters of S . It should be noted that h R is indexed by n ,the process time, despite its calculation involves T iterationsof the underlying recurrent network at a faster time scale.This latent vector is forced to contain information aboutall the input sequence and, therefore, is able to capture thetemporal information in the signals.In the case of the convolutional encoder, one dimensionalconvolutions are applied to the same input ˜Y ( n ) . Each ofthe N different measurement series ˜Y i ( n ) that compose theinput at time step n are modeled as different channels of the convolution, and by doing so, spatial correlations can becaptured. The convolutional encoder is given by h L C C ( n ) = f C ( ˜Y ( n )) , (16)where, f C represents the convolutional encoder network with L C layers, where the output of each layer l is composed by K l concatenated output channels C lout k ( n ) ∈ R z l , with k ∈ [1; K l ] , given I input channels. Each of the resulting outputchannels for layer l at process time step n is given by: C lout k ( n ) = b l k + I (cid:88) i =1 W l k ( i ) (cid:63) p l i ( n ) , (17)where b l k is a particular bias vector for channel k at layer l , W l k ( i ) is the i th channel of the I -channel kernel corre-sponding to output channel k , p l i ( n ) is the i th channel ofthe input of layer l and (cid:63) is the cross-correlation operator.For the first layer ( l = 1 ) the input, p i ( n ) is given by asequence of measurements of the i th sensor ˜Y i ( n ) , while forthe subsequent layers ( l > ) p l i ( n ) is given by an outputchannel of the previous layer C l − i ( n ) . Is worth to mentionthat the dimension of each output channel, z l , depends bothon the size of the input channels and on the parameters of theconvolution such as the padding, the kernel size, stride anddilation.Finally, after concatenating all the output channels of thelast layer L C , the convolutional enconder produces its ownlatent representation h L C C ( n ) ∈ R z LC × K LC , which is projectedby two linear layers to match the requirements of the latentspace dimension. The first linear layer S transforms h LCC ( n ) to a vector ˆh C ( n ) ∈ R K LC , and the second linear layer S transforms this resulting vector to the required latent spacedimension Q . ˆh C ( n ) = ( h LCC ( n )) T W S + b S h C ( n ) = W S ˆh C ( n ) + b S , (18)where W S ∈ R z LC , b S ∈ R K LC , W S ∈ R Q × K LC and b S ∈ R Q are trainable parameters of S and S .The final latent representation of the input at time step n isjust the sum of the latent representations of both encoders. h E ( n ) = h R T ( n ) + h C ( n ) (19)For the decoding phase, we first project h E ( n ) with a linearlayer S to extract useful features for decoding. ˆh E ( n ) = W S4 h E ( n ) + b S4 , (20)where W S4 ∈ R Q × Q and b S4 ∈ R Q .Then, another recurrent GRU network f RD with L D layersand ¯ Q l neurons in each layer l , receives the resulting vectorfrom this projection as its initial state and begins to decode theinformation. The decoder hidden state ¯ Q l -dimensional finite-length sequence h LDD , of length T , is generated as h LDD ( j ) = f RD ( p j ( n ) , h LDD ( j − , (21)where p j ( n ) ∈ R N is the input for each decoding step ofthe network. This input could be either the delayed target ˜Y j − ( n ) or the previous network estimate ˆY j − ( n ) . The network estimates, are calculated by projecting h LDD ( j ) usinga linear layer S as ˆY j ( n ) = W S5 h LDD ( j ) + b S5 , (22)where W S5 ∈ R N × ¯ Q LD and b S5 ∈ R N are trainableparameters. Hence, the final output of the DBDAE for processtime n corresponds to ˆY T ( n ) . It should be noted that thedecoder starts once h E ( n ) is available, which involves T previous iterations of the encoder at AE’s processing rate.Therefore, careful should be taken when interpreting the localtime of the decoder.Even though we have access to the entire decoder’s target( ˜Y ( n ) ) during training and testing, and exposure bias [24] isnot a problem, we found that gradually giving the networkits own predictions ˆY j ( n ) yields better results. This approachis called scheduled sampling [25] and consists in giving thenetwork the target at j − as input for predicting the targetat j in early stages of the training phase and gradually, withan increasing probability p , feed the network with its ownprediction at j − for predicting the target at j . This gradualshift from using the target as the input in each step to usethe network’s own predictions improves the stability of thenetwork, specially when the decoded sequence is long. Onthe other hand, a value p = 1 , i.e., the network relies solely inits own predictions, forces h E ( n ) to retain all the informationof the signals and to decode the entire sequence from it. Inour case p was increased linearly as p = min (1 , k + cE ) , (23)where k and c are parameters and E is the number of trainingiterations.Finally, the loss function used for training is the recon-struction error between the autoencoder’s prediciton for allthe sequence, ˆY ( n ) and the noisy measurements ˜Y ( n ) : Loss = 1 N T N (cid:88) k =1 T (cid:88) j =1 ( ˆY kj ( n ) − ˜Y kj ( n )) (24)An illustration of the DBDAE is shown in Figure 2
2) Enhancing the network to reconstruct faulty input sig-nals:
To reconstruct faulty signals, in addition to requiring alatent space with rich latent features, a simple yet effectivemethod is proposed, which consists in randomly choosing aninput measurement sequence ˜Y i ( n ) in each training iterationand replace all its values by an arbitrary label l that willindicate that the signal is faulty. ˜Y i ( n ) = [ l, l, . . . , l ] , ∈ R T , (25)where i ∈ [1; N ] and is randomly chosen in each iteration.Then, similar to common denoising autoencoders, we en-force the network to reconstruct the original input withoutcorrupting any of its values. Additionally, we can weight theloss function (usually MSE for autoencoders) to emphasize thereconstruction of the faulty signal. Is worth to mention that forstability reasons, and similar to the scheduling sampling tech-nique, at the beginning of the training phase, the probabilityof choosing one of the signals as faulty is small and then isgradually increased.
3) Blind denoising by constraining reconstruction:
Similarto related works that use AEs for signal reconstruction, agreat effort is put in obtaining rich features that contain themost important information of the signals since the richer thefeatures, the better the reconstruction. However, unlike AEswe do not want a perfect reconstruction of the input becausethat would mean reconstructing noise as well. Therefore, acriterion to determine how rich the latent space features mustbe to capture only the important information of the signals,leaving out noise, is needed.Intuitively, this could be achieved by using a small latentspace dimension and by doing so, the network will be forcedto only retain the most important information of the signal.However, it is not clear what the dimension should be. Thisof course depends on the application and if an excessivelysmall latent space dimension is selected, the network will notbe able to learn complex patterns and reconstruction will bepoor. On the other side, if an excessively big dimension isselected, the network will learn to reconstruct noise and, again,the objective will not be achieved.A possible option, is to enforce sparsity in the networkto improve denosing performance, as pointed out in [26].This can be achieved by adding an L1 regularization termto the loss function [27], however, is unclear how strong theregularization should be. Again, if regularization is too strong,the network will not be able to learn complex patterns and iftoo weak, the network will learn to reconstruct noise.In order to overcome these uncertainties, we propose a novelheuristic to restrict the network to reconstruct only the inputdata, leaving out noise, while retaining complex patterns. Thismethod is based on analyzing the evolution of the latent spacestructure during training using PCA.PCA [11] is a linear technique that uses an orthogonalprojection of a set of vectors into a set of linearly uncorrelatedvectors in the directions where variance is maximized. Isusually used as a dimensionality reduction technique, however,it has many other applications [28]. In this case PCA is used totransform and reduce the dimensionality of the DBDAE latentspace during training. The idea behind this, is that even thoughPCA is a linear technique, it is able to represent to someextent the orderly structure of the latent space of the DBDAEwith few components. After a few initial training iterations ofpossibly chaotic behaviour (due to the random initializationof the network weights), the accuracy and explained varianceof the principal components will gradually change as theDBDAE learns the dynamics of the system. However, whenthe DBDAE begins to overfit the data and starts learning noiseas well, the orderly structure of the latent space suddenlybecomes disordered due the random nature of noise. This isreflected in a sudden increase in the reconstruction error ofthe principal components and a sudden fall of their explainedvariance. Therefore, the point in which the DBDAE beginsto learn the noise can be identified when the derivative ofthe reconstruction error is maximum, or equivalently, whenthe derivative of the explained variance is minimum. At thispoint, training must be stopped to obtain the clean version ofthe signals.The complete algorithm for training the network is pre-
Noisy signals Conv1 Conv2 Conv3 Linear1n t Input Layer RNN1 RNN2 Linear2
Encoder + Linear3 RNN3 RNN4 Output Layer Clean signals
Decoder
Fig. 2. Dual blind denoising architecture composed by both a recurrent and a convolutional encoder that combine their resulting output features to form theDBDAE’s latent vectors. The recurrent decoder is able to decode this latent vector and reconstruct a clean version of the input signals. sented in Algorithm 1.
C. Simulated example
As a proof of concept, we applied the DBDAE to asimulated industrial process, the well-known quadruple tankconfiguration [29], which is a nonlinear multi-input multi-output system. Inputs are the voltages applied to the pumps,which vary in the range 0 to 10 volts, and the outputs are thefour levels of the tanks that vary between 0 and 50 meters.We excited the system with different inputs, as PRBS,sinusoidal and stairs, then we added different types of noiseto the inputs and finally we trained the network to denoisethe signals. We tested the DBDAE in the two tasks it wasdesigned for: blind denoising and reconstruction of faultymeasurements. Results are presented in the following
1) Blind denoising:
For blind denoising, we corrupted thesignals with different types of noise, namely, white noise, pinknoise, brownian noise, impulsive noise, and a combination ofall of them meaning that different sensors suffer from differenttypes of noise. As mentioned before, as a performance baselinewe used both an EMA and a SG filter.For training the DBDAEs, we used the procedure describedin the previous section. Figures 3 and 4 show the PCAreconstruction error and its derivative for different number ofprincipal components while training the autoencoder. Accord-ing to the PCA-based heuristic, the point where the network isoptimal for denoising the signal is where the derivative of thePCA reconstruction error is maximum, i.e., when the DBDAEtraining error is 10. Figure 5 confirms that the root meansquare error (RMSE) between the output of the DBDAE andthe ground truth (noise-free) is minimum at the same pointwhere the PCA-based heuristic indicates that training muststop. Denoising results in terms of RMSE for different typesof noise are shown in Table I. The DBDAE outperforms theEMA and SG filters in all cases but impulsive noise, wherethe SG filter shows slightly better performance.To illustrate the application of the PCA heuristic in a linearsystem, the procedure was repeated for the linearized versionof the quadruple tank system. As expected, the linear charac-teristics of the system are reflected in the latent space as well,Figure 6 presents the derivative of the PCA reconstruction
Evaluation Loss R e c o n s t r u c t i o n E rr o r Fig. 3. PCA reconstruction error of the DBDAE latent space for differentnumber of principal components, when denoising the quadruple tank system.
Evaluation Loss R e c o n s t r u c t i o n e rr o r d i ff Fig. 4. Derivative of the PCA reconstruction error of the DBDAE latent spacefor different number of principal components, when denoising the quadrupletank system. The derivative is maximum when the training error is 10, at thispoint the network is able to denoise and reconstruct the input signals withoutreconstructing noise. The red dot indicates where the training procedure mustbe stopped. error, a clear maximum is observed when the training error is7. Results indicate that the PCA-based heuristic can be appliedto linear systems as well.
Evaluation Loss R M S E Fig. 5. RMSE of the DBDAE output with respect to the clean (true) signal.The minimum is achieved when the DBDAE training error is 10, whichmatches the value given by the PCA reconstruction error method.
TABLE IC
OMPARISON OF DENOISING RESULTS IN TERMS OF THE AVERAGE
RMSE [ M ] FOR ALL THE TANK LEVELS .Technique White Noise Pink Noise Brownian Noise Combination Impulsive noiseOriginal 2.05 1.98 1.88 1.52 1.52Autoencoder
Algorithm 1
Dual Blind Denoising Autoencoder TrainingProcedure Net : Given E : Number of epochs, µ, σ : Normalization param-eters extracted from ˜Y , T : depth of the window, M : lengthof the training set, l : the label value to mask a faulty sensor, (cid:15) s , c s : offset and slope of the scheduled sampling probabilityand (cid:15) f , c f : offset and slope of the sensor fault probability. P : Number of PCA components to preserve. W : Number ofwarming iterations before applying the criterion, to account forthe random initialization of the weights. procedure T RAINING Initialize network weights θ randomly Normalization of the training set by ˜Y = ˜Y − µσ (cid:15) f = (cid:15) f (cid:15) s = (cid:15) s Initialize list pca E where PCA reconstruction error of the P selected components for each epoch will be saved. Initialize list checkpoints L where all the network’s weightswill be saved. Initialize list h L where all hidden states for each epoch willbe saved. for e in E epochs do for n in [T, M] do Select from ˜Y a sliding window ˜Y ( n ) Select two random numbers (cid:15) i , (cid:15) j ∈ [0 , F = True If (cid:15) i < (cid:15) f Else False (Simulate a faultysensor) SS = True If (cid:15) j < (cid:15) f Else False (Use scheduledsampling when predicting) if F is True then k = Select randomly a sensor index ∈ [1 , N ] ˜Y k ( n ) = [ l, l, . . . , l ] , ∈ R T (cid:46) Labeled as faulty end if
Set network gradients to zero ˆY ( n ) , h ( n ) = Net ( ˜Y ( n )) Save h ( n ) in h L J ( θ ) = N T (cid:80) Nk =1 (cid:80) Tj =1 ( ˆY kj ( n ) − ˜Y kj ( n )) Compute Backpropagation
Gradient Descent: θ = θ − ∇ θ J ( θ ) end for if e > W then Calculate PCA reconstruction error derivative pre-serving P components of P CA ( h L ) and save it in pca DE end if Clear h L Save
Net weights in checkpoints L (cid:15) f = min (1 , (cid:15) f + c f e ) (cid:15) s = min (1 , (cid:15) s + c s e ) end for weights = checkpoints L ( argmax ( pca DE )) Load weights to Net end procedure
Evaluation Loss R e c o n s t r u c t i o n e rr o r d i ff Fig. 6. Derivative of the PCA reconstruction error of the DBDAE latent spacefor different number of principal components, when denoising the linearizedversion of the quadruple tank system. The derivative is maximum when thetraining error is 7. The red dot indicates where the training procedure mustbe stopped.
Timesteps W a t e r L e v e l [ m ] Fig. 7. DBDAE reconstruction of a faulty signal. The signal labeled as”Noisy” represents the input signal had the sensor been operating correctly.
2) Reconstruction:
The network was also evaluated forreconstruction of faulty sensors, when receiving noisy mea-surements. Figure 7 shows reconstruction for one of theoutputs when its respective sensor is faulty. It can be seenthat the DBDAE is able to reconstruct the signal, from whereit can be inferred that the network is capable of learning thedynamics and exploit both temporal and spatial informationwhile denoising. Table II summarizes the reconstruction resultsfor all the process variables, under different types of noise.
TABLE IIR
ECONSTRUCTION RESULTS IN TERMS OF
RMSE
FOR ALL THE PROCESSVARIABLES UNDER DIFFERENT TYPES OF NOISE . u i : PROCESS INPUTS INVOLTS ; y i : PROCESS OUTPUTS IN METERS .Type of noise u u y y y y White Noise 0.99 1.16 1.18 1.37 1.33 1.31Pink Noise 1.69 1.59 1.94 2.02 1.6 1.77Brownian Noise 1.81 1.7 1.71 1.93 1.65 1.62Combination 1.29 1.34 1.37 1.49 1.3 1.48Impulsive noise 1.03 1.05 1.4 1.54 1.44 1.42
Flocculant
CentrifugalPump
Feed
Tailings Disposal
ThinningStand-by
Mass FlowMeasurementVolumetric FlowMeasurement Volumetric FlowMeasurementDensityMeasurement DensityMeasurementHydrostaticPressureMeasurement PositiveDisplacementPumpCentrifugalPump CentrifugalPump· Clarity· Interfase· Mud
Fig. 8. Process and instrumentation diagram of the thickener under study.Time series are available for the following variables: feeding and dischargerates, flocculant addition rate, feeding and discharge density, internal states ofthe thickener (mud, interface and clarity levels).
V. A
PPLICATION TO AN INDUSTRIAL PASTE THICKENER
A. Thickening process
Thickening is the primary method in mineral processing forproducing high density tailings slurries. The most commonmethod generally involves a large thickener tank (see Figure 8)with a slow turning raking system. Typically, the tailings slurryis added to the tank after the ore extraction process, alongwith a sedimentation-promoting polymer known as flocculant,which increases the sedimentation rate to produce thickenedmaterial discharged as underflow. In this context, the maincontrol objectives are: 1) to stabilize the solids contents in theunderflow; 2) to improve the clarity of the overflow water, and3) to reduce the flocculant consumption [3].Due to its complexity and highly non-linear dynamics,deriving a first-principles-based mathematical model is verychallenging. Therefore, an appealing approach is to use data-driven modeling techniques. However, sensors in charge ofproviding data are expose to strong disturbances and noise andan effective online preprocessing technique is needed. Hence,the thickener is an interesting real process to test the DBDAE.
B. Blind denoising
The DBDAE was trained with 12 months of real operationaldata from the industrial thickener to learn how to denoise8 different variables. The PCA-based heuristic was used forrestricting training of the network. Afterwards, the DBDAEwas tested online as the preprocessing tool for a modelpredictive control implementation.Figure 9 shows the PCA reconstruction error for differ-ent training losses. As mentioned before, training should bestopped when the reconstruction error is maximized, after aninitial transient due to the random initialization of weights.The warming parameter in Algorithm 1 accounts for thisphenomenon. Figures 10 and 11 present the denoising resultsfor the input flow and input solids concentration, two of thenoisiest measurements present in the thickener operation. It
Evaluation Loss R e c o n s t r u c t i o n e rr o r d i ff Fig. 9. Derivative of the PCA reconstruction error for different number ofprincipal components when training the DBDAE for denoising the thickener’ssignals. Two maxima are observed; however, the first should be discarded sinceit is due to the random initialization of the weights.Fig. 10. Denoising of the input flow variable using the DBDAE. can be appreciated that the DBDAE delivers a clean signalthat follows the dynamics and does not present major lag.
C. Reconstruction
Reconstruction, or sensor masking, relies on an algorithmcapable of detecting and labeling a faulty sensor. In the thick-ener implementation, an already existing statistical algorithmwas used, which was artificially modified to label healthysignals as faulty, in order to evaluate the DBDAE. Figure12 shows the results when reconstructing one of the processoutput signals after this signal was labeled as faulty. It can beseen that the reconstructed signal not only mimics closely thetarget signal, but also is a denoised version of it.From this real implementation example, it can be seen thatthe DBDAE is able to both denoise time series from a realindustrial process and reconstruct faulty signals on the fly.VI. C
ONCLUSION
In this work, a novel autoencoder architecture coined asDual Blind Denoising Autoencoder (DBDAE) was presentedas an alternative for denoising and reconstructing time seriesfrom an underlying dynamical system. The dual architectureis based on two encoders, a recurrent network for exploiting
Fig. 11. Denoising of the input solids concentration variable using theDBDAE.
Fig. 12. DBDAE reconstruction of the output solids concentration variable.The signal labeled as ”Noisy” represents the input signal to the DBDAE hadthe sensor been operating correctly. temporal information and a convolutional network for exploit-ing spatial correlations. A PCA-based heuristic is proposedfor constraining learning so denoising can be done in a blindmanner without knowledge on the real clean target signal.Experimental results show that the DBDAE outperformsclassical methods as the EMA and SG filters in denoisingtasks. A real implementation in an industrial thickener illus-trates the capabilities when faced with real process data.R
EFERENCES[1] Y. Jiang, J. Fan, T. Chai, J. Li, and F. L. Lewis, “Data-driven flotationindustrial process operational optimal control based on reinforcementlearning,”
IEEE Trans. Ind. Informat. , vol. 14, no. 5, pp. 1974–1989,May 2018.[2] S. Langarica, C. R¨uffelmacher, and F. N´u˜nez, “An industrial internetapplication for real-time fault diagnosis in industrial motors,”
IEEETrans. Autom. Sci. Eng. , vol. 17, no. 1, pp. 284–295, 2020.[3] F. N´u˜nez, S. Langarica, P. D´ıaz, M. Torres, and J. C. Salas, “Neuralnetwork-based model predictive control of a paste thickener over anindustrial internet platform,”
IEEE Trans. Ind. Informat. , vol. 16, no. 4,pp. 2859–2867, 2020.[4] Y. Zhang, Y. Yang, S. X. Ding, and L. Li, “Data-driven design andoptimization of feedback control systems for industrial applications,”
IEEE Trans. Ind. Electron. , vol. 61, no. 11, pp. 6409–6417, Nov 2014.[5] A. Savitzky and M. J. E. Golay, “Smoothing and differentiation of databy simplified least squares procedures.”
Analytical Chemistry , vol. 36,no. 8, pp. 1627–1639, 1964.[6] B. Alexander, T. Ivan, and B. Denis, “Analysis of noisy signal restorationquality with exponential moving average filter,” in , May2016, pp. 1–4.[7] H. D. Hesar and M. Mohebbi, “Ecg denoising using marginalizedparticle extended kalman filter with an automatic particle weightingstrategy,”
IEEE Journal of Biomedical and Health Informatics , vol. 21,no. 3, pp. 635–644, May 2017.[8] H. Kumar and A. Mishra, “Feedback particle filter based image de-noiser,” in , Oct 2017, pp. 320–325.[9] D. L. Donoho, “De-noising by soft-thresholding,”
IEEE Transactions onInf. Theory , vol. 41, no. 3, pp. 613–627, May 1995.[10] N. Nezamoddini-Kachouie and P. Fieguth, “A gabor based technique forimage denoising,” in
Canadian Conference on Electrical and ComputerEngineering, 2005. , May 2005, pp. 980–983.[11] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,”
Chemometrics and Intelligent Laboratory Systems , vol. 2, no. 1, pp.37–52, 1987.[12] J. Im, D. W. Apley, and G. C. Runger, “Tangent hyperplane kernelprincipal component analysis for denoising,”
IEEE Transactions onNeural Networks and Learning Systems , vol. 23, no. 4, pp. 644–656,April 2012.[13] L. Shao, R. Yan, X. Li, and Y. Liu, “From heuristic optimization todictionary learning: A review and comprehensive comparison of imagedenoising algorithms,”
IEEE Transactions on Cybernetics , vol. 44, no. 7,pp. 1001–1013, July 2014. [14] W. Yu and C. Zhao, “Robust monitoring and fault isolation of nonlinearindustrial processes using denoising autoencoder and elastic net,”
IEEETrans. Control Syst. Technol. , pp. 1–9, 2019.[15] Z. Sun and H. Sun, “Stacked denoising autoencoder with density-grid based clustering method for detecting outlier of wind turbinecomponents,”
IEEE Access , vol. 7, pp. 13 078–13 091, 2019.[16] C. R. A. Chaitanya, A. S. Kaplanyan, C. Schied, M. Salvi, A. Lefohn,D. Nowrouzezahrai, and T. Aila, “Interactive reconstruction of montecarlo image sequences using a recurrent denoising autoencoder,”
ACMTrans. Graph. , vol. 36, no. 4, pp. 98:1–98:12, Jul. 2017.[17] E. Marchi, F. Vesperini, S. Squartini, and B. Schuller, “Deep recurrentneural network-based autoencoders for acoustic novelty detection,”
In-tell. Neuroscience , vol. 2017, pp. 4–, Jan. 2017.[18] S. Xu, B. Lu, M. Baldea, T. F. Edgar, W. Wojsznis, T. Blevins,and M. Nixon, “Data cleaning in the process industries,”
Reviews inChemical Engineering , vol. 31, no. 5, pp. 453 – 490, 2015.[19] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,”
Science , vol. 313, no. 5786, pp. 504–507,2006.[20] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extractingand composing robust features with denoising autoencoders,” in
Pro-ceedings of the 25th International Conference on Machine Learning ,ser. ICML ’08, 2008, pp. 1096–1103.[21] A. Creswell and A. A. Bharath, “Denoising adversarial autoencoders,”
IEEE Transactions on Neural Networks and Learning Systems , vol. 30,no. 4, pp. 968–984, April 2019.[22] A. Majumdar, “Blind denoising autoencoder,”
IEEE Transactions onNeural Networks and Learning Systems , vol. 30, no. 1, pp. 312–317,Jan 2019.[23] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingRNN encoder–decoder for statistical machine translation,” in
Proceed-ings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , Oct. 2014, pp. 1724–1734.[24] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “SequenceLevel Training with Recurrent Neural Networks,” arXiv e-prints , p.arXiv:1511.06732, Nov 2015.[25] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Samplingfor Sequence Prediction with Recurrent Neural Networks,” arXiv e-prints , p. arXiv:1506.03099, Jun 2015.[26] B. Wen, S. Ravishankar, and Y. Bresler, “Structured overcomplete sparsi-fying transform learning with convergence guarantees and applications,”
International Journal of Computer Vision , vol. 114, no. 2, pp. 137–167,Sep 2015.[27] K. Tan, W. Li, Y. Huang, and J. Yang, “A regularization imaging methodfor forward-looking scanning radar via joint l1-l2 norm constraint,” in , July 2017, pp. 2314–2317.[28] X. Deng, X. Tian, S. Chen, and C. J. Harris, “Nonlinear processfault diagnosis based on serial principal component analysis,”
IEEETransactions on Neural Networks and Learning Systems , vol. 29, no. 3,pp. 560–572, March 2018.[29] K. H. Johansson, “The quadruple-tank process: a multivariable lab-oratory process with an adjustable zero,”