RRecurrent autoencoder with sequence-aware encoding
Robert [email protected]
Institute of Applied Computer Science,Łódź University of Technology, Poland
Abstract
Recurrent Neural Networks (RNN) received a vast amount of attention last decade. Recently,the architectures of Recurrent AutoEncoders (RAE) found many applications in practice. RAEcan extract the semantically valuable information, called context that represents a latent spaceuseful for further processing. Nevertheless, recurrent autoencoders are hard to train, and thetraining process takes much time. In this paper, we propose an autoencoder architecture withsequence-aware encoding, which employs 1D convolutional layer to improve its performance interms of model training time. We prove that the recurrent autoencoder with sequence-awareencoding outperforms a standard RAE in terms of training speed in most cases. The preliminaryresults show that the proposed solution dominates over the standard RAE, and the trainingprocess is order of magnitude faster.
Recurrent Neural Networks (RNN) [21, 29] re-ceived a vast amount of attention last decadeand found a wide range of applications suchas language modelling [18, 23], signal process-ing [8, 22], anomaly detection [19, 24].The RNN is (in short) a feedforward neuralnetwork adapted to sequences of data that havethe ability to map sequences to sequences achiev-ing excellent performance on time series. Mul-tiple layers of RNN can be stacked to processefficiently long input sequences [12, 20]. Thetraining process of deep recurrent neural network(DRNN) is difficult because the gradients (inbackpropagation through time [29]) either vanishor explode [3, 9]. It means that despite the RNNcan learn long dependencies the training processmay take a much time or even fail. The problemwas resolved by application of Long Short-TermMemory (LSTM) [13] or much newer and simplerGated Recurrent Units (GRU) [6]. Nevertheless,it is not easy to parallelise calculations in recur-rent neural networks what impact the trainingtime.A different and efficient approach was pro-posed by Aäron et al. [26] who proved thatstacked 1D convolutional layers can process ef-ficiently long sequences handling tens of thou- sands of times steps. The CNNs have alsobeen widely applied to autoencoder architectureas a solution for problems such as outlier andanomaly detection [16, 14, 1], noise reduction [5],and more.Autoencoders [4] are unsupervised algorithmstrained to attempt to copy its input to its output.The desirable side effect of this approach is a la-tent representation (called context or code ) of theinput data. The context is usually smaller thaninput data to extract only the semantically valu-able information. Encoder-Decoder (Sequence-to-Sequence) [7, 25] architecture looks very muchlike autoencoder and consists of two blocks: en-coder, and decoder, both containing a couple ofRNN layers. The encoder takes the input dataand generates the code (a semantic summary)used to represent the input. Later, the decoderprocesses the code and generates the final out-put. The encoder-decoder approach allows hav-ing variable-length input and output sequencesin contrast to classic RNN solutions. Thereare several related attempts, including an inter-esting approach introduced by Graves [12] thathave been later successfully applied in practicein [2, 17]. The authors proposed a novel differ-entiable attention mechanism that allows the de-coder to focus on appropriate words at each time1 a r X i v : . [ c s . L G ] S e p tep. This technique improved state of the art inneural machine translation (NMT) and was laterapplied even without any recurrent or convolu-tional layers [28]. Besides the machine transla-tion, there are multiple variants and applicationsof the Recurrent AutoEncoders (RAE). In [10],the proposed generative model of variational re-current autoencoder (VRAE) learns the latentvector representation of data and use it to gen-erate samples. Another variational autoencoderwas introduced in [27, 11] where authors ap-ply convolutional layers and WaveNet for audiosequence. Interesting approach, the FeedbackRecurrent AutoEncoder (FRAE) was presentedin [30]. In short, the idea is to add a connectionthat provides feedback from decoder to encoder.This design allows efficiently compressing the se-quences of speech spectrograms.In this paper, we present an autoencoder ar-chitecture, which employs 1D convolutional layerin order to improve its performance in termsof training time and model accuracy. We alsopropose a different interpretation of the con-text (the final hidden state of the encoder). Wetransform the context into the sequence that ispassed to the decoder. This technical trick, evenwithout changing other elements of architecture,improves the performance of recurrent autoen-coder.We demonstrate the power of the proposed ar-chitecture for time series reconstruction. We per-form a wide range of experiments on a datasetof generated signals, and the preliminary resultsare promising.Following contributions of this work can beenumerated: (i) We propose a recurrent autoen-coder with sequence-aware encoding that trainsmuch faster than standard RAE. (ii) We suggestan extension to proposed solution which employsthe 1D convolutional layer to make the solutionmore flexible. (iii) We show that this architec-ture performs very well on univariate and multi-variate time series reconstruction. In this section, we describe our approach andits variants. We also discuss the advantages and disadvantages of the proposed architecture andsuggest possible solutions to its limitation. x (0) x (1) x (2) x (3) x ( n X ) Cy (0) y (1) y (2) y (3) y ( n Y ) ...Figure 1: Recurrent AutoEncoder (RAE) [7, 25],called also Encoder-Decoder or Sequence to Se-quence.The recurrent autoencoder generates an out-put sequence Y = ( y (0) , y (1) , . . . , y ( n Y − ) for given an input sequence X =( x (0) , x (1) , . . . , x ( n X − ) , where n Y and n X are the sizes of output and input sequences re-spectively (both can be of the same or differentsize). Usually, X = Y to force autoencoderlearning the semantic meaning of data. First,the input sequence is encoded by the RNNencoder, and then the given fixed-size contextvariable C is decoded by the decoder (usuallyalso RNN), see Figure 1. We propose a recurrent autoencoder architecture(Figure 2) where the context C (the output of thefinal hidden state of the encoder) is interpretedas the sequence C (cid:48) , in contrast to other solutionswhere it is interpreted as a vector (Figure 1).The C = ( c i ) n C − i =0 is transformed to C (cid:48) = (( c iλ + j ) λ − j =0 ) n C / ( λ − i =0 (1)where λ = n C /n X ( λ ∈ N ).Once the context is transformed ( C (cid:48) =( c (cid:48) (0) , c (cid:48) (1) , . . . , c (cid:48) ( n X − ) ), the decoder starts todecode the sequence C (cid:48) of m C (cid:48) = λ features.This technical trick in the data structure speeds2 (0) x (1) x (2) x (3) x ( n X ) Cc (cid:48) (0) c (cid:48) (1) ... c (cid:48) (2) c (cid:48) (3) c (cid:48) ( n Y ) C (cid:48) = y (0) y (1) y (2) y (3) y ( n Y ) ...Figure 2: Recurrent AutoEncoder withSequence-aware encoding (RAES).up the training process (Section 3). Additionally,this way, we put some sequential meaning to thecontext. The one easily solvable disadvantageof this solution is the fact that the size of con-text must be multiple of input sequence length n C = λn X , where n C is the size of context C . In order to solve the limitation mentioned in theprevious section (Section 2.1), we propose to add1D convolutional layer (and max-pooling layer)to the architecture right before the decoder (Fig-ure 3). This approach gives the ability to controlthe number of output channels (also denoted as feature detectors or filters ), defined as follows: C (cid:48)(cid:48) ( i ) = (cid:88) k (cid:88) l C (cid:48) ( i + k, l ) w ( k, l ) (2)In this case, the n C does not have to be mul-tiple of n X , thus to have the desired output se-quence of n Y length, the number of filters shouldbe equal to n Y . Moreover, the output of the 1Dconvolution layer C (cid:48)(cid:48) = conv1D( C (cid:48) ) should betransposed, hence each channel becomes an ele-ment of the sequence as shown in Figure 3. Fi-nally, the desired number of features on output Y can be configured with hidden state size of thedecoder.A different and simpler approach to solve thementioned limitation is stretching the C to the x (0) x (1) x (2) x (3) x ( n X ) c (cid:48) (0) c (cid:48) (1) ... c (cid:48) (2) c (cid:48) (3) c (cid:48) ( n C (cid:48) ) CC (cid:48) = ...Conv1D C (cid:48)(cid:48) = c (cid:48)(cid:48) (0) c (cid:48)(cid:48) (1) c (cid:48)(cid:48) (2) c (cid:48)(cid:48) (3) c (cid:48)(cid:48) ( n Y ) y (0) y (1) y (2) y (3) y ( n Y ) ... n Y fi l t e r s Figure 3: Recurrent AutoEncoder withSequence-aware encoding and 1D Convolu-tional layer (RAESC).size of decoder input and filling in the gaps withaverages.The described variant is very simplified andis only an outline of proposed recurrent autoen-coder architecture (the middle part of it, to bemore precise) which can be extended by addingpooling and recurrent layers or using differentconvolution parameters (such as stride, or di-lation values). Furthermore, in our view, thisapproach could be easily applied to other RAEarchitectures (such as [30, 11]).
In order to evaluate the proposed approach, werun a few experiments, using a generated datasetof signals. We tested the following algorithms: • Standard Recurrent AutoEncoder (RAE) [7,25]. • RAE with Sequence-aware encoding (RAES). • RAES with Convolutional and max-poolinglayer (RAESC).The structure of decoder and encoder is the3ame in all algorithms. Both, decoder and en-coder are single GRU [6] layer, with additionaltime distributed fully connected layer in the out-put of decoder. The algorithms were imple-mented in Python 3.7.4 with TensorFlow 2.3.0.The experiments were run on a desktop PC withan Intel i5-3570 CPU clocked at 3.4 GHz with256 KB L1, 1 MB L2 and 6 MB L3 cache andGTX 1070 graphic card. The test machine wasequipped with 16 GB of 1333 MHz DDR3 RAMand running Fedora 28 64-bit OS. The datasetcontains 5000 sequences of size 200 with {1, 2, 4,8} features. The dataset was shuffled and splitto training and validation sets in proportions of80:20, respectively. We trained the models withAdam optimizer [15] in batches of size 100 andMean Squared Error (MSE) loss function.In the first set of analyses we investigated theimpact of context size and the number of fea-tures on performance. We noticed that there isa considerable difference in training speed (num-ber of epochs needed to achieve plateau) betweenthe classic approach and ours. To prove whetherour approach has an advantage over the standardRAE, we performed tests with different size ofthe context n C and a different number of inputfeatures m X . We set the n C size proportionallyto the size of the input and we denote it as: σ = n C m X n X (3)Figure 4 proves that the training of stan-dard RAE takes much more time (epochs) thanRAESC. In chart a) the size of context is set to σ = 25% and in b) it is set to σ = 100% of theinput size. For σ = 25% the RASEC achievesplateau after 20 epochs while standard RAE doesnot at all (it starts decreasing after nearly 100epochs). There is no RAES result presented inthis plot because of the limitation mentioned inSection 2.1 (size of code was too small to fit theoutput sequence length). For the σ = 100% bothRASEC and RAES achieve the plateau in lessthan five epochs while the standard RAE after50 epochs (order of magnitude faster).Figure 5 shows the loss in function of the num-ber of epochs for two features in input data.This experiment confirms that both RAES andRAESC dominates in terms of training speed, (a) σ = 25% epochs l o ss RAERAESC (b) σ = 100% epochs l o ss RAERAESRAESC
Figure 4: Loss as function of epoch number forunivariate data and σ = { , } .but a slight difference can be noticed in com-parison to univariate data (Figure 2). It showsthat the RAE achieves plateau in ˜50 epochs forboth cases while RAES and RAESC after 20epochs for σ = 25% and in about five epochsfor σ = 100% .In Figure 6 is presented loss in function ofthe number of epochs for 8 features. This fig-ure is interesting in several ways if compared tothe previous ones (Figures 4, 5). The chart a)shows that, for much larger number of featuresand relatively small size of the context, the train-ing time of RAES variant is much longer. Thesimilar observation may be noticed for RAESC,where the loss drops much faster than the stan-dard RAE at the begining of the training, butachieves the plateau at almost the same step.On the other hand, chart b) shows that for largersize of context, the proposed solution dominates.The most striking fact to emerge from these re-sults is that the RAE does not drop in the wholeperiod.We compared also the RAE performance for4 a) σ = 25% epochs l o ss RAERAESC (b) σ = 100% epochs l o ss RAERAESRAESC
Figure 5: Loss as function of epoch number fortwo features ( m X = 2 ) and σ = { , } .different context size ( σ ). Figure 7 presents lossin a function of time for 8 features. Each variantwas tested for 100 epochs with limited time to5 minutes. As expected we can clearly see thatRAE converges much faster for smaller value of σ . Finally, we measured the training time of eachalgorithm to confirm that proposed solution con-verges faster than standard RAE for the samesize of context. The Table 1 shows the median ofepoch time for different number of features andcontext size. The table proves that the RAESis 14% faster than RAE for univariate data andabout 33% faster than RAE for m X = 8 . Thetraining of RAESC algorithm takes a slightlymore time than RAE, which is marginal (lessthan 2%) for m X = 8 . In this work, we proposed an autoencoder withsequence-aware encoding. We proved that thissolution outperforms a standard RAE in terms (a) σ = 25% epochs l o ss RAERAESRAESC (b) σ = 100% epochs l o ss RAERAESRAESC
Figure 6: Loss as function of epoch number for m X = 8 and σ = { , } . time [s] l o ss RAE 3%RAE 6%RAE 12%RAE 25%RAE 50%RAE 100%
Figure 7: Loss as function of time [s] for m X = 8 ( σ = 100% for RAES and σ = { , , , , , } for RAE).of training speed (for the same size of context)in most cases.The experiments confirmed that the trainingof proposed architecture is much faster than thestandard RAE. The context size and a numberof features in the input sequence have a high im-pact on training performance. Only for relativelylarge number of features and small size of thecontext the proposed solution achieves compara-5 eatures σ ( m X ) algorithm 25% 50% 100%RAE 1.05 1.00 1.411 RAES - - 1.23RAESC 0.97 0.98 1.64RAE 1.00 1.47 3.692 RAES - 1.29 3.10RAESC 0.97 1.63 3.99RAE 1.47 3.66 10.604 RAES 1.31 3.08 8.24RAESC 1.60 3.97 10.95RAE 3.65 10.51 35.018 RAES 3.14 8.28 26.28RAESC 3.89 10.80 35.68 Table 1: Epoch time [s] (median) for differentnumber of features ( m X ) and context size ( σ ).ble results to standard RAE. In other cases oursolution dominates and the training time is orderof magnitude shorter.In our view these results constitute a good ini-tial step toward further research. The proposedarchitecture was much simplified and the use ofdifferent layers or hyperparameter tunning seemsto offer great opportunities. We belive that theproposed solution has a wide range of practicalapplications and it is worth to confirm it. References [1] J. An and S. Cho. Variational autoen-coder based anomaly detection using recon-struction probability.
Special Lecture on IE ,2(1):1–18, 2015.[2] D. Bahdanau, K. Cho, and Y. Bengio. Neu-ral machine translation by jointly learn-ing to align and translate. arXiv preprintarXiv:1409.0473 , 2014.[3] Y. Bengio, P. Simard, and P. Frasconi.Learning long-term dependencies with gra-dient descent is difficult.
IEEE transactionson neural networks , 5(2):157–166, 1994.[4] H. Bourlard and Y. Kamp. Auto-associationby multilayer perceptrons and singular value decomposition.
Biological cybernetics ,59(4-5):291–294, 1988.[5] H.-T. Chiang, Y.-Y. Hsieh, S.-W. Fu, K.-H. Hung, Y. Tsao, and S.-Y. Chien. Noisereduction in ecg signals using fully convo-lutional denoising autoencoders.
IEEE Ac-cess , 7:60806–60813, 2019.[6] K. Cho, B. van Merrienboer, D. Bahdanau,and Y. Bengio. On the properties of neuralmachine translation: Encoder-decoder ap-proaches.
CoRR , abs/1409.1259, 2014.[7] K. Cho, B. Van Merriënboer, C. Gulcehre,D. Bahdanau, F. Bougares, H. Schwenk,and Y. Bengio. Learning phrase represen-tations using rnn encoder-decoder for sta-tistical machine translation. arXiv preprintarXiv:1406.1078 , 2014.[8] J. Ding and Y. Wang. Wifi csi-based humanactivity recognition using deep recurrentneural network.
IEEE Access , 7:174257–174269, 2019.[9] K. Doya. Bifurcations of recurrent neuralnetworks in gradient descent learning.
IEEETransactions on neural networks , 1(75):218,1993.[10] O. Fabius and J. R. van Amersfoort. Vari-ational recurrent auto-encoders. arXivpreprint arXiv:1412.6581 , 2014.[11] C. Gârbacea, A. van den Oord, Y. Li, F. S.Lim, A. Luebs, O. Vinyals, and T. C. Wal-ters. Low bit-rate speech coding with vq-vae and a wavenet decoder. In
ICASSP2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing(ICASSP) , pages 735–739. IEEE, 2019.[12] A. Graves. Generating sequences with re-current neural networks. arXiv preprintarXiv:1308.0850 , 2013.[13] S. Hochreiter and J. Schmidhuber. Longshort-term memory.
Neural computation ,9(8):1735–1780, 1997.614] T. Kieu, B. Yang, C. Guo, and C. S. Jensen.Outlier detection for time series with re-current autoencoder ensembles. In
IJCAI ,pages 2725–2732, 2019.[15] D. P. Kingma and J. Ba. Adam: A methodfor stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[16] W. Liao, Y. Guo, X. Chen, and P. Li. Aunified unsupervised gaussian mixture vari-ational autoencoder for high dimensionaloutlier detection. In ,pages 1208–1217. IEEE, 2018.[17] M.-T. Luong, H. Pham, and C. D. Man-ning. Effective approaches to attention-based neural machine translation. arXivpreprint arXiv:1508.04025 , 2015.[18] T. Mikolov, S. Kombrink, L. Burget,J. Černock`y, and S. Khudanpur. Exten-sions of recurrent neural network languagemodel. In , pages 5528–5531. IEEE,2011.[19] A. Nanduri and L. Sherry. Anomaly detec-tion in aircraft data using recurrent neu-ral networks (rnn). In , pages 5C2–1. Ieee, 2016.[20] R. Pascanu, C. Gulcehre, K. Cho, andY. Bengio. How to construct deep recur-rent neural networks. In
Proceedings of theSecond International Conference on Learn-ing Representations (ICLR 2014) , 2014.[21] D. E. Rumelhart, G. E. Hinton, and R. J.Williams. Learning representations by back-propagating errors. nature , 323(6088):533–536, 1986.[22] S. Shahtalebi, S. F. Atashzar, R. V. Patel,and A. Mohammadi. Training of deep bidi-rectional rnns for hand motion filtering viamultimodal data fusion. In
GlobalSIP , pages1–5, 2019. [23] Y. Shi, M.-Y. Hwang, X. Lei, and H. Sheng.Knowledge distillation for recurrent neuralnetwork language modeling with trust regu-larization. In
ICASSP 2019-2019 IEEE In-ternational Conference on Acoustics, Speechand Signal Processing (ICASSP) , pages7230–7234. IEEE, 2019.[24] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, andD. Pei. Robust anomaly detection for multi-variate time series through stochastic recur-rent neural network. In
Proceedings of the25th ACM SIGKDD International Confer-ence on Knowledge Discovery & Data Min-ing , pages 2828–2837, 2019.[25] I. Sutskever, O. Vinyals, and Q. V. Le. Se-quence to sequence learning with neural net-works. In
Advances in neural informationprocessing systems , pages 3104–3112, 2014.[26] A. van den Oord, S. Dieleman, H. Zen,K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. Senior, andK. Kavukcuoglu. Wavenet: A gener-ative model for raw audio. In
Arxiv ,2016.[27] A. Van Den Oord, O. Vinyals, et al. Neu-ral discrete representation learning. In
Ad-vances in Neural Information ProcessingSystems , pages 6306–6315, 2017.[28] A. Vaswani, N. Shazeer, N. Parmar,J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin. Attention is allyou need. In
Advances in neural informa-tion processing systems , pages 5998–6008,2017.[29] P. J. Werbos. Backpropagation throughtime: what it does and how to do it.
Proceedings of the IEEE , 78(10):1550–1560,1990.[30] Y. Yang, G. Sautière, J. J. Ryu, and T. S.Cohen. Feedback recurrent autoencoder.In