ID-Conditioned Auto-Encoder for Unsupervised Anomaly Detection
DDetection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
ID-CONDITIONED AUTO-ENCODER FOR UNSUPERVISED ANOMALY DETECTION
Sławomir Kapka
Samsung R&D Institute PolandArtificial IntelligenceWarsaw, [email protected]
ABSTRACT
In this paper, we introduce ID-Conditioned Auto-Encoder forunsupervised anomaly detection. Our method is an adaptation of theClass-Conditioned Auto-Encoder (C2AE) designed for the open-setrecognition. Assuming that non-anomalous samples constitute ofdistinct IDs, we apply Conditioned Auto-Encoder with labels pro-vided by these IDs. Opposed to C2AE, our approach omits the clas-sification subtask and reduces the learning process to the single run.We simplify the learning process further by fixing a constant vec-tor as the target for non-matching labels. We apply our method inthe context of sounds for machine condition monitoring. We eval-uate our method on the ToyADMOS and MIMII datasets from theDCASE 2020 Challenge Task 2. We conduct an ablation study toindicate which steps of our method influences results the most.
Index Terms — DCASE 2020 Challenge Task 2, Unsupervisedanomaly detection, Machine Condition Monitoring, ConditionedAuto-Encoder
1. INTRODUCTION
Unsupervised anomaly detection is a problem of detectinganomalous samples under the condition that only non-anomalous(normal) samples have been provided during training phase. In thispaper, we focus on unsupervised anomaly detection in the contextof sounds for machine condition monitoring – i.e., detecting me-chanical failure by listening.Many techniques have been studied for detecting anomaloussounds. Among others there are solutions based on SVMs [1, 2],sparse coding [3, 4], GMMs [5, 6], Neyman-Pearson lemma [7, 8],signal processing [9, 10, 11], Interpolation DNN [12], and Auto-Encoders [8, 13, 14, 15]. Beyond sounds, much more techniquesfor anomaly detection based on deep leaning can be found in thesurvey [16].Unsupervised anomaly detection can be viewed as a specialcase of the open-set recognition [17]. In fact, since during trainingwe are provided only with normal samples and we have to predictwhether new samples are normal or anomalous, we can look at itas a binary classification problem with only one given class duringtraining phase.Recently, Class-Conditioned Auto-Encoder (C2AE) [18] hasbeen introduced for the open-set recognition problem. According tothe survey on the open-set recognition [19], it is currently the state-of-the-art in the open-set recognition problem. In section 2, weintroduce an ID-Conditioned Auto-Encoder (IDCAE), which is theadapted version of C2AE applicable to the unsupervised anomalydetection. Task 2 in this year IEEE AASP Challenge on Detection andClassification of Acoustic Scenes and Events (DCASE 2020) [20,21] focuses precisely on the unsupervised detection of anomaloussounds for machine condition monitoring. The data for this task isToyADMOS [22] and MIMII Dataset [23] which consists of soundsof six types of operating machines. In section 3, we develop themodel for this challenge, and we evaluate its performance via anablation study in section 4.
2. PROPOSED METHOD
Let us consider an arbitrary but fixed machine type, henceforthcalled the machine unless otherwise specified. We assume that wehave various IDs of the machine. It is precisely the case in task 2in the DCASE 2020 Challenge. In the nomenclature from [18] wetreat machines with different IDs as distinct classes.Our system constitutes of three main parts: • encoder E : X → Z which maps feature vector X from inputspace X to the code E ( X ) in the latent space Z , • decoder D : Z → X which takes the code Z from Z andoutputs the vector D ( Z ) of the same shape as feature vectorsfrom X , • conditioning made of two functions H γ , H β : Y → Z whichtake the one-hot label l from Y and map it to the vectors H γ ( l ) , H β ( l ) of the same size as codes from Z .During feed-forward, the code Z is combined with H γ ( l ) , H β ( l ) to form H ( Z, l ) = H γ ( l ) · Z + H β ( l ) , whichis an affine transformation of the latent space conditioned by l [24]. Thus, our whole system takes two inputs X, l from X and Y respectively and outputs D ( H ( E ( X ) , l )) .Given an input X with some ID, we call label corresponding tothis ID by the match and all other labels by non-matches . We wishthat our system reconstructs X faithfully if and only if it is a normalsample conditioned by the matching label.Given an input X , we set the label l to the match with prob-ability α or to a randomly selected non-match with probability − α , where α is predefined. Thus, for a batch X , X , . . . X n ap-proximately α fraction of samples will be conditioned by matchesand − α by non-matches. If l is the match, then the lossequals difference between the system’s output and X , that is (cid:107) D ( H ( E ( X ) , l )) − X (cid:107) . If l is a non-match, then the lossequals difference between the system’s output and some pre-defined constant vector C with the same shape as X , that is (cid:107) D ( H ( E ( X ) , l )) − C (cid:107) . In our setting (cid:107)·(cid:107) is either L or thesquare of L norm. a r X i v : . [ ee ss . A S ] S e p etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan Figure 1: Two exemplary spectrograms of some valve normal samples with their corresponding reconstructions. In the top row we haveoriginal spectrograms of the whole recordings. In the middle row we have reconstructions with matching labels. We see that reconstructionsare rather faithful. In the bottom row we have reconstructions with non-matching labels, and as expected the reconstructed spectrograms arequite off.During the inference we always feed the network with match-ing labels. If a sample is non-anomalous, we expect the reconstruc-tion to be faithful resulting in low reconstruction error (see Figure1). If the sample is anomalous, there may be two cases. If thesample is nothing like any sample during training, then generallyauto-encoder wouldn’t be able to reconstruct it resulting in high re-construction error. However, if the sample resemble normal sam-ples with other IDs, then the auto-encoder will try to reconstruct thevector C resulting again in high error (see Figure 2).In general, the point of auto-encoder-based solutions foranomaly detection is to separate distributions of reconstruction er-rors for normal and anomalous samples. In our case, the distributionof reconstruction errors for anomalies is shifted further to the rightdue to the samples that resemble samples with other IDs. Thus,it indicates that our method should work at least as good as regu-lar auto-encoder. Overall, distribution of reconstruction errors foranomalies with matching labels should be between distributions ofreconstruction errors for normal samples with matching and non-matching labels (see Figure 3).The additional advantage of our approach is that feeding thenetwork with more IDs may result in better performance. In fact, inthe section 4 we show that it holds in the case of machine conditionmonitoring.
3. MODEL3.1. Features
In our model, we feed the network with fragments of nor-malised log-mel power spectrograms. Feature vector space X con-sists of vectors of the shape ( F, M ) , where F is the frame size and M is the number of mels. Given an audio signal we first compute itsShort Time Fourier Transform with 1024 window and 512 hop size,we transform it to the power mel-scaled spectrogram with M mels, Table 1: The architecture of IDCAE Encoder ( E ) Decoder ( D ) Conditioning ( H γ , H β )Input ( F, M ) Input Input
Dense DenseBlock
DenseBlock sigmoidDenseBlock DenseBlock
Dense DenseBlock DenseBlock
DenseBlock Dense F · M Reshape ( F, M ) DenseBlock n : Dense n , Batch-norm, reluand we take its logarithm with base and multiply it by . Fi-nally, we standardize all spectrograms frequency-wise to zero meanand unit variance, and sample frames of size F as an input to oursystem. As described in subsection 2, our model constitutes of the en-coder E , the decoder D and the conditioning H γ , H β . In our case,all these components are fully connected neural networks. Thus, wehave to flatten feature vectors and reshape the output to ( F, M ) forthe sake of the dimension compatibility. The dense layers in E and D are followed by batch-norm and relu activation function, whilethe dense layers in H γ , H β are followed just by sigmoid activationfunctions. E has three hidden dense layers with , and units followed by the latent dense layer with units. D is madeof four hidden dense layers each with units. H γ and H β haveboth a single hidden dense layer with units. We summarise thearchitecture in Table 1. etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan Figure 2: Two exemplary spectrograms of some valve anomalous samples with their corresponding reconstructions. In the top row, wehave original spectrograms of the whole recordings. In the bottom row, we show reconstructed spectrograms with matching labels. Thereconstruction error is mostly boosted in the windows containing anomalous valve sounds.Figure 3: Distributions of reconstruction errors for all valve samplesfrom the test split from the development dataset.
We used α = 0 . , C = (the bold-face denotes that it is aconstant vector with value 5 everywhere) with a frame size F = 10 and number of mels M = 128 . As for complexity for this setup, theencoder, decoder and conditioning has , , , and parameters respectively, making total of , parameters. We train our models using Adam optimizer with default pa-rameters [25] setting mean absolute error as a loss function. Foreach machine, we train our network for 100 epochs with exponen-tial learning rate decay by multiplying the learning rate by 0.95 ev-ery 5 epochs. For every epoch we randomly sample 300 framesfrom each spectrogram. α, C and number of mels M are hyperparamters for our model.Unfortunately, there is no single couple of these parameters thatworks best for all the machines. We set F = 10 and did a gridsearch with α ∈ { . , . , . } , C ∈ { , . , , } , M ∈{ , } trying mean square and mean absolute errors. We se-lected 3 models for each machine that maximize mAUC (the aver-age of AUC and pAUC with p = 0 . ) on the test split from thedevelopment dataset, and for each machine we did an ensemble byselecting 3 weights such that the weighted anomaly score maximizemAUC. The ensemble is illustrated in Figure 4.
4. RESULTS
We develop and evaluate our system on the dataset from task2 from DCASE 2020 Challenge [20], which consists of recordingon 6 different machine types (Toy Car, Toy Conveyor, Fan, Pump,Slider and Valve). The available datasets consists of developmentand additional datasets. The development dataset consists of 3 or4 IDs per machine while the additional dataset consist of 3 IDsper machine. All results in this section are evaluated on the testsplit from the development dataset using the area under the receiveroperating characteristic (ROC) curve (AUC) and the partial-AUC(pAUC) with p = 0 . We conduct the ablation study starting from the DCASE base-line system and ending on the ensemble. We list the following steps,where each step is built upon the previous one: etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
Figure 4: Visualisation of the ensemble of the best three models for each machine type. The vertices indicate mAUC of individual modelsand the interior points of triangles indicate mAUC of the convex combination of anomaly scores. That is, for given weights, new anomalyscore is obtained as the linear combination of these weights with corresponding anomaly scores, and the mAUC is computed of that scores.The red dots indicate the weights that maximize mAUC.Table 2: Ablation Study; the boldface denotes the best scores, and the underline denotes the greatest improvements.
Toy Car Toy Conveyor Fan Pump Slider Valve AverageAUC pAUC AUC pAUC AUC pAUC AUC pAUC AUC pAUC AUC pAUC AUC pAUC
Baseline
Architect
Scaler
Condition
AddDataset
Ensemble • Baseline - DCASE Baseline system. • Architect - Encoder changed to the smaller one with 128,64 and 32 units, latent layer set to 16 units, frame size set to10, and training conducted as in subsection 3.4. • Scaler - Spectrograms standardised frequency-wise to zeromean and unit variance. • Condition - Conditioning added with α = 0 . and C = . • AddDataset - Training on the train splits from both devel-opment and additional datasets combining labels from thesedatasets. • Ensemble - Ensemble as described in the subsection 3.5.We summarise the results of the ablation study in Table 2. Eventhough results vary over machines, we will focus on the averagescores which are placed in the last two columns of the Table. Our ar-chitecture changes as in
Architect and normalization
Scaler have minor influence on AUC and pAUC scores. Adding condi- tioning layer in
Condition significantly improves pAUC score,which proves that IDCAE is advantageous over standard AE. More-over as expected, adding more IDs like in
AddDataset improvesboth AUC and pAUC scores significantly. Finally, the ensemblefrom
Ensemble due to its nature again significantly boosts bothAUC and pAUC.
5. CONCLUSION
In this paper we introduced a method on how to enhance theauto-encoder designed for unsupervised anomaly detection. As-suming distinct IDs, we condition the latent space in order to en-force auto-encoder to reconstruct samples faithfully if and only ifthey are non-anomalous and conditioned with matching labels. Ourmethod outperforms significantly the DCASE baseline system. Inthe ablation study, we proved that the conditioning we introducedis crucial for the performance improvement, and that feeding thesystem with more IDs gets even better results. etection and Classification of Acoustic Scenes and Events 2020 2–3 November 2020, Tokyo, Japan
6. REFERENCES [1] S. Lecomte, R. Lengell, C. Richard, F. Capman, and B. Rav-era, “Abnormal events detection using unsupervised one-classsvm - application to audio surveillance and evaluation -,” in , 2011, pp. 124–129.[2] F. Aurino, M. Folla, F. Gargiulo, V. Moscato, A. Picariello,and C. Sansone, “One-class svm based approach for detectinganomalous audio events,” in , 2014,pp. 145–151.[3] Y. Kawaguchi, T. Endo, K. Ichige, and K. Hamada, “Non-negative novelty extraction: A new non-negativity constraintfor nmf,” in , 2018, pp. 256–260.[4] R. Giri, A. Krishnaswamy, and K. Helwani, “Robust non-negative block sparse coding for acoustic novelty detection,”in
Proceedings of the Detection and Classification of AcousticScenes and Events 2019 Workshop (DCASE2019) , New YorkUniversity, NY, USA, October 2019, pp. 74–78.[5] A. Ito, A. Aiba, M. Ito, and S. Makino, “Detection of ab-normal sound using multi-stage gmm for surveillance micro-phone,” in , vol. 1, 2009, pp. 733–736.[6] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “Probabilisticnovelty detection for acoustic surveillance under real-worldconditions,”
Multimedia, IEEE Transactions on , vol. 13, pp.713 – 719, 09 2011.[7] Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimiz-ing acoustic feature extractor for anomalous sound detectionbased on neyman-pearson lemma,” in , 2017, pp. 698–702.[8] Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada,“Unsupervised detection of anomalous sound based on deeplearning and the neymanpearson lemma,”
IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 27,no. 1, pp. 212–224, 2019.[9] Y. Kawaguchi, R. Tanabe, T. Endo, K. Ichige, and K. Hamada,“Anomaly detection based on an ensemble of dereverberationand anomalous sound extraction,” in
ICASSP 2019 - 2019IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2019, pp. 865–869.[10] D. Conte, P. Foggia, G. Percannella, A. Saggese, andM. Vento, “An ensemble of rejecting classifiers for anomalydetection of audio events,” in , 2012, pp. 76–81.[11] R. Bardeli and D. Stein, “Uninformed abnormal event detec-tion on audio,” in
Speech Communication; 10. ITG Sympo-sium , 2012, pp. 1–4.[12] K. Suefusa, T. Nishida, H. Purohit, R. Tanabe, T. Endo, andY. Kawaguchi, “Anomalous sound detection based on interpo-lation deep neural network,” in
ICASSP 2020 - 2020 IEEE In-ternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2020, pp. 271–275. [13] Y. Kawaguchi and T. Endo, “How can we detect anomaliesfrom subsampled audio signals?” in , 2017, pp. 1–6.[14] D. Oh and I. Yun, “Residual error based anomaly detectionusing auto-encoder in smd machine sound,”
Sensors ,vol. 18, no. 5, p. 1308, Apr 2018. [Online]. Available:http://dx.doi.org/10.3390/s18051308[15] Y. Koizumi, S. Saito, M. Yamaguchi, S. Murata, andN. Harada, “Batch uniformization for minimizing maximumanomaly score of dnn-based anomaly detection in sounds,” in , 2019, pp. 6–10.[16] R. Chalapathy and S. Chawla, “Deep learning for anomalydetection: A survey,” 2019.[17] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E.Boult, “Toward open set recognition,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 35, no. 7, pp.1757–1772, 2013.[18] P. Oza and V. M. Patel, “C2AE: Class conditioned auto-encoder for open-set recognition,” in
The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June2019.[19] C. Geng, S.-J. Huang, and S. Chen, “Recent advances in openset recognition: A survey,”
IEEE transactions on pattern anal-ysis and machine intelligence , 2020.[20] http://dcase.community/challenge2020/task-unsupervised-detection-of-anomalous-sounds.[21] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura,Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo,M. Yasuda, and N. Harada, “Description and discussionon DCASE2020 challenge task2: Unsupervised anomaloussound detection for machine condition monitoring,” in arXiv e-prints: 2006.05822 , June 2020, pp. 1–4. [Online].Available: https://arxiv.org/abs/2006.05822[22] Y. Koizumi, S. Saito, H. Uematsu, N. Harada, and K. Imoto,“ToyADMOS: A dataset of miniature-machine operatingsounds for anomalous sound detection,” in
Proceedingsof IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA) , November 2019, pp.308–312. [Online]. Available: https://ieeexplore.ieee.org/document/8937164[23] H. Purohit, R. Tanabe, T. Ichige, T. Endo, Y. Nikaido, K. Sue-fusa, and Y. Kawaguchi, “MIMII Dataset: Sound dataset formalfunctioning industrial machine investigation and inspec-tion,” in
Proceedings of the Detection and Classification ofAcoustic Scenes and Events 2019 Workshop (DCASE2019) ,November 2019, pp. 209–213. [Online]. Avail-able: http://dcase.community/documents/workshop2019/proceedings/DCASE2019Workshop Purohit 21.pdf[24] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C.Courville, “Film: Visual reasoning with a general condition-ing layer,” in