An Adaptive Approach for Anomaly Detector Selection and Fine-Tuning in Time Series
Hui Ye, Xiaopeng Ma, Qingfeng Pan, Huaqiang Fang, Hang Xiang, Tongzhen Shao
AAn Adaptive Approach for Anomaly Detector Selection andFine-Tuning in Time Series
Hui Ye
Alibaba IncBeijing, [email protected]
Xiaopeng Ma
Alibaba IncBeijing, [email protected]
Qingfeng Pan
Alibaba IncBeijing, [email protected]
Huaqiang Fang
Alibaba IncBeijing, [email protected]
Hang Xiang
Alibaba IncBeijing, [email protected]
Tongzhen Shao
Alibaba IncBeijing, [email protected]
ABSTRACT
The anomaly detection of time series is a hotspot of time seriesdata mining. The own characteristics of different anomaly detec-tors determine the abnormal data that they are good at. There isno detector can be optimizing in all types of anomalies. Moreover,it still has difficulties in industrial production due to problems suchas a single detector can’t be optimized at different time windows ofthe same time series. This paper proposes an adaptive model basedon time series characteristics and selecting appropriate detectorand run-time parameters for anomaly detection, which is calledATSDLN(Adaptive Time Series Detector Learning Network). Wetake the time series as the input of the model, and learn the timeseries representation through FCN. In order to realize the adap-tive selection of detectors and run-time parameters according tothe input time series, the outputs of FCN are the inputs of twosub-networks: the detector selection network and the run-time pa-rameters selection network. In addition, the way that the variablelayer width design of the parameter selection sub-network and theintroduction of transfer learning make the model be with moreexpandability. Through experiments, it is found that ATSDLN canselect appropriate anomaly detector and run-time parameters, andhave strong expandability, which can quickly transfer. We investi-gate the performance of ATSDLN in public data sets, our methodsoutperform other methods in most cases with higher effect andbetter adaptation. We also show experimental results on publicdata sets to demonstrate how model structure and transfer learningaffect the effectiveness.
KEYWORDS
Self-adaption, Anomaly Detection, Joint Learning Network, Trans-fer Learning, Time Series
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
DLP-KDD’19, August 5, 2019, Anchorage, AK, USA © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6783-7/19/08...$15.00https://doi.org/10.1145/3326937.3341253
ACM Reference Format:
Hui Ye, Xiaopeng Ma, Qingfeng Pan, Huaqiang Fang, Hang Xiang, and TongzhenShao. 2019. An Adaptive Approach for Anomaly Detector Selection andFine-Tuning in Time Series. In
ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3326937.3341253
Internet-based services have strict requirements for continuousmonitoring and in-time anomaly detection, Specifically, monitor-ing performance ability and detecting performance anomalies areimportant. Such as, e-commerce platforms need to monitor incomeindex and broadcast alert when obvious income decrease happens.From the perspective of data science, key performance indexesare usually portrayed as time series, and potential faults in ap-plication are portrayed as anomaly. An anomaly (An outlier) intime series, is a data point or a group of data points which signifi-cantly different from the rest of the data points[8]. Due to the largeamounts of performance indexes and anomalies, human monitor-ing of these indexes is impracticable which leads the demand forautomated anomaly detection using Machine Learning and DataMining techniques[6, 10–12]. Many fast and effective anomaly de-tectors were designed to localize these anomalies[2], such as outlierdetector[1], change point detector[7]. Although anomaly detec-tors have proven effective in certain scenarios, applying them tointernet-based services remains a great challenge[9]. Due to thelarge-scale distributed monitoring vision and complex trends ofindicators, it’s almost impossible to detect anomalies in all scenarioswith one type of detector. In order to ensure the performance ofthe anomaly detection approach, expertise-based rules are requiredfor detector selection and run-time parameters fine-tuning[9]. Fur-thermore, when a detector system is deployed online, the run-timeparameters of anomaly detector are usually required to adjust ac-cording to real-time changes.It’s hard to propose one general approach to detect all types ofanomaly, such as significant decrease or increase can be detectedby static threshold directly, continuous minor changing can bedetected by change point detector more quickly. State-of-the-artdetectors are usually designed to detect one type of anomaly[8].When the multi-detector detection result voting method is adopted,each detect needs to traverse all detectors and candidate run-time a r X i v : . [ s t a t . M L ] J u l LP-KDD’19, August 5, 2019, Anchorage, AK, USA Hui Ye, Xiaopeng Ma, Qingfeng Pan, Huaqiang Fang, Hang Xiang, and Tongzhen Shao parameters combinations. The effect is greatly influenced by thedata set and voting rules and it is very time-consuming, which donot meet the demands of industrial real-time monitoring scenarios.Our proposed framework named ATSDLN, tackles the above chal-lenges through an adaptive time series anomaly detector learningnetwork.
Under the background of large industrial data scale, complicatedindex system and an unusually large variety, on the one hand,time series data usually changes with business changes. The sametime sequence may have great differences in different stages ofbusiness projects; on the other hand, influenced by commercialdata and users’ behaviors, there are different low ebbs of the peakflow on holidays, daytime and nights, big promotions and so on,which cause the natural differences in data. If we do not considerself-adaption when doing anomaly detection, we cannot balancebetween the false positive rate and the false negative rate. Therefore,choosing a universal detector to adapt to all data and scenarios isunworkable. Multi-detection algorithm fusion is a very effectivemethod to improve the time series anomaly detection field, whichis usually conducted in the two stages as follows:
The anomaly detection stage : it is realized by selecting theappropriate detector for the time series of the input.
The alarm convergence stage : it is realized by using theabnormality that is detected by each detector as the input. Thealarm convergence can be achieved with the method of voting ortime series feature modeling. • Voting method : absolute majority vote, relative majorityvote, weighted vote, etc. • Deep learning : time series modeling of the detected anom-alies.Both of them are of highly expandability and support dynamicexpansion of anomaly detectors. The former is self-adaption basedon the original time series of the input, which is more flexible, thisstudy takes the former. As is shown in the experimental chapter,the single detector is lower than our model in term of the accuracy,recall, and f1, and the error rate is relatively high. The startingpoint of this study is to set a certain sliding window size for thetime series, and optimize the accuracy, recall and false positive rateof the anomaly detection through using the detector and run-timeparameters for the self-adaption selection of the current slidingwindow time series.Since different detectors have their own characteristics whichdetermine the type of time series they are good at, it is naturalto think about to determine which the detectors and run-time pa-rameters are suitable for by the features of the time series. Wecall this way the manual rule maintenance detector and run-timeparameters selection. The core work is to determine what featuresof time series and what threshold should be used for judgment(for instance, non-stationary time series with long-term trends canadopt dynamic thresholds). The advantage of such artificial rulesis that it has strong interpretability. However, it is true that thedetermination of these rules relies on manual experience, whichis difficult to enumerate the rules. As the data accumulation rulesbecome more and more difficult to maintain, the abnormal coverage, correctness, versatility and expandability of the rules are also greatchallenges.Fortunately, in the era of artificial intelligence, it is natural tothink of using models to replace labor. Generally speaking, time se-ries classification using traditional machine learning methods (suchas KNN, DTW) can achieve better results. However, as for big data,deep learning tends to defeat traditional methods. Until recently,a paper relevant to the research was published by Fawaz H I etal. [4], which has demonstrated the feasibility of transfer learningmethod for different time series data. The author argues that FCNcan learn time series representation well when the amount of datais sufficient, and believes that the features extracted by the deepnetwork for the time series data are as similar and inherited as CNNin terms of time series. Moreover, one of the challenges for super-vised learning is the large number of labeled data. Unfortunately,it is not readily available for the real-world labeled data problemtended to the high cost and longtime consuming. This problem inessence involves using transfer learning to obtain a solution. It canbe seen that the solution based on the transfer learning becomes abetter choice for the self-adaption anomaly detection problem.
A new ATSDLN model is proposed in the paper, which realizedan adaptional classification of time series anomaly detectors andrun-time parameters selection by combining transfer learning anddynamic adaptive joint learning. It is a pre-trained model based onpublic data sets for transfer learning. Figure 1 is our frame diagram.The model supports multiple channels, and can input the originaltime series, prediction time series or residual sequence.From the bottom to the top, the first part is the Fully Convolu-tional Neural Network (FCN), which is made up of Convolutionlayers and Global average pooling layer. As the Figure 1 shows,transfer learning is applied to the FCN layer and fine-tuning in theFC layers, which makes the network parameters initialized better,so as to speed up the training and convergence and improve theperformance of time series classification model. The main functionof this part is to learn the rich time series representation by meansof a large amount of training data, and then produce time seriesrepresentation. This part introduces the ability of the migrationlearning enhancement model to extract the ability of time series rep-resentation, to deal with the problem of marking sample sparsenessand model mobility.The second part is composed of two sub-networks, both of whichare supervised classification models. The left part is responsible forthe classification of the detector, while the right part is responsiblefor the classification of the corresponding run-time parametersof the detector, the two parts can jointly study. The expressionlearned through the detector classification task will be used as theinput of the run-time parameters selection task, which can assist thelearning of the run-time parameters. Both of the sub-networks havethe problems of supervised classification. From the figure 1, it can beseen that the output of p(x) determines a certain detector uniquelyfor the current time series, and [q(x)] is the run-time parametersthat the current time series and anomaly detector choose. Becausethe size of the candidate run-time parameters sets of each detectoris inconsistent, the last layer width of the right network follows theleft as the side detector changes, that the model supports flexibleaddition and deletion detectors. It can be known from the abovethat the selection of the run-time parameters on the right side n Adaptive Approach for Anomaly Detector Selection and Fine-Tuning in Time Series DLP-KDD’19, August 5, 2019, Anchorage, AK, USA
Figure 1: Whole Net Structure, left represents anomaly detectors classification task, right represents run-time parametersfine-tuning task. Blue layers are shared by the two sub-networks. depends not only on the time series representation, but also on thedetector selected by the network on the left side. So in this part, theexpression learned in the left detector classification task is sharedto the task of the right run-time parameters selection on the rightand is taken as its input to assist in learning.The third part, which is on the top, is the execution module forthe anomaly detection. It detects the anomaly of the detector andthe run-time parameters which is selected when time series usemodels.
The following parts form the core components of an joint learn-ing approach. The two sub-networks in our approach refers toanomaly detectors classification task and run-time parameters fine-tuning task, which means the network predicts optimal detector andfine-tunes the run-time parameters simultaneously without humaninterfering. Firstly, we collect some classical detectors, which wereproposed to detect anomaly in different context. Secondly, a newevaluation criterion was proposed to evaluate the performance ofthese detectors in each time series data, this process also generatesthe label of our two sub-tasks. Thirdly, an adaptive model is trainedto extract deep features of time series, which is crucial for optimaldetector prediction and the run-time parameters fine-tuning tasks. Lastly, we transfer this representation learned from public datasets to other unseen data sets and evaluate the usability of transferlearning in time series anomaly detection.
We set the different sizes of sliding window on webscope S5 datasets for the experimental sample, which contains outliers andchange points, and use the UCR Time series Classification Archive as the source data sets for transfer learning. Webscope S5 is a labeled anomaly detection data set. There are367 time series in the data sets, each of which contains between741 and 1680 data points at regular intervals. Each time series isaccompanied by an indicator series with 1 if the observation wasan anomaly, and 0 otherwise.
UCR is a time series classificationdata sets. There are 128 data sets with different applications. Theclassification type of these data sets is from 2 to 60, and the the sizeof data sets is from 20 to 8926.Through traversing the candidate detector and the combiningoperational parameters, the optimal detector and run-time parame-ters are selected for the time series as a training data for supervisedlearning. Then, by carrying out the pre-training of the transfer https://research.yahoo.com/ LP-KDD’19, August 5, 2019, Anchorage, AK, USA Hui Ye, Xiaopeng Ma, Qingfeng Pan, Huaqiang Fang, Hang Xiang, and Tongzhen Shao learning on the UCR time series classification data set, the datavolume problem of the training data charged by the meter is solved.
Experiments results were evaluated by comparing observed anom-alies to true anomalies. In table 1, we present the evaluation mea-sures of the model’s such as precision and recall, Error which wereused. FP denotes the number of false positive, FN the number offalse negative, TP the number of true positive and TN the numberof true negative.Number of true positive whose proportion in anomaly detectionis small, in addition without considering precision’s inability toaccurately express the level of false positive ratio (or false alarmratio), especially when true positive is zero, precision is always zero.there are not very good measures for assessing anomaly detectionmethods. In our situation, the high false positive ratio will causealarm fatigue of the relevant personnel, which will lead to thedecrease of the attention of monitoring alarm. However, the numberof true negative is large, so the false positive rate is not sensitivelyenough as it grows very slowly. Therefore, we propose a new metricnamed Error which is defined as FP/(TP+FP+FN).
According to the shape and context of time series anomaly, it canbe summarized as outlier, mean-shift, cliff-type, deviating-trend,new-shape. See the table 2 for details. The anomaly detectors usedin ATSDLN are just the same as EGADS.In addition, the parameters fine-tuning is as important as the ac-curacy of selecting the most suitable detector. Detector parametersare divided into two categories: the first is the common parametersneeded by all detectors, including sliding window size, sensitivity,number of historical samples. The second is the internal parametersrequired by each detector algorithm, such as K-Multiple varianceof KSigma, eps and minPts of DBScan, confidence and drift rangeof ChangePoint, search radius of DTW similarity, etc.
Figure 2: Example of anomaly types.
Figure 3: Anomaly model performance on different detec-tors(adaptively select best run-time parameters).
The second chapter mentions that the network is composedof two sub-networks, both of which are supervised classificationmodels. The output of p(x) determines a certain detector uniquely,and q(x) is the run-time parameters corresponding to the detector.With the determined detector and run-time parameters, it is possibleto judge the abnormality of the time series execution abnormalitydetection. The evaluation of part of the experimental effect adoptsthe precision, recall and error described in evaluation criterion inChapter 3.2.The main work of this paper is to select the appropriate detec-tor as well as run-time parameters for a certain time series. Thelength of the time series is called the window size of the time series.The size of the window not only has relationship to the businessattributes, but also influences the sensitivity of the detector’s self-adaption selection. In theory, the smaller the window, the moresensitive the changes in the detector and parameters. The tradi-tional voting method relies more on the accumulation of time seriesdata and has poor adaptability. As is shown in Fig. 4, the horizontalaxis is the window size of the time series, while the vertical axis ifthe evaluation index calculated by the abnormality detection result.It can be seen that the smaller the window, the worse the baselineeffect. According to our experiments, the window size will not af-fect the performance of our model. The ATSDLN can better adaptto different window size. In order to compare the performance ofdifferent experiments, we choose the window size with 200 points.
In order to explore the necessity for self-adaption selection detec-tors and operating parameters, 29 combination parameters of fivedetectors are selected which described in detectors for time seriesanomaly detection in Chapter 3.3. The experiments are performedon the yahoo public data set. The results are shown in Figure 5,the horizontal axis shows the 29 combinations of detectors and itsparameters, the vertical axis shows the performance under each n Adaptive Approach for Anomaly Detector Selection and Fine-Tuning in Time Series DLP-KDD’19, August 5, 2019, Anchorage, AK, USA
Table 1: Evaluation Metrics
Metric DescriptionPrecision Precision is defined as TP/(TP + FP)Recall True positive rate or recall is defined as TP/(TP+FN)False positive rate The false positive rate is defined as FP/(FP+TN)F1-score F1-score is dened as 2 * precision * recall /(precision + recall)Error Error is defined as FP/(TP+FP+FN)
Table 2: Anomaly and Detectors type descriptions detectoroutlier significantly different KSigma/DBScan/LOF/Extreme LowDensity[1, 8]mean-shift sustained inapparent deviation CUSUM changepoint[7]cliff-type switch to another sustained value KernelDensity changepoint/KSigma/SimpleThresholddeviating-trend not in line with fitting trend STL decomposition[3]new-shape un-similar with others DTW similarity[5]
Figure 4: Baseline performance on different window size. combination. There is no existence of fixed detector and parame-ters which can be optimal at the whole time series. In addition, thedifferent effects of the run-time parameters are widely divergentunder the situation when the detector is determined.
Figure 3 compared the results of ATSDLN with other single detectormodels. It shows that the performance of our method is the best.The method proposed in the paper, regardless of accuracy, recallrate or F1, is superior to the single detector adaptive selection ofoptimal parameters, and the false positive rate is also reduced.
Table 3: Peformance of different network architecturesmodel type Precision Recall Error F1
Baseline 0.0278 0.0022 0.7035 0.0040LSTM-DNN 0.0940 0.5195 0.8335 0.1592FCN-LSTM-DNN 0.4419 0.0023 0.0028 0.0045
ATSDLN
In this paper, we compare several network architectures to investi-gate our proposed model’s effectiveness. The controlled models aredescribed as follows: • Baseline : The majority voting algorithm based on EGADS. • LSTM-DNN : A hybrid neural network composed of Long short-term memory LSTM and DNN network. • FCN-LSTM-DNN : The model adds a CNN layer to capture fea-tures based on the LSTM-DNN model.Table 3 shows that, when the multi-detector detection resultvoting method is adopted, each detect needs to traverse all detectorsand candidate run-time parameters combinations, it is very time-consuming, which do not meet the demands of industrial real-timemonitoring scenarios. Moreover, compared to the voting algorithm,the neural network models behave higher F1 score, what’s more,difference model structures can affect the evaluate metric. The CNNshows the better ability of abstract feature extraction in our task.
As previously reported, the model selects the run-time parametersfor the current time series as well as the detector. The time seriesrepresentation learned through the detector classification task willbe used as the input of the run-time parameters selection task,which can assist the learning of the run-time parameters. This part
LP-KDD’19, August 5, 2019, Anchorage, AK, USA Hui Ye, Xiaopeng Ma, Qingfeng Pan, Huaqiang Fang, Hang Xiang, and Tongzhen Shao
Figure 5: Anomaly model performance on different parameters.Table 4: Effect of sharing layersmodel type Precision Recall Error F1
NS-Model 0.0982 0.5146 0.8253 0.1649SSR-Model 0.2366 0.3374 0.5212 0.2782
ATSDLN discusses the influence of share layers. The relevant models aredescribed as follows: • NS-Model : without shared of network. • SSR-Model : shared the shallow representation (the outputlayer of FCN). • ATSDLN : shared specific representation (the layers of the de-tector classification task).As is shown in Table 4, the effect of shared the shallow andspecific representations of time series is optimal through the classi-fication network (anomaly detector selection) on the left and theclassification network (run-time parameters selection) on the right.This is because the run-time parameters have a strong relativitywith the detector. In order to output appropriate detector categories,the expression learned by the detector classification task will beused as input of the parameter classification task to assist parameterlearning.
To solve the shortage of data and let the network extract temporalfeatures and initialize the models better, we selecting some UCRsample data sets for the comparative experiments of transfer learn-ing. • ATSDLN : Training without transfer learning. • Transfer-1 : Transfer from FordA to our data. • Transfer-2 : Transfer from Earthquakes to our data. • Transfer-3 : Transfer from coffe to our data.
Table 5: ATSDLN with Transfer learningmodel type Precision Recall Error F1
ATSDLN 0.3606 0.4512 0.4445 0.4010
Transfer-1
Transfer-2 0.3724 0.4350 0.4230 0.4013Transfer-3 0.3613 0.4506 0.4434 0.4011The second chapter mentions that transfer learning is appliedto the FCN layer and fine-tuning in the FC layers, which makesthe network parameters initialized better, so as to speed up thetraining and convergence and improve the performance of timeseries classification model. Table 5 shows that, in most of the cases,the pre-trained model can improve the performance of model.
This paper proposed a new ATSDLN model, which realized an adap-tional classification of time series anomaly detectors and run-timeparameters selection by combining transfer learning and dynamicadaptive joint learning. The second Chapter mentions that the net-work is composed of two sub-tasks: anomaly detector classificationand the run-time parameters fine-tuning network, both of whichare supervised classification models. Because the size of the candi-date run-time parameters sets of each detector is inconsistent, thelast layer width of the right network (the run-time parameters fine-tuning network) follows the left as the side detector changes, thatthe model supports flexible addition and deletion detectors. Further-more, because the run-time parameters have a strong relativity withthe anomaly detector, the effect of shared the shallow and specificrepresentations of time series is optimal through anomaly detec-tors classification network and run-time parameters fine-tuningnetwork. Moreover, we pre-trained FCN layers based on differentdata sets, the results investigated that transfer learning approach n Adaptive Approach for Anomaly Detector Selection and Fine-Tuning in Time Series DLP-KDD’19, August 5, 2019, Anchorage, AK, USA can improve the performance of our model. Experiment resultsshow that ATSDLN solves the problem of low precision and highfalse alarm ratio when the data pattern is change. ATSDLN is alsoapplied to our industrial scenarios. In the future, we will considerextract global features of time series and alarm suppression.
REFERENCES [1] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000.LOF: identifying density-based local outliers. In
ACM sigmod record , Vol. 29. ACM,93–104.[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection:A survey.
ACM computing surveys (CSUR)
41, 3 (2009), 15.[3] Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning.1990. STL: A seasonal-trend decomposition.
Journal of official statistics
6, 1 (1990),3–73.[4] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar,and Pierre-Alain Muller. 2018. Transfer learning for time series classification. In . IEEE, 1367–1376.[5] Tak-chung Fu. 2011. A review on time series data mining.
Engineering Applicationsof Artificial Intelligence
24, 1 (2011), 164–181.[6] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, andTom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonpara-metric dynamic thresholding. In
Proceedings of the 24th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & Data Mining . ACM, 387–395. [7] Yoshinobu Kawahara, Takehisa Yairi, and Kazuo Machida. 2007. Change-pointdetection in time-series data based on subspace identification. In
Seventh IEEEInternational Conference on Data Mining (ICDM 2007) . IEEE, 559–564.[8] Nikolay Laptev, Saeed Amizadeh, and Ian Flint. 2015. Generic and scalableframework for automated time-series anomaly detection. In
Proceedings of the21th ACM SIGKDD International Conference on Knowledge Discovery and DataMining . ACM, 1939–1947.[9] Dapeng Liu, Youjian Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo, Xi-aowei Jing, and Mei Feng. 2015. Opprentice: towards practical and automaticanomaly detection through machine learning. In
Proceedings of the 2015 InternetMeasurement Conference . ACM, 211–224.[10] Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, PuneetAgarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensoranomaly detection. arXiv preprint arXiv:1607.00148 (2016).[11] Dominique T Shipmon, Jason M Gurevitch, Paolo M Piselli, and Stephen T Ed-wards. 2017. Time series anomaly detection; detection of anomalous drops withlimited features and sparse examples in noisy highly periodic data. arXiv preprintarXiv:1708.03665 (2017).[12] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li,Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, et al. 2018. Unsupervised anomalydetection via variational auto-encoder for seasonal kpis in web applications. In