Detecting Anomalies in Software Execution Logs with Siamese Network
DDetecting Anomalies in Software Execution Logswith Siamese Network
Shayan Hashemi and Mika M¨antyl¨a M3S Research Unit, ITEE, University of Oulu, Finland [email protected] , [email protected] Abstract.
Logs are semi-structured text files that represent software’sexecution paths and states during its run-time. Therefore, detectinganomalies in software logs reflect anomalies in the software’s executionpath or state. So, it has become a notable concern in software engineer-ing. We use LSTM like many prior works, and on top of LSTM, wepropose a novel anomaly detection approach based on the Siamese net-work. This paper also provides an authentic validation of the approachon the Hadoop Distributed File System (HDFS) log dataset. To the bestof our knowledge, the proposed approach outperforms other methods onthe same dataset at the F score of 0 . F score,merely a modest decay from 0.996 to 0.995. Additionally, we propose ahybrid model by combining the Siamese network with a traditional feed-forward neural network to make end-to-end training possible, reducingengineering effort in setting up a deep-learning-based log anomaly detec-tor. Furthermore, we examine our method’s robustness to log evolutionsby evaluating the model on synthetically evolved log sequences; we gotthe F score of 0.95 at the noise ratio of 20%. Finally, we dive deepinto some of the side benefits of the Siamese network. Accordingly, weintroduce a method of monitoring the evolutions of logs without labelrequirements at run-time. Additionally, we present a visualization tech-nique that facilitates human administrations of log anomaly detection. Keywords:
Log Analysis, Anomaly Detection, Siamese Network, DeepLearning
Log files are an unstructured text-based history of events that shed light on thesoftware state during its execution. Each line of log files indicates a differentevent and may hold different types of information such as log-type, timestamp,process ID, thread ID, and log message. Analyzing log events allows developersto extract helpful information from the software state during the run-time. Oneof the log analysis applications is anomaly detection. Log anomaly detection mayassist developers in software testing, debugging, or run-time monitoring. a r X i v : . [ c s . S E ] F e b Shayan Hashemi et al.
Recently, deep learning has become the most predominant method in almostevery machine learning problem. Furthermore, deep neural networks have beenutilized to improve software testing, debugging, and stability. Going more in-depth, DNNs are used in applications such as software defect prediction [1],performance analysis [2], or reopened bugs accuracy prediction [3]. Moreover,log anomaly detection is no exception, and DNNs have been widely utilized inthis research area alongside other Machine Learning (ML) approaches.There are two different approaches among the deep methods in log anomalydetection [4]. The first one is a binary classification task. It takes a sequenceas input and outputs a binary value indicating if the sequence is an anomaly.The latter approach is sequence modeling, which trains only on the non-anomalydata, learns to model the system’s normal behavior resulting in predictions oflow probabilities for anomaly behavior.As non-anomaly data volume is significantly higher than anomaly data, se-quence modeling is more common in log anomaly detection. However, trainingsolely on non-anomaly data may result in models being unaware of anomalyevents, making the approach unreliable in anomaly situation. Furthermore, sincelogs evolve due to software updates, models trained with non-anomaly data havelimited capabilities to detect anomaly situations in evolved non-anomaly situa-tions.On the other hand, binary classification solves the previously mentioned prob-lem by training the model on both anomaly and non-anomaly data. However, itcomes with its own challenges; one of them is training on an unbalanced dataset.The obstacle comes into place when the proportion of anomaly to non-anomalydata is too small. More specifically, datasets contain dramatically more anomalysamples in comparison to non-anomaly ones.Nonetheless, many solutions have been introduced to surmount the unbal-anced data obstacle. Oversampling and undersampling are two straightforwardapproaches that strive to equalize the number of samples in two classes. Anotherway of dealing with unbalanced datasets is weighted training. It manipulatesthe cost function so that both classes’ influences on the model’s parameters areequal. However, setting training weights and oversampling may result in over-fitting, while undersampling ignores a colossal proportion of negative samplesduring the training process. A more steady solution may be synthetic data gen-eration. Furthermore, it eliminates the disadvantages of oversampling yet resultsin equilibrium. However, it requires innovative methods to generate legitimateand reliable samples. This paper proposes a new approach based on the Siamesenetwork [5] to handle the unbalanced data in log anomaly detection.The primary purpose of the Siamese network is similarity learning, and it isvastly used in one-shot learning such as face verification [6,7], signature verifi-cation [8,9], and visual object tracking [10,11,12]. Furthermore, it has also beenproposed in the context of anomaly detection in video games [13]. The pro-posed Siamese-network-based model takes advantage of both non-anomaly andanomaly data while not demanding balanced training data. etecting Anomalies in Software Execution Logs with Siamese Network 3
More in-depth, we attempt to learn an embedding function for log sequencesthat maps sequences of the same class (non-anomaly or anomaly) adjacent toeach other while maximizing the distance between opposing classes’ sequences.We also propose a sampling technique inspired by negative sampling [14] to gen-erate pairs for the Siamese network’s training process. The proposed algorithmsignificantly reduces the training costs of the Siamese network.Furthermore, we evaluate the proposed method through various experiments.Accordingly, we examine the impact of different pair generation algorithms onthe Siamese network, try different classifiers on top of the embedding neural net-work, and compare the best performer to state-of-the-art methods. Moreover, weevaluate our model’s robustness on evolved log sequences and propose a methodto monitor log evolutions at production time. Besides, we reveal a solution to vi-sualize the embedded sequences to make human administration of log sequencespossible. Finally, we construct a hybrid model by imposing the Siamese networkon a feedforward neural network, investigating the Siamese network’s positiveimpact.The remainder of the paper is organized as follow: Section II is dedicatedto explaining required knowledge, reviewing famous previous works, and dis-cussing datasets. The Siamese network, the methodology, and pair generationalgorithms are explained in Section III. The preprocessing, dataset, and eval-uation metric are discussed in Section IV. Section V comprises the reports ofvarious experiments investigating the proposed method on deeper levels, whileadditional practical advantages are mentioned in Section VI. Finally, conclusionand future work proposals are offered in Section VII.
Log anomaly detection consists of multiple components, which are visualized inFigure 1. The figure illuminates four components that each Log anomaly detectorholds: preprocessor, log parser, log vectorizer, and classifier.The first component is the preprocessor. As its name implies, its mission isto prepare log events for subsequent components. The preparations may includeeliminating unnecessary information (such as IP addresses or invalid characters),extracting features from timestamps and log levels, and clustering logs based ontheir threads or process IDs. The preprocessor unit’s output is passed to thenext component, the Log parser [16]. The log parser identifies the log messageparameters and extracts templates. Log message event types could be inferred bymatching a log message with identified templates. Depending on the vectorizer’scapabilities, which is the next component, event parameters might be carriedalong with the event type. The log vectorizer produces vectors from event typesand parameters (if any). The vectors may take the form of one-hot encoded, se-mantic, or template IDs depending on the classifier’s architecture. Then, vectorsare given to the classifier, which is the last component. The classifier’s goal is to
Shayan Hashemi et al.
Preprocessor Log ParserVectorizer Classifier [Session ID] Log Message----------------------------------------[12] packet sent to: 192.168.1.50[12] first response received[13] sending request REQ_FIVE[12] packet response from: 192.168.1.50 packet sent to *first response receivedpacket response from *sending request req five
EVENT_1EVENT_2EVENT_4EVENT_3 req five [1, 0, 0, 0][0, 1, 0, 0][0, 0, 0, 1][[0, 0, 1, 0], [0, 0, 0, 0, 1]] 10
Fig. 1: System log anomaly detector’s architecture as in [15].distinguish anomalous vectors. Machine learning algorithms are quite prevalentfor this component, as they have shown promising results in sequence modelingand classification.
One of the most well-known and effective log anomaly detection methods isPCA based mentioned in [17]. The method first forms a session-event matrix,similar to the document-term matrix in Natural Language Processing (NLP),where each cell indicates the number of occurrences of a particular event thatoccurred in an individual session. Next, the matrix is passed to an analysis ofprincipal components. Then the anomalies are detected by distinguishing thesession vector’s projection length in the residual space.In another approach, [18] uses the session-event matrix and mine invariantsthat satisfy the majority of the sessions. Thus, anomalies occur in sessions thatlack the satisfaction of the mined invariants. While all mentioned works focuson designing general-purpose algorithms, [19] presents a method that comparesthe log messages to a set of automata to calculate the workflow divergence andis labeled as an anomaly as a result. However, it focuses on the log anomalydetection in OpenStack’s logs specifically.As the Deep Neural Networks have grown more mature in recent years, theyhave gained popularity among log anomaly detection research. Many approachesare leveraging different types of Recurrent Neural Networks (RNNs) such as LongShort-Term Memory (LSTM) [20] or Gated Recurrent Unit (GRU) [21], whileothers are detecting anomalies by making use of Convolutional Neural Networks(CNNs) [22].DeepLog, as the most well-known log anomaly detection method, proposedin [23], uses DNN in the classifier component. After parsing log events, DeepLogencodes the event types and parameters into vectors. Next, the model, whichis based on LSTM, trains on data from non-anomaly execution only to predictthe next log event given previous events. After the training, the model predicts etecting Anomalies in Software Execution Logs with Siamese Network 5 a low probability for some events in anomaly sequences as it has trained onnon-anomaly data only.Although the methods mentioned before accurately detect log anomalies, [15]suggests that advances made by previous works are based on a close(d)-worldassumption where logs are static, while, in real-world applications, logs are con-tinuously evolving. Log evolutions are considered undoubtedly important thesedays, as many companies are continuously delivering software updates to theircustomers [24]. Thus, the authors of [15] suggest LogRobust, a novel methodfor log anomaly detection. LogRobust proposes a new vectorization techniquecalled “semantic vectorization” to approximately compensates for the evolutionof log messages. It also suggests utilizing the attention-based Bidirectional LongShort-Term Memory (Bi-LSTM) to encounter the execution path evolutions.Furthermore, the authors present a technique to emulate log evolutions by ap-plying noise.LogAnomaly, explained in [25], presents another novel yet practical approachfor vectorization called “template2vec” that takes synonyms and antonyms intoaccount, making the vectorization process more reliable. Furthermore, LogAnomalyclaims that it can detect sequential anomalies as well as quantitive ones. Whileevery previously mentioned deep method applies LSTM to model log sequences(predict the next log event), LogAnomaly uses an LSTM on Term Frequency-Inverse Document Frequency (TF-IDF) vectors to construct a binary classifier.On the other end of the spectrum, [26] applies CNN instead of LSTM to forma binary classifier. The research also introduces an effective embedding methodto transform one-hot encoded log events to vectors called “log-key2vec”. Thismethod results in efficient dimension reduction of one-hot encoded vectors.All previous deep-learning-based methods, regardless of their core compo-nents, obeyed one of the two previously mentioned approaches. They either ap-plied binary classification or modeled the sequence. However, this paper presentsa third option that utilizes the Siamese network to circumvent the previouslymentioned challenges in a different matter. Harnessing the Siamese network’spower, our method proposes a new approach to embed the log sequences intovectors, so that embedded sequence vectors of different classes are readily sepa-rable and classifiable in the new space.
In the area of log anomaly detection, many datasets exist. The LogHub datacollection contains currently 16 different software log datasets [27]. The LogHubcollection offers log datasets from various software types such as distributed filesystems, operating systems, and web-based services. Additionally, six of themare labeled for the task of anomaly detection.The labeled logs may be divided into two different categories based on theirlabeling approach. The first is the one-to-one labeled datasets, consisting of logsequences with a unique label for every log element. In the second category,named n-to-one, on the other hand, there is one label for each sequence of ele-
Shayan Hashemi et al. ments. It could also be interpreted as if all elements of an individual log sequencepossess the same label (anomaly or non-anomaly).Among the LogPai’s labeled datasets, to the best of our knowledge, theHadoop Distributed File System (HDFS) and Openstack log datasets for anomalydetection, mentioned in [28,23], are the only n-to-one labeled datasets suitablefor our research. Moreover, since almost all previously mentioned works utilizethe labeled datasets on sequence level to detect software log anomalies regard-less of their approaches (binary classification or sequence modeling), these twodatasets are prevalent among the related works. As this paper demands one-to-nlabeled datasets, we strived to utilize the same datasets (HDFS and Openstack)from the previous works. However, during our tests, after parsing the Openstackusing the Drain log parsing algorithm, mentioned in [29], we noticed that it lacksa sufficient amount of unique sequences. More in-depth, we found only 11 uniquesequences in the Openstack dataset while it was 18,383 in the HDFS dataset.Hence, experiments are limited to the HDFS dataset only, which is also the casein some of the prior works such as [26].
As earlier mentioned, previous deep methods either train on non-anomaly eventsonly or apply binary classification to detect anomalies. However, both of thoseapproaches are prone to deficiencies.In the first (non-anomaly events only) approach, the model training wouldnot encounter log events that only occur in an anomaly situation. For instance,In a distributed data storage solution software, a hard drive failure event is nota regular event by any means. For example, in the HDFS dataset, from thetwenty-nine unique events, only twenty-two of them occurred in non-anomalysituations. Needless to say that not training on a proportion of the input spacemay result in unexpected model behavior. In the latter (binary classification)approach, the model’s training suffers from the unbalanced dataset. Althoughsome solutions have been discussed for the unbalanced data problem, all of themare accompanied by their limitations.Throughout the rest of this section, we propose a novel approach based onthe Siamese networks due to their excellent performance in one-shot learningproblems [6,7,10,11,12] and their stability on unbalanced data [30]. Our proposedmethod takes advantage of both data classes without any sampling tricks orweighted training.
The Siamese network, illuminated in Figure 2, was initially invented to resolvethe one-shot learning problem [5] by forming a similarity-based embedding func-tion. It packs two neural networks with shared weights (they are indeed thesame neural networks and may be considered one; however, discriminating themmakes the Siamese network’s architecture more interpretable) and a similarity etecting Anomalies in Software Execution Logs with Siamese Network 7
Embedding Neural Network [1, 3, 6, 7, 7, 5, 3, 9] 3.312.217.760.02...4.69
Embedding Neural Network [1, 6, 3, 3, 7, 9, 3, 6] 2.452.986.670.03...5.59
Dot
Sigmoid
Shared Weights
Fig. 2: The Siamese network’s architecture.metric. During the training, at first, pairs of samples are passed to the neuralnetworks. Next, the neural network embeds them into vectors. Then, the similar-ities between the vectors are measured. Lastly, the optimization process updatesthe weights of the neural networks with respect to the fact that similar pairs(same class) should hold high similarity values for their output vectors, whileit is the contrary for dissimilar pairs (pair from different classes). At the endof the training process, the model embeds the same class samples close to eachother while different class samples are embedded away from each other. In thispaper, we use the Siamese network to train a deep embedding neural networkthat transforms log sequences into vectors so that embedded vectors of sequencesof the same class are close to each other while being apart from the other class.After the Siamese network converges, we extract the embedding neural net-work and embed all training sequences into vectors. As the embedded vectors ofdifferent classes are well separated, they are excellent training data for an arbi-trary classifier. So, we train a classifier to work on top of the embedding neuralnetwork to form an anomaly detection method. During the test time, the em-bedding neural network transforms the input sequences into vectors and passesthem to the classifier to be classified as non-anomaly or anomaly sequences.Since the invention of the Siamese network, different loss function has emergedfor it. One of them is the contrastive loss function, mentioned in [31]. It operatesutilizing the Euclidean distance, confirming enough space between embeddedvectors of different classes while keeping vectors from the same class close toeach other. However, during our experiments, we inquired about another lossfunction based on the sigmoid of inner product and cross-entropy loss function[32], which performed better than the contrastive loss. Going more in-depth, weuse the sigmoid function on embedded sequences’ inner product to construct asimilarity measure. This measure may be formulated as: sim ( x , x ) = σ ( x · x ) . Shayan Hashemi et al.
Algorithm 1:
Generating pairs using the All algorithm
GenerateAllPairs ( D ) inputs : The dataset D , which contains sequences denoted by s andtargets denoted by t output: Pairs generated using the All algorithm foreach ( s , t ) ∈ D doforeach ( s , t ) ∈ D doif t == t then addPair( s , s , ) if t ! = t then addPair( s , s , ) On top of the similarity measure, we use the cross-entropy loss function. So, thefinal loss function may be formulated as: J ( x , x , y ) = − ( y. log ( sim ( x , x )) + (1 − y ) . log (1 − sim ( x , x ))) . As the Siamese network requires its training input to be in pairs, a properpair generation method is required. Generated training pairs must include twotypes of pairs in order to train the Siamese network. The first type is similarpairs in which the entities are from the same class, with the training targetbeing one. The second type is dissimilar pairs in which the entities are fromdifferent classes, with the training target set to zero. To shed more light, assumethat A is an anomaly sequence, while N is a non-anomaly sequence. From fourpossible pair permutations, ( A, A ) and (
N, N ) are considered as similar pairs,while (
A, N ) and (
N, A ) are dissimilar ones. The following paragraphs containtwo pair generation algorithms for training the Siamese network.The first approach, which is quite straightforward, generates every possiblepair. Going more in-depth, every sequence in the dataset pairs with all othersequences except for itself. The pseudo-code could be seen in Algorithm 1. Al-though this method is sensible and easy to implement, it is impractical for mas-sive datasets. Alongside the exponential growth of pairs quantity, this approachgenerates dramatically more similar pairs than dissimilar ones. We call this ap-proach the “All” pair generation algorithm.The second approach focuses on training efficiency. In this approach, for eachsequence within the dataset, we sample one sequence from the same class and K sequences from the different class, generating K + 1 pairs for each sequence.In other words, this approach samples a subset of all pairs instead of generatingthem all. The pseudo-code is observable in Algorithm 2. This method reducestraining time and power consumption, making it feasible for training the Siamesenetwork. We name this approach the ”K Plus One (KPOne)” pair generation etecting Anomalies in Software Execution Logs with Siamese Network 9 Algorithm 2:
Generating pairs using the KPOne algorithm
GenerateKPOnePairs ( N, P, K ) inputs : The data subsets N and P , which subsequently contain negative(non-anomaly) and positive (anomaly) sequences. The constant K where K ∈ N and is the proportion of dissimilar to similarpairs. output: Pairs generated using the KPOne algorithm foreach n ∈ N do sn = sampleSet( N ) ; addPair( n, sn, ) ; for to K do sp = sampleSet( N ) ; addPair( n, sp, ) ; foreach p ∈ P do sp = sampleSet( P ) ; addPair( p, sp, ) ; for to K do sn = sampleSet( N ) ; addPair( p, sn, ) ; algorithm. As the K value increases, so does the computational effort. We noticedimprovements in our experiments while increasing K until K = 3.The number of samples generated in each epoch, and the computational costaccordingly, may vary significantly based on the choice of the pair generation al-gorithm. Assuming that n n and n a are subsequently the number of non-anomalyand anomaly samples within a dataset. The number of pairs generated by theAll algorithm is N All = n a + n n + 2 n a n n − n a − n n , while the number of generated pairs for the KPOne algorithm is N KP One = Kn a + Kn n + n a + n n when K is the dissimilar samples count. It is blindingly obvious that for largenumbers of n a and n n , the value of N sample is dramatically smaller than N all .So, the computational cost of the All pair generation algorithm is larger thanthe KPOne. In this section, we explain the dataset and the preprocessing steps and evaluateour architecture.
We use the HDFS log dataset for anomaly detection as it has become a bench-mark in the log anomaly detection task. It is widely used in previous works,making a fair and comprehensive comparison between our and the existing state-of-the-art methods possible. As our research focuses on classification, we use thedataset’s vectorized variant, provided by [27], since the input data is cleaned, pro-cessed, transformed into sequences, and prepared for classification. The datasetcontains 575,061 sequences, 558,223 of them being non-anomaly sequences, while16,838 are anomaly sequences.Although the dataset is supposed to be ready for classification, we discoveredmany redundant sequences. Redundancy not only raises the required processingpower for training but also compromises the authenticity of the evaluation assome test samples may appear in the training set. So, our first and only step ofpreprocessing is to remove redundant sequences. After removing the redundantsequences, the dataset contains 4,124 unique anomaly and 14,259 unique non-anomaly sequences.
We split the data into the train and the test sets (90% for training and 10%for testing). The train set is used for training the Siamese network and classi-fiers, while the test set is utilized for evaluating the system. However, before westart to generate pairs using the desired pair generation algorithm and train theSiamese network, we take a small proportion of training data (equal to 3% of alldata), generate all available pairs from it, and use it as the validation set. Thenwe start the training using pairs generated with the selected algorithm. Thevalidation set’s purpose is to find the most suitable neural network architectureand hyper-parameters and control overfitting. After founding proper architectureand hyper-parameters, the validation set serves no purpose. Thus, it is mergedinto the training set for retraining the neural network. Figure 3 illuminates thedata splitting and experiments processes, presenting an overall view of the wholeprocess.
The nature of the anomaly detection task is unbalanced, meaning that thereare significantly more negative samples in comparison to positive ones. In suchcircumstances, the binary classification accuracy is not a valid metric for mea-suring performance. So, we use another metric called “ F score” to measure andcompare performance. Suppose T P , T N , F P , and
F N are respectively true pos-itives, true negatives, false positives, and false negatives. The “precision” metricformulated as precision = T PT P + T N etecting Anomalies in Software Execution Logs with Siamese Network 11
Dataset Preprocess & Split TestValidationTrain Hyperband Optimized ArchitectureTrain the Siamese networkCombine Embedding Neural NetworkTransform sequences into vectors (embed) Train & Validation SetEmbedded VectorsTrain Classifiers Visualization Robustness Testing Distribution Drift Multi-layer Neural NetworkEnd-to-end Training
Fig. 3: Overall view of data splitting and experiments.shows the accuracy of the model’s positive prediction. On the other hand, the“recall” metric demonstrates the model’s reliability in predicting all positivesamples and formulates as recall = T PT P + F N
Finally, the F score is the harmonic mean of precision and recall simplified to F = 2 · precision · recallprecision + recall However, we multiply F scores by one hundred to expose more details in theresults. This section focuses on spotting a proper architecture for embedding neural net-work, validating different pair generation algorithms, comparing different clas-sifiers and other state-of-the-art methods, and introducing low-cost and hybridmodels.
As our method’s heart is the embedding neural network trainedinside the Siamese network, we want the embedding neural network to perform at its best. Spotting an optimal architecture and hyper-parameters is a chal-lenging step in deep learning projects. So, we need an algorithm to find suitablearchitecture and hyper-parameters.
Method:
Multiple algorithms, such as Grid Search, Random Search, BayesianOptimization, and Evolutionary Optimization, have been proposed for tuningneural network architecture and hyper-parameters. However, we choose the Hy-perband algorithm [33] for its performance and computational efficiency to attaina solid architecture and hyper-parameters. The Hyperband algorithm was exe-cuted three times (to avoid local optima) with default parameters on all availablepairs in training set to minimize the Siamese loss on the validation set. In otherwords, Hyperband used the training set to train multiple different architecturesand the validation set to compare the architectures to find the best performance.
Findings:
Table 1 contains the details of the embedding neural network’sarchitecture and hyper-parameters found by the Hyperband algorithm. It showsthat multiple layers of LSTMs are required to achieve decent results as sequencesin the HDFS log for anomaly detection dataset are quite complicated.Table 1: The embedding neural network’s architecture found by the Hyperbandalgorithm, described layer by layer.
Property 1 st nd rd th th th th Layer Type Embedding LSTM LSTM LSTM Dense Dense DenseOutput Units 11 192 192 64 348 640 64Activation N.A Tanh Tanh Tanh ReLU ReLU Linear
As discussed before, generating pairs using the All pair generationalgorithm is computationally expensive. Therefore, we proposed an algorithmfor generating pairs to reduce the computational cost. In this experiment, weaim to compare two different algorithms for pair generation.
Method:
We trained two models with the same architecture found in theprevious experiment. One trains on pairs generated using the All pair generationalgorithm, while the other one’s training pairs are generated using the KPOnealgorithm. We tried different values for K in the KPOne pair generation algo-rithm and found out that k = 3 works the best in our use case. After the training,we compare the Siamese network’s loss value and the classifiers’ accuracy acrossthe two models. It must be stated that the test loss value is calculated afterthe hyper-parameter optimization process in the previous section. In fact, theHyperband algorithm neither trained on nor targeted any pairs containing anysequence from the test set. etecting Anomalies in Software Execution Logs with Siamese Network 13 Findings:
The results, available in Table 2, show that the All pair generationalgorithm results in less error in the Siamese network. However, Table 3 (inthe next subsection) demonstrates that the classification result differences arenegligible. All in all, considering the computational cost (more than 3,000 timesgenerated pairs), the All algorithm might not be a fitting choice for many cases.Table 2: The Siamese network’s loss using different pair generation algorithms.
Algorithm Train Loss Test Loss Generated PairsAll ∼ After training the embedding neural network inside the Siamesenetwork, a classifier is needed to classify the embedded sequences. In this exper-iment, we aim to evaluate different classifiers for this purpose.
Method:
We pick Logistic Regression (LR), Support Vector Machine (SVM),K Nearest Neighbours (KNN), and multi-layer neural network as the candidateclassifiers. The neural network classifier consists of two layers. The first one isactivated using the Rectifier Linear Unit (ReLU), while the second layer leveragesthe sigmoid activation function for binary classification. We embed all trainsequences into vectors and train the classifiers on them. During the test time,each sequence is embedded using the embedding neural network and passed tothe classifier for prediction.
Findings:
As Table 3 exposes the results, all classifiers achieve outstandingresults, the F score of 99.39 or better. Achieving accurate and consistent re-sults with different classifiers explains that the embedding neural network worksprecisely and as expected. Since the multi-layer neural network performs bet-ter than the other classifiers, we choose it as our Best performer for upcomingexperiments. This section evaluates the Best performer from the previous sub-section against state-of-the-art deep log anomaly detection approaches.
Method:
We bring the results of the best performers from the previousexperiments and select DeepLog [23], LogRobust [15], LogAnomaly [25], andCNNLog [26] as the competitors. We also train a neural network with the ar-chitecture equal to combining the embedding and classifier neural networks as
Table 3: The accuracy comparison between different classifiers and embeddingneural network trained using different pair generation algorithm. The first modelis trained using the All pair generation algorithm while the second one is trainedusing the KPOne.
Classifier F ScoreAll Algorithm KPOne AlgorithmK Nearest Neighbours 99.39 99.39Support Vector Machine 99.57 99.51Neural Network 99.62 99.51Logistic Regression 99.39 99.39 a single unit. This neural network, called the Feedforward model, allows us toinvestigate if utilizing the Siamese network yields any benefit.
Findings : Table 4 shows that our Best performer outperforms all previousworks and its Feedforward rival. We see that our Best performer has the F score of 99.62 followed by LogRobust with the F score of 99 , CNNLog ( F score 98.5), the Feedforward model ( F score 97.28), DeepLog ( F score 96), andLogAnomaly ( F score 95). Nevertheless, to the best of our knowledge, our Bestperformer achieves the best results ever on the HDFS log for anomaly detectiondata set yet. Moreover, the Siamese network outperforming its Feedforward rivalshows that applying the Siamese network results in an increase in the F score.Table 4: The comparison of the Best performer from our approaches and otherstate-of-the-art deep methods. It should be noted that the numbers in this tableare not multiplied by one hundred. Since reported numbers in other works werenot as accurate as ours, we could not demonstrate more precise metrics for them. Method Precision Recall F ScoreDeepLog [23] 0.95 0.96 0.96LogRobust [15] 0.98 1.00 0.99LogAnomaly [25] 0.96 0.94 0.95CNNLog [26] 0.977 0.993 0.985Best performer 0.9931 0.9994 0.9962Feedforward model 0.9924 0.9539 0.9728 The source gives no decimals given so the actual F score value could be anythingbetween 98.50 and 99.49)etecting Anomalies in Software Execution Logs with Siamese Network 15 In previous experiments, we found an architecture offering thestate of the art performance for anomaly detection in the HDFS dataset. How-ever, training a model with that architecture is expensive and was done in HPC-environment. In this experiment, we endeavor to handcraft a new architecturethat is less taxing to train. After all, the software industry might not havethe possibility or time to train models in HPC-environment. Furthermore, ex-periments, development, and utilization are cheaper and faster for the low-costmodel. Finally, as the low-cost model is capable of computationally more efficientinferences, it demands less computational power, making it economical, fast, andscalable at the production time. However, despite all benefits, the low-cost modelsacrifices accuracy to achieve them.
Method:
With the goal to find a suitable architecture, we first handcraftdifferent architectures that are significantly less expensive to train than the ar-chitecture found by the Hyperband. Later, we train all models using the KPOnepair generation algorithm with k = 3. In the end, we choose the best architec-ture according to the F score. Alongside the F score, we record two differentmetrics for both models. The first metric is the number of floating-point opera-tions (FLOPS) for one forward pass of the neural network. FLOPS is an implicitindication of computational cost during both development and production. Ad-ditionally, we calculate the number of parameters for each model. The number ofparameters specifies the amount of memory required to store and load the modeland explicitly affects the training speed. Finally, we compare training time ina typical deep learning machine’s hardware (A 14 cores Intel Xeon CPU withNvidia Tesla P100 GPU). Findings:
Table 5 demonstrates the chosen handcrafted architecture. Table6 compares the Best Performer model and low-cost model in computational cost,model size, and accuracy. The comparison sheds light on the fact that despitebeing computationally more affordable, three times less floating-point operation,30 times fewer parameters, and reducing the training time by the factor of 13,the low-cost architecture does not considerably compromise the F score, from99.62 to 98.78. For example, the low-cost model could be retrained overnightwith typical hardware while it is not possible for the best performer in typicalhardware. This would make it suitable for environments where logs would evolverapidly, but less accuracy is tolerated. As previous experiments indicate, the best performer architectureis the classifier neural network on top of the embedding neural network. However,since the classifier and the embedding function are both neural networks, westrive to train them together, making end-to-end training possible. The end-to-end architecture may reduce design and engineering efforts as the classifier andembedding neural networks train simultaneously.
Table 5: The handcrafted embedding neural network’s architecture found bycross-validation between ten different candidate models.
Property 1 st nd rd th th Layer Type Embedding Bi-LSTM Dense Dense DenseOutput Units 24 64 (32 ×
2) 64 64 64Activation N.A Tanh Leaky ReLU Leaky ReLU Linear
Table 6: The table is the comparison of low-cost architecture with the architec-ture found by Hyperband. FLOPS column indicates the amount of floating-pointoperations required for the embedding neural network to transform a sequenceinto a vector. Moreover, the Parameters column reveals the number of train-able parameters in each architecture. Furthermore, the required training timefor each architecture is mentioned in the Training time column.
Architecture F Score FLOPS Parameters Training timeBest performer 99.62 222K 805K 150h 42minLow-cost 98.78 71K 27K 11h 17min
Method:
Before training, we place the classifier network after the last com-ponent of the embedding neural network in the Siamese network. Therefore, themodified Siamese network is going to have two outputs. The first one is thesimilarity indicator, while the second one is the predicted label for the first en-try of the Siamese network. Therefore, the modified Siamese network’s loss isthe cumulative loss of the Siamese network and cross-entropy classification. Fig-ure 4 visualizes the architecture the modified Siamese network. Furthermore, toanalyze the impact of the Siamese network on the accuracy, we compare the Hy-brid model with another model with the same architecture without the Siamesesimilarity loss, i.e., the Feedforward model mentioned in 5.4.
Findings:
Table 7 confirms that the Hybrid model performs better than theFeedforward model and is almost on a par with separately trained embeddingand classifier neural networks (the Best performer), both performing at the F score of 0.99. This section notes some practical advantages that become possible with theSiamese network. The first two advantages are related to log evolution, and thelast one is related to log visualization. etecting Anomalies in Software Execution Logs with Siamese Network 17
Embedding Neural Network [1, 3, 6, 7, 7, 5, 3, 9] 3.312.217.760.02...4.69
Embedding Neural Network [1, 6, 3, 3, 7, 9, 3, 6] 2.452.986.670.03...5.59
Dot
Sigmoid
Shared Weights
Classification Neural Network
Fig. 4: The modified Siamese network’s architecture for end-to-end training.Table 7: The comparison of end-to-end training model (training classifier along-side the embedding neural network) with the best performer and feedforwardmodel.
Model Precision Recall F ScoreHybrid Model 0.9927 0.9878 0.9902Best performer 0.9931 0.9994 0.9962Feedforward model 0.9924 0.9539 0.97288 Shayan Hashemi et al.
Software logs are continually evolving due to different executionenvironments or developers’ updates. Moreover, [15] performed an empiricalstudy confirming how software logs evolve. As training deep learning modelsis dramatically power consuming, it is not feasible to train the model for everyminor software updates or changes in execution environments. Accordingly, [15]introduces three methods for emulating log evolution synthetically by addingnoise to log sequences. It is not rational to train models on synthetically gen-erated data. However, synthetically generated data may help in evaluating andanalyzing model performance on evolved logs.
Method:
In this experiment, we decided to apply the three methods ofadding noise to log sequences, mentioned in [15], to imitate the evolutions of logsequences. The methods comprise of duplicating, removing, and shuffling one ormultiple elements of a sequence. Since generating a noisy dataset is a randomprocess, we ran each test five times and took the results’ average.
Findings : Table 8 shows the classifiers’ evaluation of synthetically evolvedlog sequences with different noise ratios. Harnessing the power of the Siamesenetwork, classifiers maintained their accuracy formidably despite the evolutions.The F score drops from 0.99 to 0.92 in all classifiers when moving from noiseratio of 0% to noise ratio of 30%. In previous works, with the sequence of noiseratio of 20%, the F score of 0.95 was observed. However, direct comparison isnot possible since the noise parameters were not fully disclosed, and our setupmight differ from their experiments. Nevertheless, showing the F score of 0.95at 20% noise ratio, in Table 8, our method’s robustness appears to be as strongas previous works. We have confirmed that the proposed model is considerably robustto log sequence evolution in the previous section. However, if log sequences pro-ceed to evolve, the retraining process is inevitable. Since the retraining processis computationally expensive and time-consuming, we strive to find a solutionto minimize the number of retraining times. More in-depth, we seek a numeralvalue to present the trained model’s reliability on evolved sequences. Althoughthe F score accuracy is the best measurement for reliability, we do not possessthe sequence labels to calculate the F score in the production time as the in-coming data is completely new. It goes without mentioning that it is possibleto label the production data later down the road, yet labels are not available atthe exact moment of production. Hence, we require a new metric that indicatesthe reliability without any labeling requirement. Method:
Since the embedding neural network transforms sequences intovectors, we may exploit embedded vectors’ distribution to monitor log sequences’evolution. Thus, we introduce the fitness score as the indication of evolutions inlog sequences. To calculate the fitness score, the training sequences are embeddedinto vectors using the embedding neural network and modeled by fitting into a etecting Anomalies in Software Execution Logs with Siamese Network 19
Table 8: The evaluation results of different classifiers on synthetically evolveddatasets. The noise ration indicates the ratio of the test set samples that areaffected by synthetic log evolutions.
Classifier Metric Noise Ratio0% 5% 10% 20% 30%K Nearest Neighbors Precision 99.76 97.78 96.47 94.02 90.87Recall 99.03 98.16 97.68 96.61 94.77 F score 99.39 97.97 97.07 95.30 92.77Support Vector Machine Precision 99.76 97.83 96.56 93.98 90.76Recall 99.03 98.16 97.72 96.71 94.92 F score 99.39 97.99 97.14 95.32 92.78Neural Network Precision 99.31 97.69 96.52 93.67 90.51Recall 99.94 97.72 97.72 96.71 94.86 F score 99.62 97.92 97.11 95.16 92.68Logistic Regression Precision 99.76 97.02 96.65 94.33 91.58Recall 99.03 98.11 97.58 96.51 94.62 F score 99.39 98.02 97.11 95.86 93.07 Gaussian mixture. Accordingly, the fitness score is computed as the average log-likelihood of embedded vectors of evolved sequences. The more the log sequencesevolve, the lower the fitness value will be. Possessing such a metric, we may definea threshold and avoid the retraining process for trivial evolutions in production.Moreover, we may retrain the model as soon as the fitness score surpassed thethreshold number. Needless to say, the threshold number might vary from taskto task or even dataset to dataset.
Findings:
We used the previously mentioned methods to imitate log evolu-tions and recorded the fitness score as the evolutions increased. Figure 5 visu-alizes the purge in fitness score as the evolutions grow. The purge might be anindication of the fitness score’s reliability.
We have proposed multiple methods of evaluating the authenticityand reliability of the embedding neural network and model in previous experi-ments. However, human supervision for AI systems can bring brighter insights.One of the best solutions to human supervision is visualization. Furthermore, thevisualization (of the embedding neural network’s output in our case) gives hu-mans the ability to supervise the embedding neural network’s output and allowsmanual analysis.
Method:
As the trained embedding network allows us to transform log se-quences into vectors, we can use dimension reduction algorithms such as T-SNE F i t n e ss S c o r e Fig. 5: The purge in fitness score as the noise ratio increase. It should be notedthat positive scores are due to computing scores using probability density func-tion. etecting Anomalies in Software Execution Logs with Siamese Network 21 [34], UMAP [35], and PCA [36] to reduce the dimensions of the embedded se-quences so that they become visualizable and perceptible for humans. Accord-ingly, we embed all sequences from the train and test sets to vectors, reducetheir dimensions, and plot the results on a canvas.
Findings:
Figure 6 visualizes the embedded sequences using different dimen-sion reduction methods. The embedded non-anomaly sequences are colored asblue, while the anomaly ones are colored as red. The figure demonstrates thatembedded sequences of different classes (non-anomaly/anomaly) are readily sep-arable regardless of the dimension reduction algorithm. This fact might explainthe high accuracy among all the classifiers. (a) PCA (b) UMAP (c) T-SNE
Fig. 6: The visualization of vectors embedding by the embedding neural networkusing different dimension reduction algorithms.
This paper proposed a novel approach to detect anomalies in software executionlogs using a Siamese network structure with LSTM layers. We compared theresults with the state-of-the-art deep-learning-based methods on the HDFS logfor anomaly detection dataset and showed that the proposed method achieves thebest results on the aforenamed dataset. We conclude that the ability to achievestate-of-the-art performance is due to the Siamese network as the Feedforwardneural network with the same architecture offered a considerably lower F1-score(0.996 vs. 0.973). Furthermore, we proposed a novel algorithm to generate pairsto train the Siamese network to reduce the training process’s computational costwhile maintaining accuracy. We also showed that the Siamese network achievessatisfactory results with smaller and cheaper neural networks as well. Moreover,we introduced multiple practical advantages of the Siamese network. We assessthe robustness of our model to log evolutions. Additionally, we introduced anunsupervised method for log evolution measurement. Finally, we visualize the embedding function’s output vectors using dimension reduction algorithms tomake the neural network’s output more perceptible.Although we introduced various applications for the Siamese network along-side anomaly detection, there are interesting future investigations. Future worksmay focus on applying different side applications such as Root Cause Analy-sis by applying the Siamese network. More computationally cost-efficient neuralnetworks such as CNNs might be applied inside the Siamese neural network tofurther reduce the computational cost in future studies.
This work has been supported by the Academy of Finland (grant IDs 298020and 328058). Additionally, the authors gratefully acknowledge CSC – IT Centerfor Science, Finland, for their generous computational resources.
References
1. G. Esteves, E. Figueiredo, A. Veloso, M. Viggiato, and N. Ziviani, “Understandingmachine learning software defect predictions,”
Automated Software Engineering ,vol. 27, pp. 369–392, Dec 2020.2. M. Velez, P. Jamshidi, F. Sattler, N. Siegmund, S. Apel, and C. K¨astner, “Con-figcrusher: towards white-box performance analysis for configurable systems,”
Au-tomated Software Engineering , vol. 27, pp. 265–300, Dec 2020.3. X. Xia, D. Lo, E. Shihab, X. Wang, and B. Zhou, “Automatic, high accuracyprediction of reopened bugs,”
Automated Software Engineering , vol. 22, pp. 75–109, Mar 2015.4. R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,” arXiv preprint arXiv:1901.03407 , 2019.5. J. Bromley, I. Guyon, Y. LeCun, E. S¨ackinger, and R. Shah, “Signature verificationusing a” siamese” time delay neural network,” in
Advances in neural informationprocessing systems , pp. 737–744, 1994.6. S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discrimina-tively, with application to face verification,” in , vol. 1, pp. 539–546, IEEE, 2005.7. W. Wang, J. Yang, J. Xiao, S. Li, and D. Zhou, “Face recognition based on deeplearning,” in
International Conference on Human Centered Computing , pp. 812–820, Springer, 2014.8. S. Dey, A. Dutta, J. I. Toledo, S. K. Ghosh, J. Llad´os, and U. Pal, “Signet: Con-volutional siamese network for writer independent offline signature verification,” arXiv preprint arXiv:1707.02131 , 2017.9. K. Ahrabian and B. BabaAli, “Usage of autoencoders and siamese networks foronline handwritten signature verification,”
Neural Computing and Applications ,vol. 31, no. 12, pp. 9321–9334, 2019.10. Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu, “Structured siamesenetwork for real-time visual tracking,” in
Proceedings of the European conferenceon computer vision (ECCV) , pp. 351–366, 2018.etecting Anomalies in Software Execution Logs with Siamese Network 2311. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in
European conference oncomputer vision , pp. 850–865, Springer, 2016.12. Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning dynamicsiamese network for visual object tracking,” in
The IEEE International Conferenceon Computer Vision (ICCV) , Oct 2017.13. B. Wilkins, C. Watkins, and K. Stathis, “Anomaly detection in video games,” arXiv preprint arXiv:2005.10211 , 2020.14. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed repre-sentations of words and phrases and their compositionality,” in
Advances in Neu-ral Information Processing Systems 26 (C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger, eds.), pp. 3111–3119, Curran Associates,Inc., 2013.15. X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang, Q. Cheng,Z. Li, et al. , “Robust log-based anomaly detection on unstable log data,” in
Pro-ceedings of the 2019 27th ACM Joint Meeting on European Software EngineeringConference and Symposium on the Foundations of Software Engineering , pp. 807–817, 2019.16. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Tools and bench-marks for automated log parsing,” in ,pp. 121–130, IEEE, 2019.17. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scalesystem problems by mining console logs,” in
Proceedings of the ACM SIGOPS22nd symposium on Operating systems principles , pp. 117–132, 2009.18. J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, “Mining invariants from console logs forsystem problem detection.,” in
USENIX Annual Technical Conference , pp. 1–14,2010.19. X. Yu, P. Joshi, J. Xu, G. Jin, H. Zhang, and G. Jiang, “Cloudseer: Workflow mon-itoring of cloud infrastructures via interleaved logs,”
ACM SIGARCH ComputerArchitecture News , vol. 44, no. 2, pp. 489–502, 2016.20. S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computa-tion , vol. 9, no. 8, pp. 1735–1780, 1997.21. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gatedrecurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555 ,2014.22. Y. LeCun et al. , “Lenet-5, convolutional neural networks,”
URL: http://yann. le-cun. com/exdb/lenet , vol. 20, no. 5, p. 14, 2015.23. M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detection and di-agnosis from system logs through deep learning,” in
Proceedings of the 2017 ACMSIGSAC Conference on Computer and Communications Security , pp. 1285–1298,2017.24. M. Lepp¨anen, S. M¨akinen, M. Pagels, V.-P. Eloranta, J. Itkonen, M. V. M¨antyl¨a,and T. M¨annist¨o, “The highways and country roads to continuous deployment,”
Ieee software , vol. 32, no. 2, pp. 64–72, 2015.25. W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao,P. Sun, et al. , “Loganomaly: Unsupervised detection of sequential and quantitativeanomalies in unstructured logs,” in
Proceedings of the Twenty-Eighth InternationalJoint Conference on Artificial Intelligence, IJCAI-19. International Joint Confer-ences on Artificial Intelligence Organization , vol. 7, pp. 4739–4745, 2019.4 Shayan Hashemi et al.26. S. Lu, X. Wei, Y. Li, and L. Wang, “Detecting anomaly in big data system logsusing convolutional neural network,” in , pp. 151–158, IEEE, 2018.27. S. He, J. Zhu, P. He, and M. R. Lyu, “Loghub: A large collection of system logdatasets towards automated log analytics,” 2020.28. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scalesystem problems by mining console logs,” in
Proceedings of the ACM SIGOPS22nd symposium on Operating systems principles , pp. 117–132, 2009.29. P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approachwith fixed depth tree,” in , pp. 33–40, 2017.30. D. Sun, Z. Wu, Y. Wang, Q. Lv, and B. Hu, “Risk prediction for imbalanced data incyber security : A siamese network-based deep learning classification framework,”in , pp. 1–8,2019.31. R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning aninvariant mapping,” in , vol. 2, pp. 1735–1742, IEEE, 2006.32. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521, no. 7553,pp. 436–444, 2015.33. L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, “Hyperband:A novel bandit-based approach to hyperparameter optimization,”
The Journal ofMachine Learning Research , vol. 18, no. 1, pp. 6765–6816, 2017.34. L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journal of machinelearning research , vol. 9, no. Nov, pp. 2579–2605, 2008.35. L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximationand projection for dimension reduction,” arXiv preprint arXiv:1802.03426 , 2018.36. H. Abdi and L. J. Williams, “Principal component analysis,”