[PDF] Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models

Abstract

Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users that communicate, compute, and store information. Therefore, timely and accurate anomaly detection is necessary for reliability, security, safe operation, and mitigation of losses in these increasingly important systems. Recently, the evolution of the software industry opens up several problems that need to be tackled including (1) addressing the software evolution due software upgrades, and (2) solving the cold-start problem, where data from the system of interest is not available. In this paper, we propose a framework for anomaly detection in log data, as a major troubleshooting source of system information. To that end, we utilize pre-trained general-purpose language models to preserve the semantics of log messages and map them into log vector embeddings. The key idea is that these representations for the logs are robust and less invariant to changes in the logs, and therefore, result in a better generalization of the anomaly detection models. We perform several experiments on a cloud dataset evaluating different language models for obtaining numerical log representations such as BERT, GPT-2, and XL. The robustness is evaluated by gradually altering log messages, to simulate a change in semantics. Our results show that the proposed approach achieves high performance and robustness, which opens up possibilities for future research in this direction.

Full PDF

RRobust and Transferable Anomaly Detection in LogData using Pre-Trained Language Models

Harold Ott ∗ , Jasmin Bogatinovski ∗ , Alexander Acker, Sasho Nedelkoski, Odej Kao Distributed and Operating Systems, TU Berlin, Berlin, GermanyEmail: { jasmin.bogatinovski, alexander.acker, nedelkoski, odej.kao } @tu-berlin.de ∗ Equal contribution.

Abstract —Anomalies or failures in large computer systems,such as the cloud, have an impact on a large number of usersthat communicate, compute, and store information. Therefore,timely and accurate anomaly detection is necessary for reliability,security, safe operation, and mitigation of losses in these increas-ingly important systems. Recently, the evolution of the softwareindustry opens up several problems that need to be tackledincluding (1) addressing the software evolution due softwareupgrades, and (2) solving the cold-start problem, where datafrom the system of interest is not available. In this paper, wepropose a framework for anomaly detection in log data, as amajor troubleshooting source of system information. To thatend, we utilize pre-trained general-purpose language models topreserve the semantics of log messages and map them into logvector embeddings. The key idea is that these representationsfor the logs are robust and less invariant to changes in thelogs, and therefore, result in a better generalization of theanomaly detection models. We perform several experiments on acloud dataset evaluating different language models for obtainingnumerical log representations such as BERT, GPT-2, and XL.The robustness is evaluated by gradually altering log messages,to simulate a change in semantics. Our results show that theproposed approach achieves high performance and robustness,which opens up possibilities for future research in this direction.

Index Terms —anomaly detection, log analysis, deep learning,language models, transfer learning

I. I

NTRODUCTION

Modern computer systems such as cloud platforms are acombination of complex multi-layered software and hardware.The complexity implies high maintenance overhead for theoperators of these systems, making the manual operationcumbersome. In extreme cases, where system anomalies orfailures happen, it can lead to SLA violations. Large serviceproviders are aware of such cases and make the automationoperation and maintenance tasks a priority.Recently, a plethora of methods were introduced to auto-mate and provide scalable AI-driven solutions to perform arange of operational tasks including anomaly detection andfailure analysis [1]–[3]. In the foundation of these methodsare the system data. Although there are various data sourcesdescribing system behaviour, system logs are an omnipresentdata source [3], [4]. They are one of the most used data sourcesfor troubleshooting.The evolution of the software industry opens up severalproblems that need to be tackled. The detection of the abnor-mal behaviour of the system is one of them. When considering the anomaly detection models from log data, two of themost important challenges are (1) addressing the softwareevolution due software upgrades, and (2) solving the cold-startproblem [5]. In both cases, anomaly detection models haveto be dynamically optimized and adapted to the new setting.Exposing the underlying properties of the log messages in asystem-agnostic manner (e.g. semantics, length, etc.) arises asan important requirement from the anomaly detection methodsutilizing system logs.On the contrary, many of the existing approaches are basedon the invariant assumption, i.e. log templates never change.Furthermore, they rely on the assumption capturing all possiblevariations of log messages. Approaches, such as matchingcertain keywords (e.g. ”error”), constructing a black-list of logevents or anomalous matching regular expressions, are infertileunder the circumstances of constant system’s evolution. Theyusually lead to many unnecessary alarms, a problem knownas alarm fatigue.To mitigate the drawbacks of the invariant assumption, wepropose an anomaly detection framework capable of preserv-ing the shared properties between the log messages. Morespeciﬁcally, we utilized transfer learning and deep languagemodeling to learn a robust, context-aware representation ofthe log messages. Whenever a new log line is introduced,the framework assigns numerical-vector representation to itutilizing prior information from all the previously presentedlog messages. As such it is effective in reducing the cold-start problem the anomaly detection model is facing after anupdate. Through time, the framework reuses the accumulatedknowledge for the log messages to improve the performanceand the underlying representation. In a nutshell, the frameworkprovides a mechanism to transfer knowledge from previous logmessages and automatically detect anomalies in logs affectedby pre-processing noise and changes of log events by updatesof the underlying software.The contributions of this work are summarized as follows.1) A general framework for learning context and semantic-aware numerical log vector representations suitable foranomaly detection.2) Comparison of three semantic-level general-purpose lan-guage embedding models for anomaly detection.3) Comparison of two learning objectives for anomalydetection utilizing general language models. a r X i v : . [ c s . A I] F e b ) Robust model transfer approach for reduction of the falsepositive rate after software update.5) We provide a publicly available implementation of themethod and the datasets. The remaining of the paper is structured as follows. Insection II, we provide the related work for anomaly detectionin log data. In sections III, we present the preliminary theproposed framework. Section IV summarizes the evaluationof the different language models, learning objective as well asmodel transfer. Section V concludes the paper.II. R

ELATED W ORK

As a major data type for the system behaviour, the literaturerecognizes sustainable utilization of the log data for anomalydetection in both the industry and academia [3]–[7]. Thework on anomaly detection from log data follows two generaldirections: supervised and unsupervised methods. In this work,our focus is on unsupervised learning approaches.The unsupervised approaches have greater practical rel-evance, because labeling of log messages is an expensiveprocedure. There is a number of approaches that have beendeveloped using the log event count within a certain windowto transform log messages to numerical representations. Xuet al. [6] proposed using the Principal Component Analysis(PCA) method on such vectors. It follows a standard machinelearning techniques of investigation of the second norm of thelower principle components to decide if the log is normal oran anomaly.There are several works for log anomaly detection thatutilize deep learning approaches. For example, Zhang et al. [8]used LSTM to predict subsequent log events based on a win-dow of preceding events. The ability to correctly predicting thenext event is used to determine anomalous events. DeepLog [4]is utilizing a similar method. It is claimed that robustness tonovel events is achieved by a synonym/antonym database thatis used to generate auxiliary samples. Vijayakumar et al. [9]trained a stacked-LSTM to model the operation log samplesof normal and anomalous events. However, the input to theunsupervised methods is a one-hot vector of logs representingthe indices of the log templates. Therefore, it cannot cope withnewly appearing log events.Several studies [5] have leveraged NLP techniques to ana-lyze log data based on the idea that log is a natural languagesequence. Those works are utilizing word embeddings whichare later averaged in order to represent the full log message.Non-learnable aggregation is a heuristic that often does nothold when going from words to sentences [10]. Differentfrom all the above methods, we utilize state-of-the-art lan-guage models for obtaining numerical representations for logmessages. It enables using end-to-end trainable vector repre-sentations that can be used in various recurrent networks e.g.Bi-LSTM [11] for anomaly detection. The log representationsare robust to semantic-invariant changes of the log messages,providing good generalization. https://github.com/haraldott/anomaly detection main III. R

OBUST L OG A NOMALY D ETECTION

The architecture of the framework is presented in Figure 2.It is composed of two phases, ofﬂine training phase and onlineanomaly detection.

The training phase is composed of the following steps.First, the raw log messages from the system are preprocessed.This includes a transformation of the log into a templateand variable part (e.g.,

VM Creation took 8 seconds ;template: ”

VM Creation took * seconds ”, variables:[8]). Each of the templates is then transformed into a nu-merical vector using language models such as BERT, GPT,and XL [12]–[14]. Utilizing the pre-trained embeddings fromthese general-purpose models aims to capture the semanticproperties the log messages, important for generalization overdifferent data [5]. In the second step, we chain the embed-ding vectors through time in a recurrent neural network (Bi-LSTM [11]) that learns the normal system behaviour. It is thenutilized for anomaly detection by detecting deviations fromthe expected system behaviour. It enables robust detectionof sequential anomalies. Important to note is that the neuralnetwork is trained on pre-trained numerical representations,therefore, it largely facilitates the transfer of the learned modelto new log data that can appear due to software upgrades ordue cold-start problems.In the prediction phase , the log messages are transformedinto log vectors via the same preprocessing steps as in training.Then, the sequences of such log vectors are provided as inputto the anomaly detection model. The prediction from thesequential model is utilized to decide if the input sequenceis normal or not. In the following, we describe each of theparts in detail.

A. Log preprocessing and log vectorization

The raw log messages generated by the systems are noisywith semi-structured form. To structure them and to obtain theinformation from the logs needed for the anomaly detectionmodel, they require to be parsed [3]. Log parsing provides amapping function of the raw log messages into log templatese.g log instruction in the source code. In this work, we adoptDrain [15], due to its speed and efﬁciency.Next, the log templates are transformed into numericalvectors. Formally, we write a log vector (embedding) as w i ∈ R d , where d is the size of the vector embedding. Thegoal of the log vectorization is to preserve important propertiesof log messages and distinct normal against anomaly logmessages. VM Create finishedt1 = [0, 1, 0]VM Fatal error t2 = [0, 0, 1] VM Create completed t3 = [1, 0, 0] VM Create finishedt1 = [0, 1, 0]VM Fatal error t2 = [0, 0, 1] VM Create completed t3 = [0.02, 0.98, 0]

Fig. 1. Log vector representation using invariant embeddings (left) andsemantically-aware embeddings (right). raining dataRaw log messages... i. Took 8 seconds to buildinstance... Templates ( )...i. Took * seconds to buildinstance...Log parsingLanguageModel ...i. [0.573, -0.623, ..., -0.249]...Log vectorizationTemplate -> embedding table New log data (prediction/test)Nearest template matching for test logsTraining data 0.1320.142... ...-0.297-0.262Next template prediction(classification) Next embeddingprediction(regression)Bi-LSTM ModelPrediction for anomalyTest dataInput:Output:

Fig. 2. Overview of the framework utilizing sentence level pre-trained language models.

To better illustrate the importance of the log vectors, inFigure 1, we provide a visual comparison between normal andanomalous log messages when standard one-hot encoding isutilized against vector embeddings obtained from pre-trainedmethods. Improvements in the log vectorization translate toimprovements in the robustness and generalization of theanomaly detection models.To that end, we formalize two properties that a log vectorembedding should poses. (1) Distinguishable: the log vectorsshould represent semantic differences between the log mes-sages. For example,

VM Create finished and

VM Fatalerror are templates with different semantics, even thoughthey share the same words (instance) and synonyms (terminat-ing, deleting). (2) Tolerance: the embeddings should representthe similarity between different templates with the same orvery similar semantics. For example,

VM Create finished is semantically very similar to

VM Create completed .To preserve both properties, we refer to the natural languagemodels, where these properties are one of the major parts ofresearch. Exploiting general-purpose language models, whichare pre-trained on large corpora of texts (e.g., Wikipedia)enables preserving of general textual structures. We focuson utilizing sentence-level embeddings, in contrary to wordlevel embeddings. Sentence level embedding provide directand efﬁcient mapping from log message to log vector, withoutany intermediate steps (e.g. averaging of word vectors).

B. Bi-LSTM for Sequential Anomaly Detection

Once obtained, the log vectors are grouped (by timestamp)into sequences of size δ (window size), i.e., the sequences areformed of consecutive log vector embeddings, w i : w i + δ . Wedeﬁne two learning tasks which are utilized to learn the normalsequences of log messages: (1)Prediction of the log templateas a class (classiﬁcation, via minimization of the cross-entropyloss), and (2) prediction of the log vector (regression, viaminimization of the mean squared error), of the log messagethat appears at the next position in the sequence W i + δ +1 , giventhe w i : w i + δ sequence as input.Figure 3 depicts the overview of the Bi-LSTM model usedto optimize the objectives [11]. The input data is passed to theforward and backward layers of the Bi-LSTM. We selected this model for learning the sequences as it offers a two-sided viewand improved properties for sequence learning, in comparisonto the single LSTM networks.The output of the Bi-LSTM network is an abstract numericalrepresentation of the input sequence, which is then utilized foroptimizing the objective. The subsequent two linear layers areapplying a transformation to acquire the desired dimensions,i.e., d for regression and n (number of classes) for classiﬁca-tion. Finally, an activation function f is applied to the outputof the last linear layer. We use cross-entropy for classiﬁcationand mean squared error for regression. Anomaly Detection using Multi-Class Classiﬁcation . Forthis learning objective, we used all available log templates asa target class (total of n ). The training is performed on theassumption that the data contains an abundance of normal logmessages, while in the prediction phase, the input data containsnormal and anomalous log templates.One major issue in this setup is that of the ”close-world”classiﬁcation objective requires apriori knowledge of all logtemplates. However, during prediction, it is expected that novellog templates will emerge. To address the absence of alltemplates at the prediction phase, we apply a nearest templatematching procedure to mitigate this limitation.In the template matching procedure, we calculate the dis- LSTM LSTMLSTM LSTMForwardlayerBackwardlayerInputOutput ...... ............

Fig. 3. Unfolded Bi-LSTM model used for anomaly detection of theembedding sequence. ance between the embedding of the novel template and all ofthe known target embeddings. The novel template is assignedthe class target that has the smallest distances to the knowntarget templates. To prevent matching on arbitrary novelties,a parameter maximal distance is introduced. When theminimal distance to the template to the set of known templatesgreater then some the maximal distance , the novel templateis discarded and anomaly label is directly assigned. Thematching process is applied on w ii + δ +1 in order to obtain t i + δ +1 .After the matching process, the model predicts a probabilitydistribution P r [ t i + δ +1 | w i : i + δ ] = ( p ( t ) |∀ t ∈ T ) . It describedthe probability of a template t ∈ T to occur as a successor oftemplates w i : i + δ . Due to the noise in the sequential appearingof the templates, we consider the top- k (out of | T | ) templateswith the highest probabilities to appear as relevant as the nexttemplate. If the actual template class t i + δ +1 is within the top − k predictions with the highest probability, we consider is asnormal. Otherwise, it is labeled as an anomaly. Anomaly Detection using Log Vector Regression . For theregression learning objective, the neural network is trained tominimize the mean squared error (MSE). The input of thenetwork is a sequence of vector embeddings for the templates,while the corresponding target value for the sequence isthe vector embedding of the next template. Compared toclassiﬁcation, the log vector embeddings for regression arealways obtainable.After the model is trained, the parameters for the anomalydetection models are calculated. The regression anomaly de-tection module has as a parameter the q − th percentile ofthe squared error for the training samples. The mean squarederror of every target for each training sequence template atposition i + s + 1 and the neural network’s predicted templatevector, is computed. Afterwards, the q -th percentile of theagglomerated loss values of the training dataset is computed.To detect anomaly, when novel sample from test dataset isintroduced, the squared error between the predicted templateand the vector embedding of the nearest matched template.The system will then mark every log event whose embeddingloss value is above the calculated q -th percentile as an anomalyand normal, otherwise. C. Model Transfer

Utilizing pre-trained general-purpose language models forextracting log representations and training the Bi-LSTM modelallows the transfer of the model to new unseen logs. The modeltransfer is achieved in the following way.Let dataset A be the training dataset from already knownlog messages, and dataset B be a dataset from an updatedor new service or system. After the preprocessing, the modelis trained on the dataset A . Then, the following steps areexecuted. First, every log event of dataset B is mapped tothe nearest neighbour of dataset A , i.e. the embedding withthe shortest cosine distance. In the case of classiﬁcation, it getsassigned the same class target. Second, a few-shot training ondataset B will be executed. Finally, with the adjusted model on training dataset B , the prediction phase on a test dataset B is executed as previously described for the classiﬁcation andregression learning objectives.The initial training on dataset A preserves semantic andcontextual information from previous log messages. The few-show training on dataset B allows the model to adapt to thespeciﬁcs of the dataset B and improve the results on anomalydetection. IV. E VALUATION

To demonstrate the usefulness of our framework foranomaly detection and transferability of the models fromdifferent software deployments we conducted two evaluationexperiments. In the ﬁrst experimental scenario, we inves-tigate how effective are the representations from sentence-level language models for anomaly detection on 1) groundtruth anomalies and 2) synthetic anomalies obtained via logalteration. In the second experimental scenario, we evaluatethe transferability of the models during software updates.

A. Log Datasets

For our experiments, we utilize the CloudLab OpenStacklog dataset available at the Loghub [4]. It is composed oftwo sets of experiments. During the ﬁrst set of experiments,the Openstack instances were created and their runtime wasmonitored. The second set of experiments is similar to the for-mer, however, occasionally anomalies were injected. The ﬁrstdataset, we refer to as a normal dataset while the second oneas anomalous dataset. Furthermore, to evaluate the framework,we additionally manipulated the normal dataset and createdtwo additional test sets described in the following.

Log alteration . To evaluate the feasibility of sentence-levelbased embeddings for anomaly detection in log data we aug-mented our data with a synthetic dataset. We refer to this dataas log alteration data. We identiﬁed two points of alterationin the log messages; semantic and contextual alteration. Thealterations are applied to normal data. Therefore, the overallanomaly detection model should be robust against these alter-ations. Classifying suchlike altered log messages as anomaliesare considered as false alarms. For both, the semantic andstructural changes we identiﬁed 3 types of alteration, namely:deletion, swap and imputation.For the semantic changes, we assume a log event to contains n tokens originating from the normal data. Deletion operationinvolves deleting of l randomly selected words in the logmessage. Swap operation involves, replacing l tokens with arandom token. Imputation operation involves imputing l wordsat a random position of original log event. The parameter l controls the intensity of the alternation. It is expected thatlog events with higher alternation intensity have a higherprobability to be detected as anomalies compared to eventsthat were altered with lower intensity.For the structural changes, we assume a log sequence tocontains m log templates originating from the normal data.Deletion operation involves deleting of l random log eventsfrom the sequence. Swap operation involves, replacing the ABLE IC

OMPARISON OF THE SENTANCE - LEVEL LANGUAGE EMBEDDING MODELS FOR THE TASK OF ANOMALY DETECTION .learning objective type ofexperiment Precision score Recall score F1 scoreGPT-2 XL BERT GPT-2 XL BERT GPT-2 XL BERTregression semantic 0.88 0.21 0.43 1.00 0.63 1.00 0.94 0.31 0.56sequential 0.79 0.32 0.49 1.00 0.61 1.00 0.87 0.42 0.66classiﬁcation semantic 0.24 0.26 0.37 0.70 1.00 1.00 0.36 0.41 0.54sequantial 0.31 0.36 0.5 0.70 1.00 1.00 0.43 0.53 0.67 k templates appearing after randomly selected index i , to arandomly chosen index j . For the indices the inequality j < i holds. Imputation operation for sequences involves selectingindex at position i and repeating it l times consecutively. Theparameter l controls the number of imputations. It is expectedthat log events with higher alternation intensity have a higherprobability to be detected as anomalies compared to eventsthat were altered with lower intensity. Augmentation to Simulate a Different Dataset . Since thesoftware is often updated and thus changed constantly bydevelopers, log statements are also subject to change. Tosimulate the evolution of the system, we construed an artiﬁcialdataset that simulates changed log messages. We constructedtwo datasets, we refer to as dataset A and B, in the followingmanner. We start with the normal data we refer to dataset A.Firstly, we randomly sample p % of the logs in A. Secondly,the sampled log lines are altered using the three semanticalteration techniques with additional word augmentation. Thealteration parameters are set to random values in the range 5-100 % of the range of allowed values for altering parameters.This allows simulating different dataset. We refer to thisaltered dataset as a dataset B. Finally, we create two versionsof the dataset B. If the alteration is not severe (e.g. 20% ofthe log messages is changed) the dataset is referred to as B-similar , otherwise, the dataset is referred to as

B-different . Thedatasets A and B are used for transferring the contextual andsemantic accumulated knowledge in the following way. Themodel is trained on this dataset A for e epochs (60 in ourstudy). Then part of the dataset B is used to conduct few-shottraining. The ﬁnal evaluation is done on the task of anomalydetection in the second part of dataset B. B. Semantic-level language embedding evaluation

This section presents the evaluation of the results. Weﬁrst evaluate the sentence-level embeddings capability of thedifferent language models for anomaly detection indepen-dent on the learning task. Namely, we compare BERT, XL-Transformers and GPT-2 on the regression-based approachand the classiﬁcation-based approach for anomaly detection.Afterwards, the results of the evaluation using the modeltransfer learning approach are presented.

1) Regression-based anomaly detection.:

TABLE I enliststhe results from the comparison of the three language modelson the task of anomaly detection. We divided the experimentsinto two subsets according to the type of alteration. Semanticalteration is related to the semantic changes of the log mes-sages, while the sequential alteration is related to sequential alteration, described previously. For the semantically alertedlog messages, GPT-2 yields better results compared to BERTand XL-Transformers with regards to all metrics. For thesequential altering of the log messages, there is a small drop ofF1-score and precision for the GPT-2 embeddings. However,the same metrics increase for BERT and XL embedding.The results from both scenarios imply that GPT-2 and BERTembeddings are more robust when either the semantic orsequential changes are considered.

2) Classiﬁcation :

When considering the classiﬁcation taskwe conducted the two separate results as in the case ofregression. For semantically alerted log messages the scoresare reversed. More speciﬁcally, BERT is showing the bestresults, followed by XL and GPT-2. The same pattern appearswhen considering the sequential learning scenario. Generalcomparison of the scores between the regression and classiﬁ-cation tasks shows that GPT-2 embeddings are highly affectedby the optimization objective. The deﬁnition of the problem asa classiﬁcation task is favourable when considering structuraland sequential changes.

C. Model Transfer Evaluation

For the evaluation of model transfer we conducted two ex-periments, for both the regression- and classiﬁcation learningobjectives on the task of anomaly detection. The results arelisted in TABLE II.For the regression learning objective when considering thelarge alteration, it can be seen that the both GPT-2 and BERTare performing well. However, when considering the smallalteration, although BERT embeddings still retain high score,the GPT-2 tends to produce weaker results. On contrary, whilebeing good performing method on the task of similar logmessages, XL-Transformers fails when the changes of the logmessages are not drastic.On the classiﬁcation learning objective, when consideringboth large and small alterations, the model utilizing BERTtends to outperform the remaining two. Comparing the XL-Transformers and GPT-2, it can be observed that the formeroutperforms the latter. Comparing the results alongside thelearning objectives, it can be observed that the classiﬁcationproblem deﬁnition, slightly outperforms the deﬁnition of theproblem as a regression task.

D. Discussion

The good results from both the classiﬁcation and regressionlearning objectives show that the framework is useful foranomaly detection in setting where the data is evolving through

ABLE IIE

VALUATION RESULTS FOR THE MODEL TRANSFER AFTER SOFTWARE UPDATES . P

ERCENTAGE OF ALTERED LOG MESSAGES IS P =15 % .learning objective type ofexperiment Precision score Recall score F1 scoreGPT-2 XL BERT GPT-2 XL BERT GPT-2 XL BERTregression B-similar 0.23 0.45 0.58 0.05 0.7 0.7 0.08 0.55 0.63B-different 0.94 0.18 0.52 1.00 0.47 1.00 0.97 0.26 0.68classiﬁcation B-similar 0.27 0.53 0.61 1.00 1.00 1.00 0.43 0.69 0.75B-different 0.09 0.23 0.68 1.00 1.00 1.00 0.17 0.38 0.81 time. When evaluating the different forms of alteration ofthe log messages and sequences of log messages BERT,as a general-purpose language model on sentence-level em-beddings, shows to perform more consistently and robustlyacross the two learning objectives. It is followed by XL-Transformers and GPT-2 accordingly. GPT-2 shows strongresults in experiment type for regression learning objective, butnot as competitive for classiﬁcation learning objective. Similarobservations can be made for model transfer in settings wherethere are both small and large changes in the log messages.Comparison of the different learning objectives shows thatthe deﬁnition of the learning task as a classiﬁcation problemcan produce better results compared to it deﬁned as a regres-sion problem. This is an interesting result from this study.The plug-and-play strategy allows for testing different lan-guage models. As seen by the results, the different languagemodels can highly inﬂuence the quality of the results foranomaly detection, with different word embeddings havingstrengths and weaknesses in different categories. Improvingthe NLP language models via increasing the number of pa-rameters e.g. [16] will result in even better performance.V. C ONCLUSION

This paper addresses the problem of log anomaly detectionin large computer systems. We addressed the generalizationproblem for anomaly detection on unseen logs by introducinga plug-and-play framework that utilizes pre-trained languagemodels for obtaining numerical, semantically aware embed-dings for log events. Bi-LSTM neural network is used as amethod for exploiting contextual properties of log messagesin the task of anomaly detection. Empirically, we show that theproposed approach is robust to alteration in the log messages– scenarios frequently occurring in practice due to softwareupdates and deploying new services or systems. The resultsshow that the framework achieves high performance usingstate-of-the-art sentence-level language models. Furthermore,we show that not every representation is equally useful foranomaly detection. Some of the language models fail togenerate log representations that can be separated by a learneddecision boundary. The underlying learning objective is alsovery important to obtain good results in the task of anomalydetection. The proposed approach opens new potential foranomaly detection not just from log data, but from othersources that have the notion of a distributed representationof an event e.g., distributed tracing data. We believe thatthe method will motivate further research in the direction ofdevelopment of pre-trained language models on log data. This would enhance the log representations, and thus, improve theperformance of the anomaly detection methods.R

EFERENCES[1] J. Bogatinovski and S. Nedelkoski, “Multi-source anomaly detection indistributed it systems,” arXiv preprint arXiv:2101.04977 , 2021.[2] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao, “Self-attentive classiﬁcation-based anomaly detection in unstructured logs,”2020.[3] J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, and M. R. Lyu, “Toolsand benchmarks for automated log parsing,” in . IEEE, 2019, pp. 121–130.[4] M. Du, F. Li, G. Zheng, and V. Srikumar, “Deeplog: Anomaly detectionand diagnosis from system logs through deep learning,” in

Proceedingsof the 2017 ACM SIGSAC Conference on Computer and Communica-tions Security . ACM, 2017, pp. 1285–1298.[5] X. Zhang, Y. Xu, Q. Lin, B. Qiao, H. Zhang, Y. Dang, C. Xie, X. Yang,Q. Cheng, Z. Li et al. , “Robust log-based anomaly detection on unstablelog data,” in

Proceedings of the 2019 27th ACM Joint Meeting onEuropean Software Engineering Conference and Symposium on theFoundations of Software Engineering , 2019, pp. 807–817.[6] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detectinglarge-scale system problems by mining console logs,” in

Proceedingsof the ACM SIGOPS 22nd symposium on Operating systems principles ,2009, pp. 117–132.[7] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao,“Self-supervised log parsing,” in . Springer, 2020.[8] K. Zhang, J. Xu, M. R. Min, G. Jiang, K. Pelechrinis, and H. Zhang,“Automated it system failure prediction: A deep learning approach,” , pp. 1291–1300, 2016.[9] R. Vinayakumar, K. P. Soman, and P. Poornachandran, “Long short-termmemory based operation log anomaly detection,” , pp. 236–242, 2017.[10] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in

Advances in neural information processing systems , 2013,pp. 3111–3119.[11] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models forsequence tagging,” arXiv preprint arXiv:1508.01991 , 2015.[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,“Language models are unsupervised multitask learners.”[14] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, andQ. V. Le, “Xlnet: Generalized autoregressive pretraining for languageunderstanding,” in

Advances in neural information processing systems ,2019, pp. 5753–5763.[15] P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsingapproach with ﬁxed depth tree,” in . IEEE, 2017, pp. 33–40.[16] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al. , “Language modelsare few-shot learners,” arXiv preprint arXiv:2005.14165arXiv preprint arXiv:2005.14165