Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams
Aaron Tuor, Samuel Kaplan, Brian Hutchinson, Nicole Nichols, Sean Robinson
DDeep Learning for Unsupervised Insider Threat Detectionin Structured Cybersecurity Data Streams
Aaron Tuor and
Samuel Kaplan and
Brian Hutchinson ∗ Western Washington UniversityBellingham, WA
Nicole Nichols and
Sean Robinson
Pacific Northwest National LaboratorySeattle, WA
Abstract
Analysis of an organization’s computer network activity is akey component of early detection and mitigation of insiderthreat, a growing concern for many organizations. Raw sys-tem logs are a prototypical example of streaming data that canquickly scale beyond the cognitive power of a human analyst.As a prospective filter for the human analyst, we present anonline unsupervised deep learning approach to detect anoma-lous network activity from system logs in real time. Our mod-els decompose anomaly scores into the contributions of indi-vidual user behavior features for increased interpretability toaid analysts reviewing potential cases of insider threat. Usingthe CERT Insider Threat Dataset v6.2 and threat detectionrecall as our performance metric, our novel deep and recur-rent neural network models outperform Principal ComponentAnalysis, Support Vector Machine and Isolation Forest basedanomaly detection baselines. For our best model, the eventslabeled as insider threat activity in our dataset had an aver-age anomaly score in the 95.53 percentile, demonstrating ourapproach’s potential to greatly reduce analyst workloads.
Introduction
Insider threat is a complex and growing challenge for em-ployers. It is generally defined as any actions taken by anemployee which are potentially harmful to the organization;e.g., unsanctioned data transfer or sabotage of resources. In-sider threat may manifest in various and novel forms mo-tivated by differing goals, ranging from a disgruntled em-ployee subverting the prestige of an employer to advancedpersistent threats (APT), orchestrated multi-year campaignsto access and retrieve intelligence data (Hutchins, Cloppert,and Amin 2011).Cyber defenders are tasked with assessing a large volumeof real-time data. These datasets are high velocity, hetero-geneous streams generated by a large set of possible entities(workstations, servers, routers) and activities (DNS requests,logons, file accesses). With the goal of efficient utilizationof human resources, automated methods for filtering systemlog data for an analyst have been the focus of much past andcurrent research, this work included. ∗ Email: [email protected]. Phone: 360-650-4894.Address: 516 High Street, Bellingham, WA 98229.Copyright c (cid:13)
We present an online unsupervised deep learning sys-tem to filter system log data for analyst review. Because in-sider threat behavior is widely varying, we do not attempt toexplicitly model threat behavior. Instead, novel variants ofDeep Neural Networks (DNNs) and Recurrent Neural Net-works (RNNs) are trained to recognize activity that is char-acteristic of each user on a network and concurrently as-sess whether user behavior is normal or anomalous, all inreal time. With the streaming scenario in mind, the time andspace complexity of our methods are constant as a functionof stream duration; that is, no data is cached indefinitely anddetections are made as rapidly as new data is fed into ourDNN and RNN models. To aid analysts in interpreting sys-tem decisions, our model decomposes anomaly scores intoa human readable summary of the major factors contribut-ing to the detected anomaly (e.g. that the user copied an ab-normally large number of files to removable media between12am and 6am).There are several key difficulties in applying machinelearning to the cyber security domain (Sommer and Pax-son 2010) that our model attempts to address. User activityon a network is often unpredictable over seconds to hoursand contributes to the difficulty in finding a stable model of“normal” behavior. Our model trains continuously in an on-line fashion to adapt to changing patterns in the data. Also,anomaly detection for malicious events is particularly chal-lenging because attackers often try to closely mimic typicalbehavior. We model the stream of system logs as interleaveduser sequences with user-metadata to provide precise con-text for activity on the network; this allows our model, forexample, to identify what is truly typical behavior for theuser, employees in the same role, employees on the sameproject team, etc. We assess the effectiveness of our modelson the synthetic CERT Insider Threat v6.2 dataset (Lindaueret al. 2014; Glasser and Lindauer 2013) which includes sys-tem logs with line-level annotations of insider threat activity.The ground truth threat labels are used only for evaluation.
Prior Work
A frequent approach to insider threat detection is to framethe problem as an anomaly detection task. A comprehen-sive overview of anomaly detection provided by Chandolaet al. (2012) concludes that anomaly detection techniquesfor online and multivariate sequences are underdeveloped; a r X i v : . [ c s . N E ] D ec oth issues are addressed in this paper. A real world systemfor anomaly detection in system logs should address the setof constraints given by the real time nature of the task andprovide a set of features suitable for the application domain:concurrent tracking of multiple entities, analysis of struc-tured multivariate data, adaptation to shifting distribution ofactivities, and interpretable judgments. While each work sur-veyed below addresses some subset of these components,our work addresses all of these constraints and features.As mentioned above, it is common to approach tasks likeintrusion detection or insider threat as anomaly detection.Carter and Streilein (2012) demonstrate a probabilistic ex-tension of an exponentially weighted moving average forthe application of anomaly detection in a streaming envi-ronment. This method learns a parametric statistical modelthat adapts to the changing distribution of streaming data.An advantage of our present approach using deep learningarchitectures is the ability to model a wider range of distribu-tions with fewer underlying assumptions. Gavai et al. (2015)compare a supervised approach, from an expert-developedclassifier, with an unsupervised approach using the IsolationForest method at the task of detecting insider threat fromnetwork logs. They also aggregate information about whichfeatures contribute to the isolation of a point within the treeto produce motivation for why a user was flagged as anoma-lous. Considering this to be a reasonable approach, we in-clude Isolation Forests as one of our baselines.Researchers have also applied neural network-based ap-proaches to cybersecurity tasks. Ryan et al. (1998) train astandard neural network with one hidden layer to predict theprobabilities that each of a set of ten users created a dis-tribution of Unix commands for a given day. They detect anetwork intrusion when the probability is less than 0.5 forall ten users of the network. Differing from our work, theirinput features are not structured, and they do not train thenetwork in an online fashion. Early work on modeling nor-mal user activity on a network using RNNs was performedby Debar et al. (1992). They train an RNN to convergence ona representative sequence of Unix command line arguments(from login to logout) and predict network intrusion whenthe trained network for that user does poorly at predictingthe login to logout sequence. While this work partially ad-dresses online training it does not continuously train the net-work to take into account changing user habits over time.Veeramachananeni et al. (2016) present work using a neuralnetwork auto-encoder in an online setting. They aggregatenumeric features over a time window from web and firewalllogs which are fed to an ensemble of unsupervised anomalydetection methods: principal component reconstruction ofthe signal, auto-encoder neural network, and a multivari-ate probabilistic model over the feature space. They addi-tionally incorporate analyst feedback to continually improvewith time, but do not explicitly model individual user activ-ity over time.Recurrent neural networks have, of course, been success-fully applied to anomaly detection in various alternative do-mains; e.g., Malhotra et al. (2016) in the domain of signalsfrom mechanical sensors for machinery such as engines, andvehicles, Chuahan et al. (2015) in the domain of ECG heart Figure 1: End to End Systemdata, and Marchi et al. (2015a; 2015b) in the acoustic signalprocessing domain. In contrast to the present work, theseapplications are not faced with the task of processing a mul-tivariate combination of categorical and continuous features. System Description
Figure 1 provides an overview of our anomaly detection sys-tem. First, raw events from system user logs are fed intoour feature extraction system, which aggregates their countsand outputs one vector for each user for each day. A user’sfeature vectors are then fed into a neural network, creatinga set of networks, one per user. In one variant of our sys-tem, these are DNNs; in the other, they are RNNs. In ei-ther case, the different user models share parameters, but forthe RNN they maintain separate hidden states. These neuralnetworks are tasked with predicting the next vector in thesequence; in effect, they learn to model users’ “normal” be-havior. Anomaly is proportional to the prediction error, withsufficiently anomalous behavior being flagged for an analystto investigate. The components in the system are describedin greater detail below.
Feature Extraction
One practical consideration that a deep learning anomaly de-tection system must address is the transformation of systemlog lines from heterogeneous tracking sources into numericfeatures suitable as input. Our system extracts two kinds ofinformation from these sources: categorical user attributefeatures and continuous “count” features. The categoricaluser features refer to attributes such as a user’s role, depart-ment, and supervisor in the organization. See Table 1 fora list of categorical features used in our experiments (alongwith the number of distinct values in each category). In addi-tion to these categorical features, we also accumulate countsof 408 “activities” a user has performed over some fixed timewindow (e.g. 24 hours). An example of a counted activityis the number of uncommon non-decoy file copies from re-movable media between the hours of 12:00 p.m. and 6:00p.m. Figure 2 visually enumerates the set of count features:simply follow a path from right to left, choosing one item ineach set along the way. The set of all such traversals is theset of count features. For each user u , for each time period, t , the categorical values and activity counts are concatenatedinto a 414 dimensional numeric feature vector x ut . ategorical Var. Role 46Project 366Functional Unit 11Department 23Team 90Supervisor 246Table 1: Categorical VariablesFigure 2: Enumeration of count features.
Structured Stream Neural Network
At the core of our system is one of two neural network mod-els that map a series of feature vectors for a given user, oneper day, to a probability distribution over the next vectorin the user’s sequence. This model is trained jointly overall users simultaneously and in an online fashion. First, wedescribe our DNN model, which does not explicitly modelany temporal behavior, followed by the RNN, which does.We then discuss the remaining components for making pre-dictions of structured feature vectors and identification ofanomaly in the stream of feature vectors.
Deep Neural Network Model
Our model takes as in-put a series of T feature vectors x u , x u , . . . , x uT for a user u and produces as output a series of T hidden state vec-tors h u , h u , . . . , h uT (each to be later fed into the struc-tured prediction network). In a DNN with L hidden layers ( l = 1 , ..., L ) , our final hidden state, the output of hiddenlayer L , h ut = h uL,t is a function of x ut as follows: h ul,t = g ( W l h ul - ,t + b l ) (1)Where g is a non-linear activation function, typicallyReLU, tanh, or the logistic sigmoid, and h u ,t = x ut . Thetrainable parameters are the L weight matrices ( W ), and L bias vectors ( b ). Recurrent Neural Network Model
Like the DNN, theRNN model maps an input sequence x u , x u , . . . , x ut to ahidden state sequence h u , h u , . . . , h uT . Unlike the DNN,here the hidden state h ut is computed as a function of x u , x u , . . . , x ut , and not on x ut alone. Conditioning h ut ona sequence rather than the current input alone allows us to Figure 3: Unrolled LSTM Network with N Layerscapture temporal patterns in user behavior, and to build anincreasingly accurate model of the user’s behavior over time.We use the popular Long Short-Term Memory (LSTM)RNN architecture (Hochreiter and Schmidhuber 1997), inwhich the hidden state h ut at time t is a function of a long-term memory cell, c ut . In a deep LSTM with L hidden lay-ers, our final hidden state, the output of hidden layer L , h ut = h uL,t , depends on the input sequence and cell statesas follows: h ul,t = o ul,t (cid:12) tanh( c ul,t ) (2) c ul,t = f ul,t (cid:12) c ul,t - + i ul,t (cid:12) g ul,t , and (3) g ul,t = tanh (cid:16) W ( g,x ) l h ul - ,t + W ( g,h ) l h ul,t - + b gl (cid:17) (4) f ul,t = σ (cid:16) W ( f,x ) l h ul - ,t + W ( f,h ) l h ul,t - + b fl (cid:17) (5) i ul,t = σ (cid:16) W ( i,x ) l h ul - ,t + W ( i,h ) l h ul,t - + b il (cid:17) (6) o ul,t = σ (cid:16) W ( o,x ) l h ul - ,t + W ( o,h ) l h ul,t - + b ol (cid:17) (7)Where h u ,t = x ut , and c ul, , h ul, are set to zero vectors for all ≤ l ≤ L . We use (cid:12) and σ to denote element-wise multipli-cation and the (element-wise) logistic sigmoid function, re-spectively. Vector g ul,t is a hidden representation based on thecurrent input and previous hidden state, while vectors f ul,t , i ul,t and o ul,t , modulate how cell-state information is propa-gated across time, how the input is incorporated into the cellstate, and how the the hidden state relates to the cell state,respectively. The trainable parameters for the LSTM are the L weight matrices ( W ) and the L bias vectors ( b ); theseweights are shared among all users. Probability Decomposition
Given the hidden state attime t − , h ut − , our model outputs the parameters θ fora probability distribution over the next observation, x ut . Theanomaly for user u at time t , a ut , is then: a ut = − log P θ ( x ut | h ut − ) (8)This probability is complicated by the fact that our fea-ture vectors, and thus the predictions our model makes, in-clude six categorical variables in addition to the 408 di-mensional count vector. Therefore, P θ ( x ut | h ut − ) is actuallythe joint probability over the count vector ( ˆ x ut ) and eachof the categorical variables: role (R), project (P), functionalnit (F), department (D), team (T) and supervisor (S). Let C = { R, P, F, D, T, S } denote the set of categorical vari-ables; e.g., let R ut denote the role of user u at time t . Then P θ ( x ut | h ut − ) = P θ (ˆ x ut , R ut , . . . , S ut | h ut − ) . (9)For computational simplicity, we approximate this jointprobability by assuming conditional independence: P θ ( x ut | h ut − ) ≈ P θ ( ˆx ) (ˆ x ut | h ut − ) (cid:89) V ∈C P θ ( V ) ( V ut | h ut − ) (10)The seven parameter vectors, parameters θ ( ˆx ) and θ ( V ) for V ∈ C , are produced by seven single hidden layer neuralnetworks: θ ( ˆx ) t = U (cid:48) ˆx tanh ( U ˆx h t − + b ˆx ) + b (cid:48) ˆx ) (11) θ ( V ) t = f ( U (cid:48) V tanh ( U V h t − + b V ) + b (cid:48) V ) (12)Here f denotes the softmax function. Two additional weightmatrices ( U ) and two additional bias vectors ( b ) are intro-duced for each of the seven variables we are predicting. Likethe LSTM weights, these parameters are shared among allusers. The parametric forms for the conditional probabilitiesare described next. Conditional Probabilities
We model the conditionalprobabilities for the six categorical variables as discrete,while we model the conditional probability of the countsas continuous. For the discrete models, we use the stan-dard approach: the probability of category k is simply the k th element of vector θ ( V ) , whose dimension is equal to thenumber of categories. For example, there are 47 roles, so θ ( R ) ∈ R . Because we use a softmax output activationto produce θ ( V ) , the elements are non-negative and sum-to-one.For the count vector, we use the multivariate normal den-sity: P θ ( ˆx ) (ˆ x ut | h ut − ) = N (ˆ x ; µ, Σ) . We consider two vari-ants. In the first, our model outputs the mean vector µ ( θ ( ˆx ) = µ ) and we assume the covariance Σ to be the iden-tity. With identity covariance, maximizing the log-likelihoodof the true data is equivalent to minimizing the squared error (cid:107) ˆ x ut − µ (cid:107) . In the second, we assume diagonal covariance,and our model outputs both the mean vector and the log ofthe diagonal of Σ . This portion of the model can be seen asa simplified Mixture Density Network (Bishop 1994). Prediction Targets
We define two prediction target ap-proaches, “next time step” and “same time step”. Recallfrom Eqn. 8, anomaly is inversely proportional to the logprobability of the observation at time t given the hidden rep-resentation at time t - ; that is, given everything we know upto and including time t - , predict the outcome at time t . Thisapproach fits the normal paradigm for RNNs on sequentialdata; in our experiments, we will refer to this approach as“next time step” prediction.However, it is common in anomaly detection literature(Malhotra et al. 2016) to use an auto-encoder to detectanomaly. An auto-encoder is a parametric function trainedto reproduce the input features as output. Its complexity istypically constrained to prevent it from learning the trivial identity function; instead, the network must exploit statis-tical regularities in the data to achieve low reconstructionerror for commonly found patterns, at the expense of highreconstruction error for uncommon patterns (anomalous ac-tivity). Networks trained in this unsupervised fashion havebeen demonstrated to be very effective in several anomalydetection application domains (Markou and Singh 2003).In the context of our present application, both techniquesmay be applicable. Formally, we consider an alternative def-inition of anomaly: ˆ a ut = − log P θ ( x ut | h ut ) (13)That is, given everything we know up to and including time t , predict the input counts x ut . If x ut is anomalous, we are un-likely to produce a distribution that assigns a large density toit. We refer to this approach as “same time step” prediction. Detecting Insider Threat
Ultimately, the goal of ourmodel is to detect insider threat. We assume the followingconditions: our model produces anomaly scores, which areused to rank user-days from most anomalous to least, wethen provide the highest ranked user-day pairs to analystswho judge whether the anomalous behavior is indicative ofinsider threat. We assume that there is a daily budget whichimposes a maximum number of user-day pairs that can bejudged each day, and that if an actual case of insider threatis presented to an analyst, he or she will correctly detect it.Because our model is trained in an online fashion, theanomaly scores start out quite large (when the model knowsnothing about normal behavior) and trend lower over time(as normal behavior patterns are learned). To place theanomaly score for user u at time t in the proper context,we compute an exponentially weighted moving average es-timate of the mean and variance of these anomaly scores andstandardize each score as it arrives.One key feature of our model is that the anomaly scoredecomposes as the sum over the negative log probabilitiesof our variables; the continuous count random variable fur-ther decomposes over the sum of individual feature terms: ( x i − µ i ) /σ i . This allows us to identify which featuresare largest contributors to any anomaly score; for exam-ple, our model could indicate that a particular user-day isflagged as anomalous primarily due to an abnormal numberof emails sent with attachments to uncommon recipients be-tween 12am and 6am. Providing insight into why a user-daywas flagged may improve both the speed and accuracy ofanalysts’ judgments about insider threat behavior. Online Training
In a standard training scenario for RNNs, individual ormini-batches of sequences are fed to the RNN, gradientsof the training objective are computed via Back Propaga-tion Through Time, and then weights are adjusted via agradient-descent-like algorithm. For DNNs, individual ormini-batches of samples are fed into the DNN, and weightsare updated with gradients computed by standard backprop-agation. In either case, this process usually iterates over thefixed-size dataset until the model converges, and only thenis the model applied to new data to make predictions. Thispproach faces a few key challenges for the online anomalydetection setting: 1) the dataset is streaming and effectivelyunbounded and 2) the model is tasked with making predic-tions on new data as it learns. Attempting to shoehorn thisscenario into a standard training setup is impractical: it isinfeasible to either store or repeatedly to train on an un-bounded streaming dataset and periodically retraining themodel on a fixed-size set of recent events risks excludingimportant past events.To accommodate an online scenario, we make importantadjustments to the standard training regimen. For DNNs,the primary difference is the restriction of observing eachsample only once. For the RNN, the situation is more com-plicated. We train on multiple user sequences concurrently,backpropagating and adjusting weights each time we see anew feature vector from a user. Logically, this correspondsto training one RNN per user, where the weights are sharedbetween all users but hidden state sequences are per-user. Inpractice, we accomplish this by training a single RNN witha supplementary data structure that stores a finite windowof past inputs and hidden and cell states for each user. Eachtime a new feature vector for a user is fed into the model,the hidden and cell states for that user are then used for con-text when calculating the forward pass and backpropagatingerror.
Baseline Models
To assess the effectiveness of our DNN and RNN models,we compare against popular anomaly/novelty/outlier detec-tion methods. Specifically, we compare against one-classsupport vector machine (SVM) (Schlkopf et al. 2001), iso-lation forest (Liu, Ting, and Zhou 2008) and principle com-ponent analysis (PCA) baselines (Shyu et al. 2003). We usescikit-learn’s implementation of one-class SVM and isola-tion forest, both included as part of its novelty and outlierdetection functionality (Pedregosa et al. 2011). For the PCAbaseline, we project the feature vector onto the first k prin-ciple components and then map it back into the original fea-ture space. Anomaly is proportional to the error in this re-construction. Hyperparameter k is tuned on the developmentset. Experiments
We assess the effectiveness of our model, which we imple-mented in Tensorflow (Abadi et al. 2015) on a series of ex-periments. In this section we describe the data used, hyper-parameters tuned, and present our results and analysis. Data
Given security and privacy concerns surrounding networkdata, real world datasets must undergo an anonymizationprocess before being publicly released for research pur-poses. The anonymization process may obscure potentiallyrelevant factors in system logs. Particularly, user attributemetadata that may be available to a system administrator http://scikit-learn.org/stable/modules/outlier detection.html Code will be available at https://github.com/pnnl/safekit
Development Test
Date Range Days 1 - 418 Days 419 - 516
Total Events 111,457,667 23,659,502
Threat Events 192 236Threat User-Days 27 20Table 2: Dataset statistics.is typically absent in an open release data set. We performexperiments on the synthetic CERT Insider Threat Datasetv6.2, which includes such categorical information.CERT consists of event log lines from a simulated orga-nization’s computer network, generated with sophisticateduser models. We use five sources of events: logon/logoff ac-tivity, http traffic, email traffic, file operations, and externalstorage device usage. Over the course of 516 days, 4,000users generate 135,117,169 events (log lines). Among theseare events manually injected by domain experts, represent-ing five insider threat scenarios taking place. Additionally,user attribute metadata is included; namely, the six categor-ical attributes listed in Table 1.Since this is an unsupervised task, no supervised trainingset is required. We therefore split the entire dataset chrono-logically into two subsets: development and test. The for-mer subset ( ∼
85% of the data) is used for model selectionand hyper-parameter tuning, while the latter subset ( ∼ Tuning
We tune our models and baselines on the development setusing random hyper-parameter search. For the DNN, wetune the number of hidden layers (between 1 and 6) andthe hidden layer dimension (between 20 and 500). We fixthe batch size to 256 samples (user-days) and the learningrate to 0.01. For the RNN, we tune the hidden layers andhidden layer dimension over the same ranges as the DNN,and also fix the learning rate to 0.01. The batch size is tuned(between 256 and 8092 samples); larger batch sizes speedup model training, which is more important for the RNNthan the DNN. We also tune the number of time steps toback propagate over (between 3 and 40). When our inputsand outputs include the categorical variables, we addition-ally tune a hyper-parameter which determines the size of odel CR-400 CR-1000LSTM-Diag
LSTM-Diag-Cat k ) for budgets of 400 and1000. Comparing the performance of diagonal covarianceLSTM models with (Cat) and without categorical featuresincluded.the input embedding vector of a category in relation to howmany classes in that category (between 0.25 and 1). Bothneural network models use tanh for the hidden activationfunction and are trained using the ADAM (Kingma and Ba2014) variant of gradient descent.We also tune our baseline models. For the PCA baseline,we tune over the number of principal components (between1 and 20). For the Isolation Forest baseline, we tune thenumber of estimators (between 20 and 300), the contami-nation (between 0 and 0.5), and whether we bootstrap (trueor false). The max feature hyper-parameter is fixed at the de-fault of 1.0 (use all features). For the SVM baseline, we tunethe kernel (in the set { rbf , linear , poly , sigmoid } ), ν (between 0 and 1) and whether to use the shrinking heuristic(true or false). For the polynomial kernel, we tune the de-gree (between 1 and 10) while for all other kernels we usethe default value for the remaining hyper-parameters.For all models, our tuning criteria is Cumulative Recall k (CR- k ), which we define to be the sum of the recalls for allbudgets up to and including k . For computational efficiency,we only evaluate budgets at increments of 25, so if we de-fined R ( i ) to be the recall with a budget of i , CR- k is actu-ally R (25) + R (50) + · · · + R ( k ) . CR- k can be thought of asan approximation to an area under the recall curve. For eachmodel, we picked the hyper-parameters that maximized CR-1000, for which the maximum value achievable is 40. Giventhe assumptions that 1) we have a fixed daily analyst budgetwhich cannot be carried over from one day to the next, 2)true positives are rare, and 3) the cost of a missed detectionis substantially larger than the cost of a false positive, we feelthat recall-oriented metrics such as CR- k are a more suitablemeasurement of performance than precision-oriented ones. Results
We present three sets of experimental results, each designedto answer a specific question about our model’s perfor-mance.First, we assess the effect of including or excluding thecategorical variables in our model input and output. Table 3shows the comparison between two LSTM models, differingonly in whether they include or exclude the categorical in-formation. It shows that while the difference is not huge, themodel clearly performs better without the categorical infor-mation. While the original intention of including categoricalfeatures was to provide context to the model, we hypothesizethat our dataset may be simple enough that such context isnot necessary (or that the model does not need explicit con-text: it can infer it). It may also be that the added model com-plexity hinders trainability, leading to a net loss in perfor-
Model CR-400 CR-1000LSTM-Diag
LSTM-Diag-NextTime
DNN-Diag
DNN-Diag-NextTime k ) for daily budgets of 400and 1000. Comparing the performance of the diagonal co-variance DNN and LSTM models predicting counts at thenext time steps (NextTime) vs the current time step. Model CR-400 CR-1000Isolation Forest
SVM
PCA
DNN-Ident
DNN-Diag
LSTM-Ident
LSTM-Diag k ) for daily budgets of 400and 1000. All results are based on count features only. Forthe DNN and LSTM, diagonal (Diag) and identity (Ident)covariances are contrasted.mance. Because inclusion of categorical features adds com-putational complexity to the model and harms performance,all of the remaining experiments reported in this paper usecount features only.Our second set of experiments is designed to determinewhich of the prediction modes work best for our task: “sametime step” (Eqn. 13) or “next time step” (Eqn. 8). Table 4shows these results, comparing two DNN and two LSTMmodels. The “same time step” approach yields better per-formance for both models, although the difference is moredramatic for the LSTM. Based on this result, we only use“same time step” for our remaining set of experiments. In-terestingly, the DNN and LSTM perform equivalently. Wesuspect that the CERT dataset does not contain enough tem-poral patterns unfolding over multiple days to offer any realadvantage to the LSTM, though we would expect it to offeradvantages on real-world datasets.Our final set of experiments is designed to assess the ef-fect of covariance type for our continuous features (identityversus diagonal) and to contrast with our baseline models.Table 5 shows these results. Among the baselines, the Isola-tion Forest model is the strongest, giving the third best per-formance after DNN-Diag and LSTM-Diag. These resultsalso show that diagonal covariance leads to better perfor-mance than identity covariance. One obvious advantage ofdiagonal covariance is that it is capable of more effectivelynormalizing the data (by accounting for trends in variance).Wondering how well the identity model would perform ifthe data was normalized ahead of time, we conducted a pi-lot study where the counts were standardized with an ex-ponentially weighted moving average estimate of the meanand variance, and found no improvement for either the iden-igure 4: Percentile ranges of user-day anomaly as a func-tion of days for the DNN-Diag model. The vertical bar de-notes the split between the development and test sets.tity or diagonal covariance models. In contrast to a “global”normalization scheme, our diagonal covariance model is ca-pable of conditioning the mean and variance on local con-text (when either “next time step” or the LSTM are used);for example, it might expect greater mean or variance in thenumber of emails sent on the day after an abnormally largenumber of emails were received. That said, it is not clearwhether our data exhibits patterns that our models can takeadvantage of with this dynamic normalization. Analysis
We perform two analyses to better understand our system’sbehavior, using our best DNN model to illustrate. In thefirst, we look at the effect of time on the model’s notion ofanomaly. Because the model begins completely untrained,anomaly scores for all users are very high for the firstfew days. As the model sees examples of user behavior, itquickly learns what is “normal.” Fig. 4 shows anomaly asa function of day, (starting after the “burn in” period of thefirst few days, to keep the y-axis scale manageable). Per-centile ranges are shown (computed over the users in theday), and malicious (insider threat) user-days are overlayedas red dots. Notice that all malicious events are above the50th percentile for anomaly, with most above the 95th per-centile.In our second analysis, we study the effect of daily budgeton recall for best DNN, best LSTM and the three baselinemodels. Fig. 5 plots these recall curves. Impressively, with adaily budget of 425, DNN-Diag, LSTM-Diag and the Isola-tion Forest model all obtain 100% recall. It also shows thatwith our LSTM-Diag system, 90% recall can be obtainedwith a budget of only 250 (a 93.5% reduction in the amountof data analysts need to consider).
Daily Budget
100 200 300 400 500 600 700 800 900 1000 R e c a ll DNN-DiagLSTM-DiagIso. ForestPCASVM
Figure 5: Test set recall curves.
Conclusions
We have presented a system employing an online deep learn-ing architecture that produces interpretable assessments ofanomaly for the task of insider threat detection in streamingsystem user logs. Because insider threat takes new and dif-ferent forms, it is not practical to explicitly model it; our sys-tem instead models “normal” behavior and uses anomaly asan indicator of potential malicious behavior. Our approachis designed to support the streaming scenario, allowing highvolume streams to be filtered down to a manageable numberof events for analysts to review. Further, our probabilisticanomaly scores also allow our system to convey why it felt agiven user was anomalous on a given day (e.g. because theuser had an abnormal number of file uploads between 6pmand 12am). We hope that this interpretability will improvehuman analysts’ speed and accuracy.In our evaluation using the CERT Insider Threat v6.2dataset, our DNN and LSTM models outperformed threestandard anomaly detection technique baselines (based onIsolation Forest, SVMs and PCA). When our probabilisticoutput model uses a context-dependent diagonal covariancematrix (as a function of the input) rather than a fixed iden-tity covariance matrix, it provides better performance. Wealso contrasted two prediction scenarios: 1) probabilisticallyreconstructing the current input given a compressed hiddenrepresentation (“same time step”) and 2) probabilisticallypredicting the next time step (“next time step”). In our ex-periments, we found that the first works slightly better.There are many ways one could extend this work. First,we would like to apply this to a wider range of stream-ing tasks. Although our focus here is on insider threat,our underlying model offers a domain agnostic approachto anomaly detection. In our experiments, the LSTM per-formed equivalently to the DNN, but we suspect that theLSTM will yield superior performance when applied tolarge-scale real-world problems with more complicated tem-poral patterns.Another promising angle is to explore different granular-ties of times. The current work aggregates features over in-dividual users for each day; this has the potential to missanomalous patterns happening within a single day. Again,our LSTM model has the greatest potential to generalize:the model could be applied to individual events / log-lines,using its hidden state as memory to detect anomalous se-quences of actions. Doing so would reduce or eliminate the“feature engineering” required for aggregate count-style fea-tures. It could also dramatically narrow the set of individualevents an analyst must inspect to determine whether anoma-lous behavior constitutes insider threat.
Acknowledgments.
The research described in this paper is part of the Analysis inMotion Initiative at Pacific Northwest National Laboratory.It was conducted under the Laboratory Directed Researchand Development Program at PNNL, a multi-program na-tional laboratory operated by Battelle for the U.S. Depart-ment of Energy, and supported in part by the U.S. Depart-ment of Energy, Office of Science, Office of Workforce De-velopment for Teachers and Scientists (WDTS) under theVisiting Faculty Program (VFP).
References [Abadi et al. 2015] Abadi, M.; Agarwal, A.; Barham, P.;Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.;Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp,A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.;Kudlur, M.; Levenberg, J.; Man´e, D.; Monga, R.; Moore, S.;Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.;Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasude-van, V.; Vi´egas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.;Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Softwareavailable from tensorflow.org.[Bishop 1994] Bishop, C. 1994. Mixture density networks.Technical Report NCRG/94/004, Neural Computing Re-search Group, Aston University.[Carter and Streilein 2012] Carter, K. M., and Streilein,W. W. 2012. Probabilistic reasoning for streaming anomalydetection. In
Proc. SSP , 377–380.[Chandola, Banerjee, and Kumar 2012] Chandola, V.;Banerjee, A.; and Kumar, V. 2012. Anomaly detection fordiscrete sequences: A survey.
IEEE TKDE
Proc. DSAA , 1–7.[Debar, Becker, and Siboni 1992] Debar, H.; Becker, M.;and Siboni, D. 1992. A neural network component for anintrusion detection system. In
Proc. IEEE Symposium onResearch in Security and Privacy , 240–250.[Gavai et al. 2015] Gavai, G.; Sricharan, K.; Gunning, D.;Hanley, J.; Singhal, M.; and Rolleston, R. 2015. Supervisedand unsupervised methods to detect insider threat from en-terprise social and online activity data.
Journal of WirelessMobile Networks, Ubiquitous Computing, and DependableApplications
Proc. SPW , 98–104.[Hochreiter and Schmidhuber 1997] Hochreiter, S., andSchmidhuber, J. 1997. Long short-term memory.
Neuralcomputation
Leading Issues inInformation Warfare & Security Research arXiv preprintarXiv:1412.6980 .[Lindauer et al. 2014] Lindauer, B.; Glasser, J.; Rosen, M.;Wallnau, K. C.; and ExactData, L. 2014. Generating testdata for insider threat detectors.
Journal of Wireless MobileNetworks, Ubiquitous Computing, and Dependable Applica-tions
Proc. ICDM .[Malhotra et al. 2016] Malhotra, P.; Ramakrishnan, A.;Anand, G.; Vig, L.; Agarwal, P.; and Shroff, G. 2016.LSTM-based encoder-decoder for multi-sensor anomalydetection. arXiv preprint arXiv:1607.00148 .[Marchi et al. 2015a] Marchi, E.; Vesperini, F.; Eyben, F.;Squartini, S.; and Schuller, B. 2015a. A novel approachfor automatic acoustic novelty detection using a denoisingautoencoder with bidirectional LSTM neural networks. In
Proc. ICASSP , 1996–2000.[Marchi et al. 2015b] Marchi, E.; Vesperini, F.; Weninger, F.;Eyben, F.; Squartini, S.; and Schuller, B. 2015b. Non-linearprediction with LSTM recurrent neural networks for acous-tic novelty detection. In
Proc. IJCNN , 1–7.[Markou and Singh 2003] Markou, M., and Singh, S. 2003.Novelty detection: a reviewpart 2:: neural network based ap-proaches.
Signal processing
Journal of Machine Learning Research
Advances in neural information processingsystems
Neural Compu-tation
Proc. ICDM .[Sommer and Paxson 2010] Sommer, R., and Paxson, V.2010. Outside the closed world: On using machine learn-ng for network intrusion detection. In