[PDF] STAN: Synthetic Network Traffic Generation using Autoregressive Neural Models

Abstract

Deep learning models have achieved great success in recent years. However, large amounts of data are typically required to train such models. While some types of data, such as images, videos, and text, are easier to find, data in certain domains is difficult to obtain. For instance, cybersecurity applications routinely use network traffic data which organizations are reluctant to share, even internally, due to privacy reasons. An alternative is to use synthetically generated data; however, most existing data generating methods lack the ability to capture complex dependency structures that are usually prevalent in real data by assuming independence either temporally or between attributes. This paper presents our approach called STAN, Synthetic Network Traffic Generation using Autoregressive Neural models, to generate realistic synthetic network traffic data. Our novel autoregressive neural architecture captures both temporal dependence and dependence between attributes at any given time. It integrates convolutional neural layers (CNN) with mixture density layers (MDN) and softmax layers to model both continuous and discrete variables. We evaluate performance of STAN by training it on both a simulated dataset and a real network traffic data set. Multiple metrics are used to compare the generated data with real data and with data generated via several baseline methods. Finally, to answer the question -- can real network traffic data be substituted with synthetic data to train models of comparable accuracy -- we consider two commonly used models for anomaly detection in such data, and compare F1/MSE measures of models trained on real data and those on increasing proportions of generated data. The results show only a small decline in accuracy of models trained solely on synthetic data.

Full PDF

SSTAN: Synthetic Network Traﬃc Generationusing Autoregressive Neural Models

Shengzhe Xu , Manish Marwah , Naren Ramakrishnan { [email protected], [email protected], [email protected] } Department of Computer Science, Virginia Tech, Arlington, USA Micro Focus, CA, USASeptember 2020

Abstract

Deep learning models have achieved great success in recent years. How-ever, large amounts of data are typically required to train such models.While some types of data, such as images, videos, and text, are easier toﬁnd, data in certain domains is diﬃcult to obtain. For instance, cyberse-curity applications routinely use network traﬃc data which organizationsare reluctant to share, even internally, due to privacy reasons. An alter-native is to use synthetically generated data; however, most existing datagenerating methods lack the ability to capture complex dependency struc-tures that are usually prevalent in real data by assuming independenceeither temporally or between attributes. This paper presents our approachcalled

STAN , Synthetic Network Traﬃc Generation using AutoregressiveNeural models, to generate realistic synthetic network traﬃc data. Ournovel autoregressive neural architecture captures both temporal depen-dence and dependence between attributes at any given time. It integratesconvolutional neural layers (CNN) with mixture density layers (MDN)and softmax layers to model both continuous and discrete variables. Weevaluate performance of

STAN by training it on both a simulated datasetand a real network traﬃc data set. Multiple metrics are used to comparethe generated data with real data and with data generated via severalbaseline methods. Finally, to answer the question – can real network traf-ﬁc data be substituted with synthetic data to train models of comparableaccuracy – we consider two commonly used models for anomaly detectionin such data, and compare F1/MSE measures of models trained on realdata and those on increasing proportions of generated data. The resultsshow only a small decline in accuracy of models trained solely on syntheticdata. a r X i v : . [ c s . L G ] S e p Introduction

Cybersecurity has become a key concern for both private and public organi-zations, given the prevalence of cyber-threats and attacks. In fact, maliciouscyber-activity cost the U.S. economy between $57 billion and $109 billion in2016 [20], and worldwide yearly spending on cybersecurity reached $1.5 trillionin 2018 [17].To gain insights into and counter cybersecurity threats, organizations needto sift through large amounts of network, host and application data typicallyproduced in an organization. Manual inspection of such data by security an-alysts to discover attacks is impractical due to its sheer volume, e.g., even amedium sized enterprise can produce terabytes of network traﬃc data in a fewhours. Automating the process through use of machine learning tools is the onlyviable alternative. Recently deep learning models have been successfully usedfor cyber-security applications [4, 11], and given the large quantities of availabledata, deep learning methods appear to be a good ﬁt.However, although large amounts of data is apparently available for cyberse-curity machine learning applications, it is sensitive in nature and access to it canresult in privacy violations, e.g., network traﬃc logs can reveal web browsingbehavior of users. Thus it is diﬃcult to obtain such data to train models, eveninternally within an organization. To get around data privacy issues, there arethree main approaches [1, 2]: 1) anonymization; 2) cryptographic methods and3) perturbation methods, such as diﬀerential privacy. However, 1) leaks privateinformation in most cases, 2) is usually impractical for large data sets, and 3)degrades data quality making it less suitable for machine learning tasks.In this paper, we take an orthogonal approach by generating synthetic datathat is realistic enough to replace real data in machine learning tasks. Speciﬁ-cally, we consider multivariate time-series data and, unlike prior work, captureboth temporal dependence and attributes dependence. Figure 1 illustrates ourapproach, called

STAN : Given real historical data, phase 1 trains a CNN-basedautoregressive generative neural network that learns joint distribution of data.After the model is trained, phase 2 uses the model to synthesize any amount ofsynthetic data with the joint distribution of real data without revealing any pri-vate information. Phase 3 comprises application of the synthetic data to replacereal data in machine learning tasks where model performance is comparable tothe model trained on real data.To evaluate the performance of STAN , we use a real publicly available net-work traﬃc data set. We compare our method with four selected baselinesusing several metrics to evaluate the generated data. Finally, we compare themethods on two machine learning tasks – a classiﬁcation task and a regressiontask used for detecting cybersecurity anomalies – that are trained on both realand synthetic data. We show a comparable model performance after entirelysubstituting the real training data with our synthetic data: the F-1 score of the Here performance refers to a model evaluation metric such as precision, recall, F1-score,mean squared error, etc.

STAN consists of three phases: Phase 1 learns a generative modelfrom a given real training data set D Historical ; Phase 2 uses trained model tosequentially generate synthetic data, D Synth ; Phase 3 use the generated data D Synth , in place of real data D Real , to train machine learning models.classiﬁcation task only drops by 4% (78% to 75%), while the mean square erroronly increases by about 13% for the regression task.In summary, this paper makes the following key contributions: • We designed and prototyped

STAN , a novel tool that learns joint distribu-tion of multivariate time-series data – data typically used in cybersecurityapplications – and then generates synthetic data from the learned dis-tribution. Unlike prior work,

STAN learns both temporal and attributedependence. Our code is publicly available. • STAN integrated convolutional neural layers (CNN) with mixture densitylayers (MDN) and softmax layers to model both continuous and discretevariables. • We evaluated

STAN on both simulated data and a real publicly availablenetwork traﬃc data set, and compared with four baselines. • We build models for two cybersecurity machine learning tasks and showedthat while using only

STAN generated data to train, the performance ofthe models is comparable to using real data.

Machine learning for cybersecurity

In the past decades, people apply ma-chine learning to multiple tasks in cybersecurity, such as automatically detectmalicious activity and stop attacks [6, 7]. Such machine learning approachesusually require a large amount of training data with speciﬁc features. How-ever, training model using real user data leads to privacy exposure and ethicsproblems. Previous work on anonymizing real data [15] has failed to providesatisfactory privacy protection, or degrades data quality too much for machine https://github.com/an-anonymous-repo/ANDS.git Synthetic data generation and GAN models

Generating synthetic datato make up the lack of real data is a common solution. Compared to modelingimage data [13], learning distribution on multi-variate time-series data resultsin more challenges. Multi-variables data have multiple forms in the real world,so that the data usually have more complex dependency (temporal and spatial)as well as heterogeneous attribute types (continuous or discrete).Synthetic data generation models often treat each column as a random vari-able to model joint multivariate probability distributions. The modeled distri-bution is then used for sampling. Traditional modeling algorithms [3, 9, 18]have the restraint of distribution data types and due to computational issues,the dependability of synthetic data generated by these models is extremely lim-ited. Recently, GANs-based approaches augment performance and ﬂexibilityto generate data [14, 21]. However, they are still restricted to a static depen-dency without considering the temporal dependence usually prevalent in realworld data. We are not aware of any prior work that models both temporal andbetween attribute dependencies.

Autoregressive generative models [13, 19] have been successfully ap-plied to signal data, image data, and natural language data. They attempt toiteratively generate data elements: previously generated elements are used as aninput condition for generating the following data. Compared to GAN models,autoregressive models emphasize two factors during the distribution estimating:1) the importance of the time-sequential factor; 2) an explicit and tractabledensity. In this paper, we apply the autoregressive idea to learn and generatetime-series multi-variable data.

Mixture density networks

Unlike modeling discrete attributes, some con-tinuous numeric attributes are relatively sparse and show a large value range.Mixture Density Network [5] presents a neural network architecture to learna Gaussian mixture model (GMM) that can predict continuous attribute dis-tributions. This architecture provides the possibility to integrate GMM into acomplex neural network architecture.

We assume the data to be generated is a multivariate time-series. Speciﬁcally,data set x contains n rows and m columns. Each row x ( i, :) is an observation attime point i and each column x (: ,j ) is a random variable j , where i ∈ ..n and j ∈ ..m . Unlike typical tabular data, e.g., that found in relational databasetables, and unstructured data, e.g., images, multivariate time-series data posestwo main challenges: 1) the rows are generated by an underlying temporalprocess and are thus not independent, unlike tabular data; 2) the columns orattributes are not necessarily homogeneous, and comprise multiple data typessuch as numerical, categorical or continuous, unlike say images.4igure 2: STAN components: Left, window CNN, which crops the contextbased on a sliding window and extracts features from context; Middle, mixturedensity layers and softmax layers learn to predict the distributions of varioustypes of attributes; Right, the loss functions for diﬀerent kinds of layers.The data x follows an unknown, high-dimensional joint distribution P ( x ),which is infeasible to estimate directly. The goal is to estimate P ( x ) by agenerative model S which retains the dependency structure across rows andcolumns. Values in a column typically depend on other columns, and temporaldependence of a row can extend to tens of prior row. Once model S is trained,it can be used to generate an arbitrary amount of data, D syn .Another key challenge is evaluating the quality of the generated data, D syn .Assuming a data set, D historical , is used to train S , and an unseen test data set, D test , is used to evaluate the performance of S , we use two criteria to compare D syn with D test : 1) similarity between a metric M evaluated on the two datasets, that is M ( D test ) ≈ M ( D syn ) 2) similarity between performance P ontraining the same machine learning task T , in which the real data, D test , isreplaced by the synthetic data, D syn , that is P [ T ( D test )] ≈ P [ T ( D syn )]. We model the joint data distribution, P ( x ), using an autoregressive neural net-work. The model architecture, shown in Figure 2, combines CNN layers with adensity mixture network [5]. The CNN captures temporal and spatial (betweenattributes) dependencies, while the density mixture layer uses the learned rep-resentation to model the joint distribution. During the training phase, for eachrow, STAN takes a data window prior to it as input. Given this context, thenetwork learns the conditional distribution for each attribute. Both continuousand discrete attributes can be modeled. While a density mixture layer is usedfor continuous attributes, a softmax layer is used for discrete attributes.In the synthesis phase,

STAN sequentially generates each attribute in eachrow. Every generated attribute in a row, having been sampled from a conditionaldistribution over the prior context, serves as the next attribute’s context. P ( x ) denotes the joint probability of data x composed of n rows and m at-tributes. We can expand the data as a one-dimensional sequence x , ..., x n ,where each vector x i represents one row including the m attributes x i, , ..., x i,m .5o estimate the joint distribution P ( x ) we write it as the product of conditionaldistributions over the rows. We start from the joint distribution factorizationwith no assumptions: P ( x ) = n (cid:89) i =1 P ( x i | x , ..., x i − ) (1)Unlike unstructured data such as images, multivariate time-series data usu-ally corresponds to underlying continuous processes in the real world and donot have an exact starting and ending points. It is impractical to make a rowprobability P ( x i ) depend on all prior rows as in Equation 1. Thus, a k -sizedsliding window is utilized to restrict the context to only the k most recent rows.In other words, a row conditioned on the past k rows is independent of all re-maining prior rows, that is, for i > k , we assume independence between x i and x

Model Training process for each attribute j Input D Historical , window size k , attribute type T j . Output

STAN model S stan ; Construct window data X windowi = concatenate X i − k ,..., X i − ; y windowi = X i ; for epoch in 1 ... EPOCH do X windowi *= M ask if T j is continuous then P gmm pred = mdn ( wcnn ( X window )); loss = nll ( P gmm pred , y window ); else P softmax pred = sof tmax ( wcnn ( X window )); loss = cross entropy ( P softmax pred , y window ); end if Update S stan with loss ; end forAlgorithm 2 Data Synthesis process

Input

Trained STAN model S stan . Output D synth ; Init context X window = marginal sampling() while condition(target row number or time stamp) do X windowi *= M ask P pred = S stan ( X windowi ); y sample = sample from distribution P pred ; X windowi +1 = X windowi [1 : , :] + y sample end whileWindow convolutional layers (wcnn) . The CNN layers which we callwindow CNN since they operate on a sliding window of data, perform a two-dimensional convolution. For one row x i the layers capture a rectangular contextabove the row as shown in Figure 2. STAN uses multiple convolutional layersthat preserve the spatial and temporal resolution in a sliding time window box,each number in Figure 2 represents the number of 3 ∗ BN , ReLU ,and M , respectively. Convolution mask

Based on which factorization is selected, we have maskA for Equation 4 and mask B for Equation 5. X i,j k-window (a) Mask A for conditional indepen-dence assumption between attributesin same row X i,j k-window (b) Mask B for no conditional indepen-dence assumption in the same row Figure 3: Masks for context window convolution

Mixture density layer (mdn) learns a conditional gaussian mixture dis-tribution . It consists of three parallel fully connected layers, modeling α i , σ i , µ i separately, where the parameter α i represents for the component weights of an gaussian mixture model , and the µ i and σ i are the mean and variance parame-ters of the gaussian distribution components. The α i parameters output go toa softmax, so that the weights of all the Gaussian mixture components sum toone. Loss functions

We deﬁne loss functions for mixture density layer and soft-max layer separately. Note that the two losses have diﬀerent scales, and whilemultitask learning has its advantages, we match each mixture density component or softmax component with an individual wcnn component .Negative Log-Likelihood Loss (NLL) is used for the mixture density layers,which predict a group of mixture density parameters that can compose a Gaus-sian mixture model as Equation 6: α i , σ i , µ i . We use maximum likelihood lossto estimate a true distribution: the label of the input, which is the new row thatto be generated, is supposed to have the highest probability in the estimateddistribution. Cross entropy loss is used for the softmax layer. N LL ( x | µ, σ ) = − log (cid:88) α i ∗ N ( x | µ i , σ i ) (6) We selected four diﬀerent methods to serve as baselines for our method. Thisrange for basic Gaussian Mixture Model, Bayesian Network to two recent deeplearning approaches that use GANs for synthetic data generation, which for8revity we refer to as B1, B2, B3, and B4, respectively. We compare

STAN with these baselines and analyze the distribution factorization.

Gaussian Mixture (B1)

This assumes all attributes at a particular timestep are independent of each other, and further each row is independent. Thusit can be factorized as following: P ( x ) = n (cid:89) i =1 P ( x i ) (7a) P ( x i ) = m (cid:89) j =1 P ( x i,j )(7b) Bayesian Network (B2)

As a traditional statistical approach, limited tem-poral or attributes dependence can be learnt based on the domain knowledgefrom experts. For example, if x i,j is dependent on x i − ,j and x i,j , we canwrite it as a product of the conditional distributions (see Equation 8). The value P ( x i,j | x i − ,j , x i,j ) is the probability of the j attributes of the i -th observa-tion row, given the ( i − j attribute and the i -th j attribute. Consideringthe edge situation as well as utilizing the Bayes rule, we rewrite the distribution P ( x i,j | x i − ,j , x i,j ) as: P ( x ) = n (cid:89) i =1 [ P ( x i,j | x i,j , x i − ,j ) m (cid:89) j =1 ,j (cid:54) = j P ( x i,j )]= P ( x ) · n (cid:89) i =2 [ P ( x i,j ) P ( x i − ,j | x i,j ) P ( x i,j | x i,j )] · m (cid:89) j =1 ,j (cid:54) = j P ( x i,j ) (8) WP-GAN (B3) [16] utilizes GAN to speciﬁcally generate network traﬃcﬂow data, while

CTGAN (B4) [21] utilizes GAN to generate general tabulardata which contains both discrete and continuous attributes. Both B3 andB4 assume attribute dependence at a certain time step but ignore temporal-wise dependence. Thus the joint distribution can be factorized as Equation 7aonly, while the factorization inside each row is untractable due to the GANmechanism.

The evaluation of generative models is challenging and subjective. We usemultiple metrics to compare them: likelihood, distribution evaluation, domainknowledge rule test, and machine learning tasks performance comparison.

Likelihood ﬁtness

The likelihood function measures the goodness of astatistical model ﬁtting a data sample. However, the intrinsic diﬀerence betweenexplicit density method (B1, B2, and

STAN ) and implicit density method (B39nd B4) makes it more challenging to compare them. [8] also claims that thereis not a fair way to directly compare the likelihood of the GAN models. Thusin this paper, we only compare the likelihood between explicit density models:B1, B2 and,

STAN . Distribution and JS divergence

Although the goal of our work is tomodel joint distribution of a window of data, we also compare the marginaldistributions of the individual attributes. As a quantitative metric, we calcu-late Jensen-Shannon divergence between the distributions of the generated data D syn and the real data D test in each attribute. Domain knowledge test

We use domain knowledge checks to evaluate thesynthetic data quality. Since the application data set pertains to network traﬃcﬂow, we use several properties that such data need to satisfy in order to berealistic.[16]

Machine learning application task

The ﬁnal goal of generating syntheticdata is to build machine learning models without using any real data. Toevaluate whether the generated data is able to replace real data in a modeltraining process, we select two tasks that are used in cybersecurity anomalydetection. One is a classiﬁcation task while the other is regression. Both areself-supervised tasks.The ﬁrst task is predicting the protocol ﬁeld in the network traﬃc data,while the second task is to predict the number of bytes ﬁeld. In practice oncetrained these models are used for marking anomalies when the actual valuesigniﬁcantly diﬀers from the real one. We train a RandomForest model for theclassiﬁcation task, and a neural network model for the regression task. For bothtasks we compare the cross-validation performance of the models trained on realand synthetically generated data.

To demonstrate its eﬀectiveness, we train and evaluate

STAN on a real net-work traﬃc data set. However, initially to experiment with some architecturalvariations, we use a simple simulated data set.

We built a simulated data set with a simple random process whose dependencecan be clearly controlled. We simulated a two-variable data distribution withthe following formula and sampled 10,000 points data set (

X, Y ) from it: x i =0 . x i − + 0 . N and y i = 0 . x i + 0 . N , where N is standard normal distributionnoise.We apply a naive version STAN , that passes through the input to mixturedensity layers directly, and B1 on the simulated data set. We evaluated thecorrelation coeﬃcient R between both temporal dependence R ( X i , X i − ) andattribute dependence R ( X i , Y i ). Figure 4 presents the scatter plots of x i and y i that from four data source (both raw simulated data and synthetic data).10 a) Simulated raw data with R ( X i , X i − ) = 0 . R ( X, Y ) = 0 . R ( X i , X i − ) = 0 and R ( X, Y ) = 0(c)

STAN mask A synthesizeddata with R ( X i , X i − ) = 0 . R ( X, Y ) = 0 . STAN mask B synthesizeddata with R ( X i , X i − ) = 0 . R ( X, Y ) = 0 . Figure 4: ( X i , Y i ) scatter plot of the simulated data and synthetic data withthe Correlation Coeﬃcients R . Observation 1:

Same-row attribute conditional independence provides areasonable approximation

Data set

Network traﬃc data is typically a multivariate time-series. A commonformat is called netﬂow , where each row represents a unidirectional networktraﬃc connection or ﬂow. We selected a netﬂow data set for our experimentsas a large data set was publicly available, and further because it is a goodrepresentative format for network traﬃc data in general. Typically each rowconsists of the following attributes: timestamp at the end of a ﬂow (te), durationof ﬂow (td), packets exchanged in the ﬂow (pkt), and the corresponding numberof bytes (byt), source IP address (sa), destination IP (da) and protocol (pr). Soone row x i can be expressed as a tuple of ( te i , byt i , sa i , da i , pr i , etc). Table 1shows typical attributes, their types and examples.We apply STAN on a publicly available benchmark netﬂow data set, UGR’16[12], which contains large scale traﬃc data captured by a Tier-3 ISP cloudservice provider. First, we randomly select the April week3 data to focus on.Second, we randomly select 90 users based on the number of traﬃc ﬂows per user11ttribute Type Exampletimestamp continuous 2016-04-11 00:02:15duration continuous 0.344transport protocol categorical TCPsource IP address categorical 85.201.196.53source port categorical 19925dest. IP address categorical 42.219.145.151dest. port categorical 80bytes numeric 11238packets numeric 11Table 1: Overview of typical attributes in ﬂow-based data.distribution. Third, we extract one day’s (Monday) data to be the D historical and another day (Tuesday) of the same user group and the same week to be the D test . Lastly, from a cybersecurity perspective, we are most interested in userswith traﬃc between an organization and external IP addresses rather than traﬃcwithin an organization. Following this strategy, we selected 1,531,126 samplesfor the D historical and 1,952,702 samples for the D test . Pre-processing

To ensure the trained model is a practical and robust toolto synthesize network traﬃc ﬂow data, we normalize the raw netﬂow data sothat the neural network can deal with it. Also, the neural network predictedvariable value could also be interpreted back to the original form. Since it is justa regular data processing trick, we provide details in supplemental materials.

Likelihood

For each data point (each row), we can directly calculate therow likelihood by factorization Equations 4, 5 and 7b. Take the attribute byt forexample the negative log likelihood evaluated on the UGR16 validation set forB1, B2 and

STAN are 4.85, 3.90, 2.34 respectively. More attribute likelihoodtable in the appendix show

STAN is the best reported over all the comparableattributes, including continuous and discrete attributes.

Distribution and JS divergence

Figure 5 shows the individual JS diver-gence of the marginal distribution of both the continuous and discrete attributes.

STAN captures the marginal distribution well for most attributes. Even thoughB1 precisely models the marginal distribution of the training data set, it doesnot perform as well as

STAN on the test data set. We believe this is becausethe marginal distribution over a day is non-stationary.

Observation 2:

STAN models the marginal distribution better than base-line B1.

Domain knowledge test

We employ domain test developed by [16] fornetﬂow data. These are several rules that need to be satisﬁed by generatedﬂow-based network data. We highlight three tests here which are summarizedin Table 2.

STAN performs well in all three: • Test 1: The selected UGR16 data set is captured by an ISP. Therefore, atleast one IP address (source IP address or destination IP address) of each12igure 5: JS divergence between attribute marginal distribution of real andsynthetic dataﬂow must belong to the ISP (starting with 42.219.XXX.XXX). • Test 2: If the ﬂow describes normal user behavior and the source port ordestination port is 80 (HTTP) or 443 (HTTPS), the transport protocolmust be TCP. • Test 3: TCP and UDP packets have a minimum and maximum packet size.Therefore, we check the relationship between bytes and packets in eachﬂow according to the following rule: 42*packets ≤ bytes ≤ STAN

100 99 81

Table 2: Passing percentage of domain knowledge tests

Real application tasks

Finally, we test our synthetic data on two cyberse-curity machine learning applications – one is a classiﬁcation task, and the otheris regression. The goal is to ﬁgure out whether it is possible to fully substitutereal data with synthetic data for training machine learning models.13 series of models are trained on real test data. We start our training fromusing a complete D test (real data) and successively decrease the amount of realdata until no data from D test is used. Another series of models are trainedsimilarly using the real test data; however, instead of simply removing certainamount of data from D test , we substitute the indicated amount of data withour synthetic data D syn , so that the total amount of data is kept unchanged.In the following two tasks, we use D test , which is unseen and never used in thesynthesizer training process. For the synthetic data D synth , every synthesizermodel generates ﬁve sets of synthetic data sample, so we can compute error bars.Five-fold cross validation is used to get a robust estimate of the measurements. Task1: protocol forecasting

Fig. 6 shows the F-1 scores achieved byRandom Forest models. There are six sets of models. ’Real-Data’: these arerandom forest models trained by reducing the real data; ’stan’: these are randomforest models trained by reducing the real data, but substituting the reduceddata by synthetic data generated by

STAN ; ’B1’ through ’B4’: these are similarto the ’stan’ models but obtained by substituting the reduced data by the fourbaselines respectively. The x-axis represents how much real data is used from100% down to 0%.If we only use real data, the F1 score drops from 0.78 down to 0.6 as theamount of data decreases. Clearly, with no real data, we are unable to train amodel. When we substitute real data with that generated by the baselines, theperformance drops even quicker, because they do a poor job of capturing thetemporal and attribute dependence. Even in the absence of any real data, datagenerated by

STAN results in an F1 score of 0.75, where the drop in performanceis only 4%. That is, the model built with only synthetic data retains 96% ofthe performance of the all real data trained model.Figure 6: F1-score of Protocol Forcasting Task14igure 7: Mean Square Error of byt

Value Forcasting Task

Task2: byt value forecasting follows a similar setup of experiments asTask1. Fig. 7 shows the mean square error achieved by a neural network regres-sion model. The plot shows that

STAN and Bayesian network (B2) outperformthe other three baseline models. Building a Bayesian network with domainknowledge typically performs better than GANs [21].

Observation 3:

Comparing to B2, STAN can still get the same perfor-mance even without domain knowledge.

In our experiments, B2 is optimized speciﬁcally for the byt sequential value.However,

STAN has two advantages over the Bayesian network. First, users donot need the domain knowledge required for Bayesian network implementation.Secondly, there is no inherent bias attributable to an expert unlike traditionalBayesian networks. Similar to the ﬁrst task, the penalty for using only

STAN generated data (with no real data) is low, an increase of 13% in the mean squareerror.

Observation 4:

Even with 0% real data, STAN models task1 and task2with only a small drop in accuracy.

This paper presents the design and implementation of

STAN , a novel, ﬂexibleand robust approach to learn the distribution of complex multivariate time-series data distributions. Compared to existing approaches,

STAN is novel inseveral aspects. First,

STAN learns the joint distribution over both temporaldependency and attribute dependency. Second,

STAN is ﬂexible to generatedata with any combination of continuous and discrete attributes. Furthermore,15e perform a thorough evaluation of

STAN comparing it with four baselinesand using several performance measures as well as two cybersecurity machinelearning tasks.Our future work includes building techniques to (1) build complete systemof learning and generating network traﬃc data, (2) explore the best updatingrate for re-learning the data synthesizer on the historical data D historical (3)conduct more semantic or statistic checking with regards to the fungibility ofsynthetic data with real data. References [1] Charu C. Aggarwal and Philip S. Yu.

A General Survey of Privacy-Preserving Data Mining Models and Algorithms , pages 11–52. SpringerUS, Boston, MA, 2008.[2] Mohammad Al-Rubaie and J Morris Chang. Privacy-preserving machinelearning: Threats and solutions.

IEEE Security & Privacy , 17(2):49–58,2019.[3] Laura Avi˜n´o, Matteo Ruﬃni, and Ricard Gavald`a. Generating syntheticbut plausible healthcare record datasets. arXiv preprint arXiv:1807.01514 ,2018.[4] Daniel S Berman, Anna L Buczak, Jeﬀrey S Chavis, and Cherita L Cor-bett. A survey of deep learning methods for cyber security.

Information ,10(4):122, 2019.[5] Christopher M Bishop. Mixture density networks. 1994.[6] Anna L Buczak and Erhan Guven. A survey of data mining and machinelearning methods for cyber security intrusion detection.

IEEE Communi-cations surveys & tutorials , 18(2):1153–1176, 2015.[7] Carlos A Catania and Carlos Garc´ıa Garino. Automatic network intrusiondetection: Current techniques and open issues.

Computers & ElectricalEngineering , 38(5):1062–1072, 2012.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generativeadversarial nets. In

Advances in neural information processing systems ,pages 2672–2680, Cambridge, MA, USA, 2014. MIT Press.[9] Cormode Graham. Diﬀerentially private spatial decompositions. In

Dataengineering (ICDE), 2012 IEEE 28th international conference , 2012.[10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 , 2014.1611] Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C Suh, Ikkyun Kim,and Kuinam J Kim. A survey of deep learning-based network anomalydetection.

Cluster Computing , pages 1–13, 2017.[12] Gabriel Maci´a-Fern´andez, Jos´e Camacho, Roberto Mag´an-Carri´on, PedroGarc´ıa-Teodoro, and Roberto Ther´on. Ugr ’16: A new dataset for theevaluation of cyclostationarity-based network idss.

Computers & Security ,73:411–424, 2018.[13] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixelrecurrent neural networks. arXiv preprint arXiv:1601.06759 , 2016.[14] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia,Hongkyu Park, and Youngmin Kim. Data synthesis based on generativeadversarial networks.

Proceedings of the VLDB Endowment , 11(10):1071–1083, 2018.[15] Shukor Razak, Nur Haﬁzah, and Arafat Al-Dhaqm. Data anonymizationusing pseudonym system to preserve data privacy.

IEEE Access , 2020.[16] Markus Ring, Daniel Schl¨or, Dieter Landes, and Andreas Hotho. Flow-based network traﬃc generation using generative adversarial networks.

Computers & Security

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 33, pages 5049–5057,2019.[19] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals,Alex Graves, et al. Conditional image generation with pixelcnn decoders.In

Advances in neural information processing systems , pages 4790–4798,2016.[20] WhiteHouse. The cost of malicious cyber activity to the u.s. economy, 2018.[21] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veera-machaneni. Modeling tabular data using conditional gan. In

Advances inNeural Information Processing Systems , pages 7333–7343, 2019.17

Appendix: Simulated data

We evaluated the correlation coeﬃcient R between both temporal dependence R ( X i , X i − ) and attribute dependence R ( X i , Y i ). Figure 8 presents the scatterplots of X i and X i − from four data sources (simulated data, B1 synthetic data, STAN with mask A synthetic data, and

STAN with mask B synthetic data),and Figure 9 presents that for X i and Y i . Since mask A and mask B represent conditional independence and explicit dependence respectively, we summarizethrough this observation,: • Both conditional independence and explicit dependence provide reason-able approximation on temporal dependence. The R ( X i , X i − ) of simu-lated data, STAN mask A, and

STAN mask B are 0.9. • Conditional independence provides a reasonable same-row attribute ap-proximation, while explicit dependence performs better. The R ( X i , Y i ) ofsimulated data, and STAN mask B are 0.9; while that of

STAN mask Ais 0.7. (a) Simulated raw data with R ( X i , X i − )=0 . R ( X i , X i − )=0(c) STAN mask A synthesized datawith R ( X i , X i − )=0 . STAN mask B synthesized datawith R ( X i , X i − )=0 . Figure 8: ( X i , X i − ) scatter plot of the simulated data and synthetic data withthe Correlation Coeﬃcients R . 18 a) Simulated raw data with R ( X i , Y i )=0 . R ( X i , Y i )=0(c) STAN mask A synthesized datawith R ( X i , Y i )=0 . STAN mask B synthesized datawith R ( X i , Y i )=0 . Figure 9: ( X i , Y i ) scatter plot of the simulated data and synthetic data withthe Correlation Coeﬃcients R . We also used the simulated data and the corresponding synthetic data for train-ing two machine learning tasks (using scikit-learn Python library). In these ex-periments, we trained the models with diﬀerent synthetic data or the simulateddata and test the model performance (Mean Square Error) on the simulatedtest data. • T1 : predict y i given x i (row attribute dependence). • T2 : predict x i +1 given x i (temporal dependence).Table 3 shows that a machine learning model trained only on synthetic datagenerated by STAN produces similar test loss as that trained on real simulatedtest data. 19raining Data MSE( T1 ) MSE( T2 )Simulated data 0.010 0.01B1 0.050 0.05STAN mask A 0.013 0.01STAN mask B 0.010 0.01Table 3: Mean Square Error of the two tasks STAN

Model on UGR16 Netﬂowdata

Our models are trained on four Tesla P100 GPUs using the Pytorch toolbox.From the diﬀerent parameter update rules tried, Adam [10] gives best conver-gence performance and is used for all experiments. The learning rate scheduleswere manually set to the highest values that allowed fast convergence: 0.001for gaussian mixture layers and 0.01 for softmax layers. The batch sizes arealso manually set for the experiments. For UGR16, we use as large a batch sizeas that permitted quick converge; this corresponds to 512 time windows inputper batch. We use preprocessing to prepare data batches that can be trainedin parallel and accelerate the training and generation process. For the initialconvolution network layer parameters, we sample from a Uniform distribution.

The inputs to the neural model are pre-processed to facilitate training. Thenumerical attributes are min-max scaled, for the categorical attributes, we applyone-hot encoding. Speciﬁcally, for the protocol attribute we use a three-waysoftmax (for TCP, UDP and other). For source and destination port numberattributes, we handle well-known and other ports diﬀerently; port up to 1024are one-hot encoded with softmax output, while higher ports are modeled asa numeric attribute. Instead of modeling timestamps of individual ﬂows, wemodel the time deltas between them.

For each data point (each row), we can directly calculate the row likelihood byfactorization equations. In our case, explicit density generative models (B1, B2and

STAN ) clearly deﬁne the distribution for each attributes and for those wecan evaluate the modeled distribution directly via individual attribute distri-bution. For a more straightforward form, we use a 200 bin size for continuousvariables to validate their negative log-likelihood value for all the baselines andattributes, based on the variable value range and the data set size. In Table 420able 4: Attribute negative log likelihood of models evaluated on the UGR16validation set Model bytes packet time duration protocolB1 4.85 3.78 1.81 0.341B2 3.90 2.62 0.97 0.344

STAN

STAN and baselines B1 and B2.

Related Researches

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Regularization Strategies for Quantile Regression

by Taman Narayan

Pairwise Weights for Temporal Credit Assignment

by Zeyu Zheng

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Robust Bandit Learning with Imperfect Context

by Jianyi Yang

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

by Niladri S. Chatterji

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Roughsets-based Approach for Predicting Battery Life in IoT

by Rajesh Kaluri

"What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models

by Sahar Abdelnabi

Automatic variational inference with cascading flows

by Luca Ambrogioni

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Learning a powerful SVM using piece-wise linear loss functions

by Pritam Anand

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Reinforcement Learning For Constraint Satisfaction Game Agents (15-Puzzle, Minesweeper, 2048, and Sudoku)

by Anav Mehta

Deep Neural Network based Cough Detection using Bed-mounted Accelerometer Measurements

by Madhurananda Pahar

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

MISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

by Antonio Candelieri

«

1

2

3

4

»

Submitted on 27 Sep 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar