STAN: Synthetic Network Traffic Generation using Autoregressive Neural Models
SSTAN: Synthetic Network Traffic Generationusing Autoregressive Neural Models
Shengzhe Xu , Manish Marwah , Naren Ramakrishnan { [email protected], [email protected], [email protected] } Department of Computer Science, Virginia Tech, Arlington, USA Micro Focus, CA, USASeptember 2020
Abstract
Deep learning models have achieved great success in recent years. How-ever, large amounts of data are typically required to train such models.While some types of data, such as images, videos, and text, are easier tofind, data in certain domains is difficult to obtain. For instance, cyberse-curity applications routinely use network traffic data which organizationsare reluctant to share, even internally, due to privacy reasons. An alter-native is to use synthetically generated data; however, most existing datagenerating methods lack the ability to capture complex dependency struc-tures that are usually prevalent in real data by assuming independenceeither temporally or between attributes. This paper presents our approachcalled
STAN , Synthetic Network Traffic Generation using AutoregressiveNeural models, to generate realistic synthetic network traffic data. Ournovel autoregressive neural architecture captures both temporal depen-dence and dependence between attributes at any given time. It integratesconvolutional neural layers (CNN) with mixture density layers (MDN)and softmax layers to model both continuous and discrete variables. Weevaluate performance of
STAN by training it on both a simulated datasetand a real network traffic data set. Multiple metrics are used to comparethe generated data with real data and with data generated via severalbaseline methods. Finally, to answer the question – can real network traf-fic data be substituted with synthetic data to train models of comparableaccuracy – we consider two commonly used models for anomaly detectionin such data, and compare F1/MSE measures of models trained on realdata and those on increasing proportions of generated data. The resultsshow only a small decline in accuracy of models trained solely on syntheticdata. a r X i v : . [ c s . L G ] S e p Introduction
Cybersecurity has become a key concern for both private and public organi-zations, given the prevalence of cyber-threats and attacks. In fact, maliciouscyber-activity cost the U.S. economy between $57 billion and $109 billion in2016 [20], and worldwide yearly spending on cybersecurity reached $1.5 trillionin 2018 [17].To gain insights into and counter cybersecurity threats, organizations needto sift through large amounts of network, host and application data typicallyproduced in an organization. Manual inspection of such data by security an-alysts to discover attacks is impractical due to its sheer volume, e.g., even amedium sized enterprise can produce terabytes of network traffic data in a fewhours. Automating the process through use of machine learning tools is the onlyviable alternative. Recently deep learning models have been successfully usedfor cyber-security applications [4, 11], and given the large quantities of availabledata, deep learning methods appear to be a good fit.However, although large amounts of data is apparently available for cyberse-curity machine learning applications, it is sensitive in nature and access to it canresult in privacy violations, e.g., network traffic logs can reveal web browsingbehavior of users. Thus it is difficult to obtain such data to train models, eveninternally within an organization. To get around data privacy issues, there arethree main approaches [1, 2]: 1) anonymization; 2) cryptographic methods and3) perturbation methods, such as differential privacy. However, 1) leaks privateinformation in most cases, 2) is usually impractical for large data sets, and 3)degrades data quality making it less suitable for machine learning tasks.In this paper, we take an orthogonal approach by generating synthetic datathat is realistic enough to replace real data in machine learning tasks. Specifi-cally, we consider multivariate time-series data and, unlike prior work, captureboth temporal dependence and attributes dependence. Figure 1 illustrates ourapproach, called
STAN : Given real historical data, phase 1 trains a CNN-basedautoregressive generative neural network that learns joint distribution of data.After the model is trained, phase 2 uses the model to synthesize any amount ofsynthetic data with the joint distribution of real data without revealing any pri-vate information. Phase 3 comprises application of the synthetic data to replacereal data in machine learning tasks where model performance is comparable tothe model trained on real data.To evaluate the performance of STAN , we use a real publicly available net-work traffic data set. We compare our method with four selected baselinesusing several metrics to evaluate the generated data. Finally, we compare themethods on two machine learning tasks – a classification task and a regressiontask used for detecting cybersecurity anomalies – that are trained on both realand synthetic data. We show a comparable model performance after entirelysubstituting the real training data with our synthetic data: the F-1 score of the Here performance refers to a model evaluation metric such as precision, recall, F1-score,mean squared error, etc.
STAN consists of three phases: Phase 1 learns a generative modelfrom a given real training data set D Historical ; Phase 2 uses trained model tosequentially generate synthetic data, D Synth ; Phase 3 use the generated data D Synth , in place of real data D Real , to train machine learning models.classification task only drops by 4% (78% to 75%), while the mean square erroronly increases by about 13% for the regression task.In summary, this paper makes the following key contributions: • We designed and prototyped
STAN , a novel tool that learns joint distribu-tion of multivariate time-series data – data typically used in cybersecurityapplications – and then generates synthetic data from the learned dis-tribution. Unlike prior work,
STAN learns both temporal and attributedependence. Our code is publicly available. • STAN integrated convolutional neural layers (CNN) with mixture densitylayers (MDN) and softmax layers to model both continuous and discretevariables. • We evaluated
STAN on both simulated data and a real publicly availablenetwork traffic data set, and compared with four baselines. • We build models for two cybersecurity machine learning tasks and showedthat while using only
STAN generated data to train, the performance ofthe models is comparable to using real data.
Machine learning for cybersecurity
In the past decades, people apply ma-chine learning to multiple tasks in cybersecurity, such as automatically detectmalicious activity and stop attacks [6, 7]. Such machine learning approachesusually require a large amount of training data with specific features. How-ever, training model using real user data leads to privacy exposure and ethicsproblems. Previous work on anonymizing real data [15] has failed to providesatisfactory privacy protection, or degrades data quality too much for machine https://github.com/an-anonymous-repo/ANDS.git Synthetic data generation and GAN models
Generating synthetic datato make up the lack of real data is a common solution. Compared to modelingimage data [13], learning distribution on multi-variate time-series data resultsin more challenges. Multi-variables data have multiple forms in the real world,so that the data usually have more complex dependency (temporal and spatial)as well as heterogeneous attribute types (continuous or discrete).Synthetic data generation models often treat each column as a random vari-able to model joint multivariate probability distributions. The modeled distri-bution is then used for sampling. Traditional modeling algorithms [3, 9, 18]have the restraint of distribution data types and due to computational issues,the dependability of synthetic data generated by these models is extremely lim-ited. Recently, GANs-based approaches augment performance and flexibilityto generate data [14, 21]. However, they are still restricted to a static depen-dency without considering the temporal dependence usually prevalent in realworld data. We are not aware of any prior work that models both temporal andbetween attribute dependencies.
Autoregressive generative models [13, 19] have been successfully ap-plied to signal data, image data, and natural language data. They attempt toiteratively generate data elements: previously generated elements are used as aninput condition for generating the following data. Compared to GAN models,autoregressive models emphasize two factors during the distribution estimating:1) the importance of the time-sequential factor; 2) an explicit and tractabledensity. In this paper, we apply the autoregressive idea to learn and generatetime-series multi-variable data.
Mixture density networks
Unlike modeling discrete attributes, some con-tinuous numeric attributes are relatively sparse and show a large value range.Mixture Density Network [5] presents a neural network architecture to learna Gaussian mixture model (GMM) that can predict continuous attribute dis-tributions. This architecture provides the possibility to integrate GMM into acomplex neural network architecture.
We assume the data to be generated is a multivariate time-series. Specifically,data set x contains n rows and m columns. Each row x ( i, :) is an observation attime point i and each column x (: ,j ) is a random variable j , where i ∈ ..n and j ∈ ..m . Unlike typical tabular data, e.g., that found in relational databasetables, and unstructured data, e.g., images, multivariate time-series data posestwo main challenges: 1) the rows are generated by an underlying temporalprocess and are thus not independent, unlike tabular data; 2) the columns orattributes are not necessarily homogeneous, and comprise multiple data typessuch as numerical, categorical or continuous, unlike say images.4igure 2: STAN components: Left, window CNN, which crops the contextbased on a sliding window and extracts features from context; Middle, mixturedensity layers and softmax layers learn to predict the distributions of varioustypes of attributes; Right, the loss functions for different kinds of layers.The data x follows an unknown, high-dimensional joint distribution P ( x ),which is infeasible to estimate directly. The goal is to estimate P ( x ) by agenerative model S which retains the dependency structure across rows andcolumns. Values in a column typically depend on other columns, and temporaldependence of a row can extend to tens of prior row. Once model S is trained,it can be used to generate an arbitrary amount of data, D syn .Another key challenge is evaluating the quality of the generated data, D syn .Assuming a data set, D historical , is used to train S , and an unseen test data set, D test , is used to evaluate the performance of S , we use two criteria to compare D syn with D test : 1) similarity between a metric M evaluated on the two datasets, that is M ( D test ) ≈ M ( D syn ) 2) similarity between performance P ontraining the same machine learning task T , in which the real data, D test , isreplaced by the synthetic data, D syn , that is P [ T ( D test )] ≈ P [ T ( D syn )]. We model the joint data distribution, P ( x ), using an autoregressive neural net-work. The model architecture, shown in Figure 2, combines CNN layers with adensity mixture network [5]. The CNN captures temporal and spatial (betweenattributes) dependencies, while the density mixture layer uses the learned rep-resentation to model the joint distribution. During the training phase, for eachrow, STAN takes a data window prior to it as input. Given this context, thenetwork learns the conditional distribution for each attribute. Both continuousand discrete attributes can be modeled. While a density mixture layer is usedfor continuous attributes, a softmax layer is used for discrete attributes.In the synthesis phase,
STAN sequentially generates each attribute in eachrow. Every generated attribute in a row, having been sampled from a conditionaldistribution over the prior context, serves as the next attribute’s context. P ( x ) denotes the joint probability of data x composed of n rows and m at-tributes. We can expand the data as a one-dimensional sequence x , ..., x n ,where each vector x i represents one row including the m attributes x i, , ..., x i,m .5o estimate the joint distribution P ( x ) we write it as the product of conditionaldistributions over the rows. We start from the joint distribution factorizationwith no assumptions: P ( x ) = n (cid:89) i =1 P ( x i | x , ..., x i − ) (1)Unlike unstructured data such as images, multivariate time-series data usu-ally corresponds to underlying continuous processes in the real world and donot have an exact starting and ending points. It is impractical to make a rowprobability P ( x i ) depend on all prior rows as in Equation 1. Thus, a k -sizedsliding window is utilized to restrict the context to only the k most recent rows.In other words, a row conditioned on the past k rows is independent of all re-maining prior rows, that is, for i > k , we assume independence between x i and x
Model Training process for each attribute j Input D Historical , window size k , attribute type T j . Output
STAN model S stan ; Construct window data X windowi = concatenate X i − k ,..., X i − ; y windowi = X i ; for epoch in 1 ... EPOCH do X windowi *= M ask if T j is continuous then P gmm pred = mdn ( wcnn ( X window )); loss = nll ( P gmm pred , y window ); else P softmax pred = sof tmax ( wcnn ( X window )); loss = cross entropy ( P softmax pred , y window ); end if Update S stan with loss ; end forAlgorithm 2 Data Synthesis process
Input
Trained STAN model S stan . Output D synth ; Init context X window = marginal sampling() while condition(target row number or time stamp) do X windowi *= M ask P pred = S stan ( X windowi ); y sample = sample from distribution P pred ; X windowi +1 = X windowi [1 : , :] + y sample end whileWindow convolutional layers (wcnn) . The CNN layers which we callwindow CNN since they operate on a sliding window of data, perform a two-dimensional convolution. For one row x i the layers capture a rectangular contextabove the row as shown in Figure 2. STAN uses multiple convolutional layersthat preserve the spatial and temporal resolution in a sliding time window box,each number in Figure 2 represents the number of 3 ∗ BN , ReLU ,and M , respectively. Convolution mask
Based on which factorization is selected, we have maskA for Equation 4 and mask B for Equation 5. X i,j k-window (a) Mask A for conditional indepen-dence assumption between attributesin same row X i,j k-window (b) Mask B for no conditional indepen-dence assumption in the same row Figure 3: Masks for context window convolution
Mixture density layer (mdn) learns a conditional gaussian mixture dis-tribution . It consists of three parallel fully connected layers, modeling α i , σ i , µ i separately, where the parameter α i represents for the component weights of an gaussian mixture model , and the µ i and σ i are the mean and variance parame-ters of the gaussian distribution components. The α i parameters output go toa softmax, so that the weights of all the Gaussian mixture components sum toone. Loss functions
We define loss functions for mixture density layer and soft-max layer separately. Note that the two losses have different scales, and whilemultitask learning has its advantages, we match each mixture density component or softmax component with an individual wcnn component .Negative Log-Likelihood Loss (NLL) is used for the mixture density layers,which predict a group of mixture density parameters that can compose a Gaus-sian mixture model as Equation 6: α i , σ i , µ i . We use maximum likelihood lossto estimate a true distribution: the label of the input, which is the new row thatto be generated, is supposed to have the highest probability in the estimateddistribution. Cross entropy loss is used for the softmax layer. N LL ( x | µ, σ ) = − log (cid:88) α i ∗ N ( x | µ i , σ i ) (6) We selected four different methods to serve as baselines for our method. Thisrange for basic Gaussian Mixture Model, Bayesian Network to two recent deeplearning approaches that use GANs for synthetic data generation, which for8revity we refer to as B1, B2, B3, and B4, respectively. We compare
STAN with these baselines and analyze the distribution factorization.
Gaussian Mixture (B1)
This assumes all attributes at a particular timestep are independent of each other, and further each row is independent. Thusit can be factorized as following: P ( x ) = n (cid:89) i =1 P ( x i ) (7a) P ( x i ) = m (cid:89) j =1 P ( x i,j )(7b) Bayesian Network (B2)
As a traditional statistical approach, limited tem-poral or attributes dependence can be learnt based on the domain knowledgefrom experts. For example, if x i,j is dependent on x i − ,j and x i,j , we canwrite it as a product of the conditional distributions (see Equation 8). The value P ( x i,j | x i − ,j , x i,j ) is the probability of the j attributes of the i -th observa-tion row, given the ( i − j attribute and the i -th j attribute. Consideringthe edge situation as well as utilizing the Bayes rule, we rewrite the distribution P ( x i,j | x i − ,j , x i,j ) as: P ( x ) = n (cid:89) i =1 [ P ( x i,j | x i,j , x i − ,j ) m (cid:89) j =1 ,j (cid:54) = j P ( x i,j )]= P ( x ) · n (cid:89) i =2 [ P ( x i,j ) P ( x i − ,j | x i,j ) P ( x i,j | x i,j )] · m (cid:89) j =1 ,j (cid:54) = j P ( x i,j ) (8) WP-GAN (B3) [16] utilizes GAN to specifically generate network trafficflow data, while
CTGAN (B4) [21] utilizes GAN to generate general tabulardata which contains both discrete and continuous attributes. Both B3 andB4 assume attribute dependence at a certain time step but ignore temporal-wise dependence. Thus the joint distribution can be factorized as Equation 7aonly, while the factorization inside each row is untractable due to the GANmechanism.
The evaluation of generative models is challenging and subjective. We usemultiple metrics to compare them: likelihood, distribution evaluation, domainknowledge rule test, and machine learning tasks performance comparison.
Likelihood fitness
The likelihood function measures the goodness of astatistical model fitting a data sample. However, the intrinsic difference betweenexplicit density method (B1, B2, and
STAN ) and implicit density method (B39nd B4) makes it more challenging to compare them. [8] also claims that thereis not a fair way to directly compare the likelihood of the GAN models. Thusin this paper, we only compare the likelihood between explicit density models:B1, B2 and,
STAN . Distribution and JS divergence
Although the goal of our work is tomodel joint distribution of a window of data, we also compare the marginaldistributions of the individual attributes. As a quantitative metric, we calcu-late Jensen-Shannon divergence between the distributions of the generated data D syn and the real data D test in each attribute. Domain knowledge test
We use domain knowledge checks to evaluate thesynthetic data quality. Since the application data set pertains to network trafficflow, we use several properties that such data need to satisfy in order to berealistic.[16]
Machine learning application task
The final goal of generating syntheticdata is to build machine learning models without using any real data. Toevaluate whether the generated data is able to replace real data in a modeltraining process, we select two tasks that are used in cybersecurity anomalydetection. One is a classification task while the other is regression. Both areself-supervised tasks.The first task is predicting the protocol field in the network traffic data,while the second task is to predict the number of bytes field. In practice oncetrained these models are used for marking anomalies when the actual valuesignificantly differs from the real one. We train a RandomForest model for theclassification task, and a neural network model for the regression task. For bothtasks we compare the cross-validation performance of the models trained on realand synthetically generated data.
To demonstrate its effectiveness, we train and evaluate
STAN on a real net-work traffic data set. However, initially to experiment with some architecturalvariations, we use a simple simulated data set.
We built a simulated data set with a simple random process whose dependencecan be clearly controlled. We simulated a two-variable data distribution withthe following formula and sampled 10,000 points data set (
X, Y ) from it: x i =0 . x i − + 0 . N and y i = 0 . x i + 0 . N , where N is standard normal distributionnoise.We apply a naive version STAN , that passes through the input to mixturedensity layers directly, and B1 on the simulated data set. We evaluated thecorrelation coefficient R between both temporal dependence R ( X i , X i − ) andattribute dependence R ( X i , Y i ). Figure 4 presents the scatter plots of x i and y i that from four data source (both raw simulated data and synthetic data).10 a) Simulated raw data with R ( X i , X i − ) = 0 . R ( X, Y ) = 0 . R ( X i , X i − ) = 0 and R ( X, Y ) = 0(c)
STAN mask A synthesizeddata with R ( X i , X i − ) = 0 . R ( X, Y ) = 0 . STAN mask B synthesizeddata with R ( X i , X i − ) = 0 . R ( X, Y ) = 0 . Figure 4: ( X i , Y i ) scatter plot of the simulated data and synthetic data withthe Correlation Coefficients R . Observation 1:
Same-row attribute conditional independence provides areasonable approximation
Data set
Network traffic data is typically a multivariate time-series. A commonformat is called netflow , where each row represents a unidirectional networktraffic connection or flow. We selected a netflow data set for our experimentsas a large data set was publicly available, and further because it is a goodrepresentative format for network traffic data in general. Typically each rowconsists of the following attributes: timestamp at the end of a flow (te), durationof flow (td), packets exchanged in the flow (pkt), and the corresponding numberof bytes (byt), source IP address (sa), destination IP (da) and protocol (pr). Soone row x i can be expressed as a tuple of ( te i , byt i , sa i , da i , pr i , etc). Table 1shows typical attributes, their types and examples.We apply STAN on a publicly available benchmark netflow data set, UGR’16[12], which contains large scale traffic data captured by a Tier-3 ISP cloudservice provider. First, we randomly select the April week3 data to focus on.Second, we randomly select 90 users based on the number of traffic flows per user11ttribute Type Exampletimestamp continuous 2016-04-11 00:02:15duration continuous 0.344transport protocol categorical TCPsource IP address categorical 85.201.196.53source port categorical 19925dest. IP address categorical 42.219.145.151dest. port categorical 80bytes numeric 11238packets numeric 11Table 1: Overview of typical attributes in flow-based data.distribution. Third, we extract one day’s (Monday) data to be the D historical and another day (Tuesday) of the same user group and the same week to be the D test . Lastly, from a cybersecurity perspective, we are most interested in userswith traffic between an organization and external IP addresses rather than trafficwithin an organization. Following this strategy, we selected 1,531,126 samplesfor the D historical and 1,952,702 samples for the D test . Pre-processing
To ensure the trained model is a practical and robust toolto synthesize network traffic flow data, we normalize the raw netflow data sothat the neural network can deal with it. Also, the neural network predictedvariable value could also be interpreted back to the original form. Since it is justa regular data processing trick, we provide details in supplemental materials.
Likelihood
For each data point (each row), we can directly calculate therow likelihood by factorization Equations 4, 5 and 7b. Take the attribute byt forexample the negative log likelihood evaluated on the UGR16 validation set forB1, B2 and
STAN are 4.85, 3.90, 2.34 respectively. More attribute likelihoodtable in the appendix show
STAN is the best reported over all the comparableattributes, including continuous and discrete attributes.
Distribution and JS divergence
Figure 5 shows the individual JS diver-gence of the marginal distribution of both the continuous and discrete attributes.
STAN captures the marginal distribution well for most attributes. Even thoughB1 precisely models the marginal distribution of the training data set, it doesnot perform as well as
STAN on the test data set. We believe this is becausethe marginal distribution over a day is non-stationary.
Observation 2:
STAN models the marginal distribution better than base-line B1.
Domain knowledge test
We employ domain test developed by [16] fornetflow data. These are several rules that need to be satisfied by generatedflow-based network data. We highlight three tests here which are summarizedin Table 2.
STAN performs well in all three: • Test 1: The selected UGR16 data set is captured by an ISP. Therefore, atleast one IP address (source IP address or destination IP address) of each12igure 5: JS divergence between attribute marginal distribution of real andsynthetic dataflow must belong to the ISP (starting with 42.219.XXX.XXX). • Test 2: If the flow describes normal user behavior and the source port ordestination port is 80 (HTTP) or 443 (HTTPS), the transport protocolmust be TCP. • Test 3: TCP and UDP packets have a minimum and maximum packet size.Therefore, we check the relationship between bytes and packets in eachflow according to the following rule: 42*packets ≤ bytes ≤ STAN
100 99 81
Table 2: Passing percentage of domain knowledge tests
Real application tasks
Finally, we test our synthetic data on two cyberse-curity machine learning applications – one is a classification task, and the otheris regression. The goal is to figure out whether it is possible to fully substitutereal data with synthetic data for training machine learning models.13 series of models are trained on real test data. We start our training fromusing a complete D test (real data) and successively decrease the amount of realdata until no data from D test is used. Another series of models are trainedsimilarly using the real test data; however, instead of simply removing certainamount of data from D test , we substitute the indicated amount of data withour synthetic data D syn , so that the total amount of data is kept unchanged.In the following two tasks, we use D test , which is unseen and never used in thesynthesizer training process. For the synthetic data D synth , every synthesizermodel generates five sets of synthetic data sample, so we can compute error bars.Five-fold cross validation is used to get a robust estimate of the measurements. Task1: protocol forecasting
Fig. 6 shows the F-1 scores achieved byRandom Forest models. There are six sets of models. ’Real-Data’: these arerandom forest models trained by reducing the real data; ’stan’: these are randomforest models trained by reducing the real data, but substituting the reduceddata by synthetic data generated by
STAN ; ’B1’ through ’B4’: these are similarto the ’stan’ models but obtained by substituting the reduced data by the fourbaselines respectively. The x-axis represents how much real data is used from100% down to 0%.If we only use real data, the F1 score drops from 0.78 down to 0.6 as theamount of data decreases. Clearly, with no real data, we are unable to train amodel. When we substitute real data with that generated by the baselines, theperformance drops even quicker, because they do a poor job of capturing thetemporal and attribute dependence. Even in the absence of any real data, datagenerated by
STAN results in an F1 score of 0.75, where the drop in performanceis only 4%. That is, the model built with only synthetic data retains 96% ofthe performance of the all real data trained model.Figure 6: F1-score of Protocol Forcasting Task14igure 7: Mean Square Error of byt
Value Forcasting Task
Task2: byt value forecasting follows a similar setup of experiments asTask1. Fig. 7 shows the mean square error achieved by a neural network regres-sion model. The plot shows that
STAN and Bayesian network (B2) outperformthe other three baseline models. Building a Bayesian network with domainknowledge typically performs better than GANs [21].
Observation 3:
Comparing to B2, STAN can still get the same perfor-mance even without domain knowledge.
In our experiments, B2 is optimized specifically for the byt sequential value.However,
STAN has two advantages over the Bayesian network. First, users donot need the domain knowledge required for Bayesian network implementation.Secondly, there is no inherent bias attributable to an expert unlike traditionalBayesian networks. Similar to the first task, the penalty for using only
STAN generated data (with no real data) is low, an increase of 13% in the mean squareerror.
Observation 4:
Even with 0% real data, STAN models task1 and task2with only a small drop in accuracy.
This paper presents the design and implementation of
STAN , a novel, flexibleand robust approach to learn the distribution of complex multivariate time-series data distributions. Compared to existing approaches,
STAN is novel inseveral aspects. First,
STAN learns the joint distribution over both temporaldependency and attribute dependency. Second,
STAN is flexible to generatedata with any combination of continuous and discrete attributes. Furthermore,15e perform a thorough evaluation of
STAN comparing it with four baselinesand using several performance measures as well as two cybersecurity machinelearning tasks.Our future work includes building techniques to (1) build complete systemof learning and generating network traffic data, (2) explore the best updatingrate for re-learning the data synthesizer on the historical data D historical (3)conduct more semantic or statistic checking with regards to the fungibility ofsynthetic data with real data. References [1] Charu C. Aggarwal and Philip S. Yu.
A General Survey of Privacy-Preserving Data Mining Models and Algorithms , pages 11–52. SpringerUS, Boston, MA, 2008.[2] Mohammad Al-Rubaie and J Morris Chang. Privacy-preserving machinelearning: Threats and solutions.
IEEE Security & Privacy , 17(2):49–58,2019.[3] Laura Avi˜n´o, Matteo Ruffini, and Ricard Gavald`a. Generating syntheticbut plausible healthcare record datasets. arXiv preprint arXiv:1807.01514 ,2018.[4] Daniel S Berman, Anna L Buczak, Jeffrey S Chavis, and Cherita L Cor-bett. A survey of deep learning methods for cyber security.
Information ,10(4):122, 2019.[5] Christopher M Bishop. Mixture density networks. 1994.[6] Anna L Buczak and Erhan Guven. A survey of data mining and machinelearning methods for cyber security intrusion detection.
IEEE Communi-cations surveys & tutorials , 18(2):1153–1176, 2015.[7] Carlos A Catania and Carlos Garc´ıa Garino. Automatic network intrusiondetection: Current techniques and open issues.
Computers & ElectricalEngineering , 38(5):1062–1072, 2012.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generativeadversarial nets. In
Advances in neural information processing systems ,pages 2672–2680, Cambridge, MA, USA, 2014. MIT Press.[9] Cormode Graham. Differentially private spatial decompositions. In
Dataengineering (ICDE), 2012 IEEE 28th international conference , 2012.[10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 , 2014.1611] Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C Suh, Ikkyun Kim,and Kuinam J Kim. A survey of deep learning-based network anomalydetection.
Cluster Computing , pages 1–13, 2017.[12] Gabriel Maci´a-Fern´andez, Jos´e Camacho, Roberto Mag´an-Carri´on, PedroGarc´ıa-Teodoro, and Roberto Ther´on. Ugr ’16: A new dataset for theevaluation of cyclostationarity-based network idss.
Computers & Security ,73:411–424, 2018.[13] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixelrecurrent neural networks. arXiv preprint arXiv:1601.06759 , 2016.[14] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia,Hongkyu Park, and Youngmin Kim. Data synthesis based on generativeadversarial networks.
Proceedings of the VLDB Endowment , 11(10):1071–1083, 2018.[15] Shukor Razak, Nur Hafizah, and Arafat Al-Dhaqm. Data anonymizationusing pseudonym system to preserve data privacy.
IEEE Access , 2020.[16] Markus Ring, Daniel Schl¨or, Dieter Landes, and Andreas Hotho. Flow-based network traffic generation using generative adversarial networks.
Computers & Security
Proceedings of theAAAI Conference on Artificial Intelligence , volume 33, pages 5049–5057,2019.[19] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals,Alex Graves, et al. Conditional image generation with pixelcnn decoders.In
Advances in neural information processing systems , pages 4790–4798,2016.[20] WhiteHouse. The cost of malicious cyber activity to the u.s. economy, 2018.[21] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veera-machaneni. Modeling tabular data using conditional gan. In
Advances inNeural Information Processing Systems , pages 7333–7343, 2019.17
Appendix: Simulated data
We evaluated the correlation coefficient R between both temporal dependence R ( X i , X i − ) and attribute dependence R ( X i , Y i ). Figure 8 presents the scatterplots of X i and X i − from four data sources (simulated data, B1 synthetic data, STAN with mask A synthetic data, and
STAN with mask B synthetic data),and Figure 9 presents that for X i and Y i . Since mask A and mask B represent conditional independence and explicit dependence respectively, we summarizethrough this observation,: • Both conditional independence and explicit dependence provide reason-able approximation on temporal dependence. The R ( X i , X i − ) of simu-lated data, STAN mask A, and
STAN mask B are 0.9. • Conditional independence provides a reasonable same-row attribute ap-proximation, while explicit dependence performs better. The R ( X i , Y i ) ofsimulated data, and STAN mask B are 0.9; while that of
STAN mask Ais 0.7. (a) Simulated raw data with R ( X i , X i − )=0 . R ( X i , X i − )=0(c) STAN mask A synthesized datawith R ( X i , X i − )=0 . STAN mask B synthesized datawith R ( X i , X i − )=0 . Figure 8: ( X i , X i − ) scatter plot of the simulated data and synthetic data withthe Correlation Coefficients R . 18 a) Simulated raw data with R ( X i , Y i )=0 . R ( X i , Y i )=0(c) STAN mask A synthesized datawith R ( X i , Y i )=0 . STAN mask B synthesized datawith R ( X i , Y i )=0 . Figure 9: ( X i , Y i ) scatter plot of the simulated data and synthetic data withthe Correlation Coefficients R . We also used the simulated data and the corresponding synthetic data for train-ing two machine learning tasks (using scikit-learn Python library). In these ex-periments, we trained the models with different synthetic data or the simulateddata and test the model performance (Mean Square Error) on the simulatedtest data. • T1 : predict y i given x i (row attribute dependence). • T2 : predict x i +1 given x i (temporal dependence).Table 3 shows that a machine learning model trained only on synthetic datagenerated by STAN produces similar test loss as that trained on real simulatedtest data. 19raining Data MSE( T1 ) MSE( T2 )Simulated data 0.010 0.01B1 0.050 0.05STAN mask A 0.013 0.01STAN mask B 0.010 0.01Table 3: Mean Square Error of the two tasks STAN
Model on UGR16 Netflowdata
Our models are trained on four Tesla P100 GPUs using the Pytorch toolbox.From the different parameter update rules tried, Adam [10] gives best conver-gence performance and is used for all experiments. The learning rate scheduleswere manually set to the highest values that allowed fast convergence: 0.001for gaussian mixture layers and 0.01 for softmax layers. The batch sizes arealso manually set for the experiments. For UGR16, we use as large a batch sizeas that permitted quick converge; this corresponds to 512 time windows inputper batch. We use preprocessing to prepare data batches that can be trainedin parallel and accelerate the training and generation process. For the initialconvolution network layer parameters, we sample from a Uniform distribution.
The inputs to the neural model are pre-processed to facilitate training. Thenumerical attributes are min-max scaled, for the categorical attributes, we applyone-hot encoding. Specifically, for the protocol attribute we use a three-waysoftmax (for TCP, UDP and other). For source and destination port numberattributes, we handle well-known and other ports differently; port up to 1024are one-hot encoded with softmax output, while higher ports are modeled asa numeric attribute. Instead of modeling timestamps of individual flows, wemodel the time deltas between them.
For each data point (each row), we can directly calculate the row likelihood byfactorization equations. In our case, explicit density generative models (B1, B2and
STAN ) clearly define the distribution for each attributes and for those wecan evaluate the modeled distribution directly via individual attribute distri-bution. For a more straightforward form, we use a 200 bin size for continuousvariables to validate their negative log-likelihood value for all the baselines andattributes, based on the variable value range and the data set size. In Table 420able 4: Attribute negative log likelihood of models evaluated on the UGR16validation set Model bytes packet time duration protocolB1 4.85 3.78 1.81 0.341B2 3.90 2.62 0.97 0.344
STAN
STAN and baselines B1 and B2.