[PDF] Data-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods

Abstract

Data-driven methods have been widely used in network intrusion detection (NID) systems. However, there are currently a number of challenges derived from how the datasets are being collected. Most attack classes in network intrusion datasets are considered the minority compared to normal traffic and many datasets are collected through virtual machines or other simulated environments rather than real-world networks. These challenges undermine the performance of intrusion detection machine learning models by fitting models such as random forests or support vector machines to unrepresentative "sandbox" datasets. This survey presents a carefully designed taxonomy highlighting eight main challenges and solutions and explores common datasets from 1999 to 2020. Trends are analyzed on the distribution of challenges addressed for the past decade and future directions are proposed on expanding NID into cloud-based environments, devising scalable models for larger amount of network intrusion data, and creating labeled datasets collected in real-world networks.

Full PDF

11Data-Driven Network Intrusion Detection: A Taxonomy ofChallenges and Methods

DYLAN CHOU,

Carnegie Mellon University

MENG JIANG,

University of Notre DameData-driven methods have been widely used in network intrusion detection (NID) systems. However, thereare currently a number of challenges derived from how the datasets are being collected. Most attack classes innetwork intrusion datasets are considered the minority compared to normal traffic and many datasets arecollected through virtual machines or other simulated environments rather than real-world networks. Thesechallenges undermine the performance of intrusion detection machine learning models by fitting models suchas random forests or support vector machines to unrepresentative “sandbox” datasets. This survey presents acarefully designed taxonomy highlighting eight main challenges and solutions and explores common datasetsfrom 1999 to 2020. Trends are analyzed on the distribution of challenges addressed for the past decade andfuture directions are proposed on expanding NID into cloud-based environments, devising scalable models forlarger amount of network intrusion data, and creating labeled datasets collected in real-world networks.CCS Concepts: •

General and reference → Surveys and overviews ; •

Networks → Network secu-rity ; •

Security and privacy → Intrusion/anomaly detection and malware mitigation ; •

Computingmethodologies → Machine learning .Additional Key Words and Phrases: Network intrusion detection, Big data, Cloud computing

ACM Reference Format:

Dylan Chou and Meng Jiang. 2020. Data-Driven Network Intrusion Detection: A Taxonomy of Challenges andMethods.

ACM Comput. Surv.

1, 1, Article 1 (September 2020), 38 pages. https://doi.org/10.1145/xxx

Network intrusion detection (NID) monitors a network for malicious activity or policy violations[117, 122]. During the last two decades, data-driven methods have been developed and deployedfor NID systems [38, 155], most of which are machine learning models such as Naïve Bayes [128],Random Forests [44, 184], Adaboost [70], and Deep Neural Networks [76, 151]. A review paper in2009 summarized the NID systems that were supported by anomaly detection algorithms [58, 90].In this survey, we present a broader view of data-driven

NID, which includes related work from thepast ten years, and present a taxonomy of challenges and methods in data-driven NID research.

Since the advent of computer networks, e-commerce and web services, there has been a greaterneed for cyber-security and countermeasures toward network attacks. There was an interest inintrusion detection in 1994, where intrusion detection was known to be a retrofit way to provide asense of security when identifying unauthorized use, misuse or abuse of computer systems [117].

Authors’ addresses: Dylan Chou, [email protected], Carnegie Mellon University, Pittsburgh, PA, 15213; Meng Jiang,[email protected], University of Notre Dame, Notre Dame, Indiana, 46556.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2020 Association for Computing Machinery.0360-0300/2020/9-ART1 $15.00https://doi.org/10.1145/xxx ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. a r X i v : . [ c s . CR ] S e p :2 Dylan Chou and Meng Jiang The concept of intrusion detection later became contextualized in cyber-security systems. Theterm “intrusion detection systems” describes the extraction of information from one or multiplecomputers in a network that identifies attacks from external sources, but also misuse of resourcesin the network from internal sources [22].Intrusion detection systems can be broadly categorized as either being host-based intrusiondetection or network intrusion detection. Host-based intrusion detection looks to monitor systemfiles and internal hardware while also identifying anomalies in network traffic. A network intrusiondetection system is similar, but focuses primarily on incoming network traffic [120].There are two general behaviors in a network: normal and anomalous. Normal network behaviorfollow a specific criteria in terms of the traffic volume, applications on the network, and types ofdata exchanged. Network anomalies fall into two general categories of network failures such asnetwork congestion or file servers being down and network security attacks such as DDoS andother attacks that are conducted by a malicious agent [163].Network intrusion detection systems aim to distinguish the norm from security-related anomaliesand detect attacks on computer networks. Network intrusion detection methods can be anomaly-based that identify malicious activity that departs from normal-defined behavior on a network orsignature-based that identifies known attacks based on pattern matching. Because signature-baseddetection relies on seen patterns, it’s not as effective in detecting novel attacks, or zero-day attacks,so anomaly detection is often used to detect novel attacks.

Among the network intrusion detection surveys gleaned from the past decade, many have con-structed taxonomies along with problem-solution frameworks for cloud-computing platforms.Jeong et al. [77] addressed the anomaly teletraffic intrusion detection systems in Hadoop-basedplatforms where there is a heavy focus on the methodology of statistical, machine learning, andknowledge-based models. Different attributes of big data – storage volume, velocity, variety, intru-sion detection system, and cost – are associated with problems and technical solutions specific toHadoop-based platforms. A new platform was proposed for anomaly teletraffic intrusion detectionsystems on Hadoop. Modi et al. [111] followed a high level introduction of intrusion detectionto cloud-based systems – a common solution to these intrusions being firewalls – and identifieddifferences between signature and anomaly-based detection. Keegan et al. [82] inspected networkintrusion detection datasets, approaches, cloud environments, algorithms, and advantages anddisadvantages among the literature.Other authors primarily heeded the network intrusion detection datasets rather than its methods.Ring et al. [140] examined packet-based, flow-based data along with host log files. Data recordingenvironments were compared from the literature and a multitude of datasets, including some datarepositories found on the Internet, were discussed along with their drawbacks. Ring presenteda comprehensive overview of 34 datasets, their drawbacks and how they may be related if onedataset was built off of another. Davis and Clark [35] studied intrusion detection features derivedfrom network traffic along with data preprocessing methods including clustering, filtering packetsby high anomaly score or extracting subsets during traffic payload analysis, tracing TCP sessions,statistical features per connection, and create separate dataset.Some papers were method-specific, as Resende and Drummond [136] provided a comprehensivereview of random forest-based network intrusion detection. Resende and Drummond presentedboth a high-level overview of random trees and its components: decision trees. Datasets andcommon evaluation metrics were reviewed and the authors concluded that, in future work, randomforests will be used more on unbalanced data and on dynamic data due to its ability to adapt toincremental learning problems.

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:3

General overviews of network intrusion detection definitions and infrastructures along withtaxonomies to classify different types of intrusion detection systems were also made. Poston’staxonomy [132] covered high-level definitions of the types of intrusion detection and the typesof analysis that can be done on host-based and network-based intrusion detection. However, thetaxonomy is fairly general and the paper does not address future directions to intrusion detectionresearch. Fernandes et al. [47] did an extensive job at looking into categorization of intrusiondetection systems as well as the pros and cons of data sources commonly used in network anomalydetection. Moustafa et al. [114] inspected the types of attacks that network intrusion detectionsystems are intended to fend off. The pros and cons of host and network intrusion detection typesare charted and the methodologies are explained in visual diagrams. There is also a focus on thedecision engine techniques in the scholarly articles that were collected in Moustafa et al. [114].Other papers looked to a specific result after observing the challenges in each paper and comparingtheir machine learning methods as Buczak and Guven [20] organized their review based on amachine learning method, presented the papers that use that method, the data it used, the cyberapproach (misuse or anomaly), and the number of times the paper was cited. Mitchell and Chen[110] broke down the classification of intrusion detection by system, collection process, techniques,models, analysis and response. Many visuals are dedicated to their four-defined types of intrusiondetection: anomaly based, signature based, specification based, and reputation based. Most and leaststudied IDS techniques are analyzed and future direction of research in repurposing existing workon wireless intrusion detection applications, multitrust (data from witnesses or third parties) withintrusion detection, specification-based detection for cyber-physical systems and others. Ahmed etal. [3] analyzed four main categories of anomaly-based detection: clustering, classification, statisticaland information theory. Each of the four categories to anomaly-based detection are evaluated basedon the computational complexity among approaches of that type, the most significant networkattacks and what the output is in each technique. Nagaraja and Kumar [119] summarized variousstudies over the span of six years and presented their techniques, year published, identification andthe dataset used. Their conclusion was that the main research problem pertained to reducing highdimensional data and that many of the intrusion attacks were SQL-based.

There have been survey papers as broad as scanning over all network anomaly detection meth-ods and as specific as cloud-based intrusion systems. Past surveys focused on the foundationalknowledge of network intrusion detection frameworks such as TCP connection features or virtualmachine layers in hypervisor/host systems. Surveys have looked into overviews of datasets, orcomparisons between specific machine learning methods, all while reflecting on past literature.Many authors present previous work with charts comparing different papers and discussing chal-lenges with cloud computing, growing data, and other open issues. Challenges have been addressedin many of these surveys, but there is a lack of solutions presented under future direction. Mitchelland Chen [110] examined the most and least studied areas in wireless network intrusion detectionto propose future research areas. There is less emphasis, however, on trends of research in networkintrusion detection over time and using such trends to motivate future directions. To balance thissurvey, there is substantial background of past datasets along with the recently collected dataset

LITNET in 2020, a general taxonomy identifying the main challenges, and discussion on the trends of research in data-driven NID over time as well as what this would imply for future directions.

In Section 2, the history of data processing, cloud computing, the lack of specific network attacktypes and general big data processing techniques are examined. Section 3 covers common datasets

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :4 Dylan Chou and Meng Jiang

Input DataData ProcessingData-DrivenAnomaly Detection Model Output:Label Normal /AnomalousPredictionsEvaluationSupervised learningUnsupervised learningTransfer learningMeta learningReinforcement learningFeature selectionFeature normalizationFeature discretizationFeature reduction

Fig. 1. Fundamental Process of Data-Driven Anomaly Detection. from DARPA 1998 [88] to as recent as LITNET 2020 [125] along with their statistics in terms of hownetwork attacks are distributed and how unbalanced the datasets are based on entropy. Section4 addresses the high-level organization of the taxonomy and the details of each challenge andcorresponding solutions/methods. Section 5 discusses the trends based on the articles collectedthat form the taxonomy and areas to look further into. Section 6 presents conclusions from theliterature survey and taxonomy of data-driven network intrusion detection and reinforces futuredirection that researchers can look into.

The key purpose of anomaly detection systems is to separate anomalies from normal behavior. Incomputer networks, a network anomaly refers to circumstances where network operations deviatefrom normal network behavior [163]. Anomaly-based network intrusion detection methods areimportant to identify novel intrusion attacks. The approaches presented in the literature wereimplemented to improve individual or multiple components in the process of anomaly detectionfrom data as detailed in Figure 1.Since 1986, shipments of one or more terabytes were seen after June 2, 1986 when Teradata shippeda terabyte of data to Kmart. By the first half of the 2010’s, data had already been accumulatingin the zetabytes by volume worldwide. If a company needed to handle a large query, they wouldresort to a parallel database. Hadoop was often sought after for using open-source technologies[18]. Processing such large amounts of data was overwhelming and optimization methods wereused to speed up preprocessing or reduction methods that removed redundant features and reducedthe size of the data.Data reduction is done to remove large amounts of data to improve efficiency and reducecomputational overhead. In network traffic, packets are exchanged and TCP connections are openfor the exchanges to be carried out. Because so many packets are sent and received in a typicalnetwork, extracting only the first few packets of a TCP connection was done by Chen et al. [24] tomitigate effects of large packet data. Similarly, Ficara et al. [49], another paper from 2010, sampleda portion of the payload in the network traffic to alleviate the load from large amounts of networkdata. These were intentional extractions of network data during the collection process. Beyondextraction of data during its collection, other authors selected specific network features basedon their importance. Tan et al. [158] aimed to address the challenge of the heavy computationassociated with anomaly intrusion detection systems using linear discrimination analysis (LDA)and distance difference maps to select the most significant features. LDA finds an optimal projectionmatrix to project higher dimensional features to lower dimensions. This feature reduction method

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:5 was done on payload-based anomaly intrusion detection. In 2013, Zhang and Wang [182] applied asimpler feature selection method that underwent a sequential search and sifted through the featuresin the feature domain, where a feature was added if the accuracy from the Bayesian networkdetection model lowered after removing a feature. In 2016, Wang et al. [168] used the ID3 decisiontree theory to split nodes containing feature sets based on the feature that provides the largestamount of information gain.Aside from reduction of data, another popular method in the first half of the 2010’s to handlegrowing data was to run parallel processes and speed up big data processing. Hung et al. [72]recognized a substantial increase in the number of threats posed to networks. Because pattern-matching is computationally expensive, Hung et al. presented a graphics processing unit technologyto accelerate pattern-matching operations via parallel computation. Their proposed algorithmachieved maximal traffic processing speeds of 2 Gbits/second and can enhance performance ofnetwork intrusion detection systems. Similarly, in 2015, Zheng et al. [191] inspected methodsto speedup pattern-matching for network intrusion detection. They introduced negative patternmatching that reduces the number of lookups in ternary content-addressable memory (TCAM)along with exclusive pattern-matching that divides the rule set into subsets – each subset queriedindependently given some input.Most recently, attention has been directed towards new technologies in the cloud and neweroptimizations with computation aside from parallelism. Cloud computing services allow for pro-cessing of large datasets and a popular engine for big data processing is Apache Spark. Gupta andKulariya [61] presented a framework where correlation-based and chi-square feature selectionwere applied to obtain the most important feature set and Logistic regression, Support vectormachines (SVMs), Random forest, Gradient Boosted Decision trees and Naive Bayes were used fornetwork intrusion classification from the MLlib library in Apache Spark. In 2019, Hajimirzaei andNavimipour [63] used a combination of a multilayer perceptron (MLP) network, an artificial beecolony (ABC) algorithm and a fuzzy clustering algorithm to detect network intrusions. The ABCalgorithm, in particular, was used because it mimics the ways that bees search for a food source.This artificial system of onlooker bees that finds a food source via the hive dance of surroundingbees, employed bees returning to the previous food source, and scout bees randomly searching fornew food sources can be applied to optimization problems such as adjusting the weights and biasesin the MLP. The environment was simulated in CloudSim and exemplifies an application of a novelintegration of machine learning techniques into the cloud, different from Gupta and Kulariya’swork that tested common machine learning methods.Despite the recent spike in available data, there remains a lack of data on specific types of attacks,especially newer ones. Meta learning is useful for domain adaptation, or the improvement of amodel’s performance when it is trained in a different task that is similar to a previous source task[144]; as a consequence of automated machine learning, meta-learning observes how machinelearning approaches perform on different tasks and learns from these experiences to invoke novelmethods that are more data-driven [73]. Wang et al. [179] used meta-learning to strike a balancebetween big and small sample classifications. Meta-learning can boost performance in modelstrained on previous source data to handle new, but small data. They implemented a randomcommittee meta-learning algorithm where the base classifier in their case was a random tree andthe random tree along with the Bayesian network were voted on to determine which would classifythe data. Because network intrusion detection is a classification problem, voting would involvesumming predictions over the different classifiers. In the past year, Xu et al. [173] noted that networktraffic can be identified as a time series. During the meta-training phase, every sample in the queryset is compared to others in the sample set and a delta score is calculated between two sampleinput time-series traffic flows, representing their difference. Meta-testing involved comparing

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :6 Dylan Chou and Meng Jiang

Table 1. Vertical Comparisons of Common Datasets.

Dataset Duration Traffic Type Method

KDDCup1999 [36] N/A Synthetic Tcpdump N/A 4,898,430NSL-KDD 2009 [51] 7 weeks Synthetic N/A N/A 125,973/22,544UNSW NB15 IDS[124] 15-16hours Synthetic Tcpdump/IXIAPerfectStorm 45 2,540,044UGR’16 [48] 96 days Real Netflow 600M 16.9MCIDDS’17 [123] 4 weeks Emulated Netflow 26 32MCICDS’17 [52] 5 days B profile sys. User behavior 21 2,830,743CSE-CIC-IDS2018[53] 17 days B/M profilesystem CICFlowMeter 500 4,525,399LITNET-2020 [125] 10 months Real Flow traces 7,394,481 39,603,674MAWILab [89] 15 min/d Real Sample pointcollection N/A N/A

Table 2. Datasets and Papers that Used the Datasets for Evaluation.

Dataset

KDDCup1999 46 papers: [5, 23, 24, 28, 30, 31, 40, 45, 46, 54, 60, 61, 66, 68, 83–86, 92, 94,95, 98, 105, 106, 109, 112, 115, 134, 141, 145, 151, 152, 157, 164, 165, 168,169, 172, 174–176, 178–181]NSL-KDD 2009 30 papers: [32, 41, 56, 57, 65, 69, 71, 78, 79, 87, 96, 99, 118, 126, 127, 130,135, 142, 143, 151, 156, 162, 167, 169, 174, 177, 182, 189, 190, 194]UNSW NB15 IDS 14 papers: [12, 16, 65, 69, 79, 83, 84, 115, 147, 154, 167, 175, 177, 183]UGR’16 2 papers: [103, 104]CIDDS’17 5 papers: [2, 119, 137, 140, 141]CICDS’17 14 papers: [4, 8, 26, 41, 42, 59, 65, 133, 150, 173, 183, 186, 187, 193]CSE-CIC-IDS2018 1 paper: [85]LITNET-2020 1 paper: [34]MAWILab 1 paper: [193]each sample in the test dataset (unclassified) with those in the set of data that’s already classified.Another method to tackling the issue of unbalanced data, where specific network intrusion attacksare less represented than others, is to transfer data from other sources and fit the model to thosedata that can allow classifiers to perform better on smaller test datasets [189] given knowledgegained from other datasets.

Table 1 compares the common datasets and Table 2 presents the papers that used each dataset.

This section presents six types of basic network attacks:(1)

Malicious attacks are those that infiltrate a network and spread malware from infected devicesto other devices in the network. One type of malicious attack is a botnet where a network ofinfected devices are connected to the Internet and perform criminal activity in a group [10].(2)

Insider attacks , or insider threats, are malicious threats found from the people within anorganization. This includes user to root (U2R) attacks on systems where an attacker gainsaccess of user accounts then exploits a vulnerability that gives them root access. Attackers

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:7 may also flood a server with requests to shut it down. Port Scanning is another insider attackwhere insecure ports are found via scanning and targeted for future attacks [75].(3)

Password attacks involve a malicious entity gaining access of someone’s password throughdifferent means such as using a dictionary to decrypt an encrypted password or brute forcethat involves directly trying different usernames and passwords until one works [10].(4)

Distributed Attacks target a specific server or user, but also the surrounding infrastructurewithin the network. One example of this is a backdoor attack where an attacker gains entryof a website through a vulnerable entry point, a “back door” [10].(5)

Distributed Denial of Service (DDoS) or Denial of Service (DoS) attacks flood a network withoverloaded requests to deny other users’ access to network resources such as servers.(6)

Spam attacks use messaging systems to send out messages in large groups, where the messagesmay be phishing schemes.

The KDD Cup 1999 was a version of the 1998 DARPA Intrusion Detection Evaluation Program thatwas collected by MIT Lincoln Labs in their packet traces and is one of the most widely used datasetsfor network intrusion detection [36]. Lincoln Labs acquired roughly nine weeks of raw tcp dumpdata from a local area network (LAN) that simulates a similar environment as an air force LAN.The attacks fall into the four main categories of denial-of-service such as a syn-flood, unauthorizedaccess to a remote machine (R2L), unauthorized access to a local superuser (U2R) and probingsuch as port scanning [36]. Although the KDD Cup 1999 dataset is considered relatively large inthat it contains 41 features and over 4.8 million rows of data, it runs into the issue of duplicatesbetween training and testing data [153]. The data is missing some important features such as IPaddresses although there are basic TCP attributes provided such as the source and destination bytes.Although the KDD Cup 1999 dataset does capture a good number of attacks, the data was collectedon a synthetic network. In general, the data collected is outdated because it was made nearly twodecades ago and has bias due to synthetic generation [37]. The attack classes are also unbalanced.The following are the proportions of each network traffic category: Back (0 . . . . . . . . . . . . × − %), Phf (8 × − %), Pod (0 . . . . . × − %), Teardrop (0 . . . The NSL-KDD 2009 dataset was made to resolve issues of possible biases in duplicate data betweentraining and testing datasets from the KDD Cup 1999 [51]. The Canadian Institute for Cybersecurityand University of New Brunswick were involved in collecting the dataset. However, NSL-KDDremoved some redundant, more frequent records in the training set that were from the KDD Cup1999 dataset, which can still be important. In turn, this may lead to further biases given that thedata from the raw TCP dump should still be kept. An underlying issue with the NSL-KDD datasetis that it still contains data from a network dating back as early as 1998’s DARPA dataset. However,

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :8 Dylan Chou and Meng Jiang the breakdown of the normal traffic is 51 .

88% while anomalous traffic comprises 48 .

12% of the data,which is almost completely balanced. The entropy is 0.999 between normal and anomalous traffic ,which is extremely close to a balanced dataset.

The UNSW NB15 Intrusion Detection System dataset contains source files in the formats of pcap,BRO, Argus, and CSV along with reports by Dr. Nour Moustafa [124]. The dataset was createdwith an IXIA traffic generator that had TCP connections to a total of three servers. Two of theseservers were connected to a router that had a TCP dump and three clients, where the TCP dumpresulted in pcap files. The third server was connected to a router with three clients as well. Thetwo routers that the first two servers and the third server were connected to were separatedby a firewall. An issue with the UNSW NB15 dataset is again with the realness in its data. Thebreakdown of the attack types is as follows: Fuzzers (0 . . . . . . . . . . The UGR’16 dataset was collected from several netflow v9 collectors in the network of a SpanishISP by researchers from University of Granada in Spain [48]. The data is split into a calibration andtraining set, where long-term evolution and periodicity in data is a major advantage over previousdatasets. However, a major issue is that most of the network traffic is labeled as “background” whichmay either be anomalous or benign. Also, there is a mix of synthetically generated network attacksalong with real-world network traffic, which isn’t of the same quality if none of the traffic wassimulated. The dataset was labeled based on the logs from their honeypot system in their set-up.The breakdown of the network traffic is the following: DoS (0 . . . . . . . The CIDDS-001 dataset was collected in 2017 by four researchers [139], two PhD students and twoprofessors, who are affiliated with the Coburg University of Applied Sciences in Germany [123].The data was part of the project WISENT, funded by the Bavarian Ministry for Economic affairs.The intention of the dataset was to be used as an evaluation dataset for anomaly-based intrusiondetection systems. The dataset is labelled and flow-based, where a small business environmentwas emulated on OpenStack. For the infrastructure, on the Internet, there are three attackers andan external server that has a firewall separating it from a server, where there are three layers:developer, office and management. There are four servers that are in the OpenStack environmentcontaining the three subnet layers. Generation of DoS, Brute Force, and Port Scanning occurredin the network. The first label attribute is traffic class: normal, attacker, victim, suspicious andunknown. The second label attribute is attack type and the third being an attack ID. Because theexternal server emulates a real network environment, the CIDDS-001 dataset is primarily used forbenchmark models. There are only three types of attacks, which unveils a lack in diverse attacks inthe data [138]. The class breakdown is as follows: 89.8% for non-attacks, 0.023% brute force attacks,

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:9

The CICIDS dataset was collected under the Canadian Institute of Cybersecurity as well andUniversity of New Brunswick [52]. The generation of network traffic came from a proposed B-profile system where abstract behaviors were derived for 25 users based on HTTP, HTTPS, FTP,SSH and email protocols. With regards to the victim and attacker network information, therewas a firewall against the IPs 205.174.165.80 and 172.16.0.1 and a DNS server at 192.168.10.3. Theattackers network comprises two IPs: Kali: 205.174.165.73, Win: 205.174.165.69. The victim networkis composed of 2 Web servers 16 Public, 6 Ubuntu servers, 5 Windows servers and a MAC server.The data collection occurred over the course of 5 days where Monday was benign activity, Tuesdaywas brute force, Wednesday was DoS, and Thursday was web attacks where the afternoon sawBotnet, Port Scan and a DDoS LOIT. The breakdown of the traffic types is as follows: Infiltration(0 . . . . . . . . . . . . . . . The CSE-CIC-IDS2018 dataset is a collaborative project between the Communications SecurityEstablishment (CSE) and the Canadian Institute of Cybersecurity (CIC) [53]. A notion of profiles isadopted to generate data systematically. First is the B profile that captures behavior in users usingmachine and statistical learning techniques. M-profiles are human users or automated agents whomay examine network scenarios. With the environment supported in AWS, the network topologyincludes an attack network of 50 machines, 5 departments holding 100 machines each and a serverwith 30 machines. The breakdown of the different classes is: Brute Force Attack (0 . . . . . . LITNET is a new annotated benchmark dataset where data was collected by four professors, a PhDstudent and two students at the Kaunas University of Technology (KTU) [125]. The infrastructureof the network is composed of nodes with communication lines connecting them. The LITNETtopology consists of senders and receivers, netflow senders (Cisco routers) and a netflow server.The netflow exporters were in four cities in Lithuania, Vilnius Gediminas Technical University, andtwo KTU university nodes. The dataset contains real network attacks in Lithuanian-wide networkwith servers in four geographic locations within the country. The breakdown of the traffic typesis: Smurf (0 . . . . . . . . . . . . . ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :10 Dylan Chou and Meng Jiang

Data Driven Anomaly Based Network Intrusion Detection

Challenges to Research Studies Challenges to Technical ModelsLack of Real-World Network Data Features Labels InstancesNoisy Data Redundant Data Weakly Correlated Data Too Few Labeled Data Imbalanced DataCollection in Real Network Infrastructure Big Data Small DataDynamic DataFeature NormalizationDensity-Based Clustering Redundancy RemovalFrameworksFeature SelectionRough SetAutoencoder Semi-Supervised LearningTransfer LearningAdversarial Sample Generation Over Sampling, Under SamplingGenetic Programming Optimal Feature ExtractionSiamese Neural NetworkTransfer Learning Data ReductionIncremental LearningParallelism and Multi threadingCloud Computing Stream Data Models

Reinforcement Learning

Transfer LearningIncrease Dimensionality Meta LearningSimulated Realism

Incremental Learning

Feature Fusione.g., some observations (DDoS) having much higher flow rate e.g., duplicate instances per classe.g., low correlation per feature paire.g., only tworeal datasets (UGR, LITNET) e.g.,fraction of attacks unlabeled e.g., 80% normal, 20% attack e.g., ~1 million+ rows e.g., real-time updating datae.g., ~1k rows

Fig. 2. Hierarchical Chart of Categorized High-Level Challenges and Recent Methods to Resolve Them.

MAWILab is a database containing the dataset from the MAWI archive that records network trafficdata between two endpoints [89]: one in Japan and another in the US. MAWILab’s dataset hasbeen contributed to since 2010 [50] and records 15 minutes of network traces each day. Labelsof network traffic are generated from anomaly classifiers based on port numbers, TCP flags andICMP codes along with a taxonomy of traffic anomalies based on packets headers and connectionpatterns [108]. The graph on MAWILab’s website divides the type of traffic over the course of 13years based on byte and packet ratios. HTTP traffic used to be very common from 2007 to 2017, butsharply decreased at the end of 2017. Port Scanning is uncommon, where “multiple points” was thesecond most dominant traffic type from 2007 to 2017. A spike in denial of service (DoS) data wascollected between 2011 and 2012. Currently, the most common type of traffic is multi points, thenhttp, then IPV6 tunneling and alpha flow by byte and packet ratio. The number of anomalies from2007 to 2020 ranged roughly between 100 to 200 at any time. Outliers are as low as 50 anomaliesand as high as 500 anomalies daily. Since the network traffic has been between over the same linkand two endpoints since 2007, MAWILab’s network may not be as similar to most other networksused now. In addition, the labels fall into four broad categories: anomalous, suspicious, notice andbenign. The labels are dependent on the anomaly classifiers, so there may be misclassified traffic.

Figure 2 presents a hierarchical chart of categorized high-level challenges and recent methods toresolve them. This section will discuss the challenges and introduce the methods in details.

Figure 3 summarizes the articles collected for the taxonomy, where their publishers, month andyear published and topic are shown in the four bar charts.

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:11

Fig. 3. Plots of Distribution of Articles (Left to Right): Articles By Publisher, Month, Year, and Topic.

Challenge.

When network traffic data was initially being collected, even as early as the KDDCup dataset from 1999, attacks were outdated and not compatible to attacks done in the real-world.Because using real-world networks to collect network traffic was costly, researchers looked tosimulating realistic networks with synthetic data generation or a simulated virtual network as analternative. Initially, honeypots were used as a means of simulating a virtual network environmentto attract attackers and gather traffic data. Honeypots are security resources that are meant to bemisused by malicious attackers, where such attacks would be recorded in databases. They consistof a decoy, or an information source, and a security program that provides attack monitoring anddetection [43]. These mechanisms can be used to collect network intrusion data with simulatedrealism that run in a virtual machine [39, 81], containing possibly more than one honeypot toresemble a distributed honeypot system to simulate a distributed network more accurately [102].Then came the use of TCP dumps in IXIA traffic generation after their products on virtual networktesting came out. Simulating realistic network intrusion data came from synthetically generatingdata, which Moustafa and Slay [116] have done to create the UNSW-NB15 dataset by generating datawith an IXIA traffic generator, then collecting pcap files extracted from a tcpdump. This syntheticdata generation was improved upon in 2017 by Haider et al. with the generation of network trafficvia IXIA Perfect Storm and collection of the host’s network logs during the simulation. This wasbetter than the UNSW-NB15 dataset because UNSW-NB15 lacked the information of normal andsynthetic data that came from the operating system’s log files. Haider et al. also verified the realismof their dataset through the sugeno fuzzy inference engine [62]. Architectures of main approachesto the creation of real-world network data are illustrated in Figure 4.

Collection in public network infrastructure.

In the past couple of years, researchers have looked tocollect network traffic data in a cloud environment due to the growing usage of cloud computingplatforms such as Amazon Web Services and Google Cloud. A mixture of virtualization and cloud

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :12 Dylan Chou and Meng Jiang

Header-based Protocol-based Payload-basedTraffic

Microservices:

State Management Provision Control

Controller:

Classification Program Partitioning (a) vNIDS [93]: Network traffic arrives into thedetection system and is passed through three typesof microservices as shown in the figure. Then datais passed into the vNIDS controller, which containsstate management that’s responsible for detectionstate classification and provision controlresponsible for partitioning detection logicprograms into header and payload-based DLPs)

Virtual LANInternal Network Zone BZone A Zone C (b) ISOT-CID [7]: The three computer iconsrepresent three hypervisor nodes (A, B, C) that holdten virtual machine instances. The yellow iconsrepresent routers and the cloud is the isot cloudnetwork. Internal depicts the internal network thatzone A’s hypervisor is connected to and VLAN isconnected to zone B’s hypervisor.

CoreInner (c) UGR 16’ [103]: The topology of the networkbegins with the internet represented through theglobe icon, which have two routers, two yellowicons, connected to it. The attacker and victims’networks are depicted via computer icons. The tworouters are called BR1 and BR2, which stand forborder routers. The second border router isconnected to the attacker network (five machines).The core network has five victim machines used indata collection, which has two firewalls representedby the red icons. The inner network holds 15 victimmachines where five machines are placed in each ofthree distinct existing networks. (d) LITNET [34]: The yellow icons are routers. Thetop three connecting nodes are CITY2 (KlaipedaUniversity), CITY3 (Siauliai University), and CITY4(KTU Panevezys Faculty of Technologies andBusiness) from left to right. The middle three areCITY1 (Kaunas–Vytautas Magnus University andKaunas Technological University), KTU University2, and CAPACITY (Vilnius Gediminas TechnicalUniversity) from left to right. The lower left routeris KTU University 1. The red icon is a firewall andthe lower-right icon depicts a netflow server. Thefour nodes KTU UNIVERSITY 1, CAPACITY, KTUUNIVERSITY 2, CITY1 along with the firewall arenetflow exporters that catch new traffic.Fig. 4. Paradigms of Systems Used Towards Real-World Network Data Collection. intrusion detection using hypervisors have been implemented as well, which can resolve the issueof small datasets by aggrandizing network traffic data. In 2018, Hongda et al. [93] combined networkvirtualization with software-defined networks to handle attack traffic. Their virtual network intru-sion detection system (vNIDS) employed static program analysis to determine the detection states

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:13 a 1 (a) Feature normalization is depicted via a shiftingof the standard deviation in the distribution of data.From a standard deviation of a , normalizationwould scale down feature data so that all featureswould follow the same scale such as a standardnormal distribution. Target SetLower Approximation AnomalyUpper Approximation (b) Feature Normalization + SVM and Fuzzy RoughSet [100]: Liu et al. applied the idea of fuzzy roughset, which has lower and upper approximation to atarget set where inclusion could mean membershipin normal or anomalous groups. Anomalies areobserved further outside the target set.

Encoder Decoder

Input Output

Anomaly Data Reconstruction (c) Feature Normalization + Autoencoder [69]:Following normalization and within an ensemblemachine learning model, Hsu et al. implemented anautoencoder, which takes in the input datarepresented as colored circles – an input array –and conducts a mapping from input space into codespace, followed by a decoding phase thatreconstructs the data and can remove anomalies bycapturing the main, important features in data,often used for dimensionality reduction.

Input

Outer NeighborsMore Anomalous Mapping to Nearest Neighbors (d) Feature Normalization + Self-Organizing Map[87]: The self-organizing map (SOM) figureillustrates how an input space is mapped to a 2DSOM lattice where a normal point may be markedas black and there may be 1,2,3 neighbors - greaternumbers mean further away from the normalobservation - so mapping may be done to theclosest neighbors (the light-colored nodes in thelast figure under “Mapping to Nearest Neighbors”)as ones further away may be more anomalous.

Anomaly (e) Density-Based Clustering [159]: Tang et al. usednearest neighbor algorithms for outlier detection,which boils down to clustering and observingdistant observations as anomalies.

MarginsBoundaryAnomalous Normal (f) Feature Normalization + SVM and Fuzzy RoughSet [100]: Liu et al. applied the idea of fuzzy roughset to better distinguish noise that SVMs aretraditionally known to be sensitive to.Fig. 5. Paradigms to Distinguish Noisy Outliers in Data Collection. to share. The prototype of vNIDS was done in CloudLab for flexibility with processing capacity andplacement location. In the past year, Aldribi et al. [7] acknowledged the new challenging terrainthat cloud computing provides for attackers. In turn, they implemented a new hypervisor-based

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :14 Dylan Chou and Meng Jiang cloud network intrusion detection system using multivariate statistical change analytics to detectanomalies. Alongside further research into generating network traffic data in the cloud, the realismin past datasets was called into question because of their outdated attacks and synthetic traffic gen-eration. A solution proposed involved generating real-world network through gathering networktraffic from a university network such as the Lithuanian Research and Education Network [34] or areal virtual network of a tier-3 ISP done in the UGR 16’ dataset [103].

Challenge.

Some traffic data in datasets may contain outliers that can come in the form of lessfrequent traffic classes. To combat noisy data or data with outliers, feature normalization methodshave been applied to scale features and allow them to have similar effects in the model so noisewouldn’t weigh differently than the rest of the data. In other instances, density-based featureselection was used to identify the most important features by finding overlaps between featureprobability distributions as well as non-overlapping regions. Comparisons of noisy methods arehighlighted in Figure 5.

Feature normalization.

Feature normalization methods can be applied to scale features and allowthem to have similar effects in the model so noise won’t be weighed differently than the restof the data. Statistical methods have been used to facilitate network anomaly classification. In2015, Delahoz et al. [87] studied a probabilistic Bayesian self-organizing map model to performunsupervised learning. To overcome the challenge of noise in the network data, they normalizedcontinuous variables to have a mean of 0 and variance of 1, a standard normal distribution. Forcategorical variables, they are encoded before normalized. Categorical encodings are 1 if a featureis “activated” and 0 if not. Although normalization to a standard normal distribution via x − ¯ xσ is onemethod, rescaling logarithmically is another option. Hsu et al. [69] developed an online intrusiondetection system based on an autoencoder, SVM, and Random Forest ensemble where noise wasdealt with feature normalization, where they used the two normalization functions:¯ a = loд ( a + )( loд ( a + )) max , (1)¯ a = aa max , (2)The functions were meant to rescale feature values to the proper range, where a is the originalraw data value, a max being the max value among all values under the same feature as a , and ( loд ( a + )) max being the maximum loд ( raw value + ) for all logarithmic values under the samefeature as a . Packets sent and received were two features that were extremely variable becausecertain attacks (DDoS) entail much larger amounts of traffic in the network, so those feature valuesare normalized by its logarithm divided by its max value (first normalization equation). For featureswith lower variance, they are normalized by division of their max value (second equation). Specificto the sensitivity to noise innate in support vector machines (SVMs), Liu et al. [100] worked towardsmitigating the sensitivity that SVMs have for noise samples by applying a fuzzy membership tomeasure the distance between a sample and the hyperplane, as in SVM. The larger the distance,the smaller the weight coefficient for the sample. Each sample will have a distinct effect on theoptimized classification hyperplane so outliers and noise (values with larger distance) won’t impactthe classifier plane as much as they are assigned lower weights. Density-based clustering.

In other instances, density-based clustering is used to group togetherdata from the same class and identify outliers that are unusually distant from the clusters observed.Because of the scattered nature of denial of service (DoS) attacks in wireless sensor networks

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:15

F1F2F3F4F5F6 F1 F2 F3 F4 F5 F6F3F1 F5 F1 F3 F5

Correlation-basedFeature Selection

InitializationOperator Fitness Operator Original Input Data

Reduced DataGeneral FeatureReduction

Crossover/Mutation Operator

GeneticAlgorithm (a) Correlation-based Feature Selection [86]: Thefeatures are lined up on the horizontal and verticalaxes of the correlation map. The method choosesfeatures which are highly correlated with a class,but not correlated with each other. GeneticAlgorithm [55]: Ganapathy et al. identified atrending feature selection method using geneticalgorithm that uses a fitness function and adecision tree where features are removed andmodel fitness so the optimal feature set is obtained. (b) Swarm Optimization [31]: Chung and Wahidimproved normal swarm optimization by conducteda local weighted search to avoid premature“optimal” solutions. Particles are shown as arrows inthe figure and are updated by evolutionaryoperators. Depending on its fitness, its location isupdated until the final feature set is optimal - thedistribution of resulting particles after optimizationshown below the first rounded rectangle.Fig. 6. Methods for Dealing with Redundant Features: Besides the above two methods, Autoencoder [6] andFuzzy Rough Set [146] have also been used for reducing redundant features. (WSNs), Shamshirband et al. [149] introduced an imperialist competitive algorithm (ICA) withdensity-based algorithms and fuzzy logic. Dense areas in data space are clusters and low-densityareas (noise) surround them. Density-based clustering can detect shape clusters and handle noise.As network intrusion detection involves outlier detection, one may broaden the density-basedapproach to outlier detection. Tang and He [159] presented an effective density-based outlierdetection method where a relative density-based outlier score is assigned to observations as ameans of distinguishing major clusters in a dataset from outliers. Similarly, Gu et al. [59] applied adensity-based initial cluster center selection algorithm to a Hadoop-based hybrid feature selectionmethod for the mitigation of outlier effects.

Challenge.

Some features in a network intrusion feature set may not contribute significantly tothe predictive power of a model, so they may be removed based on feature importance . To handleredundant data, frameworks have been made to remove redundancies. Significant methods tohandle redundant features in data are illustrated in Figure 6.

Feature removal frameworks.

The presence of data redundancies is a prevalent issue amongnetwork intrusion datasets, so researchers have developed frameworks where specific data removaltechniques are recommended. Initial feature removal methods were integrated into computationalintelligence approaches over the course of the 2000s and into the 2010s. In 2013, Ganapathy et al.[55] wrote a review detailing a gradual feature removal method and modified mutual informationmethod that selects features to maximize information for outputs (maximize relevance betweeninputs and outputs), conditional random field (CRF) as a layered method (each layer representingan attack type), and genetic feature selection where a set of trees are generated and the best setof features are extracted. Recent research appears to be reflective on integrating feature removalmethods into a more streamlined model creation process. Bamakan et al. [13] proposed an effectiveintrusion detection framework where feature selection is embedded in the objective functioncombined with time-varying chaos particle swarm optimization (TVCPSO). The number of features

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :16 Dylan Chou and Meng Jiang is weighed in the objective function as w F (cid:18) − n F (cid:205) i = f i / n F (cid:19) , where w F is an arbitrary weight and f i isthe i th feature mask (1 if selected and 0 if not). They streamlined their weighted objective functionapproach in a flow chart where, with each iteration, the fitness of the particles is updated in particleswarm optimization and chaotic search is done to find the global optima. This year, Carrion et al.[104] addressed the lack of the evaluation in network intrusion detection methods by providinga structured methodology that involved more rigorous feature selection or removal techniques.Including steps on how feature selection or removal took place to arrive at a final accuracy, as theystated, can allow for easier replication and more reliable evaluation in network intrusion detectionliterature. Feature selection.

Feature selection can rule out redundant features and select a subset of thefeatures in the data without significantly degrading the performance of the model [74]. Early 2010’ssaw an interest in filtering-based feature selection methods as Koc et al. [86] applied the hiddennaïve bayes (HNB) model to data with highly correlated features. Accompanying their HNB modelwas a filter-based feature selection model that is both correlation and consistency-based and reliesonly on the statistical properties in the data. Correlation feature-selection picks features that arebiased towards highly correlated classes. The consistency-based filter has an inconsistency criterionthat specifies when to stop reducing the dimensionality of the data. After filter-based methods,there was interest in using forward selection for feature ranking via Random Forest by Aljarrah etal. [5]. But rather than finding the most optimal feature set, recently, Elmasry et al. [41] claimedthat feature selection can be time-consuming due to its exhaustive search, and that evolutionarycomputation techniques may be applied to find near-optimal solutions in a shorter amount of time.

Automatic feature extraction.

In the realm of automatic feature extraction, rough set theoryand autoencoders are two important automation methods. Rough set ranks extracted featuresfrom network intrusion data and generalizes an information system by replacing the originalattribute values with some discrete ranges [9], and autoencoders are considered to be nonlineargeneralizations of principle components analysis which use an adaptive, multilayer “encoder”network to reduce data dimensionality [67]. The early 2010’s saw research interest in rough settheory for feature selection. Because simplified swarm optimization (SSO) may find prematuresolutions, Chung and Wahid [31] went about improving the performance of it by conducting a localweighted search after SSO to produce more satisfactory solutions. They applied k-means clusteringto continuous network data values and rough set theory to minimally-sized subsets of the feature.The goodness in selected features is evaluated using the fitness function given input data D , | C | being the number of features, | R | being the length of a feature subset where R is a feature subset,and γ R as the classification quality of feature set R: α × γ R ( D ) + β × | C | − | R || C | . (3)Data is changing rapidly and with the increasing presence of irrelevant features, Liu et al. [99]introduced a Gaussian mixture model to extract structural features in a network and identifyanomalous and normal patterns where redundant features were removed and important featureswere optimally selected using fuzzy rough set theory. Alongside irrelevant features and the ageof big data, the speed in which a model’s objective function converges slows down. Both fuzzyrough set methods and autoencoders have been devised to tackle the large volume of data. Withuncertainty surrounding whether network traffic is normal or anomalous, Selvakumar et al. [146]presented a fuzzy rough set attribute selection method where the fuzzy-oriented rough degree ondependency of γ ′ P ( D ) to subset P is defined as γ ′ P ( D ) where a subset of features is evaluated on its ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:17

Input 2D Image DataSplit Into 4 Feature Categories Grayscale Fuse Result of4 CNN ModelsOutput

Fig. 7. Li et al. [96] proposed multi convolutional neural network (multi-CNN) fusion framework where initialone-dimensional input is converted to a 121-dimensional dimensional feature after numeralization. First partof the data containing 90 features is transformed into a 9 by 10 matrix. Then the second, third and fourthparts have 11, 9, and 10 features, respectively. Feature data is split into 4 feature categories (Host-based,Time-based, Content, Basic), then the 64-dimensional output from the last hidden layer of the four CNNs arecombined into 256-dimensional data that is fed into a softmax layer and used as output for predictions. relevance to the data. To handle growing data, as well as irrelevant data, Alqatf et al. [6] proposedthe use of an autoencoder for feature learning and dimensionality reduction to extract the mostimportant features and filter out those that are redundant. Then they pass the reduced data into anSVM model for network traffic classification.

Challenge.

The lack of strong correlation between features in data may make the construction of amodel more challenging. Correlation can be artificially made through increasing the dimensionalityof the data by data fusion or the introduction of new features.

Increase dimensionality.

Given one-dimensional feature data, Li et al. [96] augmented the data totwo dimensions and performed data segmentation where split data was later fused back together fornetwork intrusion classification. They split feature data into four separate parts based on featuresthat are correlated with each another. The one-dimensional feature space is converted to grayscale,then the data output from the four data components are merged and passed to the output layer ofthe multi-fusion CNN. Below is the illustration of the procedure in Figure 7.

Challenge.

The data may also be imbalanced where network intrusion attacks are dispropor-tionately smaller than that of normal network activity. As discussed in Section 3 on commonpublic datasets, most network intrusion datasets face considerable imbalance between normal andanomalous traffic, but especially among attack types. Figure 8 highlights the random oversamplingand undersampling techniques used to handle minority and majority classes in network intrusiondatasets and main machine learning models implemented to handle unbalanced classes.

Over/under-sampling.

Oversampling is meant to increase samples from the minority class andbalance the distribution of data among attacks and normal activity in a network. Undersamplingremoves samples from the majority class to allow minority and majority classes to become similarin size, disallowing misclassifications of underrepresented network attacks [192]. A collection ofwork has been written last year on the use of over or under sampling to balance network intrusion

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :18 Dylan Chou and Meng Jiang

Intrusion Attack Classes

Class 2 Class 3

Under-sampling:Over-sampling:

Class 1 Class 2 Class 3Class 1 (a) Undersampling [109]: Mikhail et al. appliedrandom undersampling, which is illustrated inrandomly sampling less of class 3 to enable equallysized class datasets. Another idea to make theclasses of equal size is oversampling.

Light red points generated to over-sample

Red minorityclass points

More heavilyweighed

Synthetic Minority Over-sampling: (b) Synthetic Minority Over-sampling (SMOTE)[185]: Zhang et al. implemented a minorityoversampling technique that weighs harder dataexamples more heavily, which will be synthesizedmore. The light red points are synthesized, whichare from the large red minority class circles that aremore heavily weighed.

Input Conv-Layer 1Conv-Layer 1 Pool-Layer 1Conv-Layer 2 Conv-Layer 3Conv-Layer 2 Pool-Layer 2Conv-Layer 4 Avg Pool

Dense Layer

Global Conv (c) Parallel Convolutional Neural Network (CNN)and Feature Fusion [186]: Zhang et al. used featurefusion and a parallel CNN. The top branch ofconvolutional layers is responsible for pixel-levelclassification. The lower branch of convolutionaland pooling layers mitigates redundant featuresfrom majority class samples via down-sampling.Then output feature maps are fused at the whilecircles at different stages. A global average poolinglayer is used to further reduce redundant features.

Shared weights

Normalized correlation & ReLU (d) Siamese Neural Network [15]: Bedi et al. used asiamese neural network that accepts some input,illustrated as the two left gray parallelograms, andare accepted by two identical sub-networkscontaining the same weights. The networks extractfeature representations that are passed into a fullyconvolutional network that results in a prediction ofnormal or anomalous traffic (the rightmost circles).Fig. 8. Highlighted Machine Learning and Sampling Methods for Unbalanced Labels. datasets. Mikhail et al. [109] resolved the issue of minority attack classes by training an ensembleclassifier with undersampling data and training each sub-ensemble. Gao et al. [56] noticed that theKDD Cup dataset that they used had a large amount of user to root (U2R) attacks, so they changedthe proportion of classes in samples passed as input into the models. They used classificationand regression trees (CARTs) where multiple trees were trained on adjusted samples by randomundersampling - similar to Mikhail and others’ work - where of normal traffic (a majorityclass) was sampled to solve imbalances. Although minority sampling can allow for more evenly-proportioned classes for network intrusion detection, there’s the potential for majority classes tobe predicted with lower accuracy due to undersampling of majority classes or oversampling ofminority classes. Zhang et al. [185] resolved this issue by combining weighted oversampling with

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:19 an ensemble boosting method. The weighted oversampling technique updates weights associatedwith minority classes and the misclassified majority class observations are forced on the classifierto learn.

Optimal feature extraction.

Ranking features based on their importance can be done to reducea feature set to an optimal feature subset. Thaseen et al. [162] used a consistency-based featureselection method that determines whether the value and class label of two observations match.Zhang et al. [188] aggregated time intervals of network traffic into subgroups to result in moreaccurate information from the five features: address count, packet count, port count, byte countand the bytes per packet.

Siamese neural network.

To combat the challenge of minority and majority classes in imbalanceddatasets, Bedi et al. [15] employed a few-shot learning method called a Siamese Neural Networkthat was first introduced by Bromley et al. [19]. Siamese neural networks compute the similaritybetween two input samples to determine how similar or dissimilar they are, so pairs of samplesbelonging to the same class such as DoS-DoS, Normal-Normal, U2R-U2R were considered mostsimilar and labeled with a 1 whereas distinct pairs were labeled with a 0. Traditional methods ofoversampling and undersampling were bypassed with the use of siamese neural networks pairedwith sampling equal number of observations per network traffic class.

Feature fusion.

Feature fusion can combine different data that will, together, result in balancedattention to features. Zhang et al. [186] implemented a parallel cross convolutional neural networkthat fused traffic flow features learned by two separate convolutional neural networks to makethe network pay more attention to minority attack classes. After downsampling the two neuralnetworks, the number of channels was doubled in the output feature map, then a pooling layer wasapplied to reduce the dimensionality of the data by combining outputs of clusters in one layer intoone neuron in the next layer.

Genetic programming.

Genetic programming uses an agent to learn an optimal or near-optimalsolution to a problem [171] and can be used in conjunction with machine learning models to evolvethe model until its fitness is optimized for network intrusion detection. Le et al. [91] found geneticprogramming to perform well on imbalanced datasets when using accuracy as the fitness function:the number of true positives and true negatives over all classified observations.

Challenge.

Data may have a lack of labels, particularly when network traffic is ambiguous orunlabeled. This poses another challenge between the stages of data preprocessing and modelcreation. Figure 9 illustrates transfer learning, adversarial sample generation, and deep learningparadigms used to resolve the issue of unlabeled data.

Unsupervised/Supervised Learning.

Initially with a completely unsupervised learning approach,Casas et al. [23] used an unsupervised sub-space clustering method to detect network intrusions byaggregated traffic packets into multi-resolution traffic flows. With too few labeled data, researchersmay look to semi-supervised learning: first perform unsupervised learning on unlabeled data tolabel it, then pass the labeled data to a supervised learning model. More recently, there has beenmore research on semi-supervised network intrusion detection methods. Khan et al. [84] proposeda semi-supervised model that initially classified unlabeled traffic as normal or anomalous with aprobability score that was used as input for an unsupervised autoencoder to train on, then the datawas passed into stacked pretrained neural layers with a soft-max layer for classification. Through amore randomized approach, Ravi and Shalinie [135] proposed a semi-supervised learning model

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :20 Dylan Chou and Meng Jiang

Feature Space Input w Output

Deep Neural Network

Two-Stage Cascade Deep Learning Model: b Non-linear wb Non-linear (a) Semi-supervised Learning [84]: In the initialstage, Khan et al. used a deep neural network thatwill predict whether a traffic observation is normalor anomalous using a probability score. This is usedas an additional feature in the stacked autoencodermodel represented by the two rounded rectanglescontaining the softmax layers.

Source DatasetTarget Dataset

Prediction

Source ModelTarget Model (b) Transfer Learning [154]: Singla et al. adhered tothe transfer learning heuristic above whererepresenting the source model is trained on a targetdataset as the knowledge from training on thesource dataset carries over, which produces a targetmodel used for anomalous prediction.

Filter Adversarial Samples Machine learning Detection

Generator Mutated Samples Discriminator

Improve

Adversarial Samples

Generate (c) Adversarial Sample Generation [29]: Among the methods to generating adversarial samples, Chenget al. used a generative adversarial network (GAN) that used a generator to produce fake data (mutatedsamples) fed along with real data into a discriminator that “Filters Adversarial Samples” and outputspredicted labels. The loss from the predictions are used to refine the adversarial samples that are againfed into the discriminator to distinguish real vs fake data.Fig. 9. Illustration of main methods towards handling too few labels in network data. that employed repeated random sampling and k-means to label data as different traffic types, thenpassed it through classifiers developed in related work.

Transfer learning.

Transfer learning can compensate for the lack of labeled data via transferof knowledge from other labeled data sources [101]. Singla et al. [154] examined the viability intransfer learning for imbalanced datasets; namely, the UNSW-NB15 dataset was split into labeledsub-datasets. Each sub-dataset is split into a source dataset and a target dataset, where the classifierwas pretrained on the source dataset, then retrained on the target dataset to combat the lack oflabeled data. Beyond the synthetic dataset UNSW-NB15, a network type that has recently beenexplored was the consumer network, which doesn’t have the firewall or switches to deter networkintrusion attacks. Patel et al. [129] proposed normalized entropy from features in payload, packetand frame statistics to be funneled into a training and testing dataset and passed into a one-classSVM for classification of traffic in consumer networks. Through the collection of packet capturedata sent from programs on devices in a consumer network, a consumer network dataset wasconstructed to combat the lack of consumer network datasets. Patel and other researchers exhibitedthat exhaustively labeled datasets aren’t necessary for accurate intrusion detection models.

Adversarial sample generation.

Adversarial sample generation is done to fool a machine learningmodel, especially a neural network, with adversarial samples, so that correctly classified data canbe mistakenly identified for another class [80]. Using a random forest, Apruzzese et al. [11] usednetwork flows to identify normal and botnet activity. The adversary is assumed to have already

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:21

Between Clustering -Dotted Group Within Clustering -Blue Groups

InitializeInput

Input Distance from 2 Nearest Neighboring Nodes > Similarity Threshold?

Insert Into NetworkConnect 2 Nearest Nodes

Update Weights of Nearest Node & Neighbors/DeleteOld Edges

Delete overlap and noise nodes

Finish Learning? Output Results

No Yes YesNo Yes (a) Noorbehbahani et al. semi-supervised model [121]: Noorbehbahaniet al. used a semi-supervised model where a mixed-dataself-organizing incremental network is trained and continuouslyupdated with new data. The unsupervised learning takes place withboth within and between clustering illustrated above in the figure. Theleftmost flowchart illustrates how new inputs are fit into theincremental network during offline learning. When the model is online,the old cluster sets are updated and the old incremental network isused to classify new data, depicted in the two rightmost roundedrectangles. Also note that the green branches indicate an answer of yesto question nodes. An answer of no are represented by red branches.

Class 1 Class k . . . + SOINN -SOINN + SOINN -SOINNBinary SVM 1 . . .

Binary SVM kTop m classesMulticlass SVMFinal Predicted Class (b) Constantinides et al.self-organizing incrementalneural network [32]: Thedetection system is initializedwith a dataset containing kattack classes. Each attack classcategory is modeled with twoself-organizing incrementalneural network. The input vectorper SVM is constructed fromthat SVM’s positive n-SOINNand other negative n-SOINNsfrom the other classes.

Input Feature Vector Combination of k ClassifiersPrediction of k Classifiers Concatenate Predictions into Classification Vector

Concatenate Classification/Feature Vector

Input Layer Hidden Layer Output Layer … Check If k-thOutput > k-thThreshold Concatenate Prediction Results

DecisionVectorAction Vector ClassificationVectorXOR (c) Sethi et al. Agent Network [147]: Sethi et al. employed reinforcement learning by initially passing in aninput feature vector that concatenates predictions of k classifiers with the input feature vector as input forthe agent network, which is depicted in blue. Then decision vector is composed with the classification vector(either 1 (attack) or 0 (normal)) in an XOR to form the agent network’s output vector (action vector).Fig. 10. Paradigms of Dynamic Network Data Models. compromised at least one machine in the network and deployed a bot to communicate with othermachines through limited “Command and Control infrastructure.” The attacker intends to trick theclassifier by slightly increasing flow duration and exchanged bytes and packets. Instead of a baseadversary that changed feature attributes in adversarial samples, Cheng [29] used a generativeadversarial network (GAN) where a generator aims to refine the generation of fake data whilea discriminator determines which network traffic flows are legitimate or anomalous, which waswhat Usama et al. did as well [164]. Similar to Apruzzese’s adversarial sample generation, but in amore controlled environment, Aiken and Scott-Hayward [4] developed an adversarial testing toolcalled

Hydra that behaves as an emulator for a system that launches attacks in a software-definednetwork where a test manager sends traffic, evading the classifier by changing payload size andrate. In all such cases, adversarial sample generation not only offers more data to combat unlabeleddata, but can develop defensive mechanisms for more robust NID systems.

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :22 Dylan Chou and Meng Jiang

Challenge.

Due to the changing landscape of new data being generated daily, adaptive modelshave been ever more important to dynamic data, especially as data has been growing exponentiallyfor the past decade and that now the digital world contains roughly 2.7 zetabytes [21]. Figure 10summarize the significant and novel dynamic network intrusion models developed recently.

Stream-based models.

Since dynamic data may come in the form of a stream, researchers havelooked at specializing model for stream data. To resolve the issue of irrelevant data in dynamicstreaming data, Thakran et al. [161] employed density and partition-based clustering methods alongwith weighted attributes to handle noisy data in streaming data, which was used for outlier detection.For better real-time responsiveness from intrusion detection models, HewaNadungodage et al. [66]accelerated outlier detection with parallelized processing power from a graphics computing unit(GPU). Instead of improving upon the real-time speed in which outliers are detected, Noorbehbahaniet al. [121] looked towards a more adaptive model that uses incremental learning, which stillperforms well with limited labels in streaming data. They implemented a mixed self-organizingmap incremental neural network (MSOINN) and “within and between” clustering for offline andonline learning. An initial cluster set from the network training data and initial classification modelare generated during the offline phase. Clusters are updated with the MSOINN clustering algorithmand new observations are classified with the current MSOINN model during the online phase oflearning.

Reinforcement learning.

Reinforcement learning is one type of machine learning that learns amapping, or a policy, between the states of a system and the actions it can execute given a rewardand punishment notion [107]. Through an adaptive approach, Bensefia and Ghoualmi [17] proposedthe integration of an adaptive artificial neural network and a learning classifier system that usesa reactive learning base to learn new attack patterns. There has recently been research on cloudenvironments and applying reinforcement learning to changing data in the cloud by Sethi et al.[147], who applied reinforcement learning to the cloud where a host network communicates withan agent network through VPN. Log generation from the virtual machine was provided to an agentthat applied a deep Q-network and compared the model’s result with the actual result from theadministrator network, calculating the reward (a metric of how well the model did) and iteratinguntil the reward was maximized.

Incremental learning.

With data in dynamic environments, it is necessary that pretrained modelsare updated with new data in an incremental fashion without compromising classification performanceon preceding data [131]. Addressing botnet intrusion attacks, Feilong Chen et al. [25] argued thatbotnet detection starts with the set of server IP-addresses visited by past client machines, so anincremental least-squares support vector machine was implemented to be adaptive to feature anddata evolution. Five years later, Meng-Hui Chen et al. [27] made a population-based incrementallearning method that learned from evolved data through past experiences and applied collaborativefiltering to automate classification, adapting to key features in the data. A shift towards morescalable applications came with an online incremental neural network accompanied by a supportvector machine that Constantinides et al. [32] proposed.

Challenge.

The concomitant challenge with a growing network intrusion data repository is thecontinued lack of data on more current, diverse network attack types. As seen from Section 3,datasets have been riddled with a lack of evenly represented attack classes. Some datasets maybe dominated by specific attack, but other attack types can also be underrepresented or that all

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:23

Selected Features . . .

Adaboost kAdaboost 1 Class C_k (1) or otherwise (-1)Class C_1 (1) or otherwise (-1) . . .

Classifiers Encoded DataSelected Features . . .

Adaboost kAdaboost 1 1 -1 1Decoded Data … -1 1 1Hamming Distance ∑ ∑ ∑ ………… … (a) Adaboost and Error Correcting Output Code(ECOC) [1]: Abdelrahman et al. initialized aselected group of features that is distributed among k Adaboost classifiers and encoded in a binarystring of length k . The bit positions are shown inthe decoded data figure and each classifier isapplied to every data observation to obtain a newbinary string is labeled with the traffic class closestto it (with lowest hamming distance, or leastnumber of distinct bits). Model Input collected by 2 CAN Buses Conv LSTM 2D . . .

Conv LSTM 2DConv LSTM 2D . . .. . .. . . Conv LSTM 2DConv LSTM 2DConv LSTM 2D . . .

Conv 3D Fully Connected01Time Series Data (b) Transfer Learning With LSTM Network [160]:Tariq et al. remedied the problem with a smallamount of time-series data for CAN bus networkintrusions by collecting data on 2 CAN buses that isdispersed across multiple convolutional LSTM 2Dnetworks. The timesteps in the time series aretransformed into a two-dimensional multivariatetime series that convolutional LSTMs were trainedon. The outputs on the 2D data form athree-dimensional output, which is passed througha fully connected layer and final predictions ofnormal or anomalous is outputted.Fig. 11. Novel Small Data Transfer and Meta Learning. attack types are minority classes. To resolve the issue of small amounts of data, specifically a lackof attack types, meta and transfer-learning techniques have been explored. Novel machine learningmodels implementing the two techniques are highlighted in Figure 11.

Meta learning.

Meta-learning uses automated learning to improve the way in which a modellearns from data. Typically data is split into learning and prediction sets. The support set is in thelearning set and training and testing sets are in the prediction set. In “few-shot” learning, predictionerror on unlabeled data is intended to be reduced given only a meager support set. Panda et al. [127]conducted learning with multiple classifiers where ensembles of balanced nested dichotomies formulti-class problems were employed to handle multi-class datasets and make intelligent decisionsin identifying network intrusions. A similar ensemble-based method using bagging and Adaboostwas proposed by Abdelrahman and Abraham [1]. They implemented the meta-learning techniqueof Error correcting output code (ECOC), where, per attack class, a binary string of length k is madeso that each bit is a classifier output and the class with closest string of outputs are returned andused for classification. As a direct response to handling the limited number of malicious samples innetwork data, Xu et al. [173] devised a few-shot meta-learning method that used a deep neuralnetwork and a feature extraction network. The few-shot detection begins with a comparisonbetween feature sets extracted from two data flows and a delta score indicating how different thetwo input data flows are. During the meta-training phase, samples from query and sample sets arecompared and average delta scores are calculated. During meta-testing, samples from the test setand support set are compared and predicted labels for samples are the ones with the minimumaverage delta score in the support set. Transfer learning.

Just as with the lack of labeled data, transferring knowledge from otherdata sources through transfer learning can resolve issues of a lack of data, specifically on attack

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :24 Dylan Chou and Meng Jiang (1) MonitorNetwork nodes& ports (2) SplitFeatures

A Level B LevelC Level (4) Train

Fully Connected Layers (5) Output(3) Real-time Data

Fully Connected Layers (a) Chen et al. DDoS Multi-Channel Convolutional Neural Network Incremental (MC-CNN) Learning Model[26]: Chen et al. implemented a multi-channel incremental network that monitored network traffic andpackets, inputting the data into a database (represented as a black cylinder). Features are split into traffic,packet, and host level, which are represented through the A,B,C levels. Then the real-time data from thenetwork along with the partitioned features are passed into a multi-channel CNN where the top branchaccepts traffic features and the lower branch takes in packet features. The top layer undergoes pooling toreduce parameter complexity in the fully connected layer. The top and bottom branches are combined andpassed to the fully connected layers and a final prediction is outputted.

13 40 6 8 92 75 hs hi see r s s h e r s t h e x x x x x x x x Aho-Corasick StateMachine Parallel Failure-less Aho-Corasick (PFAC) Algorithm Hierarchical Parallelism MechanismCPU CPUGPUDispatcherDispatcher Input Buffer Input Buffer Input Buffer Input Buffer Input Buffer Input Buffer Input Buffer Input Buffer

H2D PFAC D2H

Model Results

H2D PFAC D2HH2D PFAC D2HH2D PFAC D2HH2D PFAC D2HH2D PFAC D2HH2D PFAC D2HH2D PFAC D2H

Model ResultsModel ResultsModel ResultsModel ResultsModel ResultsModel ResultsModel Results Post-processingPost-processing (b) Lin and Hsieh CUDA Hierarchical Parallelism [97]: The Parallel Failure-less Aho-Corasick (PFAC)algorithm that Lin and Hsieh implemented used separate threads to do one pass through the full inputstring, which can be run in parallel. The PFAC state machine matches signature rules to the beginning of thecharacter per location in the string. In turn, the input data buffer can run on multiple threads and perform ahost to device data transfer, then pattern matching with signature-based network intrusion rules, then adevice to host transfer that produces model results. Post-processing takes place per dispatcher where trafficpackets are matched and alerted for the user to know.

Orange

Lines being splits per feature

Flow duration < x_1

Flow Byte Rate < x_2 Max Time Between Flows < x_3

Yes No

Normal Traffic is

Blue

Anomalous Traffic is

Green Yes No Yes No (c) Morfino and Rampone Apache Sparkwith Decision Tree [113]: Morfino et al.used Apache Spark’s Machine LearningLibrary, MLlib, that stores filter/mapoperations in a directed acyclic graphand uses “Catalyst” to optimize anefficient execution plan. The decision treesplits a distribution by features untilsplits divide features into morehomogeneous groups of normal andanomalous traffic.Fig. 12. Novel Methods to Handle Big Volume of Data. types. Because generating labels for data can be time-consuming, Zhao et al. [189] employeda heterogeneous feature-based transfer learning method to detect network anomalies that wascompared to other feature-based approaches such as HeMap and Correlation Alignment (CORAL).Rather than feature-based methods, mimic learning has been applied as a means of transfer learning

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:25 by retraining a parent model - pretrained on private data - on public data to protect privatelycollected data and improve accuracy in the final model. Shafee et al. [148] transferred the knowledgefrom a privately trained model – a random forest that performed best during experimentation ofthe teacher model – to a public training setting, producing a shareable student model. More nicheto robust vehicles, Controller Area Networks (CANs) were revealed to be easily exploited and thatthere was a lack of intrusion data on CANs. Thereby, Tariq et al. [160] recently collected CANtraffic data using two CAN buses and applied transfer learning to train a convolutional long-shortterm memory network on the new intrusion data.

Challenge.

For big data, processing such large amounts of data is overwhelming, so optimizationmethods were devised to speed up preprocessing such as reduction methods which remove redun-dant features and reduce the size of the data. Figure 12 depicts the paradigms from three pivotalmethods handling large amounts of data using incremental learning, parallel processes and ApacheSpark for Cloud Computing.

Incremental learning.

To handle such large amounts of data, incremental learning may be appliedto process it in increments. Chen et al. [26] implemented an incremental training method thatrepeatedly trained one convolutional layer then added another layer to a convolutional neuralnetwork (CNN) as new data came in until the target structure was achieved for the final CNN tooptimize training time.

Parallelism.

Parallel processing may be used to speed up the convergence time of model trainingon large amounts of data. In the early 2010s, Vasiliadis et al. [166] implemented a multi-parallelintrusion detection method that was housed in Nvidia’s CUDA program to identify prodigiousamounts of data in high speed networks using three levels of units: multi-queue NICs, multipleCPUs, and multiple GPUs. A single CPU process follows an iterative sequence of acquiring, copyingand pattern matching data to a Buffer 0 in the GPU then copying back to CPU to carry outdetection using plugins such as PCRE or Packet Header Inspection. Looking beyond the specificsof hardware improvements with parallel computing on big data and more towards the advent ofcloud computing, Bandre and Nandimath [14] wanted to handle the increase in data in distributedsystems, particularly in Hadoop, by using a General Purpose Graphical Processing Unit to hastenthe process of intrusion detection. Using a similar heuristic as Vasiliadis et al.’s, Lin and Hsieh [97]sped up intrusion detection on big data with hierarchical parallelism on three levels: parallelism onmultiple GPUs, a single GPU and parallelism of the Aho-Corasick algorithm, a string-searchingalgorithm for matching traffic packets. All three approaches apply parallelism from the CUDAprogram and pattern match large amounts of traffic packets using a signature database; thus,parallelism in big data is a heavily data-driven signature-based research area.

Cloud computing.

With the advent of cloud computing platforms such as Amazon Web Services(AWS) and the Apache software foundation, using virtual services is not only available, but fast.The current research interests appear to lie in implementing machine learning with the Apacheservices. Manzoor and Morgan [105] used Apache Storm to accelerate intrusion detection andemploy a real-time support vector machine-based intrusion detection system; Faker and Dogdu [42]used Apache Spark to implement deep feed-forward neural network, random forest and gradientboosting tree methods; most recently, Morfino and Rampone [113] used the MLlib library of ApacheSpark to reduce training time for their highest performing model - a decision tree - so they can fitthe model to over 2 million rows of data and tackle SYN-DOS attacks.

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :26 Dylan Chou and Meng Jiang

Fig. 13. Yearly Articles By Topic.

Figure 13 displays the trends of research interests from 2010 to 2020 on data-driven NID methods.

Upon examining literature from the past decade on NID, there was already a pre-existing interestin big data research since 2010. This interest can be attributed to the large amounts of data on theInternet since 2010, as mentioned in Section 2 of the paper, which continued growing through thepast decade. 2019 saw the largest number of articles on big data where researchers continued tostudy parallel processing techniques and incremental learning methods to handle processing largeamounts of data. In 2010, there was also effort put into resolving the challenge of small data.Although there were large amounts of data, data on different attack types were lacking asexhibited in the datasets attack type breakdown and entropy analysis in Section 3 of the paper.In general, the lack of network intrusion attack types, pertinent to the challenge of small data,comes from the typically short time frame that intrusions take place. Small data issues were firstresearched in the early 2010s, particularly with meta-learning.With noisy data challenges, authors have done more extensive research into methods that weighnoisy observations over others in network intrusion datasets since 2017. Although there haven’tbeen many papers on handling noisy data, solutions to noisy data have been well established suchas rescaling features or using density-based feature selection.The majority of the research between 2010 and 2015 studied ways to work around big data andtoo few data. After 2015 and until 2020, big data processing has remained a popular research topic inthe field of network intrusion detection. However, as expected, due to the changing environment of

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:27 (a) Phrase Distances from “Big Data”. "Cl ':r' ., New Hypervisor-basedCloud Reaf-worfdAcademicNetwork G e n e rat e Lab e l e d Datas e ts QI u iii singstate-of-th e -artHacking Depf0YingRealistic Attack Up-to-date Network Ff ow III V, V, Distance I N V, --.J 0 V, V, V, (b) Phrase Distances from “Lack Real-World Data”.Fig. 14. Related Phrases (Technologies) with “Big Data” and “Lack Real-World Data”. the present-day databases, research addressing dynamic data issues has gone up. The lack of labeleddata has seen more research proposing semi-supervised learning models since 2019. However, onearea of research that hasn’t seen much attention in the past decade is real-world network data. 2010and 2011 saw some honeypot emulation of networks for data collection and it was only recently in2020 when the LITNET dataset [125] was made and released as one of the first real-world networkintrusion datasets. Based on the development of network intrusion research overtime, the realm of solutions to inconveniently large amounts of network data has been belaboredover the past decade with the surge of the digital world, supported by Figure 13. Although weaklycorrelated data has recently been explored, it doesn’t appear to be an issue as pressing as the lackof real-world network data. Since the start of the 2010s, the lack of real-world data on networkintrusion attacks had been addressed, but with minimal amounts of research directed towards thechallenge. There was initially a step towards real-world network intrusion data by emulating arealistic network environment with honeypots that would attract attackers or synthetic (IXIA)data generation. However, simulated data may not be as valuable to fit and test a model on as datacollected on a real-world network due to possibly incorrect network attack models and behaviorsin sandbox network environments. The issue with the current research on applying models tonetwork intrusion is that 46 of the papers in the taxonomy used the KDD Cup 1999 as a evaluationdataset for their models. Because it’s synthetically generated, there’s bias in the traffic patterns that

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :28 Dylan Chou and Meng Jiang real-world traffic wouldn’t have. A step towards more modern network attacks on a real-worldnetwork came with the LITNET dataset [125] collected in 2020 on a Lithuanian network coveringnodes in four major Lithuanian cities as being one of the first long-term (10 months) and real-worldnetwork intrusion datasets produced and made available for researchers. Realism and availabilityare the two significant areas that current network intrusion datasets should be striving to have,which will be a future goal for researchers interested in creating new real-world datasets. Figure 14reinforces how the challenge of the lack of real-world data is highly associated with data collectionin the cloud with “New Hypervisor-based Cloud” methods and “Real-world Academic Network”having close and positive distances to the phrase “Lack of Real-World Data”, but not as much with“state-of-the-art hacking methods”, the “deployment of realistic attacks”, or “up-to-date networkflow data.” Currency and realism in normal network traffic and attacks are problems confirmedthrough word vector analyses in the figure referenced earlier. In turn, network intrusion researchrequires further data collection of realistic attacks in real-world networks.

Although traffic flows may be labelled manually by networksecurity experts, real-world network traffic flow can easily grow into the millions. The UGR datasetfrom 2016 [48] was labeled using log files from the honeypot system used for data collection. Oftenexperts may be the ones responsible for labeling traffic data, while other datasets such as LITNETin 2020 [125] are less clear on how labeling took place. Labelling training data has been a roadblockfor anomaly-based intrusion detection since the late 2000s [33]. Labeling traffic too scrupulouslymay go against privacy policies, so detection models tend be updated whenever data becomeslabeled and manual labeling still occurs with offline learning [170]. To handle newly labeled databeing fed into intrusion detection models, there should be further development in adaptive modelsor incremental models such as an online incremental neural network with SVM by Constantinideset al. [32]. Future research in labeling network data lies in devising more adaptive detection modelsfor data annotation and developing paradigms and techniques for better, more efficient traffic datalabeling. Phrase Vector Distances illustrated in Figure 14 depict that “Generate Labeled Datasets” isnot strongly associated with “Lack of Real-World Data”.

Specific to collecting data in a real-world network, the collectionof data on consumer networks such as those at home, which are not armed with the same securityresources as enterprise networks, lack datasets. Recently, Patel et al. [129] handled the naturalentropy with detecting anomalies in a home network by collecting basic traffic features such aspacket size, source and destination ports and analyzing feature entropy. Further data collection inconsumer networks has yet to be seen but is a viable route for research in the future.

Typically, cloud computing platformsare often associated with big data analytics, which hold the resources to perform fast operationsand processing on data. Aside from speed-up in model convergence or reducing anomaly detectiontime with cloud computing, exploring network intrusion in cloud environments has yet to beexhaustively researched. A hypervisor-based cloud network intrusion detection system based onstatistical analytics was devised by Aldribi et al. [7], but more sophisticated attack methods have yetto be implemented as Aldribi and others have noted the overtly regular pattern in the traffic datathat was collected. Another trait of cloud environments now is that there is constantly changingdata. Because a tremendous amount of data is stored on the cloud, looking to develop machinelearning for dynamic data in the cloud should be a future step in research. In 2020, Sethi et al. [147]applied a deep Q-learning reinforcement model to the cloud that is adaptable to changing data.Although there’s been some work towards incorporating machine learning on dynamic data in thecloud, this is still nascent in terms of research and has potential to be studied further in the future

ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:29 for network intrusion. Word vector distances in Figure 14 affirm that cloud-based applicationsand services are most closely associated with big data, although “Cloud Edge Computing” is lessassociated with big data in network intrusion detection systems. Applying edge computing to cloudcomputing environments housing large amounts of network data is a potential route of research inthe upcoming years to speed up detection time by bringing data storage and computation closer tothe location where it is needed [64].

Machine learning models havebeen applied to nearly every challenge observed in the constructed taxonomy within this paperexcept for applying parallel computing to big data, where a large amount of research pertains toparallelizing signature-based intrusion detection systems. Among the eight main challenges totechnical models detailed in the taxonomy, big and dynamic data appear to be the main types thatshould be handled. Although big data can be combated using edge computing that brings datastorage closer to its intended location and speeds up processing time, parallelism in big data machinelearning models could help researchers improve anomaly-based intrusion detection methods ascurrently, an emphasis is made instead on signature-based techniques using CUDA. NID dataand traffic is rapidly changing and a natural approach to handling dynamic data is processingdata in increments using incremental learning. Recently, Constantinides et al. [32] focused onscalability with incremental machine learning models. To handle the growth of their incrementalself-organizing neural network commensurate with the growth of new data, a parameter n is usedso that any node that is nearest in Euclidean distance to more than n input vectors (more than n “wins”) passes a “win” to the node with more than n “wins”. The aging parameter in the networkalso removes nodes that aren’t updated to maintain a manageable size. With the dearth of scalabilityresearch, in the future, researchers should continue to study methods that enable incrementalmachine learning models to be more scalable in light of tremendous data growth. Network intrusion detection has existed for a little over two decades when network resourceswere misused. Despite most data-driven network intrusion systems being signature-based and thatmost systems have not been integrated with an anomaly-based intrusion detection system on alarge scale due to high false positive rates, researchers continue to improve anomaly detectionaccuracy and performance in the literature because of anomaly detection’s ability to detect novelnetwork attacks. This paper introduces a general taxonomy on data-driven network intrusiondetection methods based on a challenge-method heuristic and examines common public datasetsused by papers in the taxonomy, performing entropy analysis and attack type breakdown on themto measure imbalance between network traffic classes. Our focus is on the research trends gatheredfrom the taxonomy-structured survey on network intrusion detection methods in the past decade.We conclude that, given the research trends over time, areas requiring future research are in bignetwork data, streaming and changing data, and real-world network data collection and availability.Many solutions have been implemented for the other challenges specified in the taxonomy, butthere remains a dearth of real-world network data, especially data on consumer networks. Thissurvey provides a high-level overview of the background on network intrusion detection, commondatasets, a taxonomy of important research areas and future directions.

REFERENCES [1] Shaza Merghani Abdelrahman and Ajith Abraham. 2014. Intrusion detection using error correcting output code basedensemble. In . IEEE, 181–186.[2] R. Abdulhammed, M. Faezipour, A. Abuzneid, and A. AbuMallouh. 2019. Deep and Machine Learning Approaches forAnomaly-Based Intrusion Detection of Imbalanced Network Traffic.

IEEE Sensors Letters

3, 1 (2019), 1–4.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :30 Dylan Chou and Meng Jiang [3] Mohiuddin Ahmed, Abdun Naser Mahmood, and Jiankun Hu. 2016. A survey of network anomaly detection techniques.

Journal of Network and Computer Applications

60 (2016), 19 – 31. https://doi.org/10.1016/j.jnca.2015.11.016[4] J. Aiken and S. Scott-Hayward. 2019. Investigating Adversarial Attacks against Network Intrusion Detection Systemsin SDNs. In . 1–7.[5] O. Y. Al-Jarrah, A. Siddiqui, M. Elsalamouny, P. D. Yoo, S. Muhaidat, and K. Kim. 2014. Machine-Learning-BasedFeature Selection Techniques for Large-Scale Network Intrusion Detection. In . 177–181.[6] M. Al-Qatf, Y. Lasheng, M. Al-Habib, and K. Al-Sabahi. 2018. Deep Learning Approach Combining Sparse AutoencoderWith SVM for Network Intrusion Detection.

IEEE Access

Computers & Security

88 (2020), 101646. https://doi.org/10.1016/j.cose.2019.101646[8] H. S. Alsaadi, R. Hedjam, A. Touzene, and A. Abdessalem. 2020. Fast Binary Network Intrusion Detection based onMatched Filter Optimization. In .195–199.[9] A. An, C. Chan, N. Shan, N. Cercone, and W. Ziarko. 1997. Applying knowledge discovery to predict water-supplyconsumption.

IEEE Expert

12, 4 (1997), 72–78.[10] Shahid Anwar, Jasni Mohamad Zain, Mohamad Fadli Zolkipli, Zakira Inayat, Suleman Khan, Bokolo Anthony, andVictor Chang. 2017. From intrusion detection to an intrusion response system: fundamentals, requirements, and futuredirections.

Algorithms

10, 2 (2017), 39.[11] G. Apruzzese and M. Colajanni. 2018. Evading Botnet Detectors Based on Flows and Random Forest with AdversarialSamples. In . 1–8.[12] M. Azizjon, A. Jumabek, and W. Kim. 2020. 1D CNN based network intrusion detection with normalization onimbalanced data. In .218–224.[13] Seyed Mojtaba Hosseini Bamakan, Huadong Wang, Tian Yingjie, and Yong Shi. 2016. An effective intrusion detectionframework based on MCLP/SVM optimized by time-varying chaos particle swarm optimization.

Neurocomputing . 1–6.[15] Punam Bedi, Neha Gupta, and Vinita Jindal. 2020. Siam-IDS: Handling class imbalance problem in Intrusion DetectionSystems using Siamese Neural Network.

Procedia Computer Science

171 (2020), 780–789.[16] Mustapha Belouch, Salah El Hadaj, and Mohamed Idhammad. 2018. Performance evaluation of intrusion detectionbased on machine learning using Apache Spark.

Procedia Computer Science

127 (2018), 1–6.[17] Hassina Bensefia and Nacira Ghoualmi. 2011. A new approach for adaptive intrusion detection. In . IEEE, 983–987.[18] Vinayak Borkar, Michael J Carey, and Chen Li. 2012. Inside" Big Data management" ogres, onions, or parfaits?. In

Proceedings of the 15th international conference on extending database technology . 3–14.[19] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a"siamese" time delay neural network. In

Advances in neural information processing systems . 737–744.[20] A. L. Buczak and E. Guven. 2016. A Survey of Data Mining and Machine Learning Methods for Cyber Security IntrusionDetection.

IEEE Communications Surveys Tutorials

18, 2 (2016), 1153–1176.[21] Mohamad Bydon, Clemens M Schirmer, Eric K Oermann, Ryan S Kitagawa, Nader Pouratian, Jason Davies, AshwiniSharan, and Lola B Chambless. 2020. Big Data defined: a practical review for neurosurgeons.

World Neurosurgery

Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and TelecommunicationSystems (Cat. No.PR00728) . IEEE, 466–473.[23] Pedro Casas, Johan Mazel, and Philippe Owezarski. 2012. Unsupervised Network Intrusion Detection Systems:Detecting the Unknown without Knowledge.

Computer Communications

35, 7 (2012), 772 – 783. https://doi.org/10.1016/j.comcom.2012.01.016[24] Chia-Mei Chen, Ya-Lin Chen, and Hsiao-Chung Lin. 2010. An efficient network intrusion detection.

ComputerCommunications

33, 4 (2010), 477 – 484. https://doi.org/10.1016/j.comcom.2009.10.010[25] Feilong Chen, Supranamaya Ranjan, and Pang-Ning Tan. 2011. Detecting bots via incremental LS-SVM learning withdynamic feature adaptation. In

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discoveryand data mining . 386–394.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:31 [26] Jinyin Chen, Yi-tao Yang, Ke-ke Hu, Hai-bin Zheng, and Zhen Wang. 2019. DAD-MCNN: DDoS Attack Detection viaMulti-Channel CNN. In

Proceedings of the 2019 11th International Conference on Machine Learning and Computing (ICMLC’19) . Association for Computing Machinery, New York, NY, USA, 484–488. https://doi.org/10.1145/3318299.3318329[27] Meng-Hui Chen, Pei-Chann Chang, and Jheng-Long Wu. 2016. A population-based incremental learning approachwith artificial immune system for network intrusion detection.

Engineering Applications of Artificial Intelligence . 416–420.[29] A. Cheng. 2019. PAC-GAN: Packet Generation of Network Traffic using Generative Adversarial Networks. In . 0728–0734.[30] Zouhair Chiba, Noureddine Abghour, Khalid Moussaid, Amina [El Omri], and Mohamed Rida. 2018. A novel architecturecombined with optimal parameters for back propagation neural networks applied to anomaly network intrusiondetection.

Computers & Security

75 (2018), 36 – 58. https://doi.org/10.1016/j.cose.2018.01.023[31] Yuk Ying Chung and Noorhaniza Wahid. 2012. A hybrid network intrusion detection system using simplified swarmoptimization (SSO).

Applied Soft Computing

12, 9 (2012), 3014 – 3022. https://doi.org/10.1016/j.asoc.2012.04.020[32] Christos Constantinides, Stavros Shiaeles, Bogdan Ghita, and Nicholas Kolokotronis. 2019. A novel online incrementallearning intrusion prevention system. In . IEEE, 1–6.[33] Gabriela F Cretu, Angelos Stavrou, Michael E Locasto, Salvatore J Stolfo, and Angelos D Keromytis. 2008. Casting outdemons: Sanitizing training data for anomaly sensors. In . IEEE,81–95.[34] Robertas Damasevicius, Algimantas Venckauskas, Sarunas Grigaliunas, Jevgenijus Toldinas, Nerijus Morkevicius,Tautvydas Aleliunas, and Paulius Smuikys. 2020. LITNET-2020: An Annotated Real-World Network Flow Dataset forNetwork Intrusion Detection.

Electronics

9, 5 (May 2020), 800. https://doi.org/10.3390/electronics9050800[35] Jonathan J. Davis and Andrew J. Clark. 2011. Data preprocessing for anomaly based network intrusion detection: Areview.

Computers & Security . IEEE, 1–8.[38] Paul Dokas, Levent Ertoz, Vipin Kumar, Aleksandar Lazarevic, Jaideep Srivastava, and Pang-Ning Tan. 2002. Datamining for network intrusion detection. In

Proc. NSF Workshop on Next Generation Data Mining . 21–30.[39] L. Dongxia and Z. Yongbo. 2012. An Intrusion Detection System Based on Honeypot Technology. In , Vol. 1. 451–454.[40] Adel Sabry Eesa, Zeynep Orman, and Adnan Mohsin Abdulazeez Brifcani. 2015. A novel feature-selection approachbased on the cuttlefish optimization algorithm for intrusion detection systems.

Expert Systems with Applications

42, 5(2015), 2670 – 2679. https://doi.org/10.1016/j.eswa.2014.11.009[41] Wisam Elmasry, Akhan Akbulut, and Abdul Halim Zaim. 2020. Evolving deep learning architectures for networkintrusion detection using a double PSO metaheuristic.

Computer Networks

168 (2020), 107042. https://doi.org/10.1016/j.comnet.2019.107042[42] Osama Faker and Erdogan Dogdu. 2019. Intrusion Detection Using Big Data and Deep Learning Techniques. In

Proceedings of the 2019 ACM Southeast Conference (ACM SE ’19) . Association for Computing Machinery, New York, NY,USA, 86–93. https://doi.org/10.1145/3299815.3314439[43] W. Fan, Z. Du, D. Fernández, and V. A. Villagrá. 2018. Enabling an Anatomic View to Investigate Honeypot Systems: ASurvey.

IEEE Systems Journal

12, 4 (2018), 3906–3919.[44] Nabila Farnaaz and MA Jabbar. 2016. Random forest modeling for network intrusion detection system.

ProcediaComputer Science

89, 1 (2016), 213–217.[45] Wenying Feng, Qinglei Zhang, Gongzhu Hu, and Jimmy Xiangji Huang. 2014. Mining network data for intrusiondetection through combining SVMs with ant colony networks.

Future Generation Computer Systems

37 (2014), 127–140.[46] Feng Xie, Hongyu Yang, Yong Peng, and Haihui Gao. 2012. Data fusion detection model based on SVM and evidencetheory. In . 814–818.[47] Gilberto Fernandes, Joel J. P. C. Rodrigues, Luiz Fernando Carvalho, Jalal F. Al-Muhtadi, and Mario Lemes Proença.2019. A comprehensive survey on network anomaly detection.

Telecommunication Systems

70, 3 (2019), 447–489.https://doi.org/10.1007/s11235-018-0475-8[48] Gabriel Macia Fernandez, Jose Camacho, Roberto Magan-Carri, Pedro Garcia-Teodoro, and Roberto Theron. 2016(accessed June 20, 2020). UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :32 Dylan Chou and Meng Jiang https://nesg.ugr.es/nesg-ugr16/. (2016 (accessed June 20, 2020)).[49] D. Ficara, G. Antichi, A. Di Pietro, S. Giordano, G. Procissi, and F. Vitucci. 2010. Sampling Techniques to AcceleratePattern Matching in Network Intrusion Detection Systems. In .1–5.[50] Romain Fontugne, Pierre Borgnat, Patrice Abry, and Kensuke Fukuda. 2010. MAWILab: Combining Diverse AnomalyDetectors for Automated Anomaly Labeling and Performance Benchmarking. In

ACM CoNEXT ’10 . 318–323.[55] Sannasi Ganapathy, Kanagasabai Kulothungan, Sannasy Muthurajkumar, Muthusamy Vijayalakshmi, PalanichamyYogesh, and Arputharaj Kannan. 2013. Intelligent feature selection and classification techniques for intrusion detectionin networks: a survey.

EURASIP Journal on Wireless Communications and Networking

IEEE Access

IEEE Access computers & security

28, 1-2 (2009), 18–28.[59] Y. Gu, K. Li, Z. Guo, and Y. Wang. 2019. Semi-Supervised K-Means DDoS Detection Method Using Hybrid FeatureSelection Algorithm.

IEEE Access . 1441–1446.[61] Govind P. Gupta and Manish Kulariya. 2016. A Framework for Fast and Efficient Cyber Security Network IntrusionDetection Using Apache Spark.

Procedia Computer Science

93 (2016), 824 – 831. https://doi.org/10.1016/j.procs.2016.07.238 Proceedings of the 6th International Conference on Advances in Computing and Communications.[62] W. Haider, J. Hu, J. Slay, B.P. Turnbull, and Y. Xie. 2017. Generating realistic intrusion detection system datasetbased on fuzzy qualitative modeling.

Journal of Network and Computer Applications

87 (2017), 185 – 192. https://doi.org/10.1016/j.jnca.2017.03.018[63] Bahram Hajimirzaei and Nima Jafari Navimipour. 2019. Intrusion detection for cloud computing using neural networksand artificial bee colony optimization algorithm.

ICT Express

5, 1 (2019), 56–59.[64] Eric Hamilton. 2019. What is Edge Computing: The Network Edge Explained.

Cloudwards. Retrieved

IEEE Access . IEEE, 1133–1142.[67] G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data withNeural Networks.

Science

Expert Systemswith Applications

38, 1 (2011), 306 – 313. https://doi.org/10.1016/j.eswa.2010.06.066[69] Y. Hsu, Z. He, Y. Tarutani, and M. Matsuoka. 2019. Toward an Online Network Intrusion Detection System Based onEnsemble Learning. In . 174–178.[70] Weiming Hu, Wei Hu, and Steve Maybank. 2008. Adaboost-based algorithm for network intrusion detection.

IEEETransactions on Systems, Man, and Cybernetics, Part B (Cybernetics)

38, 2 (2008), 577–583.[71] Shin-Ying Huang, Fang Yu, Rua-Huan Tsaih, and Yennun Huang. 2015. Network-traffic anomaly detection withincremental majority learning. In . IEEE, 1–8.[72] Che-Lun Hung, Chun-Yuan Lin, and Hsiao-Hsi Wang. 2014. An efficient parallel-network packet pattern-matchingapproach using GPUs.

Journal of systems architecture

60, 5 (2014), 431–439.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:33 [73] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019. Automated Machine Learning: Methods, Systems, Chal-lenges.

Automated Machine Learning (2019).[74] M. Ichino and J. Sklansky. 1984. Optimum feature selection by zero-one integer programming.

IEEE Transactions onSystems, Man, and Cybernetics

SMC-14, 5 (1984), 737–746.[75] Zakira Inayat, Abdullah Gani, Nor Badrul Anuar, Muhammad Khurram Khan, and Shahid Anwar. 2016. Intrusionresponse systems: Foundations, design, and challenges.

Journal of Network and Computer Applications

62 (2016), 53–74.[76] Ahmad Javaid, Quamar Niyaz, Weiqing Sun, and Mansoor Alam. 2016. A Deep Learning Approach for NetworkIntrusion Detection System. In

Proceedings of the 9th EAI International Conference on Bio-Inspired Information and Com-munications Technologies (Formerly BIONETICS) (BICT’15) . ICST (Institute for Computer Sciences, Social-Informaticsand Telecommunications Engineering), Brussels, BEL, 21–26. https://doi.org/10.4108/eai.3-12-2015.2262516[77] H. J. Jeong, W. Hyun, J. Lim, and I. You. 2012. Anomaly Teletraffic Intrusion Detection Systems on Hadoop-BasedPlatforms: A Survey of Some Problems and Solutions. In . 766–770.[78] H. Jiang, Z. He, G. Ye, and H. Zhang. 2020. Network Intrusion Detection Based on PSO-Xgboost Model.

IEEE Access

Proceedings of the International Conference on Advances in Computing and Artificial Intelligence (ACAI’11) . Association for Computing Machinery, New York, NY, USA, 34–38. https://doi.org/10.1145/2007052.2007060[82] Nathan Keegan, Soo-Yeon Ji, Aastha Chaudhary, Claude Concolato, Byunggu Yu, and Dong Hyun Jeong. 2016. Asurvey of cloud-based network intrusion detection analysis.

Human-centric Computing and Information Sciences

Computers & Security

70 (2017), 255 – 277. https://doi.org/10.1016/j.cose.2017.06.005[84] F. A. Khan, A. Gumaei, A. Derhab, and A. Hussain. 2019. A Novel Two-Stage Deep Learning Model for EfficientNetwork Intrusion Detection.

IEEE Access

Electronics

9, 6 (Jun 2020), 916. https://doi.org/10.3390/electronics9060916[86] Levent Koc, Thomas A. Mazzuchi, and Shahram Sarkani. 2012. A network intrusion detection system based ona Hidden Naive Bayes multiclass classifier.

Expert Systems with Applications

39, 18 (2012), 13492 – 13500. https://doi.org/10.1016/j.eswa.2012.07.009[87] Eduardo [De la Hoz], Emiro [De La Hoz], Andrés Ortiz, Julio Ortega, and Beatriz Prieto. 2015. PCA filtering andprobabilistic SOM for network intrusion detection.

Neurocomputing

Proceedings of the 2003 SIAM international conference ondata mining . SIAM, 25–36.[91] T. A. Le, T. H. Chu, Q. U. Nguyen, and X. H. Nguyen. 2014. Malware detection using genetic programming. In the 2014Seventh IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA) . 1–6.[92] John Zhong Lei and Ali A. Ghorbani. 2012. Improved competitive learning neural networks for network intrusionand fraud detection.

Neurocomputing

75, 1 (2012), 135 – 145. https://doi.org/10.1016/j.neucom.2011.02.021 BrazilianSymposium on Neural Networks (SBRN 2010) International Conference on Hybrid Artificial Intelligence Systems(HAIS 2010).[93] Hongda Li, Hongxin Hu, Guofei Gu, Gail-Joon Ahn, and Fuqiang Zhang. 2018. VNIDS: Towards Elastic Security withSafe and Efficient Virtualization of Network Intrusion Detection Systems. In

Proceedings of the 2018 ACM SIGSACConference on Computer and Communications Security (CCS ’18) . Association for Computing Machinery, New York, NY,USA, 17–34. https://doi.org/10.1145/3243734.3243862[94] Peipei Li, Xindong Wu, Xuegang Hu, and Hao Wang. 2015. Learning concept-drifting data streams with randomensemble decision trees.

Neurocomputing

166 (2015), 68–83.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :34 Dylan Chou and Meng Jiang [95] Y. Li, Z. Li, and R. Wang. 2011. Intrusion Detection Algorithm Based on Semi-supervised Learning. In , Vol. 2. 153–156.[96] Yanmiao Li, Yingying Xu, Zhi Liu, Haixia Hou, Yushuo Zheng, Yang Xin, Yuefeng Zhao, and Lizhen Cui. 2020. Robustdetection for network intrusion of industrial IoT based on multi-CNN fusion.

Measurement

154 (2020), 107450.https://doi.org/10.1016/j.measurement.2019.107450[97] Cheng-Hung Lin and Cheng-Hung Hsieh. 2018. A novel hierarchical parallelism for accelerating NIDS using GPUs. In . IEEE, 578–581.[98] Jinping Liu, Jiezhou He, Wuxia Zhang, Tianyu Ma, Zhaohui Tang, Jean Paul Niyoyita, and Weihua Gui. 2019. ANID-SEoKELM: Adaptive network intrusion detection based on selective ensemble of kernel ELMs with random features.

Knowledge-Based Systems

177 (2019), 104 – 116. https://doi.org/10.1016/j.knosys.2019.04.008[99] Jinping Liu, Wuxia Zhang, Zhaohui Tang, Yongfang Xie, Tianyu Ma, Jingjing Zhang, Guoyong Zhang, and Jean PaulNiyoyita. 2020. Adaptive intrusion detection via GA-GOGMM-based pattern learning with fuzzy rough set-basedattribute selection.

Expert Systems with Applications

139 (2020), 112845. https://doi.org/10.1016/j.eswa.2019.112845[100] Wei Liu, LinLin Ci, and LiPing Liu. 2020. A New Method of Fuzzy Support Vector Machine Algorithm for IntrusionDetection.

Applied Sciences

10, 3 (Feb 2020), 1065. https://doi.org/10.3390/app10031065[101] Jie Lu, Vahid Behbood, Peng Hao, Hua Zuo, Shan Xue, and Guangquan Zhang. 2015. Transfer learning usingcomputational intelligence: A survey.

Knowledge-Based Systems

80 (2015), 14 – 23. https://doi.org/10.1016/j.knosys.2015.01.010 25th anniversary of Knowledge-Based Systems.[102] Ma Yue, Lian Hong, and X. F. Zhang. 2010. Researches on the IPv6 Network safeguard linked system. In , Vol. 7. 387–390.[103] Gabriel Maciá-Fernández, José Camacho, Roberto Magán-Carrión, Pedro García-Teodoro, and Roberto Therón. 2018.UGR’16: A new dataset for the evaluation of cyclostationarity-based network IDSs.

Computers & Security

73 (2018),411 – 424. https://doi.org/10.1016/j.cose.2017.11.004[104] Roberto Magán-Carrión, Daniel Urda, Ignacio Díaz-Cano, and Bernabé Dorronsoro. 2020. Towards a ReliableComparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches.

AppliedSciences

10, 5 (Mar 2020), 1775. https://doi.org/10.3390/app10051775[105] M. A. Manzoor and Y. Morgan. 2016. Real-time Support Vector Machine based Network Intrusion Detection systemusing Apache Storm. In . 1–5.[106] Nathan Martindale, Muhammad Ismail, and Douglas A Talbert. 2020. Ensemble-Based Online Machine LearningAlgorithms for Network Intrusion Detection Systems Using Streaming Data.

Information

11, 6 (2020), 315.[107] Maja Mataric. 1991. A comparative analysis of reinforcement learning methods. (1991).[108] Johan Mazel, Romain Fontugne, and Kensuke Fukuda. 2014. A taxonomy of anomalies in backbone network traffic.In . IEEE, 30–36.[109] Joseph W. Mikhail, John M. Fossaceca, and Ronald Iammartino. 2019. A Semi-Boosted Nested Model With Sensitivity-Based Weighted Binarization for Multi-Domain Network Intrusion Detection.

ACM Trans. Intell. Syst. Technol.

10, 3,Article 28 (April 2019), 27 pages. https://doi.org/10.1145/3313778[110] Robert Mitchell and Ing-Ray Chen. 2014. A survey of intrusion detection in wireless network applications.

ComputerCommunications

42 (2014), 1 – 23. https://doi.org/10.1016/j.comcom.2014.01.012[111] Chirag Modi, Dhiren Patel, Bhavesh Borisaniya, Hiren Patel, Avi Patel, and Muttukrishnan Rajarajan. 2013. Asurvey of intrusion detection techniques in Cloud.

Journal of Network and Computer Applications

36, 1 (2013), 42 – 57.https://doi.org/10.1016/j.jnca.2012.05.003[112] Sara Mohammadi, Hamid Mirvaziri, Mostafa Ghazizadeh-Ahsaee, and Hadis Karimipour. 2019. Cyber intrusiondetection by combined feature selection algorithm.

Journal of Information Security and Applications

44 (2019), 80 – 88.https://doi.org/10.1016/j.jisa.2018.11.007[113] Valerio Morfino and Salvatore Rampone. 2020. Towards Near-Real-Time Intrusion Detection for IoT Devices usingSupervised Learning and Apache Spark.

Electronics

9, 3 (Mar 2020), 444. https://doi.org/10.3390/electronics9030444[114] Nour Moustafa, Jiankun Hu, and Jill Slay. 2019. A holistic review of Network Anomaly Detection Systems: Acomprehensive survey.

Journal of Network and Computer Applications

128 (2019), 33 – 55. https://doi.org/10.1016/j.jnca.2018.12.006[115] N. Moustafa and J. Slay. 2015. The Significant Features of the UNSW-NB15 and the KDD99 Data Sets for NetworkIntrusion Detection Systems. In . 25–31.[116] N. Moustafa and J. Slay. 2015. UNSW-NB15: a comprehensive data set for network intrusion detection systems(UNSW-NB15 network data set). In . 1–6.[117] Biswanath Mukherjee, L Todd Heberlein, and Karl N Levitt. 1994. Network intrusion detection.

IEEE network

8, 3(1994), 26–41.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:35 [118] Saurabh Mukherjee and Neelam Sharma. 2012. Intrusion Detection using Naive Bayes Classifier with FeatureReduction.

Procedia Technology

Proceedings of the Fourth International Conference on Engineering & MIS 2018 (ICEMIS ’18) . Association for ComputingMachinery, New York, NY, USA, Article 45, 9 pages. https://doi.org/10.1145/3234698.3234743[120] R. Newman. 2009.

Computer Security: Protecting Digital Resources . Jones & Bartlett Learning. https://books.google.com/books?id=_R5ndK-i3vkC[121] Fakhroddin Noorbehbahani, Ali Fanian, Rasoul Mousavi, and Homa Hasannejad. 2017. An incremental intrusiondetection system using a new semi-supervised stream classification method.

International Journal of CommunicationSystems

30, 4 (2017), e3002.[122] Stephen Northcutt and Judy Novak. 2002.

Network intrusion detection . IEEE, 374–377.[127] Mrutyunjaya Panda, Ajith Abraham, and Manas Ranjan Patra. 2012. A Hybrid Intelligent Approach for NetworkIntrusion Detection.

Procedia Engineering

30 (2012), 1 – 9. https://doi.org/10.1016/j.proeng.2012.01.827 InternationalConference on Communication Technology and System Design 2011.[128] Mrutyunjaya Panda and Manas Ranjan Patra. 2007. Network intrusion detection using naive bayes.

Internationaljournal of computer science and network security

7, 12 (2007), 258–263.[129] Darsh Patel, Kathiravan Srinivasan, Chuan-Yu Chang, Takshi Gupta, and Aman Kataria. 2020. Network AnomalyDetection inside Consumer Networks—A Hybrid Approach.

Electronics

9, 6 (2020), 923.[130] Y. Peng, J. Su, X. Shi, and B. Zhao. 2019. Evaluating Deep Learning Based Network Intrusion Detection Systemin Adversarial Environment. In . 61–66.[131] Robi Polikar, Lalita Udpa, Satish Udpa, and Vasant Honavar. 2004. An incremental learning algorithm with confidenceestimation for automated identification of NDE signals. ieee transactions on ultrasonics, ferroelectrics, and frequencycontrol

51, 8 (2004), 990–1001.[132] H. E. Poston. 2012. A brief taxonomy of intrusion detection strategies. In . 255–263.[133] Mahendra Prasad, Sachin Tripathi, and Keshav Dahal. 2020. An efficient feature selection based Bayesian and Rough setapproach for intrusion detection.

Applied Soft Computing

87 (2020), 105980. https://doi.org/10.1016/j.asoc.2019.105980[134] M R [Gauthama Raman], Kannan Kirthivasan, and V S [Shankar Sriram]. 2017. Development of Rough Set –Hypergraph Technique for Key Feature Identification in Intrusion Detection Systems.

Computers & Electrical Engineering

59 (2017), 189 – 200. https://doi.org/10.1016/j.compeleceng.2017.01.006[135] N. Ravi and S. M. Shalinie. 2020. Semi-Supervised Learning based Security to Detect and Mitigate Intrusions in IoTNetwork.

IEEE Internet of Things Journal (2020), 1–1.[136] Paulo Angelo Alves Resende and André Costa Drummond. 2018. A Survey of Random Forest Based Methods forIntrusion Detection Systems.

ACM Comput. Surv.

51, 3, Article 48 (May 2018), 36 pages. https://doi.org/10.1145/3178582[137] Markus Ring, Daniel Schlör, Dieter Landes, and Andreas Hotho. 2019. Flow-based network traffic generation usingGenerative Adversarial Networks.

Computers & Security

82 (2019), 156 – 172. https://doi.org/10.1016/j.cose.2018.12.012[138] Markus Ring, Sarah Wunderlich, Dominik Grüdl, Dieter Landes, and Andreas Hotho. 2017. Creation of Flow-BasedData Sets for Intrusion Detection.

Journal of Information Warfare

16 (2017), 40–53. Issue 4.[139] Markus Ring, Sarah Wunderlich, Dominik Grüdl, Dieter Landes, and Andreas Hotho. 2017. Flow-based benchmarkdata sets for intrusion detection. In

Proceedings of the 16th European Conference on Cyber Warfare and Security (ECCWS) .ACPI, 361–369.[140] Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes, and Andreas Hotho. 2019. A survey of network-basedintrusion detection data sets.

Computers & Security

86 (2019), 147 – 167. https://doi.org/10.1016/j.cose.2019.06.005ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. :36 Dylan Chou and Meng Jiang [141] A. Sahu, Z. Mao, K. Davis, and A. E. Goulart. 2020. Data Processing and Model Selection for Machine Learning-basedNetwork Intrusion Detection. In . 1–6.[142] Roberto Saia., Salvatore Carta., Diego Reforgiato Recupero., Gianni Fenu., and Maria Madalina Stanciu. 2019. ADiscretized Extended Feature Space (DEFS) Model to Improve the Anomaly Detection Performance in NetworkIntrusion Detection Systems. In

Proceedings of the 11th International Joint Conference on Knowledge Discovery, KnowledgeEngineering and Knowledge Management - Volume 1: KDIR, . INSTICC, SciTePress, 322–329. https://doi.org/10.5220/0008113603220329[143] Fadi Salo, Ali Bou Nassif, and Aleksander Essex. 2019. Dimensionality reduction with IG-PCA and ensemble classifierfor network intrusion detection.

Computer Networks

148 (2019), 164 – 175. https://doi.org/10.1016/j.comnet.2018.11.010[144] Claude Sammut and Geoffrey I. Webb (Eds.). 2017.

Encyclopedia of Machine Learning and Data Mining . Springer.https://doi.org/10.1007/978-1-4899-7687-1[145] Martin Sarnovsky and Jan Paralic. 2020. Hierarchical intrusion detection using machine learning and knowledgemodel.

Symmetry

12, 2 (2020), 203.[146] K Selvakumar, Marimuthu Karuppiah, L SaiRamesh, SK Hafizul Islam, Mohammad Mehedi Hassan, Giancarlo Fortino,and Kim-Kwang Raymond Choo. 2019. Intelligent temporal classification and fuzzy rough set-based feature selectionalgorithm for intrusion detection system in WSNs.

Information Sciences

497 (2019), 77 – 90. https://doi.org/10.1016/j.ins.2019.05.040[147] Kamalakanta Sethi, Rahul Kumar, Nishant Prajapati, and Padmalochan Bera. 2020. Deep Reinforcement Learningbased Intrusion Detection System for Cloud Infrastructure. In . IEEE, 1–6.[148] A. Shafee, M. Baza, D. A. Talbert, M. M. Fouda, M. Nabil, and M. Mahmoud. 2020. Mimic Learning to Generatea Shareable Network Intrusion Detection Model. In . 1–6.[149] Shahaboddin Shamshirband, Amineh Amini, Nor Badrul Anuar, Miss Laiha Mat Kiah, Ying Wah Teh, and StevenFurnell. 2014. D-FICCA: A density-based fuzzy imperialist competitive clustering algorithm for intrusion detection inwireless sensor networks.

Measurement

55 (2014), 212–226.[150] Z. Shi, J. Li, and C. Wu. 2019. DeepDDoS: Online DDoS Attack Detection. In . 1–6.[151] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi. 2018. A Deep Learning Approach to Network Intrusion Detection.

IEEETransactions on Emerging Topics in Computational Intelligence

2, 1 (2018), 41–50.[152] W. Shuyue, Y. Jie, and F. Xiaoping. 2011. Research on Intrusion Detection Method Based on SVM Co-training. In , Vol. 2. 668–671.[153] Kamran Siddique, Zahid Akhtar, Farrukh Aslam Khan, and Yangwoo Kim. 2019. Kdd cup 99 data sets: A perspectiveon the role of data sets in network intrusion detection research.

Computer

52, 2 (2019), 41–51.[154] A. Singla, E. Bertino, and D. Verma. 2019. Overcoming the Lack of Labeled Data: Training Intrusion Detection ModelsUsing Transfer Learning. In . 69–74.[155] Robin Sommer and Vern Paxson. 2010. Outside the closed world: On using machine learning for network intrusiondetection. In . IEEE, 305–316.[156] Tongtong Su, Huazhi Sun, Jinqi Zhu, Sheng Wang, and Yabo Li. 2020. BAT: Deep Learning Methods on NetworkIntrusion Detection Using NSL-KDD Dataset.

IEEE Access

Proceedings of the International Conference on Advances in Computing,Communications and Informatics (ICACCI ’12) . Association for Computing Machinery, New York, NY, USA, 645–649.https://doi.org/10.1145/2345396.2345501[158] Z. Tan, A. Jamdagni, X. He, and P. Nanda. 2010. Network Intrusion Detection based on LDA for payload featureselection. In . 1545–1549.[159] Bo Tang and Haibo He. 2017. A local density-based approach for outlier detection.

Neurocomputing

241 (2017),171–180.[160] Shahroz Tariq, Sangyup Lee, and Simon S Woo. 2020. CANTransfer: transfer learning based intrusion detection on acontroller area network using convolutional LSTM network. In

Proceedings of the 35th Annual ACM Symposium onApplied Computing . 1048–1055.[161] Yogita Thakran and Durga Toshniwal. 2012. Unsupervised outlier detection in streaming data using weightedclustering. In . IEEE, 947–952.[162] I. S. Thaseen and C. A. Kumar. 2016. An integrated intrusion detection model using consistency based featureselection and LPBoost. In . 1–6.ACM Comput. Surv., Vol. 1, No. 1, Article 1. Publication date: September 2020. ata-Driven Network Intrusion Detection: A Taxonomy of Challenges and Methods 1:37 [163] M. Thottan and Chuanyi Ji. 2003. Anomaly detection in IP networks.

IEEE Transactions on Signal Processing

51, 8(2003), 2191–2204.[164] M. Usama, M. Asim, S. Latif, J. Qadir, and Ala-Al-Fuqaha. 2019. Generative Adversarial Networks For Launchingand Thwarting Adversarial Attacks on Network Intrusion Detection Systems. In . 78–83.[165] K. [Keerthi Vasan] and B. Surendiran. 2016. Dimensionality reduction using Principal Component Analysis fornetwork intrusion detection.

Perspectives in Science

Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS ’11) . Associationfor Computing Machinery, New York, NY, USA, 297–308. https://doi.org/10.1145/2046707.2046741[167] Cheng-Ru Wang, Rong-Fang Xu, Shie-Jue Lee, and Chie-Hong Lee. 2018. Network intrusion detection using equalityconstrained-optimization-based extreme learning machines.

Knowledge-Based Systems

147 (2018), 68–80.[168] P. Wang, K. Chao, H. Lin, W. Lin, and C. Lo. 2016. An Efficient Flow Control Approach for SDN-Based NetworkThreat Detection and Migration Using Support Vector Machine. In . 56–63.[169] Quanmin Wang and Xuan Wei. 2020. The Detection of Network Intrusion Based on Improved Adaboost Algorithm.In

Proceedings of the 2020 4th International Conference on Cryptography, Security and Privacy (ICCSP 2020) . Associationfor Computing Machinery, New York, NY, USA, 84–88. https://doi.org/10.1145/3377644.3377660[170] Wei Wang, Thomas Guyet, René Quiniou, Marie-Odile Cordier, Florent Masseglia, and Xiangliang Zhang. 2014.Autonomic intrusion detection: Adaptively detecting anomalies over unlabeled audit data streams in computernetworks.

Knowledge-Based Systems

70 (2014), 103–117.[171] W. Wong, H. Chen, C. Hsu, and T. Chao. 2011. Reinforcement Learning of Robotic Motion with Genetic Programming,Simulated Annealing and Self-Organizing Map. In . 292–298.[172] Binhan Xu, Shuyu Chen, Hancui Zhang, and Tianshu Wu. 2017. Incremental k-NN SVM method in intrusion detection.In . IEEE, 712–717.[173] C. Xu, J. Shen, and X. Du. 2020. A Method of Few-Shot Network Intrusion Detection Based on Meta-LearningFramework.

IEEE Transactions on Information Forensics and Security

15 (2020), 3540–3552.[174] C. Xu, J. Shen, X. Du, and F. Zhang. 2018. An Intrusion Detection System Using a Deep Neural Network With GatedRecurrent Units.

IEEE Access

IEEETransactions on Network Science and Engineering (2019), 1–1.[176] H. Yang and F. Wang. 2019. Wireless Network Intrusion Detection Based on Improved Convolutional Neural Network.

IEEE Access

Expert Systems with Applications

38, 6 (2011), 7698–7707.[179] Ying Wang, Yongjun Shen, and Guidong Zhang. 2016. Research on Intrusion Detection Model using ensemble learningmethods. In . 422–425.[180] S. Youm, Y. Kim, K. Shin, and E. Kim. 2020. An Authorized Access Attack Detection Method for Realtime IntrusionDetection System. In . 1–6.[181] D. YuanTong. 2019. Research of Intrusion Detection Method Based on IL-FSVM. In . 1221–1225.[182] F. Zhang and D. Wang. 2013. An Effective Feature Selection Approach for Network Intrusion Detection. In . 307–311.[183] Hongpo Zhang, Lulu Huang, Chase Q. Wu, and Zhanbo Li. 2020. An effective convolutional neural network based onSMOTE and Gaussian mixture model for intrusion detection in imbalanced dataset.

Computer Networks

177 (2020),107315. https://doi.org/10.1016/j.comnet.2020.107315[184] Jiong Zhang, Mohammad Zulkernine, and Anwar Haque. 2008. Random-forests-based network intrusion detectionsystems.

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)

38, 5 (2008), 649–659.[185] Wenhao Zhang, Ramin Ramezani, and Arash Naeim. 2019. WOTBoost: Weighted Oversampling Technique in Boostingfor imbalanced learning. In . IEEE, 2523–2531.[186] Y. Zhang, X. Chen, D. Guo, M. Song, Y. Teng, and X. Wang. 2019. PCCN: Parallel Cross Convolutional Neural Networkfor Abnormal Network Traffic Flows Detection in Multi-Class Imbalanced Network Traffic Flows.

IEEE Access :38 Dylan Chou and Meng Jiang [187] Y. Zhang, X. Chen, L. Jin, X. Wang, and D. Guo. 2019. Network Intrusion Detection: Based on Deep HierarchicalNetwork and Original Flow Data.

IEEE Access . 1–6.[189] J. Zhao, S. Shetty, and J. W. Pan. 2017. Feature-based transfer learning for network security. In

MILCOM 2017 - 2017IEEE Military Communications Conference (MILCOM) . 17–22.[190] Juan Zhao, Sachin Shetty, Jan Wei Pan, Charles Kamhoua, and Kevin Kwiat. 2019. Transfer learning for detectingunknown network attacks.

EURASIP Journal on Information Security

Computer Communications

62 (2015), 47–58.[192] Ming Zheng, Tong Li, Rui Zhu, Yahui Tang, Mingjing Tang, Leilei Lin, and Zifei Ma. 2020. Conditional Wasser-stein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification.

Information Sciences

512 (2020), 1009 – 1023. https://doi.org/10.1016/j.ins.2019.10.014[193] Ying Zhong, Wenqi Chen, Zhiliang Wang, Yifan Chen, Kai Wang, Yahui Li, Xia Yin, Xingang Shi, Jiahai Yang, andKeqin Li. 2020. HELAD: A novel network anomaly detection model based on heterogeneous ensemble learning.

Computer Networks

169 (2020), 107049. https://doi.org/10.1016/j.comnet.2019.107049[194] Yingying Zhu, Junwei Liang, Jianyong Chen, and Zhong Ming. 2017. An improved NSGA-III algorithm for featureselection used in intrusion detection.