A new method for flow-based network intrusion detection using the inverse Potts model
Camila Pontes, Manuela Souza, João Gondim, Matt Bishop, Marcelo Marotta
11 A new method for flow-based network intrusiondetection using the inverse Potts model
Camila Pontes, Manuela Souza, João Gondim, Matt Bishop and Marcelo Marotta
Abstract —Network Intrusion Detection Systems (NIDS) playan important role as tools for identifying potential networkthreats. In the context of ever-increasing traffic volume on com-puter networks, flow-based NIDS arise as good solutions for real-time traffic classification. In recent years, different flow-basedclassifiers have been proposed using Machine Learning (ML)algorithms. Nevertheless, classical ML-based classifiers have somelimitations. For instance, they require large amounts of labeleddata, which might be difficult to obtain. Additionally, most ML-based classifiers are not capable of domain adaptation, i.e., afterbeing trained on an specific data distribution, they are not generalenough to be applied to other related data distributions. And,finally, many of the models inferred by these algorithms areblack boxes, hard to understand in detail. To overcome theselimitations, we propose a new algorithm, called Energy-basedFlow Classifier (EFC). This anomaly-based classifier uses inversestatistics to infer a statistical model based on labeled benignexamples. We show that EFC is capable of accurately performingbinary flow classification and is more adaptable to new domainsthan classical ML-based classifiers. Given the positive resultsobtained on three different datasets (CIDDS-001, CICIDS17 andCICDDoS19), we consider EFC to be a promising algorithm toperform robust flow-based traffic classification.
Index Terms —Flow-based Network Intrusion Detection,Anomaly-based Network Intrusion Detection, Network FlowClassification, Network Intrusion Detection Systems, Energy-based Flow Classifier, Inverse Potts Model, Domain Adaptation.
I. I
NTRODUCTION S YMANTEC’S Internet Security Threat Report [1] pointsout a 56% increase in the number of web attacks in 2019.Network scans, denial of service, and brute force attacks areamong the most common threats. Such malicious activitiesthreaten not only individuals, but also some collective or-ganizations such as public health, financial, and governmentinstitutions. In this context, Network Intrusion Detection Sys-tems (NIDSs) play an important role as tools for identifyingpotential threats [2].There are two main approaches for NIDSs regarding thekind of data analyzed: packet-based and flow-based. In theformer, deep packet inspection is performed, taking into ac-count individual packet payloads as well as header information[3]. In the latter, flows, i.e., packet collections, are analyzedregarding their properties, e.g., duration, number of packets,number of bytes, and source/destination port [3]. To performclassification in real-time, a massive volume of data must
C. Pontes, M. Souza, J. Gondim and M. A. Marotta are with the Uni-versity of Brasilia, Brazil, emails: [email protected], [email protected],[email protected];M. Bishop is with the University of California at Davis, Davis, USA, email:[email protected] be analyzed, which makes deep packet inspection too costlyto be applied regarding processing and energy consumption.Since flow-based approaches can classify the whole trafficinspecting an equivalent to 0.1% of the total volume, NIDSsbased on flow analysis arise as good solutions for real-timetraffic classification [4].In recent years, different flow-based classifiers have beenproposed based on both shallow and deep learning [5]. Ac-cording to the report in [5], the best flow-based classifiersachieve around 99% accuracy. Although quite accurate, clas-sical Machine Learning (ML)-based classifiers require labeledmalicious traffic samples to perform training. However, realtraffic labeling might be difficult, especially in the case ofmalicious traffic. In addition to that, ML-based classifiers, aftertrained on specific data distribution, usually do not work wellwhen applied to other data with related distribution, i.e., theyhave low domain adaptation capability [6], [7]. Moreover, mostML algorithms are well-known to be black-box mechanisms,challenging to be understood and readjusted in detail [8]. Inthis regard, there is a clear need for a new flow-based classifierfor NIDSs, which generates an understandable model (whitebox), is based solely on benign examples, and is adaptable todifferent domains.In this work, we propose a novel classifier called Energy-based Flow Classifier (EFC), which is a network flow classifierbased on the inverse Potts model. EFC performs one-class,anomaly-based classification, i.e., as long as it can learn theproperties of benign flows, it will be able to discriminatebetween benign and malicious flows. Moreover, it is a whitebox algorithm, producing a statistical model that can beanalyzed in detail regarding individual parameter values. Here,we compared the performance of EFC against a variety ofclassifiers using three different datasets, i.e.,
CIDDS-001 [9],CICIDS17 [10], and CICDDoS19 [11]. Our results show thatclassifiers based on classical ML are more sensitive to changesin data distribution than EFC. Our main contributions are: • The proposal and implementation of a flow classifierbased on the inverse Potts model to be employed inNIDSs; • A performance comparison of the proposed classifierwith classical ML-based classifiers using three differentdatasets; • An analysis of how different classifiers perform whentrained within one domain and tested in another relateddomain.The rest of this paper is structured as follows. In Section II,we briefly present the state-of-the-art in flow-based NIDSs. a r X i v : . [ c s . N I] J un In Section III, we describe the structure of network flowswith a preliminary analysis of the datasets considered here. InSection IV, we introduce the statistical model proposed and theclassifier implementation. In Section V, we present the resultsobtained regarding the analysis of the statistical model andthe classification experiments performed. Finally, in SectionVI, we present our conclusions and future work.II. R
ELATED W ORK
In this section, we briefly review the state-of-the-art in flow-based network intrusion detection. We show some early workin the field, as well as recent advances. In the end, someprevious work on CIDDS-001, CICIDS17, and CICDDoS19datasets are shown.Several ML-based flow classifiers have been explored overthe last 15 years for network intrusion detection. There arerecent comprehensive surveys, in which ML-based classifiersused in this context are reviewed [5], [12], [13], [14]. Withinthe algorithms evaluated in these surveys, Random Forest (RF)performs especially well, and has been applied in most of therecently proposed NIDS [15], [16], [17]. In this work, wedeploy most of the ML-classifiers covered in recent surveysto serve as baselines against which we compare our classifier.Flow-based intrusion detection has also been explored inmodern contexts, i.e.,
Internet of Things (IoT) networks [18],[19] and cloud environments [20], [21]. The proposed solu-tions for intrusion detection in IoT and cloud environmentsachieved satisfactory classification accuracy and feasible run-ning times. However, their domain adaptation capability is stilla matter of investigation. Most of the proposed solutions fromliterature assume that there will be available training sets tobe used in all contexts, which is not necessarily true. In thisregard, we propose a flow-classifier solution that is adaptableto different domains without retraining.Since malicious data is frequently changing its characteristicwhen new attack types arise, domain adaptation becomes amajor issue to intrusion detection. Bartos et al. [6] and Li et al. [7] proposed similar approaches to cope with domainchange by applying data transformation to reduce differencesin data features across domains. Here, we propose a classifierwhich is intrinsically adaptable to different domains, since itsleaning phase is based solely on benign data. Hence, there isno need to transform data to adapt data features, making ourapproach simpler and more straightforward.To assess EFC’s performance, one of the datasets we useis CIDDS-001. This dataset was used by Verma and Ranga(2018) [22] to assess the performance of K-Nearest Neighbors(KNN) and k-means clustering algorithms when classifyingtraffic. Both algorithms achieved over 99% accuracy. Also,Ring et al. [23] explored slow port scans detection usingCIDDS-001. The approach proposed by them is capable ofaccurately recognizing the attacks with a low false alarm rate.Finally, Abdulhammed et al. [24] also performed classificationbased on flows on CIDDS-001 and proposed an approach thatis robust considering imbalanced network traffic. In summary,CIDDS-001 is an updated and relevant dataset to be used fornetwork flow-classification solutions, being one of our datasetchoices for assessing the performance of EFC. Another two datasets we use in this work are CICIDS17and CICDDoS19, from the Canadian Institute for Cyber Se-curity. Recently, Yulianto, Sukarno, and Suwastika [25] usedCICIDS17 to assess the performance of an Adaboost-basedclassifier. Aksu et al. [26] did the same in 2018 with differentML classifiers. CICIDS17 contains benign as well as the mostup-to-date common attacks, resembling true real-world data,being a relevant dataset to consider for flow-based trafficclassification.CICDDoS19, in turn, is a very recent dataset with a focuson DDoS attacks. A very recent work [27] proposes a real-time entropy-based NIDS for detection of volumetric DDoSin IoT and performs tests over CICDDoS19 dataset, amongother datasets. Another recent work [28] obtained over 99%accuracy over CICDDoS19 dataset using a ConvolutionalNeural Network (CNN). And, finally, Novaes et al. [29]proposed a system for intrusion detection based on fuzzylogic, which had its performance assessed on CICDDoS19.The rising popularity of this dataset serves as proof of itsrelevance to assess the performance of different NIDS. Hence,we use CICDDoS19 and other two up-to-date datasets to testour classifier and compare it to the performance of classicalML classifiers. III. P
RELIMINARIES
A network flow is a set of packets that traverses in-termediary nodes between end-points within a given timeinterval. Under the perspective of an intermediary node, i.e., an observation point, all packets belonging to a given flowhave a set of common features called flow keys. It means thatflow keys do not change for packets belonging to the sameflow, while the remaining features might vary. FlowScan [30]is an example of a tool capable of collecting data from a setof packets and extracting flow features to be later exported indifferent formats, such as NetFlow and IPFIX. Since NetFlowis the most commonly used format, its main features are listedbelow: • Source/Destination IP (flow keys) - determine the originand destination of a given flow in the network; • Source/Destination port (flow keys) - characterize differ-ent kinds of network services e.g., ssh service uses port22; • Protocol (flow key) - characterizes flows regarding thetransport protocol used e.g.,
TCP, UDP, ICMP. • Number of packets (feature) - total number of packetscaptured in a flow; • Number of bytes (feature) - total number of bytes in aflow; • Duration (feature) - total duration of a flow in seconds; • Initial timestamp (feature) - system time when one flowstarted to be captured.Other features such as TCP Flags and Type of Service mightalso be exported in some cases. The combination of differentflow keys and features characterize one flow and determine itsparticular behavior.Flow-based approaches are seen as suitable alternatives toprecede packet inspection in real-time NIDSs. The idea is to deeply inspect only the packets belonging to flows consideredto be suspicious by the flow-based classifier. A two-stepapproach would notably reduce the amount of data analyzedwhile maintaining a high classification accuracy [4]. In thiswork, we are only concerned with the first step, which isthe flow classification. We evaluate the performance of ouralgorithm, the EFC, compared to other ML algorithms usingthree different datasets. We also evaluate the performanceof the algorithms by training with data from part of thedataset and testing with other parts of it. Nonetheless, althoughboth parts of the data come from the same dataset, theirdistributions are different to characterize domain adaptation. Inthe following, we briefly describe the datasets used for testingand characterize what constitutes a domain adaptation in eachof them.
A. CIDDS-001
CIDDS-001 [9] is a relatively recent dataset composed ofa set of flow samples captured within a simulated OpenStackenvironment and another set of flow samples obtained from areal server. The former contains only simulated traffic, whilethe latter includes both real and simulated traffic. Each samplecollected within these two environments has one of the labelsdescribed in Table I.
Table IL
ABELS WITHIN
CIDDS-001
DATASET
Environment Labels
OpenStack normal, DoS, portScan, pingScan, bruteForceExternal server normal, DoS, bruteForce, unknown, suspicious
Simulated benign flows are labeled as normal , while simu-lated malicious flows are labeled as dos , portScan , pingScan or bruteForce , depending on the type of attack simulated. Thelabels suspicious and unknown , in turn, are used for real traffic.The external server is open to user access through the ports80 and 443. Hence, flows directed at these ports were labeledas unknown , since they could be either benign or malicious.All flows directed at other ports were labeled as suspicious .Traffic was sampled in both the simulated and the externalenvironment for a period of four weeks. For the simulatedenvironment, we consider only traffic captured in the secondweek to reduce the amount of data to be analyzed. Similarly,only external traffic captured within the third week wasassessed. These weeks were selected because they have thefairest proportion between the different types of maliciousflows. Within this dataset, a change from the simulated datadistribution to the external server data distribution is a domainchange, requiring the classifiers to adapt.CIDDS-001 dataset flow features are shown in Table II.All features were taken into account for characterization andclassification except for Src IP , Dest IP and
Date first seen .These exceptions are because the latter one is intrinsicallynot informative to differentiate flows, and the former two aremade up in the context of the simulated network and mightbe confounding.
Table IIF
EATURES WITHIN
CIDDS-001
DATASET e.g.,
ICMP, TCP, or UDP)6 Date first seen Start time flow first seen7 Duration Duration of the flow8 Bytes Number of transmitted bytes9 Packets Number of transmitted packets10 Flags OR concatenation of all TCP Flags
B. CICIDS17
CICIDS17 [10] dataset contains benign traffic and the mostup-to-date common attacks, resembling real-world data. Thisdataset was built using the abstract behavior of 25 users basedon the HTTP, HTTPS, FTP, SSH, and email protocols. Thedata was captured during one week in July 2017. The attacksimplemented include Brute Force FTP, Brute Force SSH,DoS, Heartbleed, Web Attack, Infiltration, Botnet, and DDoS.They were executed both morning and afternoon on Tuesday,Wednesday, Thursday, and Friday (see Table III). Within thisdataset, a change from one day’s data distribution to that ofanother day is a domain change, requiring the classifiers toadapt.
Table IIIA
TTACKS WITHIN
CICIDS17
DATASET
Week day Attacks
MondayTuesday FTP-Patator, SSH-PatatorWednesday DoS slowloris, DoS Slowhttptes, DoS Hulk, DoSGoldenEye, Heartbleed Port 444Thursday Brute Force, XSS, Sql Injection, Dropbox download,Cool diskFriday Botnet ARES, Port Scan, DDoS LOIT
Flow features on this dataset were extracted using CI-CFlowMeter [31]. There are in total 88 features, which arenot going to be cited here because of the limited space. Allfeatures were considered here, except for Flow ID, Source IP,Destination IP, and Timestamp. These exceptions were madebecause the features were either intrinsically not informativeor made up within a simulated environment.
C. CICDDoS19
CICDDoS19 [11] contains benign traffic and the most up-to-date common DDoS attacks (volumetric and application: lowvolume, slow rate), resembling real-world data. This datasetcontains different modern reflective DDoS attacks such asPortMap, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag, SYN,NTP, DNS, and SNMP. The traffic was captured in January(first day) and March (second day), 2019. Attacks wereexecuted during this period (see Table IV). Within this dataset, a change from one day’s data distribution to that of anotherday is a domain change, requiring the classifiers to adapt.
Table IVA
TTACKS WITHIN
CICDD O S19
DATASET
Day Attacks
First PortMap, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag, SYNSecond NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP,UDP-Lag, WebDDos, SYN, TFTP
Flow features on this dataset were extracted using CI-CFlowMeter [31]. All features were considered here, exceptfor Flow Id, Source IP, Destination IP, and Timestamp. Theseexceptions were made because the features were either in-trinsically not informative or made up within a simulatedenvironment. IV. S
TATISTICAL MODEL
The main task of inverse statistics is to infer a statisticaldistribution based on a sample of it [32]. Methods usinginverse statistics have been successfully applied to problemsin other disciplines, e.g., the difficulty of predicting proteincontacts in Biophysics [32], [33]. Here, the statistical inferenceis based on the Potts model [34]. This model provides amathematical description of interacting spins on a crystallinelattice. Within the model framework, interacting spins aremapped into a graph G ( η , ε ) (see Figure 1 A)), where eachnode i ∈ η = { , ..., N } has an associated spin a i , which canassume one value from a set Ω that contains all possibleindividual quantum states. Each node i has also an associatedlocal field h i ( a i ) that is a function of a i ’s state. Meanwhile,each edge ( i , j ) ∈ ε , i , j ∈ η , has an associated coupling value e i j ( a i , a j ) that is a function of the states of spins a i and a j associated to nodes i and j . A specific system configurationhas an associated total energy, determined by the Hamiltonianfunction H ( a ... a N ) , which depends on all spin states. ProtocolDuration SrcPtDstPt
80 800.1
TCP +1+2 -3 -2 -1-2 +1 +100A) B)a l a j a k a i e ij (a i ,a j ) h i (a i ) e jk (a j ,a k ) e kl (a k ,a l ) e jl (a j ,a l ) e il (a i ,a l ) e ik (a i ,a k ) h j (a j ) h k (a k ) h l (a l ) Figure 1. A) Interacting spins on a crystalline lattice. B) Network flowmapped into a graph structure.
In this work, we adapt the Potts model to characterizenetwork flows (see Figure 1 B)). An individual flow k isrepresented by a specific graph configuration G k ( η , ε ) . Insteadof spins, each node represents a selected feature i ∈ η = { SrcPort , ...,
Flags } . Within a given flow k , each feature i assumes one value a ki from the set Ω i that contains allpossible values for this feature. As in the Potts Model, eachfeature i has an associated local field h i ( a ki ) . Meanwhile, ε = { ( i , j ) | i , j ∈ η ; i (cid:54) = j } is the set of edges determined byall possible pairs of features. Each edge has an associatedcoupling value determined by the function e i j ( a ki , a k j ) .Since the values of local fields and couplings depend on thevalues assumed by features within a given flow, each distinctflow will have a different combination of these quantities.As in the Potts Model, local fields and couplings determinethe total "energy" H ( a k ... a k N ) of each flow. For instance,in Figure 1 B), the total "energy" of the flow is obtainedby summing up all values associated with the edges andto the nodes, resulting in a total of -3. Note that what wecall energy is analogous to the notion of Hamiltonian inQuantum Mechanics. It is important to note that the modeldescribed here is discrete. Therefore continuous features mustbe discretized. The classes for continuous feature discretizationare shown on Supplementary Information. In the following, wepresent the framework applied to perform the statistical modelinference and subsequent energy-based flow classification. A. Model inference
In this section, a statistical model is going to be inferred interms of couplings and local field values to perform energy-based flow classification. The main idea consists in extractinga statistical model from benign flow samples to infer couplingand local field values that characterize this type of traffic.When calculating the energies of unlabeled flows using theinferred values, it is expected that benign flows will have lowerenergies than malicious flows.Let ( A ... A N ) be an N-tuple of features, which can be in-stantiated for flow k as ( a k ... a k N ) , with a k ∈ Ω , ..., a k N ∈ Ω N .Each feature value a ki is encoded by an integer from the set Ω = { , , ..., Q } , i.e., all feature alphabets are the same Ω i = Ω of size Q . If a given feature can only assume M values and M < Q , it is considered that values M + , ..., Q are possible,but will never be observed empirically. For instance, if theonly possible values for feature protocol are { ’TCP’ , ’UDP’ },and given Q =
4. In this case, we would have the mapping{ ’TCP’ :1, ’UDP’ :2, ’ ’ :3, ’ ’ :4 } and feature values 3 and 4would never occur.Now, let K be the set of all possible flows, i.e., allpossible combinations of feature values ( K = Ω N ), and let S ⊂ K be a sample of flows. We can use inverse statisticalphysics to infer a statistical model associating a probability P ( a k ... a k N ) to each flow k ∈ K based on sample S . Theglobal statistical model P is inferred following the EntropyMaximization Principle [35]:max P − ∑ k ∈ K P ( a k ... a k N ) log ( P ( a k ... a k N )) (1) s . t . ∑ k ∈ K | a ki = a i P ( a k ... a k N ) = f i ( a i ) (2) ∀ i ∈ η ; ∀ a i ∈ Ω ; ∑ k ∈ K | a ki = a i , a kj = a j P ( a k ... a k N ) = f i j ( a i , a j ) (3) ∀ ( i , j ) ∈ η | i (cid:54) = j ; ∀ ( a i , a j ) ∈ Ω ; where f i ( a i ) is the empirical frequency of value a i on feature i and f i j ( a i , a j ) is the empirical joint frequency of the pair ofvalues ( a i , a j ) of features i and j . Note that constraints 2 and3 force model P to generate single as well as joint empiricalfrequency counts as marginals. This way, the model is sure tobe coherent with empirical data.Single and joint empirical frequencies f i ( a i ) and f i j ( a i , a j ) are obtained from set S by counting occurrences of a givenfeature value a i or feature value pair ( a i , a j ) , respectively, anddividing by the total number of flows in S . Since the set S is finite and much smaller than K , inferences based on S aresubjected to undersampling effects. Following the theoreticalframework proposed in [33], we add pseudocounts to empiricalfrequencies to limit undersampling effects by performing thefollowing operations: f i ( a i ) ← ( − α ) f i ( a i ) + α Q (4) f i j ( a i , a j ) ← ( − α ) f i j ( a i , a j ) + α Q (5)where ( a i , a j ) ∈ Ω and 0 ≤ α ≤ S is extended with a fractionof flows with uniformly sampled features.The proposed maximization can be solved using a La-grangian function such as presented in [35], yielding thefollowing Boltzmann-like distribution: P ∗ ( a k ... a k N ) = e − H ( a k ... a k N ) Z (6)where H ( a k ... a k N ) = − ∑ i , j | i < j e i j ( a ki , a k j ) − ∑ i h i ( a ki ) (7)is the Hamiltonian of flow k and Z (eq. (6)) is the partitionfunction that normalizes the distribution. Since in this work weare not interested in obtaining individual flow probabilities, Z is not required and, as a consequence, its calculation isomitted. Our objective is to calculate individual flows energies, i.e., individual Hamiltonians as determined in eq. (7).Note that the Hamiltonian, as presented above, is fullydetermined regarding the Lagrange multipliers e i j ( · ) and h i ( · ) associated to constraints (2) and (3), respectively. Withinthe Potts Model framework, the Lagrange multipliers have aspecial meaning, with the set { e i j ( a i , a j ) | ( a i , a j ) ∈ Ω } beingthe set of all possible coupling values between features i and j and { h i ( a i ) | a i ∈ Ω } the set of possible local fields associatedto feature i .Inferring the local fields and pairwise couplings is dif-ficult since the number of parameters exceeds the numberof independent constraints. Due to the physical propertiesof interacting spins, it is possible to infer pairwise couplingvalues e i j ( a i , a j ) using a Gaussian approximation. Assumingthat the same properties apply for flow features, we infercoupling values as follows: e i j ( a i , a j ) = − ( C − ) i j ( a i , a j ) , (8) ∀ ( i , j ) ∈ η , ∀ ( a i , a j ) ∈ Ω , a i , a j (cid:54) = Q where C i j ( a i , a j ) = f i j ( a i , a j ) − f i ( a i ) f j ( a j ) (9)is the covariance matrix obtained from single and joint empir-ical frequencies. Taking the inverse of the covariance matrixis a well known procedure in statistics to remove the effect ofindirect correlation in data [36]. Now, it is important to clarifythat the number of independent constraints in eq. (2) and eq.(3) is actually N ( N − ) ( Q − ) + N ( Q − ) , even though themodel in eq. (6) has N ( N − ) Q + NQ parameters. So, withoutloss of generality, we set: e i , j ( a i , Q ) = e i , j ( Q , a j ) = h i ( Q ) = e i , j ( a i , a j ) in case a i or a j is equal to Q [33]. Afterwards, local fields h i ( a i ) canbe inferred using a mean-field approximation [37]: f i ( a i ) f i ( Q ) = exp (cid:32) h i ( a i ) + ∑ j , a j e i j ( a i , a j ) f j ( a j ) (cid:33) , (11) ∀ i ∈ η , a i ∈ Ω , a i (cid:54) = Q where f i ( Q ) is the frequency of the last element a i = Q for any feature i used for normalization. It is also worthmentioning that the element Q is arbitrarily selected and couldbe replaced by any other value in {1 . . . Q} as long as theselected element is kept the same for calculations of the localfields of every feature i ∈ η . Note that in eq. (11) the empiricalsingle frequencies f i ( a i ) and the coupling values e i j ( a i , a j ) areknown, yielding: h i ( a i ) = ln (cid:18) f i ( a i ) f i ( Q ) (cid:19) − ∑ j , a j e i j ( a i , a j ) f j ( a j ) (12)In the mean-field approximation presented above, the inter-action of a feature with its neighbors is replaced by anapproximate interaction with an averaged feature, yielding anapproximated value for the local field associated to it.For further details about these calculations, please referto [32]. Now that all model parameters are known, it ispossible to calculate a given flow energy according to eq.(7). In the following, we are going to present the theoreticalframework implementation to perform a two-class, i.e., benignand malicious, flow classification. B. Energy-based flow classification
The energy of a given flow can be calculated according toeq. (7) based on the values of its features and the parametersfrom the statistical model inferred in the last section. In simpleterms, a given flow energy is the negative sum of couplingsand local fields associated with its features, according to agiven statistical model. It means that a flow that resembles theones used to infer the model is likely to be low in energy.Since EFC is an anomaly-based classifier, the statisticalmodel used for classification is inferred based only on benignflow samples. We would then expect the energies of benignsamples to be lower than the energies of malicious samples.In this sense, it is possible to classify flow samples asbenign or malicious based on a chosen energy threshold. The classification is performed by stating that samples with energysmaller than the threshold are benign, and samples with energygreater than or equal to the threshold are malicious. Note thatthe threshold for classification can be chosen in different ways,and it can be static or dynamic. In this work, we will considera static threshold.
Algorithm 1
Energy-based Flow Classifier
Input: benign _ f lows ( K × N ) , Q , α , cuto f f import all model inference functions f _ i ← SiteFreq ( benign _ f lows , Q , α ) f _ i j ← PairFreq ( benign _ f lows , f _ i , Q , α ) e _ i j ← Couplings ( f _ i , f _ i j , Q ) h _ i ← LocalFields ( e _ i j , f _ i , Q ) while Scanning the Network do f low ← wait_for_incoming_flow() e ← for i ← N − do a _ i ← f low [ i ] for j ← i + N do a _ j ← f low [ j ] if a _ i (cid:54) = Q and a _ j (cid:54) = Q then e ← e − e _ i j [ i , a _ i , j , a _ j ] end if end for if a _ i (cid:54) = Q then e ← e − h _ i [ i , a _ i ] end if end for if e ≥ cuto f f then stop_flow() forward_to_DPI() else release_flow() end if end while Algorithm 1 shows the implementation of EFC. In lines 2-5, the statistical model for the sampled flows is inferred, asdescribed by eqs. (4), (5), (8) and (12). Afterward, on lines 6-27, the classifier monitors the network waiting for a capturedflow. When a flow is captured, its energy is calculated on lines9-20, according to the Hamiltonian in eq. (7). The computedflow energy is compared to a known threshold ( cutoff ) valueon line 21. In case the energy falls above the threshold, theflow is classified as malicious and should be forwarded to deeppacket inspection (line 23) for assessment. Otherwise, the flowis released, and the classifier waits for another flow.It is essential to highlight that the time complexity of thetraining step of EFC is O (( M × Q ) + N × M × Q ) , where N in the number of samples, M is the number of features, and Q is the size of the alphabet. Meanwhile, the complexity of theclassification step for each sample is O ( M ) . It means that, inboth steps, the complexity is more dependant on the numberof features chosen, which can be kept small by using a fea-ture selection mechanism, e.g., Principal Component Analysis(PCA). Therefore, it is possible to see that EFC has a lowcomputational cost when compared to ML-based classifiers, such as Artificial Neural Network (ANN), Support VectorMachine (SVM), and RF. Considering the implementationshown in this section, next, we present the results obtainedwhen EFC is used to perform flow classification.V. R
ESULTS
In this section, we present the results obtained for EFC andclassical ML-based classifiers in different binary classificationexperiments considering three different datasets, i.e.,
CIDDS-001, CICIDS17, and CICDDoS19. First, we show that EFCcan separate benign from malicious flows based on theirenergies, a result that is consistent for all the considereddatasets. Then, we present EFC’s classification performanceand compare it to the classification performance of classicalML-based classifiers in different test scenarios within eachdataset.It is important to highlight that the classification experi-ments we perform in this work were designed not only toassess the performance of different classifiers but also toinvestigate their capability of adaptation to different domains, i.e., data distributions. Hence, we considered each day/contextwithin a given dataset to be a different domain and performedtwo kinds of experiments: training/testing in the same do-main, and training/testing in different domains. In the case oftraining/testing in the same domain, 10-fold cross-validation(CV) was performed. Afterward, the models inferred in eachof the ten steps of the CV were used to classify samplescoming from another domain. This classification was done toinvestigate each classifier’s capability for domain adaptation.EFC’s cutoff was defined to be at the 95th percentile of theenergy distribution obtained in the training phase.
A. EFC characterization
To investigate if EFC was able to classify benign andmalicious traffic flow samples correctly, we inferred a modelbased on benign samples from simulated traffic within theCIDDS-001 dataset. This model was then used to calculatethe energy of benign and malicious flow samples coming fromsimulated traffic, and also from the external server traffic.Figure 2 first plot shows energy values of 5,000 randomlysampled flows labeled as normal and 5,000 randomly sampledflows labeled as malicious from the simulated traffic containedin CIDDS-001 dataset. The statistical model used to calculatethe energies was inferred based on 4,500 flows randomlysampled from the simulated traffic. The separation between thetwo flow classes is clear; i.e., normal flows energy distributionis clearly shifted to the left in relation to malicious flowsenergy distribution.Energy values of 5,000 randomly sampled flows labeledas unknown and 5,000 randomly sampled flows labeled as suspicious from the external sever traffic in CIDDS-001 areshown in Figure 2 second plot. Traffic labeled as unknown is traffic coming from external users with destination port 80or 443, i.e., expected traffic. In this sense, here we considerthis traffic to be analogous to benign traffic. Traffic labeled as suspicious , on the other hand, is traffic coming from externalusers aimed at ports other than 80 and 443, i.e., unexpected
CIDDS-001 Train/Test simulatedCIDDS-001 Train simulated, test external
Figure 2. Energy histogram of flow samples from simulated traffic (above)and from the external server traffic (below) within CIDDS-001 dataset. Theenergy classification threshold, defined as the 95th percentile of the trainingdistribution, is shown in red. traffic. Hence, this traffic is considered analogous to malicioustraffic. Note that the separation between these two classes, i.e., unknown and suspicious, is also evident. In Figure 2,we can see that a portion of unknown traffic is mixed upwith suspicious traffic, which might be an indication that thistraffic, even with expected destination ports, is malicious. It isimportant to note that it is possible to apply the same energythreshold (around 140, i.e., i.e., benign and malicious, and the resultsare consistent for all datasets considered. In addition to that,
CICIDS17 Friday working hours CICDDoS19 DrDoS NTP
Train/Test same day Train/Test same dayTrain/Test di ff erent days Train/Test di ff erent days Figure 3. Energy histograms for classification tests performed on samplescoming from the same day as training (first row) and samples coming from adifferent day (second row) for both CICIDS17 (first column) and CICDDoS19(second column) datasets. The energy classification threshold, defined as the95th percentile of the training distribution, is shown in red. we observe that the classification threshold was defined basedon a specific training distribution and can be applied to adifferent data distribution or domain. Such an observationcontributes to the claim that EFC is adaptable to differentdomains and does not overfit data. In the following subsection,classification results are shown for different classifiers andcompared with the results obtained for EFC. In the nextsubsection, comparative results are presented.
B. Comparative analysis of EFC’s performance
We compared EFC to seven different ML classifiers foundin [12] that are available online at GitHub . The classifiersconsidered here are: K-Nearest Neighbors (KNN) [38], Deci-sion Tree (DT) [39], [40], Adaboost [41], Random Forest (RF)[42], ANN [43], Naive Bayes (NB) [44], and SVM [45], alldeployed with their default scikit-learn configurations. Flowfeatures were only discretized for EFC (see SupplementaryInformation: Table XI) since discretization would probablyimpair the performance of most ML algorithms. The metricsused to compare the results were the F1 score and the areaunder the ROC curve (AUC). The first metric, F1 score, is theharmonic mean of the Precision and the Recall, i.e.,F = Precision − + Recall − = · Precision · RecallPrecision + Recall (13)where
Precision = T P / ( T P + FP ) , Recall = T P / ( T P + FN ) ,TP are the true positives, i.e., malicious traffic classified as ma-licious, FP are the false positives, i.e., benign traffic classifiedas malicious, and FN are the false negatives, i.e., benign trafficclassified as malicious. The second metric, the area under theROC curve (AUC), is one of the most widespread evaluationmetrics for binary classifiers [46], [47]. The ROC curve isconstructed by plotting the true positive rate (TPR) against the https://github.com/vinayakumarr/Network-Intrusion-Detection false positive rate (FPR) at different classification thresholds.It means that the AUC is the probability that a randomlychosen positive example will receive a higher score than arandomly chosen negative one. One of the main advantages ofthe AUC is that it is invariant to changes in class distribution, i.e., the ROC curve will not change if the distribution changesin a test set, but the underlying conditional distributions fromwhich the data are drawn stay the same [48], [47]. Since weare interested in evaluating domain adaptation, this metric isparticularly interesting to be adopted in this work.
1) CIDDS-001:
To evaluate the performance of EFC com-pared to different ML algorithms, we constructed two testsets using a subset of the CIDDS-001 dataset. Test set Iis composed solely by simulated traffic flow samples (withno common samples between them), while external trafficflow samples from test set II. Dataset undersampling wasperformed to obtain a more homogeneous distribution of thedifferent malicious traffic subclasses. Details about how thisundersampling was performed are described in SupplementaryInformation: Appendix A.
Table VC
LASSIFICATION RESULTS : P
ERFORMANCE OF DIFFERENT CLASSIFIERSTRAINED WITH
CIDDS-001 - S
IMULATED TRAFFIC
Train/Test simulated Train simulated, test externalClassifier F1 score AUC F1 score AUCNB 0.799 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± EFC ± ± ± ± It is clear from Table V that most classifiers achieve highvalues of both F1 score and AUC over the test set containingsimulated traffic. For instance, DT, Adaboost, and RF achievedan F1 score and AUC above 99%. EFC also achieved goodresults, with an F1 score of 0.957 ± ±
2) CICIDS17:
Two different experiments were performedto evaluate the performance of the classifiers over CICIDS17.In each experiment, two test sets were constructed: one com-prising traffic of one specific day, and the other comprisingtraffic of all other days, except the chosen one (see Supple-mentary Information: Appendix B). Days in which there werenot enough samples to compose a test set were left out.
Table VIC
LASSIFICATION RESULTS : P
ERFORMANCE OF DIFFERENT CLASSIFIERSTRAINED WITH
CICIDS17 - F
RIDAY WORKING HOURS
Train/Test same day Train/Test different daysClassifier F1 score AUC F1 score AUCNB 0.773 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± EFC ± ± ± ± Table VI shows the results obtained when the training wasperformed on Friday. When tested on data from the same day,DT, Adaboost and RF had the results with both F1 scoreand AUC above 99%. EFC also performed well, achievingan F1 score of 0.952 ± ± ± ± Table VIIC
LASSIFICATION RESULTS : P
ERFORMANCE OF DIFFERENT CLASSIFIERSTRAINED WITH
CICIDS17 - W
EDNESDAY WORKING HOURS
Train/Test same day Train/Test different daysClassifier F1 score AUC F1 score AUCNB 0.930 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RF 0.998 ± ± ± ± ± ± ± ± ± ± ± ± EFC ± ± ± ± Despite the favorable results obtained in the first experiment,it is possible to see in Table VII that Adaboost and RFoutperformed EFC when trained on Wednesday. Adaboost andRF achieved better results both when training and testing onthe same day, and when training and testing on different days.Such results might be due to the greater diversity of attacktypes present on Wednesday compared to Friday, thus givingmore information for the ML algorithms. However, even ifit was not the best, EFC had a good performance in bothtests, achieving over 99% AUC in the first test and over 95%AUC in the second. It is also worth mentioning that EFCdoes not require malicious samples to achieve the same resultsdifferently from the other MLs. Therefore, these results arealigned to the other results presented to this point, pointingout that EFC is easily adaptable to different domains withlesser training information, i.e., without malicious samples.
3) CICDDoS19:
Three separate experiments were per-formed to evaluate the performance of different classifiersover the CICDDoS19 dataset. In each experiment, two testsets were constructed: one comprising traffic of one specificday, and the other comprising traffic of all other days, except the chosen one (see Supplementary Information: Appendix C).Days in which there were not enough samples to compose atest set were left out.
Table VIIIC
LASSIFICATION RESULTS : P
ERFORMANCE OF DIFFERENT CLASSIFIERSTRAINED WITH
CICDD O S19 - DD O S NTP
Train/Test same day Train/Test different daysClassifier F1 score AUC F1 score AUCNB 0.735 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± EFC ± ± ± ± Table VIII shows the classification results obtained whentraining with only attacks of the type DDoS NTP. With trainingand testing in the same context, once more DT, Adaboost andRF outperform the other classifiers, achieving over 99% bothon the F1 score and AUC. EFC achieved an F1 score of 0.968 ± ± ± ± Table IXC
LASSIFICATION RESULTS : P
ERFORMANCE OF DIFFERENT CLASSIFIERSTRAINED WITH
CICDD O S19 - S YN Train/Test same day Train/Test different daysClassifier F1 score AUC F1 score AUCNB 0.684 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SVM 0.918 ± ± ± ± ± ± ± ± EFC ± ± ± ± Differently, when trained with only Syn (Table IX) or TFTP(Table X) attack types, EFC does not outperform the otherclassifiers. It is something expected to happen in this dataset,since different types of volumetric DoS attacks have similarcharacteristics between them, making domain adaptation easierfor classical ML algorithms. In the experiments presented,Adaboost and RF were the classifiers that obtained the bestresults both when training and testing in the same contextand when training in one context and testing in another. It isworth mentioning that, despite not being the best algorithm,EFC achieved outstanding results in all tests performed, usingonly half of the information in the training phase.
4) Average results:
Finally, Table XI shows the averageperformance of each classifier, considering all tests performedwith all datasets. It is possible to observe that RF outperformsother classifiers when trained and tested in the same domain.However, for the case where the classifiers are trained in one
Table XC
LASSIFICATION RESULTS : P
ERFORMANCE OF DIFFERENT CLASSIFIERSTRAINED WITH
CICDD O S19 - TFTP
Train/Test same day Train/Test different daysClassifier F1 score AUC F1 score AUCNB 0.720 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RF 0.996 ± ± ± ± ± ± ± ± ± ± ± ± EFC ± ± ± ± domain and tested in another, EFC outperforms classical ML-based classifiers. In this case, EFC achieved an F1 score of0 . ± .
032 and AUC of 0 . ± . . ± . Table XIC
LASSIFICATION RESULTS : A
VERAGE PERFORMANCE OF DIFFERENTCLASSIFIERS
Train/Test same domain Train/Test different domainsClassifier F1 score AUC F1 score AUCNB 0.774 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± EFC ± ± ± ± Taken as a whole, the results presented in this subsectionshow that EFC is better at adapting to other domains thanclassical ML-based classifiers on average (see Table XI). Inaddition to that, it is possible to see that EFC achieves AUCvalues similar to the best ML algorithms, showing that it iscapable of performing well even if trained with only half ofthe information that other classifiers use. Not using malicioussamples in the training phase is likely to be the reason whyEFC is so good at adapting to other domains. On the otherhand, this feature might also contribute to a possible lowperformance in specific scenarios, e.g., when malicious trafficshares many characteristics with benign traffic. Moreover,EFC’s increased capability for domain adaptation when thereis a significant in data distribution is a highly desirable traitin network flow-based classifiers, since changes in trafficcomposition are expected to be very frequent, and new kindsof attacks are generated continuously. In the next section, wepresent our conclusions and future work directions.VI. C
ONCLUSION
In this work, we present a new flow-based classifier fornetwork intrusion detection called Energy-based Flow Clas-sifier, EFC. In EFC’s training phase, a statistical model is inferred based solely on benign traffic samples. Afterward,this statistical model is used to classify network flows inbenign or malicious based on "energy" values. Our resultsshow that EFC is capable of correctly performing networkflow binary classification considering three different datasets.F1 score ( 96% at best) and AUC ( 99% at best) valuesobtained using EFC are comparable to the values obtainedwith other classical ML-based classifiers, such as RandomForest, K-Nearest Neighbors and Artificial Neural Networks,even though EFC uses only half of the information in thetraining phase compared to the other algorithms.In addition to that, we analyzed different classifier capabil-ities for domain adaptation and observed that EFC is moresuitable to that than classical ML-based algorithms. In threeout of the six experiments performed to evaluate that overdifferent datasets, EFC outperformed the other classifiers. Inthe cases in which EFC was outperformed, the adaptationwas not difficult for most of the algorithms, meaning that thetwo data distributions were not that different across domains.We understand that EFC’s capability for domain adaptation isprobably linked to the fact that the model inference performedin the training phase is based only on benign samples, whichprevents overfitting.Considering the advantages presented, we believe EFC tobe a promising algorithm to perform flow-based traffic classi-fication. Nevertheless, despite the promising results achieved,there is still room for further testing and improvement. Forinstance, to obtain a more homogeneous distribution of dif-ferent attack types, we performed a dataset undersampling,which might have had some effect on the results. Hence, infuture work, we aim at performing a more comprehensiveinvestigation of EFC applicability to real-world data anddifferent contexts, such as fraud analysis in bank data.A CKNOWLEDGMENT
The authors would like to thank Luís Paulo Faina Garcia forhelping with dataset analysis. Matt Bishop was supported bythe National Science Foundation under Grant Number OAC-1739025. Any opinions, findings, and conclusions or recom-mendations expressed in this material are those of the author(s)and do not necessarily reflect the views of the National ScienceFoundation. João Gondim gratefully acknowledges the supportfrom Project "EAGER: USBRCCR: Collaborative: SecuringNetworks in the Programmable Data Plane Era" funded byNSF (National Science Foundation) and RNP (Brazilian Na-tional Research Network).R
Securing the Internet of Things: Concepts,Methodologies, Tools, and Applications . IGI Global, 2020, pp. 481–497.[3] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “Asurvey of network-based intrusion detection data sets,”
Computers &Security , 2019. [4] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, andB. Stiller, “An overview of IP flow-based intrusion detection,”
IEEECommunications Surveys and Tutorials , vol. 12, no. 3, pp. 343–356,2010. [Online]. Available: http://ieeexplore.ieee.org/document/5455789/[5] M. F. Umer, M. Sher, and Y. Bi, “Flow-based intrusion detection:Techniques and challenges,”
Computers and Security , vol. 70, pp.238–254, sep 2017. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0167404817301165[6] K. Bartos, M. Sofka, and V. Franc, “Optimized invariant representationof network traffic for detecting unseen malware variants,” in { USENIX } Security Symposium ( { USENIX } Security 16) , 2016, pp.807–822.[7] H. Li, Z. Chen, R. Spolaor, Q. Yan, C. Zhao, and B. Yang, “Dart:Detecting unseen malware variants using adaptation regularization trans-fer learning,” in
ICC 2019-2019 IEEE International Conference onCommunications (ICC) . IEEE, 2019, pp. 1–6.[8] C. Rudin, “Stop explaining black box machine learning models for highstakes decisions and use interpretable models instead,”
Nature MachineIntelligence , vol. 1, no. 5, pp. 206–215, 2019.[9] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, and A. Hotho, “Flow-based benchmark data sets for intrusion detection,” in
Proceedings ofthe 16th European Conference on Cyber Warfare and Security. ACPI ,2017, pp. 361–369.[10] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generatinga new intrusion detection dataset and intrusion traffic characterization.”in
ICISSP , 2018, pp. 108–116.[11] I. Sharafaldin, A. H. Lashkari, S. Hakak, and A. A. Ghorbani, “Devel-oping realistic distributed denial of service (ddos) attack dataset andtaxonomy,” in . IEEE, 2019, pp. 1–8.[12] R. Vinayakumar, K. Soman, and P. Poornachandran, “Evaluating effec-tiveness of shallow and deep networks to intrusion detection system,”in . IEEE, 2017, pp. 1282–1289.[13] S. Khan, E. Sivaraman, and P. B. Honnavalli, “Performance evaluation ofadvanced machine learning algorithms for network intrusion detectionsystem,” in
Proceedings of International Conference on IoT InclusiveLife (ICIIL 2019), NITTTR Chandigarh, India . Springer, 2020, pp.51–59.[14] A. M. Mahfouz, D. Venugopal, and S. G. Shiva, “Comparative analysisof ml classifiers for network intrusion detection,” in
Fourth InternationalCongress on Information and Communication Technology . Springer,2020, pp. 193–207.[15] X. Tan, S. Su, Z. Huang, X. Guo, Z. Zuo, X. Sun, and L. Li, “Wirelesssensor networks intrusion detection based on smote and the randomforest algorithm,”
Sensors , vol. 19, no. 1, p. 203, 2019.[16] J. Kazemitabar, R. TAHERI, and G. KHERADMANDIAN, “A noveltechnique for improvement of intrusion detection via combining randomforrest and genetic algorithm,” 2019.[17] T. T. Bhavani, M. K. Rao, and A. M. Reddy, “Network intrusiondetection system using random forest and decision tree machine learningtechniques,” in
First International Conference on Sustainable Technolo-gies for Computational Intelligence . Springer, 2020, pp. 637–643.[18] N. Moustafa, B. Turnbull, and K.-K. R. Choo, “An ensemble intrusiondetection technique based on proposed statistical flow features forprotecting network traffic of internet of things,”
IEEE Internet of ThingsJournal , 2018.[19] B. A. Tama and K.-H. Rhee, “Attack classification analysis of iotnetwork via deep learning approach,”
Research Briefs on Information& Communication Technology Evolution (ReBICTE) , vol. 3, pp. 1–9,2017.[20] M. Idhammad, K. Afdel, and M. Belouch, “Distributed intrusion detec-tion system for cloud environments based on data mining techniques,”
Procedia Computer Science , vol. 127, pp. 35–41, 2018.[21] ——, “Detection system of http ddos attacks in a cloud environmentbased on information theoretic entropy and random forest,”
Security andCommunication Networks , vol. 2018, 2018.[22] A. Verma and V. Ranga, “Statistical analysis of cidds-001 datasetfor network intrusion detection systems using distance-based machinelearning,”
Procedia Computer Science , vol. 125, pp. 709–716, 2018.[23] M. Ring, D. Landes, and A. Hotho, “Detection of slow port scans inflow-based network traffic,”
PloS one , vol. 13, no. 9, p. e0204507, 2018.[24] R. Abdulhammed, M. Faezipour, A. Abuzneid, and A. AbuMallouh,“Deep and Machine Learning Approaches for Anomaly-BasedIntrusion Detection of Imbalanced Network Traffic,”
IEEE SensorsLetters , vol. 3, no. 1, pp. 1–4, jan 2019. [Online]. Available:https://ieeexplore.ieee.org/document/8526292/ [25] A. Yulianto, P. Sukarno, and N. A. Suwastika, “Improving adaboost-based intrusion detection system (ids) performance on cic ids 2017dataset,” in Journal of Physics: Conference Series , vol. 1192, no. 1.IOP Publishing, 2019, p. 012018.[26] D. Aksu, S. Üstebay, M. A. Aydin, and T. Atmaca, “Intrusion detec-tion with comparative analysis of supervised learning techniques andfisher score feature selection algorithm,” in
International Symposium onComputer and Information Sciences . Springer, 2018, pp. 141–149.[27] J. Li, M. Liu, Z. Xue, X. Fan, and X. He, “Rtvd: A real-time volumetricdetection scheme for ddos in the internet of things,”
IEEE Access , vol. 8,pp. 36 191–36 201, 2020.[28] Y. Jia, F. Zhong, A. Alrawais, B. Gong, and X. Cheng, “Flowguard:An intelligent edge defense mechanism against iot ddos attacks,”
IEEEInternet of Things Journal , 2020.[29] M. P. Novaes, L. F. Carvalho, J. Lloret, and M. L. Proença, “Long short-term memory and fuzzy logic for anomaly detection and mitigationin software-defined network environment,”
IEEE Access , vol. 8, pp.83 765–83 781, 2020.[30] D. Plonka, “Flowscan: A network traffic flow reporting and visualizationtool.” in
LISA , 2000, pp. 305–317.[31] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani,“Characterization of tor traffic using time based features.” in
ICISSP ,2017, pp. 253–262.[32] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, and M. Weigt,“Inverse statistical physics of protein sequences: A key issues review,”
Reports on Progress in Physics , vol. 81, no. 3, p. 032601, mar 2018.[Online]. Available: http://stacks.iop.org/0034-4885/81/i=3/a=032601?key=crossref.353cf55f4345afafde1886d057be92bd[33] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks,C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt,“Direct-coupling analysis of residue coevolution captures nativecontacts across many protein families,”
Proceedings of the NationalAcademy of Sciences of the United States of America
Reviews of Modern Physics ,vol. 54, no. 1, pp. 235–268, jan 1982. [Online]. Available:https://link.aps.org/doi/10.1103/RevModPhys.54.235[35] E. T. Jaynes, “Information theory and statistical mechanics. II,”
Physical Review , vol. 108, no. 2, pp. 171–190, may 1957. [Online].Available: https://link.aps.org/doi/10.1103/PhysRev.106.620[36] B. Giraud, J. M. Heumann, and A. S. Lapedes, “Superadditive correla-tion,”
Physical Review E , vol. 59, no. 5, p. 4983, 1999.[37] A. Georges and J. S. Yedidia, “How to expand around mean-field theoryusing high-temperature expansions,”
Journal of Physics A: Mathematicaland General , vol. 24, no. 9, p. 2173, 1991.[38] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-meansclustering algorithm,”
Journal of the Royal Statistical Society. SeriesC (Applied Statistics) , vol. 28, no. 1, pp. 100–108, 1979.[39] J. R. Quinlan, “Simplifying decision trees,”
International journal ofman-machine studies , vol. 27, no. 3, pp. 221–234, 1987.[40] P. H. Swain and H. Hauska, “The decision tree classifier: Design andpotential,”
IEEE Transactions on Geoscience Electronics , vol. 15, no. 3,pp. 142–147, 1977.[41] Y. Freund, R. E. Schapire et al. , “Experiments with a new boostingalgorithm,” in icml , vol. 96. Citeseer, 1996, pp. 148–156.[42] T. K. Ho, “Random decision forests,” in
Proceedings of 3rd internationalconference on document analysis and recognition , vol. 1. IEEE, 1995,pp. 278–282.[43] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanentin nervous activity,”
The bulletin of mathematical biophysics , vol. 5,no. 4, pp. 115–133, 1943.[44] D. D. Lewis, “Naive (bayes) at forty: The independence assumptionin information retrieval,” in
European conference on machine learning .Springer, 1998, pp. 4–15.[45] C. Cortes and V. Vapnik, “Support-vector networks,”
Machine learning ,vol. 20, no. 3, pp. 273–297, 1995.[46] N. Japkowicz and M. Shah,
Evaluating learning algorithms: a classifi-cation perspective . Cambridge University Press, 2011.[47] D. Brzezinski and J. Stefanowski, “Prequential auc: properties of the areaunder the roc curve for data streams with concept drift,”
Knowledge andInformation Systems , vol. 52, no. 2, pp. 531–562, 2017.[48] S. Wu, P. Flach, and C. Ferri, “An improved model selection heuristicfor auc,” in
European Conference on Machine Learning . Springer,2007, pp. 478–489.
Camila F. T. Pontes is a student at the Universityof Brasilia (UnB), Brasilia, DF, Brazil. She receivedher M.Sc. degree in Molecular Biology in 2016 fromUnB and is currently an undergrad student at theDepartment of Computer Science (CIC/UnB). Herresearch interests are Computational and TheoreticalBiology and Network Security.
Manuela M. C. de Souza is an undergrad ComputerScience student at University of Brasilia (UnB),Brasilia, DF, Brazil. Her research interest is NetworkSecurity.
João J. C. Gondim was awarded an M.Sc. inComputing Science at Imperial College, Universityof London, in 1987 and a Ph.D. in Electrical En-gineering at UnB (University of Brasilia, 2017). Heis an adjunct professor at Department of ComputingScience (CIC) at UnB where he is a tenured mem-ber of faculty. His research interests are network,information and cyber security.
Matt Bishop received his Ph.D. in computer sciencefrom Purdue University, where he specialized incomputer security, in 1984. His main research area isthe analysis of vulnerabilities in computer systems.The second edition of his textbook, Computer Se-curity: Art and Science, was published in 2002 byAddison-Wesley Professional. He is currently a co-director of the Computer Security Laboratory at theUniversity of California Davis.