A Public Network Trace of a Control and Automation System
AA Public Network Trace of a Control andAutomation System
Gorby Kabasele Ndonda
Department of Computer Science and EngineeringUCLouvain, Belgium [email protected]
Ramin Sadre
Department of Computer Science and EngineeringUCLouvain, Belgium [email protected]
Abstract —The increasing number of attacks against automa-tion systems such as SCADA and their network infrastructurehave demonstrated that there is a need to secure those systems.Unfortunately, directly applying existing ICT security mecha-nisms to automation systems is hard due to constraints of thelatter, such as availability requirements or limitations of thehardware. Thus, the solution privileged by researchers is theuse of network-based intrusion detection systems (N-IDS).One of the issue that many researchers encounter is howto validate and evaluate their N-IDS. Having access to a realand large automation systems for experimentation is almostimpossible as companies are not inclined to give access to theirsystems due to obvious concerns. The few public traffic datasetsthat could be used for off-line experiments are either syntheticor collected at small testbeds. In this paper, we will describeand characterize a public traffic dataset collected at the HVACmanagement system of a university campus.Although the dataset contains only packet headers, we believethat it can help researchers, in particular designers of flow-basedIDS, to validate their solutions under more realistic conditions.The traces can be found on https://github.com/gkabasele/HVAC_Traces.
Keywords
SCADA, BMS, network traces, flows, IDSI. I
NTRODUCTION
Industrial control and automation systems, such as SCADA,have constraints different from traditional enterprise IT sys-tems, especially with regard to availability. For example, someend hosts, such as field device, have to operate continuouslyfor several decades without being stopped. It is also not un-common to see SCADA systems using outdated hardware andsoftware as updating and patching is difficult and expensive.For this reason, network-based Intrusion Detection Systems(N-IDS) have been the privileged solution by researchers tosecure SCADA systems [6] [4] since, in contrast to host-based IDS, they do not require modifications nor regularupdates on the hosts. Among techniques for network-basedintrusion detection, flow-based methods have been employedby researchers for several years to build IDS for the Internet[11] and also specifically for SCADA systems [8], [9], [22].One of the problems that researchers are facing is how toevaluate and validate their IDS solution. Getting access to realcontrol and automation systems is hard because their ownersare, for obvious reasons, rarely open to the idea of runningexperiments in them. For this reason, most researchers relyeither on experiments in testbeds or simulated environments or use traffic traces collected in such environments for off-lineexperiments [1], [10], [17]. The advantage of these approachesis the reproducibility of the experiments, the flexibility to testdifferent configurations, and the fact that no real system isput in danger. However, it requires that the used environmentsand traces accurately reflect the behavior of a real system,which is challenging to ensure. Publications relying on datafrom real systems for validation are much rarer and those thatexist have not published their datasets [8], [22], which makesit impossible to reproduce them by other researchers. To ourknowledge, no public dataset from a real SCADA system orany other type of automation system exists so far.In this paper, we present and describe a public BuildingManagement System (BMS) dataset collected from part of theHeating, Ventilation, and Air Conditioning system (HVAC) ofa university campus. A BMS is a type of automation systemlike SCADA. The dataset contains the headers of the networkpackets exchanged between field devices, control servers andhuman machine interfaces (HMI) of the BMS. The datasetdoes not contain the payload of the packets due to privacyand security reasons, but it can be used for purposes wheretraffic traces on packet header level or flow level are needed,such as • flow-level traffic characterization, • parametrization of traffic models for synthetic trafficgeneration, • experimentation with realistic background traffic load.To the best of our knowledge, the dataset only containsbenign traffic. This means that the traces cannot be directlyused to validate flow-based intrusion detection systems dueto the absence of attacks (unless the goal is to evaluatefalse positives). However, researchers can combine them withmalicious traffic or use them to generate synthetic traffic, asindicated above.The structure of this paper is as follows: Section II describesthe system from which the trace has been collected. Section IIIexplains the methodology used to collect the trace. Section IVpresents important traffic characteristics and findings, followedby a discussion in Section V. Finally, the paper concludes inSection VI. a r X i v : . [ c s . CR ] A ug I. D
ESCRIPTION
In this section we describe the BMS from which the tracehas been collected. Due to privacy and security reasons, wewill not publish certain details, such as the deployed controlsoftware, information related to non-BMS services, or thephysical configuration of the communication infrastructure.
A. HVAC Management System
The trace contains the network communication of a HVACmanagement system. The system has been deployed by and ismanaged by Honeywell.The system is fully automated, with a server communicatingwith peripheral devices deployed on the campus to controlheating in offices, classrooms, and lecture halls. The part ofthe system that we have monitored controls around 15 to 20buildings. It has been optimized for energy efficiency: Whena teacher books a classroom or lecture hall via the bookingapplication, a message is sent to the main server so that itcan instruct the PLC or RTU responsible for that room toactivate the heating several minutes before people are comingin. Operators access the system through the Human MachineInterface (HMI), which consists of dedicated workstations witha graphical user interface, to see in real-time the current systemstate or to set temperature goals manually.
B. System overview
The BMS, topology-wise, follows a typical design forautomation systems [12]. A communication network intercon-nects the control server, the HMI stations, and the peripheraldevices. The sensors and actuators are attached to the latter.Therefore, the peripheral devices are similar to PLCs andRTUs in SCADA systems: They collect and send sensor datato the server and their behavior can be (manually) influencedfrom the HMIs, but most of the time they execute simplecontrol programs with setpoints defined by the server. Thenetwork is isolated from the rest of the campus network. Thecommunication protocols are proprietary and use TCP/IP astransport layer.It is important to note that the peripheral devices are notdirectly connected to the network. Instead, a group of devicesis connected to a gateway which acts as an end point for theTCP connection from the control server (Figure 1). This allowsto also use devices which are not IP-capable. In the part ofthe network that we consider for our dataset there are 8 suchgateways, each one connected to 1 to 15 peripheral devices.III. M
ETHODOLOGY
In this section, we describe the methodology we used tocollect the BMS traffic as well as the post processing appliedto obtain the dataset.
A. Packet collection
In order to collect all network communication in the BMSnetwork, we capture packets at two routers via port mirroring.The usage of two vantage points is necessary because the pathsbetween the server and the peripheral devices are asymmetric. NetworkSystem part monitoredHMI Control ServerGatewayPeripheral devicesSensors and actuatorsFigure 1: Network overviewWe use tcpdump to capture the packets. For each packet, tcpdump records a timestamp of when the packet was capturedbased on the clock of the capturing device. Since we use twodevices, we synchronize their clocks with NTP so that packetsappear later in the correct order when merging the individ-ual packet traces based on the timestamps. Unfortunately, itshowed in our first test runs that this clock synchronization isnot sufficient and time shifts ranging from 3ms to 10ms canbe observed, resulting in wrong packet orders.To obtain a more accurate timing, we periodically computethe shift between the two clocks and correct the recordedtimestamps accordingly when merging the traces. To this end,capture device A sends every second an ICMP echo requestto device B which will reply with an ICMP echo reply. TheseICMP packets appear in the traces of both devices. The currenttime shift s between the two clocks is then estimated by s = t B − t A − RT T / where t i is the timestamp of the ICMP echo request as seen bydevice i . The round-trip time RT T between the two devicesis estimated as
RT T = t (cid:48) A − t A where t (cid:48) A is the timestampof the ICMP echo reply as seen by device A . Computing thenetwork delay between the two devices in this way assumesthat the delay is identical in both directions.Using this approach, the number of out-of-order packets ina trace of one hour duration can be reduced from 8500 to 85.y increasing the frequency of the ICMP packets, we couldfurther decrease this number but we do not want to send toomany packets and add artificial load to the network. B. Merging and anonymization process
When merging the packet traces from the two capturedevices, we calculate for every ICMP echo request the timeshift s using the method described above. We then adjustthe timestamps of all packets captured by device A in thesecond following the ICMP request, i.e. the period of timeuntil the next ICMP request. At the end, we obtain mergedand time-corrected traces containing the packets captured bythe two devices. In a filtering step, we only keep packets thateither originate from or are destined to an host being part ofthe HVAC system, typically the control server, an HMI, or agateway. We remove packets from protocols that are purelyrelated to network management, namely ARP, STP, VRRP,LLMR. The original traces also contain SNMP traffic sentby the workstations to several devices inside and outside thenetwork. Unfortunately, only the SNMP requests are visible toour capture devices so we remove them. We publish the tracein form of pcap files where each pcap file contains one hourof traffic.The owner of the BMS has demanded that the dataset isanonymized before publication. To do so, several steps weretaken. First, we use the CryptoPan tool [13] to anonymizethe IP addresses in a prefix-preserving manner. This tooluses cryptographic techniques and has the advantage of beingconsistent across several files given that the same key is usedfor the anonymization. We then truncate the packet payloadto only keep the headers and zero out the OrganizationallyUnique Identifier (OUI) of the MAC address. C. Ethical considerations
Before publishing the traces, we discussed with the partiesthat would be affected by having the traces publicly available,following the advices in [5]. These parties included the ChiefInformation Security Officer of the university the dataset wascollected at and Honeywell, the company who deployed theHVAC system and who are the manufacturers of the devicesand the authors of the protocols used by them. We got theiragreement after we explained what we wanted to publish andwhy.During the processing of the traces, we made sure that nopersonal data was collected and that the traces can not be usedto infer the behavior of individual people. The dataset onlycontains network traffic generated automatically, meaning thatit is not the direct result of human action, except for some partof the communication between the HMI workstations and thecontrol server.IV. T
RAFFIC CHARACTERISTICS
This section presents the main characteristics of the col-lected traces. We will discuss some of these characteristics interms of unidirectional flows. We define a flow using the usualfive-tuple (source IP address, destination IP address, source port, destination port, transport protocol). When we refer toflow size or packet size in the following, we always meanthe size of the payload, excluding headers. For TCP flows,we only consider the ones where at least one byte of payloadwas exchanged, thus removing connections which failed to beestablished. Note that our flow definition ignores TCP flags(SYN, FIN etc.), i.e. a flow more or less corresponds to thedefinition of a flow record in Netflow/IPFIX [7] with activeand inactive timeout set to infinity.
A. Overview
We collected the traces over a period of one week (7 days)in order to capture the behavior of the BMS on weekdaysas well as during the weekend. The capture started a Fridayafternoon at approximately 2PM. Table I shows a summaryof the part of the HVAC system we have monitored. Theright three columns give basic statistics of the bidirectionalcommunication between HMI ↔ server and server ↔ gateways,respectively. The peripheral devices are not IP-capable, so wecould not capture the traffic between them and the gateways. Host Type Number Communicate Payload Nbr pktswith exchanged exchangedHMI stations 7 Control Server 6.7GB 24.7MControl Server 1 Gateways 387.7MB 11MGateways 8 Periph. devices – –Periph. devices 58 – – –
Table I: HVAC summaryIn total, we collected 11.2 GB of packet data (correspondingto 4GB of packet headers) from 47 IP addresses and approx-imately 23,000 flows. Table II shows some characteristics ofthe entire dataset and the flows. As expected of automationnetwork, flows are small, the largest one transported 614.7MBytes over one week. dataset flowsbytes packetssize 11.2GB min 5B 1duration 7 days avg 342.2kB 1725.08Nbr flows 23K max 614.7MB 5.1MNbr IP 47
Table II: Trace SummaryFigure 2 shows the empirical CDF of (a) the number ofpackets per flows, (b) the number of bytes per flow, and (c)the flow duration. The CDFs show that there is a significantnumber of flows with a very long duration and a large numberof packets, although the distribution of the number of bytesper flow is rather conservative. Such a behavior is typicalfor automation systems where the server and the peripheraldevices periodically exchange small control packets throughmore or less permanent TCP connections. In fact, all thoseconnections in our trace lasted over the entire measurementperiod.
B. Application protocols
Most of the observed traffic is from communication betweenthe HMI stations and the control server. The HMI stationsnd the control server communicate through several port.First, they exchange Remote Procedure Calls for DistributedComputing Environment (DCE/RPC) packet [15]. The controlserver acts as the DCE/RPC server on port 135. Surprisingly,an inspection of the packets revealed that other ports were usedas well to exchange DCE/RPC packets. Those ports were notfixed; they changed during the duration of the capture. Theprotocol used for the communication between the server andthe gateways is proprietary (port 2499).In addition to the protocols used between the differentcomponents of the HVAC system, we observed other proto-cols. These protocols include ones from the NetBIOS family(NBDS, NBNS, NBSS) and are used by the HMI workstationsand the control server to exchange data with each other anddiscover available services in the network.We also observed some S7 communication, a protocol usedby Siemens PLCs. This traffic was related to another part ofthe BMS but we kept it in the dataset as it demonstrates anexample of heterogeneity in automation systems. Web trafficwas also observed, we discuss it in Section IV-E.
C. Network activity time series
As mentioned in Section II-A, the peripheral devices areconnected to the IP network through gateways. The variousdevices differ in age and capabilities and therefore depictdifferent communication characteristics. We illustrate this inthe following with two gateways X and Y which connect 11,resp. 13, peripheral devices. In the part of the BMS monitoredby us, 6 gateways have the same profile as gateway X and 2gateways are similar to gateway Y , but with different numbersof peripheral devices connected to them.We plot the time series of the number of packets and numberof bytes exchanged per hour between the main server and thegateways. Figures 3a and 3b (resp. 4a and 4b) show theseplots for gateway X (resp. Y ). We note some differencesbetween the two gateways. First, despite having a comparablenumber of devices attached to them, the number of packetsobserved at gateway Y is considerably higher than at gateway X . A similar observation can be made for the number ofexchanged bytes. We also notice how differently the packetsare generated in both directions. In gateway X , the number ofpackets sent to the server is close to the number of packet sentin the other direction. In fact, most of the packets sent by theserver to gateway X are ACK segments. We can see that theserver behaves differently in the case of gateway Y : It activelysends segments with control commands to the gateway. Sincethe gateways are connected to devices of different models, itmeans that the devices connected at gateway Y are more dataintensive.Finally, we observe diurnal and weekly patterns in thetimeseries. The measurement was started on a Friday ataround 2PM, therefore the first 55 hours of low activity inthe timeseries correspond to the weekend. This behavior isunderstandable when considering the nature of the physicalprocesses controlled by the BMS, but it should be noted thatnot all automation systems behave necessarily like this [2]. Interestingly, gateway Y shows lower activity during the nightthan during the weekend although the buildings are not used inboth situations. Our assumption is that there is a kind of sleepmode enabled during the night which decreases the frequencyat which data are exchanged. D. Packet size distribution
SCADA network protocols such as Modbus/TCP [19] orDNP3 require that the payload fits in a single IP packetto avoid packet fragmentation. These protocols aim at lowlatencies or even real-time usage, so packet fragmentation isundesired. We study the packet sizes of for the entirety of thepackets sent from both gateways to the server. The gateway X sent 255,251 packets and the gateway Y sent 5,183,021packets. Figure 3c and 4c show the normalized histograms ofthe observed packet payload sizes in bytes for each gateway.Similar to SCADA systems, the packet sizes are rather smallin both cases — they do not exceed 170 bytes. Moreover,only a few different sizes can be observed. Similar to othercharacteristic, there is a difference between the two gateways.Most of the packet sent by the gateway X contain actualdata and there is only a small percentage (approx. 20%) ofempty ACK while gateway Y sends more that 2 million emptypackets. E. Flow stability
When comparing traditional ICT networks and automationnetworks, it is a commonly accepted fact that the latter aremore stable in term of traffic activities, such as the number ofTCP connections established. This is also mostly true for theBMS network studied in this paper.We show in Figure 5 the number of new flows discoveredper hour. For this figure, we have counted flows directly relatedto the HVAC system, i.e. flows between the control server,the HMI workstations and the gateways, separately from otherUDP and TCP flows. As it can be seen in the figure, except forthe first hour when we start the capture, the number of flowsdiscovered for the HVAC system is mostly constant. Actually,flows from the gateway to the control server lasted for thewhole duration of the capture and there were no new flowcreated. The peaks that can be seen on the HVAC result fromthe change of the port used by the control server for DCE/RPC.The number of TCP flows discovered usually does not exceed500 except at some point where we can see high peak. TheTCP and UDP flow, apart form the HVAC system, are mostlyHTTP/HTTPS and DNS flows created when the HMI stationscheck if there are new updates. The peaks are caused by actualsoftware updates that took place during the measurementperiod. We have deliberately kept these "anomalies" in thedataset as they illustrate that even stable BMS systems canshow irregular traffic patterns which might confuse intrusiondetection systems.V. D
ISCUSSION AND RELATED WORK
The characteristics of network traffic in industrial con-trol and automation systems have been discussed in several a) CDF of the number of packets per flow (b) CDF of the number of bytes (in kBytes) perflow (c) CDF of the flow duration (in hours)
Figure 2: Flow characteristics (a) Packets per hour (b) KBytes per hour (c) Packet size distribution
Figure 3: Traffic observed at gateway X (a) Packets per hour (b) KBytes per hour (c) Packet size distribution Figure 4: Traffic observed at gateway Y publications. The SCADA network traffic of different utilitycompanies is studied in [2], [3]. In [8] and [22], the authorsanalyze the network traffic of BACnet-based BMS. To ourknowledge, none of the datasets used in those and similarpublications is publicly available due to privacy and securityconcerns. Therefore, researchers without access to a realsystem have to rely on public datasets collected in testbeds andvirtual environments [1], [17] or setup their own infrastructurefor experiments, for example in [18].As explained, our dataset does not contain payload data normalicious traffic. Nevertheless, it can be used in research onflow-based network intrusion detection systems that do notrely on deep packet analysis. With tools like tcpreplay [16], researcher can replay the traces to obtain (benign) traffic thatcan be mixed with synthetic attack traffic for their experiments.Moreover, the traces contain valuable packet-level informationfor anomaly-based intrusion detection [14], such as the time-stamp and size of each packet. This data allows to study thedynamic of the HVAC flows which is useful to create a modelof the system.The traces can also serve as a basis for trace generation. Asmentioned earlier, researchers experiment on traces collectedfrom simulated environments or testbeds since they do nohave access to real systems. To be of relevance, the simulatedenvironments must behave as close as possible to a realsystems, which is hard to ensure. We believe that the tracesigure 5: Flows seen per hourcan help to improve the accuracy of simulated environments.There is some work on how to generate synthetic tracesfrom another one [20], [21]. It relies on statistical modelsand machine learning techniques to generate traces that havesimilar properties to the one they were generate from. We willalso work in that direction as future work.VI. C ONCLUSION
Doing research on industrial control and automation systemssecurity is challenging because researchers usually do not haveaccess to real systems for experimentation and existing publicdatasets for off-line experimentation are rare and stemmingfrom testbeds or simulated environments.In this paper, we present a public BMS network tracecollected from the HVAC management system of a universitycampus. We show that the trace depicts the common character-istics of automation systems, such as long lived connections,but also follows a diurnal pattern similar to those caused byhuman activities in traditional IT networks.To our knowledge, this is the first public trace from areal BMS. The traces can be found on https://github.com/gkabasele/HVAC_Traces.R
Passive and Active Measurement ,pages 126–135, 2012.[3] R. R. R. Barbosa, R. Sadre, and A. Pras. A first look into scada networktraffic. In ,pages 518–521, 2012.[4] Edward J. M. Colbert and Steve Hutchinson.
Intrusion Detectionin Industrial Control Systems , pages 209–237. Springer InternationalPublishing, Cham, 2016.[5] D. Dittrich, M. Bailey, and E. Kenneally. Applying Ethical Principles toInformation and Communication Technology Research: A Companionto the Menlo Report. Technical report, U.S. Department of HomelandSecurity, Oct 2013.[6] Cheung S. et al. Using model-based intrusion detection for scadanetworks. In
Proceedings of the SCADA Security Scientific Symposium
Dependable Networks and Services , pages 62–73. Springer Berlin Heidelberg, 2012.[9] Rafael R. R. Barbosa et al. Flow whitelisting in scada networks.
International Journal of Critical Infrastructure Protection , 6(3):150–158, 2013.[10] Ring M. et al. Flow-based benchmark data sets for intrusion detection.In
European Conference on Cyber Warfare and Security , pages 361–369,2017.[11] Sperotto A. et al. An overview of ip flow-based intrusion detection.
IEEE Communications Surveys Tutorials , 12(3):343–356, 2010.[12] Stouffer Keith A. et al. Sp 800-82. guide to industrial control systems(ics) security: Supervisory control and data acquisition (scada) systems,distributed control systems (dcs), and other control system configura-tions such as programmable logic controllers (plc). Technical report,Gaithersburg, MD, United States, 2011.[13] Jinliang Fan, Jun Xu, Mostafa H. Ammar, and Sue B. Moon. Prefix-preserving ip address anonymization: measurement-based security eval-uation and a new cryptography-based scheme.
Computer Networks ,46(2):253 – 272, 2004.[14] P. García-Teodoro, J. Díaz-Verdejo, G. Maciá-Fernández, andE. Vázquez. Anomaly-based network intrusion detection: Techniques,systems and challenges.
Computers & Security , 28(1):18 – 28, 2009.[15] A. M. Khandker, P. Honeyman, and T. J. Teorey. Performance of dcerpc. In
Second International Workshop on Services in Distributed andNetworked Environments , pages 2–10, June 1995.[16] AppNeta Klassen, Fred. Tcpreplay - pcap editing and replaying utilities,2018.[17] A. Lemay and J.M. Fernandez. Providing scada network data sets forintrusion detection research. In , 2016.[18] Hui Lin, Adam Slagell, Catello Di Martino, Zbigniew Kalbarczyk, andRavishankar K. Iyer. Adapting bro into scada: Building a specification-based intrusion detection system for the dnp3 protocol. In
Proceedings ofthe Eighth Annual Cyber Security and Information Intelligence ResearchWorkshop , CSIIRW ’13, pages 5:1–5:4. ACM, 2013.[19] Modbus. MODBUS MESSAGING ON TCP/IP IMPLEMENTATIONGUIDE V1.0b , 2006.[20] Markus Ring, Daniel Schlör, Dieter Landes, and Andreas Hotho. Flow-based network traffic generation using generative adversarial networks.
CoRR , abs/1810.07795, 2018.[21] A. Varet and N. Larrieu. How to generate realistic network traffic? In , pages 299–304, July 2014.[22] Z. Zheng and A. L. N. Reddy. Safeguarding building automationnetworks: The-driven anomaly detector based on traffic analysis. In2017 26th International Conference on Computer Communication andNetworks (ICCCN)