[PDF] CAN-D: A Modular Four-Step Pipeline for Comprehensively Decoding Controller Area Network Data

Abstract

CANs are a broadcast protocol for real-time communication of critical vehicle subsystems. Original equipment manufacturers of passenger vehicles hold secret their mappings of CAN data to vehicle signals, and these definitions vary according to make, model, and year. Without these mappings, the wealth of real-time vehicle information hidden in the CAN packets is uninterpretable, impeding vehicle-related research. Guided by the 4-part CAN signal definition, we present CAN-D (CAN-Decoder), a modular, 4-step pipeline for identifying each signal's boundaries (start bit, length), endianness (byte order), signedness (bit-to-integer encoding), and by leveraging diagnostic standards, augmenting a subset of the extracted signals with physical interpretation. We provide a comprehensive review of the CAN signal reverse engineering research. Previous methods ignore endianness and signedness, rendering them incapable of decoding many standard CAN signal definitions. Incorporating endianness grows the search space from 128 to 4.72E21 signal tokenizations and introduces a web of changing dependencies. We formulate, formally analyze, and provide an efficient solution to an optimization problem, allowing identification of the optimal set of signal boundaries and byte orderings. We provide two novel, state-of-the-art signal boundary classifiers-both superior to previous approaches in precision and recall in three different test scenarios-and the first signedness classification algorithm which exhibits a > 97\% F-score. CAN-D is the only solution with the potential to extract any CAN signal. In evaluation on 10 vehicles, CAN-D's average ℓ 1 error is 5x better than all previous methods and exhibits lower ave. error, even when considering only signals that meet prior methods' assumptions. CAN-D is implemented in lightweight hardware, allowing for an OBD-II plugin for real-time in-vehicle CAN decoding.

Full PDF

11 CAN-D: A Modular Four-Step Pipeline for ComprehensivelyDecoding Controller Area Network Data

Miki E. Verma ∗ , Robert A. Bridges ∗ , Jordan J. Sosnowski † , Samuel C. Holliﬁeld ∗ , Michael D. Iannacone ∗∗ Cyber & Applied Data Analytics Division, Oak Ridge National Laboratory, Oak Ridge, TN { vermake, bridgesra, holliﬁeldsc, iannaconemd } @ornl.gov † Department of Computer Science & Software Engineering, Auburn [email protected]

Abstract —Controller area networks (CANs) are a broadcastprotocol for real-time communication of critical vehicle subsys-tems. Original equipment manufacturers (OEMs) of passengervehicles hold secret their mappings of CAN data to vehiclesignals, and these deﬁnitions vary per make, model, and year.Without these mappings, the wealth of real-time vehicle infor-mation hidden in the CAN packets is uninterpretable—severelyimpeding vehicle-related research including CAN cybersecurityand privacy studies, after-market tuning, efﬁciency and perfor-mance monitoring, and fault diagnosis to name a few.Guided by the four-part CAN signal deﬁnition, we presentCAN-D (CAN Decoder), a modular, four-step pipeline for identi-fying each signal’s boundaries (start bit and length), endianness(byte ordering), signedness (bit-to-integer encoding), and byleveraging diagnostic standards, augmenting a subset of theextracted signals with meaningful, physical interpretation. Enroute to CAN-D, we provide a comprehensive review of the CANsignal reverse engineering research. All previous methods ignoreendianness and signedness, rendering them simply incapable ofdecoding many standard CAN signal deﬁnitions. Incorporatingendianness grows the search space from 128 to 4.72E21 signaltokenizations, and introduces a web of changing dependencies. Inresponse, we formulate, formally analyze, and provide an efﬁcientsolution to an optimization problem, allowing identiﬁcation ofthe optimal set of signal boundaries and byte orderings. Inaddition, we provide two novel, state-of-the-art signal boundaryclassiﬁers (both superior to previous approaches in precision andrecall in three different test scenarios) and the ﬁrst signednessclassiﬁcation algorithm, which exhibits >

97% F-score. Overall,CAN-D is the only solution with the potential to extract anyCAN signal and is the state of the art. In evaluation on tenvehicles of different makes, CAN-D’s average (cid:96) error is 5 timesbetter (81% less) than all preceding methods, and exhibits loweraverage error even when considering only signals that meetprior methods’ assumptions. Finally, CAN-D is implemented inlightweight hardware allowing OBD-II plugin for real-time in-vehicle CAN decoding. Index Terms —Controller Area Network (CAN); Reverse En-gineering; Machine Learning; Security; Privacy; Technology;

I. I

NTRODUCTION & B

ACKGROUND

Modern automobiles rely on communication of severalelectronic control units (ECUs) (internal computers) over afew controller area networks (CANs) and adhere to a ﬁxed

This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US governmentretains and the publisher, by accepting the article for publication, acknowledges thatthe US government retains a nonexclusive, paid-up, irrevocable, worldwide license topublish or reproduce the published form of this manuscript, or allow others to doso, for US government purposes. DOE will provide public access to these resultsof federally sponsored research in accordance with the DOE Public Access Plan(http://energy.gov/downloads/doe-public-access-plan).

CAN protocol. Sensors readings, such as accelerator pedalangle, brakes, fuel injection timing, wheel speeds, as wellas less important readings, such as radio settings, are allcommunicated as signals encoded in the CAN messages. Forpassenger vehicles, the encodings of these signals into CANmessages are proprietary—one can monitor (and send) CANmessages, but generally cannot understand their meaning.Further, these encodings vary per make, model, year, eventrim, and in practice, reverse engineering of signals is currentlya tedious, per-vehicle effort. As CAN data is sent at a rapid rateand carries a wide variety of real-time vehicle information, avehicle-agnostic solution for decoding CAN signals promises avast resource of streaming, up-to-date information for analyticsand technology development on any vehicle.Each CAN message has up to 64 bits of data containing(usually) multiple signals (Figs. 2 & 3). Automotive CANsignals are characterized by four deﬁning properties (discussedin detail in Sec. I): (1) signal boundaries (start/end bit), (2)endianness (byte order), (3) signedness (bit-to-integer encod-ing), and (4) physical interpretation. The signal deﬁnitions foreach message (a message deﬁnition) are deﬁned in vehicle’sCAN database ﬁle (the industry standard is Vector’s .dbc

Step 1 : For each message ID in a CAN log, a binary

Signal Boundary Classifier outputs the likelihood of a signal boundary at each bit gap. We use either of two classifiers: supervised learning or a novel unsupervised heuristic.

1. Si gnal Boundar y Classi fi cat i on 2. Endi anness Opt i mi zat i on 3. Si gnedness Classi fi cat i on 4. Physi cal I nt er pr et at i on

CAN LogSignal Boundary Probabilities

Signal Boundary Classifier( )

Tokenized Signals DBC (No Interpretation) External Labeled TimeseriesInterpreted SignalsDBC

External Labels? No Signedness Classifier( )

Step 2 : A custom endianness optimization algorithm takes the boundary probabilities as input and determines an optimal tokenization (signals' positions and endiannesses).

Step 3 : A binary

Signedness Classifier determines each signal's signedness, allowing translation of bits to values. We use a novel unsupervised heuristic for our classifier.

Step 4 : A supplemental

Signal-to-Timeseries Matcher matches signals to externally collected labeled timeseries, providing signal inter-pretation. We regress signals onto concurrently collected diagnostics.Endianness Optimizer( )

Yes

Signal-to- Timeseries Matcher( )

Translated Signals

Fig. 1: CAN-Decoder (CAN-D) Pipeline: A four step modularpipeline that take a CAN log (capture of CAN data) as input,and outputs a DBC with signal deﬁnitions, thus providing vehicle-agnostic CAN signal reverse engineering. Italicized processes out-lined in dotted red lines indicate modular pieces that can be anyalgorithm that satisﬁes the input/output requirements. Descriptions ofour choices for these pieces are provided. Greek letters α − δ denotetuning parameters (possibly) needed for steps − respectively. a r X i v : . [ c s . OH ] J un ABLE I: Automotive CAN signal reverse engineering algorithms foreach of the four signal properties. CAN-D is the only comprehensivealgorithm, determining all four properties. B o u n d a r y E n d i a n n e ss S i g n e d n e ss I n t e r p r e t a ti o n Jaynes et al. (2016) [1] (cid:35) (cid:35) (cid:35) (cid:71)(cid:35)

Markowitz & Wool (2017) [2] (cid:32) (cid:35) (cid:35) (cid:35)

Huybrechts et al. (2017) [3] (cid:71)(cid:35) (cid:35) (cid:35) (cid:71)(cid:35)

Nolan et al.’s

TANG (2018) [4] (cid:32) (cid:35) (cid:35) (cid:35)

Marchetti & Stabili’s

READ (2018) [5] (cid:32) (cid:35) (cid:35) (cid:35)

Verma et al.’s

ACTT (2018) [6] (cid:32) (cid:35) (cid:35) (cid:32)

Pes´e et al.

LibreCAN (2019) [7] (cid:32) (cid:35) (cid:35) (cid:32)

Young et al. (2020) [8] (cid:35) (cid:35) (cid:35) (cid:71)(cid:35)

CAN-D (cid:32) (cid:32) (cid:32) (cid:32) or “DBC” ﬁle format). We use this industry-standard, four-part signal deﬁnition to frame our understanding of previousworks and guide our approach.

The goal is a vehicle-agnosticCAN decoder—to discover these four deﬁning properties foreach signal from CAN data from any vehicle, i.e., to reverseengineering the signal deﬁnitions in the vehicle’s DBC.

Recently, the research community has focused on reverseengineering signals from automotive CAN data. This researchis summarized in Related Works (Sec. II), and Table I catalogseach work’s efforts in identifying the four deﬁning signalcharacteristics. Notably, all current approaches focus only onidentifying signal boundaries (1) and/or matching signals toobservable sensor data (4), and ignore endianness (2) andsignedness (3), meaning they are unable to decode manystandard CAN signals.All previous works have developed and tested algorithmson limited CAN data, often from a single make. Targetinga vehicle-agnostic solution, we compile a much more variedcollection of labeled CAN data from ten different makes(see Sec. IV). Equipped with this robust, labeled dataset fordevelopment and testing, we pursue the ﬁrst comprehensiveand most accurate signal reverse-engineering pipeline (seeFig. 1). Before describing our contributions, we introducenecessary background information.

Fig. 2: CAN 2.0 frame depicted [9]: Arbitration ID indexes the frame;Data Field carries message content up to 64 bits.

A. CAN Fundamentals & Notation

CAN 2.0 deﬁnes the physical and data link layers (OSIlayers one and two) of a broadcast protocol [10]. In particularit speciﬁes the standardized CAN frame (or packet) formatrepresented in Fig. 2. For semantic understanding of a CANframe, only two components of the frame are necessary: • Arbitration ID - an 11-bit header used to identify theframe, and for arbitration (determining frame priority whenmultiple ECUs concurrently transmit); • Data Field/Message - up to 64 bits of content.Each ID’s data ﬁeld is comprised of signals of varying lengthsand encoding schemes packed into the 64 bits (see Fig. 3, left).A .dbc ﬁle provides the deﬁnitions of signals in the data ﬁeldfor each ID, thus deﬁning each CAN Message. CAN frames with the same ID (message header/index) areusually sent with a ﬁxed frequency to communicate updatedsignal values, although some are aperiodic (triggered by anevent). For example, ID occurs every 0.1s, ID occurs every 0.25s, etc. We partition CAN logs into

ID traces ,the time series of 64-bit messages for each ID. An ID traceis denoted [ B ( t ) . . . , B ( t )] t , a time-varying binary vectorof length 64. Note that without loss of generality, we assumeeach message is 64 bits by padding with 0 bits if necessary.

1) Byte Order (Endianness) & Bit Order:

The signiﬁcanceof a signal’s bits within a byte (contiguous 8-bit subsequences)decreases from left to right, i.e., the ﬁrst bit transmitted isthe most signiﬁcant bit (MSB), and the last (eighth) bit,the least signiﬁcant bit (LSB). This is deﬁned in the CANSpeciﬁcation [10, 11] but has been misrepresented [7] andmisunderstood [4, 6] by previous signal reverse engineeringworks. The confusion results from use of both big endianand little endian byte orderings in CAN messages. Big endian(B.E.) indicates that the signiﬁcance of bytes decreases fromleft to right, whereas little endian (L.E.) reverses the orderof the bytes (but maintains the order of the bits in each byte)[12]. We list the bit orderings for a 64-bit data ﬁeld under bothendiannesses with parenthesis demarcating the bytes [11]:B.E.: ( B , . . . , B ) , ( B . . . , B ) , . . . , ( B , . . . , B ) L.E.: ( B , . . . , B ) , ( B , . . . , B ) , . . . , ( B , . . . , B ) (1)See Examples 1 & 2 for how this affects signal deﬁnitions.

2) CAN Signals:

The speciﬁcations for decoding each ID’smessage into a set of signal values is deﬁned by the OEM andheld secret, usually stored in a DBC. Signal deﬁnitions consistof several properties (see Fig. 3, right) that detail how to: tokenize (demarcate the signal’s sequence of bits): • Start bit and length give signal position in the data ﬁeld; • Byte ordering : If the signal crosses a byte boundary, littleendian signals reverse the order of the bytes while bigendian signals retain byte order (see Eq. 1); translate (convert a sequence of bits to integers): • Signedness : Unsigned , the usual base 2 encoding, vs. signed , two’s complement encoding [15];

Fig. 3: DBCs visualized through DBC Editor GUIs.

Left: A signallayout plot visually represents a CAN Message tokenization, depict-ing an ID’s 64-bit data ﬁeld as an × array containing CANSignal(s). Each signal’s constituent bits are shown in a unique colorand unused bits are shown in white. ( CANdb++

Database Editor)[13]

Right:

Signal deﬁnition of ﬁrst 16-bit yellow signal, deﬁnedby properties: start bit, length, signedness, endianness, scaling factor,offset, unit. (

NI-XNET Database Editor ) [14] interpret (linearly scale raw translated signal values tophysically meaningful and interpretable information): • Label and unit , giving the physical meaning of the signaland it’s units (e.g., speed in MPH); • Scale and offset , which provide the linear mapping of thesignal’s tokenized values to the appropriate units.It is implicit in the DBC signal deﬁnition that (non-constant)signals are contiguous sequences of non-constant bits.

Example 1.

Consider in Fig. 3 the ﬁrst two-byte yellow signal.To tokenize the signal, or know it’s sequence (implying order)of bits, we must know endianness. If bytes 1 & 2 are bigendian, we obtain MSB-to-LSB bit indices, I = (0 , . . . , whereas if they are little endian, the bytes are swapped, obtain-ing MSB-to-LSB bit indices I = (8 , . . . , , , . . . , , notablywith B → B . Next, the signal’s signedness furnishes the translation of that bit sequence to an integer. The informationneeded for interpretation are the label and unit of the signal(in this case Engine RPM) and the linear transformationto convert the translated values (a two-byte signal can take − , values) to the appropriate physical value(e.g., in the range − , RPM).

Fig. 6 illustrates timeseries of CAN data that have beendecoded using both correct and incorrect signal deﬁnitions.Fig. 6 (a) plots green and blue

CAN signals tokenized withcorrect (middle) vs. incorrect (right) signedness, and Fig. 6(b) plots CAN signals tokenized with correct (top) vs. incor-rect endianness (bottom, in particular, the navy signal). Theclear discontinuities in these mis-tokenized and mis-translatedsignals exhibit the importance of knowing the endianness andsignedness for extracting meaningful time series.

3) On-board Diagnostics:

In the U.S., all vehicles sold after1996 include an on-board diagnostic (OBD-II) port, whichgenerally allows for open access to automotive CANs, andemissions-producing vehicles sold after 2007 also includea mandatory, standard interrogation schema for extractingdiagnostic data using the J1979 standard [16]. This On-boardDiagnostic service (OBD) is an application layer protocol inwhich one can query diagnostic data from the vehicle bysending a CAN frame. A CAN response is broadcasted withthe requested vehicular state information. There are a standardset of queries possibly available via this call-response protocol(e.g, accelerator pedal position, intake air temperature, vehiclespeed) along with unit conversions, each corresponding to aunique diagnostic OBD-II PID (DID) [17]. Speciﬁc examplesof how to perform the call and response are available, e.g.,[7, 18]. Previous CAN decoding works have iteratively sentDID requests and parsed the responses from CAN trafﬁc tocapture valuable, real-time, labeled vehicle data without usingexternal sensors [3, 6, 7]. We denote these time series ofdiagnostic responses, or

DID traces , D ( t ) . Inherent limitationsexist—the set of available DIDs varies per make, and electricvehicles need not conform to this standard [6, 7]. B. Problem, Assumptions, & Challenges1) Problem:

The goal is to to recreate the .dbc ﬁle’s signaldeﬁnitions, (discover the four properties for each signal) for any vehicle from a sufﬁcient capture of a vehicle’s CAN data.

2) Assumptions:

We make ﬁve fundamental assumptions: (A0) : Observed constant bits are unused. (A1) : Both big and little endian byte orders are possible. (A1.a) : Both endiannesses can occur in a single ID. Wehave not observed this, but it is permitted by protocol andDBC syntax. DBC editor GUIs allow per-signal endiannessspeciﬁcation with a checkbox or pull down (e.g. Fig. 3, Right),indicating that both byte orderings can co-occur in a message. (A1.b) : A single byte cannot have bits used in a little endiansignal while also containing bits used in a big endian signal;else, the byte orders indicated by the signals are contradictory. (A2) : Signed signals are possible and are encoded using a 2’scomplement encoding.

3) Challenges:

In practice, it is difﬁcult to exercise theMSBs of a signal, resulting in errors in determining signalboundaries (a Step 1 challenge). For example, consider thetwo-byte (16-bit) Engine RPM signal in Example 1 withtranslated values between − , . As 5,000 RPMs is rarelyreached, the MSB of this signal is likely to be observed asa constant 0 bit, causing the signal start bit to be mislabeled.Though this is easily surmountable for RPMs (e.g., rev enginein neutral during collection), it is far more difﬁcult to solvethis for latent sensors, e.g, engine temperature.Secondly, since continuous signals are sampled periodically,those with high resolution signals (e.g., a two-byte signal has > , values) have LSBs ﬂipping seemingly randomly(a Step 1 challenge). Our results indicate that the TANGalgorithm [4] suffers from the overly strict assumption that ﬂipfrequencies are monotonically decreasing with bit signiﬁcance.Thirdly, considering both big and little endianness greatlyenhances complexity of the problem, as bits on the byteboundaries have unknown neighbors (albeit in a ﬁxed set ofpossibilities); e.g., simply comparing the bit ﬂip probabilitiesof neighboring bits now requires custom rules for incorpo-rating all possible neighbors according to (A1), (A1.a) butremove impossibilities imposed by (A1.b) (a Step 2 challenge).See details in Sec. III-B.Fourthly, considering both signed and unsigned encodingsadds another hurdle; in particular, the order of bit representa-tions mod n is the same for both signed and unsigned, halfthe bit strings represent different integers (a Step 3 challenge).Finally, many CAN signals communicate sensor values thatare hard to measure with external sensors; hence, identifyingthe physical meaning, unit and linear mapping (scale andoffset) can be difﬁcult (a Step 4 challenge). C. Contributions

We make six contributions to the area of automotive CANsignal reverse engineering:

C1. Comprehensive signal reverse engineering pipeline:

Our primary contribution is a modular, four-part pipeline,depicted in Fig. 1, for learning all four components of a CANsignal deﬁnition, respectively. The pipeline is modular in that:Step 1 can accommodate any signal boundary classiﬁcationmethod; Step 3 can accommodate any signedness classiﬁca-tion algorithm; and Step 4 can accommodate any signal-to-3imeseries matching algorithm for physical interpretation. In-stantiating our pipeline with our signal-boundary classiﬁcationheuristic and (separately) our trained machine learning classi-ﬁer for Step 1 and the diagnostic sensor matching of Vermaet al. [6] for Step 4, we present a quantitative comparativeevaluation of our signal reverse engineering pipeline versusprevious methods. We demonstrate that CAN-D exhibits lessthan a ﬁfth average error of all previous methods (Sec. V-B& Table VI, Bottom), and qualitatively illustrate the pitfallsand limitations of previous methods (Sec. V-C & Fig. 6) thatour four-step pipeline circumvents.

Overall, CAN-D is the ﬁrstCAN signal reverse engineering effort that can accommodateall signals as deﬁned in automotive DBC ﬁles, and is by farmore accurate than any previous effort. Further, it provides aframework for future research developments to improve andplug in advancements to each step.

C2. Introduction of two state-of-the-art signal boundaryclassiﬁcation algorithms and comparative study of previ-ous algorithms:

We develop two signal boundary classiﬁers,a supervised machine learning model and an unsupervisedheuristic (Sec. III-A). We implement the previous state-of-the-art classiﬁcation methods and provide the ﬁrst quantitativecomparison of all methods (Sec. V-A & Table VI, Top) ona more comprehensive and robust data set than any previouswork. We demonstrate that our algorithms are signiﬁcantlymore accurate than previous methods, superior in both recalland precision in three testing scenarios.

C3. Endianness optimization formulation and solution:

All previous works are based on an assumption of big en-dian byte ordering (to perform tokenization and/or signal-to-timeseries matching) and there is no simple remediationfor adapting the previous algorithms to perform correctlyin the presence of both big and little (reverse byte order)endian signals. The second step of our pipeline presents anovel procedure that has been crafted to use the predictionsfrom any signal-boundary classiﬁcation algorithm from Step1 as input and determine the optimal set of endiannessesand signal boundaries from all possible tokenizations (Sec.III-B). We formulate an objective function to be optimizedand provide a formal mathematical proof for reducing thesearch space to a very tractable grid search algorithm foroptimization. Overall, this insight allows all signal-boundaryclassiﬁcation algorithms to be leveraged for extracting bothlittle and big endian signals—which has thus far been ignoredand/or insurmountable.

C4. Signedness classiﬁcation:

We provide the ﬁrst algo-rithm for determining signal signedness (bit-to-integer encod-ing) (Sec. III-C), allowing translation of signals to time series.Testing shows this simple heuristic achieves > F-Score.

C5. Prototype OBD-II plugin for in-situ or ofﬂine use:

The pipeline can be run ofﬂine for post-drive analysis or duringdriving e.g., to feed online analytics such as a CAN IDS withtranslated CAN data. We discuss our design and implementa-tion of a lightweight on-board diagnostic (OBD-II) port plugindevice (Sec. VI & Fig. 7) for use in any vehicle where aCAN is accessible via the OBD-II port (most vehicles). In asignal learning phase, the device automatically logs CAN datawhile periodically querying supported DIDs, and then runs the algorithmic pipeline to learn signal deﬁnitions and writea DBC. This allows the real-time decoding of CAN signalson future drives, e.g., to feed a novel analytic technologyleveraging the vehicle’s signals online, or ofﬂine uses, e.g.,to analyze CAN captures in post-collection analysis. Thisprototype bridges the gap between the algorithmic researchin the literature and actual online use with any vehicle.

C6. Survey:

We provide the ﬁrst comprehensive survey ofworks on reverse engineering CAN signals (Sec. II & TableI), providing the progression of the ﬁeld and documenting thebeneﬁts and limitations of each.

D. Impact

Unveiling CAN signals will provide real-time measurementsof vehicle subsystems, a rich stream of data that promisesto fuel many vehicle technologies and put development andanalytics in the hands of the consumers (in addition to OEMs).Multiple research works have, through direct and evenremote access to CANs, managed to manipulate a few manu-ally reverse engineered signals, manifesting in life-threateningeffects—most notably, the remote Jeep hack of Miller &Valasek [19–22]. These works demonstrate that CAN reverseengineering is possible on a per-vehicle basis with ample effortand expertise, and will not inhibit the determined adversary.The obscurity of CAN data does, however, hinder vulnerabilityanalysis research necessary for hardening vehicle systems,and automated CAN reverse engineering will greatly expeditevehicle vulnerability research.In parallel, CAN defensive security research is growingquickly; we found 15 surveys of the area since 2017, e.g.,[23, 24], with over 60 works on CAN intrusion detectionbetween 2016-19. Yet these works are impeded by obfuscatedCAN data, forced to either use side-channel methods thatignore message contents [25–27], use black-box methodsignorant of message meanings [28–30], or either arduouslyreverse engineer a few signals for a speciﬁc vehicle [31]or rely on an OEM for signal deﬁnitions [32], which keepsCAN security in the OEM’s hands and develops per-make (notvehicle-agnostic) capabilities. A vehicle-agnostic CAN signalreverse engineering tool promises to remove these limitationsand provide rich, online, time-series data for advancementsin detection and other security technologies. Further, thisCAN signal decoding will promote universally applicabletechnologies to address cars currently on the road, and removereliance on the vehicle OEMs for CAN security.Another emerging subﬁeld of research is driver ﬁngerprint-ing [33, 34], developing methods to identify drivers basedon their driving characteristics, such as braking, accelerating,and steering. Access to the decoded CAN data will allowthese works to be ported to plugin technologies for nearlyany vehicle, impacting at a minimum driver privacy andinsurance strategies, and potentially forensic (e.g., criminal)investigations, and vehicle security to name a few.In addition, access to CAN signals will potentially as-sist development of after-market tuning tools for enhancedefﬁciency and performance, fuel efﬁciency monitoring andguidance, ﬂeet management, vehicle fault diagnosis, forensics4echnologies, and after-market vehicle-to-vehicle capabilities.As a ﬁnal example, we note that after-market technologies toprovide autonomous driving capabilities to current vehiclesare appearing; in particular, Open Pilot (https://comma.ai/)provides latitudinal and longitudinal control for many vehi-cles on the road using a few, presumably manually reverse-engineered CAN signals. Automated, accurate, and universallyapplicable CAN de-obfuscation will promote and expeditesuch vehicle technologies, especially, after-market solutionsfor many vehicles currently in use.II. CAN S

IGNAL R EVERSE E NGINEERING S URVEY

This section provides the ﬁrst comprehensive survey ofmethods for decoding automotive CAN data into constituentsignals. We seek to show the progression of the literature, andwe provide more detailed descriptions of the methods thatwe evaluate in Sec. V with authors/methods in bold. TableI gives a quick reference for the signal reverse engineeringcontributions of each work.Early work of Jaynes et al. [1] (2016) explored supervisedlearning to identify CAN messages that control body relatedevents, but the approach was unaware that data ﬁelds arecomprised of multiple disparate signals. Thus, this methodsimply labels entire messages with a general physical meaning.Markowitz & Wool [2] (2017) focuses on CANanomaly/intrusion detection but pursues signal extraction as apreprocessing step. They were the ﬁrst to introduce the basicassumption each arbitration ID’s data ﬁeld is “a concatenationof positional [signals]”. Implicitly, Markowitz & Wool’s algo-rithm assumes only big endian and unsigned signals; hence,their algorithm need only identify the start bit and length ofa signal. The algorithm considers all 2080 possible signals(indexed by start bit and length) in an ID’s 64-bit data ﬁeld,and based on the cardinality of each candidate signal’s range,the count of observed distinct values. It then categorizes thesignal as constant, categorical (taking on only a few values),or continuous (values of a discretely sampled continuousvariable) based on the range and assigns a score. Finally, themethod identiﬁes a non-overlapping partition of the 64 bitsbased on category and a optimization of the signals’ scores.Huybrechts et al. [3] (2017) is the ﬁrst work to leverageDIDs to annotate CAN data and identify signals. Their al-gorithm converts bytes/byte-pairs in CAN messages to in-tegers and identiﬁes those that are similar to the concur-rently collected DID responses, but operates under the self-acknowledged false assumption that CAN signals are limitedto only one or two-byte signals. No linear transformation ofextracted signals to the DID sensor values is given.The next three works, Nolan et al.’s TANG algorithm [4],Verma et. al.’s ACTT [6], and Marchetti & Stabili’s READ [5]appear to have occurred independently and concurrently, andwe present them chronologically by publication date.

Nolan et al. [4] (2018) focus solely on extracting continuoussignals by considering the “transition aggregated n -grams”( TANG ). Given an ID trace [ B ( t ) , . . . , B ( t )] t Nolan etal. deﬁne the TANG vector as ( T , . . . , T ) with T i = (cid:80) t B i ( t j ) (cid:76) B i ( t j +1 ) , where (cid:76) denotes XOR. Note, this is simply a computationally efﬁcient way to obtain the bit ﬂipcount; hence, if an n -bit signal’s subsequent values change byunit increments, the LSB will exhibit T i = 2 n + 1 , and eachnext signiﬁcant bit will have TANG values decreasing by afactor of . The algorithm for identifying continuous signalboundaries is, roughly speaking: compute the TANG vectorfrom an ID trace, identify the bit with maximal TANG valueas a signal’s LSB, and walk left (resp. right for reverse bitorder) absorbing bits into the signal until the TANG valueincreases. Nolan et al. consider both forward and reverse bitorderings to attempt to take little and big endian encodingsinto account. However, since endianness refers to byte (notbit) order, this method cannot accommodate true little endiansignals, and in fact violates the ﬁxed bit order deﬁned by thestandard. Overall, this method assumes big endian, unsigned,and continuous signals. Marchetti & Stabili [5] (2018) propose the

READ (Re-verse Engineering of Automotive Dataframes) algorithm toextract signals using heuristics based on a 64-length vec-tor giving each bit’s observed ﬂip probability, [ P ( B i ( t j ) (cid:54) = B i ( t j +1 ))] i =0 . First, signal boundaries are identiﬁed using m i := (cid:100) log ( P ( B i ( t j ) (cid:54) = B i ( t j +1 ))) (cid:101) the ceiling function ofthe log probabilities. READ follows intuition similar to TANG:for continuous signals, a LSB ﬂips much more often than anadjacent signal’s MSB. Hence, READ places signal boundariesbetween bits i and i + 1 iff m i > m i +1 , or equivalently if thethe bit ﬂip probabilities cross a factor of 10 (e.g., from above . to below). Unlike TANG, READ does not claim to assumeonly continuous signals, and it in fact builds on Markowitz &Wool’s signal categorization efforts. It considers a trichotomyof signal categories —counters (increments by 1 with eachmessage), checksums (hashes for checking if messages areproperly transmitted), and a catch-all bin, “physical” signals—categorizing the extracted signals with further heuristicsrelating to bit ﬂips. Ultimately, READ partitions an ID’s 64-bitdata frame into signals with categorical labels. The algorithmignores little endian and signed encoding possibilities and can-not be easily amended to accommodate little endian signals.Marchetti & Stabili’s evaluations with real and synthetic CANdata comparing with Markowitz & Wool’s method reveal thatREAD is far more accurate at ﬁnding signal boundaries. Verma et al.’s ACTT [6] (2018), takes a fundamentallydifferent approach than all previous works. Instead of partialtokenization and translation, speciﬁcally, learning to identifysignal boundaries under limiting assumptions (e.g., assumingbig endian and unsigned encodings) in an unsupervised fash-ion, ACTT simultaneously tokenizes, translates, and interprets

CAN signals. The method automatically identiﬁes which DIDs(See Sec. I-A3) respond on the particular vehicle, and thencollects ambient CAN data during driving while periodicallyquerying DIDs. These diagnostic responses provide labeledtime-series, DID Traces, alongside the CAN data, setting up asupervised decoding algorithm. For a given ID trace, the con-stant bits are labeled, and all possible signals (start bit, length)from the remaining non-constant bits are considered. For eachpossible signal and for each DID trace, linear regression isperformed, and a score of linear ﬁt is assigned. A schedulingalgorithm using dynamic programming then identiﬁes a non-5verlapping set of signals that maximize the ﬁtness score. Theoutput is two-fold: (1) a list of constant signals, and (2) asubset of signals equipped with linear mappings to a knownphysical unit that matches a DID (start bit, length, scale,offset, physical unit, sensor label). Like all previous works,this method assumes unsigned encodings, and following Nolanet al.’s TANG, mistakenly considers reverse bit order as littleendian (not byte order). Because this method relies on DIDmatching to tokenize signals, only a small subset of signalscan be extracted, but all extracted signals are interpretable.

Pes´e et al. [7] (2019) present

LibreCAN , a three phaseprocess. (Phase 0) LibreCAN makes tweaks to READ’s al-gorithm for identifying signal boundaries and categorizingextracted signals. Speciﬁcally, while READ identiﬁes signalboundaries by ﬁnding where adjacent bit ﬂip probabilities de-crease across a multiple of 10, LibreCAN identiﬁes if adjacentbit ﬂip probabilities drop by a factor of T p , , a tunable inputparameter. (Phase 1) LibreCAN next leverages ideas similarto Verma et. al. [6], using cross correlation to match signalsto sensor readings from both DIDs and external sensors, thenusing linear regression to learn the scale and offset. (Phase 2)LibreCAN incorporates a novel, semi-automated method foridentifying body-related signals (e.g., door locks, windshieldwipers), by ﬁltering IDs based on changes in data ﬁelds beforeand after a user actuates the body-related feature. Pes´e etal. note that little endian signals exist, but like all previousmethods, their algorithm assumes big endian byte order andunsigned encodings, and does not have a natural extension toaccommodate little endian signals.The most recent CAN reverse engineering work by Younget. al. [8] (2020) uses a approach similar to LibreCAN (Phase2) to match vehicular functions (based on a hand-labeledtimeseries) to CAN IDs based on a data-change identiﬁcationalgorithm. They use a clustering algorithm to group relatedIDs, labeling the remaining unknown IDs based on thoselabeled in the matching step. However, similar to Jayneset. al., this work attempts to assign physical meaning toan entire CAN ID rather than tokenize, translate, and thenidentify (assign meaning) to constituent signals; thus, we donot consider it (nor Jaynes et al.’s) to be a true signal reverseengineering algorithm.There are signiﬁcant limitations of all previous works. Mostnotably, all assume both big endian byte order and unsignedencodings. While some may theoretically correctly identifysigned signals’ boundaries, this has not been mentioned ortested. Worse, there is no natural extension to little endianand/or signed signals. To identify signedness, an additional al-gorithm is needed: a fairly straightforward binary classiﬁcationproblem that is not difﬁcult once well formed. Including endi-anness, on the other hand, poses a far harder problem for tworeasons: (1) signal boundary algorithms depend on ﬂip countsof “neighboring” bits, but bit orderings change with endian-ness, so neighboring bits cannot be determined; (2) withoutconsidering both endiannesses, signal boundary identiﬁcationis computationally simple (the same binary classiﬁcation isindependently repeated 64 times per ID), but considering allbyte orderings grows the search space combinatorially ( boundary options × byte orders > . tokenizations per ID!) with a web of changing dependencies.III. A LGORITHM

We present CAN-D (CAN-Decoder), a four-step modularpipeline (depicted in Fig. 1) providing the ﬁrst comprehensiveand vehicle-agnostic CAN signal reverse engineering solution.We describe the needed inputs and outputs for the modularcomponents—a signal boundary classiﬁer (Step 1, Sec. III-A),a signedness classiﬁer (Step 3, Sec. III-C), and a signal-to-timeseries matcher (Step 4, Sec. III-D)—as well as ournovel endianness optimizer (Step 2, Sec. III-B), which weconsider to be the unique component providing the glue forthe interchangeable components.

A. Step 1: Signal Boundary Classiﬁcation

Given an ID trace as input, a signal boundary classiﬁermakes 64 binary classiﬁcation decisions—for each of the 64bits, predict if it is the LSB of a signal (or not), effectivelydeciding if a signal boundary or “cut” occurs between thisbit and the next. Almost all previous works have focused onsignal boundary classiﬁers that use hand-crafted heuristics thatleverage only one feature, the probability of each bit ﬂipping.In this section we pursue the same goal but use a wider setof features. In addition to a novel, unsupervised heuristic weleverage supervised machine learning (ML) and deliver twosuperior signal boundary classiﬁers.For the reverse engineering pipeline, outputs of the signalboundary classiﬁer in Step 1 are inputs to the endianness opti-mizer in Step 2. While we frame signal boundary identiﬁcationas a set of binary classiﬁcations, the input for Step 2 of theCAN-D pipeline is the estimated probability—in { , } forbinary heuristics or in [0 , for ML—of a signal boundary foreach bit. Algorithms developed in previous works [2, 4–6] and[7] (Phase 0) could be used as the Signal Boundary Classiﬁerfor this step, all of which produce binary label outputs. Sec.V presents results comparing our signal boundary classiﬁersagainst the previous state-of-the-art.

1) Data & Notational Setup:

Both unsupervised and su-pervised predictions are based on statistics describing how aparticular bit and its neighboring bits ﬂip. We use a ground-truth DBC (see Sec. IV) to create a target vector, providinga 0/1 label for each bit indicating if it is a signal’s LSB(boundary). In order to deal with the issue that neighboringbits at byte boundaries are conditioned on endianness, wesplit little endian signals on byte boundaries for training(the supervised models) and testing (all) models. In use,the classiﬁer (heuristic or ML) will be applied to ID tracesunder both byte orderings (see Eq. 1), creating two sets ofpredictions. Both sets of predictions are input to Step 2, whichdetermines the endianness of each byte.Here we introduce two views of the data used for trainingthen scoring/tuning the ML in this section (both are alsoused for testing all methods in Sec. V-A). For training, weremove the constant bits (obvious boundaries) forming a“condensed trace”. The motivation for this is threefold: (1)Based on assumption (A0) (see Sec. I-B), observed constantbits necessarily delimit signals, so a simple rule sufﬁces to6dentify these obvious signal boundaries. (2) Our featuresencode neighboring bits’ values and ﬂips, so when nearby bitsare constant, features are either trivial or undeﬁned. Removingthe non-constant bits prior to feature building yields a betterfeature set. (3) Classes are highly biased towards the negativeclass—most bits are not an LSB (not on a signal boundary).By removing constant bits, we not only get better features,but we artiﬁcially increase the number of non-obvious signalboundaries and decrease class bias particularly for the non-obvious examples for which a classiﬁer is needed. Note thisis the “c” set described in Sec. V-A. Using this condensedtrace, we build a feature array with shape m non-constant bitsby n f features (features described below for each method).Second, for tuning the ML classiﬁers in this section, we onlyconsider their performance on the non-obvious boundaries inthe original data—those boundaries not abutting constant bitsin the non-condensed ID traces. Note this set is the “f − ” setdescribed in Sec. V-A. We tune our supervised model on thisset because we ultimately wish to apply the model to full 64-bit traces and want to optimize performance for this situation.

2) Supervised Classiﬁcation:

To describe features concep-tually, we use i ± to denote bit i ’s neighbors, notationallyneglecting the varying neighbors based on endianness (ref. Eq.1) when it only presents unnecessary complications. For eachbit i , we generate a set of 15 features: The ﬁrst ﬁve featuresare “local” to bit i and its relationship to bit i + 1 , whichwe denote v idi ∈ R . These features (listed in in Table II) areestimated probabilities of a “bit ﬂip” based on observations indata over time. We denote the ﬂip of bit i — alternating valuein subsequent messages B i ( t j ) (cid:54) = B i ( t j +1 ) — as F i . TABLE II: Local bit-ﬂip features: F i de-notes a ﬂip of bit i . P ( F i ) P ( F i | F i +1 ) P ( F i +1 | F i ) P ( ¬ F i | ¬ F i +1 ) P ( ¬ F i +1 | ¬ F i ) The main intuition is that a signal’sLSB generally alternates value muchmore often than an adjacent signal’sMSB; hence, the bit-ﬂip features shouldprovide good indicators for boundaries.Speciﬁcally, the ﬁrst feature shouldidentify LSBs ( P ( F i ) ≈ ) and MSBs( P ( F i ) ≈ ). This is essentially thefeature on which previous works [4, 5, 7] base their heuristic.The next four conditional bit ﬂip features are expected to differsigniﬁcantly for adjacent bits contained in the same signalversus those that are part of separate signals, as the formerare likely dependent while the latter are likely independent.Next, we look to the neighboring bit on the right, bit i + 1 ,and add the ﬁve local features for this bit v idi +1 to our feature setfor bit i . Finally, we add ﬁve difference features δ ( v idi +1 , v idi ) ,yielding a total 15-length feature vector for bit i .Initially, we experimented with adding a wider variety offeatures based on bit values, two-bit distributions, and entropy,as well as more left/right neighboring features. However,we found that these features did not improve classiﬁcationperformance and in fact resulted in overﬁtting.We tested the performance of several binary classiﬁers:Naive Bayes, Logistic Regression, Support Vector Classiﬁers,Decision Trees, Random Forests, K-Nearest Neighbors, Multi-Layer Perceptrons and AdaBoost. After experimenting withdifferent weighting schemes to combat the bias class issueas well as the fact that we only score the non-obvious TABLE III: Aggregated Classiﬁcation Metrics usingLOOCV by CAN log, only scoring non-obvious boundarydecisions (f − set). Top : Classiﬁers with default Scikit-learnparameters.

Bottom : The top performing Random ForestModel, with optimal parameters chosen using a grid search.Classiﬁer F-Score Precision RecallNaive Bayes 71.6 57.6 94.7Logistic Regression 86.9 82.1 92.3SVC Linear 85.5 78.6 93.8SVC Poly 88.7 85.3 92.3SVC RBF 89.0 84.8 93.8SVC Sigmoid 46.4 42.3 51.4KNN 88.1 81.3 96.2MLP 88.4 82.5 95.2AdaBoost 87.6 82.6 93.3Decision Tree 78.5 67.8 93.3

Random Forest

Random Forest (Tuned) max_features= √ n f , min_samples_Leaf=3,n_estimators=200, max_depth=5 boundaries, we settle on a sample weighting scheme of non-obvious-positive:negative:obvious-positive of 8:4:1. To test theaccuracy of the classiﬁers, we used Leave-One-Out-Cross-Validation (LOOCV), holding out one CAN log per fold andaggregating the results, and the f − set, only scoring non-obvious boundaries. The results, shown in Table III, illustratethat the Random Forest (RF) classiﬁer performed the best.Finding the optimal parameters for this top-performing modelusing a grid search and LOOCV, the tuned model yields anoverall 88% Precision and 95% Recall for an F-Score of 91%.We select this tuned RF model for our ML classiﬁer.Finally, as an input to Step 2, we output the classiﬁer’spredicted probability of a bit i being a signal’s LSB. Fig. 4: Vis-ualization ofHeuristicSignalBoundaryClassiﬁer(Alg. 1)based onconditionalbit ﬂipprobabilities,with α = . , α = . . Algorithm 1:

Heuristic Signal Boundary Classiﬁer

False

3) Unsupervised Heuristic:

As an alternative to ML, weexplore the feature set to develop a simple heuristic relatingto bit-ﬂip probabilities. We ﬁnd that the conditional bit-ﬂip7robability P ( F i +1 | F i ) and the difference between successiveconditional bit ﬂip probabilities P ( F i +2 | F i +1 ) − P ( F i +1 | F i ) are a better indicator of a signal ending at bit i than thedifference of unconditional bit ﬂip probabilities P ( F i +1 ) − P ( F i ) used by most related works.We develop a heuristic based on these ﬁndings, detailed inAlg. 1 and visualized in Fig. 4. Based on observations of data,we ﬁnd that setting parameters α = . , α = . splits thefeature space well, and yields a F-Score and Precisionand

Recall (also on the f − set). Note that our heuristicwas developed and tuned based on a small preliminary dataset,but we found it generalized well to all of our data.The heuristic’s main advantage is that it requires no trainingwhile achieving similar accuracy to the ML as shown in Sec.V-A. Though simple, intuitive, and computationally efﬁcient,one drawback is that the outputs are binary labels, with noway of determining probabilities properly in (0 , , therebyremoving some of the ﬂexibility offered by the following step. B. Step 2: Endianness Optimization

Armed with the probability of a boundary or “cut” betweenadjacent bits of a message, we construct an optimization prob-lem to simultaneously determine the most likely packing ofsignals into the 64-bit data-ﬁeld and most likely endiannessesof each of the eight bytes.

1) Valid Tokenizations:

Denote a candidate signal I , thelist of bit indices ordered from MSB to LSB. Given a signal I , let LSB ( I ) (or simply LSB if no ambiguity is present)denote the least signiﬁcant bit. We consider constant bits as1-bit signals. Each ID has eight bytes indexed j = 0 , . . . , with byte j comprised of bits j, . . . , j + 1) − . Let E ( j ) ∈{ B, L } denote that byte j is big, little endian, respectively. Deﬁnition 1 (Valid Tokenizations) . For a given ID trace,deﬁne a valid tokenization, T , as a tuple of candidate signals { I k } k and endiannesses of each byte { E ( j ) } j =0 such that:(1) (cid:83) I k = {

0, . . . , 63 } (all 64 bits are used),(2) I k (cid:84) I l = ∅ for all k (cid:54) = l (signals do not overlap),(3) Assumption (A1.b), one endianness per byte, is satisﬁed(implicit in the notation E ( j ) ). Example 2.

For example, consider Fig. 5 (right), a signal plotlayout depicting a valid tokenization with one color per signal(and constant bits in grey). The navy signal, a 10-bit little en-dian signal starting at bit 0, is denoted I = (14 , , , . . . , .Since, B → B , necessarily E (0) = E (1) = L . Example 2 shows that if a signal I crosses a byte boundary,the endianness of both bytes is determined by the order ofthe indices according to Eq. 1. This leads to the followingdeﬁnition and proposition, which will play an important rolein the computational tractability of our optimization problem. Deﬁnition 2 (Byte Boundaries) . For j = 0 , . . . , let v ( j ) ∈{ J B , J L , C } denote if byte boundary j is • a cut ( C ) : bit [8( j + 1) − ends a signal or is constant, • a big endian join ( J B ): [8( j + 1) − → j + 1) , or • a little endian join ( J L ) : [8( j + 1) − → j − and V := { v ∈ { J B , J L , C } | v is valid byte boundary set } . For bits not on a byte boundary ( i / ∈ S := { j − } j =0 ),there are only two options: cut or join B i → B i +1 , and bothare valid possibilities regardless of endianness. Proposition 1.

A valid tokenization T has v satisfying:1) v ( j ) = J B = ⇒ E ( j ) = E ( j + 1) = B v ( j ) = J L = ⇒ E ( j −

1) = E ( j ) = L v (0) (cid:54) = J L v (7) (cid:54) = J B v ( j ) = J B = ⇒ v ( j + 1) (cid:54) = J L , v ( j + 2) (cid:54) = J L Proof. (1) and (2) follow directly from Eq. 1 (endiannessdeﬁnition) and Assumption A1.b (one endianness per byte).For (3) v (0) (cid:54) = J L else → − / ∈ [0 , . Similarly for (4).For (5) if v ( j ) = J B and either v ( j +1) = J L or v ( j +2) = J L , then (1) and (2) imply E ( j + 1) is both big and littleendian, a violation of Assumption A1.b. Remark 1.

Prop. 1 can be summarized by V := { v ∈{ J B , C } × { J B , J L , C } × { J L , C } with no consecutive sub-sequences of the form ( J B , J L ) or ( J B , ∗ , J L ) } . Deﬁnition 3 ( T & T v ) . Let T denote the set of validtokenizations. For v ∈ V let T v ⊂ T be the tokenizationswith byte boundaries deﬁned by v . Corollary 1.

There are |T | = | V | × |T v | = 577 × − ≈ . valid tokenizations.Proof. |{ J B , C } × { J B , J L , C } × { J L , C }| = 2 × , andremoving subsequences of the form ( J B , J L ) or ( J B , ∗ , J L ) ,leaves 577. |T v | = 2 − as the remaining − bit gapshave two valid options, cut or join.

2) Optimization Formulation:

Step 1 provisions f ( i | E ( j i )) = P ( cut to the right of bit i for endianness E ( j i )) , with j i = (cid:98) i/ (cid:99) the corresponding byte index for bit i . We set f ( i, e ) = ∞ if bit i is to the left of a mandatory cut, e.g., the next bit isa constant bit. For intuition in the formulation below, consider f ( i | E ( j i )) not as the likelihood of a cut, but as penalty fornot cutting, and let β be a ﬁxed cut penalty parameter.The idea for our cost function is to let signals accrue ajoin penalty, the sum of the probabilities f ( i | E ( j i )) for eachbit that is not cut in order to form the signal. Since thecandidate signal entails a cut to the right of its LSB, we swap f ( LSB, E ( j i )) for β , the cut penalty. Thus, the β controlshow liberal to be with cuts.The intuition is to ﬁnd the optimal balance between parti-tioning the message into too many signals and joining multipledisparate signals, by balancing the cut penalty ( β ) with thelikelihood of a cut (join penalty f ). Setting β = 1 will leadto only cutting where f ( i |· ) = ∞ (signals demarcated byconstant bits), and β = 0 will lead to a cut at every gap,resulting in 64 1-bit signals. Deﬁnition 4 (Costs) . Deﬁne the Signal Cost as φ ( I, E ) := (cid:88) i ∈ I \{ LSB } f ( i | E ( j i )) (cid:124) (cid:123)(cid:122) (cid:125) join penalty + β (cid:124)(cid:123)(cid:122)(cid:125) cut penalty . xtending to a Tokenization Cost we have Φ( T ) : = (cid:88) I ∈ T φ ( I, E )= (cid:88) χ T ( i )=0 f ( i | E ( j i )) + (cid:88) χ T ( i )=1 β = (cid:88) i =0 (1 − χ T ( i )) f ( i | E ( j i )) + χ T ( i ) β. with χ T ( i ) = 1 if i is an LSB of a token in T , else . The above deﬁnition sets up our optimization problem,identify the optimal tokenization T := arg min T ∈T Φ( T ) . (2) Example 3.

To give a concrete example of using the costfunction, consider the ﬁrst two diagrams in Fig. 5 depictingthe big endian probabilities f ( ·| E = B ) (left) and thelittle endian probabilities f ( ·| E = L ) (middle). Considertwo overlapping 11-bit candidate signals that both containbyte 4 (bits 32 to 39 as numbered in the right plot): a bigendian signal I = [29 , . . . , , , . . . , , and a little endiansignal I = [32 , . . . , , , . . . , . The penalties for thesecandidate signals are φ β,f ( I , B ) = 1 . − .

76 + β = .

97 + β ,and φ β,f ( I , L ) = 0 + β = β . Since clearly .

97 + β > β , ( I , L ) has a lower penalty, in this case, regardless of thechoice of β . In fact, T = ( I , L ) turns out to be in the globallyoptimal T , which is shown in Fig. 5 (right) in teal .3) Finding an Optimum: Given a cut penalty β ∈ [0 , and pre-computed cut probabilities— f ( i | E ( j i )) for all i ∈{ , . . . , } and both endiannesses E ( j i ) (see Step 1, Sec.III-A)—our goal is to identify an optimal tokenization (Eq. 2)from the . valid options. Theorem 1.

Fixing v ∈ V , where v gives cuts/joins at byteboundaries (bits in S = { j + 1) − } j =0 ), the subproblem: arg min T ∈T v Φ β,f ( T ) is realized by T ,v , the tokenization: for all i ∈ [0 , \ S , bit i is an LSB (cut to the right of bit i ) iff β < ( f ( i | E ( j i )) .Proof. Let T ,v be as above and T ∈ T v . By deﬁnition, for i / ∈ S, T will accrue cost min( f ( i | E ( j i )) , β ) . Since T, T ,v ∈T v both accrue the same cost for bits i ∈ S. It follows that Φ( T ) − Φ( T ,v ) = (cid:80) i/ ∈ S [(1 − χ T ( i )) f ( i | E ( j i )) + χ T ( i ) β − min( f ( i | E ( j i )) , β )] ≥ . Fig. 5: Probabilities of boundaries according to big endian ordering (left) , little endian ordering (middle) . The resulting optimal tokeniza-tion (right) using β = . is three little endian ( navy , blue , teal ), onebig endian ( snot ) and a 4-bit ( maroon ) signal. This gives a efﬁcient, constant-time search algorithm (689operations), namely, (1) storing the optimal cut/join choicefor each bit i ∈ [0 , \ S under each endianness ( × operations), then (2) applying Thm. 1 to realize both T ,v and cost Φ( T ,v ) for each of 577 v ∈ V and maintainingthe minimum. In the case that there are multiple optimaltokenizations, we break ties by choosing the one with themaximum number of cuts, followed by the minimum numberof little endian signals, which necessarily furnishes a uniqueoptimal solution.After experimenting with adjusting the tuning parameter β ,we ﬁnd that β ∈ [ . , . yield fairly consistent and correcttokenizations, and so for our pipeline we choose β = . . Notethat the heuristic classiﬁers in Step 1 provide probabilitiesin { , } meaning all choices of β yield identical results.Further, note that with binary inputs, a tie break is schemeis often necessary, whereas with high precision probabilityinputs, multiple optimal tokenizations with the same cost arevirtually impossible.The outputs of the endianness optimizer described in thisstep are tokenized signals. While in theory another endiannessoptimizer could be developed and exchanged for this compo-nent, we consider this custom optimization to be a ﬁxed andnon-interchangeable component of the pipeline. C. Step 3: Signedness Classiﬁcation

A signedness classiﬁer takes a tokenized signal (start bit,length, endianness) and makes a binary decision on whethereach signal of length greater than two is signed (using two’scomplement encoding) or unsigned. To develop our classi-ﬁer, we followed a similar workﬂow to Step 1 (Sec. III-A)experimenting with supervised classiﬁers, and unsupervisedheuristics. Since each signals is tokenized, and thus the LSBsand MSBs are now known, this problem is signiﬁcantlysimpler, and features can be developed per signal rather thanper bit. However, after experimenting with several featuresand supervised classiﬁcation methods, we ﬁnd that a simpleheuristic based on the the distribution of the two most signiﬁ-cant bits of the signal yielded better results than the supervisedmethods. Using this heuristic, described in Alg. 2, we obtainalmost perfect classiﬁcation ( . F-Score), so ultimately,we chose to use this heuristic in the CAN-D pipeline ratherthan a learned model.The heuristic is based on how the two most signiﬁcantbits will behave if the signal is signed or unsigned. Let B i , B i denote the MSB and next-most signiﬁcant bit of thesignal. First, consider the probabilities of the center values, P [( B i , B i ) = (1 , , P [( B i , B i ) = (0 , . If a signal Algorithm 2:

Heuristic Signedness Classiﬁer

Inputs : { B i ( t ) , B i ( t ) } t , γ if P [( B i , B i ) = (1 , P [( B i , B i ) = (0 , thenreturn True if P [( B i ( t j ) , B i ( t j )) = (0 , ∧ ( B i ( t j +1 ) , B i ( t j +1 )) = (1 , thenreturn False if P [( B i , B i ) = (1 , P [( B i , B i ) = (0 , < γ thenreturn True return

False

9s signed, for values close to zero ( B i , B i ) will be (0 , (small positives) or (1 , (small negatives), whereas valuesnear extremes will be (1 , (near min) or (0 , (near max).A signal with a small probability of these values is thereforelikely signed. Second, consider the probability of a jumpbetween extreme values, P [( B i ( t j ) , B i ( t j )) = (0 , ∧ ( B i ( t j +1 ) , B i ( t j +1 )) = (1 , . If a signal is signed, whenchanging from small positive to small negative values, thetwo MSBs must ﬂip from (0 , to (1 , . However, if it isunsigned, this is unlikely to ever happen since this would entailﬂipping from a very small value to a large one resulting in asigniﬁcant discontinuity. If this probability is 0, the signal islikely unsigned. We apply these two ideas as described in Alg.2, where we set γ = . based on observations of data.After step 3, signedness classiﬁcation, each ID’s 64-bitmessage is partitioned into signals, for which we know theirstart bits, lengths, endianness, and signedness; consequently,each signal can now be translated into a timeseries of integers,denoted s ( t ) . No previous works have attempted signednessclassiﬁcation, so the signedness classiﬁer presented in thissection currently sole option for this modular component. D. Step 4: Physical Interpretation

For our signal-to-timeseries matcher, we follow Verma etal.’s ACTT [6] to match a subset of the translated signalswith diagnostic data. This augments matched signals with thethe necessary information to interpret them as actual measure-ments in the vehicles. We do this by comparing each signaltime series, s ( t ) , to each DID trace D ( t (cid:48) ) , and determining ifthey are linearly related. Because the DID traces are sampled ata lower rate than normal CAN trafﬁc, we interpolate the signalvalues over the diagnostic timepoints, obtaining s ( t (cid:48) ) . We thenregress D ( t (cid:48) ) onto s ( t (cid:48) ) and ﬁnd the best linear ﬁt, furnishingthe coefﬁcients a, b so that ¯ s ( t ) := as ( t (cid:48) ) + b ≈ D ( t (cid:48) ) . Toscore the model’s ﬁt, we use coefﬁcient of determination, R ,which measures the fraction of total variation in time series D ( t (cid:48) ) that is explained by ¯ s ( t (cid:48) ) ; thus, R = 1 exhibits a perfectﬁt, while R = 0 exhibits the ﬁt of a horizontal line (assuming D ( t ) is not the horizontal line). For each signal s , we ﬁnd thediagnostic D that yields the highest R value. If R > δ ,where δ ∈ [0 , is a tuning threshold, s is matched to D . A δ = 1 will return only perfectly correlated signals, while asmall δ will allow for less correlated signals to be matched.For our implementation, we choose δ = . .For signals that match a diagnostic, we have interpretation ,having procured the label and units, as well as the scale, a , and offset, b . In additional to ACTT [6], LibreCAN [7](Phase 2) propose a signal-to-timeseries matching algorithmsthat could be used interchangeably (or even combined) forthis component. Finally, note that translated signals that arenot augmented with labels through this physical interpretationstep are still highly valuable, as there are many applicationsin which these unlabeled translated timeseries are far moreuseful than binary data.IV. D ATASET

As our goal is to build a vehicle-agnostic signal-extractioncapability, we have collected CAN data from ten different

TABLE IV: Statistics on ten CAN logs, each collected from a vehicleof a different make. For each log, we enumerate: non-constant IDs(

IDs ), non-constant IDs deﬁned by CommaAI (

Def. IDs ), and each ofthe encodings of deﬁned signals (big/little endian, signed/unsigned)resulting from ground-truth labeling process (see Sec. IV). Three logscontain a high percentage of little endian signals, and all but onecontain signed signals.

Log IDs Def. IDs Unsigned, B.E. Signed L.E. Total

54 17 61 3 25 89 Non-constant IDs: IDs with more than one non-constant bit Vehicle adheres to J1939 Standard protocol [35], and signal deﬁni-tions are derived from this open standard. vehicle makes and years ranging from 2010 to 2017 fortraining and evaluation. The details of deﬁned signals for eachlog are described in Table IV. This dataset is far larger andmore varied than any previous work. Notably, in order totest generalizability of the methods, no duplicate makes wereincluded as different models of the same make (e.g., ToyotaCamry and Corolla) have similar characteristics.In order to obtain data for our signal reverse engineeringprocess (bit position, endianness, and signedness), we usedDBCs acquired from two sources. Log

VALUATION

Using the dataset described in Sec. IV, we compare ouralgorithms, both with the Heuristic and Machine Learning(ML) for Step 1, against the following predecessors: TANG[4] , READ [5], ACTT [6] , LibreCAN (Phase 0) [7] . SeeSec. II for a description of each algorithm. Note that we donot test the algorithm proposed by Markowitz & Wool [2]because it was tested by READ, and shown to produce farinferior results. We also test against a ‘Baseline’ method thatsimply uses constant bits as signal boundaries and assumes bigendian, unsigned encodings. This represents accuracy scoresobtained by simply identifying the obvious boundaries.We quantitatively compare tokenization and translation(Step 1-3) efforts of each of these methods in the followingsection, Sec. V-A. We note that READ and LibreCAN makeefforts to categorize signals, which is an added beneﬁt of thesemethods over ours, but we do not evaluate the efﬁcacy of theircategorization algorithms. We also do not quantitatively evalu-ate the interpretation (Step 4) efforts by ACTT, LibreCAN, orCAN-D because, as pointed out by Pes´e et. al [7], ground-truthinterpretations are highly subjective and difﬁcult to evaluate α = . , α = . , β = . (though irrelevant with binary Step 2inputs), γ = . , δ = . Step 1 with tuned RF Model found in Sec. III-A, β = . , γ = . , δ = . TANG and ACTT incorrectly considered reverse bit ordering. We onlyconsider forward bit ordering for these two methods. For the R threshold, we used . . For the 5/10 logs tested that containedno diagnostic packets, this method is equivalent to Baseline. The authors state that the optimal choice for parameter T p , (percentdecrease of bit ﬂip rates) was between . and . depending on the vehicle.The authors likely meant between . and . , because a threshold of 1% or2% would lead to (and we veriﬁed this) a very high false positive rate. Forthe results reported, T p , = . was used, resulting in much higher F-scores. quantitatively. Instead, we offer a qualitative comparison ofthe full decoding efforts in Sec. V-C, and Fig. 6, whichincludes the supplemental interpretations given by CAN-D.Note that ACTT’s interpretation is virtually identical to CAN-D’s and LibreCAN’s requires an extra tool to obtain body-related labeled timeseries, so we did not attempt to performtheir interpretation method. A. Signal Boundary Classiﬁcation Evaluation

We ﬁrst quantitatively evaluate the signal boundary classi-ﬁcation algorithms of each method using three test sets thatdiffer in the number of positive labels (detailed in Table V).The condensed (c) set uses all positive labels(boundaries) in condensed traces (constant bits removed),thus increasing the number of non-obvious positivelabels and decreasing class bias, resulting in the mostrobust evaluation set for testing and comparing theefﬁcacy of signal boundary classiﬁcation algorithms.

TABLE V: Positivelabels in each testset, 5784 negative la-bels in all sets n %c 834 13f −

208 3f + However, “full” non-condensed tracesgive a more accurate representation ofthe distribution of labels and the most re-alistic positive samples. In the full (f + )set, all non-constant samples are scored(including obvious examples of LSBsabutting constant bits/message ends).This f + set is the most representativeand will yield the most realistic metrics for the total signalsthat could be extracted using a given method. Finally, in thefull non-obvious set (f − ), only non-obvious examples (thosenot abutting constant bits) are scored. This test set has veryfew positive labels (3%), but unlike (c), all are boundaries thatdelimit two adjacent signals in actual data, and unlike (f + ),will not result in score inﬂation from obvious boundaries notattributable to the algorithm being scored. The f − set gives abalance of realism in use without the inﬂation of metrics fromthe obvious boundaries.The classiﬁcation F1-Score, Precision, and Recall for theunder each scenario is reported in Table VI (Top). Recall thatsince little endian signals are split on the byte boundary intotwo big endian signals for labeling, we are testing solely theefﬁcacy of the signal boundary classiﬁcation methods withouttaking endianness into account, and thus not penalizing otheralgorithms for the limiting assumption of big endianness. Alsonote that since CAN-D is supervised, reported metrics are fromaggregating results from LOOCV per log. B. Signal Error Evaluation

Second, we compare the full tokenization and translationefforts of each method, computing the (cid:96) error between thetranslated signals and their corresponding ground-truth signals.See results in Table VI. The motivation for this evaluation isthat ultimately, the goal of all of these methods is to extracttime series that can be used as actual real-time measurementsfrom systems in the car. Therefore, the most important metricfor measuring the efﬁcacy of these methods is not how manybits overlap or the number of boundaries correctly classiﬁed(as described above), but the difference between the values of11he extracted signal’s time series and the true signal’s timeseries. All previous methods assume big endian, unsignedsignals; consequently, once signal boundaries are assigned, thetranslated signal values are completely determined, and this iswhat is used for this second evaluation. For CAN-D, Steps 2-3(endianness optimization & signedness classiﬁcation) providethe remaining tokenization and translation information.We compute the score for each log as follows. Let S denotethe set of normalized true signals, ˆ S the set of normalizedpredicted signals (all taking values in [0,1]) for a CAN log. Let η : S → ˆ S so that for each true signal, s , η ( s ) is the predictedsignal that contains the MSB of s . Any predicted signals that B a s e li n e T AN G [ ] AC TT [ ] R E AD [ ] L i b re CAN [ ] CAN - D H e u r i s t i c CAN - D M L S i g n a l B o und a r y C l a ss i f . F c 0.0 67.1 * 71.1 78.7 91.6 f − f + c 0.0 62.6 * 94.8 86.8 − + R c 0.0 72.4 * 56.8 71.9 86.7 f − f + M e a n (cid:96) S i g n a l E rr o r C AN L og TABLE VI:

Top:

Comparison of signal boundary classiﬁcation resultspresented. F = F-score, P = Precision, R = Recall. We test eachmethod using the three test scenarios, denoted in the third columnand described in Sec. V-A. “Baseline” identiﬁes only obvious signalboundaries only at constant bits, which trivially has perfect precision.CAN-D ML achieves the highest F-Score and Recall, while theHeuristic exhibits the best Precision for all sets. Both exhibit over ∼

10% improvement in Recall over all previous methods in the twodifﬁcult test sets (c, f − ). We do not evaluate ACTT under scenario(c) since it relies heavily on constant bits to shrink the search space. Bottom:

Mean (cid:96) error of translated signal values (Eq. 3) reportedfor each CAN log, ∼

50% decrease in error from other methods.Finally, note that while CAN-D ML has slightly higher average errorthan CAN-D Heuristic (due mostly to worse Precision in Step 1),it has lower error for all logs containing little endian signals ( are left unmatched ( ∀ ˆ s ∈ ˆ S \ η ( S ) ) are paired with the thezero vector (cid:126) . Take the normalized (cid:96) difference between eachsignal pair, resulting in a signal error between 0 and 1. The mean signal error for the log is deﬁned as (cid:88) s ∈ S (cid:107) s − η ( s ) (cid:107) + (cid:88) ˆ s ∈ ˆ S \ η ( S ) (cid:107) ˆ s (cid:107) (3)where (cid:107) s (cid:107) := (cid:80) n id t =1 | s ( t ) | /n id . C. Qualitative Results

Fig. 6 depicts three examples of messages decoded byCAN-D (identical decodings for the ML and heuristic signalboundary classiﬁcation) and by the most accurate competingmethods (READ and LibreCAN which both produce thesame signal boundary predictions for these examples) withdetailed descriptions and discussion. These examples illustratea message with: signed and unsigned signals (top), little endianunsigned signals (bottom left) and little endian (signed andunsigned) signals (bottom right). CAN-D correctly tokenizesand translates all examples and overall furnishing interpretabletimeseries. Where available, CAN-D’s physical interpretation(Step 4, Sec. III-D) is provided in annotations above signals,showing R value to gauge goodness-of-match. Overall, mis-tokenization and mis-translation by other methods result inrampant discontinuities and dramatic error in most timeseries,exhibiting the necessity of correctly identifying each signal’sendianness and signedness.VI. P ROTOTYPE

OBD-II P

LUGIN

Fig. 7: Prototype CAN-D deviceusing a Rasberry Pi and CANBerryDual 2.1 boards

The CAN-D PrototypeDevice is a vehicle-agnostic, OBD-II (on-board diagnostic) pluginthat collects CAN datafrom the vehicle andruns the entire CAN-DPipeline depicted in 1. Theprototype (shown in Fig. 7)is built using Linux-based,single-board computers. Speciﬁcally, we use a Raspberry Pi3B+ with Raspbian Buster in conjunction with a IndustrialBerry’s CANBerry Dual 2.1 [37]. The Raspberry Pi 3B+offers 1GB of RAM and a 1.4GHz ARMv8 processor. Thedevice is powered either from battery or using on-boardpower from a vehicle’s 12-volt system.One challenge of building a vehicle-agnostic prototype isthat the bitrate for the CAN is unknown and variable pervehicle, and improper bitrate selection can cause adversevehicle function. In order to solve this issue, the device iteratesthrough common bitrates, identifying the bitrate that resultsin only expected packets. This allows our prototype to becompatible with most CANs regardless of bitrate.Another complication is that automobiles typically havemultiple CAN buses, and often more than one is availablefrom the OBD-II interface. The prototype analyzes two uniquenetworks by allocating a dedicated CAN controller for eachusing CANBerry Dual 2.1. Once connected, it automatically12 ig. 6: Tokenization & translation of three messages by CAN-D and top competing methods, READ & LibreCAN. When interpretation isprovided by CAN-D, the label and units of the matched diagnostic is shown with the R value, and the values are scaled appropriately.(a) Message containing signed and unsigned engine- and pedal-related signals. Left:

Signal boundaries and endianness are correctly identiﬁedby all methods.

Middle : All signals are correctly translated and have physical interpretations by CAN-D. Highly correlated matches found for green , blue and maroon signals. The navy signal at bit 4, matched to DID ‘Accelerator pedal position D’ with low correlation ( R = . ),is likely an accelerator indicator. As this is not an available DID, CAN-D has unearthed information that could not be simply queried. Right :Other methods incorrectly translate green and blue signals as unsigned, resulting in sharp discontinuities where the signals change sign.(b) Message containing four wheelspeeds encoded as little endiansignals.

Top : Correct tokenization & translation by CAN-D andmatch to “Vehicle Speed” DID with R = 1 . Bottom : Mis-tokenizedas ﬁve big endian signals by other methods with MSBs (bits 13-15, 29-31, and 45-47) attributed to the wrong signals. Since allencode speed, blue , green and orange signals appears correct, savesome minor discontinuities. However, these signals encode the wheelspeeds and are often used by Electronic Stability Control to stimulateanti-lock braking and traction control pending discrepancies in wheelspeeds; hence, mixing the MSBs of wheel speeds may go unnoticedin normal conditions but prove consequential in adverse drivingconditions! (c) Message containing four steering-related, little endian signals,three of which are signed. Top:

Correct tokenization & translationby CAN-D (no interpretation).

Bottom:

Incorrect tokenization &translation by other methods. Assuming big endian signals, they areare forced to cut on most byte boundaries, resulting in truncated,noisy teal , snot , orange , and maroon signals. The navy signal doesnot appear noisy, but is noticeably incorrect when comparing thescale and the values for t ∈ [0 , to the correct CAN-D translation.The two MSBs are misattributed to the next signal, resulting in errorsof at least when the MSB(s) are nonzero. ∼ ONCLUSION

We consider the problem of developing a vehicle-agnosticmethod for extracting the hidden signals in automotive CANdata, and present a comprehensive survey of this area. Wepresent CAN-D, a four-step, modular, pipeline using a com-bination of machine learning, a novel optimization process,and heuristics, to identify and correctly translate signals inCAN data to their numerical timeseries. In particular, CAN-D is designed to extract big and little endian signals as wellas signed and unsigned signals. While this greatly enhancesthe complexity of the problem, these are necessary accom-modations as speciﬁed by standard signal deﬁnitions. As ourresults exhibit, when endianness and signedness are ignored,the resulting translations are incorrect and overly noisy. Inevaluation on ten diverse vehicles’ data, we compare CAN-Dto the four state-of-the-art methods, providing a comparativestudy of previous methods on a more comprehensive datasetthan ever previously used. We achieve less than 20% of theaverage error of other methods and establish that CAN-D isthe lone method that can handle any standard CAN signal.Finally, we present a lightweight hardware implementation forusing CAN-D in-situ via an OBD-II connection to ﬁrst learn avehicle’s signals, and in future drives convert raw CAN data tomultivariate timeseries in real time. As CAN signals providea rich source of real-time data that is currently unrealized, wehope this contribution will facilitate many vehicle technologydevelopments. A

CKNOWLEDGEMENTS

Special thanks to Bill Kay for helpful comments. Researchsponsored by the Laboratory Directed Research and Develop-ment Program of Oak Ridge National Laboratory, managed byUT-Battelle, LLC, for the U. S. Department of Energy (DOE) and by the DOE, Ofﬁce of Science, Ofﬁce of WorkforceDevelopment for Teachers and Scientists (WDTS) under theScientiﬁc Undergraduate Laboratory Internship (SULI) pro-gram. R

EFERENCES [1] Jaynes, M. et al. (2016) Automating ECU Identiﬁcation forVehicle Security. In

ICMLA

IEEE.[2] Markovitz, M. and Wool, A. (2017) Field classiﬁcation, mod-eling and anomaly detection in unknown CAN bus networks.

Vehicular Communications, .[3] Huybrechts, T. et al. (2017) Automatic reverse engineering ofCAN bus data using machine learning techniques. In .[4] Nolan, B. C. et al. (2018) Unsupervised time series extractionfrom controller area network payloads. In VTC Fall

IEEE.[5] Marchetti, M. and Stabili, D. (2019) READ: Reverse Engineer-ing of Automotive Data Frames.

Transactions on InformationForensics and Security, (4).[6] Verma, M. E. et al. (2018) ACTT: Automotive CAN Tokeniza-tion & Translation. In CSCI

IEEE.[7] Pes´e, M. D. et al. (2019) LibreCAN: Automated CAN MessageTranslator. In

SIGSAC CCS

ACM.[8] Young, C. et al. (2020) Towards Reverse Engineering ControllerArea Network Messages Using Machine Learning. In

IEEE WF-IoT

IEEE.[9] Automotive buses. https://training.dewesoft.com/online/course/automotive-buses-can-measurement.[10] Bosch GmbH, R. (1991) CAN Speciﬁcation Version 2.0.[11] Provencher, H. (2012) Controller Area Networks For Vehicles.In

Seminar Course ENGR G

Vol. 5003, .[12] Endianness. https://en.wikipedia.org/wiki/Endianness (Nov,2019) Wikipedia.[13] Hackaday: CAN Hacking. https://hackaday.com/2013/10/22/can-hacking-the-in-vehicle-network/.[14] Hooovahh’s Blog: CAN Part 5 - Signal API. http://hooovahh.blogspot.com/2017/05/can-part-5-signal-api.html.[15] Two’s complement. https://en.wikipedia.org/wiki/Two%27scomplement (Nov, 2019) Wikipedia.[16] Uniﬁed Diagnostic Services. https://en.wikipedia.org/wiki/Uniﬁed Diagnostic Services (Nov, 2019) Wikipedia.[17] OBD-II PIDs. https://en.wikipedia.org/wiki/OBD-II PIDs (Oct,2018) Wikipedia.[18] Smith, C. (2016) The car hackers handbook: a guide for thepenetration tester, No Starch Press, .[19] Checkoway, S. et al. (2011) Comprehensive experimental anal-yses of automotive attack surfaces. In

USENIX Sec.

Vol. 4, .[20] Koscher, K. et al. (2010) Experimental Security Analysis of aModern Automobile. In

IEEE.[21] Miller, C. and Valasek, C. Adventures in Automotive Networksand Control Units. (2014).[22] Miller, C. and Valasek, C. Remote exploitation of an unalteredpassenger vehicle.

Black Hat USA, , 91.[23] Lokman, S.-F. et al. Intrusion detection system for automo-tive Controller Area Network (CAN) bus system: a review.

EURASIP Jour. on Wireless Comms & Networking, (1).[24] Wu, W. et al. (2019) A Survey of Intrusion Detection for In-Vehicle Networks.

IEEE T-ITS, .[25] Moore, M. R. et al. (2017) Modeling inter-signal arrival timesfor accurate detection of CAN bus signal injection attacks. In

CISRC

ACM.[26] Lee, H. et al. (2017) OTIDS: A novel intrusion detection systemfor in-vehicle network by using remote frame. In

PST

IEEE.[27] Choi, W. et al. (2018) Identifying ECUs using inimitablecharacteristics of signals in controller area networks.

IEEETransactions on Vehicular Technology, (6).

28] Tyree, Z. et al. (2018) Exploiting the Shape of CAN Data forIn-Vehicle Intrusion Detection. In

VTC Fall

IEEE.[29] Pawelec, K. et al. (2019) Towards a CAN IDS Based on aNeural Network Data Field Predictor. In

AutoSec

ACM.[30] Taylor, A. et al. (2016) Anomaly detection in automobile controlnetwork data with long short-term memory networks. In

Conf.on Data Science and Advanced Analytics

IEEE.[31] Nair Narayanan, S. et al. (May, 2016) OBD SecureAlert: AnAnomaly Detection System for Vehicles.[32] Hanselmann, M. et al. (2020) CANet: An Unsupervised Intru-sion Detection System for High Dimensional CAN Bus Data.

IEEE Access, .[33] Enev, M. et al. (2016) Automobile driver ﬁngerprinting. Pro-ceedings on Privacy Enhancing Technologies, (1).[34] Wakita, T. et al. (2006) Driver Identiﬁcation Using DrivingBehavior Signals.

IEICE Trans. on Info & Systems,