CAN-D: A Modular Four-Step Pipeline for Comprehensively Decoding Controller Area Network Data
Miki E. Verma, Robert A. Bridges, Jordan J. Sosnowski, Samuel C. Hollifield, Michael D. Iannacone
11 CAN-D: A Modular Four-Step Pipeline for ComprehensivelyDecoding Controller Area Network Data
Miki E. Verma ∗ , Robert A. Bridges ∗ , Jordan J. Sosnowski † , Samuel C. Hollifield ∗ , Michael D. Iannacone ∗∗ Cyber & Applied Data Analytics Division, Oak Ridge National Laboratory, Oak Ridge, TN { vermake, bridgesra, hollifieldsc, iannaconemd } @ornl.gov † Department of Computer Science & Software Engineering, Auburn [email protected]
Abstract —Controller area networks (CANs) are a broadcastprotocol for real-time communication of critical vehicle subsys-tems. Original equipment manufacturers (OEMs) of passengervehicles hold secret their mappings of CAN data to vehiclesignals, and these definitions vary per make, model, and year.Without these mappings, the wealth of real-time vehicle infor-mation hidden in the CAN packets is uninterpretable—severelyimpeding vehicle-related research including CAN cybersecurityand privacy studies, after-market tuning, efficiency and perfor-mance monitoring, and fault diagnosis to name a few.Guided by the four-part CAN signal definition, we presentCAN-D (CAN Decoder), a modular, four-step pipeline for identi-fying each signal’s boundaries (start bit and length), endianness(byte ordering), signedness (bit-to-integer encoding), and byleveraging diagnostic standards, augmenting a subset of theextracted signals with meaningful, physical interpretation. Enroute to CAN-D, we provide a comprehensive review of the CANsignal reverse engineering research. All previous methods ignoreendianness and signedness, rendering them simply incapable ofdecoding many standard CAN signal definitions. Incorporatingendianness grows the search space from 128 to 4.72E21 signaltokenizations, and introduces a web of changing dependencies. Inresponse, we formulate, formally analyze, and provide an efficientsolution to an optimization problem, allowing identification ofthe optimal set of signal boundaries and byte orderings. Inaddition, we provide two novel, state-of-the-art signal boundaryclassifiers (both superior to previous approaches in precision andrecall in three different test scenarios) and the first signednessclassification algorithm, which exhibits >
97% F-score. Overall,CAN-D is the only solution with the potential to extract anyCAN signal and is the state of the art. In evaluation on tenvehicles of different makes, CAN-D’s average (cid:96) error is 5 timesbetter (81% less) than all preceding methods, and exhibits loweraverage error even when considering only signals that meetprior methods’ assumptions. Finally, CAN-D is implemented inlightweight hardware allowing OBD-II plugin for real-time in-vehicle CAN decoding. Index Terms —Controller Area Network (CAN); Reverse En-gineering; Machine Learning; Security; Privacy; Technology;
I. I
NTRODUCTION & B
ACKGROUND
Modern automobiles rely on communication of severalelectronic control units (ECUs) (internal computers) over afew controller area networks (CANs) and adhere to a fixed
This manuscript has been co-authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US governmentretains and the publisher, by accepting the article for publication, acknowledges thatthe US government retains a nonexclusive, paid-up, irrevocable, worldwide license topublish or reproduce the published form of this manuscript, or allow others to doso, for US government purposes. DOE will provide public access to these resultsof federally sponsored research in accordance with the DOE Public Access Plan(http://energy.gov/downloads/doe-public-access-plan).
CAN protocol. Sensors readings, such as accelerator pedalangle, brakes, fuel injection timing, wheel speeds, as wellas less important readings, such as radio settings, are allcommunicated as signals encoded in the CAN messages. Forpassenger vehicles, the encodings of these signals into CANmessages are proprietary—one can monitor (and send) CANmessages, but generally cannot understand their meaning.Further, these encodings vary per make, model, year, eventrim, and in practice, reverse engineering of signals is currentlya tedious, per-vehicle effort. As CAN data is sent at a rapid rateand carries a wide variety of real-time vehicle information, avehicle-agnostic solution for decoding CAN signals promises avast resource of streaming, up-to-date information for analyticsand technology development on any vehicle.Each CAN message has up to 64 bits of data containing(usually) multiple signals (Figs. 2 & 3). Automotive CANsignals are characterized by four defining properties (discussedin detail in Sec. I): (1) signal boundaries (start/end bit), (2)endianness (byte order), (3) signedness (bit-to-integer encod-ing), and (4) physical interpretation. The signal definitions foreach message (a message definition) are defined in vehicle’sCAN database file (the industry standard is Vector’s .dbc
Step 1 : For each message ID in a CAN log, a binary
Signal Boundary Classifier outputs the likelihood of a signal boundary at each bit gap. We use either of two classifiers: supervised learning or a novel unsupervised heuristic.
1. Si gnal Boundar y Classi fi cat i on 2. Endi anness Opt i mi zat i on 3. Si gnedness Classi fi cat i on 4. Physi cal I nt er pr et at i on
CAN LogSignal Boundary Probabilities
Signal Boundary Classifier( )
Tokenized Signals DBC (No Interpretation) External Labeled TimeseriesInterpreted SignalsDBC
External Labels? No Signedness Classifier( )
Step 2 : A custom endianness optimization algorithm takes the boundary probabilities as input and determines an optimal tokenization (signals' positions and endiannesses).
Step 3 : A binary
Signedness Classifier determines each signal's signedness, allowing translation of bits to values. We use a novel unsupervised heuristic for our classifier.
Step 4 : A supplemental
Signal-to-Timeseries Matcher matches signals to externally collected labeled timeseries, providing signal inter-pretation. We regress signals onto concurrently collected diagnostics.Endianness Optimizer( )
Yes
Signal-to- Timeseries Matcher( )
Translated Signals
Fig. 1: CAN-Decoder (CAN-D) Pipeline: A four step modularpipeline that take a CAN log (capture of CAN data) as input,and outputs a DBC with signal definitions, thus providing vehicle-agnostic CAN signal reverse engineering. Italicized processes out-lined in dotted red lines indicate modular pieces that can be anyalgorithm that satisfies the input/output requirements. Descriptions ofour choices for these pieces are provided. Greek letters α − δ denotetuning parameters (possibly) needed for steps − respectively. a r X i v : . [ c s . OH ] J un ABLE I: Automotive CAN signal reverse engineering algorithms foreach of the four signal properties. CAN-D is the only comprehensivealgorithm, determining all four properties. B o u n d a r y E n d i a n n e ss S i g n e d n e ss I n t e r p r e t a ti o n Jaynes et al. (2016) [1] (cid:35) (cid:35) (cid:35) (cid:71)(cid:35)
Markowitz & Wool (2017) [2] (cid:32) (cid:35) (cid:35) (cid:35)
Huybrechts et al. (2017) [3] (cid:71)(cid:35) (cid:35) (cid:35) (cid:71)(cid:35)
Nolan et al.’s
TANG (2018) [4] (cid:32) (cid:35) (cid:35) (cid:35)
Marchetti & Stabili’s
READ (2018) [5] (cid:32) (cid:35) (cid:35) (cid:35)
Verma et al.’s
ACTT (2018) [6] (cid:32) (cid:35) (cid:35) (cid:32)
Pes´e et al.
LibreCAN (2019) [7] (cid:32) (cid:35) (cid:35) (cid:32)
Young et al. (2020) [8] (cid:35) (cid:35) (cid:35) (cid:71)(cid:35)
CAN-D (cid:32) (cid:32) (cid:32) (cid:32) or “DBC” file format). We use this industry-standard, four-part signal definition to frame our understanding of previousworks and guide our approach.
The goal is a vehicle-agnosticCAN decoder—to discover these four defining properties foreach signal from CAN data from any vehicle, i.e., to reverseengineering the signal definitions in the vehicle’s DBC.
Recently, the research community has focused on reverseengineering signals from automotive CAN data. This researchis summarized in Related Works (Sec. II), and Table I catalogseach work’s efforts in identifying the four defining signalcharacteristics. Notably, all current approaches focus only onidentifying signal boundaries (1) and/or matching signals toobservable sensor data (4), and ignore endianness (2) andsignedness (3), meaning they are unable to decode manystandard CAN signals.All previous works have developed and tested algorithmson limited CAN data, often from a single make. Targetinga vehicle-agnostic solution, we compile a much more variedcollection of labeled CAN data from ten different makes(see Sec. IV). Equipped with this robust, labeled dataset fordevelopment and testing, we pursue the first comprehensiveand most accurate signal reverse-engineering pipeline (seeFig. 1). Before describing our contributions, we introducenecessary background information.
Fig. 2: CAN 2.0 frame depicted [9]: Arbitration ID indexes the frame;Data Field carries message content up to 64 bits.
A. CAN Fundamentals & Notation
CAN 2.0 defines the physical and data link layers (OSIlayers one and two) of a broadcast protocol [10]. In particularit specifies the standardized CAN frame (or packet) formatrepresented in Fig. 2. For semantic understanding of a CANframe, only two components of the frame are necessary: • Arbitration ID - an 11-bit header used to identify theframe, and for arbitration (determining frame priority whenmultiple ECUs concurrently transmit); • Data Field/Message - up to 64 bits of content.Each ID’s data field is comprised of signals of varying lengthsand encoding schemes packed into the 64 bits (see Fig. 3, left).A .dbc file provides the definitions of signals in the data fieldfor each ID, thus defining each CAN Message. CAN frames with the same ID (message header/index) areusually sent with a fixed frequency to communicate updatedsignal values, although some are aperiodic (triggered by anevent). For example, ID occurs every 0.1s, ID occurs every 0.25s, etc. We partition CAN logs into
ID traces ,the time series of 64-bit messages for each ID. An ID traceis denoted [ B ( t ) . . . , B ( t )] t , a time-varying binary vectorof length 64. Note that without loss of generality, we assumeeach message is 64 bits by padding with 0 bits if necessary.
1) Byte Order (Endianness) & Bit Order:
The significanceof a signal’s bits within a byte (contiguous 8-bit subsequences)decreases from left to right, i.e., the first bit transmitted isthe most significant bit (MSB), and the last (eighth) bit,the least significant bit (LSB). This is defined in the CANSpecification [10, 11] but has been misrepresented [7] andmisunderstood [4, 6] by previous signal reverse engineeringworks. The confusion results from use of both big endianand little endian byte orderings in CAN messages. Big endian(B.E.) indicates that the significance of bytes decreases fromleft to right, whereas little endian (L.E.) reverses the orderof the bytes (but maintains the order of the bits in each byte)[12]. We list the bit orderings for a 64-bit data field under bothendiannesses with parenthesis demarcating the bytes [11]:B.E.: ( B , . . . , B ) , ( B . . . , B ) , . . . , ( B , . . . , B ) L.E.: ( B , . . . , B ) , ( B , . . . , B ) , . . . , ( B , . . . , B ) (1)See Examples 1 & 2 for how this affects signal definitions.
2) CAN Signals:
The specifications for decoding each ID’smessage into a set of signal values is defined by the OEM andheld secret, usually stored in a DBC. Signal definitions consistof several properties (see Fig. 3, right) that detail how to: tokenize (demarcate the signal’s sequence of bits): • Start bit and length give signal position in the data field; • Byte ordering : If the signal crosses a byte boundary, littleendian signals reverse the order of the bytes while bigendian signals retain byte order (see Eq. 1); translate (convert a sequence of bits to integers): • Signedness : Unsigned , the usual base 2 encoding, vs. signed , two’s complement encoding [15];
Fig. 3: DBCs visualized through DBC Editor GUIs.
Left: A signallayout plot visually represents a CAN Message tokenization, depict-ing an ID’s 64-bit data field as an × array containing CANSignal(s). Each signal’s constituent bits are shown in a unique colorand unused bits are shown in white. ( CANdb++
Database Editor)[13]
Right:
Signal definition of first 16-bit yellow signal, definedby properties: start bit, length, signedness, endianness, scaling factor,offset, unit. (
NI-XNET Database Editor ) [14] interpret (linearly scale raw translated signal values tophysically meaningful and interpretable information): • Label and unit , giving the physical meaning of the signaland it’s units (e.g., speed in MPH); • Scale and offset , which provide the linear mapping of thesignal’s tokenized values to the appropriate units.It is implicit in the DBC signal definition that (non-constant)signals are contiguous sequences of non-constant bits.
Example 1.
Consider in Fig. 3 the first two-byte yellow signal.To tokenize the signal, or know it’s sequence (implying order)of bits, we must know endianness. If bytes 1 & 2 are bigendian, we obtain MSB-to-LSB bit indices, I = (0 , . . . , whereas if they are little endian, the bytes are swapped, obtain-ing MSB-to-LSB bit indices I = (8 , . . . , , , . . . , , notablywith B → B . Next, the signal’s signedness furnishes the translation of that bit sequence to an integer. The informationneeded for interpretation are the label and unit of the signal(in this case Engine RPM) and the linear transformationto convert the translated values (a two-byte signal can take − , values) to the appropriate physical value(e.g., in the range − , RPM).
Fig. 6 illustrates timeseries of CAN data that have beendecoded using both correct and incorrect signal definitions.Fig. 6 (a) plots green and blue
CAN signals tokenized withcorrect (middle) vs. incorrect (right) signedness, and Fig. 6(b) plots CAN signals tokenized with correct (top) vs. incor-rect endianness (bottom, in particular, the navy signal). Theclear discontinuities in these mis-tokenized and mis-translatedsignals exhibit the importance of knowing the endianness andsignedness for extracting meaningful time series.
3) On-board Diagnostics:
In the U.S., all vehicles sold after1996 include an on-board diagnostic (OBD-II) port, whichgenerally allows for open access to automotive CANs, andemissions-producing vehicles sold after 2007 also includea mandatory, standard interrogation schema for extractingdiagnostic data using the J1979 standard [16]. This On-boardDiagnostic service (OBD) is an application layer protocol inwhich one can query diagnostic data from the vehicle bysending a CAN frame. A CAN response is broadcasted withthe requested vehicular state information. There are a standardset of queries possibly available via this call-response protocol(e.g, accelerator pedal position, intake air temperature, vehiclespeed) along with unit conversions, each corresponding to aunique diagnostic OBD-II PID (DID) [17]. Specific examplesof how to perform the call and response are available, e.g.,[7, 18]. Previous CAN decoding works have iteratively sentDID requests and parsed the responses from CAN traffic tocapture valuable, real-time, labeled vehicle data without usingexternal sensors [3, 6, 7]. We denote these time series ofdiagnostic responses, or
DID traces , D ( t ) . Inherent limitationsexist—the set of available DIDs varies per make, and electricvehicles need not conform to this standard [6, 7]. B. Problem, Assumptions, & Challenges1) Problem:
The goal is to to recreate the .dbc file’s signaldefinitions, (discover the four properties for each signal) for any vehicle from a sufficient capture of a vehicle’s CAN data.
2) Assumptions:
We make five fundamental assumptions: (A0) : Observed constant bits are unused. (A1) : Both big and little endian byte orders are possible. (A1.a) : Both endiannesses can occur in a single ID. Wehave not observed this, but it is permitted by protocol andDBC syntax. DBC editor GUIs allow per-signal endiannessspecification with a checkbox or pull down (e.g. Fig. 3, Right),indicating that both byte orderings can co-occur in a message. (A1.b) : A single byte cannot have bits used in a little endiansignal while also containing bits used in a big endian signal;else, the byte orders indicated by the signals are contradictory. (A2) : Signed signals are possible and are encoded using a 2’scomplement encoding.
3) Challenges:
In practice, it is difficult to exercise theMSBs of a signal, resulting in errors in determining signalboundaries (a Step 1 challenge). For example, consider thetwo-byte (16-bit) Engine RPM signal in Example 1 withtranslated values between − , . As 5,000 RPMs is rarelyreached, the MSB of this signal is likely to be observed asa constant 0 bit, causing the signal start bit to be mislabeled.Though this is easily surmountable for RPMs (e.g., rev enginein neutral during collection), it is far more difficult to solvethis for latent sensors, e.g, engine temperature.Secondly, since continuous signals are sampled periodically,those with high resolution signals (e.g., a two-byte signal has > , values) have LSBs flipping seemingly randomly(a Step 1 challenge). Our results indicate that the TANGalgorithm [4] suffers from the overly strict assumption that flipfrequencies are monotonically decreasing with bit significance.Thirdly, considering both big and little endianness greatlyenhances complexity of the problem, as bits on the byteboundaries have unknown neighbors (albeit in a fixed set ofpossibilities); e.g., simply comparing the bit flip probabilitiesof neighboring bits now requires custom rules for incorpo-rating all possible neighbors according to (A1), (A1.a) butremove impossibilities imposed by (A1.b) (a Step 2 challenge).See details in Sec. III-B.Fourthly, considering both signed and unsigned encodingsadds another hurdle; in particular, the order of bit representa-tions mod n is the same for both signed and unsigned, halfthe bit strings represent different integers (a Step 3 challenge).Finally, many CAN signals communicate sensor values thatare hard to measure with external sensors; hence, identifyingthe physical meaning, unit and linear mapping (scale andoffset) can be difficult (a Step 4 challenge). C. Contributions
We make six contributions to the area of automotive CANsignal reverse engineering:
C1. Comprehensive signal reverse engineering pipeline:
Our primary contribution is a modular, four-part pipeline,depicted in Fig. 1, for learning all four components of a CANsignal definition, respectively. The pipeline is modular in that:Step 1 can accommodate any signal boundary classificationmethod; Step 3 can accommodate any signedness classifica-tion algorithm; and Step 4 can accommodate any signal-to-3imeseries matching algorithm for physical interpretation. In-stantiating our pipeline with our signal-boundary classificationheuristic and (separately) our trained machine learning classi-fier for Step 1 and the diagnostic sensor matching of Vermaet al. [6] for Step 4, we present a quantitative comparativeevaluation of our signal reverse engineering pipeline versusprevious methods. We demonstrate that CAN-D exhibits lessthan a fifth average error of all previous methods (Sec. V-B& Table VI, Bottom), and qualitatively illustrate the pitfallsand limitations of previous methods (Sec. V-C & Fig. 6) thatour four-step pipeline circumvents.
Overall, CAN-D is the firstCAN signal reverse engineering effort that can accommodateall signals as defined in automotive DBC files, and is by farmore accurate than any previous effort. Further, it provides aframework for future research developments to improve andplug in advancements to each step.
C2. Introduction of two state-of-the-art signal boundaryclassification algorithms and comparative study of previ-ous algorithms:
We develop two signal boundary classifiers,a supervised machine learning model and an unsupervisedheuristic (Sec. III-A). We implement the previous state-of-the-art classification methods and provide the first quantitativecomparison of all methods (Sec. V-A & Table VI, Top) ona more comprehensive and robust data set than any previouswork. We demonstrate that our algorithms are significantlymore accurate than previous methods, superior in both recalland precision in three testing scenarios.
C3. Endianness optimization formulation and solution:
All previous works are based on an assumption of big en-dian byte ordering (to perform tokenization and/or signal-to-timeseries matching) and there is no simple remediationfor adapting the previous algorithms to perform correctlyin the presence of both big and little (reverse byte order)endian signals. The second step of our pipeline presents anovel procedure that has been crafted to use the predictionsfrom any signal-boundary classification algorithm from Step1 as input and determine the optimal set of endiannessesand signal boundaries from all possible tokenizations (Sec.III-B). We formulate an objective function to be optimizedand provide a formal mathematical proof for reducing thesearch space to a very tractable grid search algorithm foroptimization. Overall, this insight allows all signal-boundaryclassification algorithms to be leveraged for extracting bothlittle and big endian signals—which has thus far been ignoredand/or insurmountable.
C4. Signedness classification:
We provide the first algo-rithm for determining signal signedness (bit-to-integer encod-ing) (Sec. III-C), allowing translation of signals to time series.Testing shows this simple heuristic achieves > F-Score.
C5. Prototype OBD-II plugin for in-situ or offline use:
The pipeline can be run offline for post-drive analysis or duringdriving e.g., to feed online analytics such as a CAN IDS withtranslated CAN data. We discuss our design and implementa-tion of a lightweight on-board diagnostic (OBD-II) port plugindevice (Sec. VI & Fig. 7) for use in any vehicle where aCAN is accessible via the OBD-II port (most vehicles). In asignal learning phase, the device automatically logs CAN datawhile periodically querying supported DIDs, and then runs the algorithmic pipeline to learn signal definitions and writea DBC. This allows the real-time decoding of CAN signalson future drives, e.g., to feed a novel analytic technologyleveraging the vehicle’s signals online, or offline uses, e.g.,to analyze CAN captures in post-collection analysis. Thisprototype bridges the gap between the algorithmic researchin the literature and actual online use with any vehicle.
C6. Survey:
We provide the first comprehensive survey ofworks on reverse engineering CAN signals (Sec. II & TableI), providing the progression of the field and documenting thebenefits and limitations of each.
D. Impact
Unveiling CAN signals will provide real-time measurementsof vehicle subsystems, a rich stream of data that promisesto fuel many vehicle technologies and put development andanalytics in the hands of the consumers (in addition to OEMs).Multiple research works have, through direct and evenremote access to CANs, managed to manipulate a few manu-ally reverse engineered signals, manifesting in life-threateningeffects—most notably, the remote Jeep hack of Miller &Valasek [19–22]. These works demonstrate that CAN reverseengineering is possible on a per-vehicle basis with ample effortand expertise, and will not inhibit the determined adversary.The obscurity of CAN data does, however, hinder vulnerabilityanalysis research necessary for hardening vehicle systems,and automated CAN reverse engineering will greatly expeditevehicle vulnerability research.In parallel, CAN defensive security research is growingquickly; we found 15 surveys of the area since 2017, e.g.,[23, 24], with over 60 works on CAN intrusion detectionbetween 2016-19. Yet these works are impeded by obfuscatedCAN data, forced to either use side-channel methods thatignore message contents [25–27], use black-box methodsignorant of message meanings [28–30], or either arduouslyreverse engineer a few signals for a specific vehicle [31]or rely on an OEM for signal definitions [32], which keepsCAN security in the OEM’s hands and develops per-make (notvehicle-agnostic) capabilities. A vehicle-agnostic CAN signalreverse engineering tool promises to remove these limitationsand provide rich, online, time-series data for advancementsin detection and other security technologies. Further, thisCAN signal decoding will promote universally applicabletechnologies to address cars currently on the road, and removereliance on the vehicle OEMs for CAN security.Another emerging subfield of research is driver fingerprint-ing [33, 34], developing methods to identify drivers basedon their driving characteristics, such as braking, accelerating,and steering. Access to the decoded CAN data will allowthese works to be ported to plugin technologies for nearlyany vehicle, impacting at a minimum driver privacy andinsurance strategies, and potentially forensic (e.g., criminal)investigations, and vehicle security to name a few.In addition, access to CAN signals will potentially as-sist development of after-market tuning tools for enhancedefficiency and performance, fuel efficiency monitoring andguidance, fleet management, vehicle fault diagnosis, forensics4echnologies, and after-market vehicle-to-vehicle capabilities.As a final example, we note that after-market technologies toprovide autonomous driving capabilities to current vehiclesare appearing; in particular, Open Pilot (https://comma.ai/)provides latitudinal and longitudinal control for many vehi-cles on the road using a few, presumably manually reverse-engineered CAN signals. Automated, accurate, and universallyapplicable CAN de-obfuscation will promote and expeditesuch vehicle technologies, especially, after-market solutionsfor many vehicles currently in use.II. CAN S
IGNAL R EVERSE E NGINEERING S URVEY
This section provides the first comprehensive survey ofmethods for decoding automotive CAN data into constituentsignals. We seek to show the progression of the literature, andwe provide more detailed descriptions of the methods thatwe evaluate in Sec. V with authors/methods in bold. TableI gives a quick reference for the signal reverse engineeringcontributions of each work.Early work of Jaynes et al. [1] (2016) explored supervisedlearning to identify CAN messages that control body relatedevents, but the approach was unaware that data fields arecomprised of multiple disparate signals. Thus, this methodsimply labels entire messages with a general physical meaning.Markowitz & Wool [2] (2017) focuses on CANanomaly/intrusion detection but pursues signal extraction as apreprocessing step. They were the first to introduce the basicassumption each arbitration ID’s data field is “a concatenationof positional [signals]”. Implicitly, Markowitz & Wool’s algo-rithm assumes only big endian and unsigned signals; hence,their algorithm need only identify the start bit and length ofa signal. The algorithm considers all 2080 possible signals(indexed by start bit and length) in an ID’s 64-bit data field,and based on the cardinality of each candidate signal’s range,the count of observed distinct values. It then categorizes thesignal as constant, categorical (taking on only a few values),or continuous (values of a discretely sampled continuousvariable) based on the range and assigns a score. Finally, themethod identifies a non-overlapping partition of the 64 bitsbased on category and a optimization of the signals’ scores.Huybrechts et al. [3] (2017) is the first work to leverageDIDs to annotate CAN data and identify signals. Their al-gorithm converts bytes/byte-pairs in CAN messages to in-tegers and identifies those that are similar to the concur-rently collected DID responses, but operates under the self-acknowledged false assumption that CAN signals are limitedto only one or two-byte signals. No linear transformation ofextracted signals to the DID sensor values is given.The next three works, Nolan et al.’s TANG algorithm [4],Verma et. al.’s ACTT [6], and Marchetti & Stabili’s READ [5]appear to have occurred independently and concurrently, andwe present them chronologically by publication date.
Nolan et al. [4] (2018) focus solely on extracting continuoussignals by considering the “transition aggregated n -grams”( TANG ). Given an ID trace [ B ( t ) , . . . , B ( t )] t Nolan etal. define the TANG vector as ( T , . . . , T ) with T i = (cid:80) t B i ( t j ) (cid:76) B i ( t j +1 ) , where (cid:76) denotes XOR. Note, this is simply a computationally efficient way to obtain the bit flipcount; hence, if an n -bit signal’s subsequent values change byunit increments, the LSB will exhibit T i = 2 n + 1 , and eachnext significant bit will have TANG values decreasing by afactor of . The algorithm for identifying continuous signalboundaries is, roughly speaking: compute the TANG vectorfrom an ID trace, identify the bit with maximal TANG valueas a signal’s LSB, and walk left (resp. right for reverse bitorder) absorbing bits into the signal until the TANG valueincreases. Nolan et al. consider both forward and reverse bitorderings to attempt to take little and big endian encodingsinto account. However, since endianness refers to byte (notbit) order, this method cannot accommodate true little endiansignals, and in fact violates the fixed bit order defined by thestandard. Overall, this method assumes big endian, unsigned,and continuous signals. Marchetti & Stabili [5] (2018) propose the
READ (Re-verse Engineering of Automotive Dataframes) algorithm toextract signals using heuristics based on a 64-length vec-tor giving each bit’s observed flip probability, [ P ( B i ( t j ) (cid:54) = B i ( t j +1 ))] i =0 . First, signal boundaries are identified using m i := (cid:100) log ( P ( B i ( t j ) (cid:54) = B i ( t j +1 ))) (cid:101) the ceiling function ofthe log probabilities. READ follows intuition similar to TANG:for continuous signals, a LSB flips much more often than anadjacent signal’s MSB. Hence, READ places signal boundariesbetween bits i and i + 1 iff m i > m i +1 , or equivalently if thethe bit flip probabilities cross a factor of 10 (e.g., from above . to below). Unlike TANG, READ does not claim to assumeonly continuous signals, and it in fact builds on Markowitz &Wool’s signal categorization efforts. It considers a trichotomyof signal categories —counters (increments by 1 with eachmessage), checksums (hashes for checking if messages areproperly transmitted), and a catch-all bin, “physical” signals—categorizing the extracted signals with further heuristicsrelating to bit flips. Ultimately, READ partitions an ID’s 64-bitdata frame into signals with categorical labels. The algorithmignores little endian and signed encoding possibilities and can-not be easily amended to accommodate little endian signals.Marchetti & Stabili’s evaluations with real and synthetic CANdata comparing with Markowitz & Wool’s method reveal thatREAD is far more accurate at finding signal boundaries. Verma et al.’s ACTT [6] (2018), takes a fundamentallydifferent approach than all previous works. Instead of partialtokenization and translation, specifically, learning to identifysignal boundaries under limiting assumptions (e.g., assumingbig endian and unsigned encodings) in an unsupervised fash-ion, ACTT simultaneously tokenizes, translates, and interprets
CAN signals. The method automatically identifies which DIDs(See Sec. I-A3) respond on the particular vehicle, and thencollects ambient CAN data during driving while periodicallyquerying DIDs. These diagnostic responses provide labeledtime-series, DID Traces, alongside the CAN data, setting up asupervised decoding algorithm. For a given ID trace, the con-stant bits are labeled, and all possible signals (start bit, length)from the remaining non-constant bits are considered. For eachpossible signal and for each DID trace, linear regression isperformed, and a score of linear fit is assigned. A schedulingalgorithm using dynamic programming then identifies a non-5verlapping set of signals that maximize the fitness score. Theoutput is two-fold: (1) a list of constant signals, and (2) asubset of signals equipped with linear mappings to a knownphysical unit that matches a DID (start bit, length, scale,offset, physical unit, sensor label). Like all previous works,this method assumes unsigned encodings, and following Nolanet al.’s TANG, mistakenly considers reverse bit order as littleendian (not byte order). Because this method relies on DIDmatching to tokenize signals, only a small subset of signalscan be extracted, but all extracted signals are interpretable.
Pes´e et al. [7] (2019) present
LibreCAN , a three phaseprocess. (Phase 0) LibreCAN makes tweaks to READ’s al-gorithm for identifying signal boundaries and categorizingextracted signals. Specifically, while READ identifies signalboundaries by finding where adjacent bit flip probabilities de-crease across a multiple of 10, LibreCAN identifies if adjacentbit flip probabilities drop by a factor of T p , , a tunable inputparameter. (Phase 1) LibreCAN next leverages ideas similarto Verma et. al. [6], using cross correlation to match signalsto sensor readings from both DIDs and external sensors, thenusing linear regression to learn the scale and offset. (Phase 2)LibreCAN incorporates a novel, semi-automated method foridentifying body-related signals (e.g., door locks, windshieldwipers), by filtering IDs based on changes in data fields beforeand after a user actuates the body-related feature. Pes´e etal. note that little endian signals exist, but like all previousmethods, their algorithm assumes big endian byte order andunsigned encodings, and does not have a natural extension toaccommodate little endian signals.The most recent CAN reverse engineering work by Younget. al. [8] (2020) uses a approach similar to LibreCAN (Phase2) to match vehicular functions (based on a hand-labeledtimeseries) to CAN IDs based on a data-change identificationalgorithm. They use a clustering algorithm to group relatedIDs, labeling the remaining unknown IDs based on thoselabeled in the matching step. However, similar to Jayneset. al., this work attempts to assign physical meaning toan entire CAN ID rather than tokenize, translate, and thenidentify (assign meaning) to constituent signals; thus, we donot consider it (nor Jaynes et al.’s) to be a true signal reverseengineering algorithm.There are significant limitations of all previous works. Mostnotably, all assume both big endian byte order and unsignedencodings. While some may theoretically correctly identifysigned signals’ boundaries, this has not been mentioned ortested. Worse, there is no natural extension to little endianand/or signed signals. To identify signedness, an additional al-gorithm is needed: a fairly straightforward binary classificationproblem that is not difficult once well formed. Including endi-anness, on the other hand, poses a far harder problem for tworeasons: (1) signal boundary algorithms depend on flip countsof “neighboring” bits, but bit orderings change with endian-ness, so neighboring bits cannot be determined; (2) withoutconsidering both endiannesses, signal boundary identificationis computationally simple (the same binary classification isindependently repeated 64 times per ID), but considering allbyte orderings grows the search space combinatorially ( boundary options × byte orders > . tokenizations per ID!) with a web of changing dependencies.III. A LGORITHM
We present CAN-D (CAN-Decoder), a four-step modularpipeline (depicted in Fig. 1) providing the first comprehensiveand vehicle-agnostic CAN signal reverse engineering solution.We describe the needed inputs and outputs for the modularcomponents—a signal boundary classifier (Step 1, Sec. III-A),a signedness classifier (Step 3, Sec. III-C), and a signal-to-timeseries matcher (Step 4, Sec. III-D)—as well as ournovel endianness optimizer (Step 2, Sec. III-B), which weconsider to be the unique component providing the glue forthe interchangeable components.
A. Step 1: Signal Boundary Classification
Given an ID trace as input, a signal boundary classifiermakes 64 binary classification decisions—for each of the 64bits, predict if it is the LSB of a signal (or not), effectivelydeciding if a signal boundary or “cut” occurs between thisbit and the next. Almost all previous works have focused onsignal boundary classifiers that use hand-crafted heuristics thatleverage only one feature, the probability of each bit flipping.In this section we pursue the same goal but use a wider setof features. In addition to a novel, unsupervised heuristic weleverage supervised machine learning (ML) and deliver twosuperior signal boundary classifiers.For the reverse engineering pipeline, outputs of the signalboundary classifier in Step 1 are inputs to the endianness opti-mizer in Step 2. While we frame signal boundary identificationas a set of binary classifications, the input for Step 2 of theCAN-D pipeline is the estimated probability—in { , } forbinary heuristics or in [0 , for ML—of a signal boundary foreach bit. Algorithms developed in previous works [2, 4–6] and[7] (Phase 0) could be used as the Signal Boundary Classifierfor this step, all of which produce binary label outputs. Sec.V presents results comparing our signal boundary classifiersagainst the previous state-of-the-art.
1) Data & Notational Setup:
Both unsupervised and su-pervised predictions are based on statistics describing how aparticular bit and its neighboring bits flip. We use a ground-truth DBC (see Sec. IV) to create a target vector, providinga 0/1 label for each bit indicating if it is a signal’s LSB(boundary). In order to deal with the issue that neighboringbits at byte boundaries are conditioned on endianness, wesplit little endian signals on byte boundaries for training(the supervised models) and testing (all) models. In use,the classifier (heuristic or ML) will be applied to ID tracesunder both byte orderings (see Eq. 1), creating two sets ofpredictions. Both sets of predictions are input to Step 2, whichdetermines the endianness of each byte.Here we introduce two views of the data used for trainingthen scoring/tuning the ML in this section (both are alsoused for testing all methods in Sec. V-A). For training, weremove the constant bits (obvious boundaries) forming a“condensed trace”. The motivation for this is threefold: (1)Based on assumption (A0) (see Sec. I-B), observed constantbits necessarily delimit signals, so a simple rule suffices to6dentify these obvious signal boundaries. (2) Our featuresencode neighboring bits’ values and flips, so when nearby bitsare constant, features are either trivial or undefined. Removingthe non-constant bits prior to feature building yields a betterfeature set. (3) Classes are highly biased towards the negativeclass—most bits are not an LSB (not on a signal boundary).By removing constant bits, we not only get better features,but we artificially increase the number of non-obvious signalboundaries and decrease class bias particularly for the non-obvious examples for which a classifier is needed. Note thisis the “c” set described in Sec. V-A. Using this condensedtrace, we build a feature array with shape m non-constant bitsby n f features (features described below for each method).Second, for tuning the ML classifiers in this section, we onlyconsider their performance on the non-obvious boundaries inthe original data—those boundaries not abutting constant bitsin the non-condensed ID traces. Note this set is the “f − ” setdescribed in Sec. V-A. We tune our supervised model on thisset because we ultimately wish to apply the model to full 64-bit traces and want to optimize performance for this situation.
2) Supervised Classification:
To describe features concep-tually, we use i ± to denote bit i ’s neighbors, notationallyneglecting the varying neighbors based on endianness (ref. Eq.1) when it only presents unnecessary complications. For eachbit i , we generate a set of 15 features: The first five featuresare “local” to bit i and its relationship to bit i + 1 , whichwe denote v idi ∈ R . These features (listed in in Table II) areestimated probabilities of a “bit flip” based on observations indata over time. We denote the flip of bit i — alternating valuein subsequent messages B i ( t j ) (cid:54) = B i ( t j +1 ) — as F i . TABLE II: Local bit-flip features: F i de-notes a flip of bit i . P ( F i ) P ( F i | F i +1 ) P ( F i +1 | F i ) P ( ¬ F i | ¬ F i +1 ) P ( ¬ F i +1 | ¬ F i ) The main intuition is that a signal’sLSB generally alternates value muchmore often than an adjacent signal’sMSB; hence, the bit-flip features shouldprovide good indicators for boundaries.Specifically, the first feature shouldidentify LSBs ( P ( F i ) ≈ ) and MSBs( P ( F i ) ≈ ). This is essentially thefeature on which previous works [4, 5, 7] base their heuristic.The next four conditional bit flip features are expected to differsignificantly for adjacent bits contained in the same signalversus those that are part of separate signals, as the formerare likely dependent while the latter are likely independent.Next, we look to the neighboring bit on the right, bit i + 1 ,and add the five local features for this bit v idi +1 to our feature setfor bit i . Finally, we add five difference features δ ( v idi +1 , v idi ) ,yielding a total 15-length feature vector for bit i .Initially, we experimented with adding a wider variety offeatures based on bit values, two-bit distributions, and entropy,as well as more left/right neighboring features. However,we found that these features did not improve classificationperformance and in fact resulted in overfitting.We tested the performance of several binary classifiers:Naive Bayes, Logistic Regression, Support Vector Classifiers,Decision Trees, Random Forests, K-Nearest Neighbors, Multi-Layer Perceptrons and AdaBoost. After experimenting withdifferent weighting schemes to combat the bias class issueas well as the fact that we only score the non-obvious TABLE III: Aggregated Classification Metrics usingLOOCV by CAN log, only scoring non-obvious boundarydecisions (f − set). Top : Classifiers with default Scikit-learnparameters.
Bottom : The top performing Random ForestModel, with optimal parameters chosen using a grid search.Classifier F-Score Precision RecallNaive Bayes 71.6 57.6 94.7Logistic Regression 86.9 82.1 92.3SVC Linear 85.5 78.6 93.8SVC Poly 88.7 85.3 92.3SVC RBF 89.0 84.8 93.8SVC Sigmoid 46.4 42.3 51.4KNN 88.1 81.3 96.2MLP 88.4 82.5 95.2AdaBoost 87.6 82.6 93.3Decision Tree 78.5 67.8 93.3
Random Forest
Random Forest (Tuned) max_features= √ n f , min_samples_Leaf=3,n_estimators=200, max_depth=5 boundaries, we settle on a sample weighting scheme of non-obvious-positive:negative:obvious-positive of 8:4:1. To test theaccuracy of the classifiers, we used Leave-One-Out-Cross-Validation (LOOCV), holding out one CAN log per fold andaggregating the results, and the f − set, only scoring non-obvious boundaries. The results, shown in Table III, illustratethat the Random Forest (RF) classifier performed the best.Finding the optimal parameters for this top-performing modelusing a grid search and LOOCV, the tuned model yields anoverall 88% Precision and 95% Recall for an F-Score of 91%.We select this tuned RF model for our ML classifier.Finally, as an input to Step 2, we output the classifier’spredicted probability of a bit i being a signal’s LSB. Fig. 4: Vis-ualization ofHeuristicSignalBoundaryClassifier(Alg. 1)based onconditionalbit flipprobabilities,with α = . , α = . . Algorithm 1:
Heuristic Signal Boundary Classifier
Inputs : P ( F i +1 | F i ) , P ( F i +2 | F i +1 ) , α , α if P ( F i +1 | F i ) < α or P ( F i +2 | F i +1 ) − P ( F i +1 | F i ) > α thenreturn True else return
False
3) Unsupervised Heuristic:
As an alternative to ML, weexplore the feature set to develop a simple heuristic relatingto bit-flip probabilities. We find that the conditional bit-flip7robability P ( F i +1 | F i ) and the difference between successiveconditional bit flip probabilities P ( F i +2 | F i +1 ) − P ( F i +1 | F i ) are a better indicator of a signal ending at bit i than thedifference of unconditional bit flip probabilities P ( F i +1 ) − P ( F i ) used by most related works.We develop a heuristic based on these findings, detailed inAlg. 1 and visualized in Fig. 4. Based on observations of data,we find that setting parameters α = . , α = . splits thefeature space well, and yields a F-Score and Precisionand
Recall (also on the f − set). Note that our heuristicwas developed and tuned based on a small preliminary dataset,but we found it generalized well to all of our data.The heuristic’s main advantage is that it requires no trainingwhile achieving similar accuracy to the ML as shown in Sec.V-A. Though simple, intuitive, and computationally efficient,one drawback is that the outputs are binary labels, with noway of determining probabilities properly in (0 , , therebyremoving some of the flexibility offered by the following step. B. Step 2: Endianness Optimization
Armed with the probability of a boundary or “cut” betweenadjacent bits of a message, we construct an optimization prob-lem to simultaneously determine the most likely packing ofsignals into the 64-bit data-field and most likely endiannessesof each of the eight bytes.
1) Valid Tokenizations:
Denote a candidate signal I , thelist of bit indices ordered from MSB to LSB. Given a signal I , let LSB ( I ) (or simply LSB if no ambiguity is present)denote the least significant bit. We consider constant bits as1-bit signals. Each ID has eight bytes indexed j = 0 , . . . , with byte j comprised of bits j, . . . , j + 1) − . Let E ( j ) ∈{ B, L } denote that byte j is big, little endian, respectively. Definition 1 (Valid Tokenizations) . For a given ID trace,define a valid tokenization, T , as a tuple of candidate signals { I k } k and endiannesses of each byte { E ( j ) } j =0 such that:(1) (cid:83) I k = {
0, . . . , 63 } (all 64 bits are used),(2) I k (cid:84) I l = ∅ for all k (cid:54) = l (signals do not overlap),(3) Assumption (A1.b), one endianness per byte, is satisfied(implicit in the notation E ( j ) ). Example 2.
For example, consider Fig. 5 (right), a signal plotlayout depicting a valid tokenization with one color per signal(and constant bits in grey). The navy signal, a 10-bit little en-dian signal starting at bit 0, is denoted I = (14 , , , . . . , .Since, B → B , necessarily E (0) = E (1) = L . Example 2 shows that if a signal I crosses a byte boundary,the endianness of both bytes is determined by the order ofthe indices according to Eq. 1. This leads to the followingdefinition and proposition, which will play an important rolein the computational tractability of our optimization problem. Definition 2 (Byte Boundaries) . For j = 0 , . . . , let v ( j ) ∈{ J B , J L , C } denote if byte boundary j is • a cut ( C ) : bit [8( j + 1) − ends a signal or is constant, • a big endian join ( J B ): [8( j + 1) − → j + 1) , or • a little endian join ( J L ) : [8( j + 1) − → j − and V := { v ∈ { J B , J L , C } | v is valid byte boundary set } . For bits not on a byte boundary ( i / ∈ S := { j − } j =0 ),there are only two options: cut or join B i → B i +1 , and bothare valid possibilities regardless of endianness. Proposition 1.
A valid tokenization T has v satisfying:1) v ( j ) = J B = ⇒ E ( j ) = E ( j + 1) = B v ( j ) = J L = ⇒ E ( j −
1) = E ( j ) = L v (0) (cid:54) = J L v (7) (cid:54) = J B v ( j ) = J B = ⇒ v ( j + 1) (cid:54) = J L , v ( j + 2) (cid:54) = J L Proof. (1) and (2) follow directly from Eq. 1 (endiannessdefinition) and Assumption A1.b (one endianness per byte).For (3) v (0) (cid:54) = J L else → − / ∈ [0 , . Similarly for (4).For (5) if v ( j ) = J B and either v ( j +1) = J L or v ( j +2) = J L , then (1) and (2) imply E ( j + 1) is both big and littleendian, a violation of Assumption A1.b. Remark 1.
Prop. 1 can be summarized by V := { v ∈{ J B , C } × { J B , J L , C } × { J L , C } with no consecutive sub-sequences of the form ( J B , J L ) or ( J B , ∗ , J L ) } . Definition 3 ( T & T v ) . Let T denote the set of validtokenizations. For v ∈ V let T v ⊂ T be the tokenizationswith byte boundaries defined by v . Corollary 1.
There are |T | = | V | × |T v | = 577 × − ≈ . valid tokenizations.Proof. |{ J B , C } × { J B , J L , C } × { J L , C }| = 2 × , andremoving subsequences of the form ( J B , J L ) or ( J B , ∗ , J L ) ,leaves 577. |T v | = 2 − as the remaining − bit gapshave two valid options, cut or join.
2) Optimization Formulation:
Step 1 provisions f ( i | E ( j i )) = P ( cut to the right of bit i for endianness E ( j i )) , with j i = (cid:98) i/ (cid:99) the corresponding byte index for bit i . We set f ( i, e ) = ∞ if bit i is to the left of a mandatory cut, e.g., the next bit isa constant bit. For intuition in the formulation below, consider f ( i | E ( j i )) not as the likelihood of a cut, but as penalty fornot cutting, and let β be a fixed cut penalty parameter.The idea for our cost function is to let signals accrue ajoin penalty, the sum of the probabilities f ( i | E ( j i )) for eachbit that is not cut in order to form the signal. Since thecandidate signal entails a cut to the right of its LSB, we swap f ( LSB, E ( j i )) for β , the cut penalty. Thus, the β controlshow liberal to be with cuts.The intuition is to find the optimal balance between parti-tioning the message into too many signals and joining multipledisparate signals, by balancing the cut penalty ( β ) with thelikelihood of a cut (join penalty f ). Setting β = 1 will leadto only cutting where f ( i |· ) = ∞ (signals demarcated byconstant bits), and β = 0 will lead to a cut at every gap,resulting in 64 1-bit signals. Definition 4 (Costs) . Define the Signal Cost as φ ( I, E ) := (cid:88) i ∈ I \{ LSB } f ( i | E ( j i )) (cid:124) (cid:123)(cid:122) (cid:125) join penalty + β (cid:124)(cid:123)(cid:122)(cid:125) cut penalty . xtending to a Tokenization Cost we have Φ( T ) : = (cid:88) I ∈ T φ ( I, E )= (cid:88) χ T ( i )=0 f ( i | E ( j i )) + (cid:88) χ T ( i )=1 β = (cid:88) i =0 (1 − χ T ( i )) f ( i | E ( j i )) + χ T ( i ) β. with χ T ( i ) = 1 if i is an LSB of a token in T , else . The above definition sets up our optimization problem,identify the optimal tokenization T := arg min T ∈T Φ( T ) . (2) Example 3.
To give a concrete example of using the costfunction, consider the first two diagrams in Fig. 5 depictingthe big endian probabilities f ( ·| E = B ) (left) and thelittle endian probabilities f ( ·| E = L ) (middle). Considertwo overlapping 11-bit candidate signals that both containbyte 4 (bits 32 to 39 as numbered in the right plot): a bigendian signal I = [29 , . . . , , , . . . , , and a little endiansignal I = [32 , . . . , , , . . . , . The penalties for thesecandidate signals are φ β,f ( I , B ) = 1 . − .
76 + β = .
97 + β ,and φ β,f ( I , L ) = 0 + β = β . Since clearly .
97 + β > β , ( I , L ) has a lower penalty, in this case, regardless of thechoice of β . In fact, T = ( I , L ) turns out to be in the globallyoptimal T , which is shown in Fig. 5 (right) in teal .3) Finding an Optimum: Given a cut penalty β ∈ [0 , and pre-computed cut probabilities— f ( i | E ( j i )) for all i ∈{ , . . . , } and both endiannesses E ( j i ) (see Step 1, Sec.III-A)—our goal is to identify an optimal tokenization (Eq. 2)from the . valid options. Theorem 1.
Fixing v ∈ V , where v gives cuts/joins at byteboundaries (bits in S = { j + 1) − } j =0 ), the subproblem: arg min T ∈T v Φ β,f ( T ) is realized by T ,v , the tokenization: for all i ∈ [0 , \ S , bit i is an LSB (cut to the right of bit i ) iff β < ( f ( i | E ( j i )) .Proof. Let T ,v be as above and T ∈ T v . By definition, for i / ∈ S, T will accrue cost min( f ( i | E ( j i )) , β ) . Since T, T ,v ∈T v both accrue the same cost for bits i ∈ S. It follows that Φ( T ) − Φ( T ,v ) = (cid:80) i/ ∈ S [(1 − χ T ( i )) f ( i | E ( j i )) + χ T ( i ) β − min( f ( i | E ( j i )) , β )] ≥ . Fig. 5: Probabilities of boundaries according to big endian ordering (left) , little endian ordering (middle) . The resulting optimal tokeniza-tion (right) using β = . is three little endian ( navy , blue , teal ), onebig endian ( snot ) and a 4-bit ( maroon ) signal. This gives a efficient, constant-time search algorithm (689operations), namely, (1) storing the optimal cut/join choicefor each bit i ∈ [0 , \ S under each endianness ( × operations), then (2) applying Thm. 1 to realize both T ,v and cost Φ( T ,v ) for each of 577 v ∈ V and maintainingthe minimum. In the case that there are multiple optimaltokenizations, we break ties by choosing the one with themaximum number of cuts, followed by the minimum numberof little endian signals, which necessarily furnishes a uniqueoptimal solution.After experimenting with adjusting the tuning parameter β ,we find that β ∈ [ . , . yield fairly consistent and correcttokenizations, and so for our pipeline we choose β = . . Notethat the heuristic classifiers in Step 1 provide probabilitiesin { , } meaning all choices of β yield identical results.Further, note that with binary inputs, a tie break is schemeis often necessary, whereas with high precision probabilityinputs, multiple optimal tokenizations with the same cost arevirtually impossible.The outputs of the endianness optimizer described in thisstep are tokenized signals. While in theory another endiannessoptimizer could be developed and exchanged for this compo-nent, we consider this custom optimization to be a fixed andnon-interchangeable component of the pipeline. C. Step 3: Signedness Classification
A signedness classifier takes a tokenized signal (start bit,length, endianness) and makes a binary decision on whethereach signal of length greater than two is signed (using two’scomplement encoding) or unsigned. To develop our classi-fier, we followed a similar workflow to Step 1 (Sec. III-A)experimenting with supervised classifiers, and unsupervisedheuristics. Since each signals is tokenized, and thus the LSBsand MSBs are now known, this problem is significantlysimpler, and features can be developed per signal rather thanper bit. However, after experimenting with several featuresand supervised classification methods, we find that a simpleheuristic based on the the distribution of the two most signifi-cant bits of the signal yielded better results than the supervisedmethods. Using this heuristic, described in Alg. 2, we obtainalmost perfect classification ( . F-Score), so ultimately,we chose to use this heuristic in the CAN-D pipeline ratherthan a learned model.The heuristic is based on how the two most significantbits will behave if the signal is signed or unsigned. Let B i , B i denote the MSB and next-most significant bit of thesignal. First, consider the probabilities of the center values, P [( B i , B i ) = (1 , , P [( B i , B i ) = (0 , . If a signal Algorithm 2:
Heuristic Signedness Classifier
Inputs : { B i ( t ) , B i ( t ) } t , γ if P [( B i , B i ) = (1 , P [( B i , B i ) = (0 , thenreturn True if P [( B i ( t j ) , B i ( t j )) = (0 , ∧ ( B i ( t j +1 ) , B i ( t j +1 )) = (1 , thenreturn False if P [( B i , B i ) = (1 , P [( B i , B i ) = (0 , < γ thenreturn True return
False
9s signed, for values close to zero ( B i , B i ) will be (0 , (small positives) or (1 , (small negatives), whereas valuesnear extremes will be (1 , (near min) or (0 , (near max).A signal with a small probability of these values is thereforelikely signed. Second, consider the probability of a jumpbetween extreme values, P [( B i ( t j ) , B i ( t j )) = (0 , ∧ ( B i ( t j +1 ) , B i ( t j +1 )) = (1 , . If a signal is signed, whenchanging from small positive to small negative values, thetwo MSBs must flip from (0 , to (1 , . However, if it isunsigned, this is unlikely to ever happen since this would entailflipping from a very small value to a large one resulting in asignificant discontinuity. If this probability is 0, the signal islikely unsigned. We apply these two ideas as described in Alg.2, where we set γ = . based on observations of data.After step 3, signedness classification, each ID’s 64-bitmessage is partitioned into signals, for which we know theirstart bits, lengths, endianness, and signedness; consequently,each signal can now be translated into a timeseries of integers,denoted s ( t ) . No previous works have attempted signednessclassification, so the signedness classifier presented in thissection currently sole option for this modular component. D. Step 4: Physical Interpretation
For our signal-to-timeseries matcher, we follow Verma etal.’s ACTT [6] to match a subset of the translated signalswith diagnostic data. This augments matched signals with thethe necessary information to interpret them as actual measure-ments in the vehicles. We do this by comparing each signaltime series, s ( t ) , to each DID trace D ( t (cid:48) ) , and determining ifthey are linearly related. Because the DID traces are sampled ata lower rate than normal CAN traffic, we interpolate the signalvalues over the diagnostic timepoints, obtaining s ( t (cid:48) ) . We thenregress D ( t (cid:48) ) onto s ( t (cid:48) ) and find the best linear fit, furnishingthe coefficients a, b so that ¯ s ( t ) := as ( t (cid:48) ) + b ≈ D ( t (cid:48) ) . Toscore the model’s fit, we use coefficient of determination, R ,which measures the fraction of total variation in time series D ( t (cid:48) ) that is explained by ¯ s ( t (cid:48) ) ; thus, R = 1 exhibits a perfectfit, while R = 0 exhibits the fit of a horizontal line (assuming D ( t ) is not the horizontal line). For each signal s , we find thediagnostic D that yields the highest R value. If R > δ ,where δ ∈ [0 , is a tuning threshold, s is matched to D . A δ = 1 will return only perfectly correlated signals, while asmall δ will allow for less correlated signals to be matched.For our implementation, we choose δ = . .For signals that match a diagnostic, we have interpretation ,having procured the label and units, as well as the scale, a , and offset, b . In additional to ACTT [6], LibreCAN [7](Phase 2) propose a signal-to-timeseries matching algorithmsthat could be used interchangeably (or even combined) forthis component. Finally, note that translated signals that arenot augmented with labels through this physical interpretationstep are still highly valuable, as there are many applicationsin which these unlabeled translated timeseries are far moreuseful than binary data.IV. D ATASET
As our goal is to build a vehicle-agnostic signal-extractioncapability, we have collected CAN data from ten different
TABLE IV: Statistics on ten CAN logs, each collected from a vehicleof a different make. For each log, we enumerate: non-constant IDs(
IDs ), non-constant IDs defined by CommaAI (
Def. IDs ), and each ofthe encodings of defined signals (big/little endian, signed/unsigned)resulting from ground-truth labeling process (see Sec. IV). Three logscontain a high percentage of little endian signals, and all but onecontain signed signals.
Log IDs Def. IDs Unsigned, B.E. Signed L.E. Total
54 17 61 3 25 89 Non-constant IDs: IDs with more than one non-constant bit Vehicle adheres to J1939 Standard protocol [35], and signal defini-tions are derived from this open standard. vehicle makes and years ranging from 2010 to 2017 fortraining and evaluation. The details of defined signals for eachlog are described in Table IV. This dataset is far larger andmore varied than any previous work. Notably, in order totest generalizability of the methods, no duplicate makes wereincluded as different models of the same make (e.g., ToyotaCamry and Corolla) have similar characteristics.In order to obtain data for our signal reverse engineeringprocess (bit position, endianness, and signedness), we usedDBCs acquired from two sources. Log
VALUATION
Using the dataset described in Sec. IV, we compare ouralgorithms, both with the Heuristic and Machine Learning(ML) for Step 1, against the following predecessors: TANG[4] , READ [5], ACTT [6] , LibreCAN (Phase 0) [7] . SeeSec. II for a description of each algorithm. Note that we donot test the algorithm proposed by Markowitz & Wool [2]because it was tested by READ, and shown to produce farinferior results. We also test against a ‘Baseline’ method thatsimply uses constant bits as signal boundaries and assumes bigendian, unsigned encodings. This represents accuracy scoresobtained by simply identifying the obvious boundaries.We quantitatively compare tokenization and translation(Step 1-3) efforts of each of these methods in the followingsection, Sec. V-A. We note that READ and LibreCAN makeefforts to categorize signals, which is an added benefit of thesemethods over ours, but we do not evaluate the efficacy of theircategorization algorithms. We also do not quantitatively evalu-ate the interpretation (Step 4) efforts by ACTT, LibreCAN, orCAN-D because, as pointed out by Pes´e et. al [7], ground-truthinterpretations are highly subjective and difficult to evaluate α = . , α = . , β = . (though irrelevant with binary Step 2inputs), γ = . , δ = . Step 1 with tuned RF Model found in Sec. III-A, β = . , γ = . , δ = . TANG and ACTT incorrectly considered reverse bit ordering. We onlyconsider forward bit ordering for these two methods. For the R threshold, we used . . For the 5/10 logs tested that containedno diagnostic packets, this method is equivalent to Baseline. The authors state that the optimal choice for parameter T p , (percentdecrease of bit flip rates) was between . and . depending on the vehicle.The authors likely meant between . and . , because a threshold of 1% or2% would lead to (and we verified this) a very high false positive rate. Forthe results reported, T p , = . was used, resulting in much higher F-scores. quantitatively. Instead, we offer a qualitative comparison ofthe full decoding efforts in Sec. V-C, and Fig. 6, whichincludes the supplemental interpretations given by CAN-D.Note that ACTT’s interpretation is virtually identical to CAN-D’s and LibreCAN’s requires an extra tool to obtain body-related labeled timeseries, so we did not attempt to performtheir interpretation method. A. Signal Boundary Classification Evaluation
We first quantitatively evaluate the signal boundary classi-fication algorithms of each method using three test sets thatdiffer in the number of positive labels (detailed in Table V).The condensed (c) set uses all positive labels(boundaries) in condensed traces (constant bits removed),thus increasing the number of non-obvious positivelabels and decreasing class bias, resulting in the mostrobust evaluation set for testing and comparing theefficacy of signal boundary classification algorithms.
TABLE V: Positivelabels in each testset, 5784 negative la-bels in all sets n %c 834 13f −
208 3f + However, “full” non-condensed tracesgive a more accurate representation ofthe distribution of labels and the most re-alistic positive samples. In the full (f + )set, all non-constant samples are scored(including obvious examples of LSBsabutting constant bits/message ends).This f + set is the most representativeand will yield the most realistic metrics for the total signalsthat could be extracted using a given method. Finally, in thefull non-obvious set (f − ), only non-obvious examples (thosenot abutting constant bits) are scored. This test set has veryfew positive labels (3%), but unlike (c), all are boundaries thatdelimit two adjacent signals in actual data, and unlike (f + ),will not result in score inflation from obvious boundaries notattributable to the algorithm being scored. The f − set gives abalance of realism in use without the inflation of metrics fromthe obvious boundaries.The classification F1-Score, Precision, and Recall for theunder each scenario is reported in Table VI (Top). Recall thatsince little endian signals are split on the byte boundary intotwo big endian signals for labeling, we are testing solely theefficacy of the signal boundary classification methods withouttaking endianness into account, and thus not penalizing otheralgorithms for the limiting assumption of big endianness. Alsonote that since CAN-D is supervised, reported metrics are fromaggregating results from LOOCV per log. B. Signal Error Evaluation
Second, we compare the full tokenization and translationefforts of each method, computing the (cid:96) error between thetranslated signals and their corresponding ground-truth signals.See results in Table VI. The motivation for this evaluation isthat ultimately, the goal of all of these methods is to extracttime series that can be used as actual real-time measurementsfrom systems in the car. Therefore, the most important metricfor measuring the efficacy of these methods is not how manybits overlap or the number of boundaries correctly classified(as described above), but the difference between the values of11he extracted signal’s time series and the true signal’s timeseries. All previous methods assume big endian, unsignedsignals; consequently, once signal boundaries are assigned, thetranslated signal values are completely determined, and this iswhat is used for this second evaluation. For CAN-D, Steps 2-3(endianness optimization & signedness classification) providethe remaining tokenization and translation information.We compute the score for each log as follows. Let S denotethe set of normalized true signals, ˆ S the set of normalizedpredicted signals (all taking values in [0,1]) for a CAN log. Let η : S → ˆ S so that for each true signal, s , η ( s ) is the predictedsignal that contains the MSB of s . Any predicted signals that B a s e li n e T AN G [ ] AC TT [ ] R E AD [ ] L i b re CAN [ ] CAN - D H e u r i s t i c CAN - D M L S i g n a l B o und a r y C l a ss i f . F c 0.0 67.1 * 71.1 78.7 91.6 f − f + c 0.0 62.6 * 94.8 86.8 − + R c 0.0 72.4 * 56.8 71.9 86.7 f − f + M e a n (cid:96) S i g n a l E rr o r C AN L og TABLE VI:
Top:
Comparison of signal boundary classification resultspresented. F = F-score, P = Precision, R = Recall. We test eachmethod using the three test scenarios, denoted in the third columnand described in Sec. V-A. “Baseline” identifies only obvious signalboundaries only at constant bits, which trivially has perfect precision.CAN-D ML achieves the highest F-Score and Recall, while theHeuristic exhibits the best Precision for all sets. Both exhibit over ∼
10% improvement in Recall over all previous methods in the twodifficult test sets (c, f − ). We do not evaluate ACTT under scenario(c) since it relies heavily on constant bits to shrink the search space. Bottom:
Mean (cid:96) error of translated signal values (Eq. 3) reportedfor each CAN log, ∼
50% decrease in error from other methods.Finally, note that while CAN-D ML has slightly higher average errorthan CAN-D Heuristic (due mostly to worse Precision in Step 1),it has lower error for all logs containing little endian signals ( are left unmatched ( ∀ ˆ s ∈ ˆ S \ η ( S ) ) are paired with the thezero vector (cid:126) . Take the normalized (cid:96) difference between eachsignal pair, resulting in a signal error between 0 and 1. The mean signal error for the log is defined as (cid:88) s ∈ S (cid:107) s − η ( s ) (cid:107) + (cid:88) ˆ s ∈ ˆ S \ η ( S ) (cid:107) ˆ s (cid:107) (3)where (cid:107) s (cid:107) := (cid:80) n id t =1 | s ( t ) | /n id . C. Qualitative Results
Fig. 6 depicts three examples of messages decoded byCAN-D (identical decodings for the ML and heuristic signalboundary classification) and by the most accurate competingmethods (READ and LibreCAN which both produce thesame signal boundary predictions for these examples) withdetailed descriptions and discussion. These examples illustratea message with: signed and unsigned signals (top), little endianunsigned signals (bottom left) and little endian (signed andunsigned) signals (bottom right). CAN-D correctly tokenizesand translates all examples and overall furnishing interpretabletimeseries. Where available, CAN-D’s physical interpretation(Step 4, Sec. III-D) is provided in annotations above signals,showing R value to gauge goodness-of-match. Overall, mis-tokenization and mis-translation by other methods result inrampant discontinuities and dramatic error in most timeseries,exhibiting the necessity of correctly identifying each signal’sendianness and signedness.VI. P ROTOTYPE
OBD-II P
LUGIN
Fig. 7: Prototype CAN-D deviceusing a Rasberry Pi and CANBerryDual 2.1 boards
The CAN-D PrototypeDevice is a vehicle-agnostic, OBD-II (on-board diagnostic) pluginthat collects CAN datafrom the vehicle andruns the entire CAN-DPipeline depicted in 1. Theprototype (shown in Fig. 7)is built using Linux-based,single-board computers. Specifically, we use a Raspberry Pi3B+ with Raspbian Buster in conjunction with a IndustrialBerry’s CANBerry Dual 2.1 [37]. The Raspberry Pi 3B+offers 1GB of RAM and a 1.4GHz ARMv8 processor. Thedevice is powered either from battery or using on-boardpower from a vehicle’s 12-volt system.One challenge of building a vehicle-agnostic prototype isthat the bitrate for the CAN is unknown and variable pervehicle, and improper bitrate selection can cause adversevehicle function. In order to solve this issue, the device iteratesthrough common bitrates, identifying the bitrate that resultsin only expected packets. This allows our prototype to becompatible with most CANs regardless of bitrate.Another complication is that automobiles typically havemultiple CAN buses, and often more than one is availablefrom the OBD-II interface. The prototype analyzes two uniquenetworks by allocating a dedicated CAN controller for eachusing CANBerry Dual 2.1. Once connected, it automatically12 ig. 6: Tokenization & translation of three messages by CAN-D and top competing methods, READ & LibreCAN. When interpretation isprovided by CAN-D, the label and units of the matched diagnostic is shown with the R value, and the values are scaled appropriately.(a) Message containing signed and unsigned engine- and pedal-related signals. Left:
Signal boundaries and endianness are correctly identifiedby all methods.
Middle : All signals are correctly translated and have physical interpretations by CAN-D. Highly correlated matches found for green , blue and maroon signals. The navy signal at bit 4, matched to DID ‘Accelerator pedal position D’ with low correlation ( R = . ),is likely an accelerator indicator. As this is not an available DID, CAN-D has unearthed information that could not be simply queried. Right :Other methods incorrectly translate green and blue signals as unsigned, resulting in sharp discontinuities where the signals change sign.(b) Message containing four wheelspeeds encoded as little endiansignals.
Top : Correct tokenization & translation by CAN-D andmatch to “Vehicle Speed” DID with R = 1 . Bottom : Mis-tokenizedas five big endian signals by other methods with MSBs (bits 13-15, 29-31, and 45-47) attributed to the wrong signals. Since allencode speed, blue , green and orange signals appears correct, savesome minor discontinuities. However, these signals encode the wheelspeeds and are often used by Electronic Stability Control to stimulateanti-lock braking and traction control pending discrepancies in wheelspeeds; hence, mixing the MSBs of wheel speeds may go unnoticedin normal conditions but prove consequential in adverse drivingconditions! (c) Message containing four steering-related, little endian signals,three of which are signed. Top:
Correct tokenization & translationby CAN-D (no interpretation).
Bottom:
Incorrect tokenization &translation by other methods. Assuming big endian signals, they areare forced to cut on most byte boundaries, resulting in truncated,noisy teal , snot , orange , and maroon signals. The navy signal doesnot appear noisy, but is noticeably incorrect when comparing thescale and the values for t ∈ [0 , to the correct CAN-D translation.The two MSBs are misattributed to the next signal, resulting in errorsof at least when the MSB(s) are nonzero. ∼ ONCLUSION
We consider the problem of developing a vehicle-agnosticmethod for extracting the hidden signals in automotive CANdata, and present a comprehensive survey of this area. Wepresent CAN-D, a four-step, modular, pipeline using a com-bination of machine learning, a novel optimization process,and heuristics, to identify and correctly translate signals inCAN data to their numerical timeseries. In particular, CAN-D is designed to extract big and little endian signals as wellas signed and unsigned signals. While this greatly enhancesthe complexity of the problem, these are necessary accom-modations as specified by standard signal definitions. As ourresults exhibit, when endianness and signedness are ignored,the resulting translations are incorrect and overly noisy. Inevaluation on ten diverse vehicles’ data, we compare CAN-Dto the four state-of-the-art methods, providing a comparativestudy of previous methods on a more comprehensive datasetthan ever previously used. We achieve less than 20% of theaverage error of other methods and establish that CAN-D isthe lone method that can handle any standard CAN signal.Finally, we present a lightweight hardware implementation forusing CAN-D in-situ via an OBD-II connection to first learn avehicle’s signals, and in future drives convert raw CAN data tomultivariate timeseries in real time. As CAN signals providea rich source of real-time data that is currently unrealized, wehope this contribution will facilitate many vehicle technologydevelopments. A
CKNOWLEDGEMENTS
Special thanks to Bill Kay for helpful comments. Researchsponsored by the Laboratory Directed Research and Develop-ment Program of Oak Ridge National Laboratory, managed byUT-Battelle, LLC, for the U. S. Department of Energy (DOE) and by the DOE, Office of Science, Office of WorkforceDevelopment for Teachers and Scientists (WDTS) under theScientific Undergraduate Laboratory Internship (SULI) pro-gram. R
EFERENCES [1] Jaynes, M. et al. (2016) Automating ECU Identification forVehicle Security. In
ICMLA
IEEE.[2] Markovitz, M. and Wool, A. (2017) Field classification, mod-eling and anomaly detection in unknown CAN bus networks.
Vehicular Communications, .[3] Huybrechts, T. et al. (2017) Automatic reverse engineering ofCAN bus data using machine learning techniques. In .[4] Nolan, B. C. et al. (2018) Unsupervised time series extractionfrom controller area network payloads. In VTC Fall
IEEE.[5] Marchetti, M. and Stabili, D. (2019) READ: Reverse Engineer-ing of Automotive Data Frames.
Transactions on InformationForensics and Security, (4).[6] Verma, M. E. et al. (2018) ACTT: Automotive CAN Tokeniza-tion & Translation. In CSCI
IEEE.[7] Pes´e, M. D. et al. (2019) LibreCAN: Automated CAN MessageTranslator. In
SIGSAC CCS
ACM.[8] Young, C. et al. (2020) Towards Reverse Engineering ControllerArea Network Messages Using Machine Learning. In
IEEE WF-IoT
IEEE.[9] Automotive buses. https://training.dewesoft.com/online/course/automotive-buses-can-measurement.[10] Bosch GmbH, R. (1991) CAN Specification Version 2.0.[11] Provencher, H. (2012) Controller Area Networks For Vehicles.In
Seminar Course ENGR G
Vol. 5003, .[12] Endianness. https://en.wikipedia.org/wiki/Endianness (Nov,2019) Wikipedia.[13] Hackaday: CAN Hacking. https://hackaday.com/2013/10/22/can-hacking-the-in-vehicle-network/.[14] Hooovahh’s Blog: CAN Part 5 - Signal API. http://hooovahh.blogspot.com/2017/05/can-part-5-signal-api.html.[15] Two’s complement. https://en.wikipedia.org/wiki/Two%27scomplement (Nov, 2019) Wikipedia.[16] Unified Diagnostic Services. https://en.wikipedia.org/wiki/Unified Diagnostic Services (Nov, 2019) Wikipedia.[17] OBD-II PIDs. https://en.wikipedia.org/wiki/OBD-II PIDs (Oct,2018) Wikipedia.[18] Smith, C. (2016) The car hackers handbook: a guide for thepenetration tester, No Starch Press, .[19] Checkoway, S. et al. (2011) Comprehensive experimental anal-yses of automotive attack surfaces. In
USENIX Sec.
Vol. 4, .[20] Koscher, K. et al. (2010) Experimental Security Analysis of aModern Automobile. In
IEEE.[21] Miller, C. and Valasek, C. Adventures in Automotive Networksand Control Units. (2014).[22] Miller, C. and Valasek, C. Remote exploitation of an unalteredpassenger vehicle.
Black Hat USA, , 91.[23] Lokman, S.-F. et al. Intrusion detection system for automo-tive Controller Area Network (CAN) bus system: a review.
EURASIP Jour. on Wireless Comms & Networking, (1).[24] Wu, W. et al. (2019) A Survey of Intrusion Detection for In-Vehicle Networks.
IEEE T-ITS, .[25] Moore, M. R. et al. (2017) Modeling inter-signal arrival timesfor accurate detection of CAN bus signal injection attacks. In
CISRC
ACM.[26] Lee, H. et al. (2017) OTIDS: A novel intrusion detection systemfor in-vehicle network by using remote frame. In
PST
IEEE.[27] Choi, W. et al. (2018) Identifying ECUs using inimitablecharacteristics of signals in controller area networks.
IEEETransactions on Vehicular Technology, (6).
28] Tyree, Z. et al. (2018) Exploiting the Shape of CAN Data forIn-Vehicle Intrusion Detection. In
VTC Fall
IEEE.[29] Pawelec, K. et al. (2019) Towards a CAN IDS Based on aNeural Network Data Field Predictor. In
AutoSec
ACM.[30] Taylor, A. et al. (2016) Anomaly detection in automobile controlnetwork data with long short-term memory networks. In
Conf.on Data Science and Advanced Analytics
IEEE.[31] Nair Narayanan, S. et al. (May, 2016) OBD SecureAlert: AnAnomaly Detection System for Vehicles.[32] Hanselmann, M. et al. (2020) CANet: An Unsupervised Intru-sion Detection System for High Dimensional CAN Bus Data.
IEEE Access, .[33] Enev, M. et al. (2016) Automobile driver fingerprinting. Pro-ceedings on Privacy Enhancing Technologies, (1).[34] Wakita, T. et al. (2006) Driver Identification Using DrivingBehavior Signals.
IEICE Trans. on Info & Systems,