Software-Based Monitoring and Analysis of a USB Host Controller Subject to Electrostatic Discharge
Natasha Jarus, Antonio Sabatini, Pratik Maheshwari, Sahra Sedigh Sarvestani
22020 CSI/CPSSI International Symposium on Real-Time and Embedded Systems and Technologies (RTEST)
Software-Based Monitoring and Analysis of aUSB Host Controller Subject toElectrostatic Discharge
Natasha Jarus, Antonio Sabatini, Pratik Maheshwari, and Sahra Sedigh Sarvestani
Department of Electrical and Computer EngineeringMissouri University of Science and Technology, Rolla, MO 65409, USAEmail: { jarus, ajs5gd, prm8c7, sedighs } @mst.edu Abstract —Observing, understanding, and mitigating the ef-fects of failure in embedded systems is essential for buildingdependable control systems. We develop a software-based moni-toring methodology to further this goal. This methodology can beapplied to any embedded system peripheral and allows the systemto operate normally while the monitoring software is running. Weuse software to instrument the operating system kernel and recordindicators of system behavior. By comparing those indicatorsagainst baseline indicators of normal system operation, faultscan be detected and appropriate action can be taken.We implement this methodology to detect faults caused byelectrostatic discharge in a USB host controller. As indicators, weselect specific control registers that provide a manifestation of theinternal execution of the host controller. Analysis of the recordedregister values reveals differences in system execution when thesystem is subject to interference. This improved understandingof system behavior may lead to better hardware and softwaremitigation of electrostatic discharge and assist in root-causeanalysis and repair of failures.
Keywords — Software instrumentation, Electrostatic discharge,Failure analysis, Universal Serial Bus, Computerized instrumen-tation, Embedded software, Software debugging
I. I
NTRODUCTION
As embedded systems become smaller and smaller, theybecome more vulnerable to physical events and thus moredifficult to make reliable. Interference from ElectrostaticDischarge (ESD) is a major cause of this unreliability, sincea smaller electrical charge is required for smaller componentsto experience an ESD event. The effects of these events onthe software running on the embedded system are not yetwell understood. In order to understand these effects, we mustobserve how the hardware effects of ESD manifest in thesoftware controlling that hardware.Depending on severity, ESD events can cause permanenthardware damage or manifest as software glitches, such asscreen flickers or program crashes, that appear random andunexpected to the system user. Associating these user-observedfailures with specific software and hardware faults is an ongoingchallenge. Additionally, component miniaturization increasesthe difficulty of monitoring all traces on a board for ESD withhardware probes. Even if all the traces could be monitored, itis a nontrivial task to analyze where the ESD coupled to thesystem and the resulting effect it had on various components.Finally, while invasive hardware testing might be feasible on a development board, testing on consumer hardware not equippedwith test points and monitoring hardware is much more difficult.Executing in-field tests or analyzing faults that only occur onproduction hardware are daunting tasks.We propose a low-level, lightweight software-based methodfor monitoring, detecting, and analyzing the effects of ESDevents. Our method is applicable to other types of electromag-netic interference, but in the interest of clarity, the focus ofthis paper is on ESD events. Software instrumentation allowsfor monitoring of hardware that cannot be physically probed.Some existing software analysis techniques focus on high-level failures, e.g., screen glitches, but stop short of root-causeanalysis of hardware faults. Other software approaches studylow-level failures, such as data corruption in CPU caches,but require complete control of the system unmediated by anoperating system and are thus inapplicable to systems undertypical usage conditions. Our approach uses modified hardwaredrivers to allow a system in the field to be monitored for ESDevents.With software instrumentation, we are able to observechanges in system operation caused by ESD. We comparethis to operation during normal system operation to determinewhether a system is experiencing an ESD event. These resultshave several applications in failure analysis as well as hardwareand software design. Collected data can be used for postmortemanalysis, validating system designs, and runtime fault detectionand recovery. Throughout this paper, we will discuss theseobservations and analyses within the context of a USB hostcontroller on an embedded system, specifically, an ARM systemrunning Linux. In our work, we consider small-scale ESD eventsthat do not persist after a power cycle.In this paper, we present: • A method for instrumenting device drivers to monitorinternal operation of system peripherals. • A method for analyzing the observed states of peripheraloperation.The rest of the paper is as follows: Section II reviews relatedliterature. Our software instrumentation approach is describedin Section III. Section IV discusses the data analysis algorithm.Experimental setup is documented in Section V and results andobservations are presented in Section VI. Finally, Section VIIdraws conclusions and discusses future extensions to this work.978-1-7281-7551-5/20/$31.00 c (cid:13)(cid:13)
As embedded systems become smaller and smaller, theybecome more vulnerable to physical events and thus moredifficult to make reliable. Interference from ElectrostaticDischarge (ESD) is a major cause of this unreliability, sincea smaller electrical charge is required for smaller componentsto experience an ESD event. The effects of these events onthe software running on the embedded system are not yetwell understood. In order to understand these effects, we mustobserve how the hardware effects of ESD manifest in thesoftware controlling that hardware.Depending on severity, ESD events can cause permanenthardware damage or manifest as software glitches, such asscreen flickers or program crashes, that appear random andunexpected to the system user. Associating these user-observedfailures with specific software and hardware faults is an ongoingchallenge. Additionally, component miniaturization increasesthe difficulty of monitoring all traces on a board for ESD withhardware probes. Even if all the traces could be monitored, itis a nontrivial task to analyze where the ESD coupled to thesystem and the resulting effect it had on various components.Finally, while invasive hardware testing might be feasible on a development board, testing on consumer hardware not equippedwith test points and monitoring hardware is much more difficult.Executing in-field tests or analyzing faults that only occur onproduction hardware are daunting tasks.We propose a low-level, lightweight software-based methodfor monitoring, detecting, and analyzing the effects of ESDevents. Our method is applicable to other types of electromag-netic interference, but in the interest of clarity, the focus ofthis paper is on ESD events. Software instrumentation allowsfor monitoring of hardware that cannot be physically probed.Some existing software analysis techniques focus on high-level failures, e.g., screen glitches, but stop short of root-causeanalysis of hardware faults. Other software approaches studylow-level failures, such as data corruption in CPU caches,but require complete control of the system unmediated by anoperating system and are thus inapplicable to systems undertypical usage conditions. Our approach uses modified hardwaredrivers to allow a system in the field to be monitored for ESDevents.With software instrumentation, we are able to observechanges in system operation caused by ESD. We comparethis to operation during normal system operation to determinewhether a system is experiencing an ESD event. These resultshave several applications in failure analysis as well as hardwareand software design. Collected data can be used for postmortemanalysis, validating system designs, and runtime fault detectionand recovery. Throughout this paper, we will discuss theseobservations and analyses within the context of a USB hostcontroller on an embedded system, specifically, an ARM systemrunning Linux. In our work, we consider small-scale ESD eventsthat do not persist after a power cycle.In this paper, we present: • A method for instrumenting device drivers to monitorinternal operation of system peripherals. • A method for analyzing the observed states of peripheraloperation.The rest of the paper is as follows: Section II reviews relatedliterature. Our software instrumentation approach is describedin Section III. Section IV discusses the data analysis algorithm.Experimental setup is documented in Section V and results andobservations are presented in Section VI. Finally, Section VIIdraws conclusions and discusses future extensions to this work.978-1-7281-7551-5/20/$31.00 c (cid:13)(cid:13) a r X i v : . [ c s . OH ] M a r
020 CSI/CPSSI International Symposium on Real-Time and Embedded Systems and Technologies (RTEST)II. B
ACKGROUND AND R ELATED L ITERATURE
ESD-induced failures can be broadly categorized as either hard failures or soft failures [1]. In this context, a hard failurepermanently damages the system so that components must bereplaced. Soft failures, on the other hand, can be recoveredfrom; these failures are further characterized into three levelsbased on the visibility of the failure and the action needed torecover from it:Level 1) The system automatically recovers with no user-visible faults or loss or corruption of data. Oftenthis recovery is possible due to ESD-robusthardware and fault-tolerant control protocols.Level 2) The system experiences a system-level manifesta-tion, such as momentary screen or data corruption,but recovers without intervention.Level 3) The system crashes or requires the user to per-form an action, such as resetting the system orunplugging and re-plugging a device, to recoverfrom a fault condition.These failures are studied using a variety of hardware- andsoftware-based techniques.Numerous studies have investigated the relationship betweenESD interference and level 2 and 3 soft failures. Hardware ESDfault injection with direct injection and field injection probes isdescribed in [2–4]. These studies characterize integrated circuit(IC) immunity to ESD. The sensitivity threshold for each ICwas determined by injecting ESD at increasing voltages andobserving when errors occurred. In these studies, only user-visible errors, such as screen glitches or hardware resets, wereinvestigated.Izadi et al. [5] extend this fault injection process by mappingthe ESD sensitivity of the board.The injection probes areattached to a 2-D scanner that sweeps them across the board.At each point on a grid over the CPU, ESD is injected and thelevel at which the device becomes susceptible is recorded. Theresulting map can be used to identify traces and componentsthat are at risk for ESD damage. Mapping is carried out atvarious CPU loads and clock speeds; the authors determinethat the system is most susceptible under heavy load and lowclock speed.Vora et al. [6] study user-visible soft failures in a mi-croprocessor, a microcontroller, and an FPGA. In particular,they observed a relationship between CPU load and likelihoodof display flicker on a microprocessor, indicating that ESDwas coupling to the CPU chip rather than to the displayitself. Furthermore, they observed that the likelihood of certainfailures—process termination and display flicker—depend onthe program executing at the time of the ESD event.Investigating level 1 soft failures and understanding theroot causes of higher-level soft failures requires the ability toobserve a system’s behavior at a high level of detail. Vora et al.[6, 7], Feng et al. [8] use a custom microcontroller runningcode which monitors register values and system interrupts tostudy the effects of ESD on CPUs. While too invasive to useon a system performing additional tasks, this approach gives avery fine-grain view of observable soft failures. In particular,the authors observe numerous multiple bit errors in IO registersand frequent spurious interrupt triggers. The effect of ESD on USB devices in particular hasalso been investigated. Maghlakelidze et al. [9] develop anautomated testing system for studying soft failures in aUSB interface on a single-board computer. The system ischaracterized by injecting ESD pulses of varied voltage andpulse width into specific IC pins. Soft failures are observedbased on data transmission rate and error messages in kernellogs. Under positive voltage injections, most failures did notrequire user interaction; however, negative voltage injectionsproduced numerous severe soft failures. Koch et al. [10] furthertest USB-related soft failures and determine that likelihood offailure is also dependent on the state of the USB protocol, i.e.,which packets are being transmitted at the time of the injection.Root cause analysis shows that many failures are caused byESD coupling to the power domains in the USB controllerrather than to data lines.While some soft failures are not user-visible, they may stillbe observable by software monitoring of low-level systembehavior. Yuan et al. [11] continuously poll the status ofa phase-lock loop (PLL) embedded in the microcontroller;if the PLL unlocks, it can be assumed that the system hasexperienced an ESD shock. While this approach provides anexcellent measure of ESD events on the microcontroller, itcannot measure peripheral ESD events because most peripheralsdo not contain a separate PLL that can be monitored by themicrocontroller.Another case study of low-level system monitoring is carriedout in [12] on a wireless router. A debugging serial port on therouter logs every context switch performed by the processors,giving an approximate record of the execution path taken byprocesses running on the router. This data is collected intosystem function graphs of both reference operating functionand ESD-exposed function. Several graph metrics are appliedto these graphs; differences in metric values indicate that soft-failures can be observed by this monitoring technique.While not directly related to ESD events, software-basedas well as combined hardware and software system moni-toring approaches have been studied extensively. Wattersonand Heffernan [13] outline research related to monitoringfor runtime verification. System state is monitored by somecombination of hardware and software; this information isthen used to verify that the system is operating withinspecification. A software-specific study of fault monitoringis carried out by [14]. The authors present a taxonomy ofruntime monitoring approaches and discuss various systemrequirements for different monitoring techniques.Choudhuri and Givargis [15] develop a mixed hardwareand software approach for logging non-deterministic behaviorin embedded systems. They modify a compiler to emit codethat logs messages to an attached storage system, reducingprocessing overhead on the low-powered embedded hardwarebeing monitored. Reinbacher et al. [16] create a tool thatconverts an embedded system software specification into bothan executable and a configuration for a hardware monitor. Thehardware monitor interfaces with the embedded CPU and itscommunication buses and verifies the operation of the system.Delgado et al. [14] develop a software-based monitoringsystem for ATMs by instrumenting the drivers for each hardwarecomponent to measure state and performance. A runtime978-1-7281-7551-5/20/$31.00 c (cid:13)(cid:13)
ESD-induced failures can be broadly categorized as either hard failures or soft failures [1]. In this context, a hard failurepermanently damages the system so that components must bereplaced. Soft failures, on the other hand, can be recoveredfrom; these failures are further characterized into three levelsbased on the visibility of the failure and the action needed torecover from it:Level 1) The system automatically recovers with no user-visible faults or loss or corruption of data. Oftenthis recovery is possible due to ESD-robusthardware and fault-tolerant control protocols.Level 2) The system experiences a system-level manifesta-tion, such as momentary screen or data corruption,but recovers without intervention.Level 3) The system crashes or requires the user to per-form an action, such as resetting the system orunplugging and re-plugging a device, to recoverfrom a fault condition.These failures are studied using a variety of hardware- andsoftware-based techniques.Numerous studies have investigated the relationship betweenESD interference and level 2 and 3 soft failures. Hardware ESDfault injection with direct injection and field injection probes isdescribed in [2–4]. These studies characterize integrated circuit(IC) immunity to ESD. The sensitivity threshold for each ICwas determined by injecting ESD at increasing voltages andobserving when errors occurred. In these studies, only user-visible errors, such as screen glitches or hardware resets, wereinvestigated.Izadi et al. [5] extend this fault injection process by mappingthe ESD sensitivity of the board.The injection probes areattached to a 2-D scanner that sweeps them across the board.At each point on a grid over the CPU, ESD is injected and thelevel at which the device becomes susceptible is recorded. Theresulting map can be used to identify traces and componentsthat are at risk for ESD damage. Mapping is carried out atvarious CPU loads and clock speeds; the authors determinethat the system is most susceptible under heavy load and lowclock speed.Vora et al. [6] study user-visible soft failures in a mi-croprocessor, a microcontroller, and an FPGA. In particular,they observed a relationship between CPU load and likelihoodof display flicker on a microprocessor, indicating that ESDwas coupling to the CPU chip rather than to the displayitself. Furthermore, they observed that the likelihood of certainfailures—process termination and display flicker—depend onthe program executing at the time of the ESD event.Investigating level 1 soft failures and understanding theroot causes of higher-level soft failures requires the ability toobserve a system’s behavior at a high level of detail. Vora et al.[6, 7], Feng et al. [8] use a custom microcontroller runningcode which monitors register values and system interrupts tostudy the effects of ESD on CPUs. While too invasive to useon a system performing additional tasks, this approach gives avery fine-grain view of observable soft failures. In particular,the authors observe numerous multiple bit errors in IO registersand frequent spurious interrupt triggers. The effect of ESD on USB devices in particular hasalso been investigated. Maghlakelidze et al. [9] develop anautomated testing system for studying soft failures in aUSB interface on a single-board computer. The system ischaracterized by injecting ESD pulses of varied voltage andpulse width into specific IC pins. Soft failures are observedbased on data transmission rate and error messages in kernellogs. Under positive voltage injections, most failures did notrequire user interaction; however, negative voltage injectionsproduced numerous severe soft failures. Koch et al. [10] furthertest USB-related soft failures and determine that likelihood offailure is also dependent on the state of the USB protocol, i.e.,which packets are being transmitted at the time of the injection.Root cause analysis shows that many failures are caused byESD coupling to the power domains in the USB controllerrather than to data lines.While some soft failures are not user-visible, they may stillbe observable by software monitoring of low-level systembehavior. Yuan et al. [11] continuously poll the status ofa phase-lock loop (PLL) embedded in the microcontroller;if the PLL unlocks, it can be assumed that the system hasexperienced an ESD shock. While this approach provides anexcellent measure of ESD events on the microcontroller, itcannot measure peripheral ESD events because most peripheralsdo not contain a separate PLL that can be monitored by themicrocontroller.Another case study of low-level system monitoring is carriedout in [12] on a wireless router. A debugging serial port on therouter logs every context switch performed by the processors,giving an approximate record of the execution path taken byprocesses running on the router. This data is collected intosystem function graphs of both reference operating functionand ESD-exposed function. Several graph metrics are appliedto these graphs; differences in metric values indicate that soft-failures can be observed by this monitoring technique.While not directly related to ESD events, software-basedas well as combined hardware and software system moni-toring approaches have been studied extensively. Wattersonand Heffernan [13] outline research related to monitoringfor runtime verification. System state is monitored by somecombination of hardware and software; this information isthen used to verify that the system is operating withinspecification. A software-specific study of fault monitoringis carried out by [14]. The authors present a taxonomy ofruntime monitoring approaches and discuss various systemrequirements for different monitoring techniques.Choudhuri and Givargis [15] develop a mixed hardwareand software approach for logging non-deterministic behaviorin embedded systems. They modify a compiler to emit codethat logs messages to an attached storage system, reducingprocessing overhead on the low-powered embedded hardwarebeing monitored. Reinbacher et al. [16] create a tool thatconverts an embedded system software specification into bothan executable and a configuration for a hardware monitor. Thehardware monitor interfaces with the embedded CPU and itscommunication buses and verifies the operation of the system.Delgado et al. [14] develop a software-based monitoringsystem for ATMs by instrumenting the drivers for each hardwarecomponent to measure state and performance. A runtime978-1-7281-7551-5/20/$31.00 c (cid:13)(cid:13) Fig. 1. USB subsystem block diagram checker uses the resulting data to determine if the systemis operating correctly. If not, recovery actions can be taken torestore system availability.The goal of our work is to improve the resolution of ESDsoftware detection—in effect, to make some level 1 soft failuresvisible to detection software—and to better understand thesoftware and hardware root causes of all types of soft failures.Early detection of ESD effects and detailed logs of systembehavior are essential to tracing ESD as it propagates through asystem. We aim to achieve this with minimal impact to systembehavior, as the visibility and behavior of ESD-induced failurescan change based on the processes running on the system.Allowing a system to operate as it does in the field provides abetter basis for testing ESD effects and reproducing unusualESD-induced failures. Finally, such instrumentation enablesreal-time monitoring and recovery from faults, improvingsystem immunity to level 2 and 3 soft failures.III. P
ROPOSED M ONITORING A PPROACH
Inducing ESD events on an embedded system peripheralcauses bits to flip in its data or control lines and power glitchesthat corrupt computations. These flipped bits can lead to changesin register values, loss of synchronization between peripheralsand the CPU, or data corruption. All of these effects are visibleto software running on the embedded system and thus shouldbe detectable by monitoring software. Our objective is to usesoftware to log as many of these events as possible for analysis.Monitoring only for corruption of transmitted data may beconfounded by protocol-level checksums and retransmissions;furthermore, doing so only observes the effects of ESD ondata lines and misses ESD events caused by discharges intothe chip power supplies. The methodology we propose offers alower-level view of peripheral operation that captures a widerarray of ESD events.This work can be applied to many computer peripherals,but we present it in the context of a USB Host Controller on anembedded system running Linux (see Section V for more detail).The Host Controller serves as an interface between the physicalUSB hardware and the software executing on the system’s CPUas shown in Figure 1. Its responsibilities include enumeratingdevices as they are connected and disconnected, configuringpower delivery, and communicating data and control signalsbetween the system’s memory and the USB peripherals. Weselect it for instrumentation as it connects directly to the USBbus and is thus subject to any ESD events happening on thebus.Our work focuses on non-invasive monitoring of the effectsof ESD events using software that allows normal systemoperation. We primarily study changes in register values, asthose values control the operation of the peripheral device.Each system state is represented by the n -tuple consisting Fig. 2. Research Methodology of the values of each peripheral register at a specific time.Some of these changes will be part of normal operation. WhenESD is induced, however, we should observe new abnormalstates or unexpected transitions between normal states. Theseabnormal states and transitions can indicate that the system isexperiencing ESD. Our analysis avoids state-space explosionby only considering states that are observed during systemoperation; it does not exhaustively explore the state space.The USB host controller is a complex piece of hardwarewhose operation is quite opaque to the system CPU. We cannotinspect any of its internal registers or microcode executionprocess. The extent of our visibility into its operation is thecontrol registers it exposes to the system. We select registersto monitor that provide a manifestation of the host controller’sinternal state. Recording snapshots of register values as thesystem performs USB operations gives a trace of host controllerexecution. The goal of this research is to use these traces toobserve anomalous operation potentially caused by ESD, assummarized in Figure 2.While the host controller’s registers are mapped in systemmemory, Linux’s memory protection mechanisms preventunprivileged programs from reading them. Thus, we must insertsome software into the Linux kernel to allow us access to thosememory addresses. Initially we attempted a naive approachwhich repeatedly sampled the registers in a loop. After thisapproach proved ineffective, we developed a more sophisticatedapproach which captures register values every time they arerelevant to software executing on the CPU.
A. Initial Design
Our first design focused on directly reading USB registervalues from their physical memory addresses. We adapted theMyregrw [17] software to better suit our needs as a softprobefor ESD. This software consists of a Linux system driver anda program that communicates with it. The driver reads thevalues of requested physical memory addresses. The user-levelprogram reads a configuration file specifying which addressesto request, repeatedly requests the data at those addresses, andstores that data to a file.We configured Myregrw to record the control and statusregisters for the USB host interface. We injected ESD intothe host controller while Myregrw continuously sampled theregisters. In theory, ESD-caused changes should appear in therecorded register values.However, the sampling rate of this softprobe was notsufficient to observe ESD-induced errors. We empiricallydetermined that the sampling rate of the software running These are mapped in the physical address range – . (cid:13)(cid:13)
Our first design focused on directly reading USB registervalues from their physical memory addresses. We adapted theMyregrw [17] software to better suit our needs as a softprobefor ESD. This software consists of a Linux system driver anda program that communicates with it. The driver reads thevalues of requested physical memory addresses. The user-levelprogram reads a configuration file specifying which addressesto request, repeatedly requests the data at those addresses, andstores that data to a file.We configured Myregrw to record the control and statusregisters for the USB host interface. We injected ESD intothe host controller while Myregrw continuously sampled theregisters. In theory, ESD-caused changes should appear in therecorded register values.However, the sampling rate of this softprobe was notsufficient to observe ESD-induced errors. We empiricallydetermined that the sampling rate of the software running These are mapped in the physical address range – . (cid:13)(cid:13) ∗ ∗
100 = 0 . of its values. As we have twenty-three registers to monitor, the effective sampling rate will beeven lower. Considering this low probability and the lack ofinformation recorded from our experiments, we devised a newmeasurement methodology with a higher sampling rate capableof recording additional register values.A confounding issue with this approach is the competitionfor access to these values between the Myregrw driver andthe USB host controller driver. By default, Linux drivers haveexecution priority over any user applications, meaning that itwould be nearly impossible to read all of the register valuesafter an error but before the USB host controller driver modifiesthe registers. Therefore, we developed a new methodology that,in addition to providing a faster sampling rate, ensures theregister values are recorded before the USB host controllerdriver can modify them. B. Improved Design
In the improved approach, we first enabled the debuggingconfiguration already present in the USB host controller driver.We then modified the drivers for the USB host controller. Thehost controller driver consists of several functions that are calledwhen certain events occur; for example, ohci_irq is calledwhen an IRQ occurs for the host controller. We configuredeach function to first log its name and the values of the hostcontroller registers to the system log. These modifications allowus to observe not only register state changes but also the orderin which different driver functions are called. An example ofsuch a log entry is shown in Figure 3.This approach is minimally invasive as the driver modi-fications are minor and do not affect the logic of the driveritself. While this induces a constant overhead, in practice theoverhead is small and can be reduced by using, e.g., a bufferto hold log entries and a separate program to write those entriesto a file. The sampling rate is variable, but it exactly capturesthe driver-visible operation of the USB host controller.IV. P ROPOSED A NALYSIS A PPROACH
The log files generated by the instrumented driver consistof lines each having a timestamp, register name, and associatedregister value. We parse these lines into n -tuples containingsnapshots of register values at the time of each function call.The sequence of n -tuples from each log constitutes an executiontrace.Many of these execution traces revisit the same statesrepeatedly. By identifying these repeated states and coalescingthem, we can develop an execution graph. This graph is adirected graph where each node is a unique system state andan edge from node s to node t indicates that the system went Estimated less than 10% overhead. function: ohci_irqHcControl: 0x83HcCommandStatus: 0x4HcInterruptStatus: 0x24HcInterruptEnable: 0x8000005eHcInterruptDisable: 0x8000005eHcHCCA: 0x338b1000HcPeriodCurrentED: 0x0HcControlHeadED: 0x339b2000HcControlCurrentED: 0x0HcBulkHeadED: 0x339b2080HcBulkCurrentED: 0x0HcDoneHead: 0x0HcFmInterval: 0xa7782edfHcFmRemaining: 0x80002760HcFmNumber: 0x921dHcPeriodicStart: 0x2a2fHcLSThreshold: 0x628HcRhDescriptorA: 0x2001202HcRhDescriptorB: 0x0HcRhStatus: 0x8000HcRhDescriptorA: 0x2001202HcRhPortStatus[0]: 0x103HcRhPortStatus[1]: 0x100Done.
Fig. 3. Example host controller state from state s to state t in the corresponding execution trace.An execution trace then becomes a path through the executiongraph.Once execution graphs for each log have been created, werepeat the deduplication process to produce the unified execu-tion graph of all runs. This allows us to identify similaritiesand differences among system execution traces.The operation of the analysis code can be summarized asfollows:1) Parse the log files to create states based on the registers’values.2) Deduplicate these execution traces to derive a per-runexecution graph.3) Deduplicate the execution graphs of different runs to derivea universal execution graph.4) Using this execution graph and each run’s execution trace,perform statistical analysis on the data. A. Constructing Execution Graphs
The first stage of analysis parses register values from thelog file for each run. After creating tuples for each of the statesin that file, we deduplicate the sequence of states to create thenodes of the execution graph. We then derive the executiontrace path through the execution graph from the state sequence.We also record the number of times each transition is taken.
B. Constructing the Unified Execution Graph
The next analysis step combines the data from each log intoa unified execution graph. The process is similar to that usedto develop the execution graph for each log. Certain registersfor the host controller contain memory addresses that changeevery time the driver is reloaded. The values of these registers These registers are
HcPeriodCurrentED , HcBulkCurrentED , HcFmRemaining , HcHCCA , HcControlHeadED , HcControlCurrentED , HcBulkHeadED , HcFmNumber , and
HcDoneHead . (cid:13)(cid:13)
HcDoneHead . (cid:13)(cid:13) C. Graph Analysis
We divide the data collected from test runs into two groups:baseline and ESD-exposed. Baseline logs are logs of the systemoperating normally; they provide us with the system’s expectedstate machine. ESD-exposed logs document how the systemtransitions into and out of unexpected behavior due to ESDexposure.After we create the graph of globally unique states, weanalyze the baseline and ESD-exposed logs individually toobserve how system operation differs among them. We subtractthe set of states reached in baseline logs from the set of statesreached in ESD-exposed logs to get a list of states only reachedduring ESD injection. These state sets can be used to showwhere and when the system transitioned into a state that canpotentially be attributed to ESD. Similarly, we can determinewhich transitions between states are present only during ESDexposure. V. C
ASE S TUDY
The system used for tests was the FriendlyArm Mini2440embedded development platform with a Samsung S3C2440ARM926T processor [18]. Its USB host interface conformsto the Open Host Controller Interface specifications [19]. Thesystem ran a modified Linux kernel based on the version 2.6.29kernel downloaded from the FriendlyArm website [18]. Weset up the system with our logging software and connected itto a PC to control it during the tests. During testing, a standardUSB 2.0 flash drive was connected to the system’s USB port.To ensure that the host controller is active during ESD injection,we copied a large file to or from the flash drive during tests.To thoroughly characterize system behavior, ESD interfer-ence was injected using electric (E) field and magnetic (H) fieldprobes powered by a transmission line pulse (TLP) generator.For each probe, multiple tests were run with varying pulsevoltages. In addition, different sizes of probes were used toadjust the intensity of the fields injected. The E-field probedoes not have an orientation; we positioned it across the USBport or over the host controller IC. E-field interference wasinjected using an EZ-3 probe at voltages between 500 and5500 volts with a pulse width between 0.1 and 0.25 seconds.Because the magnetic fields generated by the H-field probe aredirectional, we conducted tests with the probe in parallel withand perpendicular to the data and control lines. We used twoprobes, the HX-5 and the HX-1T2, injecting ESD betweeen500 and 8000 volts with pulse widths between 0.1 and 0.6seconds. The system was more resilient to H-field interference,allowing us to perform H-field tests with more intense ESDconditions than were possible with E-field tests. We have successfully replicated this study on a more recent Linux kernelversion; the results are forthcoming. TABLE I. P
ROBABILITY D ISTRIBUTION OF R EGISTER V ALUES : H C I NTERRUPT E NABLE
AND H C I NTERRUPT D ISABLE
Value Baseline ESD-exposed ProbabilityProbability
Enable Disable
Difference † † † † *0x64 † *0x66 † *1E-5 1E-4 1E-3 1E-2 1E-1 1E+0 Value Probability
Change in Probability
BaselineESD-exposed
Values
Fig. 4. Probability Distribution of Register Values:
HcInterruptStatus † indicates frame counter overflow; * indicates status change VI. R
ESULTS
A. Registers of Interest
Certain registers on the host controller were observedto give indications of ESD. In particular, we consider thevalues of the registers for interrupt enabling and disabling(
HcInterruptEnable and
HcInterruptDisable ),interrupt status (
HcInterruptStatus ), control(
HcControl ), and port status (
HcRhPortStatus0 ).The host controller has multiple events and errors it cangenerate hardware interrupts for; the driver can enable anddisable them depending on the current operation and checkwhether they have been triggered via the interrupt enable,disable, and status registers. The control register allows thedriver to switch between various USB transfer modes andenable certain host controller features. The port status registerreports whether a port is enabled, what device is connected toa port, device power configuration, etc.Per the OHCI specification [19],
HcInterruptEnable and
HcInterruptDisable should be duplicates of eachother when read. However, as shown in Table I, there are a fewstates in the ESD-exposed data where they are not duplicates.This may indicate ESD-induced bit flips or the system failingto properly update both registers when one is changed.The
HcInterruptStatus register values observed areshown in Figure 4 along with the probability of those valuesappearing in baseline and ESD-exposed logs and the absolutechange in that probability due to ESD exposure. It shows adramatic increase in values where the frame number counteroverflowed (marked † ) in the ESD-exposed logs, indicating thatthe system transmits many more frames during ESD exposure.In addition, values indicating the hub’s status has changed(marked *) are also much more prevalent in ESD-exposed logs.The HcControl register values provide a different per-spective on the increase in the number of frames and hub statuschanges. Figure 5 shows a great increase in control frameprocessing ( ) and a corresponding decrease in bulk data978-1-7281-7551-5/20/$31.00 c (cid:13)(cid:13)
HcInterruptStatus register values observed areshown in Figure 4 along with the probability of those valuesappearing in baseline and ESD-exposed logs and the absolutechange in that probability due to ESD exposure. It shows adramatic increase in values where the frame number counteroverflowed (marked † ) in the ESD-exposed logs, indicating thatthe system transmits many more frames during ESD exposure.In addition, values indicating the hub’s status has changed(marked *) are also much more prevalent in ESD-exposed logs.The HcControl register values provide a different per-spective on the increase in the number of frames and hub statuschanges. Figure 5 shows a great increase in control frameprocessing ( ) and a corresponding decrease in bulk data978-1-7281-7551-5/20/$31.00 c (cid:13)(cid:13)
Value Probability
Change in Probability
BaselineESD-exposed
Values
Fig. 5. Probability Distribution of Register Values:
HcControl ⋄ ⋄ ⋄ † † * † * † Change in Probability
Value Probability
BaselineESD-exposed
Values
Fig. 6. Probability Distribution of Register Values:
HcRhPortStatus0 (cid:5) indicates device connected with no change in port status; † indicates portenabled/disabled; * indicates port reset frame processing ( ). It is possible that ESD glitches aredisrupting bus operation, requiring the host controller and deviceto send a greater number of status change frames. In addition,corruption in the bulk data frames would require retransmissionsand therefore increase the number of new control and dataframes ( ).The HcRhPortStatus0 register contains status informa-tion about the port the USB drive was plugged into duringtesting. Figure 6 shows a marked decrease in states where theport status remains unchanged ( (cid:5) ) and an increase in statesindicating the port has been enabled or disabled ( † ). As well,port resets (*) were only observed in ESD-exposed logs. Theprevalence of resets and toggling whether the port is enabledhint that the host controller is experiencing unexpected errorsand attempting to recover by resetting the port’s status. Thepresence of a port reset where the driver or host controllerwould not usually issue one is a particularly strong indicatorof ESD exposure. B. Execution Graphs
Figure 7 shows an execution graph of sample baseline andESD-exposed execution traces. The set of nodes and solidarcs on the left of the figure is the execution graph of thebaseline log. The right of the figure consists of the additionalstates and transitions present in the sample ESD-exposed log.This execution graph demonstrates several potential effects ofESD on the system state: transitions to non-baseline states, transitions between non-baseline states, non-baselinetransitions between baseline states, and transitions fromnon-baseline to baseline states.Consider how we should expect the system to behave undernormal conditions and under ESD exposure. Normally, it shouldhave a small number of common code paths and some edge b baseline start b b b b b b b b b b b b ESD-exposed start e b b b b b b b b e e e e e e b b e e e e e b b n b n : n th state present in baseline logs b n e n : n th state present only in ESD-exposed logstransition present in baseline logstransition present only in ESD-exposed logs Fig. 7. Execution graph of one baseline trace and one ESD-exposed executiontrace A v e r a g e o cc u rr e n c e s p e r l o g States from baseline logsStates from ESD-exposed logs
Fig. 8. Average State Occurrences Per Log case handling. Under ESD exposure, we should see a number ofanomalous states caused by various register bits being flippedas well as control flow anomalies. Figure 8 shows the averagenumber of occurrences of states per baseline and ESD-exposedtraces. The baseline traces show a few states that are verycommon and a small tail of less common states. There are farmore unique states in ESD-exposed logs, and they are far lesslikely to occur. (We have omitted half of the ESD-exposed statetail to make the interesting portion of the graph more legible.)This graph provides quick verification of our methodology; wecan see that the data we have collected reflects expected systembehavior.Figure 9 compares the TLP pulse voltage with the per-centage of transitions to or from states not in the baselinelogs. The lack of a clear relationship between observed ESDcoupling and pulse voltage indicates that there are confoundingfactors between ESD exposure and system behavior. These978-1-7281-7551-5/20/$31.00 c (cid:13)
Fig. 9. Relationship between pulse voltage and ESD-caused transitions factors may include field type and orientation, injection location,pulse frequency, and the operation being performed by the hostcontroller at the time of injection. In addition, the ESD injectionmay cause the system to crash almost instantaneously, in whichcase the resulting state log will have relatively few states causedby ESD. More work is needed to assess the effect each of thesefactors has on system operation.VII. C
ONCLUSION
We have presented a software-based methodology fordetecting ESD events on embedded system peripherals. Thismethodology monitors the state of the peripheral by readingthe registers it exposes to the CPU with an instrumented kerneldriver for the peripheral. As with all software monitoringtechniques, this approach is only able to monitor eventsthat do not entirely disrupt CPU execution. We applied thismethodology to a USB host controller on an embedded systemrunning Linux. We demonstrated that we are able to observestates and transitions that the system experiences only whenexposed to ESD.Furthermore, the relationship between the recorded errorsand ESD can be reversed. Doing so allows us to predict, basedon the errors that the software experiences, when and wherethe system experiences ESD. We can apply this in severalways: components that have received ESD can be identified,either for replacement (if the goal of the experiment is to repairhardware that has experienced ESD) or for improvement (ifthe goal is to reduce the effects of ESD on a peripheral); inaddition, software can be written to recover from error states ina more efficient and automatic fashion. Software may also beable to compensate for the effects of ESD, allowing operationto continue in hostile environments at the cost of reducedperformance and more software overhead.A topic for future research is correlating system states withESD injection on a specific location on the board, which couldgive insight into which components have experienced ESD forperforming repairs or assist circuit designers in shielding theboard from particular error states. One could also study systemstates from a software perspective to determine how best torecover from certain ESD-induced errors. Finally, applying thismethodology to other peripherals and embedded systems maylead to additional insights for software monitoring. In particular,applying this methodology in tandem with PCB schematic andchip layout analysis would provide a bridge between software-observed and hardware-observed ESD effects. R
EFERENCES[1]
White Paper 3, System Level ESD, Part II: Implementation of EffectiveESD Robust Designs , Industry Council on ESD Target Levels, Mar. 2019.[2] P. Maheshwari, T. Li, J. Lee, B. Seol, S. Sedigh, and D. Pommerenke,“Software-based analysis of the effects of electrostatic discharge onembedded systems,” in
IEEE Computer Software and ApplicationsConference (COMPSAC) , July 2011, pp. 436–441.[3] K. H. Kim and Y. Kim, “Systematic analysis methodology for mobilephone’s electrostatic discharge soft failures,”
IEEE Transactions onElectromagnetic Compatibility , vol. 53, no. 3, pp. 611–618, Aug 2011.[4] T. Schwingshackl, B. Orr, J. Willemen, W. Simburger, H. Gossner,W. Bosch, and D. Pommerenke, “Powered system-level conductive TLPprobing method for ESD/EMI hard fail and soft fail threshold evaluation,”in , Sept 2013, pp. 1–8.[5] O. H. Izadi, A. Hosseinbeig, D. Pommerenke, H. Shumiya, J. Maeshima,and K. Araki, “Systematic analysis of ESD-induced soft-failures as afunction of operating conditions,” in , May 2018, pp. 286–291.[6] S. Vora, R. Jiang, S. Vasudevan, and E. Rosenbaum, “Application levelinvestigation of system-level ESD-induced soft failures,” in , Sep.2016, pp. 1–10.[7] S. Vora, R. Jiang, P. M. Vijayaraj, K. Feng, Y. Xiu, S. Vasudevan,and E. Rosenbaum, “Hardware and software combined detection ofsystem-level ESD-induced soft failures,” in , Sep. 2018, pp.1–10.[8] K. Feng, S. Vora, R. Jiang, E. Rosenbaum, and S. Vasudevan, “Guiltyas charged: Computational reliability threats posed by electrostaticdischarge-induced soft errors,” in , Mar. 2019, pp. 156–161.[9] G. Maghlakelidze, P. Wei, W. Huang, H. Gossner, and D. Pommerenke,“Pin specific ESD soft failure characterization using a fully automatedset-up,” in , Sep. 2018, pp. 1–9.[10] S. Koch, B. J. Orr, H. Gossner, H. A. Gieser, and L. Maurer, “Identificationof soft failure mechanisms triggered by ESD stress on a powered USB 3.0interface,”
IEEE Transactions on Electromagnetic Compatibility , vol. 61,no. 1, pp. 20–28, Feb. 2019.[11] S.-Y. Yuan, Y.-L. Wu, R. Perdriau, and S.-S. Liao, “Detection ofelectromagnetic interference in microcontrollers using the instabilityof an embedded phase-lock loop,”
IEEE Transactions on ElectromagneticCompatibility , vol. 55, no. 2, pp. 299–306, Apr. 2013.[12] X. Liu, O. H. Izadi, G. Maghlakelidze, M. Pommerenke, and D. Pom-merenke, “A preliminary study of ESD effects on the process calls treeof a wireless router,” in , Jul.2018, pp. 408–413.[13] C. Watterson and D. Heffernan, “Runtime verification and monitoring ofembedded systems,”
Software, IET , vol. 1, no. 5, pp. 172–179, 2007.[14] N. Delgado, A. Gates, and S. Roach, “A taxonomy and catalog ofruntime software-fault monitoring tools,”
IEEE Transactions on SoftwareEngineering , vol. 30, no. 12, pp. 859–872, 2004.[15] S. Choudhuri and T. Givargis, “FlashBox: a system for logging non-deterministic events in deployed embedded systems.” in
Proceedings ofthe 2009 ACM symposium on Applied Computing (SAC) , S. Y. Shin andS. Ossowski, Eds. ACM, 2009, pp. 1676–1682.[16] T. Reinbacher, M. Horauer, and A. Steininger, “A runtime verificationunit for microcontrollers,” in
System, Software, SoC and Silicon DebugConference (S4D) , September 2012, p. 16.[17] “edwinrong/myregrw,” https://github.com/edwinrong/myregrw, accessed:2020-01-20.[18] “Downloads — FriendlyArm,” http://dl.friendlyarm.com/mini2440, ac-cessed: 2020-01-20.[19]
OpenHCI: Open host controller interface specification for USB , Compaq,Microsoft, and National Semiconductor, Oct. 2000, release: 1.0a. (cid:13)(cid:13)