[PDF] Quantifying the Latency and Possible Throughput of External Interrupts on Cyber-Physical Systems

Abstract

An important characteristic of cyber-physical systems is their capability to respond, in-time, to events from their physical environment. However, to the best of our knowledge there exists no benchmark for assessing and comparing the interrupt handling performance of different software stacks. Hence, we present a flexible evaluation method for measuring the interrupt latency and throughput on ARMv8-A based platforms. We define and validate seven test-cases that stress individual parts of the overall process and combine them to three benchmark functions that provoke the minimal and maximal interrupt latency, and maximal interrupt throughput.

Full PDF

QuQuantifying the Latency and Possible Throughput ofExternal Interrupts on Cyber-Physical Systems

Oliver HorstJohannes Wiesböck fortiss GmbHResearch Institute of the Free State of BavariaMunich, Germany{horst,wiesboeck}@fortiss.org

Raphael WildUwe Baumgarten

Technical University of MunichDepartment of InformaticsGarching, Germany{raphael.wild,baumgaru}@tum.de

ABSTRACT

An important characteristic of cyber-physical systems is their capa-bility to respond, in-time, to events from their physical environment.However, to the best of our knowledge there exists no benchmarkfor assessing and comparing the interrupt handling performanceof different software stacks. Hence, we present a flexible evalua-tion method for measuring the interrupt latency and throughputon ARMv8-A based platforms. We define and validate seven test-cases that stress individual parts of the overall process and combinethem to three benchmark functions that provoke the minimal andmaximal interrupt latency, and maximal interrupt throughput.

DATA AVAILABILITY STATEMENT

A snapshot of the exact version of the prototyping platform toki [12]that was used to conduct the presented measurements is availableon Zenodo [13]. The snapshot also contains the captured, raw STMtrace data and scripts to produce the presented figures. The latestversion of toki can be obtained from [10].

Cyber-physical systems (CPSs) are characterized by the fact that acomputer system works together with a physical environment, orrather controls it. A specific characteristic of such control systemsis their necessity to provide short and predictable reaction times onevents in the physical world, to guarantee a good control quality[14]. Both properties are likewise essential for modern systems,such as tele-operated-driving [27], and classic systems, such as thecontrol of internal combustion engines [11].An important aspect of the achievable reaction time is the inter-rupt handling performance in both dimensions the interrupt han-dling latency and throughput capabilities of a system. Especiallythe effect of the utilized software stack has not yet been compre-hensively assessed. Such a systematic evaluation would, however,facilitate the development and selection of CPS software stacks forparticularly latency-sensitive or throughput-hungry use-cases.Previous studies in this field mainly conducted measurementswith the help of external measurement devices [16, 19, 22, 26],

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

CPS-IoTBench’20, Sept. 25, 2020, London, UK © 2020 Copyright held by the owner/author(s). which requires an in-depth understanding of the hardware to ob-tain precise measurements [22]. This expert knowledge, however,is reserved for the system on chip (SoC) and processor intellectualproperty (IP) vendors. Hence, we see the need for a measurementmethod that allows to accurately measure and properly stress theinterrupt handling process of today’s SoCs without expert knowl-edge. Accordingly, we present a flexible interrupt performancemeasurement method that can be applied to ARMv8-A IP-corebased platforms that provide a physical trace-port. As we see anincreasing market share of ARM based systems [20] and their wideadoption in automotive CPSs [9, 25, 28] we strongly believe thatour method helps in analyzing a multitude of relevant systems.We specify three benchmark functions based on the assessmentof ten combinations out of seven distinctive test-cases. Wherebyeach test case was chosen to stress a dedicated part of the ARMinterrupt handling process. The effectiveness of the test-cases andbenchmark functions is demonstrated on a Xilinx ZCU102 evalua-tion board [30] with two different software stacks.In summary, we contribute (i) a precise method to measure theinterrupt performance of complex ARM based SoCs without expertknowledge, and (ii) a set of benchmark functions that provokesthe best and worst interrupt latency and maximal throughput on agiven ARMv8-A hardware and software combination.The rest of this paper is structured as follows: Section 2 de-scribes the interrupt handling process on ARMv8-A platforms witha Generic Interrupt Controller (GIC) version 2, Section 3 presentsthe measurement setup and procedure of the envisioned evaluationmethod, Section 4 discusses the proposed test-cases and bench-marks along with the measurement results, Section 5 gives anoverview on related work, and Section 6 concludes the paper.

Müller and Paul [21] define an interrupt as an event that causes achange in the execution flow of a program sequence other than abranch instruction. Its handling process starts with the activationthrough a stimulus and ends with the completion of the interruptservice routine (ISR), which is called in consequence and processesthe stimulus. Until the ISR is executed several steps are undergonein hardware to cope for example with simultaneously arriving in-terrupt requests (IRQs) and masking of certain requests. In thefollowing, we explain this process for ARMv8-A platforms, as spec-ified in the GIC architecture specification version 2 [3]. In Section 4this information is used to design suitable test cases. a r X i v : . [ c s . O S ] S e p PS-IoTBench’20, Sept. 25, 2020, London, UK O. Horst, J. Wiesböck, R. Wild, and U. Baumgarten

Set ID=0Set HPPI=1023Set ID=ID+1 ID

EventSignal CPU Vector Table

Figure 1: Repeatedly, per CPU interface executed selectionand signaling process of the Generic Interrupt Controller(GIC) for handling triggered interrupt requests (IRQs).

The GIC architecture specification differentiates among fourtypes of interrupts: peripheral, software-generated, virtual, andmaintenance interrupts. In the course of this paper we focus solelyon measuring interrupts triggered by external stimuli, the periph-eral interrupts. They can be configured as edge-triggered or level-sensitive. This means that the corresponding interrupt is recognizedeither once on a rising edge in the input event signal, or continu-ously as long as the signal has a certain strength.The GIC supervises the overall interrupt routing and manage-ment process up to the point that the ISR is called. Figure 1 showsthe GIC architecture and the signaling path for the selected mea-surement hardware, the Xilinx UltraScale+ MPSoC (ZynqMP). TheGIC manages all incoming event signals of the system and consistsout of a central Distributor and a CPU interface per processor core.The Distributor manages the trigger-type of each interrupt, orga-nizes their prioritization, and forwards requests to the responsibleCPU interface. The CPU interfaces perform the priority maskingand preemption handling for their associated core.The timeline of the various steps in the interrupt handling pro-cess are illustrated in Fig. 2. The handling process of a certain IRQbegins with the arrival of an event signal at the Distributor (step 1).In case the signal matches the configured trigger type, the actualhandling process is triggered through the recognition of a new IRQ(step 2), which eventually leads to the execution of the ISR (step 9).After being recognized (step 2), the Distributor may select andforward the IRQ to the responsible CPU interface according tothe process depicted in the upper half of Fig. 1. This selectionprocess (step 3) is executed repeatedly and potentially in parallelfor each CPU interface. When the next highest priority pendinginterrupt (HPPI) is identified the Distributor forwards the requestto the currently inspected CPU interface (step 4). The CPU interfacefilters incoming requests according its configuration and the processshown in the lower half of Fig. 1. As a result the CPU interface maysignal a pending IRQ to its associated core (step 5). t EventSignal IRQRecognized SelectIRQ ForwardIRQ SignalIRQAcknowlegdeIRQ InterruptActive Jump toVector Table ISR Δ t latency Figure 2: Overall steps of the interrupt handling process onARMv8 platforms with a Generic Interrupt Controller (GIC)version 2 and an ARMv8-A architecture profile core.

Subsequent to the signaling of a new IRQ, on an ARMv8-A ar-chitecture [6], the core acknowledges the signaled IRQ by readingits id (step 6), which marks the IRQ as active within the GIC (step7). Meanwhile, the processing core jumps to the vector table entryindicated by the id of the current IRQ (step 8), which leads to theexecution of the ISR (step 9). Individual software stacks might addadditional steps before the ISR finally starts.Besides the regular interrupt requests described above, the GICarchitecture version 2 (GICv2) also supports the prioritized andmore secure fast interrupt requests (FIQs), however, these lay outof the focus of this paper. Furthermore, it shall be noted that theregular interrupt processing path was not affected by the latestupdates to the GIC architecture ( i . e ., version 3 and 4). Hence, westrongly believe that the presented benchmark functions would bestill valid for the updated architectures. With respect to the interrupt performance of a hardware/softwarecombination two properties are most relevant: the IRQ latency and throughput . Considering the process shown in Fig. 2, we define theIRQ latency as the time between a signaled event (step 1) and thecall to the corresponding ISR (step 9). As throughput we understandthe number of completed ISRs per time unit.

Our measurements utilize a Xilinx ZCU102 [30], with the toki pro-totyping platform [10, 12], and an ARM DSTREAM [5] hardwaretrace. We have chosen the ZynqMP, as it features an in-built pro-grammable logic (PL) and versatile tracing capabilities. Figure 3illustrates the chosen hardware setup. The ZynqMP is divided intothe PL and processing system (PS). The PS provides two proces-sor clusters, the application processing unit (APU) with four ARMCortex-A53 cores and the real-time processing unit (RPU) withtwo ARM Cortex-R5 cores. Both clusters have their own interrupthandling infrastructure, but we focus on the APU’s only.The ARM CoreSight [4] tracing capabilities of the ZynqMP allowto record events, such as taken branches, external interrupts, andsoftware events on a common timeline defined by system-widetimestamps. The system trace macrocell (STM) and embedded tracemacrocells (ETMs) record the various events in hardware, withoutaltering the behavior of the executed code, neither in a temporalnor semantic sense. Only software driven events require a registerwrite operation and thus marginally influence the timing of theexecuted code. CoreSight is part of all newer ARM processor IPsand can be utilized as soon as the used hardware features a physical antifying the Latency and Possible Throughput of External Interrupts on Cyber-Physical Systems CPS-IoTBench’20, Sept. 25, 2020, London, UK

ARMDSTREAMTracePortXilinx UltraScale+ MPSoC ZCU102 PLPS, APUMPSoC

PL2PS_INTEMIO

GIC Cortex-A53

32 32

InterruptGeneration R e s e t O n - C o u n t O ff - C o u n t STM E v e n t s S y s t e m T i m e s t a m p ETM Events

Figure 3: Chosen measurement setup, with four PL2PS in-terrupts generated by the programmable logic (PL) accord-ing to the configuration signaled by the processing sys-tem (PS) via its extended multiplexed I/O interface (EMIO).The generated interrupts, executed instructions, and globaltimestamps are recorded through the system trace macro-cell (STM) and embedded trace macrocell (ETM). The cap-tured trace is read via an ARM DSTREAM unit.

JTAG- and trace-port. The latter is typically available on evaluationboards used for prototyping professional applications.In addition to the in-built hardware features, we deploy a custominterrupt generation block into the PL. The block allows to simulta-neously stimulate APU interrupts, following a generation patterndefined by a logical-high and -low phase, and trace them in the STM.This could also be realized with an external field-programmablegate array (FPGA) or a signal generator, to support additional plat-forms. The pin multiplexing configuration of the target platformonly has to ensure that triggered input lines are also connected toa STM hardware event channel.

Based on the measurement setup, we propose two measurement pro-cedures, one per measurement type ( i . e ., latency and throughput),utilizing two different configurations of the interrupt generationblock. Both procedures use APU core 0 to take measurements andthe other cores as stimulation sources, where needed.We conduct our measurements on two software stacks: (i) a bare-metal, and (ii) a FreeRTOS based system. Both software stacks areprovided by the toki build- and test-platform [10, 12], and utilize adriver library provided by Xilinx [32]. This library already featuresan interrupt dispatching routine that multiplexes the processorexception associated with regular interrupts on the target.The bare-metal stack is an unoptimized piece of code that fea-tures neither a scheduler, nor a timer tick. It could clearly be opti-mized, however, it is unclear how much a fully optimized assemblerversion of the stack would impact the presented results.Thanks to its Yocto [33] based build-system, toki can easily beexecuted to include Linux based software stacks and with that thepresented test setup. Completely different software stacks and hard-ware platforms can be evaluated with the given setup, when they provide (i) a libc function interface, and (ii) drivers for interactingwith the caches, STM trace, and GIC on that platform. Throughput.

In case of a throughput evaluation, we configure theinterrupt generation block to continuously signal the PL2PS inter-rupts for 9.75 s and then wait for another 250 ms on a rotating basis.We call each repetition of this pattern a stimulation phase. Core 0 isconfigured to handle the signaled private peripheral interrupt (PPI)on a level-sensitive basis and the corresponding ISR does nothingdespite emitting a STM software event, i . e ., executing a 8 bit storeinstruction. Hence, the time spent in each ISR and with that thepossible reduction of the maximum throughput is negligible.For each throughput measurement we capture a 120 s trace andevaluate the contained stimulation phases. The throughput ( µ ) ineach stimulation phase ( i ∈ [ , ] ) is obtained from the tracedmeasurement samples by counting the ISR generated STM-software-events between the rising-edge STM-hardware-event of the ( i ) th and ( i +1) th stimulation phase and dividing it with the length of onelogical-high-phase. The set of throughput values considered for theevaluation in Section 4 is then given by M = { µ ( i ) | i ∈ [ , ] } . Latency.

The latency evaluation is conducted with an alternatingscheme of a 1 ms PL2PS interrupt trigger and a 4 ms pause. The in-terrupt generation block is configured accordingly. Again we referto each repetition of this scheme as a stimulation phase. In contraryto the throughput measurement, however, core 0 is configured tohandle the signaled PPI on a rising-edge basis. Thus, every stimula-tion phase provokes only one interrupt. The corresponding ISR isthe same as for the throughput measurements.The results for the latency measurements are obtained by evalu-ating 30 s trace captures. The interrupt latency ∆ t latency ( i ) inducedby each stimulation phase i ∈ [ , ] is given by ∆ t latency = B − A ,with A representing the point in time where the interrupt was stim-ulated and B the point where the corresponding ISR was started.Both points can be obtained from the captured trace. A is givenby the timestamp of the STM hardware event associated with therising-edge of the PL2PS interrupt signal. B , on the other hand, hasto be determined and defined for every analyzed software stackindividually. In the course of this paper we utilize the timestamp ofa STM event generated within the interrupt handler of our bench-mark application that runs on top of the evaluated software stacks.Similar to the throughput values, the set of latency values con-sidered in Section 4 is given by X = { ∆ t latency ( i ) | i ∈ [ , ] } . In our measurement setup, we configure the PL, trace port, andtimestamp generation clock to oscillate at 250 MHz. Hence, two con-secutive timestamp ticks lay 4 ns apart from each other. Since eachsampled event in the ETM and STM is assigned a timestamp, ourmeasurement precision corresponds exactly to the system times-tamp resolution, i . e ., 4 ns. This is an order of magnitude smaller thanthe interrupt latency measured in a previous study for the samehardware platform [29] and a quarter of the measured minimalinterrupt latency of an ARM real-time core [22].Even though state of the art oscilloscopes provide a samplingrate of up to 20 GSa/s [15], which corresponds to a measuring pre-cision of 0.05 ns, the actual precision in case of interrupt latency PS-IoTBench’20, Sept. 25, 2020, London, UK O. Horst, J. Wiesböck, R. Wild, and U. Baumgarten measurements might be considerably lower. The reason for this isthat the oscilloscope can only measure external signals of a pro-cessor. Thus, in-depth knowledge of the internal structure of thehardware and executed instructions during a measurement is re-quired to utilize the full precision of the oscilloscope. This makesit less suited for the evaluation of different hardware platformsand software stacks. The CoreSight based measurement setup, onthe other hand, supports a flexible placement of the measurementpoints within and outside of the processor and does not requireany expert knowledge about the hardware or software.Besides the measurement precision and flexibility, we also needto ensure that the presented measurement setup is sane and trig-gered interrupts can actually be recognized by the processor. Ac-cording to the ZynqMP technical reference manual [31, p. 312], asignal pulse that shall trigger a PL2PS interrupt needs to be at least40 ns wide to guarantee that it is recognized as such. Hence, the pre-sented stimulation scenarios for the two measurement proceduresensure that all triggered interrupts can be recognized.The disadvantage of the presented measurement approach, how-ever, is that it is only applicable for ARM based platforms witha dedicated JTAG- and trace-port. Given ARM’s 40% share of thesemiconductor market for IP designs [20] and the wide availabilityof suitable evaluation boards, we believe this is an acceptable draw-back. An additional limitation is that valid measurements can onlybe obtained for the interrupt with the highest priority among theactive ones, but this applies to any kind of measurement setup.

In order to create a benchmark for comparing the interrupt latencyand throughput across platforms and software stacks, we havedesigned seven test-cases specifically tailored to stress the ARMv8-A interrupt handling process. To judge their suitability for an overallbenchmark, we measure their performance with the two softwarestacks described in Section 3.2 on top of the ZynqMP discussed inSection 3.1. By comparing the impact of each test-case with respectto the baseline performance of the two systems, we compose threebenchmarks out of the test-cases and show their suitability byapplying them to the same system configurations.

Given the interrupt handling process in Section 2, we conclude thatthe time spent in the process can be influenced by: the core, caches,memory, and GIC. We have designed seven test-cases that aim toreveal the influence of different configuration settings related tothe aforementioned components onto the temporal behavior of theinterrupt handling process. However, we do exclude the core fromour considerations by only measuring interrupts with the highestpriority and not computationally loading the measured core. Themeasurements for all test-cases follow the scheme presented inSection 3.2, unless indicated otherwise. Depending on the goal ofeach test-case they are either applied only for latency measurementsor both latency and throughput measurements. The proposed test-cases and their targeted components are summarized in Table 1, andFigs. 4 and 5 present the results of our measurements. The presentedresults are based on 848–6000 measurement samples per latencymeasurement and 11–12 samples per throughput measurement. The remainder of this section elaborates on the intended influenceof the listed test-cases on the interrupt handling process.

T1: Baseline.

T1 is intended to provide a reference point to comparethe other test-cases to and rate their impact. Hence, T1 assess theinterrupt latency and throughput of a system in the most isolatedway, with only one core and interrupt enabled and caches disabled.Hence, T1 only enables the extended multiplexed I/O interface(EMIO) pin driven interrupt and routes it to core 0. As ISR thedefault handler, described in Section 3.2, is used. T1 is evaluated forits latency and throughput performance.

T2: Caches enabled.

T2 equals T1, with the exception that all opera-tions are executed with enabled caches. This test is conducted forboth latency and throughput measurements.

T3: Caches invalidated.

T3 is also based on T1, but the ISR addition-ally invalidates the data and instruction caches. Due to the fact thatthis is not feasible in throughput measurements, as new interruptswould arrive independently of the cache invalidation process, weconduct only latency measurements with T3.

T4: Enabled interrupts.

T4 aims at stressing the GIC with the highestpossible number of enabled interrupts, as the interrupt selectionand signaling process suggests that more checks have to be donethe more interrupts are enabled/pending. Hence, T4 enables themaximum number of interrupts supported by the ZynqMP, exceptthose required for conducting the measurements. All interrupts arerouted to and handled by core 0. The measured PL-to-PS interrupt isassigned to the highest priority and all other interrupts to the lowestpriority. Core 0 installs an empty ISR that immediately returns afterclearing the IRQ in the GIC for all interrupts, except the measuredPL-to-PS interrupt, which uses the same ISR as T1.As this test aims at stressing the GIC to reduce its performance,we only evaluate it with respect to the interrupt latency. To be ableto identify trends, we evaluated this test-case with 1, 36, 72, 108,144, and 180 stressing interrupts. However, due to the marginaldifferences between the results of the different T4 variants andspace constraints we only show the results of T4-180, T4 with 180stressing interrupts, which provoked the highest latency.

T5: Order of priorities.

T5 utilizes the same setup as T4 and is alsoapplied to latency measurements only. However, in contrast to T4,T5 only utilizes as much interrupts as there are priorities, i . e ., 15.The measured interrupt remains at priority 0 and the priorities ofthe other 14 are assigned in an ascending order ( i . e ., 14 to 1). Thisdesign intends to provoke a maximal number of HPPI updates. T6: Parallel interrupt handling.

To test the influence of parallellyhandled interrupts on the interrupt handling process, T6 enables upto 4 cores and configures all of them to handle the EMIO pin 0 inter-rupt. The interrupt is configured as level-sensitive with the highestpriority. The PL ensures that this interrupt is signaled continuouslyand simultaneously as soon as the test is enabled. The ISRs on allcores generate a STM event, which are evaluated for throughputmeasurements. In case of latency measurements, however, onlythose STM events produced by core 0 are considered.We evaluated T6 with 2, 3, and 4 enabled cores. The resultsshowed a clear trend that the more enabled cores the higher theobserved latency and the lower the achieved throughput. Due to antifying the Latency and Possible Throughput of External Interrupts on Cyber-Physical Systems CPS-IoTBench’20, Sept. 25, 2020, London, UK

Table 1: Properties of the evaluated test-cases and benchmarks used to compare the interrupt latency (L) and throughput (T).

Description TargetedComponent Measurements EnabledInterrupts CacheConfig EnabledCores BenchmarksL T L min L max T max T1: Baseline — X X 1 Disabled 1T2: Caches enabled Cache X X 1 Enabled 1 X XT3: Caches invalidated Cache X 1 Invalidated 1T4: Enabled interrupts GIC X 2–181 Disabled 2 XT5: Order of priorities GIC X 15 Disabled 2T6: Parallel interrupt handling GIC X X 1 Disabled 2, 3, 4 X XT7: Random memory accesses Memory X 1 Disabled 4 a) T1 T2 T3 T4-180 T5 T6-4 T72 ns BaremetalFreeRTOS b) B-L min ns

212 216360228232904 c) B-L max µs Figure 4: Latency measured with T1–T7 (a) and B-L min and B-L max (b-c). Figure a) uses a symlog scale with a linear thresholdof , Fig. b) uses a symlog scale with a linear threshold of

240 ns , and Fig. c) uses a linear scale. a) MHz

B-TₘₐₓT6-2T2T1 Baremetal FreeRTOS b) −2 −5 −2 −14 +2 −14 +2 −5 MHz T1 Baremetal −2 −5 −2 −14 +2 −14 +2 −5 MHz

FreeRTOS −2 −5 −2 −14 +2 −14 +2 −5 T2 −2 −5 −2 −14 +2 −14 +2 −5 −2 −5 −2 −14 +2 −14 +2 −5 T6-2 −2 −5 −2 −14 +2 −14 +2 −5 −2 −5 −2 −14 +2 −14 +2 −5 B-Tₘₐₓ −2 −5 −2 −14 +2 −14 +2 −5 Figure 5: Throughput measured with T1, T2, T6, and B-T max .Figure a) compares the median of all measurements on alinear scale and Fig. b) illustrates the measured throughputranges on a symlog scale with a linear threshold of , nor-malized to a

500 kHz range around the highlighted median. space constraints we thus only show the results for T6-4, with 4enabled cores, in case of the latency considerations and T6-2 incase of the throughput measurements.

T7: Random memory accesses.

As pointed out earlier, the sharedmemory and interconnecting buses of multi-core processors rep-resent a major source of unforeseen delays. Accordingly, T7 isdesigned to delay memory accesses by overloading the intercon-necting bus and memory interface. For this purpose all 4 coresare running random, concurrent memory accesses in form of con-stants that are written to random locations in a 96 MB large array.In parallel core 0 executes the standard latency test. Throughputevaluations are not considered with this test-case, as it targets todelay the interrupt handling process.

Analyzing the measured interrupt performances under the differ-ent test-cases, shown in Figs. 4 and 5, we conclude that first of alldifferent setups and software stacks indeed considerably influencethe interrupt handling performance. All three targeted components,provoke a considerable effect on the interrupt latency and through-put. Particularly noticeable are the differences between the test-cases with enabled (T2, T3) and disabled caches (T1, T4–T7), forboth the observed latency and throughput, as well as the effects ofstressing the GIC on the measured latency (T4–T6).Of special interest is that the FreeRTOS based stack achieved asmaller minimum latency and a narrower variation range of the la-tency and throughput, compared to the bare-metal stack. Examples

PS-IoTBench’20, Sept. 25, 2020, London, UK O. Horst, J. Wiesböck, R. Wild, and U. Baumgarten are for instance T2 and T6-4 for latency measurements, and T6-2for throughput measurements. After measuring and reviewing thetests for each critical test-case multiple times without finding anyanomalies, we assume that some low-level hardware effects, forinstance in the pipeline or shared interconnects, might cause the ob-served behavior. Further insight into the situation could be gainedby (i) implementing a fully optimized, assembly-only bare-metalstack, or (ii) analyzing the actual hardware effects with a cycle-accurate simulation in ARM’s Cycle Model Studio [7]. However,both approaches are out of the scope of this paper.T2 produces by far the shortest interrupt latency of 232 ns onaverage with only a few outliers. Hence, we propose to utilize T2as benchmark for the minimal achievable latency (B-L min ).To obtain a suitable benchmark for the maximal latency, we ana-lyzed all combination out of the test-cases T4-36, T4-144, T6-3, andT7. Except for the combination out of T6 and T7, all tested combi-nations showed a similar performance with only slight differences.An exception to that forms the interrupt latency performance of thecombination out of T4-144 and T6 on FreeRTOS, which is consid-erably more confined than all other observed ranges. The highestlatency is achieved with a combination out of T4-36 and T6, how-ever, the combination of T4-36, T6, and T7 is close. Accordingly, wepropose to use the combination out of T4-36 and T6 to benchmarkthe achievable maximal interrupt latency (B-L max ).For the maximal throughput benchmark (B-T max ) we evaluatedall four variants of the T6 test-case with enabled caches (T2). In-terestingly, the enabled caches seem to mitigate the effect of moreenabled cores, as all combinations showed a similar throughput.However, the combination out of T6-2 and T2 still performed best.Even though the maximal achieved throughput of the combined test-cases lags a little behind that of T2 alone in case of the bare-metalsoftware stack, it outperforms T2 by far in case of the FreeRTOSbased stack. Hence, we propose the combination out of T6-2 andT2 to benchmark the maximal throughput of a system.

In principle, there exist two patented interrupt latency measure-ment approaches that are used in literature. First, measurementsbased on an external measurement device, such as an oscilloscope[16]. And second, measurements based on storing timestamps whenan interrupt is asserted and when the interrupt handling routine iscompleted [17], like we do with our measurements.Liu et al. [18] measured the interrupt latency of five Linux vari-ations on an Intel PXA270 processor, which features an ARM in-struction set. They used a counter based measurement method andfocused on the effect of different computational loads. Since theirstimulation is limited to a single periodic interrupt, we argue thattheir approach is not able to stress the interrupt distribution processand that they rather analyzed the responsiveness of the schedulerto aperiodic events than the deviation of the interrupt latency.The wide majority of studies, however, focused on interruptperformance measurements with external measurement devices[19, 22, 26], or combined it with the timestamp approach [24].Macauley [19] compared different 80×86 processors with each otherand NXP Semiconductors [22] determined an exact latency for theiri.MX RT1050 processor. All other aforementioned studies focused on comparing different software stacks with respect to variouscomputational loads. None of the mentioned studies analyzed thethroughput, or stressed the interrupt distribution process.Aichouch et al. [2] claim to have measured the event latency ofLITMUSˆRT vs . a KVM/Qemu virtualized environment on an Intelbased computer. However, it stays unclear how they performed themeasurements and where they got the timing information from.Previous studies of the achievable interrupt throughput focusedon the analysis of the achievable network packet transmission/re-ception or storage input/output operations per second when con-sidering different interrupt coalescing and balancing strategies[1, 8, 23], but do not analyze the interrupt throughput in isolationwith respect to different software stacks. We presented a flexible evaluation method based on the ARM Core-Sight technology [4], which enables the assessment of various soft-ware stacks on top of commodity ARMv8-A platforms with respectto their interrupt handling performance. Utilizing the evaluationmethod, we crafted seven specifically tailored test-cases that wereshown to stress the ARM interrupt handling process. Out of thesetest-cases we deduced three benchmark functions, tailored to pro-voke the minimal (B-L min ) and maximal (B-L max ) interrupt latency,and the maximal throughput (B-T max ), of a given software stack.We validated the test-cases and benchmark functions by compar-ing two software stacks (a simple bare-metal and FreeRTOS basedenvironment) and measuring them on top of a Xilinx ZCU102 [30].Our measurements showed that different software stacks dohave a considerable impact on the interrupt handling performanceof a hardware platform. Hence, we hope to draw some attention onthe importance of a good software design for CPS, with respect tointerrupt processing and the need of a more profound analysis onhow interrupt handling processes can be made more predictablewith respect to the achievable latency and throughput.

ACKNOWLEDGMENTS

The presented results build on top of the Bachelor’s thesis by Wild[29] and were partially funded by the German Federal Ministry ofEconomics and Technology (BMWi) under grant n°01MD16002Cand the European Union (EU) under RIA grant n°825050.

REFERENCES [1] Irfan Ahmad, Ajay Gulati, and Ali Mashtizadeh. 2011. vIC: Interrupt Coalescingfor Virtual Machine Storage Device IO. In

USENIX Annual Technical Conference(ATC) .[2] Mehdi Aichouch, Jean-Christophe Prevotet, and Fabienne Nouvel. 2013. Evalua-tion of the overheads and latencies of a virtualized RTOS. In

Proceedings of the8th IEEE International Symposium on Industrial Embedded Systems (SIES) . IEEE.https://doi.org/10.1109/SIES.2013.6601475[3] ARM Ltd. 2013. ARM Generic Interrupt Controller – Archichtecture ver. 2.0.https://documentation-service.arm.com/static/5f1065e70daa596235e7dea6. (IHI0048B.b).[4] ARM Ltd. 2013.

CoreSight Technical Introduction . White paper ARM-EPM-039795.[5] ARM Ltd. 2015. ARM DS-5 Version 5 ARM DSTREAM User Guide. https://documentation-service.arm.com/static/5ea3136d9931941038dec0c3. (DUI 0481K).[6] ARM Ltd. 2020. ARM Architecture Reference Manual – ARMv8, forARMv8-A architecture profile. https://documentation-service.arm.com/static/5f20515cbb903e39c84dc459. (DDI 0487F.c).[7] ARM Ltd. 2020. Cycle Model Studio, ver. 11.2 – User Manual. https://static.docs.arm.com/101108/1102/Cycle_Model_Studio_Manual.pdf. (ID011620). antifying the Latency and Possible Throughput of External Interrupts on Cyber-Physical Systems CPS-IoTBench’20, Sept. 25, 2020, London, UK [8] Luwei Cheng and Cho-Li Wang. 2012. vBalance: Using Interrupt Load Balanceto Improve I/O Performance for SMP Virtual Machines. In

Proceedings of the 3rdACM Symposium on Cloud Computing (SoCC) . ACM Press. https://doi.org/10.1145/2391229.2391231[9] Daimler AG. 2018. At a glance: The key data on MBUX. https://media.daimler.com/marsMediaSite/ko/en/32705799. (2020-07-10).[10] fortiss GmbH – Research Institute of the Free State of Bavaria. 2020. toki Devel-opment Site. https://git.fortiss.org/toki.[11] Patrick Frey. 2010.

Case study: engine control application . Technical Report. OpenAccess Repositorium der Universität Ulm. https://doi.org/10.18725/OPARU-1755[12] Oliver Horst and Uwe Baumgarten. 2019. toki: A Build- and Test-Platform for Pro-totyping and Evaluating Operating System Concepts in Real-Time Environments.In

Proceedings of the Open Demo Session of Real-Time Systems (RTSS@Work) heldin conjunction with the 40th IEEE Real-Time Systems Symposium (RTSS) .[13] Oliver Horst, Johannes Wiesböck, Raphael Wild, and Uwe Baumgarten. 2020.Quantifying the Latency and Possible Throughput of External Interrupts onCyber-Physical Systems: Measurement Software and Measured Data. Availableon Zenodo. https://doi.org/10.5281/zenodo.3968487[14] E. Douglas Jensen. 2008. Wrong Assumptions and Neglected Areas in Real-TimeSystems. In

Proceedings of the 11th IEEE International Symposium on Object andComponent-Oriented Real-Time Distributed Computing (ISORC)

IEEE Trans. Comput.

60, 7 (July 2011), 978–991. https://doi.org/10.1109/TC.2010.119[19] Martin W.S Macauley. 1998. Interrupt latency in systems based on Intel 80×86processors.

Microprocessors and Microsystems

The Complexity of SimpleComputer Architectures . Springer-Verlag, Chapter 8, 141–178. https://doi.org/10.1007/3-540-60580-0[22] NXP Semiconductors. 2018. Measuring Interrupt Latency. Application NoteAN12078.[23] Ravi Prasad, Manish Jain, and Constantinos Dovrolis. 2004. Effects of InterruptCoalescence on Network Measurements. In

Lecture Notes in Computer Science .Springer-Verlag, 247–256. https://doi.org/10.1007/978-3-540-24668-8_25[24] Paul Regnier, George Lima, and Luciano Barreto. 2008. Evaluation of interrupthandling timeliness in real-time Linux operating systems.

ACM SIGOPS OperatingSystems Review

Photonics Applications inAstronomy, Communications, Industry, and High-Energy Physics Experiments 2012 ,Ryszard S. Romaniuk (Ed.). SPIE. https://doi.org/10.1117/12.2000230[27] Tito Tang, Frederic Chucholowski, and Markus Lienkamp. 2014. Teleoperateddriving basics and system design.

ATZ worldwide