[PDF] On The Performance of ARM TrustZone

Abstract

The TrustZone technology, available in the vast majority of recent ARM processors, allows the execution of code inside a so-called secure world. It effectively provides hardware-isolated areas of the processor for sensitive data and code, i.e., a trusted execution environment (TEE). The OP-TEE framework provides a collection of toolchain, open-source libraries and secure kernel specifically geared to develop applications for TrustZone. This paper presents an in-depth performance- and energy-wise study of TrustZone using the OP-TEE framework, including secure storage and the cost of switching between secure and unsecure worlds, using emulated and hardware measurements.

Full PDF

OOn The Performance of ARM TrustZone (cid:63) (Practical Experience Report)

Julien Amacher and Valerio Schiavoni

Universit´e de Neuchˆatel, Switzerland, [email protected]

Abstract.

The T

RUST Z ONE technology, available in the vast majority of recentA RM processors, allows the execution of code inside a so-called secure world .It effectively provides hardware-isolated areas of the processor for sensitive dataand code, i.e. , a trusted execution environment ( TEE ). The O P -T EE frameworkprovides a collection of toolchain, open-source libraries and secure kernel specif-ically geared to develop applications for T RUST Z ONE . This paper presents anin-depth performance- and energy-wise study of T

RUST Z ONE using the O P -T EE framework, including secure storage and the cost of switching between secureand unsecure worlds, using emulated and hardware measurements. Keywords:

Trusted Execution Environment · ARM · TrustZone · benchmarks

Internet of Things (IoT) devices are expected to offer the pervasive computing that waspromised at its advent [47]. The economic impact of the IoT ecosystem has createdmany new business opportunities and is expected to continue growing rapidly. As aresult, the number of devices owned per user is anticipated to increase up to 26 by2020 [44]. A RM , expects 275bn active devices by 2025 - a 11 × improvement over2019 [6] - while already having sold 100bn processors. For instance, Figure 1 reportsthe sales for A RM processors in the last 20 years.These IoT devices gather, distribute and process information on their own, effec-tively pushing intelligence to edge devices. Due to their nature, these devices are mostlynomad: easy to relocate, designed as wearable, embedded in vehicles or left in remotelocations. As such, assets need to be protected from attackers, in particular those eas-ily subject to physical tampering. Hence, ensuring that conﬁdential data is processedin a secure manner, even in hostile environments, remains a challenging prerequisitefor such devices. Indeed, an attacker with physical access can relatively easily inspectand modify the execution workﬂow of any program. Nowadays, even more disturb-ing attacks not requiring physical access are surfacing [51], reinforcing the need toexploit hardware-based security mechanisms when available. Hardware-based protec-tions offer an additional security layer, by physically separating processing of secureand non-secure data components. These can be dedicated processing chips (hardware (cid:63) This is a post-peer-review, pre-copyedit version of an article published in ”Distributed Appli-cations and Interoperable Systems” (DAIS), 2019. The ﬁnal authenticated version is availableonline at https://doi.org/10.1007/978-3-030-22496-7_9 a r X i v : . [ c s . O S ] J un Julien Amacher and Valerio Schiavoni A R M v Z T r u s t Z one A R M v T r u s t Z one A R M v − A TZ / E L3 A R M v − M w / TZ U n i t s s o l d [ bn ] Year

Fig. 1:

Sales and popularity of ARM processors in the last 20 years [5,4] security modules –HSM–), or regular chips to which security extensions were added.Examples of the latter include Intel’s

Software Guard Extensions ( i.e. , SGX [21]) sincethe Skylake architecture (2015), or A RM ’s T RUST Z ONE [7] since ARMv6 (2008).A RM devices are often battery-powered and must therefore make optimal use oftheir limited energy capacity. This is especially true nowadays, when battery capacityis becoming the limiting factor when deploying new functionalities. Despite the avail-ability of such devices on the market, to the best of our knowledge we could not ﬁnda public study on the performance and energy-related consumption for these securityextensions.The contributions of this work are as follows. We begin by providing the ﬁrst publicexperimental analysis of the performance and energy requirements of the T RUST Z ONE security extensions based on hands-on metrics. Second, we report on the advantagesand limitations of O P -T EE [26], an open-source framework that supports T RUST Z ONE .Third, we provide a methodology to extend the kernel of O P -T EE in order to offer newsyscalls inside T RUST Z ONE . We leverage this methodology to implement two new ad-ditional syscalls, e.g. , to fetch thermal metrics and for secure time measurements inthe T

RUST Z ONE . Finally, we report on our in-depth experimental analysis along sev-eral dimensions (including energy) of the current secure processing capabilities offeredby some widely popular IoT devices ( i.e. , Raspberry Pi) shipping T

RUST Z ONE pro-cessors. Our results are put into perspective by comparing them against an emulatedenvironment aware of the T

RUST Z ONE extensions.The paper is organized as follows. §2 describes the T

RUST Z ONE architecture andkey concepts of world isolation. §3 explains how the kernel was extended to expose newsyscalls within T

RUST Z ONE , how all the data was gathered, as well as the hardwareand software tools that were developed. §4 presents our in-depth evaluation using realhardware and under emulation, for several hardware components ( e.g.

CPU, memory,secure storage) and metrics ( e.g. performance, energy and power consumption). Wediscuss some lessons learned in §5, before concluding in §6. n The Performance of ARM TrustZone 3 ❽ Kernel modeUser modeRich application Trusted application (TA)Secure kerneltee-supplicantGlobalPlatformTEE Client APIOP-TEEdriver Rich kernel GlobalPlatformTEE Internal APIStorage Shared memory Secure Monitor & TEE core Internal TEEutility functionsOP-TEE OSOP-TEE Client ➊➋ ➌ ➍ ➎ ➏ ➐

REE (non-secure) TEE (secure)

User modeKernel mode ❾ ➓

Fig. 2: T RUST Z ONE components and interaction workﬂow.

This section provides some background on T

RUST Z ONE . First we deﬁne a few termsused throughout this paper. §2.1 describes T

RUST Z ONE ’s main mechanisms and limi-tations, while §2.2 introduces O P -T EE . Rich Execution Environment.

The REE (or normal world ) is the regular, non-secure operating system of a device. The memory, registers, and caches are not isolatedor protected by any hardware mechanism. Typically, the REE is not focused on securityand is difﬁcult to review for security vulnerabilities, due to its large size and complexity.

Trusted Execution Environments.

Also called TEE or secure OS , it is the so-called secure world operating system part of the T

RUST Z ONE speciﬁcations. It complies withthe GlobalPlatform’s TEE System Architecture speciﬁcations [57], a set of operationsoffered to secure applications. These include interactions with persistent (secure) stor-age [57, Chapter 5], memory [57, Chapter 4.11], and cryptographic operations [57,Chapter 6]. As such, a secure application can easily be ported to another platform, dueto the standardized nature of available services. Similar to what a non-secure operatingsystem offers to its running applications, the TEE offers access to special services onlyavailable to secure applications (such as the secure storage feature, which we evaluate).This environment has a small footprint, contrary to a full-ﬂedged operating system, andonly implements the very minimal set of features required to operate. Its small sizemakes it simpler to review for security vulnerabilities, as any could potentially compro-mise all secure applications.

Trusted Application.

A trusted application ( TA ), also called secure application isdesigned to be run exclusively inside the secure world. It uses services provided bythe TEE kernel to access resources, speciﬁcally: (1) disk via the secure storage subsys-tem exclusively, (2) TCP/IP sockets, (3) memory allocation, (4) other custom services.Trusted applications provide services to either standard userland programs or other TAs.O P -T EE expects TAs to be written in C. RUST Z ONE in a nutshell

This section describes the main components of the T

RUST Z ONE architecture, also de-picted in Figure 2 alongside their interfaces.

Julien Amacher and Valerio Schiavoni

Overview. T RUST Z ONE is a hardware feature implemented in recent A RM proces-sors. It enables physical separation of different execution environments, namely TEEand REE. Its working principle is very similar to a hypervisor, the main difference be-ing that no emulation is performed and that all isolation is offered at the hardware level.Both secure (TEE) and normal worlds (REE) share the underlying physical processor.The secure world has unrestricted access to memory regions, hardware and devices.This is realized by using an additional addressing line, the NS (Non Secure) bit. Hard-ware checks performed by the T ZASC (T RUST Z ONE

Address Space Controller) [42,50]determines, if the access is authorized based on this NS-bit.

Memory.

Parts of the memory can be isolated for exclusive use by the secure worldby means of special hardware support. The memory management unit (MMU) is secure-world aware, and secure and non-secure descriptors are stored alongside each other.The differentiation is done by the

Non-secure TLB ID (NSTID) [12], an extra bit of theTLB. The secure applications (TAs) must ﬁt in the on-chip memory. Due to high costsof the secure memory, it is usually limited in size, in the order of 3-5MB. Hence, TAsare expected to have small memory footprints and only contain the minimal subset offeatures required. Clearly, this reduces the attack surface exposed by TAs.

Interrupts.

The

Fast Interrupt (FIQ) secure interrupt mode is used exclusively bydevices residing in a memory region allocated to the secure world. As such, regularinterrupts (IRQ), which are of lower priority, cannot be used to prevent the secure worldfrom executing, in particular if a physical secure clock ( i.e. , RTC) is used. Secure clocksare crucial to ensure a TA is safely executed: an external clock is a common attack vectorand can be easily tampered with [53]. Latest A RM processors include secure clocks. World Switching.

Switching between worlds requires the state of the processorto be saved and then restored, respectively when entering and exiting a new world.Processor registers are saved by the monitor when entering, and restored when leavingthe secure world. The NS-bit is changed accordingly. Normal world applications useT

RUST Z ONE indirectly, by invoking functionalities implemented in a dedicated TA.When in PL-1 [43,1] privilege level, a special hardware instruction,

Secure Monitor Call (SMC), allows switching between worlds. Recent Cortex-A processors [48] supportSMC calls by the kernel in the normal world. Entry to a different world (from secure tounsecure and vice versa) is done on a core-basis, thus limiting the parallel execution ofTAs to the number of available cores. To enter the secure world, a kernel thread executesthe monitor, which in turn issues the SMC instruction to the CPU [8,29]. Calls to SMCby a processor not in kernel mode trigger an undeﬁned exception trap. TAs can be calledfrom userland programs residing in the REE or from other TAs. The latter is particularlyuseful to reduce code duplication and to keep the TA’s attack surface minimal. Data ispassed back and forth between worlds by memory pointers or direct copies.

Secure storage. T RUST Z ONE supports persistent data storage for TAs using securestorage. Objects are stored encrypted on disk, and are signed for anti-tampering coun-termeasure. TAs access the ﬁles in cleartext: the TEE layer runs the cryptographic stacktransparently. These ﬁles have a unique numeric name based on a counter. An encryptedindex of ﬁles is maintained alongside the ﬁles. Operations on the index are atomic, en-suring integrity protection by means of a hash tree data structure that guards the index.To protect against storage replay attacks, an eMMC storage device ( embedded Mul- n The Performance of ARM TrustZone 5

Framework License Technology O P -T EE [26] BSD T RUST Z ONE

Trustonic TEE [38] Commercial T

RUST Z ONE

Open TEE [52] Apache License 2.0 T

RUST Z ONE

OpenEnclaves [23] MIT SGX1 & T

RUST Z ONE

TLK [54] BSD N

VIDIA

TegraAndroid Trusty TEE [2] Apache License 2.0 T

RUST Z ONE : emulated under Intel’s VT Table 1:

Existing frameworks for TEE-based applications. tiMediaCard , a type of non-volatile, non-removable solid-state storage device [22]) isrequired. This security feature is entirely implemented in the eMMC storage in the formof

Replay Protected Memory Block (RPMB) [55].

Key Management.

The key manager starts with a device-speciﬁc key, the

SecureStorage Key (SSK). It is derived from two pieces of information unique to each device’sprocessor: the chip identiﬁer and the hardware key. The

TA Storage Key (TSK) is a per-TA key, derived from the SSK and the TA’s UUID identiﬁer. The

File Encryption Key (FEK) is a per-ﬁle key generated upon ﬁle creation. It is used to protect the ﬁle contents,including its metadata, and is encrypted using the TSK.

Resilience to attacks.

It is of paramount importance to ensure that only trustworthyapplications are deployed to the secure world. Vulnerabilities in any TA, the TEE or acompromised secure kernel do compromise the security of the secure world. Preventionagainst buffer overﬂow attacks in the secure world are currently only provided usingbasic stack canaries [31]. Future support for ASLR (Address Space Layout Random-ization) will improve resilience against those attacks. Finally, there exist mitigationsagainst Meltdown and Spectre speculative execution attacks [15,13,14,16]. Covert datachannels [45] can also be used when required. P -T EE Trusted OS

While there are few options (Table 1) to develop applications for TEEs, we rely onO P -T EE , due to its fast development cycle and native support for the T RUST Z ONE .O P -T EE is a security framework that includes several components: a minimal secure-world operating system (the O P -T EE O S [26]); the tee-supplicant [30], offering normalworld services to the secure world; a complete build toolchain [24], the testing tool [28]( OPTEE sanity testsuite ), a secure privileged layer enabling world switching, a basicREE image, and several utility functions for developers to implement TAs. O P -T EE isﬂexible and can be deployed to platforms for which there exists a manifest, that liststhe dependencies required to build for the platform it describes, as well as its hardwarecharacteristics. Additionally, the Qemu open source emulator [33] allows to deploy andevaluate O P -T EE in emulated mode on ubiquitous machines. The TEE interface imple-mented in O P -T EE is compliant with the GlobalPlatform’s speciﬁcations. Details. O P -T EE imposes a speciﬁc interface regarding TA interactions initiatedfrom the REE. First, a request to load the desired TA is made by passing its UUID Julien Amacher and Valerio Schiavoni

Start/stoprecording

Export CSVParse CSV

Markers, durationHost computer

KM001 official application

Raspberry Pi

Benchmark applications

Benchmark applicationsPower supply

Execute benchmark

JTAG

Power consumption

KM001Monitoringprogram

Fig. 3:

Experimental setup and approach used to run our measurements to TEEC InitializeContext which returns a context object. The UUID is deﬁned atcompile-time and must be unique amongst all TAs. Next, this context is passed to

TEEC OpenSession which returns a session. This session is then used to invoke ac-tual services in the TA using the

TEEC InvokeCommand , which takes as parameters theservice identiﬁer as well as any optional parameters. A single session can be used to call

TEEC InvokeCommand any number of times. Sessions are ﬁnally closed using

TEEC -CloseSession and ultimately, the context is closed by calling

TEEC FinalizeContext .To support multiple sessions, the TA must be compiled with the

TA FLAG MULTI -SESSION ﬂag set. O P -T EE signs TAs with a private RSA key, but the toolchain doesnot allow a unique key per-TA (all TAs are signed with the same device key). Upon TAloading, the O P -T EE core checks the integrity of the TA by verifying its signature basedon its signed header. The framework includes a minimal OS that offers services to TAs,and leverages the tee-supplicant application to access resources residing in user land. This section describes the tools and techniques used to carry out our evaluation. We fo-cus on four metrics : (1) execution time for various types of benchmarks (CPU-bound,volatile and non-volatile memory), (2) power consumption under different CPU gover-nors, (3) energy consumption, and (4) thermal behaviour of the CPU.

Hardware Measurement Tools.

Energy and power measurements are carried outusing a Power-Z KM001 unit [32], plugged in-between the USB power supply and theRaspberry Pi device. The variant used in our testbed features two main USB ports (toprovide power and one from where the power is drawn) of the current mainstream USBtypes (type A, micro and type C). In our conﬁguration, type A is used for both input andoutput of power delivery. An additional (micro) USB port is used to fetch power con-sumption measurements. The KM001 unit supports different USB protocols, includingUSB PD (Power Delivery) 2.0 and Qualcomm QC (QuickCharge) from version 2.0 upto 4.0. This conﬁguration allows the power used by the Raspberry Pi to be measureddirectly as the losses of the power supply itself are not taken into account. We use thisdevice to measure only power [W] and energy [Wh], for which it produces 1 record persecond. Unfortunately, the software (Figure 3, left) provided by the unit manufactureris a closed-source 32-bit Windows binary, and the protocol used to exchange messagesover USB is undocumented. To overcome these limitations, we used the following ap-proach. Speciﬁc markers ( e.g. start recording and stop recording ) are generated during n The Performance of ARM TrustZone 7

Benchmark application

Monitoring program

Execute t: durationStart recording

Stop recording

Export

CSV

Marker: startMarker: stop Operation of interest

Process

CSV

KM001

Fig. 4:

Use of markers

TEE kernel

Syscall/RPC of interest

Syscall t1: Start instrument. t2: Stop instrument.

Store t=t2-t1t: getDuration

Execute

Benchmark application

Monitoring programKM001

ProcessCSV

StartrecordingStop recordingExport

CSV

REE kernel opt

Fig. 5:

Microbenchmarking: workﬂow execution of benchmark applications, allowing for precise recording of areas of interest(Figure 4). These markers are monitored by a custom program (on a separate node) thatpilot the Windows binary (Figure 5). The pilot sends automated messages to the binaryinstance using the Win32 API through P/Invoke (Platform Invokation Service) [11] is-sued by a monitoring program implemented in C

CPU Governors.

The Linux kernel supports several CPU governors [46], used toadjust the frequency of each core depending on its load and temperature. Several optionsexist: powersave and performance for minimum and maximum operating frequency; ondemand toggles between the previous two, and a more conservative mode that op-erates less aggressively; userspace , to manually set the CPU frequency; and schedu-til , where the frequency is set by the scheduler. The core frequency is increased duringthe execution of stressful workloads and reduced right after, for instance when the max-imum temperature is reached in order to prevent overheating. This is different from ahardware thermal throttling, which tries to prevent damage caused by excessive heat.The O P -T EE kernel uses powersave governor by default. This reduces heat outputby reducing the frequency of the core clocks, allowing passive cooling - even withoutheatsink - but also negatively impacts performance. In a compute-intensive datacenter,one would typically use the performance governor. Instead, if energy constraints areimportant, the powersave mode is best suited. Our benchmarks consider both gover-nors and compare them for REE and TEE executions. Timing issues.

Initially, we planned on porting S

TRESS -NG [36] to run insideT

RUST Z ONE . Unfortunately this proved to be not straightforward, given its relianceon system calls not available inside the TEE kernel. As such, we decided to imple-ment custom ad-hoc benchmark applications. Execution time is measured using eitherthe gettimeofday(2) [18] or the clock gettime(3) [10] syscall, which support thefollowing parameters:1.

CLOCK REALTIME : the realtime clock of the system, can be adjusted by NTP andthus can go forward and backwards.

Julien Amacher and Valerio Schiavoni CLOCK MONOTONIC : a monotonic time since an unspeciﬁed starting point (usuallysystem startup, as is the case with our setup)3.

CLOCK PROCESS CPUTIME ID : per-process timer4.

CLOCK THREAD CPUTIME ID : thread-speciﬁc CPU-time clockFor our experiments we exclusively use

CLOCK MONOTONIC . Our benchmarks includethe instrumentation delay, e.g. , the overhead introduced by the measurement itself. Thisis especially important from the TEE perspective ( i.e. , inside a TA) where one syscallcan lead to a second one if REE needs to be accessed ( e.g. , Figure 2- (cid:210) and Figure 2– (cid:208) ). Kernel and O P -T EE modiﬁcations. To access and store the monotonic time andtemperature from within a TA using the secure kernel, and to retrieve it later on withinthe REE, we extended the kernel with four new system calls:

TEE GetCpuTempera-ture, sys ktraceadd, sys ktraceget and sys ktracereset .To gather the temperature measurements, we used two methods: (1) software, viathermal APIs and (2) external hardware sensor. Originally, we planned on using ascript to record the temperature at ﬁxed intervals during the CPU stress tests executedby userland threads. However, since kernel threads executing the TAs have a higherpriority, the userland threads were starved and thus did not produce enough data points.This is a typical scenario of normal world starvation occurring when TAs monopolizeall cores. We overcome this problem by accessing the CPU temperature from inside theTA, and sending it periodically to the monitoring software for safekeeping. To use thetemperature gathering syscall from within the TA, we additionally had to implement thecorresponding TEE kernel syscall wrapper. An extensive walkthrough on this processis given in Appendix A. This section presents our in-depth evaluation and performance analysis, the main con-tribution of this work. Energy results are always presented by systematically excludingidle energy consumption, e.g. , we only show the energy cost of the given operation. En-ergy requirements are shown on a per-operation fashion. To prevent thermal throttling,all tests run while the onboard chip is actively cooled.

Evaluation Settings.

We use the Raspberry Pi 3B, a popular yet representativesingle-board device, equipped with Broadcom BCM2837

System-On-Chip (1GB ofRAM, ARM Cortex A53 quad core running at 1.2GHz). For some of our measurements,we compared the hardware experiments against a modiﬁed version of the Qemu emula-tor provided by O P -T EE with support for T RUST Z ONE [34]. This mimics the scenarioof an Infrastructure-as-a-Service provider offering access to A RM nodes (as virtual ma-chines) to cloud tenants without having the corresponding hardware infrastructure andthus relying on T RUST Z ONE virtualization [49]. Qemu uses the Cortex A53 emulationproﬁle on an Ubuntu host residing on a VMWare ESXi [40] machine equipped with ani7 6820HQ running at 2.7GHz. Note that the Raspberry Pi 3B lacks support for secureboot and hardware separation of memory and peripherals [27], hence these aspects of /sys/class/thermal/thermal zone[0-9]+/temp n The Performance of ARM TrustZone 9 Power consumption P o w e r c on s u m p t i on [ W ] Idle rpi3b ondemand rpi3b performance rpi3b powersave 1 2 3 4 5

Burn

Fig. 6:

Idle (left) and burn (right) power consumption.

Idle BurnGovernor

W BTU/h W BTU/h ondemand 0.78 2.66 3.08 10.51performance 0.86 2.93 3.32 11.33powersave 0.78 2.66 1.65 5.63

Table 2:

Average power consump-tions for idle and burn experiments(see Figure 6) the T

RUST Z ONE ecosystem could not be evaluated and are left for future work. Finally,we do not override the default secure storage key (SSK) provided by O P -T EE . Power consumption.

We start by measuring the idle and under-stress ( burn ) powerconsumption of our hardware unit. We evaluate how the three different CPU governors( ondemand , performance , and powersave ) behave. The idle measurements use thestandard REE kernel image provided by O P -T EE , without any user-intensive applica-tions nor TAs running. Burn measurements run the prime benchmark, a single-threadedTA which computes the ﬁrst 20000 prime numbers before exiting. We run 8 instancesin parallel, ensuring maximum heat output on the 4 cores. Measurements start 60 sec-onds after the benchmark instances. Figure 6 shows our results, respectively for idle(left) and burn (right) experiments. Table 2 shows the average W and BTU/h. We use abox-and-whiskers plot: the ﬁrst and third quartile are shown as a colored box, the me-dian as horizontal black bar. Min/max values are also included. Results for ondemand and powersave are on par with the ondemand governor, in particular when the CPUfrequency is set at 600MHz. As expected, we observe higher power consumption usingthe performance governor even in idle, as the cores are boosted up to 1.2GHz. Overall,the board’s power consumption is very low, in particular below 1W in idle mode. Load & unload TAs.

Next, we measure the time required to load and unload a TAinside the T

RUST Z ONE , respectively executing

TEEC InitializeContext [56, Chapter4.5.2] and

TEEC FinalizeContext [56, Chapter 4.5.3] functions. We compare results ob-tained with a TA of size smaller and another one of size larger than the 512kB L2 cacheof the Broadcom BCM2837 processor, respectively 102kB and 517kB. Our experimentsshow no signiﬁcant difference between TAs of different sizes.

Basic TA operations E m p t y f un c t i on T A l oad F i r s t T A l oad T A un l oad F i r s t T A un l oad E x e c u t i on t i m e [ m s ] Large TA (517kB)

Qemu on ESXirpi3b ondemand rpi3b performancerpi3b powersave E m p t y f un c t i on T A l oad F i r s t T A l oad T A un l oad F i r s t T A un l oad Small TA (102kB)

Fig. 7:

Basic TA operations: loading, unloading and suc-cessive calls to load/unload the same TA.

For each conﬁguration, Figure 7shows average and standard de-viation over 10k executions. Weinclude the time spent to executean empty function inside the TAonce it is loaded (1.31ms), togive a baseline of comparison.Surprisingly, our results donot show a signiﬁcant differ-ences on subsequent loadings

Context switching s t c a ll F o ll o w i ng c a ll s s t c a ll F o ll o w i ng c a ll s E x e c u t i on t i m e [ µ s ] REE to TEE TEE to REE s t c a ll F o ll o w i ng c a ll s s t c a ll F o ll o w i ng c a ll s I n s t r u m en t a t i on de l a y [ µ s ] rpi3b ondemand rpi3b performance rpi3b powersave REE TEE E ne r g y [ n W h ] Fig. 8:

World switching performance and energy requirements compared to the ﬁrst loading,despite the tee-supplicant is sup-posed to cache the TA code. Wewill investigate this aspect in fu-ture work.

Context (World) Switching.

Switching between worlds is a key operation whendeploying applications that execute inside and outside the T

RUST Z ONE . To measure theswitching time, we implemented an ad-hoc benchmark made by a host application anda TA. Both programs record the monotonic time when entering and exiting the worldin which they reside. The host issues a call to an almost empty function, which onlycontain time-measuring code. Two calls are made to the TA per session, recording thetime taken to switch between TEE and REE, and vice versa. Figure 8 (left) shows theseresults. To evaluate possible caching effects, we also include the results obtained forall the calls following the ﬁrst one. As expected, it is more time-consuming to switchfrom the REE to the TEE (110µs with the performance-oriented governors) than theopposite (47µs). The instrumentation delay (Figure 8, center) is the difference betweentwo consecutive calls to the time measurement function. An increased instrumentationdelay is observed in the TEE compared to the REE, due to the additional world switch.Finally, we also evaluate the energy spent for calling an empty TA function from theREE (Figure 8, right). The timer starts and stops when leaving and re-entering the REE,respectively. The ondemand governor is the most energy-eager (up to 12.1 nWh), while powersave is the most energy efﬁcient.

Volatile Memory.

Next, we consider simple in-memory operations ( e.g. , read andwrite, sequential or at random), for two different sizes of volatile memory (1MB and100KB) used by the REE and the TEE. We consider inter- (REE ← TEE) and intra-world ( e.g. , REE ↔ REE, TEE ↔ TEE) memory readings, as T

RUST Z ONE restrictionsprevents reading TEE memory from the REE. We compute the average and standarddeviation over 100 run, always using the high-resolution monotonic counter. Figure 9shows our results, for the Raspberry Pi device with 3 CPU governors and using Qemu.Performance of accessing a single byte in TEE memory from the TEE is on par withaccessing REE memory from the TEE, on average 0.01µs, around 2 × under emulation.Interestingly, using memory from within the TEE is also less energy eager (Figure 10),also veriﬁed by the cost of the single operations in the various conﬁgurations. We ob- n The Performance of ARM TrustZone 11 Memory performance D u r a t i on [ m s ] REE from REE

Qemu on ESXirpi3b ondemand rpi3b performancerpi3b powersave

REE from TEE

TEE from TEE r ead r and . r ead s eq . w r i t e r and . w r i t e s eq . D u r a t i on [ m s ] REE from REE r ead r and . r ead s eq . w r i t e r and . w r i t e s eq . REE from TEE r ead r and . r ead s eq . w r i t e r and . w r i t e s eq . TEE from TEE

Fig. 9:

Benchmark for memory ops

Energy used per memory access

REE from REE rpi3b ondemandrpi3b perform.rpi3b powersave 0 2 4 6 8 10 per memory access E ne r g y [ p W h ] REE from TEE r ead r and . r ead s eq . w r i t e r and . w r i t e s eq . per memory access TEE from TEE

Fig. 10:

Energy: memory accesses serve how the operations in the TEE ↔ TEE case are on average 2 × faster on bare metaland 1 . × under emulation than in the other cases. Secure Storage: performance.

We evaluate the performance of T

RUST Z ONE ’s se-cure storage via the corresponding GlobalPlatform’s API implemented by O P -T EE .Speciﬁcally, we benchmark the cost of creating, writing, reading and closing objects in-side the secure storage area, for two different object sizes (100KB and 1MB), althoughcurrent memory allocator limitations prevented to cover some cases [35,19,20,39]. Fig-ure 11 (left) shows that closing and deleting objects are fast operations, and openingand writing are the slowest ones. Iterating over objects in the secure storage ( e.g. , theexecution of a find operation) is slow, up to a few hours in the worst case (Figure 11,right). Adding more objects in secure storage degrade the results even more (up to2 . × ob ject count ratio ). Secure storage: cost breakdown.

To understand how each low-level syscall af-fects the performance of a ﬁle-system inside the secure storage, we implemented asimple microbenchmark, inside ree fs create and ree fs write . Speciﬁcally, thesetests create and write data into a new object. Figure 14 shows a breakdown cost usingstacked bars for writing and creating ﬁles. These two functions are atomic and thus aresurrounded by a monitor (mutex) which adds a considerable delay (not shown) regard-ing the write operation. The impact is negligible on the create operation. We observethat opening the ﬁle and setting the ﬁlename accounts for the most time spent.

Secure storage − operations/iterate C l o s e k B C l o s e M B C l o s e & de l . k B C l o s e & de l . M B C r ea t e k B C r ea t e M B O pen k B O pen M B R ead k B R ead M B W r i t e k B W r i t e M B IterateOperations D u r a t i on [ µ s ] IterateOperations

10 obj. present 100 obj. present 1000 obj. present D u r a t i on [ s ] Qemu on ESXi rpi3b performance rpi3b powersave

IterateOperations

10 obj. present 100 obj. present 1000 obj. present

IterateOperations

10 obj. present 100 obj. present 1000 obj. present n/an/a

Fig. 11:

Secure storage: basic operations (left) and iteration (right)

Secure storage − energy used C l o s e & de l e t e1 k B f il e C l o s e & de l e t e10 k B f il e C l o s e k B f il e C l o s e k B f il e C r ea t e f il e O pen k B f il e O pen k B f il e W r i t e k B f il e W r i t e k B f il e E ne r g y [ µ W h ] rpi3b ondemand rpi3b performance rpi3b powersave Fig. 12:

Secure storage: energy measurements for basic operations

Secure Storage: energy.

Being a feature often used by nomad devices with lowenergy autonomy, we deeply investigate its energy impacts. Figure 12 shows that creat-ing objects is the most energy-demanding (up to 403µWh), irrelevant of the size. Powerconsumption of writing objects is dependent on their size. Interestingly, the ondemand governor achieves slightly worse results when creating a ﬁle, whereas for closing anddeleting ﬁles it stands out. Figure 13 shows the energy requirements to iterate over asingle stored object (top) [57, Chapter 5.8] during enumeration of all stored objects insecure storage or rename (bottom) a single object, when additional 10 or 100 objects(of the same size) are already in the secure storage. We execute this test for 2 differentﬁle sizes (1kB and 10kB). We observe that the energy required to iterate over a singleobject depends on the number of objects stored (in particular when using performance and ondemand ), whereas the size of the object is irrelevant.

CPU Benchmarks.

To benchmark the raw performance of the A RM processors ofour units, we implemented and deployed a single-threaded TA that executes a CPU-bound task, e.g. , computes the ﬁrst 20000 prime numbers. We run multiple instancesconcurrently, and while they execute we also gather energy measurements (for all casesminus the emulation mode). Figure 15 presents these results. As expected, the perfor-mance governor ensures the fastest computing time. Due to emulation costs, the Qemuresults are the worst ones. As the number of instances exceed the available hardware n The Performance of ARM TrustZone 13 Secure storage − iterate/rename objects

Iterate over one objectRename one object

Object size E ne r g y [ µ W h ]

10 objects present rpi3b ondemand rpi3b performance rpi3b powersave

Iterate over one objectRename one object

Object size

100 objects present

80 100 120 140 160 1kB 10kB

Iterate over one objectRename one object

Object size E ne r g y [ µ W h ] Iterate over one objectRename one object

Object size

Fig. 13:

Secure storage, energy to iterate (top) and rename (bottom)

Secure storage breakdown C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ]

1. open dir2. get temp file handle3. open4. write5. sync htree6. set name C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ] C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ] C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ] onde m andpe r f o r m an c epo w e r s a v e C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ]

1. open dir2. write3. sync htree4. update hash5. commit6. sync onde m andpe r f o r m an c epo w e r s a v e C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ] onde m andpe r f o r m an c epo w e r s a v e C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ] onde m andpe r f o r m an c epo w e r s a v e C r ea t e f il e W r i t e f il e E x e c u t i on t i m e [ % ] Fig. 14:

Secure storage breakdown for two operations: create and write cores, we observe an increase of energy consumption. Overall, in this benchmark the ondemand governor is the most energy eager. This can be explained by the fact that ad-justing the core frequencies (from 600MHz and 1.2GHz) seems to be a relatively costlyoperation [41].

Thermal benchmarks.

We conclude our evaluation by looking at the thermal en-velope of the

SoC . To do so, we execute 8 concurrent instances of the prime benchmarkinside T

RUST Z ONE . Figure 16 presents the measurements fetched using the kernel’sthermals API. Additionally, we monitor the surface temperature of the chip using aTexas Instruments LM35 precision linear sensor with the help of an external microcontroller. Thermal conductivity between the

SoC and the LM35 is ensured by usinga thermal compound (Arctic MX-4[3]). The ambient temperature is of around 21.9°C.Results returned by the LM35 are calibrated and checked at rest against a Fluke thermo-couple, and against a Flir E4 [17] thermal camera (see pictures in Figure 17). Markedpoints in Figure 16 refer to measurements done using the thermal camera. We observea small margin of error of 3°C, and a discrepancy between the thermals API and theLM35 of over 15°C at times. This could be problematic because the measured surface

Finding the first 20000 prime numbers E x e c u t i on t i m e [ s ] E ne r g y [ m W h ] Fig. 15:

CPU benchmark: processing delay and energy requirements.

CPU temperature during prime benchmark

20 30 40 50 60 70 80 90

0 4 8 12 16 T e m pe r a t u r e [ ° C ] ondemand External sensorThermals API External sensor (w/ fan)Thermals API (w/ fan)

0 4 8 12 16 powersave

0 4 8 12 16

MaxAmbient performance

Fig. 16:

Evolution of CPU temperature with different cooling modes and governors. temperature exceeds the rated continuous temperature of 85°C speciﬁed by the chip’smanufacturer. In this situation, the thermals API returns an incorrect temperature thatis well below the acceptable temperature. As a consequence measures which should betaken to reduce the temperature, such as software thermal throttling, are not undertaken.A passively cooled Raspberry Pi should therefore only operate in powersave mode orrisk being hardware throttled or worse, suffer damage. An actively cooled system on theother hand can operate in any mode and stay well within acceptable conditions, evenwithout additional heat sink. Once the maximal temperature is reached, recovery timeis around 8 minutes when passively cooled and less than a minute with active cooling. n The Performance of ARM TrustZone 15 ondemandMAX 66.7 °C 90.5°C18.2°C (cid:608) (cid:610)

MAX 87 °C 90.5°C19.9°C (cid:609)

MAX 64.3 °C 90.5°C19.6°C (cid:611)

MAX 67.5 °C 90.5°C18.2°C (cid:612)

MAX 90.3 °C 90.5°C19.1°C (cid:613) powersave performancefan on fan on fan onfan o ﬀ fan o ﬀ fan o ﬀ Fig. 17:

Raspberry Pi thermal behaviour during processor stress benchmarks.

This section reports on a few lessons learned during this experimental work.

Memory limitations.

By default, 32MB are dedicated to O P -T EE , of which: 1MBfor TEE memory, 1MB for PUB (non-secure RAM) memory, and the remaining 30MBfor TAs. Each TA has two compile-time options, TA STACK SIZE and

TA DATA SIZE (in user ta header deﬁnes.h ), deﬁning the stack size and heap size that can be utilizedby a TA. These values are set at very low values by default, 2kB and 32kB respec-tively [25]. For larger memory allocations, the TA’s MMU L1 table must be set accord-ingly, as the default mapping is 1MB. We were unable to allocate more than 3MB fora single TA, even with shared memory enabled. Consequently, the O P -T EE benchmarkframework [9] could not be used. Compliance to standards.

The GlobalPlatform’s implementation in O P -T EE is noterror-free and some parts of the implementation do not comply fully with the speciﬁca-tion. For instance, the TEE BigIntAdd [57, p. 252] function, contrary to its deﬁnition,does not allow to use the same pointers for both input and output [37]. Being rela-tively new, O P -T EE is improving rapidly. While this offers great advantages, such asmitigations against the latest attacks, it also introduces incompatibilities by deprecatingolder APIs. However, the GlobalPlatform consortium offers strong incentives for TEEvendors to comply with their API, which is unlikely to introduce breaking changes.Establishing this level of compliance ensures interoperability of TAs between existingTEE solutions which is undeniably of great interest to secure application developers. Developers toolchain.

The O P -T EE framework groups all required dependenciesin a single project while also including several components of its own, such as thesecure kernel. This greatly facilitates development of secure application by reducingsetup and development efforts. The O P -T EE project includes a few TA examples andhost applications, which are a good foundation to introduce the TEE paradigm. T RUST Z ONE is a widely available technology that offers Trusted Execution Environ-ment guarantees to low-energy devices. The goal of this practical experience report wasto uncover the performance of these systems. To perform our experiments, we had toextend both secure and rich kernels so that secure timing measurements and thermalmetrics could be fetched from within T

RUST Z ONE , for which we provide detailed ex-planations in Appendix A. Our work highlights several advantages as well as limitationof the currently available software platforms, such as the O P -T EE framework chosen inour case, to implement and deploy TAs. We would like to point out two major limita-tions. (1) the lack of several basic features inside the REE kernel for security reasons,which materialize in the lack of basic syscalls ( e.g. fopen , msgget ). For this reason,it is paramount to reduce syscall dependencies when developing TAs. (2), the currentlimitations regarding memory allocation and addressing, which could negatively affectthe facility to deploy more complex TAs inside T RUST Z ONE . We hope this work willprovide useful insights to T

RUST Z ONE software developers.

Acknowledgments

The research leading to these results has received funding from the European Union’s Horizon2020 research and innovation programme under the LEGaTO Project ( legato-project.eu ),grant agreement No 780681.n The Performance of ARM TrustZone 17

References

1. AArch64 Exception Handling - System calls to EL2/EL3. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch10s02s04.html .2. Android Trusty TEE. https://source.android.com/security/trusty .3. Arctic MX-4. .4. ARM Everywhere. https://hexus.net/static/arm-everywhere/ .5. ARM Financial Results. .6. ARM Inside The Numbers - 100bn. https://community.arm.com/processors/b/blog/posts/inside-the-numbers-100-billion-arm-based-chips-1345571105 .7. ARM TrustZone Developer. https://developer.arm.com/technologies/trustzone .8. ARM1176JZF-S Technical Reference Manual - 2.12.13. Secure Monitor Call (SMC). http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0301h/ch02s12s13.html .9. Benchmark framework. https://github.com/OP-TEE/optee_os/blob/master/documentation/benchmark.md .10. clock gettime(3) - Linux man page. https://linux.die.net/man/3/clock_gettime .11. Consuming Unmanaged DLL Functions. https://docs.microsoft.com/en-us/dotnet/framework/interop/consuming-unmanaged-dll-functions .12. Cortex-A9 Technical Reference Manual - 6.3. Memory Access Sequence. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388f/Ciheiecd.html . Accessed: 2018-12-09.13. CVE-2017-5715. https://nvd.nist.gov/vuln/detail/CVE-2017-5715 .14. CVE-2017-5753. https://nvd.nist.gov/vuln/detail/CVE-2017-5753 .15. CVE-2017-5754. https://nvd.nist.gov/vuln/detail/CVE-2017-5754 .16. CVE-2018-3639. https://nvd.nist.gov/vuln/detail/CVE-2018-3639 .17. Flir E4. .18. gettimeofday(2) - Linux man page. https://linux.die.net/man/2/gettimeofday .19. Hikey: trying to allocate more physical memory to secure world. https://github.com/OP-TEE/optee_os/issues/1396 .20. How to alloc 10M memory by TEE Malloc(). https://github.com/OP-TEE/optee_os/issues/2090 .21. Intel SGX. https://software.intel.com/en-us/sgx .22. Kingston Embedded Solutions. .23. Microsoft OpenEnclave Framework. https://github.com/Microsoft/openenclave .24. OP-TEE Build on Github. https://github.com/OP-TEE/build . Accessed: 2018-12-04.25. OP-TEE FAQ on Github. https://github.com/OP-TEE/OP-TEE_website/tree/master/faq . Accessed: 2018-12-04.26. OP-TEE OS on Github. https://github.com/OP-TEE/optee_os . Accessed: 2018-12-04.27. OP-TEE Raspberry 3B platform speciﬁc documentation. .28. OP-TEE sanity testsuite on Github. https://github.com/OP-TEE/optee_test . Ac-cessed: 2018-12-04.29. OP-TEE source. https://github.com/OP-TEE/optee_os/blob/master/core/arch/arm/kernel/generic_entry_a64.S . Accessed: 2018-12-09.30. OP-TEE Supplicant on Github. https://github.com/OP-TEE/optee_client/tree/master/tee-supplicant . Accessed: 2018-12-04.31. OPTEE-OS kernel thread.c init canaries. https://github.com/OP-TEE/optee_os/blob/master/core/arch/arm/kernel/thread.c .8 Julien Amacher and Valerio Schiavoni32. POWER-Z KM001C. .33. Qemu. . Accessed: 2018-12-04.34. QEMU with WIP TrustZone Support. https://git.linaro.org/virtualization/qemu-tz.git .35. Shared memory size bigger than 1MB. https://github.com/OP-TEE/optee_os/issues/1523 .36. Stress-NG. https://kernel.ubuntu.com/˜cking/stress-ng/ . Accessed: 2019-20-01.37. TEE BigIntAdd fails when dest=op OP-TEE OS Issue https://github.com/OP-TEE/optee_os/issues/2577 .38. TRUSTSONIC. .39. Using more than 1Mb with TEE Malloc. https://github.com/OP-TEE/optee_os/issues/2178 .40. VMware ESXi. .41. Workloads and governor effects. .42. ARM. ARM® CoreLink™ TZC-400 TrustZone®Address Space Controller. 2014.43. ARM Limited. SMC CALLING CONVENTION System Software on ARM® Platforms.2016.44. M. Barbosa, S. B. Mokhtar, P. Felber, F. Maia, M. Matos, R. Oliveira, E. Riviere, V. Schi-avoni, and S. Voulgaris. SAFETHINGS: Data Security by Design in the IoT. In

DependableComputing Conference (EDCC), 2017 13th European , pages 117–120. IEEE, 2017.45. H. Cho, P. Zhang, D. Kim, J. Park, C.-H. Lee, Z. Zhao, A. Doup´e, and G.-J. Ahn.Prime+Count: Novel Cross-world Covert Channels on ARM TrustZone. In

Proceedings ofthe 34th Annual Computer Security Applications Conference , ACSAC ’18, pages 441–452,New York, NY, USA, 2018. ACM.46. Dominik Brodowski. CPU frequency and voltage scaling code in the Linux(tm) kernel.47. Gartner. Leading the IoT Gartner Insights on How to Lead in a Connected World. 2017.48. P. Greenhalgh. big.LITTLE processing with arm cortex-a15 & cortex-a7.

ARM White paper ,17, 2011.49. Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, and H. Guan. vTZ: Virtualizing ARM trustzone. In

In Proc. of the 26th USENIX Security Symposium , 2017.50. M. Lentz, R. Sen, P. Druschel, and B. Bhattacharjee. SeCloak: ARM Trustzone-based MobilePeripheral Control. pages 1–13, 06 2018.51. M. Lipp, M. T. Aga, M. Schwarz, D. Gruss, C. Maurice, L. Raab, and L. Lamster.Nethammer: Inducing Rowhammer Faults through Network Requests. arXiv preprintarXiv:1805.04956 , 2018.52. B. McGillion, T. Dettenborn, T. Nyman, and N. Asokan. Open-TEE–An OpenVirtual Trusted Execution Environment. In

Proceedings of the 2015 IEEETrustcom/BigDataSE/ISPA-Volume 01 , pages 400–407. IEEE Computer Society, 2015.53. ncc group. Implementing practical electrical glitching attacks, 2015.54. nVidia. TRUSTED LITTLE KERNEL (TLK) FOR TEGRA: FOSS EDITION. 2015.55. A. K. Reddy, P. Paramasivam, and P. B. Vemula. Mobile secure data protection using eMMCRPMB partition. In

Computing and Network Communications (CoCoNet), 2015 Interna-tional Conference on , pages 946–950. IEEE, 2015.56. G. Technology. GlobalPlatform TEE Client API Speciﬁcation v1.0.57. G. Technology. TEE Internal Core API Speciﬁcation Version 1.1.2.50. 2018.n The Performance of ARM TrustZone 19

A Appendix: Extending the Kernel

First, a new ﬁle containing the syscall used to retrieve the processor temperature getc-putemp is created. // populates temp with the CPU temperature in [m degC] SYSCALL_DEFINE1(getcputemp, unsigned long *, temp) { struct thermal_zone_device *tzd; // The name "bcm2835_thermal" is obtained // from /sys/class/thermal/thermal_zone0/type tzd = thermal_zone_get_zone_by_name("bcm2835_thermal"); if (IS_ERR(tzd)) return thermal_zone_get_temp(tzd, &temp); return } Listing 1.1: linux/custom/custom.c

This ﬁle must be referenced in the main kernel Makeﬁle: core-y += kernel/ [...] custom/ Listing 1.2: linux/custom/Makeﬁle

The syscall must be included in syscalls.h : asmlinkage long sys_getcputemp( unsigned long *temp); Listing 1.3: linux/include/linux/syscalls.h

The

CALL macro is used in unistd.h : CALL(sys_getcputemp)

Listing 1.4: linux/arch/arm/kernel/calls.S

Use the next available syscall identiﬁer: __NR_getcputemp (__NR_SYSCALL_BASE+394) Listing 1.5: linux/arch/arm/include/uapi/asm/unistd.h

In the following ﬁle and in addition to the modiﬁcation listed above, note that -NR syscalls must be incremented by one. __NR_getcputemp 288 __SYSCALL(__NR_getcputemp, sys_getcputemp) Listing 1.6: linux/include/uapi/asm-generic/unistd.h

At this point the new syscall is available to all user-mode applications running inTEE (Figure 2- (cid:204) and Figure 2- (cid:205) ). This syscall is then exposed in the REE kernel, tee-supplicant and the TEE kernel as if it were an ofﬁcial GlobalPlatform’s API functiondeﬁnition. unsigned long TEE_GetCpuTemperature( void ); Listing 1.7: optee os/lib/libutee/include/tee api.h0 Julien Amacher and Valerio Schiavoni

The

NR syscalls value must be modiﬁed to account for the new syscall: __NR_syscalls Listing 1.8: linux/arch/arm/include/asm/unistd.h

The TEE function is a wrapper for the corresponding libutee implementation: unsigned long TEE_GetCpuTemperature( void ) { unsigned long ret; TEE_Result res = utee_get_temperature(&ret); if (res != TEE_SUCCESS) TEE_Panic(res); return ret; } Listing 1.9: optee os/lib/libutee/tee api.c

TEE SCN MAX must also be increased accordingly and the call is given the nextunique identiﬁer (71 in our case): TEE_SCN_GET_TEMPERATURE 71 TEE_SCN_MAX

Listing 1.10: optee os/lib/libutee/include/tee syscall numbers.h

The utee syscall is declared in utee syscalls.h and linked to its unique identiﬁer: TEE_Result utee_get_temperature( unsigned long *temp);

Listing 1.11: optee os/lib/libutee/include/utee syscalls.h UTEE_SYSCALL utee_get_temperature, TEE_SCN_GET_TEMPERATURE, 1

Listing 1.12: optee os/lib/libutee/arch/arm/utee syscalls asm.S

Add the syscall entry in arch svc.c . The trailing comma is required. SYSCALL_ENTRY(syscall_get_temperature),

Listing 1.13: optee os/core/arch/arm/tee/arch svc.c TEE_Result syscall_get_temperature( unsigned long *temp);

Listing 1.14: optee os/core/include/tee/tee svc.h

This function serves as a wrapper to the REE kernel syscall used to retrieve thetemperature: TEE_Result syscall_get_temperature( unsigned long *temp) { tee_ta_get_temperature(temp); return TEE_SUCCESS; } Listing 1.15: optee os/core/tee/tee svc.cn The Performance of ARM TrustZone 21

A new ﬁle is created: TEE_TEMPERATURE_H TEE_TEMPERATURE_H "tee_api_types.h" TEE_Result tee_ta_get_temperature( unsigned long *temp); Listing 1.16: optee os/core/include/kernel/tee temperature.h

This function is called in 8 and triggers a REE world switch 7 : TEE_Result tee_ta_get_temperature( unsigned long *temp) { TEE_Result res; struct optee_msg_param params; memset(¶ms, 0, sizeof (params)); params.attr = OPTEE_MSG_ATTR_TYPE_VALUE_OUTPUT; res = thread_rpc_cmd(OPTEE_MSG_RPC_CMD_GET_TEMPERATURE, 1, ¶ms); if (res == TEE_SUCCESS) { *temp = params.u.value.a; } return res; } Listing 1.17: optee os/core/arch/arm/kernel/tee temperature.c

The following line in added in sub.mk : srcs-y += tee_temperature.c Listing 1.18: optee os/core/arch/arm/kernel/sub.mk

A new message used to retrieve the temperature via RPC is declared: // [out] temperature OPTEE_MSG_RPC_CMD_GET_TEMPERATURE 21

Listing 1.19: optee os/core/include/optee msg.h

The same is done in another ﬁle: // [out] temperature OPTEE_MSG_RPC_CMD_GET_TEMPERATURE 21

Listing 1.20: linux/drivers/tee/optee/optee msg.h

This function is declared inside the REE kernel: static void handle_get_temperature( struct optee_msg_arg *arg) { unsigned long cputemperature; // Linux kernel syscall if (sys_getcputemp(&cputemperature)) { arg-ret = TEEC_ERROR_GENERIC; return ; } arg->params[0].u.value.a = cputemperature; arg->ret = TEEC_SUCCESS; return ; } Listing 1.21: linux/drivers/tee/optee/rpc.c

In the same ﬁle, handle rpc func cmd is modiﬁed by adding a case to handle thenew RPC request: case OPTEE_MSG_RPC_CMD_GET_TEMPERATURE: handle_get_temperature(arg); break ; Listing 1.22: linux/drivers/tee/optee/rpc.c

After rebuilding the TEE client and kernel, the new syscall can be used as such fromany TA 9 : float tempC = TEE_GetCpuTemperature() / 1000.0f; Listing 1.23:

Usage from TA

This solution perfectly illustrates a workaround to the starvation of the REE worldcaused by the execution of the TEE.In order to accomplish the secure storage micro benchmark, it was required to mea-sure monotonic time, store and retrieve these measurements. The common denominatorbetween the TEE kernel and the host application is the REE kernel. For this reason, itwas decided to store measurements in the REE kernel, from which they could be gath-ered by the host application. Three syscalls were added in the REE kernel and madeavailable in the TEE kernel.These are ﬁrst declared: __NR_ktraceadd (__NR_SYSCALL_BASE+396) __NR_ktraceget (__NR_SYSCALL_BASE+397) __NR_ktracereset (__NR_SYSCALL_BASE+398) Listing 1.24: linux/arch/arm/include/uapi/asm/unistd.h CALL(sys_ktraceadd) CALL(sys_ktraceget) CALL(sys_ktracereset)

Listing 1.25: linux/arch/arm/kernel/calls.Sn The Performance of ARM TrustZone 23 // save the current time as the specified id asmlinkage long sys_ktraceadd( unsigned long id); // returns the id+sec+ns of the requested index asmlinkage long sys_ktraceget( unsigned long index, unsigned long * id, unsigned long * sec, unsigned long * ns); asmlinkage long sys_ktracereset( void ); Listing 1.26: linux/include/linux/syscalls.h

Implementation is stored in a separate ﬁle: MAX_KTRACE_ENTRIES 30 unsigned char ktrace_entries = 0; struct ktraceadd_e { unsigned long id; struct timespec64 ts; }; struct ktraceadd_e* ktraceadd_d; SYSCALL_DEFINE1(ktraceadd, unsigned long , id) { struct timespec64 ts; ts = ns_to_timespec64(ktime_get_ns()); if (!ktraceadd_d) { ktraceadd_d = kmalloc( sizeof ( struct ktraceadd_e)*MAX_KTRACE_ENTRIES, GFP_KERNEL | GFP_NOWAIT); } if (ktrace_entries < MAX_KTRACE_ENTRIES) { memcpy(( void *)&ktraceadd_d[ktrace_entries].id, ( void *)&id, sizeof ( unsigned (cid:44) → long )); memcpy(( void *)&ktraceadd_d[ktrace_entries].ts, ( void *)&ts, sizeof ( struct (cid:44) → timespec64)); ktrace_entries++; return } return } SYSCALL_DEFINE4(ktraceget, unsigned long , index, unsigned long *, id, unsigned long *, (cid:44) → sec, unsigned long *, ns) { if (ktraceadd_d && index >= 0 && index < ktrace_entries) { *id = ktraceadd_d[index].id; *sec = ktraceadd_d[index].ts.tv_sec; *ns = ktraceadd_d[index].ts.tv_nsec; return } return } SYSCALL_DEFINE0(ktracereset) { ktrace_entries = 0; return } Listing 1.27: linux/custom/custom.c

Next, three available syscalls identiﬁers are used. In the same ﬁle,

NR syscalls must be incremented by three. __NR_ktraceadd 290 __SYSCALL(__NR_ktraceadd, sys_ktraceadd) __NR_ktraceget 291 __SYSCALL(__NR_ktraceget, sys_ktraceget) __NR_ktracereset 292 __SYSCALL(__NR_ktracereset, sys_ktracereset) Listing 1.28: linux/include/uapi/asm-generic/unistd.h

The

NR syscalls value must be modiﬁed to account for the new syscalls: __NR_syscalls Listing 1.29: linux/arch/arm/include/asm/unistd.h

These functions can now be invoked from any REE user-mode application. Instru-mentation tests test1 and test2 are added directly from the host application using 3 ,and then retrieved and displayed. // As defined in // optee/linux/include/uapi/asm-generic/unistd.h SYSCALL_KTRCEADD 290 SYSCALL_KTRCEGET 291 SYSCALL_KTRCERESET 292 printf("Calling SYSCALL_KTRCERESET\n"); syscall(SYSCALL_KTRCERESET); printf("Calling SYSCALL_KTRCEADD 0\n"); syscall(SYSCALL_KTRCEADD, "test1"); sleep(1); printf("Calling SYSCALL_KTRCEADD 1\n"); syscall(SYSCALL_KTRCEADD, "test2"); char kget_name[20]; unsigned long kget_sec; unsigned long kget_ns; for ( int i = 0; i < 2; ++i) { syscall(SYSCALL_KTRCEGET, i, &kget_name, &kget_sec, &kget_ns); printf("SYSCALL_KTRCEGET index %d: name=%s sec=%ld ns=%ld\n", i, kget_name, kget_sec, kget_ns); } Listing 1.30:

Host application usage example