[PDF] TEEMon: A continuous performance monitoring framework for TEEs

Abstract

Trusted Execution Environments (TEEs), such as Intel Software Guard eXtensions (SGX), are considered as a promising approach to resolve security challenges in clouds. TEEs protect the confidentiality and integrity of application code and data even against privileged attackers with root and physical access by providing an isolated secure memory area, i.e., enclaves. The security guarantees are provided by the CPU, thus even if system software is compromised, the attacker can never access the enclave's content. While this approach ensures strong security guarantees for applications, it also introduces a considerable runtime overhead in part by the limited availability of protected memory (enclave page cache). Currently, only a limited number of performance measurement tools for TEE-based applications exist and none offer performance monitoring and analysis during runtime. This paper presents TEEMon, the first continuous performance monitoring and analysis tool for TEE-based applications. TEEMon provides not only fine-grained performance metrics during runtime, but also assists the analysis of identifying causes of performance bottlenecks, e.g., excessive system calls. Our approach smoothly integrates with existing open-source tools (e.g., Prometheus or Grafana) towards a holistic monitoring solution, particularly optimized for systems deployed through Docker containers or Kubernetes and offers several dedicated metrics and visualizations. Our evaluation shows that TEEMon's overhead ranges from 5% to 17%.

Full PDF

CCC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).

TEEMon: A continuous performance monitoringframework for TEEs

Robert Krahn

TU Dresden

Donald Dragoti

TU Dresden

Franz Gregor

TU Dresden, Scontain UG

Do Le Quoc

TU Dresden, Scontain UG

Valerio Schiavoni

Université de Neuchâtel

Pascal Felber

Université de Neuchâtel

Clenimar Souza

Universidade Federal de CampinaGrande

Andrey Brito

Universidade Federal de CampinaGrande

Christof Fetzer

TU Dresden, Scontain UG

Abstract

Trusted Execution Environments (TEEs), such as Intel Soft-ware Guard eXtensions (SGX), are considered as a promisingapproach to resolve security challenges in clouds. TEEs pro-tect the confidentiality and integrity of application code anddata even against privileged attackers with root and physi-cal access by providing an isolated secure memory area, i . e ., enclaves . The security guarantees are provided by the CPU,thus even if system software is compromised, the attackercan never access the enclave’s content. While this approachensures strong security guarantees for applications, it also in-troduces a considerable runtime overhead in part by the lim-ited availability of protected memory (enclave page cache).Currently, only a limited number of performance measure-ment tools for TEE-based applications exist and none offerperformance monitoring and analysis during runtime.This paper presents TEEMon, the first continuous perfor-mance monitoring and analysis tool for TEE-based applica-tions. TEEMon provides not only fine-grained performancemetrics during runtime, but also assists the analysis of iden-tifying causes of performance bottlenecks, e . g ., excessivesystem calls. Our approach smoothly integrates with exist-ing open-source tools ( e . g ., Prometheus or Grafana) towardsa holistic monitoring solution, particularly optimized forsystems deployed through Docker containers or Kubernetesand offers several dedicated metrics and visualizations. Ourevaluation shows that TEEMon’s overhead ranges from 5%to 17%. Cloud computing is a popular way to deploy modern onlineservices since it lets users focus on their applications by dele-gating tasks like resource management to the cloud provider.In such multi-tenant environments, users care about protect-ing their data and applications, especially under the threatsof powerful adversaries with root privileges or even physicalaccess to machines. As such, users cannot rely on OS accesscontrol mechanisms (bypassed by the former), or on pro-cess/memory isolation (bypassed by the latter, for instancevia cold-boot attacks [43]). In this context, cloud users have to ensure their data is always protected: at rest, in transit overthe wire, and while being processed in the main memory.To overcome these issues, hardware-assisted Trusted Ex-ecution Environments (TEEs) such as Intel SGX [24, 54],ARM TrustZone [75], IBM SecureBlue++ [29, 74], and AMDSEV [44] offer a practical approach to protect their servicesin an untrusted cloud [56]. The TEE technologies providestrong integrity and confidentiality guarantees regardlessof the trustworthiness of the underlying software ( e . g ., theoperating system or the hypervisor). Recently, Intel SGX hasbecome available in the cloud [22, 45], unleashing a plethoraof services to be ported, including data processing systemssuch as MapReduce [67], coordination services [31], content-based routing [58] and databases [60, 66]. Legacy servicescan also be executed with Intel SGX without any modifica-tion by using frameworks [25, 70] that transparently shieldexisting applications.While promising at first glance, the approach of lever-aging TEEs suffers from several technical issues, especiallyregarding performance overhead. Indeed, legacy applicationsrunning inside secure isolated areas of the hardware, called enclaves , suffer from significant performance issues [25, 73].From the cloud user’s vantage point, they require strongsecurity guarantees for the code and data of their applica-tions, as well as low runtime overhead. Gjendrum et al . [35]proposed a set of guidelines for better enclave performancein a cloud environment recommending small enclave pagecache (EPC) sizes of 64 kB and minimizing ECALL argumentsfor faster transitions. Several studies [25, 31, 73] identifiedcostly EPC paging and enclave transitions as major SGXperformance bottlenecks. While solutions exist to mitigatethis issue ( e . g ., Switchless [68], HotCalls [46]), performing acontext switch from the inside to the outside of enclaves stillintroduces a significant overhead since the hardware needsto prevent the context switch from revealing any sensitivedata stored inside enclaves.To profile the performance of SGX enclaves, Intel pro-vides the VTune Amplifier [62]. This tool is designed forprofiling applications at the instruction level. While help-ful, it lacks performance insights regarding the specific SGX1 a r X i v : . [ c s . CR ] D ec C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).performance metrics. SGX-Perf [73] and TEE-Perf [26] over-come this limitation by allowing developers to trace enclaveexecution and providing SGX performance metrics such asenclave transitions and paging. However, SGX-Perf supportsonly applications that use the Intel-SGX Software Develop-ment Kit. In addition, both SGX-Perf and TEE-Perf do notsupport runtime monitoring of the performance overhead ofan application running inside an enclave. Thus, they cannotsupport SGX frameworks to tune system parameters duringruntime to improve the performance of the applications, forexample by increasing the number of threads on the insideor outside of an enclave.To summarize, TEEs are widely adopted to provide strongsecurity guarantees for applications running in an untrustedenvironment, e . g ., a public cloud. However, there is a lackof tools to monitor the performance of these applicationsrunning inside TEE enclaves at runtime. We fill this gap bydesigning and implementing TEEMon, which allows usersto monitor the performance of their applications runningin TEEs and to identify performance issues and bottlenecksin real-time. More specifically, TEEMon has the followingdesign features:

1. Lightweight.

TEEMon incurs low performance overheadwhile providing useful and accurate performance profiling.

2. Transparency.

TEEMon does not require modificationof the monitored application.

3. Generality.

TEEMon is framework-agnostic and can beused with many SGX frameworks such as SCONE [25], Gra-phene-SGX [70], or SGX-LKL [59].We implemented TEEMon to monitor the performance ofapplications running inside Intel SGX enclaves during run-time. In this work, we focus on Intel SGX since it is widelyused in practice, although the technique can be applied toother TEEs as well. We evaluated TEEMon using real-worldapplications and state-of-the-art SGX frameworks. Our eval-uation shows that TEEMon has a negligible overhead from5% to 17% depending on running applications while pro-viding useful performance metrics data for users with anintuitive visualization during runtime. In addition, to thebest of our knowledge, this work is the first providing anintensive performance comparison between state-of-the-artIntel SGX frameworks. Lastly, TEEMon has been tested andintegrated with Kubernetes to monitor over 6000 distributedSGX enclaves in production, where it was used to track var-ious performance metrics and has allowed the developersof the SCONE framework to continuously monitor the im-pact of different software generations on a detailed level (e.g.consumption of the enclave page cache, page faults, etc) [23].

There exist only a few tools to profile and monitor appli-cations executed in TEEs at runtime. We briefly report on

Tool

Frame-workAgnos-tic Paging EnclaveTransi-tions Orches-tratedApplica-tions Real-TimeRe-ports Granu-larityLIKWID ✓ ✗ ✗ ✓ ✗ , perf ✓ ✗ ✗ ✗ ✗ ,MemProf ✓ ✗ ✗ ✗ ✗ TEE-Perf ✓ ✗ ✗ ✗ ✗ gprof ✓ ✗ ✗ ✗ ✗

VTune ✓ ✗ ✗ ✗ ✗

SGX-Perf ✗ ✓ ✓ ✗ ✗

SGXTOP ✓ ✓ ✓ ✗ ✓

TEEMon ✓ ✓ ✓ ✓ ✓ , ,

Table 1.

Profile/monitoring tools for SGX. Granularity canbe: =function, =object, =event, =system.them, as well as more generic profilers and monitoring tools.Table 1 summarizes our survey.

Over the years, numerous profiling tools and approacheshave been proposed. Ball et al . [27] first proposed a set ofoptimal algorithms for program profiling and instructiontracing to reduce the overhead of profilers. Hardware spe-cific tools such as LIKWID [63, 69] for x86 environmentsand MemProf [48] for NUMA systems make use of hardwarecounters allowing developers to explore optimization oppor-tunities specific to the underlying system architecture. Linux perf [34] provides low-level system metrics by attachingto tracepoints, performance counters or probes similarly toeBPF [36]. But it does not offer support for TEEs and it can-not provide profiling data for applications running insideSGX enclaves. VTune Amplifier [62], a commercial analysistool by Intel, offers in-depth analysis for SGX applications.With the help of special hardware features, it gives detailedinformation regarding time spent on each method or func-tion call but it provides no support for SGX specific metricslike EPC paging. Further, it depends on the Intel hardwarearchitecture and does not support other vendors. Anothersimilar SGX-specific profiler is SGX-perf [73], which pro-vides statistics on enclave entries and exits as well as EPCpaging events based on kprobes [39]. SGX-Perf is limitedto applications using the Intel SDK for SGX and does notsupport monitoring during runtime since it implemented atwo-phased record and report approach. TEEMon focuses oncontinuous monitoring during runtime to provide insightsinto performance helping to identify the root cause of anybottlenecks and revealing opportunities for performanceimprovement.The gprof [40] tool can provide developers with an exe-cution profile by counting function invocations as well as System-wide statistics are available in LMS [63]. . × higher compared to that of Linux perf asthe injected code runs on each function call [26]. This, inturn, limits its use to development environments as high over-heads are unacceptable in production. All of the aforemen-tioned profilers offer application-specific performance data,but some are platform-dependent, some language-dependentand some are proprietary. Besides, most of the profilers men-tioned ( i . e ., TEE-perf, gprof , SGX-Perf, and VTune Amplifier)are adapted for usage in development and testing stages butare not suitable for providing continuous monitoring due totheir high-performance overhead. While developed withina security context, approaches such as SGX-Step [71] couldalso be used to debug applications in TEEs but typically incura substantial slowdown. Note that the state-of-the-art moni-toring and performance profilers such as TEE-perf [26] re-port an average slowdown factor of 1.9 × compared to nativeSGX executions. Closely related to TEEMon, SGX-TOP [49]continuously displays SGX related metrics, which are alsocollected by the TEE metric exporter (Section 4). However,SGX-TOP focuses solely on displaying SGX-related metricsin a terminal, while lacking archival functionality as well asthe combination of multiple metrics into one interface as pro-vided by TEEMon. TEEMon is designed to provide a moreholistic monitoring application. In addtion, SGX-TOP can beused only with single-node applications, whereas TEEMonsupports distributed applications deployed in multiple nodes(e.g., using Kubernetes). In a distributed computing environment, applications requireconstant monitoring to be able to recover and fix deploymentissues, fix application-specific bugs or to adjust deploymentdepending on the current load on the system. Major cloudproviders offer proprietary monitoring solutions to theirclients, such as Amazon CloudWatch [6], Google Cloud’sOperations [13], or IBM SysDig [19, 30]. These monitoringsystems provide hundreds of metrics regarding the cloud in-frastructure and applications. However, those are dependenton the cloud provider and are usually billed depending on thenumber of metrics requested. Besides, in multi-cloud infras-tructures, one would need to support and maintain multiplemonitoring systems depending on the cloud providers.As a result, different companies like New Relic [5], So-larWinds [1], Datadog [7], AppDynamics [4], SignalFx [12]and Dynatrace [2] offer managed solutions to monitoringand software analytics for dynamic micro-service architec-tures. These options provide a solution to the vendor lock-in issue and offer a comprehensive list of services includingapplication and infrastructure analytics. They also supportready-made visualizations and machine learning capabili-ties running on top of accumulated data for better insights,such as SysDig for SGX capable bare-metal servers at IBM-Cloud [11]. However, none of them were designed to mon-itor and analyze the performance of applications runninginside of TEE enclaves. Our proposed framework addition-ally supports containerized applications and orchestrationmanagement system such as Kubernetes (Section 5.4) thus,can be integrated into TEE-enabled cloud environments [72].

SGX is a set of x86 ISA instructions to secure code and dataof applications available since the Intel Skylake architec-ture [24, 54]. SGX introduces a concept of a secure enclave , i . e ., a hardware-protected memory region for code and dataprotected by the CPU with confidentiality and integrity fea-tures. A dedicated region of physical memory called theEnclave Page Cache (EPC) is reserved for enclaves. EPC isprotected by an on-chip Memory Encryption Engine (MEE),which transparently encrypts and decrypts the cache linesas they are respectively written to and read from EPC. SGXprovides a call-gate mechanism to control entry into, and exitfrom, the TEE [33]. In most processors, the EPC is limited to ∼

128 MB, and only ∼

94 MB can be used for applications [33].However, the latest generations of Intel Xeon processors areequipped with a larger sized EPC. Currently, Intel SGX hasbeen offered in clouds [38, 45], enabled plenty of confidentialcloud-native applications, e.g., analytics systems [50, 51], keymanagement system [41], and secure software update [57].

To run legacy applications with Intel SGX without any sourcecode modification, SGX frameworks such as SCONE [25],SGX-LKL [59], and Graphene-SGX [70] can be utilized.

SCONE and SGX-LKL.

SCONE [25] is a shielded execu-tion framework using Intel SGX. To run applications insideSGX enclaves, programs are linked against a modified stan-dard C library, i . e ., SCONE libc . In this model, the wholeapplication is confined to the enclave memory and the inter-action with the untrusted system is executed via a systemcall interface. SCONE leverages an asynchronous system callmechanism: threads inside of the enclave execute tasks ofthe application, pushing system calls to the outside of the en-clave. Threads outside of the enclave asynchronously executethe system calls and push results back. In addition, SCONEnatively integrates with Docker [55] to seamlessly deploymicro-service based applications using container images. Asimilar approach is implemented by SGX-LKL [59], which https://software.intel.com/en-us/articles/an-overview-of-the-6th-generation-intel-core-processor-code-named-skylake EnclaveTEE Driver

TZ..

KernelEnclaveTEE Driver

SGXTZ..

KernelEnclaveTEE Driver

SGXTZ..

Kernel PerformanceMetricsAggregationPMAG PerformanceMetricsAnalyticsPMAN PerformanceMetricsVisualizationPMV

SGX iii iii iv

PME

System Performance MetricsIntel SGX performance metricsREST HTTP

Figure 1.

TEEMon system overview.also provides a framework that links applications against amodified standard C library ( musl-libc ). Graphene-SGX.

Graphene-SGX [70] is an open-source SGXimplementation of the original Graphene library OS. Sim-ilar to Haven [28], it runs a complete library OS inside ofSGX enclaves and lets developers run their applications withIntel SGX without any code modifications. Graphene-SGXfacilitates protection through a manifest file that containsuser defined security policies and a list of trusted libraries(with their cryptographic SHA-256 hashes) required by theapplication. Thus, the manifest file allows for a fine granulardefinition of trusted resources.

The scope of the original BSD Packet Filter (BPF) [53] waslimited to network monitoring and packet filtering, by lever-aging RISC CPU registers. It efficiently filters network pack-ets without copying packets to user-space before analyzingthem. The extended BSD Packet Filter (eBPF) is capable oftaking advantage of modern x86 CPU architectures, 64-bitregisters, and just-in-time compilation since the Linux kernel3.15 kernel. Although the original BPF project started as anetwork monitoring tool, nowadays eBPF programs can beattached to a multitude of hooks in the Linux kernel. eBPFadded methods to load custom programs into the eBPF vir-tual machine for the kernel. In addition, it allows for userspace programs to interact with

BPF_MAP data structuresin the kernel [15].

BPF_MAPS are generic key/value storesmainly used to share data between kernel and user-space,and between different eBPF programs.TEEMon uses eBPF to obtain and provide low-level systemstatistics such as executed system calls or cache misses, aswe detail next.

The main goal of TEEMon is to provide a low-overheadmonitoring framework able to offer the performance profilerfunctionality as previous frameworks [26, 73] while allow-ing users to continuously keep track of their applicationsrunning inside SGX enclaves during runtime, a combinationof features that as shown earlier is currently not available tracepointsSystem MetricsExporter Metrics FormatStandardizationkprobes uprobes perf events eBPF hooks

BPF program BPF program BPF program

Figure 2.

System Metrics Exporter architecture.(Table 1). In this section, we discuss the architecture of ourmonitoring framework, explaining the main design deci-sions.TEEMon is designed to be generic and applicable to a vari-ety of SGX frameworks without code changes while keepingin mind the state-of-the-art and current best practices in thefield of monitoring and observability. Figure 1 shows thehigh level overview of our monitoring framework.TEEMon consists of four core components: (i)

The perfor-mance metrics exporters (PME) collect metrics at differentsystem levels, including the kernel and the TEE environment(e.g. the Intel SGX driver). (ii)

The Performance metrics ag-gregation (PMAG) combines and stores data in a time seriesdatabase accessed and processed by (iii) the performancemetrics analysis (PMAN) component; and (iv) the visualiza-tion (PMV) component that presents the monitoring data viaa web-service to the user. We provide further details aboutcurrently utilized software components in Section 5.In a nutshell, to capture performance metrics with minimaloverhead and without code modification of applications, wedesign an exporter module for the TEE driver to extract thenecessary information for keeping track of applications dur-ing runtime. The metrics exporter obtains performance met-rics directly from the kernel. Thereafter, the metrics aggre-gator combines the monitoring metrics from all distributedexporters to provide a global view of the performance of themonitored applications.TEEMon uses multiple exporters per host machine withindividual tasks, e . g ., providing SGX-specific or machine-specific metrics. The analysis component examines the ag-gregated performance metrics to identify the bottlenecks ofmonitored applications. Lastly, the visualization componentcontinuously presents the performance bottlenecks and crit-ical performance metrics in an intuitive way for users via aweb-based interface. (i) Performance Metrics Exporter (PME).

This component combines two modules: the TEE MetricsExporter which collects TEE related metrics, and the SystemMetrics Exporter which collects system metrics. PME exportsthese collected metrics to the aggregation component in astandardized format. We describe both next.4C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).

TEE Metrics Exporter (TME).

The TME supports twomain classes of TEE-related metrics: enclave metrics (initial-ized enclaves, active enclaves, removed enclaves) and EPCmetrics (total EPC pages, free EPC pages, pages marked asold, pages evicted to main memory, pages added to enclaves,pages reclaimed from main memory). Additional metricscould be added through code-adaptation of the utilized TEEdriver.To capture these TEE metrics with low overhead, the TMEconnects to specific hooks ( e . g ., sgx_nr_free_pages , sgx_nr_enclaves , or sgx_nr_evicted ) in the TEE driverand extracts the data regarding the trusted enclave regionsand other specific information to the underlying TEE archi-tecture. There is a single TME instance per machine thatrequires privileged access to the underlying operating sys-tem.While modifying an existing TEE driver would be feasi-ble [16, 72], developing an additional component provides amodular architecture in order to support current and futureTEE technologies, and it avoids excessive and expensive mod-ifications to the underlying TEE subsystem, as this wouldneed additional effort for maintenance. In addition, it can alsobe used to expose additional metrics regarding the platformsrunning on top of a TEE. For instance, in Intel SGX, TEEMoncan be interchangeably used for SCONE or Graphene-SGXwithout changing the code of the SGX frameworks.Our design allows for the PME to be easily customizedand used on different TEE platforms as well as for kernel-integrated approaches, such as IBM PEF [18], AMD SEV [44],or Intel TDX [21]. For these virtual machine based securitymechanisms, we envision an extension to the hypervisor,e.g. qemu, that integrates the functionality of the TME. Theextension would, similar to the TME for SGX, export met-rics such as the amount of protective memory requested byeach virtual machine. Additionally, other exporters, e.g., thesystem metrics exporter (SME) described in the next section,can be added to the virtual machine image to export metricsfrom the guest OS to TEEMon. System Metrics Exporter (SME).

The task of the SME(Figure 2) is to collect and export performance metrics fromthe underlying system infrastructure. To obtain low-level sys-tem metrics, one needs to access CPU performance counters,kprobes and tracepoints similarly to Linux’s perf [34]. Thisrequires writing and running code in kernel space while en-suring it neither corrupts security nor reduces performance.To overcome these problems, we use eBPF, the in-kernelvirtual machine, allowing kernel instrumentation programsto run in a secure and restricted environment. The eBPFprograms connect to specified hooks in the host’s kernel. Byattaching small eBPF programs to each kernel hook, we canread and extract low-level system statistics and export themto user space via

BPF_MAPs . Type Method FieldSys. callmetrics

Kerneltracepoints raw_syscalls:sys_enterraw_syscalls:sys_exit

Cachemetrics

Kprobes add_to_page_cache_lrumark_page_accessedaccount_page_dirtiedmark_buffer_dirty

Perf. events

PERF_COUNT_HW_CACHE_MISSESPERF_COUNT_HW_CACHE_REFERENCES

Contextswitches

Perf. events

PERF_COUNT_SW_CONTEXT_SWITCHES

Kerneltracepoints sched:sched_switches

Pagefaults

Perf. events

PERF_COUNT_SW_PAGE_FAULTS

Kerneltracepoints exceptions:page_fault_userexceptions:page_fault_kernel

Table 2.

System Metrics collected by TEEMon.The metrics are translated into a standard format under-stood by the metrics aggregation component ( e . g ., Prome-theus [10]), and published to its metric endpoint ( e . g ., a webserver) to be scraped. The current TEEMon implementa-tion instruments system calls, context switches, page faults,and last-level cache metrics. The SME list of metrics andthe corresponding system hooks are shown in Table 2. Thelisted metrics simply provide a guideline on the most impor-tant system metrics. It can be extended depending on theapplication’s and user’s requirements. (ii) Performance Metrics Aggregation (PMAG).

The PMAG combines and aggregates performance metricscollected from the exporter component (PME). It embeds atime-series database, a metrics retrieval component, and anHTTP server. It collects, processes, and aggregates a largenumber of metrics, from a dynamically changing list of ser-vices, with low overhead on the applications. Using its time-series database, it stores all metrics data samples locally andgroups them into chunks for faster retrieval. Additionally,it allows for multi-dimensional data with the help of metriclabels specified as a set of key-value pairs, e . g ., system callswith a timestamp, system-call number, and the number ofoccurrences.To support the analysis of performance metrics, PMAGsupports data queries over specified time ranges and labeleddimensions. It provides detailed quantitative analysis by se-lecting and applying aggregation functions to query results.PMAG can connect to every service exposing a metricsendpoint, and users of TEEMon can easily add their applica-tion metrics to it. For example, a user can instrument applica-tions to export monitoring metrics in the standard text-basedformat as specified by the OpenMetrics project [17] and ex-pose them to a REST endpoint. From this endpoint, PMAGcan collect the measured metrics. Push vs. Pull in Monitoring.

There are two main mech-anisms used by centralized monitoring systems to gather5C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).new metrics: (i) Push — providing an endpoint where ser-vices can push their metrics at the rate they happen or (ii)Pull — the monitoring system pulls the metrics from themetrics endpoints on each service periodically.The push-based approach allows services to continuouslypush metrics in real-time and is mainly used by event-basedmonitoring services like statsd [9]. However, event-basedmonitoring can lead to congestion and overloading of theentire monitoring systems during sudden events bursts.The pull-based approach requires the metrics exportercomponent to provide an endpoint where the monitoringservice can subsequently scrape the metrics at specified inter-vals. Consequently, the resulting data traffic can be managedeasier and it lightens the load on the monitoring service itself,compared to having to ingest hundreds of metrics per secondas it might happen in the case of system metrics. Addition-ally, centralizing data ingestion by the system eliminates theissue of misbehaving services pushing garbage data whichmight turn into a DoS attack on the monitoring system. Thedownside to the pull approach is the need to know whichexporters need to be scraped.This can be solved by usinga service discovery system, i . e ., a centralized catalog of therunning applications and their respective REST endpoints.The monitoring service also acts as a health checker and canalert in case the monitoring target is unreachable. TEEMonrelies on a pull approach (from the PMAG component) tocollect the monitoring data from the TEE Metrics Exporterand System Metrics Exporter components. (iii) Performance Metrics Analysis (PMAN).

Although the PMAG component offers data queries and ag-gregation functions, it lacks support for metrics visualiza-tion and analytics. Thus, we design the PMAN componentto analyze the aggregated data from the PMAG componentin real-time, to identify the bottlenecks or potential anom-alies, and to report them to the visualization componentfor inspection. In addition, the PMAN component has theability to aid the identification of bottlenecks in applicationsrunning inside TEE enclaves. Technically, we make use ofthreshold-based approaches to detect anomalies in monitor-ing data. We identified these thresholds using benchmarkingwith real-world SGX-based applications. PMAN analyzes thetime-series monitoring data using slide window computa-tions, e . g ., it processes every minute for the last five minutesof the monitoring data. In each time window, PMAN not onlycompares the monitoring data with user-defined thresholdsto detect anomalies but also provides a box plot for SGXmetrics. PMAN supports handling anomalies in several waysincluding alerting, dashboard updating, and logging. Admin-istrators or developers of SGX-based applications can usethis information to discover or identify the root of bottle-necks. Note that PMAN can be further extended to performmore advanced analytics, such as the correlation between SGX metrics and configuration parameters of applications,or performance prediction. (iv) Performance Metrics Visualization (PMV).

Visualizations are crucial for monitoring services, especiallywhen dealing with complex time-series metrics to spot un-derlying issues or to quickly infer performance trends. Addi-tionally, in cases of failures or incidents, it is useful to limitthe view of data to a specific time frame. However, choosingthe right type of visual representation is not trivial, as itdepends as much on individual preferences as on the met-rics that a user is trying to visualize. The PMV componentcurrently supports several visualization options, e . g ., graphs,histograms, gauges, gradient fills, tables, etc . It is also possi-ble to group different metrics so that metrics from the sameservice or serving the same purpose are shown on the same dashboard . We provide a set of intuitive graphs and visualrepresentations of the measured and analyzed metrics, whilestill allowing users the freedom to modify them or add newmetrics according to their needs and preferences. The current TEEMon prototype supports the Intel SGX TEE,due to its wide adoption in practice, both in academia andin industry. Our design is however generic and can be ex-tended for other TEEs. We implemented the SGX Exporterfor the TEE Metrics Exporter component. We instrumentedthe Linux SGX driver and developed several eBPF programsto implement the Performance Metrics Exporter component.Figure 1 shows the architecture of TEEMon. We rely onPrometheus [10] (an open-source time-series database) to im-plement the Performance Metrics Aggregation components,as well as an ad-hoc python program used by the Perfor-mance Metrics Analysis (PMAN) to digest raw metrics dataand perform threshold analysis. Lastly, we use Grafana [14],a widely used framework, to visualize the collected metrics.We exploit Grafana also for its metric queries, analysis, andvarious visualization options, thus it’s easy to be integratedwith the Performance Metrics Analysis component.The combination of multiple system-wide sources for var-ious metrics allows TEEMon to provide a broad insight intothe monitored system and its executed applications. In com-parison to fine grained performance analysis tools, e.g., perf ,TEEMon collects key metrics on a system wide level withlower granularity, frequency and only little instrumentation,where perf-like tools may record every event, function call orobject change at the application level for advanced analysisfeatures such as call graphs. TEEMon does not completelytrace an application during its runtime but only capturessystem-wide key events,i.e., important performance metrics,such as an enclave creation or the execution of a system call,thus limiting its impact on overall performance. Additionally,monitored metrics are usually only counted, requiring verylittle processing at runtime. The data retrieval frequency can6C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).be adjusted, by default, the different exporters are queriedfor data only every 5 seconds. In the remainder, we describeadditional implementation details and the lessons learnedduring the process.

In the implementation ofthe framework, we decided to initially support the Intel SGXplatform, as one of the most mature TEE technologies inthe market. To collect the SGX metrics, we instrument theofficial Intel SGX driver source code at specific function calls.Some of the monitored metrics are listed in §4.All metrics in the SGX driver are instrumented in a similarfashion. The use of module parameters ([64, Chapter 2]) wasjustified as there is currently no SGX-enabled instrumenta-tion library supporting the C programming language. Doingso, for each metric, there is a file with the same name in /sys/module/isgx/parameters . This is used to write themetric value while the module is loaded and running. Ouradaptations to the Intel SGX driver span only 42 lines ofcode.While we instrument the SGX driver from Intel, TEEMononly indirectly supports the official Intel SDK for SGX. TEE-Mon doesn’t monitor OCalls/ECalls specific to the Intel SDKfor issuing system calls to the kernel. However, the systemcalls resulting from OCalls are indeed monitored by TEEMonat the kernel level.In order to get the SGX metrics into the aggregationcomponent, an interface component is needed. It is imple-mented in Python and takes advantage of the Flask micro-framework [8, 42] to expose the metrics via a REST/HTTPendpoint. The SGX Exporter reads the metrics and exposesthem in the OpenMetrics [17] format to its metrics endpoint.From this endpoint, the aggregation component can pull theperformance metrics during runtime.

System Metrics Exporter (SME).

The SGX driver met-rics themselves are insufficient to show all interactions of atrusted application inside enclaves with the untrusted part ofthe system. System and infrastructure metrics are additionalimportant cornerstones to monitor applications running in-side Intel SGX enclaves.For instance, the SGX driver only contains the code exe-cuted during enclave initialization and functions controllingthe EPC pages but it does not interact with other parts ofan application’s execution cycle such as entering or exitingthe enclave, system calls handling, etc . These system metricsare provided by three smaller components, each exposing asubset of the overall metrics needed: eBPF-Exporter.

Our implementation is based on the ex-porter for custom eBPF metrics by Cloudflare [32]. The ex-porter contains small eBPF programs written in C to ex-tract metrics during runtime . Using kernel probes and trace https://github.com/iovisor/bcc/ points, these eBPF programs are executed by the kernelwhenever specific events are triggered and enable us to, e.g.,count and report occurrences of page faults during runtime.Metrics currently collected by the eBPF-Exporter consist of System calls , Context switches , Page Faults , and

Cache statistics while custom eBPF programs can be added if necessary.

Node-Exporter . The node exporter (written in Go) is partof the Prometheus project and exports machine metrics avail-able through the /proc and /sys directories in Linux environ-ment. We integrated the node exporter into TEEMon andreduced the reported metrics to

CPU statistics , Memory sta-tistics , File system statistics , and

Network statistics . cAdvisor . To provide utilization metrics for Docker con-tainers, Google created the cAdvisor web-service. We inte-grated cAdvisor into TEEMon to collect and store per con-tainer metrics. Afterwards, theses metrics are then continu-ously visualized in a front-end, e.g. Grafana. The providedmetrics can be adjusted depending on the desired level ofinformation and use case [37].

Our monitoring system uses Prometheus [10] as the Met-rics Aggregation Service that collects all metric data frommultiple sources and generates insights by querying andaggregating the data. In particular, we chose Prometheus forits extensibility that allows us to gather internal metrics viaa REST/HTTP endpoint.

We make use of Grafana to implement the visualization com-ponent of TEEMon. Grafana is an open-source tool for met-rics visualization and analysis [14]. We use it to implementTEEMon’s visualization component. It supports various datasources ( e . g ., Prometheus, InfluxDB, ElasticSearch, etc .) anda broad set of visualization widgets ( e . g ., graphs, single stats,tables, heat maps, etc .) integrated into the dashboards.Currently, our prototype consists of three dashboards: (i) an SGX dashboard showing EPC metrics and a selec-tion of metrics provided by eBPF programs, (ii) a Dockerdashboard showing performance data provided by cAdvisor from running Docker containers, and (iii) an infrastructuredashboard showing metrics from both Node-Exporter andeBPF-Exporter.Figure 3 shows a partial screenshot of the SGX dashboardpresented by TEEMon. Its frontend allows the user to applya process filter, e.g. redis-server , for the continuously updatedmetrics or select a desired time range from historical data.The Figure presents recorded data for the Redis databaseduring a benchmark with its two phases (populating thedatabase and executing queries) visible as two consecutivecurves. From the graphs, the user can, for example, study https://github.com/prometheus/node_exporter https://github.com/google/cadvisor Figure 3.

Partial screenshot of TEEMon showing SGX-related metrics.the enclave page cache (top row) utilization, the occurrencesof page faults (bottom right), or the distribution of systemcalls (middle left) during runtime.

TEEMon components are encapsulated in individual Dockercontainers for quick deployment on a single host but theycan also be deployed in virtual machines or by an orchestra-tor, such as Kubernetes, an industry-standard for containerorchestration. It features an application-centric design, awell-established API with a uniform set of resources, as wellas a powerful ecosystem of third-party tools and extensions.Its controllers allow applications and infrastructures to bedefined in a declarative manner. It supports up to 5000 server-nodes in a single cluster [20].Helm , a package manager for Kubernetes applications uti-lizes charts as application definitions for Kubernetes. These charts can easily be deployed, managed, and distributed. Wecreated a chart to install TEEMon in large-scale infrastruc-tures managed by Kubernetes.In Kubernetes, each of TEEMon’s metrics exporters isdeployed (using Helm) in a daemon-like fashion (as Daemon-Set resource).

DaemonSets are deployment configurationsthat enforce exactly one application instance (pod) runningper node in the cluster. This deployment configuration in-cludes dynamically added nodes. Additionally, Kubernetesoffers service discovery and resource annotations that TEE-Mon uses to connect the performance metric aggregation https://helm.sh component ( e . g ., Prometheus) to periodically scan for run-ning metrics exporters. These two features allow TEEMonto adapt to arbitrary changes in the cluster topology.The TEEMon chart also allows for more advanced sched-uling scenarios. For example, in a heterogeneous cluster,specific Kubernetes labels, i . e ., taints , can be used to deployapplications based on the availability of hardware features.Thus, TEE-related metrics exporters can be deployed selec-tively on nodes that support TEEs, e . g ., Intel-SGX. This section presents the experimental evaluation of TEE-Mon using real world applications and various IntelSGX frameworks. After describing our evaluation settings(§6.1), we evaluate the overhead of TEEMon in monitoringapplication running inside SGX enclaves, for monitoringoverhead (§6.2) and specific application overheads (§6.3).Additionally, we present findings of continuous code pro-filing in §6.4. Next, we show the usability of TEEMon inidentifying the cause of performance bottlenecks of state-of-the-art Intel SGX frameworks including SCONE, SGX-LKL,and Graphene-SGX (§6.5). Our evaluation demonstrates thatthe design of TEEMon is generic and it can be transparentlyused across a variety of SGX frameworks without changingtheir source code.

We used two machines connected via a switched1 GBit Ethernet network (one hop) to conduct experimentsmeasuring the performance of different Intel SGX frame-works using TEEMon. The desktop machine is a FujitsuESPRIMO P957/E90+ desktop machine with an Intel® Corei7–6700 CPU, with 32 GB of RAM, running Ubuntu 16.04.6(kernel v4.15.0) on an SSD SATA-based disk. The server hasan Intel® Xeon® CPU E3–1280 v6 processor and 64 GBmain memory, running Ubuntu 18.04 (kernel v4.15.0–70)with the microcode package 3.20191115 installed (microcoderevision ) to mitigate risks against recent SGX-relatedattacks [47, 52].

Methodology.

For overhead measurements, we report theaverage evaluation results of 10 runs and examine severalconfigurations for each of the evaluated systems: • Native SGX (without deploying TEEMon), using theofficial Intel SGX driver. We used the native versionas the evaluation baseline. • Activating TEEMon with only TEE Performance Met-rics Exporter (PME) component ( i . e ., activating theeBPF-Exporter and the SGX Metrics Exporter). • Activating TEEMon with all components.8C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020). (a)

CPU Overhead (b)

Memory Overhead

Figure 4.

The CPU and memory consumption of TEEMon’scomponents.

We begin by evaluating the overhead of TEEMon by mea-suring the CPU and memory footprint of its components, tobetter understand what is the overhead of each component.We execute this experiment over 24 hours, by deployingTEEMon on the desktop machine to collect the performancemetrics in the system and measure the CPU and Memoryusage for each module of each component.Figure 4 (a) shows the CPU utilization of each componentin TEEMon. Overall, we observe a modest CPU utilization,and at most 3% on average during the 24 hours for the cAd-visor component. We envision future versions of TEEMonwhere this component is deactivated to further reduce inter-ferences induced by the tool itself.Next, Figure 4 (b) shows the average memory consumptionof each monitoring component. The aggregation componentof TEEMon (on top of Prometheus) is the most memory-eager one. The overall memory footprint of TEEMon is ∼

700 MB. While all other components use 100 MB on av-erage, Prometheus allocates 4 × as much. This is expected:by design, the aggregator keeps all currently used data inmemory for faster retrieval. Prometheus’ memory usage canbe reduced by setting a cache limit (default is 2 GB). In summary, the overhead of our monitoring framework isnegligible for today’s standards, allowing users to deploy itin production environments and continuously analyze moni-toring data and performance insights of applications runninginside SGX enclaves. As our approach requires no changes tothe monitored application and gathers SGX-related statisticsat the driver level, no additional memory from the enclavepage cache (EPC) is used by TEEMon.

We evaluate the overhead induced by TEEMon while moni-toring three real-world applications. Specifically, we use adocument database ( i . e ., MongoDB (v3.6.3) [3]), a web-server( i . e ., NGINX (v1.14.0) [61]), and an in-memory key-value https://prometheus.io/docs/prometheus/1.8/storage/ Figure 5.

The overhead of TEEMon in monitoring variousapplications. Results are normalized to the native SGX exe-cution. (a)

Commit 572 𝑏𝑑 𝑎 (b) Commit 09 𝑓 𝑒𝑎 Figure 6.

Occurrences of selected system calls during exe-cution of Redis with specific versions of SCONE.store ( i . e ., Redis (v5.0.5) [65]). We selected SCONE in thisexperiment for its ease of use compared to other SGX frame-works. However, we designed and implemented TEEMon inthe way that it can also be used with other SGX frameworkssuch as SGX-LKL, and Graphene-SGX (see §6.5).We profile these applications in three different configura-tions: (i) Native SGX (

Monitoring OFF ), (ii) TEEMon withonly the PME component (

Monitoring OFF + eBPF ON ),and (iii) full TEEMon with all components (

Monitoring ON ).Figure 5 presents our results. We show the average over 10executions, normalized against the throughput of the nativeSGX version, without enabling TEEMon. The applications’throughput varies from 87% of baseline executions for NG-INX to 95% for MongoDB. The eBPF programs running inthe kernel contribute for half of the performance drop, andother TEEMon components contribute the other half. Thisis expected since several eBPF programs are attached tofrequently used performance counters ( e . g ., cache misses,references, system calls, page faults, and context switches).In some cases ( i . e ., number of context switches) we instru-mented both hardware and software counters with 99 Hzand this accounts for some of the added overhead. This extraoverhead can be reduced by disabling unnecessary perfor-mance counters, reducing sampling frequency for softwarecounters, or filtering metrics like system calls and contextswitches to only a specified PID. To facilitate filtering, weprovide a macro for some of the programs which can be setin the eBPF configuration file.9C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020). Figure 7.

Changes in throughput for Redis at different stagesof code evolution and native Redis.

Besides using it for monitoring deployed applications at alarge scale for overall system health and performance, TEE-Mon can be easily used to track the impact of code changes.Our approach to continuous integration includes compila-tion and benchmarks of several applications including Rediswhile tracking statistics about system calls, page faults, andEPC pages, provided by TEEMon.As a showcase, we evaluate two minor releases of SCONEand run the Redis-benchmark application using TEEMonto monitor its execution. This allowed us to find that the futex and clock_gettime system calls dominated over theread/write system calls to receive and send data. Thus, itindicated a performance bottleneck as all system calls triggeran expensive enclave exit.Figure 6 shows the occurrences of selected system callsfor two consecutive code commits, as monitored by TEE-Mon during the execution of Redis compiled with SCONE.Commit 572 𝑏𝑑 𝑎 𝑓 𝑒𝑎

91. For the specificcommit 572 𝑏𝑑 𝑎 clock_gettime systemcalls peaked at over 370000/sec while read and write systemcalls were at a tenth of that.As depicted in Figure 7, the specific code changes yieldedin an almost doubling of average throughput. The graphshows the performance measurements of Redis using Redis-benchmark on a single host.With commit 09 𝑓 𝑒𝑎

91 the handling of the clock_gettime system call was improved towards handling it on the insideof the enclave without triggering a system call in the kernel.As a result, the number of clock_gettime system calls tothe kernel and enclave exits is dramatically reduced allowingRedis to handle more requests per second. With this opti-mization, we measured at most 100 clock_gettime systemcalls per second while the maximum read and write systemcalls increased from 23 to 32 per second.For commit 572 𝑏𝑑 𝑎

5, Redis achieves a throughput of267952 .

22 IOP/s. With the commit 09 𝑓 𝑒𝑎

91, the through-put of Redis increased to 621504 IOP/s.

We conclude our experimental evaluation by showing thatTEEMon is designed in a generic way so that it can be used https://redis.io/topics/benchmarks for different Intel SGX frameworks without changing theirsource code. We focus on a head-to-head comparison us-ing the Redis in-memory key-value store as application, de-ployed and run inside Intel SGX enclaves using several SGXframeworks. In addition, we also demonstrate that basedon the performance metrics and statistics provided by TEE-Mon, we can identify the cause of bottlenecks of these SGXframeworks.We benchmarked Redis (v5.0.5) running inside SGX en-claves using SGX-LKL, SCONE, and Graphene-SGX. These SGX frameworks can run legacy applications IntelSGX without changing their code, simply by recompiling orrelinking Redis using their provided compilation toolchains.While Redis was executed directly on the host, we adaptedthe configuration of Redis to allow for stable execution withall frameworks. Foremost, we disabled the periodic creationof persistent snapshots, it requires the availability of the fork() system call within the SGX-enclave, which is notavailable in SGX-LKL and Graphene-SGX. Furthermore, weconfigured Redis to use at most 1 GB of memory, i . e ., theheap size of the enclave configured for all SGX frameworks.We make use of the memtier_benchmark suite to mea-sure the performance of Redis and configure it to use 8 con-current threads for optimal performance. Hence, the indi-cated number of connections is always a factor of 8.First, we pre-populate the database with 720000 keys. Dur-ing the measurements, the benchmark issues GET requests.The memtier_benchmark is configured to use a pipeline of 8requests and 8 connections per client-thread as these settingsprovided the best results in preliminary tests. We run exper-iments with different Redis database sizes (78 MB, 105 MB,and 127 MB) by setting the size of values (in the key-valuemessages) of 32 , 64 , and 96 bytes, respectively. The reasonwe conducted the experiments with different database sizesis that the most current SGX hardware supports only ∼

94 MBEPC size (see §

3) for applications running inside enclaves.When more memory is required, the applications inside en-claves need to perform the paging mechanism, usually veryexpensive performance-wise.Next, we first present the performance comparison ofRedis running with different SGX frameworks. Then, wedescribe how to use performance metrics captured by TEE-Mon to identify the bottlenecks of these SGX frameworks.

In the following, we discuss the performance measurementsfor Redis running with Intel SGX using SGX-LKL, SCONE,and Graphene-SGX. Note that the native version in this ex-periment is the vanilla Redis running without Intel SGX. Commit ff8a1a3d, master branch. Commit fab5a2b7c, master branch. Commit e98be31, master branch. https://github.com/RedisLabs/memtier_benchmark (a) Native (b)

SCONE (c)

SGX-LKL (d)

Graphene-SGX

Figure 8.

The throughput comparison between native Redis and Redis with different SGX frameworks. The total memoryusage of Redis is set to different sizes of 78 MB, 105 MB, and 127 MB. (a)

Native (b)

SCONE (c)

SGX-LXL (d)

Graphene-SGX

Figure 9.

The latency comparison between native Redis and Redis with different SGX frameworks. The total memory usage ofRedis is set to different sizes of 78 MB, 105 MB, and 127 MB. (a)

Throughput Comparison (b)

Latency Comparison

Figure 10.

The throughput and latency comparison betweennative Redis and Redis with different SGX-frameworks. Thetotal memory usage of Redis is set to 78 MB.

Throughput.

Figure 8 shows the throughput of Redis usingthe different SGX frameworks. The native Redis achievesthe throughput of 1 .

01 M - 1 . § ∼

23% throughput of native Redis). The throughput of Rediswith SCONE drops when the database size increases dueto the EPC limitation of the SGX hardware. Increasing thedatabase size from 87 MB to 105 MB reduces the peak perfor-mance of Redis with SCONE by 32 KIOP/s (decrease of 12%).Further increasing the database size to 127 MB decreases thepeak performance at 29 KIOP/s. Figure 8 (c) shows the results for the throughput of Rediswith the SGX-LKL framework. While Redis with SGX-LKLpeaks at 320 connections with 121 KIOP/s ( ∼

10% of nativeRedis throughput), our results also show a steep drop inperformance of Redis with SGX-LKL at 560 connections witha steady increase afterward.Figure 8 (d) shows that, differently from the other SGXframeworks, Graphene-SGX performs best for one client (8connections) and exhibits a reduced performance for moreconnections. The peak performance of Graphene-SGX wasmeasured at 20 KIOP/s for 8 connections, ( ∼ .

6% of nativeRedis throughput). Similar to SCONE, Figure 8 (d) shows adrop in throughput of Graphene-SGX if the database sizeincreases from 78 MB to 105 MB. For a single client, thethroughput decreases from 20 KIOP/s to 12 KIOP/s.

Latency.

Figure 9 presents the Redis latency comparisonbetween the different SGX frameworks. As expected, thelatency of all evaluated systems increases when the numberof connections increases. At 320 connections, the latencyof the native Redis is ∼ ∼ ∼

20 ms, and ∼

249 ms, respectively. All latencymeasurements show an overall similar correlation betweenthe number of connections and the latency. However, Rediswith Graphene-SGX imposes a significantly higher latencycompared to other frameworks.Figure 10 (a) and (b) show a performance comparison ofnative Redis and Redis with different SGX frameworks, witha database size of 78 MB and an increasing number of clientsconnections. In general, these results only show the overallperformance trends of the SGX frameworks. To understand11C-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020).CC-BY 4.0. This is the author’s version of the work. The definitive version is published in the proceedings of the 21thACM/IFIP International Middleware Conference (Middleware 2020). (a)

User-space page faults per node (b)

Page-faults per node (c)

Last-Level-Cache Misses per node (d)

Evicted EPC pages per node (e)

Context switches monitored per PID (f)

Context switches per node

Figure 11.

The detailed statistics of monitored performance metrics of native Redis and Redis running inside SGX enclavesusing different SGX frameworks. The experiments are conducted with different configurations: 8 connections and 78 MBdatabase size (8 C-S); 8 connections and 105 MB database size (8 C-L); 320 connections and 78 MB database size (320 C-S);320 connections and 105 MB database size (320 C-L); 580 connections and 78 MB database size (580 C-S); and for 580 connectionsand 105 MB database size (580 C-L).the insights of these SGX frameworks, we analyze the de-tailed performance metrics data during runtime and identifythe bottlenecks using TEEMon.

Figure 11 presents the performance metrics statistics dataof native Redis and Redis with different SGX frameworks,collected during benchmarks as described in §6.5.1. All pre-sented statistics and data are similarly presented by the TEE-Mon front-end during a monitoring session.

Page Faults.

Figure 11 (a) and Figure 11 (b) show the pagefaults in user space for Redis and the total page faults perhost during the benchmark, respectively. The user spacepage faults include: no_page_found , write_prot_fault , write_fault and instr_fetch_fault .While Figure 11 (a) indicates an overall low rate of userspace page faults, it also shows that native Redis causesno page faults in user space. For the SGX frameworks, therate of user space page faults increases with database sizesexceeding the EPC size ( ∼

94 MB). This happens when theSGX-enabled Redis reads data that was previously swappedout of the EPC and is unavailable for the current request. For320 and 580 connections, with the database size of 105 MB,Redis with SCONE reaches the peaks of 0 .

069 and 0 . ∼ .

03 page faults per100 requests for larger database sizes. While SCONE andSGX-LKL introduce negligible page faults ( e . g ., almost nopage fault) with the database size of 78 MB which fits into EPC, the measurements show that Graphene-SGX still hasthe number of page faults of 0 .

02 per 100 requests.In contrast to the low rate of user level page faults, Fig-ure 11 (b) shows that on host-wide scope more page faultsare registered. Native Redis has 607 total page faults per 100GET requests for 8 connections, however, this number de-creases ( <

170 page faults) for larger numbers of connections.This closely follows the finding that few connections leadto context switches in native Redis. While SCONE and SGX-LKL have the page fault rates increasing from 500 to 2200total page faults per 100 GET requests, Graphene-SGX has asignificant number of total page faults. For 580 connectionsand database size of 105 MB, Graphene-SGX has 8996 totalpage faults per 100 requests on average.

Last Level Cache Misses.

Figure 11 (c) illustrates the lastlevel cache (LLC) misses during the benchmark. Comparedto native Redis, all SGX frameworks induce an elevated rateof LLC misses. With native Redis we observe 1 . −

23 LLCmisses per 100 GET requests. Instead, SCONE and SGX-LKLachieve similar (yet higher) rates, i . e ., 29 to 103 LLC missesper 100 GET requests. Graphene-SGX has the highest LLCmisses: 91 for 8 connections and 78 MB database size, and upto 161 LLC misses for 580 connections with 105 MB databasesize (per 100 GET requests). Evicted EPC Pages.

Figure 11 (d) shows the measuredevicted pages from the enclave page cache (EPC). Graphene-SGX has at most 0 .

02 evicted pages per 100 GET requestsfor the database size of 78 MB which fits in the EPC. For thedatabase size of 105 MB, Graphene-SGX exhibits at most 0 . . . Context Switches.

A common cause of SGX performanceoverheads is costly enclave transitions. The context switcheswere filtered by PID, to make it easier to monitor specificapplications in the system.Figure 11 (e) shows these results and indicates that per100 GET requests, Redis with SGX-LKL hits the most contextswitches. Instead, native Redis exhibits 0.14 context switchesper 100 requests for the evaluation with just 8 connections.Since Redis uses an event queue and in combination withthe findings shown in Figure 8 (a), we conclude that, for 8connections, Redis often waits (sleeps) for new messagesand thereby causes context switches. With the exception ofGraphene-SGX, SCONE and SGX-LKL show a similar patternfor 8 connections.Figure 11 (f) shows the number of total context switcheson the host while the GET requests are issued. The Figuresuggests the total (host-wide) context switches of Redis withGraphene-SGX increases dramatically (up to 12 × ) comparedto Redis with other SGX frameworks and native Redis. For580 connections with a database size of 105 MB, Redis withGraphene-SGX has 304 context switches per 100 GET re-quests, while native Redis has only 37. SCONE and SGX-LKLexpose a similar pattern as native Redis, with at most 125context switches per 100 GET requests. We believe that Gra-phene-SGX has the lower performance as shown in Figure 10because it has significantly more context switches than theother frameworks as reported by TEEMon.Note that Figure 11 (e) shows only the context switches byRedis process itself, including its threads while Figure 11 (f)shows the total (host-wide) amount of context switcheswhich includes the context switches between kernel pro-cesses as well as context switches to the ksgxswapd (Intel®SGX swapping daemon) process.In summary, in this experiment, we show that TEEMonprovides detailed performance data during runtime (e.g.,cache misses, context switches, page faults, evicted EPCpages, etc) of applications (e.g., Redis) running inside IntelSGX which helps us to understand the performance behaviorof the applications. The presented performance metrics byTEEMon are helpful for developers using SGX frameworksto identify performance issues and to provide guidance forimproving the performance of these frameworks, especiallywith regard to scarce resources such as EPC memory and the expensive enclave exit and enter operations (due to sys-tem calls). This is achieved by presenting valuable graphsthat show, e.g., high occurrences of the clock_gettime sys-tem call dominating the desired read-write system calls fornetwork IO. While different metrics could in principal begathered individually with different tools, TEEMon providesa single frontend for continuous and effortless monitoring ofapplication to analyse their behavior in a production readyenvironment. This paper described TEEMon, a real-time performance mon-itoring framework for applications running inside TEE en-claves. TEEMon is independent from specific secure execu-tion platforms and applications running on top of it whileoffering a wide range of performance metrics. Furthermore,it natively supports a micro-service architecture for appli-cations since all components of TEEMon can be deployedusing Docker containers. We evaluated TEEMon using real-world applications and state-of-the-art SGX frameworks.Our evaluation shows that TEEMon incurs a very lowoverhead, which is only from 5% to 17% depending on run-ning applications, while it provides valuable insights on themeasured performance.TEEMon additionally offers a visualization dashboard toinspect in real-time the behavior of the systems being moni-tored. The evaluation also shows that TEEMon can be usedfor many different Intel SGX frameworks without changingthe framework’s code. The performance metrics providedby TEEMon allow users to pinpoint bottlenecks and perfor-mance issues of applications running inside enclaves usingdifferent SGX frameworks and identify the source causingthe bottlenecks of these SGX frameworks including SCONE,SGX-LKL and Graphene-SGX. Finally, TEEMon has beenintegrated with Kubernetes to monitor TEE applications run-ning in distributed cluster.In the future, we will extend and improve TEEMon to offeradditional information as well as flexibility to the user. Theprobes that TEEMon uses to gather kernel data are currentlyfixed but in the future, on-demand loading should be possible.TEEMon will be made available to the research community.

Acknowledgements.

We thank our shepherd ProfessorTim Wood and the anonymous reviewers for their work andhelpful comments as well as Rasha Faqeh, Anna Galanou, andFábio Silva for their feedback and contribution. This workwas funded by the German Research Foundation, Project-ID 174223256 (TRR 96) and the European Union’s Horizon2020 research and innovation programme under the LEGaTOProject (legato-project.eu), grant agreement No 780681.

References [1] 1999. SolarWinds. . Accessed: Sept. 2020.[2] 2005. Dynatrace. https://dynatrace.com . Accessed: Sept. 2020.[3] 2007. MongoDB. . Accessed: Sept. 2020. [4] 2008. AppDynamics. . Accessed: Sept.2020.[5] 2008. New Relic. https://newrelic.com . Accessed: Sept. 2020.[6] 2010. Amazon CloudWatch. https://aws.amazon.com/cloudwatch/ .Accessed: Sept. 2020.[7] 2010. Datadog. . Accessed: Sept. 2020.[8] 2010. Flask micro web framework. https://palletsprojects.com/p/flask/ .Accessed: Sept. 2020.[9] 2011. Statsd. https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ . Accessed: Sept. 2020.[10] 2012. Prometheus - Monitoring system & time series database. https://prometheus.io/docs/introduction/overview/ . Accessed: Sept. 2020.[11] 2013. IBM Cloud. . Accessed: Sept. 2020.[12] 2013. SignalFx. . Accessed: Sept. 2020.[13] 2014. Google Cloud’s Operations. https://cloud.google.com/products/operations . Accessed: Sept. 2020.[14] 2014. Grafana. https://grafana.com . Accessed: Sept. 2020.[15] 2016. eBPF Maps. https://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html . Accessed: Sept. 2020.[16] 2016. Intel Linux SGX Driver. https://github.com/intel/linux-sgx-driver . Accessed: Sept. 2020.[17] 2017. The OpenMetrics project. . Accessed:Sept. 2020.[18] 2018. IBM PEF. https://developer.ibm.com/articles/l-support-protected-computing/ . Accessed: Sept. 2020.[19] 2018. IBM Sysdig. https://cloud.ibm.com/docs/services/Monitoring-with-Sysdig . Accessed: Sept. 2020.[20] 2019. Building large clusters - Kubernetes. https://kubernetes.io/docs/setup/best-practices/cluster-large . Accessed: Sept. 2020.[21] 2020. Intel Trust Domain Extensions. https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-v4.pdf . Accessed: Sept. 2020.[22] 2020. New DCsv2-series virtual machines are now avail-able. https://azure.microsoft.com/en-au/updates/new-dcsv2series-virtual-machines-are-now-available . Accessed: Sept. 2020.[23] 2020. TEEMon. https://sconedocs.github.io/teemon/ . Accessed: Sept.2020.[24] Ittai Anati, Shay Gueron, P. Simon Johnson, and R. Vincent Scarlata.2013. Innovative technology for CPU based attestation and sealing.In

Proceedings of the 2nd International Workshop on Hardware andArchitectural Support for Security and Privacy (HASP) .[25] Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, AndreMartin, Christian Priebe, Joshua Lind, Divya Muthukumaran, DanO’keeffe, Mark L Stillwell, et al. 2016. SCONE: Secure Linux Contain-ers with Intel SGX. In

Proceedings of the 12th USENIX Symposium onOperating Systems Design and Implementation (OSDI) .[26] Maurice Bailleu, Donald Dragoti, Pramod Bhatotia, and Christof Fetzer.2019. TEE-Perf: A Profiler for Trusted Execution Environments. In

Proceedings of the 49th IEEE/IFIP International Conference on DependableSystems and Networks (DSN) .[27] Thomas Ball and James R Larus. 1994. Optimally Profiling and TracingPrograms.

ACM Transactions on Programming Languages and Systems (1994).[28] Andrew Baumann, Marcus Peinado, and Galen Hunt. 2015. ShieldingApplications from an Untrusted Cloud with Haven. In

ACM Transac-tions on Computer Systems (TOCS) .[29] Rick Boivie and Peter Williams. 2012. SecureBlue++: CPU support forsecure execution.

Technical report (2012).[30] Gianluca Borello. 2015. System and Application Monitoring and Trou-bleshooting with Sysdig. USENIX Association.[31] Stefan Brenner, Colin Wulf, David Goltzsche, Nico Weichbrodt,Matthias Lorenz, Christof Fetzer, Peter Pietzuch, and Rüdiger Kapitza.2016. SecureKeeper: Confidential ZooKeeper using Intel SGX. In

Pro-ceedings of the 17th International Middleware Conference (Middleware) . [32] Cloudflare. 2019. Prometheus exporter for custom eBPF metrics. https://github.com/cloudflare/ebpf_exporter . Accessed: Sept. 2020.[33] Victor Costan and Srinivas Devadas. 2016. Intel SGX Explained.

IACRCryptology ePrint Archive (2016).[34] Arnaldo Carvalho De Melo. 2010. The new linux ’perf’ tools. In

Slidesfrom Linux Kongress .[35] Anders T Gjerdrum, Robert Pettersen, Håvard D Johansen, and DagJohansen. 2017. Performance of Trusted Computing in Cloud In-frastructures with Intel SGX. In

Proceedings of the 7th InternationalConference on Cloud Computing and Services Science (CLOSER) .[36] Sasha Goldshtein. 2016. The Next Linux Superpower: eBPF Primer. In

USENIX SREcon .[37] Google. 2020. cAdvisor Runtime Options. https://github.com/google/cadvisor/blob/master/docs/runtime_options.md . Accessed: Sept. 2020.[38] James C Gordon. [n.d.]. Microsoft Azure Confidential Computing withIntel SGX.

Accessed: Sept. 2020.[39] Sudhanshu Goswami. 2005. An introduction to KProbes.

LWN-LinuxWeekly News-online .[40] Susan L Graham, Peter B Kessler, and Marshall K McKusick. 2004.Gprof: A call graph execution profiler. In

ACM Sigplan Notices .[41] Franz Gregor, Wojciech Ozga, Sébastien Vaucher, Rafael Pires, Do LeQuoc, Sergei Arnautov, André Martin, Valerio Schiavoni, Pascal Felber,and Christof Fetzer. 2020. Trust Management as a Service: EnablingTrusted Execution in the Face of Byzantine Stakeholders. In

Proceedingsof the IEEE/IFIP International Conference on Dependable Systems andNetworks (DSN) .[42] Miguel Grinberg. 2018.

Flask web development: developing web appli-cations with python . O’Reilly Media, Inc.[43] J Alex Halderman, Seth D Schoen, Nadia Heninger, William Clarkson,William Paul, Joseph A Calandrino, Ariel J Feldman, Jacob Appelbaum,and Edward W Felten. 2009. Lest we remember: cold-boot attacks onencryption keys. In

Communications of the ACM .[44] David Kaplan, Jeremy Powell, and Tom Woller. 2016. AMD memoryencryption. White paper.[45] Karnati. 2018. Data-in-use protection on IBM Cloud using IntelSGX. . Accessed: Sept. 2020.[46] Taehoon Kim, Joongun Park, Jaewook Woo, Seungheun Jeon, and Jae-hyuk Huh. 2019. ShieldStore: Shielded In-Memory Key-Value Storagewith SGX. In

Proceedings of the 14th ACM European Conference onComputer Systems (EuroSys) .[47] Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss,Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, ThomasPrescher, et al. 2019. Spectre attacks: Exploiting speculative execution.In

Proceedings of the 40th IEEE Symposium on Security and Privacy(S&P) .[48] Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf:A Memory Profiler for NUMA Multicore Systems. In

Proceedings ofthe USENIX Annual Technical Conference (USENIX ATC) .[49] Kevin Lahey. 2020. Monitoring Intel SGX Enclaves. https://fortanix.com/blog/2020/02/monitoring-intel-sgx-enclaves/ . Accessed: Sept.2020.[50] Do Le Quoc, Franz Gregor, Sergei Arnautov, Roland Kunkeland,Pramod Bhatotia, and Christof Fetzer. 2020. secureTF: A Secure Tensor-Flow Framework. In

Proceedings of the 21th International MiddlewareConference (Middleware) .[51] Do Le Quoc, Franz Gregor, Jatinder Singh, and Christof Fetzer. 2019.SGX-PySpark: Secure Distributed Data Analytics. In

Proceedings of theWorld Wide Web Conference (WWW) .[52] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, WernerHaas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, DanielGenkin, et al. 2018. Meltdown: Reading kernel memory from userspace. In

Proceedings of the 27th USENIX Security Symposium USENIX

Security) .[53] Steven McCanne and Van Jacobson. 1993. The BSD Packet Filter: ANew Architecture for User-level Packet Capture. In

Proceedings of theUSENIX Winter Conference (USENIX) .[54] Frank McKeen, Ilya Alexandrovich, Alex Berenzon, Carlos V. Rozas,Hisham Shafi, Vedvyas Shanbhogue, and Uday R. Savagaonkar. 2013.Innovative instructions and software model for isolated execution.In

Proceedings of the 2nd International Workshop on Hardware andArchitectural Support for Security and Privacy (HASP) .[55] Dirk Merkel. 2014. Docker: lightweight linux containers for consistentdevelopment and deployment.

Linux Journal .[56] Saeid Mofrad, Fengwei Zhang, Shiyong Lu, and Weidong Shi. 2018. Acomparison study of Intel SGX and AMD memory encryption tech-nology. In

Proceedings of the 7th International Workshop on Hardwareand Architectural Support for Security and Privacy (HASP) .[57] Wojciech Ozga, Do Le Quoc, and Christof Fetzer. 2020. A practicalapproach for updating an integrity-enforced operating system. In

Pro-ceedings of the 21th International Middleware Conference (Middleware) .[58] Rafael Pires, Marcelo Pasin, Pascal Felber, and Christof Fetzer. 2016.Secure Content-based Routing using Intel Software Guard Extensions.In

Proceedings of the 17th International Middleware Conference (Middle-ware) .[59] Christian Priebe, Divya Muthukumaran, Joshua Lind, Huanzhou Zhu,Shujie Cui, Vasily A Sartakov, and Peter Pietzuch. 2019. SGX-LKL: Se-curing the host OS interface for trusted execution. In arXiv:1908.11143 .[60] Christian Priebe, Kapil Vaswani, and Manuel Costa. 2018. EnclaveDB: ASecure Database using SGX. In

Proceedings of the 39th IEEE Symposiumon Security and Privacy (S&P) .[61] Will Reese. 2008. Nginx: the high-performance web server and reverseproxy.

Linux Journal (2008).[62] James Reinders. 2005. VTune performance analyzer essentials. In

IntelPress .[63] Thomas Röhl, Jan Eitzinger, Georg Hager, and Gerhard Wellein. 2017.LIKWID Monitoring Stack: A Flexible Framework Enabling Job SpecificPerformance monitoring for the masses. In

Proceedings of the IEEEInternational Conference on Cluster Computing (CLUSTER) . [64] Alessandro Rubini and Jonathan Corbet. 2001.

Linux device drivers .O’Reilly Media, Inc.[65] Salvatore Sanfilippo. 2009. Redis. http://redis.io . Accessed: Sept. 2020.[66] Vasily Sartakov, Nico Weichbrodt, Sebastian Krieter, Thomas Leich,and Rudiger Kapitza. 2018. STANlite–a database engine for securedata processing at rack-scale level. In

Proceedings of the 6th IEEE Inter-national Conference on Cloud Engineering (IC2E) .[67] Felix Schuster, Manuel Costa, Cédric Fournet, Christos Gkantsidis,Marcus Peinado, Gloria Mainar-Ruiz, and Mark Russinovich. 2015. VC3:Trustworthy Data Analytics in the Cloud using SGX. In

Proceedings ofthe 36th IEEE Symposium on Security and Privacy (S&P) .[68] Hongliang Tian, Qiong Zhang, Shoumeng Yan, Alex Rudnitsky, LironShacham, Ron Yariv, and Noam Milshten. 2018. Switchless Calls MadePractical in Intel SGX. In

Proceedings of the 3rd Workshop on SystemSoftware for Trusted Execution (SysTEX) .[69] Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: Alightweight performance-oriented tool suite for x86 multicore envi-ronments. In

First International Workshop on Parallel Software Toolsand Tool Infrastructures (PSTI) .[70] Chia-Che Tsai, Donald E Porter, and Mona Vij. 2017. Graphene-SGX:A Practical Library OS for Unmodified Applications on SGX. In

Pro-ceedings of the USENIX Annual Technical Conference (USENIX) .[71] Jo Van Bulck, Frank Piessens, and Raoul Strackx. 2017. SGX-Step:A practical attack framework for precise enclave execution control.In

Proceedings of the 2nd Workshop on System Software for TrustedExecution (SysTEX) .[72] Sébastien Vaucher, Rafael Pires, Pascal Felber, Marcelo Pasin, ValerioSchiavoni, and Christof Fetzer. 2018. SGX-aware container orches-tration for heterogeneous clusters. In

Proceedings of the 38th IEEEInternational Conference on Distributed Computing Systems (ICDCS) .[73] Nico Weichbrodt, Pierre-Louis Aublin, and Rüdiger Kapitza. 2018. Sgx-perf: A Performance Analysis Tool for Intel SGX Enclaves. In

Proceed-ings of the 19th International Middleware Conference (Middleware) .[74] Peter Williams and Rick Boivie. 2011. CPU support for secure executa-bles. In

International Conference on Trust and Trustworthy Computing .[75] Johannes Winter. 2008. Trusted computing building blocks for embed-ded linux-based ARM trustzone platforms. In

Proceedings of the 3rdACM workshop on Scalable trusted computing (STC) ..