[PDF] Slicing the IO execution with ReLayTracer

Abstract

Analyzing IO performance anomalies is a crucial task in various computing environments, ranging from large-scale cloud applications to desktop applications. However, the IO stack of modern operating systems is complicated, making it hard to understand the performance anomalies with existing tools. Kernel IO executions are frequently interrupted by internal kernel activities, requiring a sophisticated IO profile tool to deal with the noises. Furthermore, complicated interactions of concurrent IO requests cause different sources of tail latencies in kernel IO stack. As a consequence, developers want to know fine-grained latency profile across IO layers, which may differ in each IO requests. To meet the requirements, this paper suggests ReLayTracer, a per-request, per-layer IO profiler. ReLayTracer enables a detailed analysis to identify root causes of IO performance anomalies by providing per-layer latency distributions of each IO request, hardware performance behavior, and time spent by kernel activities such as an interrupt.

Full PDF

SSlicing the IO execution with ReLayTracer

Ganguk Lee

[email protected]

Yeaseul Park

[email protected]

Jeongseob Ahn

Ajou [email protected]

Youngjin Kwon

[email protected]

Abstract

Analyzing IO performance anomalies is a crucial task invarious computing environments, ranging from large-scalecloud applications to desktop applications. However, the IOstack of modern operating systems is complicated, making ithard to understand the performance anomalies with existingtools. Kernel IO executions are frequently interrupted by inter-nal kernel activities, requiring a sophisticated IO proﬁle toolto deal with the noises. Furthermore, complicated interactionsof concurrent IO requests cause different sources of tail laten-cies in kernel IO stack. As a consequence, developers want toknow ﬁne-grained latency proﬁle across IO layers, which maydiffer in each IO requests. To meet the requirements, this pa-per suggests ReLayTracer, a per-request, per-layer IO proﬁler.ReLayTracer enables detailed analysis to identify root causesof IO performance anomalies by providing per-layer latencydistributions of each IO request, hardware performance be-havior, and time spent by kernel activities such as an interrupt.

Understanding IO performance problems is challenging. Per-formance of kernel IO stacks are affected by underlying hard-ware behaviors such as CPU cache locality [15, 17, 19]. Thehardware behaviors add an unexpected delay to kernel IO exe-cutions, causing high performance variations and tail latency.Also, kernel IO executions are often interrupted by internalkernel activities such as interrupts and exceptions or schedulerpreemptions, make it hard for developers to pinpoint the rootcause of an unexpected performance anomaly.What makes the cases harder is that developers requireﬁne-grained proﬁling information from the complex kernelIO stack. Modern IO stack is built by a set of abstractionlayers each of which has different performance characteristics.Developers want to proﬁle the latency breakdown of eachlayer to identify performance bottlenecks. Furthermore, theywould like to know latency distributions of each IO request tounderstand what request causes a tail latency and which layercauses the slowdown comparing to other requests.

Read Requests La t en cy ( n s ) io req blk cpy mm fs vfs Figure 1: Read latency breakdown of each layer. X-axis isread requests.In response to the requirement of kernel IO proﬁling, thereare many research and practical tools to make an effort toprovide useful information. They provide latency breakdownof entire kernel IO stack or block layer but do not give detailedlatency distributions of individual requests [12, 14, 16, 18] andrequire massive kernel changes to collect trace data [11–13].In this paper, we introduce a proﬁling tool for analyz-ing kernel IO performance in detail, called ReLayTracer( per-Request per-Layer tracer). ReLayTracer providesthe latency distributions of each kernel request along withﬁne-grained information such as per-abstraction-layer latencybreakdown. To that end, ReLayTracer maps each kernel re-quest (e.g., system call) to request ID and tracks the requestID across the abstraction layers. By tagging each IO requestwith a request ID , which propagates across IO layers, Re-LayTracer can report a latency breakdown of each layer ofan individual IO request. Figure 1 shows the latency break-down of IO layers proﬁled by ReLayTracer. Peaks show taillatency, and Figure 1 shows where the latency peaks happenamong the seven layers. In Linux, a background kernel threadperforms device IO asynchronously. ReLayTracer traces the off-CPU event made by an IO request using the request ID . Toprovide a precise latency breakdown, ReLayTracer analyzesan unexpected delay made by internal kernel activities andaccounts them separately. Also, ReLayTracer monitors hard-ware performance behavior (e.g., IPC) along with a latency1 a r X i v : . [ c s . O S ] J un ront end Back end

AnalyzerTrace Framework

Profiling Script

Visualizer *Logger*Parser

Target System

KernelUser *Dynamic kernel code generator ② ③ ① Agent

Figure 2:

ReLayTracer

Architecture.proﬁle of each layer, supporting reasoning about tail latency.ReLayTracer adopts the split architecture consisting of afront end and a back end. The front end of ReLayTracer runswith the target system to proﬁle, collecting data with minimalruntime overhead. The back end of ReLayTracer, working asseparate processes, processes data collected by the front endand visualizes the processed data with graphs. The front endof ReLayTracer leverages the dynamic instrumentation frame-work [1, 2, 4, 6, 8], supported in the most of modern operatingsystems, to minimize proﬁling overheads ( low overhead ) bytracing only required execution points of interest. With thedynamic instrumentation, ReLayTracer can proﬁle any kernelsubsystems ( versatility ) and easily adapt kernel code changes( portability ). ReLayTracer provides ﬁne-grained performanceproﬁling with 3 - 10% for random read, 0.1% for sequentialread runtime overhead.This paper makes the following contributions: • By tracing each IO request with a request ID , ReLay-Tracer can trace per-layer latency on the path of execut-ing an individual IO request. • By monitoring internal kernel activities, ReLayTracercan separately account the interference caused the kernelactivities. • By bundling software and hardware performance infor-mation, ReLayTracer can precisely analyze the LinuxIO performance.The current scope of the work focuses on the read path of IO.Adding support for other IO system calls is future work.

The goal of ReLayTracer is to provide precise and ﬁne-grained latency information of kernel IO subsystems withminimal performance overhead. ReLayTracer starts tracing oneach system call and generates trace data proﬁling timing ofkernel IO layers. With the trace data, ReLayTracer computesﬁne-grained latency breakdown of each IO layer. To minimizeruntime overhead of a target system, ReLayTracer leveragesan existing lightweight instrumentation framework [1,2,4–10].The number of trace points directly affects the runtime over-head of a target system. Kernel instrumentation frameworkshave the ﬂexibility to control the number of trace points, en-abling ReLayTracer to manage the runtime overhead. Theremainder of this section describes the design of ReLayTracer functions : { entry_func : { name: “ foo ”, pre_hdlr: “..”, post_hdlr: “..”, perf_event: { type_id: “ event_id ”, ... }, exit_func : { name: “ bar ”, ... }, others : { {name: “..”, pre_hdlr: “...”}, {...} }

1. Get rid

2. read time_stamp, hw_perf_value Write log to ring_buffer

Profile descriptionprobe handler

Figure 3: Proﬁle script describes probing points and probehanders. A probe handler generally consists of getting rid,collecting trace data to a log, and writing the log to the ringbuffer.by giving a case study of the read system call.

ReLayTracer comprises a front end and a back end system.Figure 2 shows the overall architecture of ReLayTracer. Thefront end is planted to a target system and collects tracingdata, and the back end processes data obtained from the frontend and visualizes them.

Leveraging instrumentation.

Modern operating systemssupport kernel dynamic instrumentation [1,2,4,6,8]. A kerneldynamic instrumentation allows execution of user-deﬁnedcode into in-memory kernel code at function boundaries (i.e.,entering and exiting a function) or in any lines (called probepoint ). When a kernel execution hits an installed probe point,the instrumentation framework calls a procedure (called probehandler ) registered by the framework. The front end takesadvantage of existing kernel dynamic instrumentation frame-works to install arbitrary probe points into a running ker-nel [4, 5, 7–10]. The front end takes a proﬁle script to installprobe points. In Figure 3, we show an example of the pro-ﬁle script. The proﬁle script is written in a domain-speciﬁclanguage describing probe points of each kernel layer as alist of kernel functions, entry and exit actions deﬁned as aprobe handler, and hardware events [7] to monitor. The trac-ing framework compiles the proﬁle script to a binary andinserts it into kernel memory using a kernel dynamic instru-mentation framework.

The primary goal of the front end install probe points andprobe handlers to collect timing information of each IO layerin each IO request. The front end installs probe points in exe-cution paths of the system call, off-CPU events, and interruptservice routines.

Tracing system calls.

Because the system call is the wayof entering into the kernel space, we ﬁrst focus on how wetrace system calls to get the per-request latency distribution.2 submit_bio(struct bio *) ext4_mpage_readpages(..., struct page *, ...) generic_file_buffered_read(struct kiocb *, ...) vfs_read(struct file *, ...) rid = random_get_entropy();

VFS layerMM layerFS layerBLOCK layer blk_account_io_start(struct request *, ...) ...

REQUEST layer ...

Figure 4: request id propagation through function calls. Afterthe request is queued at the request layer, SCSI driver keepschecking the queue and handles the queued requests.An IO system call extends across multiple layers of kernelcomponents. For example, in Linux, a read system call travelsthrough VFS and page cache layer, ﬁle system layer, blockIO layer, and device driver layer. Each layer has differentperformance characteristics such as memory copy bandwidthor slowdown by lock contentions. In addition to that, con-current system calls make the analysis more complicatedbecause it is hard to distinguish what system call is executingwithout having a complex function parameter analysis andmatching thread id which causes non-negligible overhead inIO-intensive workloads.To provide lightweight and ﬁne-grained tracing, ReLay-Tracer associates a request id (or rid ) with an IO system callto trace. rid is a timestamp monotonically increasing andserves as a unique identiﬁer of an IO system call. To trackindividual IO system calls across the layers, ReLayTracerpropagates an assigned rid for each read system call to lowerlayers of the kernel as depicted in Figure 4. The value of rid is initially stored in struct file (VFS layer), transferred to struct kiocb (memory management layer) and the subse-quent layers. To implement the rid propagation, we slightlymodify kernel data structures to hold the rid and add codeto maintain the rid value when a layer is switched. We usea kernel instrumentation framework to change 29 lines ofcode in the Linux kernel for tracing read IO (7 lines for rid instruct and 22 lines for rid propagation). This minimizes per-formance and space overhead when tracing a large volume ofIO requests. ReLayTracer can work with any instrumentaionframework that supports on-the-ﬂy modiﬁcation of kernelcode.

Tracing off-CPU events.

In addition to system calls, ReLay-Tracer supports proﬁling off-CPU events. Some IO systemcalls execute their part of code paths on a different CPU,which makes it hard to distinguish the origin of an IO request.For example, read or write system calls make a block IO re-quest to a kernel IO thread and wait until the kernel IO threadcompletes the IO request. The off-CPU execution path mustbe added to the total latency of the system calls. To associate asystem call and off-CPU execution, ReLayTracer transfers the rid to off-CPU kernel functions. Generally, there is a shareddata structure for the off-CPU handler (e.g., struct bio and struct request in Figure 4). ReLayTracer installs probepoints on off-CPU kernel events to proﬁle and delivers rid to an off-CPU probe point via a data structure used for a parameter.

Tracing kernel activities.

Tracing only execution paths ofsystem calls causes an inaccurate proﬁle result because thekernel execution can be interrupted by various external events.For example, an interrupt handling during a read system callcauses a delay in a layer. If probe points are installed atfunction boundaries of a system call execution path, the delaywould not be accounted. To sort out the delay made by thekernel activities, ReLayTracer also installs probe points oninterrupt handling routines, schedule in and out handlers, andthe wakeup points of background kernel threads, recordingoccurences of the event and its timing information.ReLayTracer installs a probe handler, described in Fig-ure 3, on each probe point. A probe point can be either anentry and an exit of a function or both based on proﬁle de-scription in Figure 3. A probe handler creates a trace log con-sisting of and records the log to aring buffer. The ring buffer is a protected, shared memorybetween kernel and user-level applications. The ring buffersize is 4MB (1024 * 4KB page) by default. A ReLayTracer’suser-level agent in a target system periodically sends tracelogs to the back end via the network to avoid buffer overruns.

The goal of the back end is to compute per-layer latencybreakdowns from raw log records traced by the front end.The back end aggregates the log records with a request id and then invokes an analyzer for each request id . Along witha proﬁle description, the analyzer takes a layer description,which speciﬁes layer types and how to separate layers withgiven function names. With the layer description, the analyzercomputes per-layer latency of each system call request whichhas the same rid in the log records. If a traced function, f ,calls a function, f , belonging to a lower layer, the analyzersubtracts the time taken by f when computing latency of f .The analyzer checks whether an off-CPU event happens in themiddle of execution of a traced function. If an event happens,the analyzer separately accounts the latency caused by theoff-CPU event when computing latency of a traced function.In case that a hardware event (e.g., CPI) is recorded in a tracedfunction, the analyzer correlates the hardware event with thetraced function and records the hardware event data with acomputed latency. Finally, the analyzer stores the processeddata to an intermediate format, and the visualizer draws aﬁgure, which can be rendered in web browsers. In this section, we explore possible application scenarios ofReLayTracer to understand kernel behaviors at ﬁne-grainedlevel. Our prototype leverages the ebpf framework supportedby standard Linux to make use of the tracing facility [1]. The3

PIlatency7400 7500 7600 7700 7800 7900

Read Requests La t en cy ( n s ) C yc l e /I n s t r u c t i on cpy mm vfs Latency and CPI for each request by Core 12

CPIlatency7400 7500 7600 7700 7800 7900

Read Requests La t en cy ( n s ) C yc l e /I n s t r u c t i on cpy mm vfs ❶ ❷ ❸ ❹ ❺ ❻ irq cpy Figure 5:

Per-request latency (solid) and CPI (dotted) forsingle-threaded sequential read requests in chronological or-der. Only page cache hit cases are shown. Each latency isbroken down into four kernel layers - vfs, mm, cpy, irq target system consists of the Intel Xeon CPU E5-2695 v4CPU, 128GB DDR4 memory, and 256GB SATA SSD. Weuse a Ubuntu 16.04 machine with modiﬁed kernel version4.15 and the ext4 ﬁlesystem. We run various workloads ofsynchronous read I/O operations generated by FIO bench-mark [3] on the target system to observe the effectiveness ofReLayTracer in terms of the per-request and per-layer pro-ﬁling facility. First, we examine single-threaded sequentialand random read access patterns, then we move on to themulti-threaded benchmark study.

In general, when the Linux operating system receives a readrequest from a user, it ﬁrst looks up in the main memorycache, called page cache , to check if the requested data isalready in memory. Based on this mechanism, we categorizethe read system requests into two classes - page cache hit andpage cache miss. In cases of page cache hits, the read requestsare rapidly served without traversing through multiple kernellayers. On the other hand, read system calls that experiencepage cache misses are more time-consuming because they godown to lower kernel layers to access the storage medium andbring the requested page into the memory.

Tail latency within page cache hits.

We expect that the la-tency variation among read requests would not be signiﬁcantif the requested data reside in the page cache. Interestingly,however, ReLayTracer reports that latency still varies widelywithin page cache hit cases. Figure 5 shows per-request readsystem call latency (for page cache hits) when a ﬁle is read se-quentially by a single thread. The x-axis represents individualrequests in chronological order. For most requests, the latencyis measured around 18000ns, but we observe that there areseveral noticeable spikes that rise above 25000ns. To identifythe cause of tail latencies, we additionally analyze the CPI(cycles per instruction) provided by ReLayTracer front end foreach request, which is depicted by the dotted line in the sameﬁgure. As the latency ﬂuctuates, the CPI follows the same

Data I/OPage Copy

File System

File SystemRequestBlock

Memory Management

Virtual File System

Read Requests

Ⓑ ⒷⒶ

Figure 6:

Per-layer breakdown of latency for single-threadedrandom read requests in chronological order.trend. Using the per-layer facility of ReLayTracer, we dissectthe latency into four layers - virtual ﬁle system ( vfs ), memorymanagement ( mm ), page copy ( cpy ), and interrupt request han-dle ( irq ). The data visualization output by our tracer showsthat the ﬂuctuation does not occur in a particular layer, but,instead, different components are responsible for slowdownin each request. vfs layer turns out to be the causative factorof tail (cid:202) and (cid:205) . Similarly, cpy layer was the bottleneck intail (cid:203) and (cid:207) . The layer breakdown utility shows us that tail (cid:204) additionally contains irq layer. In fact, the high latencyin this particular request was due to a hardware interrupt forIO completion which occurred at mm layer. irq not only addsan extra layer in the request but also introduces huge contextswitch overhead in the interrupted layer. ReLayTracer allowsin-depth analysis to locate unusual latency spikes even withinpage cache hits and further diagnose the root cause of each athardware and software levels. Tail latency within page cache misses.

We also examine ascenario where the read system calls do not beneﬁt from thepage cache due to the random access pattern. Figure 6 shows per-layer breakdown of request latency for single-threadedrandom reads on a ﬁle. When read requests are made for ran-dom positions in a ﬁle, most requests experience page cachemisses and suffer much higher latency because they pene-trate through deeper layers down to storage to bring data intomemory. A large portion of the time is spent in performingdisk I/O, as depicted by the thick green layer ( io ) in Figure 1.Figure 6 corresponds to the layer decomposition of requestsin the spiking region (requests 3560 - 3600) of Figure 1. Bybreaking each request down to units of a kernel layer, it ispossible to observe that each burst is caused by different lay-ers even within a small interval of 40 requests. The largest4 ore T1 T2 T3 T41 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Time Interval (6000 requests/interval) t h - P e r c en t il e La t en cy ( n s ) Figure 7: 99th-percentile latency within each time interval forthreads T1, T2, T3, T4 simultaneously performing randomread operations on a single ﬁle. Total 90,000 requests aredivided into 15 intervals, each containing 6000 requests.

Thread

Table 1: Number of IRQ events handled by each thread andthe total time spent on handling IRQ during execution.three spikes ( A (cid:13) ) are due to delay in the Disk I/O layer, butthe smaller spikes that appear immediately before and after( B (cid:13) ) are caused by equally proportional slowdown in all othersix layers. Since the random read requests do not have accesslocality, we anticipate that there is no signiﬁcant latency dif-ference among them. However, ReLayTracer informs us thatmultiple tail latencies can occur in a batched manner within asmall time interval and that different layers contribute to eachtail.

In this subsection, we examine the fairness of issuing randomread operations by multiple threads. As each thread makes itsread request, the requests end up mixed across the underlyingkernel layers. With its ability to measure the latency on a per-request basis, however, ReLayTracer is able to retrievethe owner thread of each request. Figure 7 shows 99th per-centile latency of each thread over time with 6,000 requestsas a group of the time interval ( window ), with each bar repre-senting individual threads. From window 5 to 13, we observethat T1 and T3 experience much longer tail latency. In fact,we identify that the burden of IRQ handling was unfairlydistributed throughout threads. Certain threads served muchmore IRQ requests than the others and had to spend up to 60xmore time on IRQ handling as shown in Table 1.

Probe depth L1 L2 L3 L4 L5 L6 L7 L8Overhead(%) 3 3.3 6 6.1 8.3 8.4 9.5 10.7

Table 2: Bandwidth degradation with increasing probe depthwhen tracing 8 threads simultaneously performing randomread a total of 2GB of data.

We end the section by evaluating the performance overheadof ReLayTracer. First, we measure the read throughput (band-width) for sequential read access patterns with 8 threads. Inthis case, threads do not experience any throughput degra-dation when running with ReLayTracer. Since most of theread operations are served by the page cache, the depth ofIO path for most requests is small. As a result, the overheadintroduced by the additional CPU cycles from the tracer isnegligible.Next, we measure the overhead of ReLayTracer for randomread accesses by 8 threads. Table 2 shows the throughputoverhead with increasing depth of probe layers. As the probedepth increases, the number of probe points also increases. Forrandom accesses, the page cache does not help, so all possibletracing points across the 8 layers will be reached frequently,resulting in non-negligible overhead. At minimum depth ( L1 ),the bandwidth degradation is only 3%, but it can increaseup to 10.7% with maximum depth ( L8 ). This result depictsthe tradeoff between the proﬁling depth and the overhead.In response to this, ReLayTracer provides a control knob tousers to adjust the proﬁling granularity. To identify the root causes of performance anomalies of kernelIO, ReLayTracer provides ﬁne-grained performance proﬁlingof kernel IO stacks. ReLayTracer tags each IO request with request ID and traces the request ID through IO layers,measuring precise latency of each IO layer the IO request trav-els. ReLayTracer leverages existing kernel instrumentationframework to dynamically install trace points and collectslatency information within 3 - 10% overhead.

References [1] Bpf: the universal in-kernel virtual machine. https://lwn.net/Articles/599755/ .[2] dtrace. http://dtrace.org/blogs/ .[3] ﬁo: Flexible i/o tester rev. 3.13. https://fio.readthedocs.io/en/latest/fio_doc.html .[4] ftrace: A look at ftrace. https://lwn.net/Articles/322666/ .55] Kernel probes. .[6] LTTng: an open source tracing framework for linux. https://lttng.org/ .[7] perf: Linux proﬁling with performance counters. https://perf.wiki.kernel.org/index.php/Main_Page .[8] Systemtap. https://sourceware.org/systemtap/documentation.html .[9] Uprobe-tracer: Uprobe-based event tracing. .[10] Using the linux kernel tracepoints. .[11] F. Doray and M. Dagenais. Diagnosing performancevariations by comparing multi-level execution traces.

IEEE Transactions on Parallel and Distributed Systems ,Feb 2017.[12] Nikolai Joukov, Avishay Traeger, Rakesh Iyer, Charles P.Wright, and Erez Zadok. Operating system proﬁlingvia latency analysis. In

Proceedings of the 7th Sympo-sium on Operating Systems Design and Implementation ,OSDI ’06, 2006.[13] Yaoping Ruan and Vivek S. Pai. Making the "box" trans-parent: System call performance as a ﬁrst-class result.In

Proceedings of the General Track: 2004 USENIXAnnual Technical Conference, June 27 - July 2, 2004,Boston Marriott Copley Place, Boston, MA, USA , pages1–14, 2004.[14] Woong Shin, Qichen Chen, Myoungwon Oh, Hyeon-sang Eom, and Heon Y. Yeom. Os i/o path optimizations for ﬂash solid-state drives. In

Proceedings of the 2014USENIX Conference on USENIX Annual Technical Con-ference , USENIX ATC’14, pages 483–488, Berkeley,CA, USA, 2014. USENIX Association.[15] Livio Soares and Michael Stumm. Flexsc: Flexiblesystem call scheduling with exception-less system calls.In Remzi H. Arpaci-Dusseau and Brad Chen, editors,

OSDI , pages 33–46. USENIX Association, 2010.[16] Avishay Traeger, Ivan Deras, and Erez Zadok. Darc:Dynamic analysis of root causes of latency distributions.In

Proceedings of the 2008 ACM SIGMETRICS Inter-national Conference on Measurement and Modeling ofComputer Systems , SIGMETRICS ’08, 2008.[17] Zev Weiss, Andrea C. Arpaci-Dusseau, and Remzi H.Arpaci-Dusseau. Densefs: A cache-compact ﬁlesys-tem. In

Proceedings of the 10th USENIX Conference onHot Topics in Storage and File Systems , HotStorage’18,Berkeley, CA, USA, 2018. USENIX Association.[18] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh,Tameesh Suri, Manu Awasthi, Zvika Guz, AnahitaShayesteh, and Vijay Balakrishnan. Performance anal-ysis of nvme ssds and their implication on real worlddatabases. In

Proceedings of the 8th ACM InternationalSystems and Storage Conference , SYSTOR ’15, NewYork, NY, USA, 2015. ACM.[19] Jisoo Yang, Dave B. Minturn, and Frank Hady. Whenpoll is better than interrupt. In