[PDF] Interprocess Communication in FreeBSD 11: Performance Analysis

Abstract

Interprocess communication, IPC, is one of the most fundamental functions of a modern operating system, playing an essential role in the fabric of contemporary applications. This report conducts an investigation in FreeBSD of the real world performance considerations behind two of the most common IPC mechanisms; pipes and sockets. A simple benchmark provides a fair sense of effective bandwidth for each, and analysis using DTrace, hardware performance counters and the operating system's source code is presented. We note that pipes outperform sockets by 63% on average across all configurations, and further that the size of userspace transmission buffers has a profound effect on performance - larger buffers are beneficial up to a point (~32-64 KiB) after which performance collapses as a result of devastating cache exhaustion. A deep scrutiny of the probe effects at play is also presented, justifying the validity of conclusions drawn from these experiments.

Full PDF

IInterprocess Communication in FreeBSD 11:Performance Analysis

A. H. Bell-Thomas

[email protected]

Abstract

DTrace , hardware performance counters and the operatingsystem’s source code is presented. We note that pipes outperform sockets by 63% on average acrossall conﬁgurations, and further that the size of userspace transmission buffers has a profound effecton performance — larger buffers are beneﬁcial up to a point ( ∼ a r X i v : . [ c s . O S ] A ug Introduction

One of the most fundamental mechanisms of any modern operating system is interprocess commu-nication (IPC). This is especially true in environments based on UNIX, where it is standard for a largenumber of disparate applications to be stitched together to fulﬁl a task. FreeBSD provides a wide rangeof IPC implementations, such as shared lock management and memory, but two mechanisms in partic-ular dominate this space; pipes and sockets. This report shall investigate and compare both.Both pipes and sockets are inherited from the POSIX standard, and although they are internally verydifferent from one another they are both presented to users via generic ﬁle-descriptors. Pipes facilitatethe sending of unidirectional, unnamed byte-streams between entities; this is a very popular dataﬂowsystem when using command-line tools. They require a common parent process to set up the communi-cation channel. Notably, before 4.2BSD pipes were implemented via the ﬁlesystem and in later versionsusing sockets. FreeBSD no longer uses socket-backed pipes, instead opting for a separate optimisedimplementation directly on top of the virtual memory system. [4] Sockets on the other hand are de-signed for bidirectional communication and are highly adaptable, affording support for complex datastructures such as packets when interfacing with a network. This report shows that pipes outperformsockets due to implementation optimisations in how they manage memory, and additionally that largerbuffer sizes will improve performance up to a point, after which it degrades.The focus of this report is to determine the performance characteristics of these two IPC mechanismswhen used across both two threads and two processes. The analysis conducted considers a numberof inﬂuential factors; particular focus is given to the internal structure and interactions of the kernel’scomponents, as well as system behaviours at both architectural and micro-architectural levels. Section2 details the experimental setup and methodology used, including a set of hypotheses that the analysis(Section 3) attempts to resolve. Section 4 present the conclusion to this investigation. The research questions this report aims to answer are as follows.1. How does increasing IPC buffer size uniformly change performance across IPC models?2. How does changing the IPC buffer size affect the architectural and micro-architectural aspects ofcache and memory behaviour?3. Can we reach causal conclusions about the scalability of the kernels pipes and local socket imple-mentations given evidence from processor performance counters?These questions were used to derive the guiding set of hypotheses from which the experiments pre-sented in this reports were derived. At all points the probe effect is considered. The benchmark wastested unaltered, but only one operation mode, , with a single static total IPC size, 16 MiB, wasof interest.

Hypothesis 1 : Increasing the buffer size available to each IPC mechanism will improve their perfor-mance up to a maximum, after which it will degrade.

Regardless of the IPC mechanism used, datahas to be transferred block by block between sending and receiving entities; this is facilitated via buffersat the both the receiver and sender. In the case of sockets there is an additional in-kernel intermediarybuffer, the size of which can be manipulated by applications (the -s ﬂag in the benchmark). The buffersize chosen should be an essential factor of real-world performance; the best choice will most likely de-pend on the IPC mechanism and other system conditions. Logically, one would expect a larger buffersize to better amortise the effect of syscall overhead for large transmissions, but as shown in the ﬁrst labreport this must be balanced with adverse effects on the processor’s cache memories. Portable Operating System Interface. ypothesis 2 : Pipes will yield better performance than sockets for local communication due to theirspeciﬁc VM-optimisations. Sockets are notably far more versatile than pipes, theoretically puttingthem at a disadvantage in IPC performance testing. They need to interface not only across threads andprocesses but additionally across networks, a medium that brings a very different set of constraints andchallenges. Pipes on the other hand are known to have a IPC implementation separate to that of sockets,giving room for context-aware optimisation directly on top of FreeBSD’s VM system.

Hypothesis 3 : The use of instrumentation tools such as

DTrace and PMCs will adversely affect thebenchmark’s performance, but overall the results will demonstrate a consistent shape, especiallyconsidering their inﬂection points.

All experiments, and therefore results recorded, were conducted on a BeagleBone Black (revision C)board. [2, 3] It is equipped with a 32-bit ARM Cortex-A8 (AM335x, 1GHz, 32 KB L1 cache, 256 KBL2 cache), [1] 512 MB of DDR3L RAM (800MHz, 1.6 GiB/second of memory bandwidth), and 4GB ofeMCC ﬂash memory. The operating system used is FreeBSD 11.0.0; this comes with native DTrace support. The persistent disk used by the system is a 8GB microSDHC card. The root of the ﬁlesystemwas mounted using read-only mode to ensure that no unexpected/accidental activity could affect theOS installation; the writable /data partition was used for running trials and recording the results. Themachine was accessed locally, using both a USB serial connection and SSH over Ethernet.The heart of our investigation of the hypothesis set was an I/O benchmark program written by RobertN.M. Watson. The benchmark is able to measure IPC across both threads and processes via either pipesor sockets: the corresponding POSIX APIs are pthread create , fork , pipe , and socketpair respec-tively. For sockets, the kernel’s internal buffer size can be changed using setsockopt ; the benchmarkexposes this via the -s ﬂag, setting it to be the same size as the userspace buffers. In addition to record-ing the effective IPC bandwidth seen using a particular conﬁguration it enabled CPU performancecounters (PMCs) to be queried to further enlighten our understanding of the system’s behaviour. Toguarantee that read and write performed full, not partial, operations, the system was conﬁgured toincrease its kern.ipc.maxsockbuf value to 32 MiB, greater than the largest buffer size tested against.This report’s analysis of the benchmark’s behaviour relies primarily on DTrace and PMCs; these werecoordinated both using shell scripts and an IPython Jupyter Notebook for data collection. The probeeffect of these investigation tools is detected and measured with a number of additional experiments;the purpose of this is to verify that the benchmark’s behaviour was not signiﬁcantly altered. This willbe discussed in depth in 3.3.Unless stated otherwise, all results presented use a sample size of at least 30 datapoints. The ﬁrstdatapoint for each conﬁguration run was discarded to ensure the system was in a steady-state beforerecording measurements. Graphs plot the median of these samples, as well as their IQR (25 th to 75 th percentile) as a notion of error. Statistical signiﬁcance tests make use of the paired Wilcoxon signed-ranktest: this was chosen over the traditional Student’s t -test because it does not assume the underlying datais drawn from a Gaussian distribution, something that cannot be guaranteed in this context.Three distinct sets of experiments were performed to investigate the three experimental hypotheses— software-based instrumentation to analyse kernel performance, hardware-based instrumentation toassess the impact of the processor’s microarchitecture, and a statistical analysis of both to investigatethe impact of probe effect. 2 Buffer Size, bytes × × × × × B a n d w i d t h , K i B y t e s / s e c o n d p a g e s i z e P I PE _ S I Z E K L ∗ K T L B K L SocketSocket ( − s)Pipe Figure 1: Effective bandwidths produced by the ipc-static benchmark for different IPC types andconﬁgurations as the buffer sizes used changes (no instrumentation).

In this section the results of the experiments performed will be presented, backing a discussion andevaluation of the difference between pipe and socketpair for IPC.

Figure 1 presents the results produced by the benchmark across a range of buffer sizes and using eachof the three IPC mechanism conﬁgurations. The -s qualiﬁer denotes that the size of the kernel bufferwas modiﬁed to mirror the size of the user space buffers. From ﬁrst impressions there is a clear trendin the data: regardless of the IPC model, performance increases with buffer sizes up to a point ( ∼

32 to64 KiB), after which it begins to decrease substantially. The initial increase in performance can fairlyeasily be attributed to a decrease in the number of read() syscalls given the total I/O size is ﬁxed. Thisbehaviour is directly in line with the high-level conceptual model presented in

Hypothesis 1 . Figure 1however contains a number of inﬂections points hinting at more subtle effects at play. To decipherthis we will now consider each IPC mechanism in greater depth using

DTrace and direct source codeanalysis.

Pipes are a rather restricted transport mechanism, merely providing a way of sending ordered byte-streams between entities. Notably this representation is far less versatile that socketpair . To get asense for how pipes behave in FreeBSD we ﬁrst look at the source implementation and associated doc-umentation. An essential observation to be gained from kern/sys pipe.c is that pipe can operate inone of two modes; ’small’ or ’large’ write mode.In ’small’ write mode, when the buffer size if smaller than PIPE MINDIRECT (8 KiB), a ’normal’ bufferis provided through the kernel. For write sizes between PIPE MINDIRECT and

PIPE SIZE (16 KiB) the Modiﬁed to run on the BeagleBone Black ( version d508cb8(release/11.0.0)-dirty ). http://fxr.watson.org/fxr/source/kern/sys_pipe.c?v=FREEBSD-11-0 Refer to http://fxr.watson.org/fxr/source/sys/pipe.h?v=FREEBSD-11-0 . pipe demonstrating greater performance that socketpair , as this removes unnecessarycopying via a kernel buffer.The greatest observed throughput for pipe occurred with a buufer size of 64 KiB, as can be clearlyseen in Figure 1. A contributing factor towards this is pipe ’s resizing mechanism, which, if enabled, will increase the default PIPE SIZE from 16 KiB to a maximum of

BIG PIPE SIZE (64 KiB) in increments.This is visible via

DTrace ’s syscall::read*:entry/return probes, which clearly showed all buffer sizesbeyond 64 KiB resolving to 256 read() calls of size 64 KiB.Using DTrace we further validated a number of details about the behaviour of pipe . Using the fbt::pipe write:entry probe descriptor the following data was extracted from arg0 ( struct file*fp ) and arg1 (struct uio *uio) ). ( α ) struct pipebuf pipe buffer { ...; u int size = 0x4000; ... } ( β ) struct uio uio { ...; ssize t uio resid = 0x1000000; ... } ( α ) provides the initial size of default buffer given to the pipe on creation, PIPE SIZE ; this veriﬁes ourconclusions from the source ﬁle. The signiﬁcance of ( β ) is slightly more subtle. This extract was takenfrom a run with buffer set to 16 MiB ( =0x1000000 ), and this uio resid value declares the number ofbytes left to process when copying commences; this conﬁrms that the entirety of the userspace bufferpassed into write() is processed in one shot by pipe write() , eliminating the need for an -s option aswe have for sockets. The implementation of socketpair is far more complicated than that of pipe , something that is ratherunsurprising given how versatile the POSIX standard forces it to be. Notes in kern/uipc socket.c and kern/uipc usrreq.c enlighten a handful of the difﬁculties faced, especially in the face of ’ancillarydata’ ; credentials, ﬁle descriptors, and even, one layer deeper, other sockets themselves may potentiallybe passed over a socket, requiring additional considerations such as a specialised garbage collectorfor dead sockets. Importantly, FreeBSD’s implementation does not lend itself towards interoperabilitybetween local sockets and shared memory (or other VM optimisations); this is undoubtedly key to itspoorer performance when compared to pipe in Figure 1.An important observation from Figure 1 is that both socketpair conﬁgurations achieve their maxi-mum throughput earlier than pipe , also behaving differently to one another after the 8 KiB mark. Thiswas initially investigated using the

DTrace syscall::read*/write*:entry/return probes. All buffersizes larger than 8 KiB resolved to 2048 read() calls of 8 KiB when the -s ﬂag was not used, explainingits early plateau in Figure 1. This limitation was not observed when the -s ﬂag was used.Deeper DTrace instrumentation was performed to further dissect this behaviour, targeting speciﬁcallythe soreceive generic() method (sockets’ internal version of read() ) in kern/uipc socket.c . Usingthe fbt::soreceive receive:entry probe the following data was extracted. ( γ ) struct sockbuf so rcv { ...; u int sb hiwat = 0x2000; ... } ( δ ) struct sockbuf so snd { ...; u int sb hiwat = 0x2000; ... } Refer to kern/sys pipe.c , lines 1079-1088. It is on our system ( sysctl kern.ipc.piperesizeallowed → ). Refer to kern/sys pipe.c , line 1039. γ ) and ( δ ) depict the high watermark , or maximum char count supported, of the sockbuf bufferstructures used for both sending and receiving sockets. Given that = 8 KiB this is a highlyplausible explanation for the plateau in socketpair performance between 8 KiB and 64 KiB. The -s ﬂag manually sets the high and low watermarks ( int sb lowat ) to the size of the benchmark’s buffer;this was veriﬁed using the same DTrace probe.

In the previous section we discussed various aspects of the kernel’s behaviour that affect the bench-mark’s performance as seen in Figure 1. A handful of the data’s inﬂection points have not yet beenexplained, thus PMCs are now used to extend our understanding further by reasoning at the microar-chitectural level. Figure 2 presents four selected PMC attributes, demonstrating their behaviour asbuffer size increases.Figure 2a depicts the mem

PMC counter’s

MEM READ attribute; this can be used to give an approximateindicator of the number architectural read operations. Frustratingly this cannot be directly translatedinto the number of bytes read. This is, we believe, due to specialised ARM instructions such as

LDRD ,which loads from two locations simultaneously in one operation —

LDRD is used at various points inFreeBSD’s ARM implementation, including memcpy . However from this we are able to very clearlypick out the point at which socketpair with and without the -s ﬂag deviates (8 KiB), reafﬁrming thebehaviour reported by DTrace . As the number of architectural reads directly translates as I/O load onthe system, lower values are better. The VM optimisations of the pipe implementation found in theFreeBSD source can be seen coming into play, with a far lower impact in hardware. Additionally, otherinﬂection points can also be seen; socketpair with the -s ﬂag plateaus at 32-64 KiB, and pipe at 64 KiB,mirroring the story told by the bandwidth readings reported in Figure 1. MEM READ has shown itselfto be a fairly reliable proxy for bandwidth performance at lower buffer sizes, especially as it reliablyexposes software behaviours, encapsulating the bare-metal requests exiting the kernel.The sudden drop in bandwidth performance after the 32 and 64 KiB marks has yet to be explained.With this in mind, the axi

PMC tells an interesting story, bringing the root cause of the phenomenon tolight. Figures 2b and 2c show the

AXI READ and

INSTR EXECUTED/CLOCK CYCLE attributes respectively.The axi

PMC measures activity on the chip’s AXI bus, which is responsible for actually transportingdata to physical I/O devices or DRAM, therefore capturing requests which are not satisﬁed by theCPU’s cache hierarchy. Figure 2c has been included as it serves as a useful metric of the average relativeexpense of each AXI transaction.A vital observation to make to explain the sharp drop in bandwidth performance is that as perfor-mance decreases in the 32-256 KiB buffer size range, both the number and relative expense of AXI op-erations increase drastically. Taking the readings observed for pipe as an example, and given no otherPMC attributes reveal anything of particular note, it is highly likely that these observed behaviour onthe AXI bus are direct indicators of the core issue causing performance to crash.There is no L3 cache in the BeagleBone Black’s Cortex-A8 processor, making the L2 cache the lastlevel of on-chip memory. Sadly the Cortex-A8 does not expose a PMC for measuring the number ofL2 cache misses, but as a proxy Figure 2d plots the average clock cycles per L2 cache operation; thiswill implicitly expose the number of misses as operations take longer whilst fetching data from DRAM.We can see that this proxy L2 miss metric aligns almost exactly with the increased strain seen on theAXI bus, giving credence to the assertion that the observed performance collapse is directly caused byL2 cache exhaustion. This further explains why both the increase in load and decrease in performancecease to change so drastically when the marked L2 limit is hit; this is the ﬁrst truly degenerate case asthe working set no longer ﬁts in the last-level cache. For this reason it is clear to see how this impactsall IPC mechanisms in roughly the same way. http://fxr.watson.org/fxr/source/sys/sockbuf.h?v=FREEBSD-11-0 Refer to contrib/cortex-strings/src/arm/memcpy.S a) Architecturally originated memory reads ( MEM READ ) Buffer Size, bytes × × × E v e n t s p a g e s i z e P I PE _ S I Z E K L ∗ K T L B K L SocketSocket ( − s)Pipe (b) AXI bus: architectural reads ( AXI READ ) Buffer Size, bytes × E v e n t s p a g e s i z e P I PE _ S I Z E K L ∗ K T L B K L SocketSocket ( − s)Pipe (c) AXI bus: average clock cycles per instruction ( CLOCK CYCLES/INSTR EXECUTED ) Buffer Size, bytes E v e n t s p a g e s i z e P I PE _ S I Z E K L ∗ K T L B K L SocketSocket ( − s)Pipe (d) L2 cache: average clock cycles per access ( L2 ACCESS/CLOCK CYCLES − ) Buffer Size, bytes × × × × × E v e n t s p a g e s i z e P I PE _ S I Z E K L ∗ K T L B K L SocketSocket ( − s)Pipe Figure 2: Selected plots of PMC attributes as buffer size increases.6 a) Effective bandwidths achieved by the benchmark under various and no instrumentation. Buffer Size, bytes × × × × × B a n d w i d t h , K i B y t e s / s e c o n d p a g e s i z e P I PE _ S I Z E K L ∗ K T L B K L RawDTracePMCs (

Indirect )PMCs (

Direct ) (b) Paired Wilcoxon signed-rank tests comparing results from instrumented runs touninstrumented performance after normalisation. Buffer Size, bytes p - v a l u e

5% threshold

Figure 3: Plots examining the impact of probe effect from both

DTrace and PMCs on the ipc-static benchmark. The results shown are for pipe , although socketpair behaves nearly identically.

In all experiments discussed thus far the utmost care was taken to minimise the effect of any probeeffect — no

DTrace -instrumented runs were included in the ﬁnal dataset, and PMCs were only activatedwhen their output was required for the experiment.

Hypothesis 3 stated that instrumentation was goingto affect results but not the overall shape of the graph; Figure 3 plots the results of a set of experimentsdesigned to test this. Only the results for pipe are plotted, but socketpair demonstrated very similarbehaviours.The

Raw line depicts the performance of the benchmark with no instrumentation enabled.

DTrace plots performance when using the syscall::read*/write*:entry/return probes. Two

PMC lines areshown here; direct ( l1d, mem, axi, tlb ) and indirect ( l1i, l2 ). This refers to whether a PMC canor cannot be directly measured on a Cortex-A8. Personally there are two surprises here;1. PMCs, despite being accelerated in hardware, perform far worse that (software-based)

DTrace .Obviously this is not a fair comparison as the two measure different things, but the differencehelps to illustrate the hit required to gather such low-level runtime data. This information comes from the brieﬁng document for the third L41 lab session.

7. The PMCs that are not directly countable by the processor outperform those that are — the mediandifference is 7.8% (IQR: 6.3% → mem and axi counters as they are not natively recordingin the processor.From Figure 3a it is clear that the ﬁrst half of Hypothesis 3 is correct; instrumentation had a tremen-dous affect on results produced by the benchmark. A slightly more interesting question is how similarthese lines are — we hope that the basic shape of all are similar, and in an ideal world are the same lineoffset by constant scale factors k , k , k . To test this, each line was normalised against its mean valueacross all buffer sizes — this should conceptually bring each in-line with each other, eliminating con-stant scale factors. Each of the three normalised instrumentation lines were then compared against the Raw plot using a paired Wilcoxon signed-rank test to determine whether their differences are statisti-cally signiﬁcant. These are plotted in Figure 3b with a 5% signiﬁcance level. The null-hypothesis (thatnormalised lines are not signiﬁcantly different) is rejected at only three points when buffer sizes aresmall — larger buffer sizes fare well, demonstrating high p -values.Overall, this give satisfactory credibility to Hypothesis 3 in this context.

This report investigated the effective bandwidth of both pipes and sockets in FreeBSD 11. Drawingon the source code of the OS, its documentation,

DTrace proﬁling and hardware performance countersthe three initial experimental hypothesis have been shown to be generally correct. Speciﬁcally, we haveshown:1. Increasing buffer sizes improves performance for all IPC mechanisms up to a point ( ∼ Hypothesis 1 can therefore be soundly accepted.2. Pipes yield better performance than sockets for local communication due to speciﬁc memory op-timisations. Further, sockets are inherently constrained by their versatile design and independentin-kernel buffer, adding up to a far less scalable IPC solution. Virtual memory optimisations for pipe vastly decrease the expense of performing operations, as shown in Figure 2a, for example— this indicates a far more scalable solution for local IPC than socketpair . Hypothesis 2 is alsoaccepted conﬁdently, citing evidence from FreeBSD’s source code and observed PMC attributeresults to assert that VM-optimisations were a major component of pipe ’s speed.3. Instrumentation affects performance, but, to a great extent, neither the shape nor inﬂection pointsof results.

Hypothesis 3 is therefore tentatively accepted; there is no strong evidence to reject it,although for smaller buffer sizes there may be small, but signiﬁcant, artefacts. This is admittedly imperfect in the general case — here, however, the results were manually inspected before being deemedacceptable for this purpose. eferences [1] AM335x ARM Cortex-A8 MPUs - TI . . (Accessed on 03/09/2020).[2] BeagleBoard.org - black . [Online; accessed 11. Feb. 2020]. Feb. 2020.

URL : https:/ /beagleboard.org/black .[3] BeagleBone Black System Reference Manual . https://cdn- shop.adafruit.com/datasheets/BBB_SRM.pdf . (Accessed on 03/09/2020).[4] Marshall Kirk McKusick, George Neville-Neil, and Robert N.M. Watson. The Design and Implemen-tation of the FreeBSD Operating System . 2nd. Addison-Wesley Professional, 2014.