Nature of System Calls in CPU-centric Computing Paradigm
NNature of System Calls in CPU-centric Computing Paradigm
Viacheslav Dubeyko , Om Rameshwar Gatla , and Mai Zheng Western Digital Research Iowa State University
Abstract
Modern operating systems are typically POSIX-compliantwith major system calls specified decades ago. The next gen-eration of non-volatile memory (NVM) technologies raiseconcerns about the efficiency of the traditional POSIX-basedsystems. As one step toward building high performanceNVM systems, we explore the potential dependencies be-tween system call performance and major hardware compo-nents (e.g., CPU, memory, storage) under typical user cases(e.g., software compilation, installation, web browser, officesuite) in this paper. We build histograms for the most fre-quent and time-consuming system calls with the goal to un-derstand the nature of distribution on different platforms. Wefind that there is a strong dependency between the systemcall performance and the CPU architecture. On the otherhand, the type of persistent storage plays a less importantrole in affecting the performance.
Index terms:
Non volatile memory (NVM), storage classmemory (SCM), POSIX OS, system call, file system.
The next generation of non-volatile memory (NVM) or stor-age class memory (SCM) technologies are expected to pro-vide durability similar to the flash memory with latenciescomparable to DRAM. These unique characteristics bringnew concerns on the efficiency of modern POSIX-compliantoperating systems, since the major interface was specifieddecades ago.In this paper, we explore the potential dependency be-tween POSIX system call performance and major hardwarecomponents. We monitor the system calls under five repre-sentative use cases (e.g., software compilation, installation,diff viewer, Firefox, OpenOffice) through the strace util-ity [21], calculate their distributions, and identify the onesthat are used most frequently and are most time-consuming.Moreover, we built histograms for the major system callson six platforms with a wide range of CPUs (e.g., AMD E-450, i7-3630QM, XEON E5-2620 v2), storage devices (e.g., 15015 RPM SAS HDD, SSD SATA 3.1 SSD, PCIe SSD,RAMdisk), OS kernels (e.g., 3.13.0-24, 3.13.0-95, 4.2.0-42),and file systems (e.g., ext4, XFS, tmpfs). Through cross-platform analysis, we find that there is a strong dependencybetween the system call performance and the CPU architec-ture. On the other hand, the type of persistent storage plays aless important role in the system call performance. We hopethis study could help understand the performance bottleneckof existing system call implementations, and thus facilitatebuilding high performance NVM systems.The rest of this paper is organized as follows. Section IIpresents the background and motivation of the paper. Sec-tion III surveys the related works. Section IV explains themethodology (experimental setup, use-cases, research tool)that was used in this work. Section V contains use-casesanalysis on the basis of total frequency (number of calls) andtotal time consumption of different system calls. Section VIintroduces the system calls analysis on the basis of preparedhistograms. Section VII includes final discussion of the sys-tem calls analysis. Section VIII offers conclusions.
Next generation of NVM memory is able to provide apersistent byte-addressable space for storing, accessing andmodification of data. The features of the new persistentmemory are very attractive and to promise the potential re-solving of many critical problems in the current data process-ing paradigm.
CPU vs. NVM memory . Existing hardware and softwarestack has huge amount of issues that prevent from efficientusing of NVM/SCM memory in current computing systems.First of all, modern CPUs are not ready to access the byte-addressable space of NVM memory. Another critical point isinability of modern CPUs to manage the persistent nature ofNVM memory. The persistency of NVM memory is the is-sue but not the advantage for the modern CPUs. Current stateof the art of hardware stack includes several levels of mem-ory hierarchy (CPU’s registers, L1/L2/L3 caches, DRAM,1 a r X i v : . [ c s . O S ] M a r ersistent storage). This architecture is the compromise thattook place because of block-based interface and huge latencyof regular persistent storage devices (HDD, SSD). However,NVM memory is able to suggest the low access latency, byte-addressable nature and persistence. If anybody considers touse the NVM memory in the persistent storage device thenall advantages of NVM memory (byte-addressable natureand low latency) will be neutralize by the latency of stor-age device’s controller and by the overhead of block-basedinterface of OS’s software stack. NVM memory like main memory . The idea to use theNVM memory instead of DRAM like the main memory re-veals the many drawbacks too. Many NVM memory tech-nologies have not very high endurance. It means that thedirect CPU access is able to kill the NVM chip very quickly.Usually, the main memory is volatile and as CPU as OSesis unable to use the persistent nature of NVM memory likesubstitution of DRAM. Moreover, the persistence of NVMmemory can be a source of various security issues for thecase of using like the main memory. There are significantnumber of research efforts with the goal to suggest the newhardware-based approaches or OS’s subsystems modifica-tion for efficient using of NVM memory. However, all theseefforts have one critical downside. The efficiency of everyapproach is estimated by means of special benchmarks thatare unable to simulate the real life. It means that any bench-mark is unable to show the real efficiency of computing sys-tem in the real environment (for example, for the case oftypical environment of regular end-user).
System calls analysis . Usually, OS is split on user-spaceand kernel-space. The user-space is a world of applicationsthat can request some services from hardware devices bymeans of system calls are provided by kernel-space. Gener-ally speaking, any hardware resources are managed by ker-nel space and user-space application is able to request anyservices by means of system calls. Finally, the NVM mem-ory is hardware resource that should be managed by kernel-space and is available for user-space application through thesystem calls. It means that system calls analysis is the fun-damental way to achieve the understanding how efficientlythe whole computing system can operate. It is possible tobuild the testing space by means of variation of hardwareresources (CPUs, DRAM capacity, storage devices) and use-cases. This testing space is able to reveal the key peculiaritiesand tendencies of the computing system at whole. The anal-ysis of system calls’ frequency and time consumption is ableto show the important peculiarities of modern OSes. Theseresults can be the basis for understanding the key bottlenecksof modern OSes that can be used for elaboration of vision(s)how the next generation of NVM memory can really be usedefficiently.
System support for NVM . Great efforts have been put to-wards providing software and/or hardware support for NVM[1, 2, 3, 5, 6, 7]. For example, Mnemosyne [5] and NVML[1] provide transaction libraries for durable transactions andatomic updates on NVM. PMFS [3], NOVA [6] and NOVA-Fortis [7] enable accessing NVM through the file system in-terface. Our work is complementary to these efforts in thatwe focus on identifying the fundamental bottleneck of ex-isting system call interfaces and implementations, which iscritical for re-designing systems for NVM.
Performance Analysis of NVM-based systems . Many re-searchers have analyzed the performance of NVM-basedsystems and/or applications [1, 2, 3, 4, 5, 6, 7, 8]. Forexample, WHISPER [4] quantitatively evaluates 10 NVM-aware benchmarks using three access interfaces (i.e., native,library, and file system) and identify important characteris-tics of NVM applications (e.g., software transactions are of-ten implemented with 5 to 50 ordering points). Differentfrom these existing efforts which mostly uses one type ofhardware (e.g., DRAM or hardware emulator) and a limitednumber of workloads, we cover a wide spectrum of hardwareplatforms, use cases, and system calls.
System Calls Optimization . The inefficiency of systemcalls was discussed in many papers [9, 10, 11, 12, 13, 14, 15,16]. Triplett et al [11] proposed rethinking the division be-tween user-space and kernel space to eliminate the overheadof system calls. Rather than dedicating only a single core to aprocess, they suggested to dedicate two: one to run the user-mode process, and one to perform system calls and the asso-ciated kernel-mode computation. Soares et al [13] proposedexception-less system calls. They showed that synchronoussystem calls negatively affect performance in a significantway, primarily because of pipeline flushing and pollutionof key processor structures (e.g., TLB, data and instructioncaches, etc.). In their implementation, system calls are is-sued by writing kernel requests to a reserved syscall page,using normal memory store operations. The actual execu-tion of system calls is performed asynchronously by specialin-kernel syscall threads, which post the results of systemcalls to the syscall page after their completion. Rajagopalanet al [15] suggested a profile-directed approach to optimizinga programs system call behavior (system call clustering). Inthis approach, profiles are used to identify groups of systemscalls that can be replaced by a single call, thereby reducingthe number of kernel boundary crossings. They showed anaverage 25% improvement in frame rate, 20% reduction inexecution time, and 15% reduction in the number of cyclesfor the mpeg play video software decoder.
System Analysis via System Calls Tracing . Tracing thesystem calls is very useful technique, especially, for thecase of debugging an application. The researchers usethis technique for analysis of system efficiency at whole217, 18, 19, 20]. Park et al [17] evaluated battery per-formance on the Android platform by tracing system calls.Kodirov et al [19] used system calls tracing for investigationof the text editing program gedit and the libraries it reliesupon. They identified in application the recurring patternsthat can be simplified if better FS support were available.
As one step towards building high-performance NVM sys-tems, we explore one important research question: Is the ex-ecution time of popular system calls mainly affected by thetype of persistent storage? If so, how much? If not, what isthe major factor affecting the execution time?To answer the question, we monitor the system calls un-der five representative use cases through the strace util-ity [21], calculate the distribution of system calls in termsof frequency, and identify the ones that are used most often.Moreover, we record the execution time of each system callinvocation, and compare them across six platforms with dif-ferent hardware components.
We use six platforms with a wide range of hardware and soft-ware. As shown in Table 1, we select several CPUs with: (1)different architectures (AMD E-450, Intel Xeon E5-2620, In-tel i7-3630QM); (2) various core numbers (2 - 24); (3) var-ious L1/L2/L3 cache sizes. Similarly, we vary the DRAMsize from 8 GB up to 24 GB. In terms of persistent storage,we use several HDDs (5200 RPM SATA 3.0, 15000 RPMSAS) and SSDs (SATA 3.1, PCIe). Since persistent mem-ory is immature and not publicly available, we use RAMdiskto approximate an ideal persistent memory environment. Interms of operating systems, we use the Ubuntu distributionwith kernel versions 3.13.0-24 and 4.2.0-42. The Ext2, Ext4and XFS file system are selected because of their support forDirect Access (DAX). The tmpfs file system is used for theRAMdisk based platform.
As shown in Table 2, we select five representative use casesto drive the target systems:
Compilation configures andcompiles F2FS utilities [22] through a set of common com-piler tools including autoreconf , configure , and make ; Installation installs multiple applications through theUbuntu Software Center; the remaining three use cases (i.e.,
FireFox , OpenOffice Calc , and
Meld ) run popular appli-cations for web browsing, spreadsheet calculation, and diffviewing respectively, all of which involve user interactions(e.g., browsing across multiple websites and clicking links).Together, the five cases cover typical usage scenarios of endusers in a desktop environment. Moreover, the five use cases are expected to cover differ-ent I/O patterns. For example,
Compilation involves a mixof read/write operations,
Installation is dominated bywrite operations, and
Meld involves many file system meta-data operations for traversing directories and merging files.In addition, to minimize the impact of buffering in the tar-get systems, we explicitly flush buffers and drop the dirtypages before running each use case (i.e., using sync and echo 3> /proc/sys/vm/drop caches ). To identify the most popular and the most time-consumingsystem calls, we calculate the system call distribution underdifferent use cases based on the usage frequency.Figure 1 shows the distribution of system calls for the make utility in the
Compilation use case. We can see that make distributes the execution activity among a variety ofsystem calls including: (1) OS related system calls (mpro-tect, brk, rt sigaction, rt sigprocmask, rt sigreturn, pipe,dup2, wait4), (2) metadata related system calls (stat, fstat,lstat, open, close, access, lseek), (3) user data related systemcalls (mmap, munmap, read, write).Figure 1: System call distribution for the make utility in the
Compilation use case. OS-related system calls are markedby red color. Metadata-related system calls are marked byblue color. User data related system calls are marked bygreen color.Similarly, we measure the distributions of system calls forother use cases.
OpenOffice Calc (Fig. 2) has some pecu-liarities in the execution activity distribution but it uses: (1)OS related system calls (mprotect, brk, madvise, recvmsg,futex, poll), (2) metadata related system calls (stat, fstat,lstat, open, close, access), (3) user data system calls (mmap,munmap, read, write).
Meld (Fig. 4) follows the same pro-file of operations: (1) OS related system calls (recvmsg, fu-tex, poll, wait4), (2) metadata related system calls (stat, fstat,lstat, open, close), (3) user data related system calls (mmap,3 Platform 1 Platform 2 Platform 3 Platform 4 Platform 5 Platform 6type AMD E-450 AMD E-450 Intel i7 Intel Xeon Intel Xeon Intel Xeon1,650 MHz 1,650 MHz 3630QM E5-2620 v2 E5-2620 v2 E5-2620 v22.40 GHz 2.10 GHz 2.10 GHz 2.10 GHzCPU core 2 2 8 24 24 24cache L1 32 KB L1 32 KB L1 32 KB L1 32 KB L1 32 KB L1 32 KBL2 512 KB L2 512 KB L2 256 KB L2 256 KB L2 256 KB L2 256 KBL3 6 MB L3 15 MB L3 15 MB L3 15 MBtype SODIMM SODIMM SODIMM DIMM DIMM DIMMHW Memory DDR3 DDR3 DDR3 DDR3 DDR3 DDR31,333 MHz 1,333 MHz 1,600 MHz 1,866 MHz 1,866 MHz 1,866 MHzsize 8 GB 8 GB 24 GB 16 GB 16 GB 16 GBtype HDD SSD SSD HDD SSD DRAMPersistent RPM 5,200 – – 15,015 – –Storage size 2 TB 500 GB 480 GB 73.4 GB 128 GB 16 GBprotocol SATA 3.0 SATA 3.1 SATA 3.1 SAS PCIe –6 Gb/s 6 Gb/s 6 Gb/s 6 Gb/s 6 Gb/sOS Linux kernel version 3.13.0-24 3.13.0-24 3.13.0-95 3.13.0-24 3.13.0-24 4.2.0-42File system Ext2 Ext4 XFS Ext2 XFS tmpfsTable 1:
Six platforms with different hardware (HW) and operating systems (OS).
Use Cases Description
Compilation compile F2FS utilities via autoreconf , configure , & makeInstallation install applications viaUbuntu Software Center FireFox a web browser
OpenOffice Calc a spreadsheet application
Meld a visual diff and merge toolTable 2:
Five representative use cases. munmap, read, write). Finally,
Installation (Fig. 5) dis-tributes the execution activity among: (1) OS related systemcalls (recvmsg, fcntl, poll), (2) metadata related system calls(stat, open, close), (3) user data related system calls (mmap,munmap, read, write).We summarize the most frequent and the most time-consuming system calls in Table 3, and investigate them fur-ther on different platforms in the next section.
Memory operations is the group of system calls that are re-sponsible for interaction with OS’s memory subsystem. Thefrequency and time consumption analysis revealed the dom-inating importance of mprotect(), brk(), mmap(), munmap(),and madvise() system calls. The method of memory-mappedfile I/O is the hottest topic now. As a result, the mmap() Figure 2: System calls frequency distribution for OpenOf-fice Calc use-case. OS-related system calls are marked byred color. Metadata-related system calls are marked by bluecolor. User data related system calls are marked by greencolor.and munmap() histograms were selected for detailed anal-ysis. However, it is worth to point out that, fundamen-tally, the histograms of rest system calls (mprotect(), brk(),and madvise()) look similar. Fig. 6 - Fig. 7 do notshow any valuable difference in histograms for the case ofHDD and SSD persistent storages. All such histograms havethe same set of peaks. However, HDD-based histogramshave the longer tail distribution. It’s worth to mention thatRAMdisk and OptaneSSD cases change the fine structure ofhistogram by means of transformation into sharper peak(s)4igure 3: System calls frequency distribution for Firefoxuse-case. OS-related system calls are marked by red color.Metadata-related system calls are marked by blue color. Userdata related system calls are marked by green color.Group System Calls
Memory mprotect(), brk(), mmap(),
Operations munmap(), madvise()
Signal rt sigaction(), rt sigreturn()
Processing rt sigprocmask()
Interprocess pipe(), recvmsg(),
Communication recvfrom()
File dup2(), stat(), fstat(), lstat(), open()
Operations close(), access(), lseek(), fcntl()
User Data read()
Operations write()
Locking futex, poll, wait4
Operations
Table 3:
Summary of the most frequent and time-consuming system calls. with larger intensity. This peak is moved into the lower la-tency area. But the important point is that tail distribution isnot changed significantly for the RAMdisk and OptaneSSDcases. Sometimes, tail distribution can be longer for Op-taneSSD case than for SATA SSD storage device. The reallyunexpected discovery is the presence of main peak of Op-taneSSD in the lower latency area comparing with RAMdiskcase. However, it needs to take into account that histogramsof OptaneSSD and RAMdisk were built for different plat-forms (i7 and XEON). Generally speaking, probably, CPUarchitecture could be more important than persistent mem-ory. Also it’s very interesting to compare the SATA SSD’sand OptaneSSD’s histograms for i7 based platform. It ispossible to see that fine structure of OptaneSSD’s histogramtends to repeat the fine structure of SATA SSD’s histogram.Sometimes, it is possible to see the visible difference in thefine structure of SATA SSD and OptaneSSD. But, very fre- Figure 4: System calls frequency distribution for Meld use-case. OS-related system calls are marked by red color.Metadata-related system calls are marked by blue color. Userdata related system calls are marked by green color.quently, it is possible to distinguish the same set of peaks forboth cases. However, the OptaneSSD’s histogram is shiftedinto the lower latency area, usually. But Fig. 7 shows veryinteresting case when the main peak of OptaneSSD’s his-togram is surrounded by peaks of SATA SSD’s histogram.Fig. 8 - Fig. 9 show that different use-cases have similar finestructure of histograms. It is possible to see only differencein intensity and length of the tail distribution. Probably, thelength of tail distribution is defined by system’s backgroundactivity. It means that background processes in the systemaffect the timing of system calls of investigated use-cases.Fig. 8 - Fig. 9 reveal that even RAMdisk and OptaneSSDhave significant length of the tail distribution. It is possi-ble to conclude that faster persistent storage is unable to ex-clude the tail distribution. Most probably, the histogram’stail distribution is the fundamental factor for von Neumanncomputing architecture. It’s worth to mention that, for exam-ple, brk() is the system call is working with memory subsys-tem by means of allocation/deallocation the memory in thesystem. From one point of view, it makes sense to expectthat histogram of mprotect(), brk(), mmap(), munmap(), andmadvise() should have the simple fine structure in the formof one peak. However, the histograms’ fine structure of thesesystem calls contains significant amount of peaks. More-over, changing the CPU architecture is able to change thehistogram’s fine structure significantly. Oppositely, chang-ing the persistent storage is unable to change the histogram’sfine structure. Only RAMdisk and OptaneSSD cases are ableto reduce the fine structure to one intensive peak. However,OptaneSSD case is able to show the fine structure with sig-nificant amount of details. Probably, the task scheduler af-fects the histogram’s fine structure for the mprotect(), brk(),5igure 5: System calls frequency distribution for SoftwareInstallation use-case. OS-related system calls are marked byred color. Metadata-related system calls are marked by bluecolor. User data related system calls are marked by greencolor.mmap(), munmap(), and madvise() system calls.
The sigaction() system call is used to change the action takenby a process on receipt of a specific signal. The rt sigaction()system call looks better for the case of XEON + RAMdiskand i7 + OptaneSSD (see Fig. 10 - Fig. 11). However,histogram of i7 + OptaneSSD platform contains the mainpeak in the lower latencies area comparing with XEON +RAMdisk case. This fact can be considered as a basis forthe conclusion that CPU architecture plays more importantrole than persistent memory in CPU-centric data processingparadigm. However, the type of persistent storage/memoryis able to improve slightly the application’s performance forthe same CPU architecture. But Fig. 12 - Fig. 13 show thatsuch improvement cannot be significant because even Op-taneSSD is unable to shrink the tail distribution significantly.It is possible to imagine the signals like software interrupts.When a signal is sent to a process or thread, a signal han-dler may be entered, which is similar to the system enteringan interrupt handler as the result of receiving an interrupt.Generally speaking, the signals can be treated like synchro-nization primitive. As a result, the signal-based inter-processcommunication cannot be improved by a new type of persis-tent memory because the CPU architecture plays more cru-cial role in CPU-centric data processing paradigm. Figure 6: Histograms of mmap() system call for the makeuse-case.Figure 7: Histograms of munmap() system call for the au-toreconf use-case.
Interprocess communication (IPC) is a set of programminginterfaces that allow a programmer to coordinate activitiesamong different program processes that can run concurrentlyin an operating system. The frequency and time consump-tion analysis revealed the importance of pipe(), recvmsg(),recvfrom() system calls. The pipe() system call creates aunidirectional data channel that can be used for interprocesscommunication. Data written to the write end of the pipe isbuffered by the kernel until it is read from the read end of thepipe. Fig. 14 - Fig. 15 do not show any visible improvementof recvmsg() system call’s execution time for the RAMdiskand OptaneSSD cases. The i7 + SATA SSD easily competeswith XEON + RAMdisk and i7 + OptaneSSD. Even XEON+ HDD 15K, PCIe SSD are able to compete with XEON +RAMdisk for the case of recvmsg() system call. Also Op-taneSSD case is unable to shrink the tail distribution signifi-cantly (see Fig. 14 - Fig. 15). The recvmsg() system call im-plements the inter-process communications and it represents6igure 8: Histograms of mmap() system call for AMD +HDD 5K + EXT2.Figure 9: Histograms of mmap() system call for i7 + Op-taneSSD + EXT4.very important OS’s mechanism that affects the applications’performance significantly. But available results do not showany significant improvement for the case of recvmsg() sys-tem call. But it is possible to see the same sequence AMD- > XEON- > i7 where the histograms for i7 platform are lo-cated in the area with lowest latency values. It looks like thatCPU architecture plays the most important role in definingthe efficiency of recvmsg() system call. The inter-processcommunication is the cornerstone of improving the applica-tions performance by means of parallel execution of appli-cations’ sub-tasks. But persistent memory is unable to im-prove the performance of inter-process communications forthe CPU-centric data processing paradigm. The file operations are metadata related system calls. Anyuser-space application uses the metadata related system callswith significant frequency. Frequently, these type of sys- Figure 10: Histograms of rt sigaction() system call for theautoreconf use-case.Figure 11: Histograms of rt sigaction() system call for themake use-case.tem calls dominate in the application profile. The frequencyand time consumption analysis revealed the dominating im-portance of dup2(), stat(), fstat(), lstat(), open(), close(), ac-cess(), lseek(), fcntl() system calls. It is possible to see thatFig. 16 - Fig. 17 do not show any principal difference in his-tograms for the AMD + HDD 5K and AMD + SATA SSD.The same situation can be concluded for the XEON + HDD15K and XEON + PCIe SSD (see Fig. 16 - Fig. 17). Evenif the stat() system calls family is dependent from operationswith persistent storage but it is not possible to see that type ofpersistent storage is able to improve the performance dramat-ically. The histograms show only one difference the lengthof the tail distribution. But it was discovered that HDD 15Khas shorter tail distribution comparing with PCIe SSD. Gen-erally speaking, the type of persistent storage is not steadybasis for improving the application’s performance at whole.The XEON + RAMdisk case looks better for the softwareinstallation use-case (see Fig. 17). But another use-cases(Meld and Firefox, for example) do not show any signifi-cant advantages for the RAMdisk case (see Fig. 16). The i77igure 12: Histograms of rt sigaction() system call for AMD+ HDD 5K + EXT2.Figure 13: Histograms of rt sigaction() system call for i7 +OptaneSSD + EXT4.+ OptaneSSD case looks better than i7 + SATA SSD but itis possible to state that SATA SSD is able to compete withOptaneSSD. Finally, generally speaking, the persistent mem-ory is unable to improve the total application’s performancealone. Fig. 16 - Fig. 17 clearly show that CPU architectureis more influential factor than type of persistent memory.
The locking primitives are used in multi-threaded applica-tions very frequently because of necessity to synchronize theaccess to shared data. The frequency and time consump-tion analysis revealed the dominating importance of futex(),poll(), and wait4() system calls. The futex() system call hasvery special histograms. First of all, all histograms have verylong tail distribution. And different persistent storage typesare unable to shrink or to exclude the tail distribution. More-over, the tail distribution is practically unchanged for anyCPU architecture or persistent storage (see Fig. 20 - Fig. Figure 14: Histograms of recvmsg() system call for theOpenOffice Calc use-case.Figure 15: Histograms of recvmsg() system call for the Fire-fox use-case.21). Even RAMdisk and OptaneSSD have really long taildistribution that affects the total execution time of the use-cases. Generally speaking, main peaks of all histograms aregrouped in a narrow area (see Fig. 18). There is only onedifference redistribution of the peaks’ intensity among dif-ferent latency values. However, it is easy to see the samesequence AMD- > XEON- > i7 where the histograms for i7platform are located in the area with lowest latency values(see Fig. 18). The futex() system call is one that signifi-cantly affects the total execution time. Fig. 19 show thatall investigated use-cases are unable to improve average ex-ecution time or to shrink the tail distribution of poll() systemcall. The poll() system call plays the role of synchroniza-tion primitive is associated with I/O operations. And it iseasy to see that even RAMdisk and OptaneSSD are unableto change anything. Generally speaking, the synchronizationprimitives degrades the whole performance significantly butthe von Neumann paradigm doesn’t provide any hope to re-solve the drawback. There are some fundamental issues in8igure 16: Histograms of fstat() system call for the Melduse-case.Figure 17: Histograms of lstat() system call for the softwareinstallation use-case.CPU-centric paradigm and modern OS architectures. Every-body believes that the next generation of NVM/SCM mem-ory is able to improve the performance/latency of read/writeoperations significantly. It means that average execution timeof I/O operations should be reduced. As a result, an applica-tion’s threads should spend much lesser in synchronizationprimitives that wait the ending of I/O operations. However,as RAMdisk as OptaneSSD cases are unable to show any sig-nificant improvement of synchronization primitives’ as aver-age as total execution time. Most probable, the key draw-backs take place in CPU architecture, task scheduler imple-mentation and the whole interaction of OS subsystems dur-ing the processes/threads management. Also, the memorymanagement subsystem is able to play very important rolethat is able to affect the performance of I/O operations andsynchronization primitives. The available results providesthe basis for the conclusion that the changing as persistentstorage/memory as CPU architecture is unable to decreasethe average execution time of futex(), poll(), and wait4() sys-tem calls. Most probably, the von Neumann paradigm has Figure 18: Histograms of futex() system call for the Firefoxuse-case.Figure 19: Histograms of poll() system call for the Firefoxuse-case.fundamental drawback(s) that cannot be resolved by fasterCPU or faster persistent memory. The read(), write() system calls are cornerstones for opera-tion with user data. The read() system call is the very crit-ical function that directly defines the performance of opera-tions with user data. Another important point that read() isexecuted synchronously by the kernel, usually. Nave pointof view provides expectation that RAMdisk or OptaneSSDhave to improve the performance of read() system call dra-matically. However, Fig. 22 do not show any significantimprovements for the read() system call by means of varyingthe persistent storage type. Finally, all histograms providethe basis for the paradoxical conclusion that CPU architec-ture is the more influential factor for read() system call. Thefaster persistent memory is able to shrink the tail distributionslightly but it is unable to change the nature of histogram at9igure 20: Histograms of futex() system call for AMD +HDD 5K + EXT2.Figure 21: Histograms of futex() system call for i7 + Op-taneSSD + EXT4.whole (see Fig. 23 - Fig. 24). Generally speaking, the read()system call also reveals the fundamental influence of CPUarchitecture in the CPU-centric data processing paradigm in-stead of expected influence of persistent storage type. Fig.25 does not show any significant improvement comparingHDD and SSD cases for the case of write() system call. Itis possible to see only difference in the tail distribution forthese cases. The SSD based case is able to shrink the tail dis-tribution slightly. The RAMdisk and OptaneSSD cases areable to simplify the fine structure of the histogram by meansof increasing the intensity of the main peak. Also, usually,the main peak is located into lower latencies area compar-ing with HDD or SSD cases. However, even RAMdisk orOptaneSSD are unable to shrink the tail distribution signifi-cantly. The most important point that the write() system callhas the histograms of i7 + OptaneSSD are located in lowerlatencies area comparing with XEON + RAMdisk case. Figure 22: Histograms of read() system call for the Firefoxuse-case.Figure 23: Histograms of read() system call for AMD +HDD 5K + EXT2.
User-data operations . The read() system call is the verycritical function that directly defines the performance of op-erations with user data. Another important point that read()is executed synchronously by the kernel, usually. Naive pointof view provides expectation that RAMdisk or OptaneSSDhave to improve the performance of read() system call dra-matically. However, histograms do not show any significantimprovements for the read() system call by means of varyingthe persistent storage type. Finally, all histograms providethe basis for the paradoxical conclusion that CPU architec-ture is the more influential factor for the read() system call.
Metadata operations . It is possible to see that typeof persistent storage is unable to improve the performanceof metadata-related system calls dramatically (for example,stat() system call) even if the metadata-related system callsare dependent from operations with persistent storage. Thehistograms show only one difference the length of the tail10igure 24: Histograms of read() system call for i7 + Op-taneSSD + EXT4.Figure 25: Histograms of write() system call for the Firefoxuse-case.distribution. But it was discovered that HDD 15K could haveshorter tail distribution comparing with PCIe SSD. It is pos-sible to see that the tail distribution has very interesting pecu-liarities for some cases. Such peculiarities have fundamentalreasons that could be created by OS background activity orsystem call nature. The really important point that the taildistribution cannot be shrunk significantly for the case ofCPU-centric data processing architecture. The backgroundactivity of POSIX OS is able to eliminate completely the ad-vantages of fast persistent memory. Finally, the RAMdiskand OptaneSSD cases reveal the influence of the OS back-ground activity. Most probably, the CPU-centric data pro-cessing is responsible for the impossibility of new type ofNVM/SCM memory to improve the system performance atwhole for any use-case.
Synchronization primitives . The futex() system call hasvery special histograms. First of all, all histograms havevery long tail distribution. And different persistent storage Figure 26: Histograms of write() system call for AMD +HDD 5K + EXT2.Figure 27: Histograms of write() system call for i7 + Op-taneSSD + EXT4.types are unable to shrink or to exclude the tail distribution.Moreover, the tail distribution is practically unchanged forany CPU architecture or persistent storage. Even RAMdiskand OptaneSSD have really long tail distribution that affectsthe total execution time of the use-cases. The futex() sys-tem call is one that significantly affects the total executiontime. The poll() system call plays the role of synchroniza-tion primitive is associated with I/O operations. Everybodybelieves that the next generation of NVM/SCM memory isable to improve the performance/latency of read/write op-erations significantly. It means that average execution timeof I/O operations should be reduced. As a result, an appli-cations threads should spend much lesser time in synchro-nization primitives that wait the ending of I/O operations.However, as RAMdisk as OptaneSSD cases are unable toshow any significant improvement of synchronization prim-itives’ as average as total execution time. Most probable, thekey drawbacks take place in CPU architecture, task sched-uler implementation and the whole interaction of OS subsys-11ems during the processes/threads management. Also, thememory management subsystem is able to play very impor-tant role that is able to affect the performance of I/O op-erations and synchronization primitives. Generally speak-ing, the combination of CPU-centric data processing withPOSIX-based OS architecture is affected by CPU architec-ture more significantly than by persistent memory.
Inter-process communication . The inter-process com-munication is the cornerstone of improving the applicationsperformance by means of parallel execution of applications’sub-tasks. But persistent memory is unable to improve theperformance of inter-process communications for the CPU-centric data processing paradigm. It is possible to summarizethat CPU architecture defines the position of histograms fordifferent platforms because even XEON + RAMdisk plat-form is unable to compete with i7 + OptaneSSD; SATA SSDcases.
CPU-centric architecture . The really unexpected dis-covery is the presence of main peak of OptaneSSD in thelower latency area comparing with RAMdisk case. Also, thehistograms reveal that even RAMdisk and OptaneSSD havesignificant length of the tail distribution. It is possible toconclude that faster persistent storage is unable to excludethe tail distribution. Most probably, the histogram’s tail dis-tribution is the fundamental factor for von Neumann com-puting architecture. Moreover, changing the CPU architec-ture is able to change the histogram’s fine structure signifi-cantly. Oppositely, changing the persistent storage is unableto change the histogram’s fine structure. Generally speaking,the type of persistent memory or storage is not steady ba-sis for improving the performance of application. Even thetail distribution can be shrunk more efficiently by means ofchanging the CPU architecture but not the persistent mem-ory type. However, the type of persistent storage/memoryis able to improve slightly the application’s performance forthe same CPU architecture.
The next generation of NVM/SCM memory would be ableto open the new horizons for future computer technologies.However, currently, the NVM/SCM memory represents thebig challenge but not the hope to resolve the computer sci-ence’s problems. We’ve built histograms for the most fre-quent and time-consuming system calls with the goal to un-derstand the nature of distribution for different platforms. Itwas discovered unexpected and stable dependence of his-tograms from the CPU architecture. However, the type ofpersistent storage doesn’t play important role in the discov-ered dependence. Different use-cases have similar fine struc-ture of histograms. It is possible to see only difference inintensity and length of the tail distribution. Probably, thelength of tail distribution is defined by system’s backgroundactivity. It means that background processes in the system affect the timing of system calls of investigated use-cases.The analysis of histograms showed that faster persistent stor-age is unable to exclude the tail distribution. Most proba-bly, the histogram’s tail distribution is the fundamental factorfor von Neumann computing architecture. It was discoveredthat changing the CPU architecture is able to change the his-togram’s fine structure significantly. Oppositely, changingthe persistent storage is unable to change the histogram’s finestructure. Only RAMdisk and OptaneSSD cases are able toreduce the fine structure to one intensive peak. However, Op-taneSSD case is able to show the fine structure with signifi-cant amount of details. Probably, the task scheduler affectsthe histogram’s fine structure. This fact can be considered asa basis for the conclusion that CPU architecture plays moreimportant role than persistent memory in CPU-centric dataprocessing paradigm. However, the type of persistent stor-age/memory is able to improve slightly the application’s per-formance for the same CPU architecture. The inter-processcommunication is the cornerstone of improving the applica-tions performance by means of parallel execution of appli-cations’ sub-tasks. But the signal-based inter-process com-munication cannot be improved by a new type of persistentmemory because the CPU architecture plays more crucialrole in CPU-centric data processing paradigm. The synchro-nization primitives degrades the whole performance signifi-cantly but the von Neumann paradigm doesn’t provide anyhope to resolve the drawback. Most probable, the key draw-backs take place in CPU architecture, task scheduler imple-mentation and the whole interaction of OS subsystems dur-ing the processes/threads management. Also, the memorymanagement subsystem is able to play very important rolethat is able to affect the performance of I/O operations andsynchronization primitives.
References [1] pmem.io: Persistent Memory Programming.[2] COBURN, J., CAULFIELD, A. M., AKEL, A.,GRUPP, L. M., GUPTA, R. K., JHALA, R., ANDSWANSON, S. Nv-heaps: Making persistent objectsfast and safe with next-generation, non-volatile mem-ories. In Proceedings of the Sixteenth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems (New York, NY,USA, 2011), ASPLOS XVI, ACM, pp. 105-118.[3] DULLOOR, S. R., KUMAR, S., KESHAVAMURTHY,A., LANTZ, P., REDDY, D., SANKARAN, R., ANDJACKSON, J. System software for persistent memory.In Proceedings of the Ninth European Conference onComputer Systems (New York, NY, USA, 2014), Eu-roSys14, ACM, pp. 15:1-15:15.124] NALLI, S., HARIA, S., HILL, M. D., SWIFT, M. M.,VOLOS, H., AND KEETON, K. An Analysis of Per-sistent Memory Use with WHISPER. In Proceedings ofthe Twenty-Second International Conference on Archi-tectural Support for Programming Languages and Op-erating Systems (New York, NY, USA, 2017), ASP-LOS17, ACM, pp. 135-148.[5] VOLOS, H., TACK, A. J., AND SWIFT, M. M.Mnemosyne: Lightweight persistent memory. In Pro-ceedings of the Sixteenth International Conference onArchitectural Support for Programming Languages andOperating Systems (New York, NY, USA, 2011), ASP-LOS XVI, ACM, pp. 91-104.[6] XU, J., AND SWANSON, S. NOVA: A log-structuredfile system for hybrid volatile/non-volatile main mem-ories. In 14th USENIX Conference on File and Stor-age Technologies (FAST 16) (Santa Clara, CA, 2016),USENIX Association, pp. 323-338.[7] XU, J., ZHANG, L., MEMARIPOUR, A., GANGAD-HARAIAH, A., BORASE, A., DA SILVA, T. B.,SWANSON, S., AND RUDOFF, A. Nova-fortis: Afault-tolerant non-volatile main memory file system. InProceedings of the 26th Symposium on Operating Sys-tems Principles (New York, NY, USA, 2017), SOSP 17,ACM, pp. 478-496.[8] ZHANG, Y., AND SWANSON, S. A study of appli-cation performance with non-volatile main memory. In2015 31st Symposium on Mass Storage Systems andTechnologies (MSST) (2015), IEEE, pp. 1-10.[9] CHIA-CHE TSAI, BHUSHAN JAIN, NAFEESAHMED ABDUL, AND DONALD E. PORTER,A study of modern Linux API usage and compati-bility: what to support when you’re supporting, InProceedings of the Eleventh European Conference onComputer Systems (EuroSys ’16). ACM, New York,NY, USA, Article 16, 16 pages.[10] MOJTABA BAGHERZADEH, NAFISEH KAHANI,COR-PAUL BEZEMER, AHMED E. HASSAN,JUERGEN DINGEL, JAMES R. CORDY, Analyzinga Decade of Linux System Calls, Empirical SoftwareEngineering, June 2018, Volume 23, Issue 3, pp. 1519-1551.[11] JOSH TRIPLETT, PHILIP W. HOWARD, ERICWHEELER, JONATHAN WALPOLE, Avoiding sys-tem call overhead via dedicated user and kernel CPUs,2010, [Online]. Available: http://web.cecs.pdx.edu/~walpole/papers/osdi2010paper.pdf , Ac-cessed on: Nov. 27, 2018. [12] E. VICENTE, R. MATIAS, L. BORGES AND A.MACŁDO, Evaluation of Compound System Calls inthe Linux Kernel, 2011 Brazilian Symposium on Com-puting System Engineering, Florianopolis, 2011, pp.164-169.[13] LIVIO SOARES AND MICHAEL STUMM, FlexSC:flexible system call scheduling with exception-less sys-tem calls, In Proceedings of the 9th USENIX confer-ence on Operating systems design and implementation(OSDI’10). USENIX Association, Berkeley, CA, USA,pp. 33-46.[14] BRENDEN KOKOSZKA, PATRICK DONNELLY,DOUGLAS THAIN, Search Should Be a System Call,2013, [Online]. Available: http://ccl.cse.nd.edu/research/papers/search-tr.pdf , Accessedon: Nov. 27, 2018.[15] MOHAN RAJAGOPALAN, SAUMYA K. DE-BRAY, MATTI A. HILTUNEN, RICHARD D.SCHLICHTING, System Call Clustering: A Profile-Directed Optimization Technique, [Online]. Available: , Accessed on: Nov. 27, 2018.[16] DONALD E. PORTER, EMMETT WITCHEL, Trans-actional system calls on Linux, [Online]. Available: , Accessed on: Nov.27, 2018.[17] YEAYEUN PARK, MARK SERRANO, CRAIGPENTRACK, BatTrace: Android battery performancetesting via system call tracing, [Online]. Available: , Accessed on:Nov. 27, 2018.[18] PETER OKECH, NICHOLAS MC GUIRE,CHRISTOF FETZER, WILLIAM OKELO-ODONGO, Investigating Execution Path Non-determinism in the Linux Kernel, 2013, [Online].Available: , Accessed on: Nov.27, 2018.[19] NODIR KODIROV, RJ SUMI, Study on the File Sys-tem Calls of Desktop Applications, [Online]. Avail-able: , Accessed on: Nov. 27,2018.[20] KARAN AGGARWAL, CHENLEI ZHANG,JOSHUA CHARLES CAMPBELL, ABRAM HIN-DLE, AND ELENI STROULIA. 2014. The powerof system call traces: predicting the software energy13onsumption impact of changes. In Proceedings of24th Annual International Conference on ComputerScience and Software Engineering (CASCON ’14).IBM Corp., Riverton, NJ, USA, pp. 219-233.[21] strace [Online]. Available: https://strace.io/ ,Accessed on: Feb. 27, 2019.[22] F2FS tools [Online]. Available: https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.githttps://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git