[PDF] Thread Evolution Kit for Optimizing Thread Operations on CE/IoT Devices

Abstract

Most modern operating systems have adopted the one-to-one thread model to support fast execution of threads in both multi-core and single-core systems. This thread model, which maps the kernel-space and user-space threads in a one-to-one manner, supports quick thread creation and termination in high-performance server environments. However, the performance of time-critical threads is degraded when multiple threads are being run in low-end CE devices with limited system resources. When a CE device runs many threads to support diverse application functionalities, low-level hardware specifications often lead to significant resource contention among the threads trying to obtain system resources. As a result, the operating system encounters challenges, such as excessive thread context switching overhead, execution delay of time-critical threads, and a lack of virtual memory for thread stacks. This paper proposes a state-of-the-art Thread Evolution Kit (TEK) that consists of three primary components: a CPU Mediator, Stack Tuner, and Enhanced Thread Identifier. From the experiment, we can see that the proposed scheme significantly improves user responsiveness (7x faster) under high CPU contention compared to the traditional thread model. Also, TEK solves the segmentation fault problem that frequently occurs when a CE application increases the number of threads during its execution.

Full PDF

TThread Evolution Kit for Optimizing ThreadOperations on CE/IoT Devices

Geunsik Lim ,

Student Member, IEEE , Donghyun Kang , and Young Ik Eom

Abstract —Most modern operating systems have adopted theone-to-one thread model to support fast execution of threadsin both multi-core and single-core systems. This thread model,which maps the kernel-space and user-space threads in a one-to-one manner, supports quick thread creation and terminationin high-performance server environments. However, the perfor-mance of time-critical threads is degraded when multiple threadsare being run in low-end CE devices with limited system re-sources. When a CE device runs many threads to support diverseapplication functionalities, low-level hardware speciﬁcations oftenlead to signiﬁcant resource contention among the threads tryingto obtain system resources. As a result, the operating systemencounters challenges, such as excessive thread context switchingoverhead, execution delay of time-critical threads, and a lack ofvirtual memory for thread stacks. This paper proposes a state-of-the-art Thread Evolution Kit (TEK) that consists of three pri-mary components: a CPU Mediator, Stack Tuner, and EnhancedThread Identiﬁer. From the experiment, we can see that theproposed scheme signiﬁcantly improves user responsiveness (7xfaster) under high CPU contention compared to the traditionalthread model. Also, TEK solves the segmentation fault problemthat frequently occurs when a CE application increases thenumber of threads during its execution.

Index Terms —Thread model, thread optimization, threadstack, thread scheduling, thread manager.

I. I

NTRODUCTION A S digital consumer electronics (CE) devices such asa smart refrigerator [1] and smart television becomecommon, it is important for traditional software layers tobe optimized to mitigate the limitations of CE devices [2]–[4]. Nowadays, such devices are generally called IoT devicesbecause they interoperate each other with internet facilitiesand sensor modules. Meanwhile, traditional operating systemsof computing systems have adopted a model of one-to-onemapping between kernel-space and user-space threads becauseit allows opportunities for improving the scalability and per-formance of the system [5]–[15]. Unfortunately, this modeldoes not ﬁt for CE/IoT devices that have lower hardwarespeciﬁcations, because the model incurs some problems in that

This work was partly supported by Institute of Information & commu-nications Technology Planning & Evaluation (IITP) grant funded by theKorea government (MSIT) (IITP-2015-0-00284, (SW Starlab) Developmentof UX Platform Software for Supporting Concurrent Multi-users on LargeDisplays) and Basic Science Research Program through the National ResearchFoundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017R1A2B3004660). (

Corresponding author: Young Ik Eom .)G. Lim and Y. I. Eom are with the Department of Electrical and ComputerEngineering, Sungkyunkwan University, 2066, Seobu-ro, Jangan-gu, Suwon,South Korea (e-mail: { leemgs, yieom } @skku.edu).D. Kang is with Changwon National University, 20 Changwondaehak-roUichang-gu Changwon-si, Gyeongsangnam-do (51140), South Korea (e-mail:[email protected]). the threads running on CE/IoT devices often unintentionallyspend a signiﬁcant amount of time in taking the CPU resourceand the frequency of context switch rapidly increases due tothe limited system resources, degrading the performance ofthe system signiﬁcantly. In addition, since CE/IoT devicesusually have limited memory space, they may suffer from thesegmentation fault [16] problem incurred by memory shortagesas the number of threads increases and they remain runningfor a long time.Some engineers have attempted to address the challengesof IoT environments such as smart homes by using betterhardware speciﬁcations for CE/IoT devices [3], [17]–[21].Unfortunately, this approach is inefﬁcient and expensive be-cause high-performance hardware requirement increases themanufacturing costs of CE/IoT devices. Other researchersand engineers have implemented dual-version applications : ageneric version for normal computing systems and a light-weight version for CE/IoT systems [20]. However, this ap-proach also increases the cost of maintaining the applicationsbecause both versions of the software code must be modiﬁedwhen a software update is made. Meanwhile, in traditionalsystems, there is no concept of thread priority, and thus, itis difﬁcult to identify time-critical threads that require quickresponsiveness. As a result, when many active threads arerunning at the same time, all the threads, including those thatare time-critical, unprejudicedly compete for system resources.This leads to performance collapse along with dropping userresponsiveness.This paper proposes a Thread Evolution Kit (TEK) thatadopts the modern one-to-one thread model for CE/IoT deviceswhile leveraging its beneﬁts in the user space. TEK is com-posed of three primary components: a CPU Mediator, StackTuner, and Enhanced Thread Identiﬁer. First, we designed theCPU Mediator to help developers set the priority of time-critical threads. TEK is also implemented with a priority-basedscheduler that isolates high-priority threads (i.e., time-criticalthreads) from normal ones. The goal of the Stack Tuner isto determine how much memory space should be allocatedfor thread stacks. Also, it is used to avoid the problem of coarse-grained stack management in which existing operatingsystems give each thread more stack memory than is actuallyused by the thread. To implement the Stack Tuner, thispaper revisited the existing stack management mechanismand designed a scheme that could automatically assign anappropriate stack size by obtaining the thread’s actual stackusage. With the proposed scheme, software developers forCE/IoT devices can easily use TEK to prevent virtual memoryshortages. Meanwhile, to employ TEK, software developers a r X i v : . [ c s . O S ] J a n ig. 1. Three types of thread models. Popular operating systems [5], [22]–[24]adopt the one-to-one thread model as their major thread model. must specify information on the threads using the special APIsprovided by TEK. In order to correctly handle each thread, theEnhanced Thread Identiﬁer inspects this information wheneverthe program codes are compiled. For evaluation, TEK wasimplemented on an CE/IoT development board and comparedwith the conventional system. Surprisingly, the results showthat the response time of time-critical threads was acceleratedby up to 7x, and the amount of memory space was saved byup to 3.4x, compared with the conventional software platform.The remainder of this paper is organized as follows. SectionII describes the strong and weak points of each thread model.Section III discusses the observation results of conventionalthread operations on CE/IoT devices. Section IV presentsthe design and implementation of the proposed schemes,and Section V shows the evaluation results. Related work isdescribed in Section VI, and ﬁnally, Section VII concludes thepaper. II. B ACKGROUND

This section compares the strong and weak points of thethree major thread models: many-to-one, many-to-many, andone-to-one, as shown in Fig. 1. Then, it addresses how modernoperating systems such as Linux have evolved the threadmapping model.

A. Comparison of Thread Models

The model for thread mapping between the user space andkernel space has a great inﬂuence on the behavior of thethreads from their creation to termination. Fig. 1 shows thedifferent operation ﬂows of the common thread models andhow each works. This section addresses the features, strengths,and weaknesses of each thread model.

Many-to-one thread model:

Many operating systems havecommonly used this model as it allows for simple imple-mentation and portability. In addition, this model providesfor much quicker context switching compared with othermodels because all threads are managed in the user space.Unfortunately, when this model is applied, a thread can blockthe ﬂows of other threads for a long time. For example, if onethread triggers a system call that leads to resource contention,other threads of the same application must also wait until thethread gets the resource. Moreover, this model cannot fully take advantage of the beneﬁts of a multi-processor architecturebecause it handles only one application with a single processor,even though the application is composed of multiple threads[12], [13].

Many-to-many thread model:

This model allows for parallelmapping between the user-space and kernel-space threads.In this model, most context switching operations occur inthe user space, thus, this model shares kernel objects amongthe user-space threads. However, this model requires moreeffort to implement than the many-to-one model due to itscomplicated structure. Also, now that priority is assigned tothe threads through the thread manager in the user space, thethreads cannot directly access the kernel objects to obtain CPUresources [25], [26].

One-to-one thread model:

Most modern operating systemsemploy the one-to-one thread model. This model managesthreads directly in the kernel space so that kernel-levelscheduling is possible for each thread. In multi-processorsystems, multiple threads can run on different CPUs simul-taneously, and there may be no delay, even during blocks ofsystem services. When one thread calls a blocking system call,the other threads do not need to be blocked. However, sincethis model performs context switching in the kernel space,the context switching operation is relatively slow comparedto the many-to-one model. Also, with each thread taking upkernel resources, if an application creates many threads, theperformance of the threads depends heavily on the hardwarespeciﬁcations [5], [27], [28].It is now common for an application to be developed bydozens of developers due to the increased size and com-plexity of applications, and so, the number of threads inmodern CE/IoT devices has increased from tens to hundreds.Meanwhile, when the number of user-space threads in oneapplication increases to hundreds, it is difﬁcult for the appli-cation developer to know the roles of threads created by otherdevelopers because the existing thread models do not providedevelopers with a mechanism to monitor thread functionalities[27]–[29]. As a result, it becomes increasingly difﬁcult tocontrol and optimize their behavior.

B. Evolution of Linux Thread Model

In recent decades, developers implemented a multi-threadedapplication with the many-to-one thread model. The many-to-one thread model made thread implementation easy by pro-viding application developers with intuitive user-space threadoperations as discussed in Section II-A. However, user-levelthreads based on the many-to-one thread model could notutilize multi-processors directly without kernel-level supportfrom the operating system. As a result, with the advent ofmulti-processor systems, operating systems added kernel-levelsupport for threads [5], [13].The Linux kernel introduced Fast User-Level muTEX (FU-TEX) [25] to support light and fast synchronization. Thiscan be used when the threads access shared resources in themultiprocessor environment. The system calls of the FUTEXfacility provide application developers with a fast user-levelsynchronization mechanism [30]–[32]. The one-to-one threadodel of Linux supports real-time FUTEX behavior for user-space threads. Then, the Linux kernel adopts LinuxThread,which uses a clone system call to enable user-space real-time thread operations [5], [25]. Since the clone system callof the one-to-one thread model provides a mechanism thattracks threads to speed up thread creation, Linux dramaticallyimproves the scalability and performance of thread execution.Also, now that the one-to-one thread model eliminates theuser-level thread manager for fast thread operation, it bothsimpliﬁes the operation ﬂow of the threads and signiﬁcantlyaccelerates the execution speed of thread termination. As aresult, applications no longer depend on a thread manager thatmay cause context switching and performance degradation [5],[10], [33].The one-to-one threaded programs can run even faster sinceLinux introduced the O(1) scheduler, which consists of runqueues and bitmap priority arrays. This improves scalabilitywithout adding a performance penalty in multi-processor en-vironments [6], [34]. As a result, when the O(1) schedulerallowed threads to run quickly in high-performance multi-coreserver environments, modern operating systems adopted theone-to-one thread model as its standard thread model.Finally, the one-to-one thread model safely solves existingPOSIX compliance problems because it performs signal han-dling of the threads in kernel space. Moreover, because Linuxmaps the user-space thread to the Light-Weight Process (LWP)in kernel space, it completely links the system resource usageof the thread to that of the LWP of the Linux kernel. On Linux,the LWP refers to processes sharing the same memory addressspace and other resources in the kernel space. Therefore, thethread library can correctly monitor the thread behaviors of anapplication using the pseudo ﬁlesystems (e.g., proc and sysfs )[35] in Linux.Even though the Next-Generation POSIX Threads (NGPT)[25] proposed a many-to-many model in which many userthreads are mapped onto many kernel threads, unlike theexisting one-to-one thread model of the Linux kernel, it cannotsolve all the problems of the user-space library threads. Themain reason that the traditional LinuxThread has been used asthe dominant thread library for so long time is that the kernel-level threads of the operating system solve fault handling andthread performance problems.III. O

BSERVATION

This section discusses the observation that modern CE/IoTdevices bring new challenges to the existing one-to-one threadmodel.

A. CPU Contention Among the Threads on Low-End CE/IoTdevices

The latest Linux kernel supports two types of CPU sched-ulers: the O(1) and the Completely Fair Scheduler (CFS) [21].The O(1) scheduler provides CPU scheduling of real-timethreads with a First-In-First-Out (FIFO) or Round-Robin (RR)scheduling policy, and controls each thread according to itsﬁxed priority. On the other hand, the CFS scheduler supportsfair scheduling of non-real-time threads with a NORMAL

Fig. 2. The response time of two time-critical threads in an urgent groupduring CPU contention. The gray rectangle represents a region where threadscompete for available CPU resources. Time-critical thread 412 probes thecurrent temperature from a thermal peripheral device, while time-criticalthread 329 displays the temperature. scheduling policy, controlling each thread according to itsdynamic priority [36]. The ﬁxed priority does not change whilescheduling threads, while strictly maintaining the schedulingsequence. On the contrary, the dynamic priority changesfrom time to time while scheduling threads because CFSdynamically recalculates the weight values of the threads in thesystem. Typically, user-space applications create new threadsthat are controlled by the NORMAL scheduling policy, andthey are scheduled in a time-sharing manner. In detail, theNORMAL scheduling policy manages CPU resources usingthe virtual run-time [37] with the red-black tree which isusually used for efﬁcient self-balancing binary search [38] toensure that all threads use CPU resources fairly [36]. The CFSscheduler adopts the notion of virtual run-time, which shouldbe equal for all tasks, where the virtual run-time normalizes thevalue of the real run-time of a given thread with its nice value(user-level thread priority). At this time, the CFS scheduleremploys a red-black tree to support efﬁcient self-balancingbinary search algorithm. The red-black tree, in combinationwith the virtual run-time, ensures that higher priority tasks gainaccess to CPU resources more frequently without starving thelower priority tasks.However, the CFS scheduler does not consider low-endCE/IoT devices, especially when the CE/IoT device has toexecute time-critical threads with more favor under high CPUcontention. The nice value of -20 to 19 entered by theapplication developer is based on the 40 weight values deﬁnedby CFS as an array. For example, if a developer createsfour threads with nice values of 1, 2, 3, and 4, the CPUusage becomes 26.5%, 25.5%, 24.5%, and 23.5%, respectively,because the operating system applies equitable weight valuesto the threads. In other words, the CFS scheduler focuses onfair resource distribution among the threads in the system.Therefore, if the threads frequently compete to obtain availableCPU resources, the existing scheduler decreases the responsetime of the time-critical threads in which user responsivenessis important.Fig. 2 shows the delay in processing time-critical threadsthat measure the current temperature during high CPU (Em-bedded CPU quad-core 1.2GHz) contention in a real CE/IoTsystem environment such as a refrigerator. In the experiment,the time-critical threads and the background service threads ig. 3. The actual stack usage of 258 user-space threads on a CE/IoT device. in the same thread group were intensely competing withone another for CPU resources. As a result, the systemfrequently reproduced unpredictable processing delays for thetime-critical threads. Even though high-end hardware (e.g.,X86 CPU) can minimize the frequency of CPU contentioncompared to low-end hardware (e.g., Embedded CPU), mostCE/IoT devices require an energy-efﬁcient CPU for low powerconsumption and smaller die size. Therefore, it is crucial tooptimize the processing speed of time-critical threads underhigh CPU contention in low-end devices. In Section IV,this paper describes the design and implementation of thetechnique proposed to solve this problem in detail.

B. Segmentation Fault of New Threads on CE/IoT devices

In general, a segmentation fault occurs when a runningthread accesses an illegal memory location or tries to writeonto a read-only memory location [16]. Surprisingly, it is ob-served that there is another case when segmentation faults aregenerated while running an application. When an applicationrequests to create a new thread, the operating system buildsa new stack on the virtual memory space and then clonesthe contexts of the parent process, such as code and data,to that of the new thread. However, whenever a thread iscreated, the existing operating system gives each thread morespace in stack memory than the amount of space the threadactually uses (e.g., in Linux, a stack space of 8 MB or 12MB is automatically assigned for each thread). Therefore, thiscoarse-grained stack management problem, which induces alack of system stack space, may be accelerated over time.Unfortunately, in 32-bit architecture, it is not easy to solve thisproblem because an application can use a total of 3 GB for theuser-space area. For example, if an application simultaneouslyruns 300 threads with 8 MB stack for each thread, the existingoperating system may incur a segmentation fault when creatinga new thread. Of course, the frequency of segmentation faultsdepends somewhat on how the software platform handles thevirtual memory area.Fig. 3 shows the actual stack usage of 258 user-spacethreads that are run on a CE/IoT device (1 GB RAM LPDDR2900MHz, 4 GB virtual memory with the memory managementunit) with coarse-grained stack management. In Linux, a new

TABLE IT HE POSIX API

S FOR

CPU

SCHEDULING OF PROCESSES AND THREADS

API name Arguments Who Typenice (cid:182) inc Process Syscallsetpriority (cid:183) which, (cid:184) who, (cid:185) prio Process Syscallpthread setschedparam (cid:186) thread, (cid:187) policy, (cid:188) priority Thread Libcall (cid:182) inc: a nice value for the calling process. (cid:183) which: PRIO PROCESS, PRIO PGRP, or PRIO USER. (cid:184) who: a process group or real user ID of the calling process. (cid:185) prio: a value in the range -20 to 19. The default priority is 0. (cid:186) thread: a thread ID. (cid:187) policy: a scheduling policy (e.g., SCHED TEK for CPU Mediator). (cid:188) priority: a scheduling priority (e.g., a nice value for SCHED TEK). thread requires a minimum stack size of 16 KB to establish adata structure of the user-level thread. As shown in Fig. 3, mostof the threads allocate much higher stack size than they use inreality. From the analysis, although all threads run tasks withthe default stack size (8 MB) of the system, more stack space isallocated for the UI Manager (TID 446) and Media Controller(TID 492). On the contrary, less stack space is allocated for thethreads of TID 428, 512, and 592. The reason for this is thatthe developers employ the pthread attr setstacksize

API [11]in order to directly manipulate the stack space of the threads.These observations give further motivation to propose TEKbecause it is believed that it is possible to resolve the resourcemanagement problems incurred due to excessive resource con-tention while running the one-to-one mapping model betweenkernel-space and user-space threads.IV. TEK: D

ESIGN AND I MPLEMENTATIONFig. 4. Overall architecture and operation ﬂow of TEK.

This section introduces the

Thread Evolution Kit (TEK) ,designed in the same spirit as the traditional one-to-onethread model to handle application threads in the user space.However, to develop applications in CE/IoT environmentswithout re-design of the existing one-to-one thread model,TEK optimizes the previous thread model with three keycomponents: • CPU Mediator (Section IV-A): This component supportsﬁne-grained thread management in which each thread is ig. 5. Thread programming model with SCHED TEK of CPU Mediator for accelerating the response time of time-critical threads. handled based on the priority given by the applicationdevelopers. • Stack Tuner (Section IV-B): The goal of this component isto optimally allocate stack memory in the virtual addressspace whenever an application creates a new thread. • Enhanced Thread Identiﬁer and New APIs (Sec-tion IV-C): This component is responsible for handlingthe hundreds of threads running on a CE/IoT device andprovides new APIs to designate a thread as time-criticalor non-time-critical.Fig. 4 shows how applications are managed with TEK.TEK provides application developers with POSIX-compatiblethread APIs (e.g., pthread setschedparam ) that support op-timization of resource management in both low-end CE/IoTdevices and existing high-end server systems (i.e., in TEK, theimproved pthread setschedparam

API is used to run a uniﬁedapplication that is compatible with both low-end and high-enddevices). Now, this paper discusses the key components ofTEK in detail.

A. CPU Mediator

The existing software layer for CE/IoT devices was de-signed to handle threads with a group scheduling policy (i.e.,coarse-grained thread management). Because modern applica-tions create more and more threads, this technique can effec-tively control CPU resources by grouping the threads of eachapplication. However, this coarse-grained thread managementtechnique may be harmful in modern CE/IoT environments inwhich threads may require fast responsiveness because it isnot easy to predict which thread will be run next.The CPU Mediator is designed to support fast and pre-dictable thread execution. In particular, the CPU Mediatorclassiﬁes all threads running on a CE/IoT device into twocategories according to their priority: time-critical and non-time-critical. The scheduling priority ( (cid:188) in Table I) of eachthread is set by calling the APIs supported by TEK asdescribed in Table I and would not be adjusted until thethread terminates. For time-critical threads, this paper furtherimplements a new scheduler policy, called SCHED TEK ( (cid:187) in Table I), that offers more chances to obtain CPU resourcesby delaying non-time-critical threads. This paper considers an example scenario in which time-critical threads are processed. When an application runs time-critical threads to guarantee fast response time, the CPUMediator changes the policy of the scheduler to SCHED TEKby calling the pthread setschedparam

API in user space. Fig.5 shows a logical thread migration ﬂow of the CPU Mediatoralong with the SCHED TEK policy. The SCHED TEK policystarts to control the CPU resources according to the followingtwo steps. First, the CPU Mediator looks up the time-criticalthreads in the group where the user-space thread lays based onits kernel-space thread ID , and then logically migrates themto the Fast Region , where the probability of obtaining CPUresources is relatively high, as shown in Fig. 5. Second, theCPU Mediator dynamically drops the priority of each non-time-critical thread running on the CE/IoT device to yieldCPU resources to the time-critical threads. Then, it logicallymigrates all non-time-critical threads to the

Lazy Region ,where the execution of the threads will be delayed until the

Fast Region becomes empty. The purpose of the

Fast Region is to accelerate the processing speed of time-critical threads,while that of the

Lazy Region is to delay the other threads.These regions link or unlink the threads of the existing groupswith a doubly-linked list. After the time-critical threads in the

Fast Region are terminated, the CPU Mediator unlinks thethreads belonging to the

Lazy Region and puts them into theiroriginal groups, instantly restoring their scheduling policy.

B. Stack Tuner

As the number of threads increases, segmentation faults mayoccur frequently due to the lack of system stack space in thevirtual memory. Traditional operating systems always allocatestack space as requested by the thread , regardless of howmuch stack space is actually used by the thread at runtime;this situation is very similar to the internal fragmentation issue in physical memory. For example, if a thread running foran application only uses 1 MB of stack space after being The kernel-space thread ID is acquired by using the modiﬁed gettid() system call where this paper replaces FUTEX with the Read-Copy-Update(RCU) mechanism because it is more suitable for read-intensive operations. The pthread attr setstacksize()

API is used to explicitly allocate the stackmemory space.ig. 6. The stack management structure of the Stack Tuner used to avoid ashortage of virtual memory. allocated 100 MB of stack space, a signiﬁcant amount ofvirtual memory (i.e., 99%) is wasted. Therefore, the coarse-grained stack management technique mentioned in SectionIII-B may accelerate the lack of the system stack space overtime.To address the lack of system stack space, this paperdesigned the Stack Tuner to monitor the stack space during thelifetime of each thread. In order to measure the stack usage ofeach thread, the Stack Tuner periodically obtains informationon the procFS [35] ﬁlesystem and records the peak stack usageof each thread in the

Thread Information Table , which willbe discussed in the next section. Based on the recorded stackusage, the Stack Tuner automatically gives each thread suitablestack space to optimize the memory usage of the applications.For exact guidelines, this paper additionally conﬁgured threetypes of zones in the stack space:

Low Zone , Normal Zone ,and

High Zone . Fig. 6 shows how to classify the stack spaceof each thread. If the peak stack usage of a thread belongsto the Low Zone or High Zone , the Stack Tuner informs thedevelopers, allowing them to ﬁx the stack space requirementat the next compilation. To deliver this information, this papermodiﬁes the Glibc [39] library, which is well-established asthe standard library for handling system calls in Linux. Inthe

Low Zone , the Stack Tuner points out that the thread iswasting the virtual memory space of the application. On theother hand, if the peak stack usage of the thread reaches the

High Zone range, the Stack Tuner generates the informationthat the thread may end up with a stack overﬂow in the nearfuture. Finally, the Stack Tuner puts the

Guard Page at theend of the thread’s stack to detect a stack overﬂow.

C. Enhanced Thread Identiﬁer and New APIs

The main purpose of the Enhanced Thread Identiﬁer is toeasily identify the characteristics and attributes of each thread The default size of each zone is conﬁgured by the conﬁguration ﬁle at theboot time of the CE/IoT device. Fig. 7. The operation ﬂow of the Enhanced Thread Identiﬁer. among hundreds of threads running on a CE/IoT device. Inreality, an application calls the pthread create

API to create anew thread and the new thread just executes the thread functionspeciﬁed by the third argument of the pthread create

API [11],[13], [15], [22]. As a result, it is not easy to determine the roleof the thread with the thread ID only. To enhance the identiﬁ-cation of a thread, the Enhanced Thread Identiﬁer records theinformation on a new thread created by an application into anauxiliary table, called

Thread Information Table .Fig. 7 shows the structure and relationship of the compo-nents of the Enhanced Thread Identiﬁer. The Enhanced ThreadIdentiﬁer extracts the thread attributes from the parametersof the pthread create or pthread set attributes API, and thenrecords them in the

Thread Information Table on-the-ﬂy. Thethread attributes include the information on the role of thethread.Application threads often call a function that connects to asensor device to receive data from it (e.g., humidity sensor,temperature controller, air pressure sensor, or gas detectionsensor). For example, a developer can set “gas detection” asthe role of a thread with the pthread set attributes

API inorder to easily lookup the gas detection thread running ona device. The

Thread Monitor in Fig. 7 periodically collectsthread information on the running threads (e.g., schedulingpolicy, scheduling priority, thread creation time, stack size,and virtual memory size) from the

ProcFS [35] ﬁlesystem.The peak usage of the stack space of each thread is measuredin this way.Meanwhile, this paper designed novel APIs to set or getthe attributes of a thread in the

Thread Information Table .Application developers can mark a thread as time-critical ornon-time-critical by triggering the pthread set attributes

API.When this function is called in the user space with a thread IDand its attributes, the Enhanced Thread Identiﬁer searches forthe thread ID in the

Thread Information Table and saves thethread attributes, including its priority and scheduling policy,into the table. On the other hand, the pthread get attributes

API is used to return the attributes of the thread.The

Thread Information Table requires additional memoryspace to store thread attributes. Considering that modernCE/IoT applications usually run more than 300 threads andthe Enhanced Thread Identiﬁer allocates 40 bytes to storethe thread attributes for each thread, the Enhanced ThreadIdentiﬁer requires just 12 KB (300 threads multiplied by 40bytes) of additional memory space for 300 threads, and so, theadditional memory cost is not signiﬁcant. Also, consideringthat the read and write operations for managing the threadinformation are completed within 46 ns and 67 ns, respec-

ABLE IIS

YSTEM CONFIGURATION FOR EXPERIMENTS

Content Item SpeciﬁcationsH/W CPU Embedded CPU quad-core 1.2GHzRAM 1GB LPDDR2 (900MHz)Storage 32GB MicroSDSensor Interface GPIO-40 pin headerS/W OS Linux 4.4.15 32bit (LTS)Virtual memory 1 GB kernel space and 3 GB user spaceCompiler GCC 9.1C library Glibc 2.29Thread Model NPTL (Native POSIX Thread Library) [5]Fig. 8. The context switching time of the threads. tively, when the Enhanced Thread Identiﬁer saves the threadinformation into a typical memory device (1 GB LPDDR2900MHz), it has little effect on the thread performance of thedevices. V. E

VALUATION

This section introduces an experimental environment andthen explores how the proposed scheme, TEK, improves notonly the response time of the time-critical threads but alsothe utilization of memory space. In particular, the evaluationsin this section answer the following questions: (1) what isthe difference between TEK and the conventional system interms of context switching? (Section V-B), (2) where doesthe improvement of the response time come from when TEKis enabled on CE/IoT devices? (Section V-C), and (3) howdoes TEK contribute to stack management at the kernel level?(Section V-D).

A. Experimental Setup

A prototype of TEK was implemented based on a com-mercial CE/IoT device using an Embedded CPU with 1 GBmemory running Linux kernel 4.4. Table II summarizes theevaluation setup in detail. The beneﬁts of TEK were comparedwith the conventional kernel wherein the CPU schedulerprovides coarse-grained control of threads and provides mem-ory management based on a ﬁxed-sized stack space. In thispaper, all evaluations were conducted by categorizing threadsas time-critical or non-time-critical so as to understand theperformance difference of the events triggered by users. If a

Fig. 9. The frequency of context switching operations on time-critical threads.Fig. 10. The evaluation results of the user-space SCHED TEK policy forimproving user responsiveness of time-critical threads under CPU contention. thread frequently handles user-level events during a short timeperiod, it is considered to be time-critical because it requiresa short response time. Otherwise, it is considered to be non-time-critical. The evaluation results were measured during thecreation of 2000 threads after ﬁnishing the boot procedure.

B. Context Switching of Threads

This paper ﬁrst focuses on the performance of TEK interms of context switching since a performance drop maybe caused by the use of ﬁne-grained thread managementalong with the thread’s priority. Fig. 8 shows the experimentalresults for the context switching time. In the ﬁgure, the x-axis represents the thread ID of the time-critical threads, andthe y-axis represents the context switching time of the time-critical threads. The proposed scheme has results similar to theconventional scheme even though it includes more behaviorsfor categorizing threads into time-critical and non-time-criticalones. This is possible because the proposed scheme ofﬂoadsthe operations for thread classiﬁcation to the CFS schedulerby logically migrating the threads between the regions im-plemented with the doubly-linked list as depicted in SectionIV-A.Meanwhile, Fig. 9 shows the number of context switchingoperations centered around the time-critical threads. Thisﬁgure conﬁrms that TEK signiﬁcantly reduces the number ofcontext switching operations compared with the conventionalmethod. In the best case, TEK reduces the number of context ig. 11. Virtual memory consumption for each stack size of the threadsFig. 12. The experimental results for segmentation fault frequency while creating threads. switching operations by up to 41%. The reason behind this isthat TEK assigns more CPU time to the time-critical threadsby isolating them from the group scheduling policy. As aresult, the time-critical threads can speed up their responsetime by up to 42% compared with those of the conventionalsystem. In summary, TEK provides more opportunities totime-critical threads in terms of CPU scheduling with littleoverhead. In addition, TEK does not require any modiﬁcationsto the conventional one-to-one thread model because it usesthe API of the conventional thread model.

C. Response Time of User-Level Event

This section discusses the improvement obtained to theresponse time of the time-critical threads using the TEKuser space thread manager. To measure the response timeof the time-critical threads, the experiment was conductedunder a real-life scenario that included two steps. As describedin Section IV-A, modern operating systems can limit theCPU usage of the running threads by classifying them intofour different scheduling groups: urgent, normal, service, andbackground [40]. Now, the ﬁrst step in the scenario createsthreads that are evenly deployed to each thread group so asto create a CPU intensive situation. Then, the second stepmeasures the response time of each thread when it handles auser-level event, such as a touch screen input on the CE/IoTdevice. In other words, the response time of each time-criticalthread was measured during 100% CPU utilization. Fig. 10 shows the evaluation results of TEK comparedwith the conventional system. As shown in the ﬁgure, theconventional system leads to latency ﬂuctuations; the averageresponse time was 1719 ms when all of the time-criticalthreads competed for the CPU. This is because the non-time-critical threads can frequently stagnate and even hang on thetime-critical threads. Alternatively, TEK shows steady perfor-mance results, dramatically reducing the average response timeof the time-critical threads by up to 235 ms. These results aremeaningful because there was a lot of competition for CPU re-sources in the thread group. The reason behind such signiﬁcantimprovements is that the CPU Mediator efﬁciently isolates andhandles time-critical threads in terms of the CPU resources.Unfortunately, the CPU Mediator causes a negative impact onthe performance of non-time-critical threads because they canbe preempted to yield resources to the time-critical threads.In the worst case, the response time of the non-time-criticalthreads was stalled by up to 1487 ms. However, it is importantto mention that the non-time-critical threads, such as softwareupdate threads, reserved task threads, and system managementthreads, do not react to user activities on-the-ﬂy.

D. Stack Management of Threads

Generally, whenever one thread is created, the memorymanager in the kernel allocates a ﬁxed-size chunk of stackmemory for the created thread. Unfortunately, if the threaduses a smaller region of memory than the allocated chunk, theunused memory space is wasted. This unused memory spaceay indirectly cause a segmentation fault because it leads toa shortage of free memory. On the other hand, as mentioned,TEK efﬁciently allocates a stack memory chunk that bestﬁts the thread using the Stack Tuner. Fig. 11 shows theaccumulated usage of stack memory allocated to the createdthread. For accurate evaluation, the data in Fig. 11 weremonitored during creation of a total of 273 threads over 15days. As expected, Fig. 11–(a) clearly conﬁrms that TEK usesa much smaller amount of memory space than the conventionalsystem. Also, TEK allocated only 70 MB of memory, eventhough a total of 273 threads were running simultaneously, asshown in Fig. 11–(b).Fig. 12 plots the number of segmentation faults that actu-ally occurred while running the threads on the experimentalCE/IoT device. As mentioned before, the conventional systemsemploy a ﬁxed-sized chunk of memory to support the creationof a new thread, and therefore, the evaluation was performedby varying the size of the ﬁxed-sized stack space from 2 MBto 8 MB. In the conventional system, the evaluation resultsclearly show that the number of segmentation faults increasedsigniﬁcantly with an increase in the number of threads. Inparticular, after the number of accumulated threads reached200, the conventional system became unstable and could notguarantee a stable response for the creation of a new threadbecause of the exception handling of the segmentation fault.On the other hand, TEK maintained good conditions over mostof the experiment because the proposed system supports a ﬁne-grained stack allocation that exactly allocates stack memoryspace for the amount actually used in the thread. Of course,TEK also wastes a small portion of memory because of pagealignments. To understand how much memory space is wasted,the amount of allocated memory usage was monitored on boththe TEK and the conventional system. As shown in Fig. 11,TEK only used 39% (70528 KB) of the stack memory spacethanks to the Stack Tuner, while the conventional system used4300% (2236416 KB), compared with the stack memory spaceactually used by the threads (50828 KB).VI. R

ELATED W ORK

This section summarizes prior work to clearly understandthe difference between the proposed scheme and the conven-tional system in terms of the thread model, thread perfor-mance, and thread management.

Thread Model and Performance:

Many studies have beenperformed to enhance the thread model. NPTL [5] pointedout the issue of the scalability of the Linux scheduler, thenproposed the O(1) Linux scheduler to address the issue bothon multi-core and single-core architecture. In particular, NPTLdesigned a FUTEX synchronization mechanism to support theone-to-one thread model without additional overhead. Wong[6] presented the CFS scheduler to ensure a fair allocationof CPU resources to tasks without sacriﬁcing interactiveperformance. As a result, this scheduler could replace theO(1) scheduler [6] in the Linux kernel. Meanwhile, PK[12] focused on the performance of threads and proposeda concurrency model based on POSIX threads (Pthreads)to improve thread performance, including real-time threads. Engelschall [13] described a portable multi-threading mecha-nism that supports the expeditious creation and execution ofthreads during the simultaneous execution of multiple threads.In addition, to achieve backward compatibility, Engelschalldeveloped a mechanism based on ANSI-C on the Unix system.In summary, all of the above studies focused on improvingthe performance of threads in high-performance computingenvironments equipped with large-scale hardware resources.Therefore, they are different from TEK in that TEK considerssmall-scale hardware environments, like CE/IoT devices.

Thread Management:

Adya [33] focuses on cooperative taskmanagement to guide the concurrency conditions of the systemfor program architects. A prototype of the cooperative taskmanagement method was implemented based on the event-driven approach so as to meet the requirements of threadconcurrency. On the other hand, Arachne [10] addresses low-latency and high-throughput applications by designing short-lived threads. In this scheme, threads running in the user levelare handled with core-aware scheduling, which assigns thecores to each thread according to the application requirements.In other words, the desired scheduling can be achieved withcore-aware thread management. However, since the APIs arenot POSIX compatible, it is difﬁcult to immediately port themto modern CE/IoT devices.VII. C

ONCLUSION

Contemporary CE devices, which have sensor and networkmodules, have unique characteristics compared with generaldesktop or server systems in that the threads of an applicationmust be handled by limited resources, such as low clock speedCPU and small capacity memory. This allows for low powerrequirements, miniaturization, and cost competitiveness. Thispaper targeted enhancing the existing one-to-one thread modelto resolve the resource management problems of the user-space threads on low-end CE/IoT devices. To handle threadsmore efﬁciently in these systems, this paper proposed state-of-the-art resource management facilities for CE devices: a CPUMediator, Stack Tuner, and Enhanced Thread Identiﬁer. Thispaper shows that the proposed system dramatically improvesthe response time of time-critical threads by up to 7x andsaves available virtual memory space by up to 3.4x. Inaddition, the proposed system supports a POSIX-compatiblethread scheduling API that allows developers to run uniﬁedapplications on both small-scale and large-scale hardwareplatforms. Also, the proposed system supports a light-weightsystem resource manager to improve naive stack managementon the low-end CE devices.R

EFERENCES[1] W. Z. Khan, M. Y. Aalsalem, and M. K. Khan, “Communal Acts ofIoT Consumers: A Potential Threat to Security and Privacy,”

IEEETrans. Consum. Electron. , vol. 65, no. 1, pp. 64–72, 2019, DOI:10.1109/TCE.2018.2880338.[2] S. K. Roy, S. Misra, and N. S. Raghuwanshi, “SensPnP: SeamlessIntegration of Heterogeneous Sensors With IoT Devices,”

IEEE Trans.Consum. Electron. , vol. 65, no. 2, pp. 205–214, Mar. 2019, DOI:10.1109/TCE.2019.2903351.[3] D. Jo and G. J. Kim, “ARIoT: Scalable Augmented Reality Frameworkfor Interacting with Internet of Things Appliances Everywhere,”

IEEETrans. Consum. Electron. , vol. 62, no. 3, pp. 334–340, Oct. 2016, DOI:10.1109/TCE.2016.7613201.4] S. Raj, “An Efﬁcient IoT-Based Platform for Remote Real-Time CardiacActivity Monitoring,”

IEEE Trans. Consum. Electron. , vol. 66, no. 2, pp.106–114, Mar. 2020, DOI: 10.1109/TCE.2020.2981511.[5] S. J. Hill, “Native POSIX Threads Library (NPTL) Support for uClibc,”in

Proc. OLS , Ottawa, ON, Canada, 2006, pp. 409–420.[6] C. Wong, I. Tan, R. Kumari, J. Lam, and W. Fun, “Fairness andInteractive Performance of O(1) and CFS Linux Kernel Schedulers,”in

Proc. ITCC , Kuala Lumpur, Malaysia, 2008, pp. 1–8.[7] F. Mueller, “A Library Implementation of POSIX Threads under UNIX,”in

Proc. USENIX Conference , San Diego, CA, USA, 1993, pp. 29–41.[8] H.-J. Boehm, “Threads Cannot be Implemented as a Library,”

SIGPLAN Not. , vol. 40, no. 6, pp. 261–268, Jun. 2005, DOI:10.1145/1065010.1065042.[9] J. Nakashima and K. Taura, “MassiveThreads: A Thread Library forHigh Productivity Languages,” in

Proc. LNCS , Berlin, Germany, 2014,pp. 222–238.[10] H. Qin, Q. Li, J. Speiser, P. Kraft, and J. Ousterhout, “Arachne: Core-Aware Thread Management,” in

Proc. USENIX OSDI , Carlsbad, CA,USA, 2018, pp. 145–160.[11] B. Barney, “POSIX Threads Programming,” 2009, Accessed: Mar. 30,2020. [Online]. Available: https://computing.llnl.gov/tutorials/pthreads[12] F. W. Miller, “PK: A POSIX Threads Kernel,” in

Proc. USENIX ATC ,Monterey, CA, USA, 1999, pp. 179–181.[13] R. S. Engelschall, “Portable Multithreading: The Signal Stack Trick forUser-Space Thread Creation,” in

Proc. USENIX ATC , San Diego, CA,USA, 2000, pp. 1–12.[14] J. Howell, B. Parno, and J. R. Douceur, “How to Run POSIX Apps ina Minimal Picoprocess,” in

Proc. USENIX ATC , San Jose, CA, USA,2013, pp. 321–332.[15] M. Rieker, J. Ansel, and G. Cooperman, “Transparent User-LevelCheckpointing for the Native POSIX Thread Library for Linux,” in

Proc.PDPTA , Las Vegas, NV, USA, 2006, pp. 492–498.[16] K. Tran, T. E. Carlson, K. Koukos, M. Sj¨alander, V. Spiliopoulos,S. Kaxiras, and A. Jimborean, “Clairvoyance: Look-Ahead Compile-Time Scheduling,” in

Proc. CGO , Austin, TX, USA, 2017, pp. 171–184.[17] J. Yun, I. Ahn, N. Sung, and J. Kim, “A Device Software Platform forConsumer Electronics Based on the Internet of Things,”

IEEE Trans.Consum. Electron. , vol. 61, no. 4, pp. 564–571, Nov. 2015, DOI:10.1109/TCE.2015.7389813.[18] P. Sundaravadivel, K. Kesavan, L. Kesavan, S. P. Mohanty, andE. Kougianos, “Smart-Log: A Deep-Learning Based AutomatedNutrition Monitoring System in the IoT,”

IEEE Trans. Con-sum. Electron. , vol. 64, no. 3, pp. 390–398, Aug. 2018, DOI:10.1109/TCE.2018.2867802.[19] G. Lee and M. Rho, “IoT Connectivity Interface in Tizen: Smart TVScenarios,” in

Proc. DUXU , Toronto, ON, Canada, 2016, pp. 357–364.[20] M. Ham and G. Lim, “Making Conﬁgurable and Uniﬁed Platform,Ready for Broader Future Devices,” in

Proc. ICSE-SEIP , Montreal, QC,Canada, 2019, pp. 141–150.[21] S. Dhotre, P. Patil, S. Patil, and R. Jamale, “Analysis of SchedulerSettings on the Performance of Multi-core Processors,” in

Proc. IEEEICEI , Tirunelveli, India, 2017, pp. 687–691.[22] L. Gong, Z. Li, T. Dong, and Y. Sun, “Rethink Scalable M:N Threadingon Modern Operating Systems,”

J. Comput. , vol. 11, no. 3, pp. 176–189,May 2016, DOI: 10.17706/jcp.11.3.176-188.[23] B. D. Veerasamy and G. M. Nasira, “JNT-Java Native Thread for Win32Platform,”

Int. J. Comput. Appl. , vol. 70, no. 24, pp. 1–9, May 2013,DOI: 10.5120/12212-8249.[24] G. Blake, R. G. Dreslinski, T. Mudge, and K. Flautner, “Evolution ofThread-Level Parallelism in Desktop Applications,” in

Proc. ISCA , SanJose, CA, USA, 2010, pp. 302–313.[25] H. Franke, R. Russell, and M. Kirkwood, “Fuss, Futexes and Furwocks:Fast User-level Locking in Linux,” in

Proc. OLS , Ottawa, ON, Canada,2002, pp. 479–495.[26] N. Brown, “C++CSP2: A Many-to-Many Threading Model for MulticoreArchitectures,” in

Proc. CPA , Surrey, U.K., 2007, pp. 183–205.[27] J. J. Harrow, “Runtime Checking of Multithreaded Applications withVisual Threads,” in

Proc. SPIN , Beijing, China, 2000, pp. 331–342.[28] K. Pouget, M. P´erache, P. Carribault, and H. Jourdren, “User LevelDB: A Debugging API for User-Level Thread Libraries,” in

Proc. IEEEIPDPS , Atlanta, GA, USA, 2010, pp. 1–7.[29] M. Leske, A. Chis, and O. Nierstrasz, “Improving Live Debugging ofConcurrent Threads Through Thread Histories,”

Sci. Comput. Program. ,vol. 161, pp. 122–148, Sep. 2018, DOI: 10.1016/j.scico.2017.10.005.[30] A. Gidenstam and M. Papatriantaﬁlou, “LFTHREADS: A Lock-FreeThread Library,”

SIGARCH Comput. Archit. News

ACMTrans. Comput. Syst. , vol. 24, no. 2, pp. 140–174, May 2006, DOI:10.1145/1132026.1132028.[33] A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur,“Cooperative Task Management without Manual Stack Management,”in

Proc. USENIX ATC , Monterey, CA, USA, 2002, pp. 289–302.[34] L. Soares and M. Stumm, “FlexSC: Flexible System Call Schedulingwith Exception-Less System Calls,” in

Proc. USENIX OSDI , Vancouver,BC, Canada, 2010, pp. 33–46.[35] W. Bei, W. Bo, and X. Qingqing, “The Comparison of CommunicationMethods Between User and Kernel Space in Embedded Linux,” in

Proc.IEEE ICCP , LiJiang, China, 2010, pp. 234–237.[36] C. S. Wong, I. Tan, R. D. Kumari, and F. Wey, “Towards AchievingFairness in the Linux Scheduler,”

ACM SIGOPS Oper. Syst. Rev. , vol. 42,no. 5, pp. 34–43, Jul. 2008, DOI: 10.1145/1400097.1400102.[37] M. Kim, S. Noh, J. Hyeon, and S. Hong, “Fair-Share Schedulingin Single-ISA Asymmetric Multicore Architecture via Scaled VirtualRuntime and Load Redistribution,”

J. Parallel Distrib. Comput. , vol.111, pp. 174–186, Jan. 2018, DOI: 10.1016/j.jpdc.2017.08.012.[38] C. Davis, J. Jackson, J. Oldﬁeld, T. Johnson, and M. Hale, “A TimeComparison Between AVL Trees and Red Black Trees,” in

Proc. FCS

ACM Trans. Embed. Comput. Syst. , vol. 14, no. 2, pp. 1–17, Mar.2015, DOI: 10.1145/2658990.

Geunsik Lim received his B.S. degree in ComputerScience and Engineering from Ajou University, inSouth Korea in 2003. He received his M.S. degree inthe College of Information and Communication En-gineering from Sungkyunkwan University, in SouthKorea in 2014. He is currently a Ph.D. student in theDepartment of Electrical and Computer Engineer-ing, Sungkyunkwan University, and also a principalsoftware engineer for Samsung Electronics in SouthKorea. His current research interests include systemoptimization, operating systems, software platforms,and on-device artiﬁcial intelligence.

Donghyun Kang is an assistant professor with theDepartment of Computer Engineering at ChangwonNational University in South Korea. Before joiningChangwon National University, he was an assistantprofessor with Dongguk University (2019-2020) anda software engineer at Samsung Electronics in SouthKorea (2018-2019). He received his Ph.D. degree inCollege of Information and Communication Engi-neering from Sungkyunkwan University in 2018. Hisresearch interests include ﬁle and storage systems,operating systems, and emerging storage technolo-gies.