[PDF] A Fault Tolerant Mechanism for Partitioning and Offloading Framework in Pervasive Environments

Abstract

Application partitioning and code offloading are being researched extensively during the past few years. Several frameworks for code offloading have been proposed. However, fewer works attempted to address issues occurred with its implementation in pervasive environments such as frequent network disconnection due to high mobility of users. Thus, in this paper, we proposed a fault tolerant algorithm that helps in consolidating the efficiency and robustness of application partitioning and offloading frameworks. To permit the usage of different fault tolerant policies such as replication and checkpointing, the devices are grouped into high and low reliability clusters. Experimental results shown that the fault tolerant algorithm can easily adapt to different execution conditions while incurring minimum overhead.

Full PDF

A Fault Tolerant Mechanism for Partitioning and Offloading Framework in Pervasive Environments

Nevin Vunka Jungum , Nawaz Mohamudally and Nimal Nissanke School of Innovative Technologies and Engineering, University of Technology Mauritius La Tour Koenig, Pointe-aux-Sables, Mauritius School of Computing, Information Systems and Mathematics, London South Bank University London, UK

Abstract

Keywords: application partitioning and offloading, mobile codes, fault tolerant, pervasive environment

1. Introduction

A mobile pervasive environment consists of users interacting with mobile devices connected to stationary devices, desktops, servers or other mobile devices wirelessly. Due to mobility of users, frequent network disconnections have become a normal characteristic, and as a consequence, this results in failure of any mobile distributed system running in such environment affecting negatively the reliability of the latter. Several fault tolerance mechanisms [1][2] have been proposed to solve the reliability problem in distributed computing systems. Almost all proposed techniques consider environments with wired homogeneous computational devices. Thus, they are difficult to adapt in a wireless heterogenous mobile computing environment as opposed to grid computing systems. Fault tolerance is a process of reinstating the normal or an acceptable behavior of a system. In pervasive computing environments, network disconnection in the middle of the execution of a task is frequent. It is due to, mainly, mobility of users. For example, if user A’s device act as a participating device in a cluster of devices collaborating to execute an application partitioned and offloaded from user B’s device and the former moves away from the cluster, then a failure is generated. Hence, this paper proposes a fault tolerant algorithm, using reactive fault tolerant methods, which is an independent component that can be added to any offloading framework. The fault tolerant component takes as input the different tasks offloading schedules from the existing offloading systems [3][4][5] and ensure the complete application execution by the application of different fault tolerant policies. The rest of the paper is structured as follows: Section 2 discusses some existing works in the area of fault tolerance in distributed computing systems; Section 3 presents the modeling of the application, device and reliability models; Section 4 describes in detail the calculation of reliability level of devices, presents an algorithm to cluster devices based on reliability level, discusses the replication and checkpointing policies and describes a possible implementation. The fault tolerant algorithm is also presented and discussed along with an analysis of the time complexity of the latter. Section 5 describes the experimental setup, the design and implementation of a simulator and the evaluation and analysis of the results; and finally, Section 6 summarizes the paper.

2. Related Works

Two reactive fault tolerance mechanisms often used are checkpointing and replication. Using checkpointing, snapshot of an application state is taken at a pre-define time interval and the latter is saved on disk. The system reliability is determined by the time elapsed between two checkpoints. However, in the case of replication, no snapshot of application state is saved. Actually, a replication of the application is executed in parallel on other computational devices to ensure complete processing of task. The paradigm of distributed computing encompasses grid computing, mobile grids, cluster computing among many others. In such systems, the computational resources loosely coupled are connected by means of a network that requires the management of fault tolerance to ensure the system stability, robustness and reliability. The authors in [6] made a performance comparison between dynamic load balancing (DLB) and job replication (JR) on distributed systems robust level. A measure statistic Y and a corresponding threshold value Y* were provided, such that DLB consistently outperformed JR, and the reverse is true while Y

In [9] the authors proposed the k-out-of-n reliability control technique with the objective to achieve energy efficiency and so as to maintain system reliability level for processing and storage of data in mobile cloud. Three factors are considered to estimate failure probability. They are amount of battery life left, mobility of device and application computation requirement. A fault tolerant algorithm was proposed [10] for management of resources in mobile cloud. Multiple fault tolerance methods such as checkpoint and replication were used for multiple clusters of devices depending upon their availability and mobility. But the clustering algorithm takes into consideration only device mobility and rate of utilization. However, more criteria can be considered. In [11] the author categorized devices in mobile grids into groups having identical hardware properties. Tasks sent to low reliability groups; the fault tolerance strategy used is task replication while tasks sent to high reliability groups are offloaded to other device upon failures.

3. Modeling of the System

The application is modeled as a directed acyclic graph (DAG)

𝐴 (cid:3404) (cid:4666)𝑆 , 𝐸(cid:4667) as shown in Figure 1, where 𝑆 represents the set of tasks 𝑆 (cid:3036) that constitutes the application and 𝐸 denotes the tasks’ dependencies. The execution time of each task is represented by the weight of each node pointed towards it. For example, the tasks 𝑆 (cid:2869) and 𝑆 (cid:2872) in Figure 1 below takes 8 time units and 6 time units respectively. Tasks in S Task dependencies in E 1 none 2 none 3 1, 2 4 1, 3, 5 5 2 6 4, 5 Table 1 Application task dependency table

Figure 1 An application modeled as a DAG. Critical path is shown in bold

It is assumed that a mobile task cannot be divided into sub-tasks. Each task has a time variable 𝑇 (cid:3047)(cid:3033)(cid:3020) (cid:3284) referred to as the total float. It is the time a task can be delayed without delaying the execution of the overall application. It is computed as the difference between the latest finish time and the earliest finish time. The critical path of a DAG is a path of tasks that has total float, that is, the longest task execution time, hence, the application earliest finish time [12]. Thus, tasks found along the critical path should be subject to the application fault tolerant policies to make sure the application computes on time. Let suppose a mobile pervasive environment network 𝑀 consists of 𝑛 heterogenous computational devices such as smartphones, laptops, desktops, servers, IOT devices and so on. 𝑚 (cid:3036) ∈ 𝑀(cid:4666)𝑖 (cid:3404)

1, 2, … , 𝑛(cid:4667) designates the 𝑖 (cid:3047)(cid:3035) device. The hardware specifications of a device are modeled as follows:  the processing speed of a device is denoted as 𝜇 (cid:3036) ,  the rate of utilization of the processor is represented by 𝑄 (cid:3036) .  𝐼 (cid:3036)(cid:3050)(cid:3036)(cid:3033)(cid:3036) is a binary indicator indicating whether the device is equipped with a WIFI communication medium. Later, this can be further be extended with 𝐼 (cid:3036)(cid:3030)(cid:3032)(cid:3039)(cid:3039) and 𝐼 (cid:3036)(cid:3029)(cid:3047) to represent cellular and Bluetooth communication interfaces respectively. For example, if 𝐼 (cid:3036)(cid:3050)(cid:3036)(cid:3033)(cid:3036) (cid:3404) and 𝐼 (cid:3036)(cid:3032)(cid:3047)(cid:3035)(cid:3032)(cid:3045) (cid:3404) , this means that device 𝑚 (cid:3036) has a WIFI interface but no ethernet port for communication.  𝐵 (cid:3036)(cid:3050)(cid:3036)(cid:3033)(cid:3036) and 𝐵 (cid:3036)(cid:3032)(cid:3047)(cid:3035)(cid:3032)(cid:3045) represents the bandwidth of the communication medium on device 𝑚 (cid:3036) . The speed of each communication medium is computed as the product of the binary indicators 𝐼 (cid:3036) and bandwidth and the addition of current network latency.  the earliest available time of a device 𝑚 (cid:3036) is 𝑇 (cid:3036)(cid:3028)(cid:3049)(cid:3028)(cid:3036)(cid:3039) which reflects the actual load of the latter.  𝑇 (cid:3036)(cid:3045) represents the time duration between two failures of device 𝑚 (cid:3036) . The mobility of the devices in the environment permits the collection of 𝑇 (cid:3036)(cid:3045) of each device. Reliability of a device implies its availability. Just as a side note, this paper does not focus on the mobility and trajectory of the devices. To get a device 𝑚 (cid:3036) available time, the mean time between failures (MTBF) of the latter is considered as 𝑇 (cid:3036)(cid:3045) . The MTBF of a device can be computed using the Weibull distribution. It is a versatile distribution since it can take on the features of other types of distributions (such as normal, exponential and so on) by adjusting its shape parameter, 𝛽 . A 2-parameter Weibull distribution probability density function is as follows: 𝑓(cid:4666)𝑡(cid:4667) (cid:3404) (cid:3081)(cid:3086) (cid:4672) (cid:3047)(cid:3086) (cid:4673) (cid:3081)(cid:2879)(cid:2869) 𝑒 (cid:2879)(cid:4672) (cid:3295)(cid:3334) (cid:4673) (cid:3329) (1) 𝛽 is the shape parameter and 𝜂 is the scale parameter estimated using historic data from the device connection time to the network and its duration.

4. Proposed Fault Tolerant Algorithm

The fault tolerant mechanism along with other components to enable the offloading process is depicted in Figure 2.

Figure 2 The fault tolerant mechanism along with other components of the offloading framework

The interaction diagram of the fault tolerant component along with the code offloading engine, source device and participating devices is shown in Figure 3.

Figure 3 The interaction diagram of proposed fault tolerant mechanism

The offloading decision making component output a list of tasks to be offloaded and their corresponding host devices. We refer to this output as the task offloading scheduling plans (TOSP). The device clustering module clusters the participating devices and the policy generator binds the relevant fault tolerance policy to each task after decoding the TOSP. Three criteria are considered. They are the performance of the device, its availability and its data transfer speed within the network.

The three criteria used to determine the reliability of devices where either replication or checkpointing fault tolerant policy will be applied are the computing capability of the device, its availability and its data throughput in the network. The computing capability of a device 𝑚 (cid:3036) is computed as: 𝐶 (cid:3036) (cid:3404) 𝛿 (cid:3036) ∗ 𝜑 (cid:3036) (2) where 𝐶 (cid:3036) is the computing capability of the latter, 𝛿 (cid:3036) is the CPU speed and 𝜑 (cid:3036) is the CPU utilization rate. The mobility and battery lifetime of the device are considered to determine its availability. The availability of the device is the amount of time it is connected to the network. The longer it is connected to the network and the higher its battery life, this implies, the higher its availability. Hence, it is obtained as follows: 𝐴 (cid:3036) (cid:3404) 𝑌𝑇 (cid:3036)(cid:3045) ∗ 𝑍𝜓 (cid:3036) (3) where 𝑇 (cid:3036)(cid:3045) represents the device available time from the mobility model and 𝜓 (cid:3036) represents the percentage of battery remaining. 𝑌 and 𝑍 are weight factors that can be adjusted depending the context and system requirements and 𝑌 (cid:3397)𝑍 (cid:3404) . Both fault tolerant policies, replication and checkpointing, needs fast data transfer with minimum energy consumed. Thus, communication with other devices in the environment is taken into consideration. The communication capacity of the device is calculated as follows: 𝜇 (cid:3036) (cid:3404) 𝐵 (cid:3036) (cid:3398) (cid:4666)𝑑 ∗ 𝑣(cid:4667) (4) where 𝐵 (cid:3036) represents the total bandwidth of the device, 𝑑 represents the data transfer rate and 𝑣 represents the number of existing connections to other devices we assume a fixed data transfer rate for all connections. Thus, 𝜇 (cid:3036) reflects the capacity remaining to service future devices. The three criteria defined, that is, computation capability, availability and communication capacity of the device, are all combined to determine the reliability degree of the latter. The technique of combining all three criteria helps in reducing the computational complexity specially in the event of a very large number of devices. As such, the reliability 𝑅 (cid:3036) of a device 𝑚 (cid:3036) is calculated as follows: 𝑅 (cid:3036) (cid:3404) 𝛼 (cid:3030)(cid:3043)(cid:3048) 𝐶 (cid:3036) ∗ 𝛼 (cid:3029)(cid:3028)(cid:3047)(cid:3047) 𝐴 (cid:3036) ∗ 𝛼 (cid:3030)(cid:3042)(cid:3041)(cid:3041) 𝜇 (cid:3036) (5) where 𝛼 (cid:3030)(cid:3043)(cid:3048) , 𝛼 (cid:3029)(cid:3028)(cid:3047)(cid:3047) and 𝛼 (cid:3030)(cid:3042)(cid:3041)(cid:3041) are weight factors and 𝛼 (cid:3030)(cid:3043)(cid:3048) (cid:3397)𝛼 (cid:3029)(cid:3028)(cid:3047)(cid:3047) (cid:3397) 𝛼 (cid:3030)(cid:3042)(cid:3041)(cid:3041) (cid:3404) . All devices that potentially may involve in the fault tolerance process will have to be categorized in high reliability and low reliability clusters. Because of the heterogeneity of the devices in the environment, a onefold fault tolerant strategy might not be suitable for all conditions. For example, consider a high reliability cluster consisting of high reliable devices, by applying the replication policy to such cluster will result in more operating overhead than checkpointing. On the other hand, consider a low reliability cluster consisting of low reliable devices, applying the checkpointing policy may create unnecessary storage of snapshots. Several clustering algorithms exist to group similar entities in one cluster and dissimilar ones to another one. We are using the k-means clustering [13] which is a prototype-based clustering technique that tries to identify a certain 𝑘 number of clusters specified by the user. Procedure

The first step is the selection of 𝑘 initial centroids, where 𝑘 is a parameter specified by the user, that is, the number of clusters desired. Each reliability value 𝑅 (cid:3036) of device 𝑚 (cid:3036) is assigned to the closest centroid and each group of reliability values 𝑅 (cid:3036) assigned to a centroid is called a cluster. Based on the reliability value 𝑅 (cid:3036) assigned to the cluster, the centroid of each cluster is updated. Algorithm 1

Reliability Clustering Algorithm 1. Choose the number of reliability clusters ( 𝑘 (cid:3404) ) and get the 𝑅 (cid:3036) values 2. Place the centroids 𝑐 (cid:2869) , … , 𝑐 (cid:3038) randomly 3. For each reliability value 𝑅 (cid:3036) : - find the nearest centroid (cid:4666)𝑐 (cid:2869) , … , 𝑐 (cid:3038) (cid:4667) - assign 𝑅 (cid:3036) to that cluster 4. For each cluster 𝑗 (cid:3404)

1, . . . , 𝑘 - new centroid = mean of all 𝑅 (cid:3036) assigned to that cluster 5. Repeat steps 4 and 5 until convergence or no further change in the assignment of 𝑅 (cid:3036) data points 6. Return low and high reliability clusters 7. End The task replication is offloaded to the same cluster as the selected participating device, originally scheduled to process the offloaded task as per the code offloading module, in order to maintain the offloading benefits in terms of execution time and energy consumption [3][18] and load balancing [5]. Having unnecessarily a large number of replications may result in an inefficient redundancy of resources along with a high energy overhead. To resolve this issue, a score is calculated for each device based on the execution time for the replication, the failure rate and the number of directly connected devices. 𝑠𝑐𝑜𝑟𝑒(cid:4666)𝑚(cid:4667) (cid:3404) 𝑌𝑇 (cid:3020) (cid:3284) ∗ 𝑍 (cid:3436) (cid:3020) (cid:3284)(cid:3281)(cid:3276)(cid:3284)(cid:3287) (cid:3020) (cid:3284)(cid:3295)(cid:3290)(cid:3295)(cid:3276)(cid:3287) (cid:3440) ∗ 𝜆 (cid:3031) (cid:3278)(cid:3290)(cid:3289)(cid:3289) (cid:3031) (cid:3295)(cid:3290)(cid:3295)(cid:3276)(cid:3287) (6) where 𝑌 , 𝑍 and 𝜆 are adjustable weight factors and 𝑌 (cid:3397)𝑍 (cid:3397) 𝜆 (cid:3404) . 𝑇 (cid:3020) (cid:3284) denotes the task completion time. And 𝑆 (cid:3036)(cid:3033)(cid:3028)(cid:3036)(cid:3039) and 𝑆 (cid:3036)(cid:3047)(cid:3042)(cid:3047)(cid:3028)(cid:3039) represents the number of tasks failed and the total number of tasks executed by the device. 𝑑 (cid:3030)(cid:3042)(cid:3041)(cid:3041) and 𝑑 (cid:3047)(cid:3042)(cid:3047)(cid:3028)(cid:3039) represents the number of devices that 𝑚 is connected and the total number of devices in the cluster respectively. Then, the device with the lowest score is, in this configuration, the most reliable one in the cluster and is chosen to offload the replica of the task. Applying the checkpointing policy implies saving a snapshot of the task being executed periodically. Determining the frequency of checkpoints is important as it creates additional network traffic for the wireless communications, thus, more energy usage and checkpointing also incurs time overhead. Therefore, the frequency needs to be calculated based on the time taken by the checkpointing operation and failure rate. The technique proposed by the author in [16] is used to compute the checkpointing frequency as follows: 𝑇 (cid:3030) (cid:3404) (cid:3493) 𝑇 (cid:3046) 𝑇 (cid:3033) (7) where 𝑇 (cid:3030) is the time between two checkpoints, 𝑇 (cid:3046) is the checkpointing time and 𝑇 (cid:3033) is the time between failures. The proposed fault tolerant algorithm is listed in Algorithm 2 below.

Algorithm 2

Proposed Fault Tolerant Algorithm

FaultTolerantProcess (cid:4666)𝑀 , 𝑆 , 𝑆 (cid:3042)(cid:3033)(cid:3033)(cid:3039)(cid:3042)(cid:3028)(cid:3031) (cid:4667) for all (cid:4666)𝑚 ∈ 𝑀(cid:4667) do Calculate 𝑅 (cid:3036) for all (cid:4666)𝑚 ∈ 𝑀(cid:4667) do (cid:4668)𝑀 (cid:3009)(cid:3019) , 𝑀 (cid:3013)(cid:3019) (cid:4669) ← (cid:4666)𝑀 , 𝑅 (cid:3036) (cid:4667) (cid:3419)𝑆 (cid:3030)(cid:3043) (cid:3423) ← 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 _ 𝑝𝑎𝑡ℎ (cid:4666)𝑆(cid:4667) for all (cid:4666)𝑠 ∈ 𝑆 (cid:3042)(cid:3033)(cid:3033)(cid:3039)(cid:3042)(cid:3028)(cid:3031) (cid:4667) do if (cid:4666)𝑠 ∉ 𝑆 (cid:3030)(cid:3043) (cid:4667) then 𝑜𝑓𝑓𝑙𝑜𝑎𝑑 𝑡𝑎𝑠𝑘 𝑠 𝑡𝑜 𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒 𝑑𝑒𝑣𝑖𝑐𝑒 elseif (cid:4666)𝑠 ∈ 𝑆 (cid:3030)(cid:3043) (cid:4667) then if (cid:4666)𝑠(cid:4666)𝑚(cid:4667) ∈ 𝑀 (cid:3013)(cid:3019) (cid:4667) then 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑑𝑒𝑣𝑖𝑐𝑒𝑠 𝑖𝑛 𝑀 (cid:3013)(cid:3019) ~ 𝑚 ← 𝑙𝑜𝑤𝑒𝑠𝑡 𝑠𝑐𝑜𝑟𝑒 𝑑𝑒𝑣𝑖𝑐𝑒 ~ 𝑚 ← 𝑆 (cid:3045)(cid:3032)(cid:3043)(cid:3039)(cid:3036)(cid:3030)(cid:3028) elseif (cid:4666)𝑆(cid:4666)𝑚(cid:4667) ∈ 𝑀 (cid:3009)(cid:3019) (cid:4667) then 𝑇 (cid:3030) ← 𝑐𝑎𝑙𝑐 𝑐ℎ𝑒𝑐𝑘𝑝𝑜𝑖𝑛𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑠 ← 𝑎𝑝𝑝𝑒𝑛𝑑 𝑐ℎ𝑒𝑐𝑘𝑝𝑜𝑖𝑛𝑡 𝑤𝑖𝑡ℎ 𝑇 (cid:3030) 𝑀 represents the list of devices in the environment, 𝑆 denotes the list of all tasks and 𝑆 (cid:3042)(cid:3033)(cid:3033)(cid:3039)(cid:3042)(cid:3028)(cid:3031) represents the list of tasks to be offloaded. All these three variables are taken as input by the fault tolerant algorithm. First the reliability value 𝑅 (cid:3036) of each device 𝑚 is calculated. All devices 𝑀 are then categorized into low reliability 𝑀 (cid:3013)(cid:3019) and high reliability 𝑀 (cid:3009)(cid:3019) clusters. 𝑆 (cid:3030)(cid:3043) is the set of tasks found in the critical path. All tasks that need to be offloaded are checked if they are in the critical path 𝑆 (cid:3030)(cid:3043) . Any task not found in the critical path are offloaded to its scheduled device for execution. For each task found in the critical path, their corresponding offloading scheduled device 𝑚 is check if the latter is in the low reliability or high reliability cluster. For the low reliability cluster, the score of all devices in that cluster is calculated and the device ~ 𝑚 with the lowest score is assigned the task replica 𝑆 (cid:3045)(cid:3032)(cid:3043)(cid:3039)(cid:3036)(cid:3030)(cid:3028) . For the high reliability cluster, the checkpoint is calculated 𝑇 (cid:3030) and appended to the task 𝑠 . Suppose there are 𝑚 number of devices in the environment and 𝑛 number of tasks are processed by the fault tolerant algorithm. The first step consists of calculating the reliability 𝑅 (cid:3036) of each device in 𝑀 thus taking 𝑂(cid:4666)𝑚(cid:4667) . The reliability clustering algorithm takes

𝑂(cid:4666)𝑚 ∗ 𝑘 ∗ 𝐼 ∗ 𝑑(cid:4667) , where 𝑘 is the number of clusters, 𝐼 is the number of iterations until convergence and 𝑑 is the number of attributes. Since each device has only one attribute which is 𝑅 (cid:3036) and the number of clusters is 2, that is, low reliability and high reliability clusters, therefore, the time complexity can be reduced to 𝑂(cid:4666)𝑚 ∗ 𝐼(cid:4667) and hence to

𝑂(cid:4666)𝑚(cid:4667) . The scoring calculation for the replication policy costs

𝑂(cid:4666)𝑚(cid:4667) while the calculation of the frequency of checkpointing takes

𝑂(cid:4666) (cid:4667) . Thus, the maximum cost for using the fault tolerant policy for 𝑛 tasks is 𝑂(cid:4666)𝑚𝑛(cid:4667) . Therefore, the algorithm overall time complexity is given as

𝑂(cid:4666)𝑚 (cid:3397) 𝑚𝑛(cid:4667) .

5. Evaluation

A number of simulations are performed so as to evaluate the performance of the proposed fault tolerant algorithm. Three metrics are considered for this evaluation and they are the application completion time, the fault tolerant algorithm overhead cost and the number of control messages for fault tolerance. Conventional fault tolerant algorithms are compared with the results obtained to better situate the performance of the proposed algorithm. A simulator is implemented based on the INET Framework [14] to simulate the devices in the mobile network. INET is an open-source model suite for wired, wireless and mobile networks running on top of OMNeT++ [15] which is an event discrete simulator. Different topologies with varying bandwidth and delay values for each communication link are generated. Table 2 and 3 describes the task parameters list and device parameters list for the simulation respectively.

Parameters Value Number of applications 50 Length of computation (no of instructions) 20000 – 100000 Size of data (MB) 0.5 – 10 Table 2 Task parameter values for the simulation

Parameters Value CPU Speed (MIPS) 1000 – 100000 Total time available (s) 1000 – 30000 Weibull (shape & scale) (failure point) 𝛼 (cid:3404) , 𝛽 (cid:3404) Bandwidth WIFI (MBps) 0.9 – 1.2 Number of devices 20 – 50 Table 3 Device parameter values for the simulation

We designed and implemented a 𝜋 -calculator to estimate to a certain extent the value of 𝜋 . The workloads used in this simulation consists of a set of applications represented by different randomly generated DAGs. A task of an application is represented by a vertex in a DAG. Each task has to calculate the value of 𝜋 and takes as input the number of times to run the approximations and the number of decimal places desired. Each vertex has also two values associated with it, that is, the amount of computation and data size. The two values are generated from within a specified range. The device parameters listed in Table 3 shows that the latter’s processing speed is measured in Million Instructions Per Second (MIPS). With every tick of the clock, the CPU fetches and executes one instruction. The clock speed is measured in cycles per second, and one cycle per second is known as 1 hertz. This means that a CPU with a clock speed of 2 gigahertz (GHz) can perform two billion cycles per second. To simulate failures of devices, a 2-parameter Weibull distribution is considered to generate random time between failures and the failure time points are computed by the device start time to the time generated from the distribution. And for realistic mobility patterns, we used the CRAWDAD trace dataset [17] to get the shape and scale parameters of the Weibull distribution. Figure 4 The structural design of the simulator

The simulator, see Figure 4, consists of several modules. Simulation configurations are stored in the simulation settings database or file. And the latter is accessed by the application generator and device generator to obtain simulation data. Random application directed acyclic graphs (DAGs) are generated by the application generator that simulates the mobile tasks and their dependencies. Using the device information from the simulation settings module, the device generator output data for mobile devices. And the device cluster generator uses the former along with the network generator to generated the high and low reliability clusters. The workload generator gets the tasks on the critical path and the list of high and low reliability device clusters and assign each task to a device. The offloading engine here implements the existing mobile offloading algorithms and send the schedule plans to the policy generator. The latter applies the relevant fault tolerant policies to the offloading schedule plans.

The weight factors

𝑌 (cid:3404) , 𝑍 (cid:3404) and 𝜆 (cid:3404) are used for the experiments. We evaluated the performance of the proposed fault tolerant algorithm with 500 randomly generated applications DAGs. Each application graph is generated with random number of edges and vertices. Each task is bind with a data size and the amount of computation. The output is then compared to other basic fault tolerant strategies such as checkpointing only policy 𝐶 ∗ , replication only policy 𝑅 ∗ and lastly with a no-fault tolerant policy 𝑁𝑜𝐹𝑇 ∗ , that is, no fault tolerance policy is applied to the scheduling plans. Two performance experiments are considered. Experiment 1 analyze the effect of device availability reflecting on its median time between failures (MTBF). Experiment 2 assesses the consequence of task computation on the fault tolerant algorithm’s performance. Experiment 1

In this experiment, the fault tolerant algorithms performance is evaluated. The MTBF is used to denote the availability of the device. The Weibull distribution is used to generate the failure time for each device. Figure 5 shows the average application completion time. We can see that the proposed fault tolerant algorithm outperformed the other three strategies.

Figure 5 Application completion time based on different device availability

Notice when the MTBF is small (for example, 10 and 20), it implies that the availability of devices is very high and the

𝑁𝑜𝐹𝑇 ∗ strategy results in the worst completion time of 4000 seconds compared to other. Whereas 𝑅 ∗ and the proposed fault tolerant algorithm ( 𝐹𝑇 _ 𝐴𝑙𝑔𝑜 ) strategies have amongst the lowest completion time. This is because when the failure occurs more often, the other two strategies, that is, 𝐶 ∗ and 𝑁𝑜𝐹𝑇 ∗ keeps restarting the task execution and that results in a higher overall application completion time than 𝑅 ∗ and 𝐹𝑇 _ 𝐴𝑙𝑔𝑜 . Also, when the MTBF is low, the 𝐹𝑇 _ 𝐴𝑙𝑔𝑜 applies the replication policy thus the result is similar to the 𝑅 ∗ strategy. When the MTBF is above 90, all strategies seem to generate stable results. It is because the devices are more and more reliable, the 𝑅 ∗ strategy generates more redundancy overhead for transmitting the replica than 𝐶 ∗ as depicted in Figure 6. Figure 6 Overhead cost based on different device availability

Figure 7 Number of messages based on different device availability

As the MTBF increases, this implies, more time is available for the tasks to be completed before failures occur. Hence, a decrease in the overhead for all strategies. As illustrated in Figures 6 and 7, the overhead and number of messages generated by the overhead tends to decrease and stabilize, which is in line with the overall application completion time.

Experiment 2

In this second experiment, the algorithms are evaluated with DAGs having varying amount of task computation requirements in terms of MIPS, to analyze the fault tolerant algorithm performance. MTBF from 90s to 120s are used for the devices.

Figure 8 Application completion time based on different task computation requirements

Figure 8 shows the application completion time for the four strategies. The application completion time for all strategies is similar when the amount of computation is low. Even when the computation size grows, all four strategies application completion time grows similarly. The fault tolerant algorithm outperforms all other three strategies and

𝑁𝑜𝐹𝑇 ∗ records the worst performance. Figure 9 Overhead cost based on different task computation requirements

Figure 10 Number of messages based on different task computation requirements

Figures 9 and 10 shows the overhead and number of control messages incurred by the 𝐹𝑇 _ 𝐴𝑙𝑔𝑜 , 𝑅 ∗ and 𝐶 ∗ algorithms. As the 𝑅 ∗ strategy only generates replicas for all tasks, the overhead and number of messages did not fluctuate much. On the contrary, as the task execution gets longer, more checkpointing operations are performed, thus resulting in an increase in overhead and control messages for the 𝐹𝑇 _ 𝐴𝑙𝑔𝑜 and the checkpointing strategies. The results from the different experiments demonstrate that the proposed fault tolerant algorithm can adapt, its policies, to different execution conditions, such as different device availability and different task computation requirement.

6. Summary

The mobility of users makes the mobile partitioning and offloading system susceptible to failures. Since most existing fault tolerant algorithms for distributed systems concentrates mainly on device crash failures, their adaptability to pervasive environment that consists of frequent wireless network failures seem difficult and inadequate. Thus, in this paper, we presented a fault tolerant algorithm that helps in consolidating the efficiency and robustness of the partitioning and offloading frameworks. To permit the usage of different fault tolerant policies such as replication and checkpointing, the devices are grouped into high and low reliability clusters. Experiments result shown that the fault tolerant algorithm can easily adapt to different execution conditions while incurring minimum overhead.