[PDF] MDLdroid: a ChainSGD-reduce Approach to Mobile Deep Learning for Personal Mobile Sensing

Abstract

Personal mobile sensing is fast permeating our daily lives to enable activity monitoring, healthcare and rehabilitation. Combined with deep learning, these applications have achieved significant success in recent years. Different from conventional cloud-based paradigms, running deep learning on devices offers several advantages including data privacy preservation and low-latency response for both model inference and update. Since data collection is costly in reality, Google's Federated Learning offers not only complete data privacy but also better model robustness based on multiple user data. However, personal mobile sensing applications are mostly user-specific and highly affected by environment. As a result, continuous local changes may seriously affect the performance of a global model generated by Federated Learning. In addition, deploying Federated Learning on a local server, e.g., edge server, may quickly reach the bottleneck due to resource constraint and serious failure by attacks. Towards pushing deep learning on devices, we present MDLdroid, a novel decentralized mobile deep learning framework to enable resource-aware on-device collaborative learning for personal mobile sensing applications. To address resource limitation, we propose a ChainSGD-reduce approach which includes a novel chain-directed Synchronous Stochastic Gradient Descent algorithm to effectively reduce overhead among multiple devices. We also design an agent-based multi-goal reinforcement learning mechanism to balance resources in a fair and efficient manner. Our evaluations show that our model training on off-the-shelf mobile devices achieves 2x to 3.5x faster than single-device training, and 1.5x faster than the master-slave approach.

Full PDF

MMDLdroid: a ChainSGD-reduce Approach to Mobile DeepLearning for Personal Mobile Sensing

Yu Zhang

RMIT [email protected]

Tao Gu

RMIT [email protected]

Xi Zhang

RMIT [email protected]

ABSTRACT

Personal mobile sensing is fast permeating our daily lives to enableactivity monitoring, healthcare and rehabilitation. Combined withdeep learning, these applications have achieved significant successin recent years. Different from conventional cloud-based paradigms,running deep learning on devices offers several advantages includ-ing data privacy preservation and low-latency response for bothmodel inference and update. Since data collection is costly in reality,Google’s Federated Learning offers not only complete data privacybut also better model robustness based on multiple user data. How-ever, personal mobile sensing applications are mostly user-specificand highly affected by environment. As a result, continuous localchanges may seriously affect the performance of a global modelgenerated by Federated Learning. In addition, deploying FederatedLearning on a local server, e.g., edge server, may quickly reachthe bottleneck due to resource constraint and serious failure byattacks. Towards pushing deep learning on devices, we presentMDLdroid, a novel decentralized mobile deep learning frameworkto enable resource-aware on-device collaborative learning for per-sonal mobile sensing applications. To address resource limitation,we propose a ChainSGD-reduce approach which includes a novel chain-directed

Synchronous Stochastic Gradient Descent algorithmto effectively reduce overhead among multiple devices. We also de-sign an agent-based multi-goal reinforcement learning mechanismto balance resources in a fair and efficient manner. Our evalua-tions show that our model training on off-the-shelf mobile devicesachieves 2x to 3.5x faster than single-device training, and 1.5x fasterthan the master-slave approach.

CCS CONCEPTS • Computing methodologies → Neural networks ; Distributedcomputing methodologies ; Reinforcement learning ; Mobile agents ;• Networks → Network resources allocation ; Network protocoldesign . KEYWORDS

Mobile deep learning, Neural networks, Distribute computing, Re-source allocation, Reinforcement learning

With the rapid development of mobile and wearable devices, recentyears have witnessed an explosion of mobile sensing applications.These applications gain an insight into people’s life based on richpersonal sensing data, e.g., understanding biological contexts indaily living [31], recognizing activities in ambient assisted livingareas [17], and monitoring personal health in smart home or hos-pital [32]. Machine learning has been commonly used to process sensing data. However, traditional machine learning techniquesrequire manual and complex feature engineering. Deep Learning(DL) has gained an increasing popularity due to its higher modelaccuracy. Besides, its automated feature extraction capability andthe ability of scaling with data make it an ideal solution for process-ing multi-modality sensing data [40]. It is advocated that DL willbe the next key enabler for advanced personal mobile sensing [24].

Personal Sensing Requirements

A successful DL applicationrequires a huge amount of data to train a model, in which largecomputation resources will be involved. The commercial solutionis to transmit the data from local to cloud, offloading the heavytraining workloads to the cloud. Edge server can also be used toefficiently offload the workloads [27]. However, real-world per-sonal sensing applications pose several requirements in which theserver-based approaches may not fit. Firstly, personal sensing dataare significantly privacy-sensitive as the data contain a variety ofhuman motion and biological contexts. Studies [39] [21] show thatleaking motion data can cause a series of violations concerninguser location, health condition, emotional state, and identificationinformation. Even governments across the worlds are reinforcinglaws to protect personal data privacy, for example the General DataProtection Regulation (GDPR) [44]. Server-based approaches haveserious privacy concerns. Secondly, due to dynamic external natureeffects and internal characteristics of each individual, personal sens-ing data are strongly affected by specific personal activities, socialabilities and surrounding environment conditions over time [25][17], and features may be updated at any time. Simply deploying apre-trained global model on device may not continually work wellto adapt the local features which have been changed. Continuallytraining a local model with new data is a fundamental requirement,and applications usually require low-latency responses for bothmodel inference and update. In practice, continually transmittingsensing data to the server and downloading model updates for training can incur fast battery drain and considerable latency formobile devices especially when the network connection is unstableor broken. Different from server-based approaches, mobile DL maybe an ideal solution to effectively preserve sensing data privacywithout being transmitted over the public network, and enablequick local model inference and update response for continuous on-device training [11] [40]. Thus, mobile DL is a promising directionfor personal sensing applications.

Mobile Deep Learning Challenges

Existing work [40] revealthat mobile DL poses two challenges. Firstly, pushing DL with bothmodel training and inference to a resource-constraint mobile devicein practice is extremely constricted by its capability. Secondly, datacollection with various conditions is usually costly in reality, and themodel performance and robustness can be seriously affected due toinsufficient data. Collaborative deep learning is proposed to ensure a r X i v : . [ c s . L G ] F e b PSN’20, April 21–24, 2020, Sydney, Australia Y. Zhang et al. model training efficiency and robustness based on multiple user data[46]. Following this direction, Google’s Federated Learning [19] hasbeen proposed to provide a mobile collaborative DL framework.This framework not only preserves data privacy by transferring thegradient parameters of a DL model without exposing local data, butalso improves the global model performance and robustness by alarge number of user models. However, Federated Learning relies ona central server for intensive global model aggregations, which maynot work well for personal sensing scenarios due to the low-latencyrequirement for continuous local model update. Scaling down theframework to a central edge server [41] or even a master device mayefficiently reduce the latency. However, Federated Learning deploysa master-slave structure which is less attack-resistant [48]. A recentstudy [6] demonstrates that the model poisoning attacks can causea serious negative impact to the central training process. In addition,the central edge server or master device may quickly become thebottleneck due to substantial communication and intensive modelaggregation, resulting in huge resource overhead [23]. Moreover,the system fault-tolerant may be limited in practice if the centralnode goes down.In contrast, applying a decentralized structure for mobile DL cantheoretically offer high attack-resistant and reliable fault-tolerant[48]. Existing approaches propose a theoretical decentralized struc-ture based on a fixed directed graph network such as Ring-Allreduce[28]. These approaches have been applied in high performancecloud environments where resources are rich and stable. However,when applied to mobile devices where their resources are scarceand dynamic, the training process can easily suspend or crash dueto low level of resources available. Furthermore, studies [15] [17]report that using opportunistic user context from multiple peoplecan enrich the sensing data distribution to make model robust to the local variations. Therefore, moving towards a local resource-awaredecentralized collaborative mobile DL approach without centralserver support for personal sensing applications is strongly moti-vated in order to mitigate resource overhead and reduce latency fortraining.

Decentralized Mobile Scheduling

Deploying a decentralizedDL framework on multiple devices can be extremely difficult dueto the design of resource-efficiency task scheduling for resource-constrained devices. A generic optimization algorithm [42] can beapplied to perform task scheduling. However, the recent

Multi-agents Reinforcement Learning (MARL) approach [33] may workbetter in such a non-stationarity and multi-device scenario. Agent-based approaches have a significant advantage of estimating thefuture scheduling order because agents inherently learn featuresfrom the past experiences based on Reinforcement Learning (RL),while the generic optimization approaches cannot estimate a futurepossible state and will be limited to control a complex environmentdue to lacking of such learning mechanism [5]. Thus, using agent-based approaches can theoretically make scheduling more reliablefor complex environments.

Our Approach

Aiming to push DL to mobile devices, we presentMDLdroid, a novel decentralized M obile D eep L earning frameworkto enable resource-aware on-device collaborative learning for per-sonal mobile sensing applications. MDLdroid targets to fully oper-ate on multiple off-the-shelf An droid smartphones connected in a mesh network, and achieve high training accuracy and reliableexecution of the state-of-the-art DL models.Our challenge is two-fold. To address the first challenge of re-source constraint on device, we propose a ChainSGD-reduce ap-proach which essentially uses a novel chain-directed SynchronousStochastic Gradient Descent (S-SGD) algorithm to effectively re-duce resource overhead among devices for training. The key idea isto decentralize the S-SGD algorithm [9] running on a single deviceto multiple devices with dynamic chain-directed model aggregation.Specifically, each device runs a descendant model for training inwhich model aggregation task is managed by any two devices at atime to achieve minimal-peak (i.e., minimum communication andmemory) of resource overhead for each device. Different from theexisting Ring-Allreduce approach [28], each pair of devices is dy-namically scheduled to perform model aggregation based on theirresource condition for latency reduction. In practice, we leverage ona mesh graph to aggregate multiple descendant model parametersinto global model parameters to complete each training iteration.Once the

Epoch reaches the given iterations, the entire training pro-cess is completed. The global model generated can then be deployedon each device for inference.Secondly, designing a resource-efficiency task scheduler in adecentralized DL framework is challenging because resources ondevice are constrained strictly and changed dynamically (i.e., the

Non-stationarity problem). A device may be dynamically switchedto busy state due to other high priority tasks, leading to trainingbeing suspended. In addition, the model aggregation task needs tobe distributed to multiple devices in a fair manner to battery con-sumption across the network. To address this challenge, we proposea single agent-based scheduler based on Reinforcement Learningto self organize scheduling tasks using dynamic information of on-device resource, named

Chain-scheduler . Chain-scheduler includesan effective reward function design to map each perceived sched-uling action to a reward, aiming to optimize the given constraints.Through our analysis in Section (§3.3), Chain-scheduler can achievethe best latency-reduced scheduling close to

Tree-scheduler in theTree-Allreduce approach [37], and it can also achieve the best en-ergy balance close to

Ring-scheduler in the Ring-Allreduce approach[28]. To tackle the

Non-stationarity problem, we apply a continuousenvironment learning strategy to repeat learning if the schedul-ing environment is changed. In this way, scheduling can be moreadaptive to on-device resource changes. In addition, we furtherenhance the performance of the reward function by designing aThreshold-based Decaying Greedy-Exploration (TDGE) strategy tosave on-device resources used for scheduling tasks.We fully implement MDLdroid on off-the-shelf smartphonesbased on modified DL4J libraries. To evaluate MDLdroid, we usetwo standard Convolutional Neural Network (CNN) models (LeNet[26] and AlexNet [20]) since CNN can be used to effectively pro-cess multi-channel sensing data [43]. We evaluate the trainingperformance using 6 public datasets in Table 1, containing diversepersonal mobile sensing data. Results show that MDLdroid achieveshigh training accuracy which is comparable to the state-of-the-artaccuracy reported in 3, speeds up training by 2x to 3.5x compared tothe single-device baseline and 1.5x compared to Federated Learning.In addition, MDLdroid reduces latency overhead due to busy condi-tion by 23% and 53% compared to

Tree-scheduler and

Ring-scheduler , DLdroid: a ChainSGD-reduce Approach to Mobile Deep Learning for Personal Mobile Sensing IPSN’20, April 21–24, 2020, Sydney, Australia respectively. Moreover, MDLdroid reduces the variance in batteryconsumption among devices by 40% compared to

Tree-scheduler . Summary of Contributions

Key contributions of this paperare summarized as follows: • We present MDLdroid, a novel decentralized mobile DL frame-work based on a mesh network to enable resource-aware on-device collaborative learning for personal mobile sensing appli-cations (§3). • We propose a ChainSGD-reduce approach, in particular a chain-directed

S-SGD algorithm, to minimize the resource overhead ofmodel aggregation tasks in a decentralized framework (§3.2). • We design an agent-based multi-goal reinforcement learningmechanism, Chain-scheduler, with an accelerated reward func-tion to manage and balance resources in a fair and efficient man-ner (§3.3). • We evaluate MDLdroid on off-the-shelf Android smartphoneswith two standard DL models using 6 public personal mobilesensing datasets (§4). Results indicate that MDLdroid acceleratestraining effectively, outperforming the state-of-the-arts.To the best of our knowledge, this is the first time that collabora-tive DL is fully implemented and functioned on off-the-shelf mobiledevices with both model training and inference without centralserver support. MDLdroid takes mobile DL to the next step, andopens up new possibilities for building secure and adaptive mobilesensing applications.

Implication

MDLdroid defines two implication models. The individual model is used for an individual who has sufficient per-sonal data and multiple mobile devices to offer shared resources.The personal sensing data can only be safely distributed to thegiven mobile devices verified by the same identity (e.g., Google ac-count). MDLdroid can accelerate on-device learning, and alleviatethe resource burden from a single device.The non-individual model is mainly applied for a group of peopleto explore specific local sensing features. The data will be strictlykept on device to preserve data privacy, and only the model gradientparameters of each individual will be exchanged to improve modelrobustness. With the nature benefits of a decentralized approach,model poisoning check [6] can be easily deployed in each device toprevent attacks. Even if a few of devices are down due to attacks,MDLdroid can isolate them and keep others working safely to avoidcatastrophic failure. Moreover, MDLdroid can be potentially usedin many multi-user sensing scenarios [4], such as specific familybehaviour recognition in a smart home, patient-specific healthcondition monitoring in a hospital, and healthcare tracking forolder adults in different aged care facilities.

To investigate the limitations of deploying mobile DL on device incurrent solutions, we conduct preliminary studies to discover thebottleneck of on-device training.

Setup

We select three well-known public datasets with differ-ent scales, i.e., sEMG (25MB) [22], HAR (160MB) [1], and OPPO(500MB) [1] in Table 1. We use an off-the-shelf Android smartphone,i.e., Google Pixel 2XL running Android 8.1. In each experiment, thesmartphone runs complete training with a dataset using DL4J li-braries [10].

14 17 8 sEMG HAR OPPO01000200030004000 B a tt e r y ( m A h ) LeNetAlexNet (a)

Time(100ms) M e m o r y ( M B ) N=2N=4N=6N=8 (b) T i m e ( m s ) MasterSequentialCrash (c)

Batch M e m o r y ( M B ) BothOnly Agent TaskOnly Training Task (d)

Figure 1: Preliminary results. (a) Battery consumption onsingle-device training; (b) Memory crash on CNN-LSTM; (c)Time cost of Master vs. Sequential model aggregation; (d)Memory conflict when running multi-agents.Bottleneck of Single-device Training

The encouraging resultis that single-device training with LeNet for both sEMG and HARachieves a fair accuracy with no failure. As we raise the memorythreshold, the run-time memory usage of training is moderatelyincreased with no major issues observed. However, from Figure 1a,we observe that battery overhead dramatically rises with the growthof data size and model complexity. In particular, with a maximumbattery capacity of 3500 mAh on the smartphone, the training ofHAR with AlexNet and the training of OPPO with both LeNetand AlexNet fail due to massive battery drain. Thus, single-devicetraining on off-the-shelf smartphone is not currently feasible.

Memory Overhead of Master Model Aggregation

To havequantitative measurements, we implement Federated Learning [7]on device without central server support, and run the master modelaggregation process (i.e., memory usage is O(N)) on a master deviceto evaluate its resource utilization. Each experiment runs on thesame smartphone receiving N copies of model gradient parame-ters, where N denotes the number of other slave devices. We selectPAMA2 dataset from Table 1 to train a CNN-LSTM hybrid model[47] (i.e., 46.4MB model file and 11.61M model parameters). Figure1b illustrates its memory usage since memory is more critical thanbattery due to model aggregation. We observe that memory usage israpidly increased with N . When N =

8, it ends up with

OutOfMem-ory (i.e., maximum 512MB). Therefore, multi-device training in acentralized framework has severe memory limitations for modelaggregation.

Cost of Sequential Model Aggregation

To solve the memorylimitations, a naive sequential model aggregation approach (i.e.,memory usage is O(2)) is implemented by sequentially loadingmodel files into memory on the master device. We use the sameCNN-LSTM hybrid model to evaluate the time cost. We simulate N is up to 20 to increase the scalability of model aggregation. Figure1c indicates the time usage of sequential aggregation is way higherthan using master aggregation (e.g., 47307ms vs. 160ms respectivelywhen N = PSN’20, April 21–24, 2020, Sydney, Australia Y. Zhang et al.

User Input

Model ConﬁgTrade-offs

Scan-Build

BLE ScanNetwork BuildBroadcastsModel Conﬁgs

Training

DescendantModel Init

Users

ChainSGD

On-deviceTraining

Resource Info MsgInit Msg RequestScheduling

Scheduling

DQN Model

ChainScheduler

Monitor

Users Model Inference

LoadingSensing Data On-deviceTrainingMulti-deviceTraining

Figure 2: MDLdroid Architecture consists of three-stage pro-cesses: 1) user model configuration input; 2) network scan &build; 3) model training & task scheduling. presents a linear increase with N , which is also critical in practice.Moreover, since two copies of model files are aggregated in memoryspace for each loading, the NeighborSGD in Eq. (2) is used in thisimplementation. However, a severe accuracy degradation issue isrevealed as shown in Figures 3a and 3a (§3.2.2). Thus, althoughsequential model aggregation simply minimizes the memory usage,it can still cause considerable extra training time cost and accuracydegradation. Resource Conflict in MARL

Our preliminary experiment re-veals that severe resource conflict exists in MARL. We implement asimplified MARL [5] on the same smartphone using DL4J libraries.Figure 1d indicates that running training and agent tasks concur-rently on-device can dominate the main memory of the device,resulting in 0.6x higher than that of running any single task. Wethus turn our intention to a single agent-based RL scheduling ap-proach, which motivates our proposal.

In this section, we detail the system architecture of MDLdroid. Wealso present the proposed ChainSGD-reduce approach and Chain-scheduler.

We first give an architecture of MDLdroid in Figure 2. Since MDL-droid is designed to operate full-scale DL on Android based on amesh network, we employ both Bluetooth Low Energy (BLE) andBluetooth Socket (BS) to build the mesh network due to accessi-bility and low energy consumption. In principle, any on-devicemesh-based protocol can be applied (§3.2).In the first stage, users input different model configurationsbased on their demand to train models. MDLdroid uses two com-bined network topology in the second stage. The BS-based meshtopology is applied to perform the decentralized model aggregationbetween devices in the training stage, while the BLE-based treetopology is used for the centralized resource condition monitor-ing in the task scheduling stage. In the third stage, each deviceis required to continually report resource condition to a mobileagent (MA). Once all training tasks request the model aggregationfor each iteration, the Chain-scheduler on the MA can managethe resource-efficiency scheduling paths as a chain-directed graph.When model aggregation of each iteration are completed, a copyof the aggregated model parameters will be resource-aware broad-casted (§3.2.3) to all devices for the next iteration.

Finally , once training is completed, a global model will be distributed to all de-vices via broadcasting for model inference Especially, MDLdroidcan also reload a pre-trained model to continually train with newlocal sensing data.

Collaborative learning relies on the distributed stochastic gradientdescent (SGD) algorithms [19]. To achieve reliable training accuracy,a central server typically gathers local gradient parameters ∆ w ki from all machines and aggregates them to be global gradient pa-rameters ∆ w ( k ) ( t ) after each training iteration. The central serverthen updates the local parameters w ( k ) ( t + ) for all machines via aone-to-many broadcasts [13]: ∆ w ( k ) ( t ) = N N (cid:213) i = ∆ w ki w ( k ) ( t + ) = w k ( t ) − η ∆ w k ( t ) (1)where w ( k ) ( t ) denotes the k ( th ) parameters at each training itera-tion t , ∆ w i denotes a local gradient parameter from the i th machine, N is the number of machines and η is the learning rate. Asynchronous SGD vs. Synchronous SGD . The dis-tributed SGD algorithms are mainly classified into two categories—Asynchronous SGD (A-SGD) and S-SGD [9]. A-SGD can be morecommunication efficient and runs with no strong dependency amongmachines. However, study [45] points out that A-SGD suffers froman uncertain training accuracy degradation issue due to its delaymodel updating mechanism. S-SGD representing in Eq. (1), on theother hand, runs quite stable without this issue, but the drawback ofS-SGD is that some machines may be slowed down in run-time dueto dynamic low resources, and hence the overall training time de-pends on the slowest machine. By given limited training conditionon mobile device, S-SGD can effectively achieve a higher accu-racy and more reliable training performance than A-SGD, whichmotivates S-SGD in our approach.

Chain-directed Synchronous SGD . In a decentralized DL framework, the relationship between train-ing nodes and the central node is decoupled. Therefore, the central-ized model aggregation can be separated into multiple descendantaggregations by structural transformation. The existing work [18]present decentralized SGD algorithms based on a fixed directed network. We define a decentralized topology with a fixed directed graph as ( V , E ) , where V denotes a set of devices and E representsa set of edges. When we have N training devices, V = { , , ..., N } and E ∈ R V × V . We define directed edges as ( i , j ) ∈ E , which meansdevice i can send its gradient parameter to device j for model aggre-gation. The number of device j ’s neighbors is m , and the number ofdevice j is n , where m , n ∈ N . With these settings, we can transformthe centralized model aggregation (CentralSGD) in Eq. (1) to thedecentralized neighbor aggregation (NeighborSGD) as follows. ∆ w ( k ) ( t ) = n n (cid:213) j = (cid:32) (cid:205) mi = ∆ w ki + m ∆ w kj m (cid:33) (2)However, although the model aggregation is divided by j descen-dant neighbor aggregations, the single device j still requires to DLdroid: a ChainSGD-reduce Approach to Mobile Deep Learning for Personal Mobile Sensing IPSN’20, April 21–24, 2020, Sydney, Australia

Epoch A cc u r a cy ( % ) NeighborSGDCentralSGDChainSGD (a)

Epoch A cc u r a cy ( % ) NeighborSGDCentralSGDChainSGD (b)

Figure 3: Training accuracy degradation and restore whenthe number of device increases (a) N = ; (b) N = . concurrently perform m model aggregations, which may cause sig-nificant memory overhead revealed by our preliminary study. Onthe other hand, considering the dynamicity of resources on device isuncertain during training, the neighbor model aggregation based onthe fixed directed decentralized topology may cause distinct latencyif device j pauses the process due to low-resource condition.To reduce memory overhead and latency, we propose a ChainSGD-reduce approach with a mesh-based decentralized topology. In thisapproach, m is constantly managed as one for every neighbor aggre-gation to achieve a minimal-peak in memory and communicationoverhead for both devices i and j . Our approach also includes anagent-based RL Chain-scheduler to schedule the neighbor aggrega-tion task as a dynamic chain-directed graph in a resource efficiencyway.Compared with centralized graph and decentralized neighborgraph, Figure 4 demonstrates that the major differences of ChainSGD-reduce are twofold: 1) the model aggregation is managed only withone of neighbors at a time; 2) the order of the aggregation tasks isdynamically scheduled depending on the real-time resource condi-tion of device.When ChainSGD-reduce is applied to NeighborSGD in Eq. 2,we reveal an accuracy degradation issue. Figure 3a shows that thetraining accuracy of NeighborSGD is lower than that of CentralSGDby 3% ( N = N increases, thetraining accuracy gap between NeighborSGD and CentralSGD be-comes larger by roughly 5% less ( N = ∆ w ( k ) ( t ) decompose through the model aggre-gation process. We thus redefine ChainSGD as the following pairaggregation function W ( j , i ) aiming to restore training accuracy. W ( j , i ) =  θ i ∆ w ki + θ j ∆ w kj θ i + θ j θ ′ i = θ i + θ j (3)where W ( j , i ) represents the aggregated gradient parameters send-ing from a remote device j to device i as a pair aggregation . Theglobal gradient parameters of each iteration are denoted as ∆ w ( k ) ( t ) = ( N − ) W ( j , i ) . Since ChainSGD offers a dynamic model aggregationprocess, multiple pair aggregation can be performed simultaneouslybased on resource condition of devices within one round to reducelatency showing in Figure 4. Especially, θ is a reversal parameter toenable that the gradient parameters of W ( j , i ) can be restored closeto CentralSGD, and θ starts from 1. Once a pair aggregation is done,the local θ i will be updated with a remote θ j as θ ′ i for sending tonext round. Our approach requires that the BS message for modelaggregation contains ∆ w k and θ as ( ∆ w k , θ ) . Both Figures 3a and Master MA

Figure 4: Model aggregation structure comparison: Cen-tralSGD vs. NeighborSGD vs. ChainSGD

3a show that the performance of ChainSGD is consistent with thatof CentralSGD with no accuracy degradation.In summary, the ChainSGD-reduce algorithm is shown in Algo-rithm 1 below.

ALGORITHM 1:

ChainSGD-reduce AlgorithmInitialize parameters w , number of iteration t , number of devices N ,training dataset D train ; for All devices i ∈ N dofor t Epochs do Train model on local d ttrain ∈ D train ; while true doif Get an incoming BS message ( W ( j , i ) , θ ) then Pair aggregation ∆ w ki in Eq. (3);Update θ ; endelse Get a remote neighbor j from MA;Send ( ∆ w ki , θ ) to neighbor j , Break; endend Update local w ( k ) i ( t + ) = w ki ( t ) − η ∆ w k ( t ) ; endend Resource-aware Broadcasting . In ChainSGD-reduce, the MA device handles the global modelparameters broadcasting after each training iteration. If the MAdevice sequentially broadcasts by BS messages, it may suffer fromheavy communication overhead when the number of devices N islarge. We employ a binomial-tree broadcast algorithm [14] with amessage forwarding function to address this issue. Basically, our im-plementation enables the broadcast list sorted by devices’ resourcecondition, and the devices with more resources will be assignedwith a list of forwarding tasks to mitigate the overhead on MA.Thus, the number of broadcast round by MA can be reduced from N to ⌈ log N ⌉ . Fault-tolerant . MDLdroid offers a fault-tolerant strategy to ensure the stabilityfor on-device training: 1)

File cache : model parameters and neces-sary records can be saved as files backup in run-time for trainingrecovery; 2)

Training device fault : once a training device is lostafter the re-connection attempts,

File cache will be first preformed.Then MA will skip the device for scheduling until reconnected. Ifre-connection is successful, MA will require the device to send over

PSN’20, April 21–24, 2020, Sydney, Australia Y. Zhang et al. the last model parameters and iteration records. After that, MA willsend back the current iteration records for synchronization, and thedevice will receive the global aggregated model parameters untilnext iteration.

We propose Chain-scheduler using an agent-based RL model todynamically schedule model aggregation tasks. To mitigate thememory overhead analyzed in our preliminary experiment (§2),we design a single agent-based RL mechanism. The agent task isseparated into an individual device as the MA which is respon-sible of resource-aware scheduling, and other devices are mainlyresponsible of running training tasks. With a mesh network, MAcan globally monitor all training devices’ resource condition inreal-time with low-energy cost. In MDLdroid, two optimal goalsare defined for Chain-scheduler to achieve resource-efficiency. Oneis to reduce training latency due to the resource condition of somedevices dynamically turning busy in real-time. Another is to bal-ance communication overhead and battery consumption across thenetwork.Mathematically, Chain-scheduler deals with the following con-strained optimization function:arg min N ( T ( t )) + N ( E ( t )) s.t. M ≤ M max , B ≤ B max (4)where T and E denote training time and energy balance across thenetwork, respectively. N ( x ) = ( x − x min )/( x max − x min ) is a stan-dard normalization function to transform training time and energybalance to be at the same scale. t represents the current trainingiteration. We denote M max and B max as the maximum memoryand maximum battery offered by the target device, respectively. Training Time T The training time for each iteration t is de-fined as: T ( t ) = T tr + f ( T a , T b ) (5)where T tr denotes the training time of each individual device, T a denotes the overall model aggregation time for iteration t , and T b represents the overall latency time caused by the busy state ofdevices. f ( x ) is the schedule function. Especially, due to concurrentprocesses, f ( x ) can schedule T a and T b to reduce iteration latency.The worse case is T a + T b in which the scheduled sequence is linearand no concurrent overlap, while the ideal case is max ( T a , T b ) inwhich training latency is minimized as the concurrent overlap. Energy Balance E The energy balance for each iteration t canbe modeled as: E ( t ) = N N (cid:213) i = ( E ic − µ ) (6)where µ is the mean of { E , E ... E N } , and E ic denotes the energyconsumption of device i . We employ a variance function to repre-sent the balance performance of Chain-scheduler.The best estimated scheduling sequence represents as a reversetree structure of Tree-scheduler [37] to reduce latency, which re-quires ⌈ log N ⌉ communication rounds. While the worse case rep-resents as a ring structure of Ring-scheduler [28] with N commu-nication rounds. However, Ring-scheduler can perform the best Send model Module Aggregation Count for balance1

Round 3Round 1Round 2Round 4 (a)

TrainingDevice MA Retraining Scheduler

Traininginterruptedand startretrain again

Report Resources MessageRequest Scheduling (b) Figure 5: (a) The process of scheduling by busy condition; (b)State diagram and Re-training mechanism energy balance as each device only requires to aggregate modelonce. Thus, Chain-scheduler is aiming to achieve optimal trade-offbetween

Tree-scheduler and

Ring-scheduler . Multi-goal Reward Function . We employ a DQN model [33] in MA to learn the schedulingenvironment based on the optimization function in Eq. (4). In ourprotocol, all train devices are required to continually report theirresource condition to MA. We select five essential resource param-eters including: free memory (i.e., remaining memory), battery (i.e.,remaining battery), in-use (i.e., whether if under intensive use con-dition), charge (i.e., whether if charging), cpu (i.e., current CPU), toidentify whether the device is in busy ( S busy ) or free ( S f ree ). Next,we present the design of the Chain-scheduler structure: • State : We design five states for Chain-scheduler to make cru-cial scheduling decisions based on our protocol. s = { s t , i = , , ..., T } represents as a set of devices’ states, where t denotesthe learning step. States = ( S f ree , S busy , S send , S дet , S done ) ,where S send denotes that the device is sending model parame-ters, and S дet represents the device is getting model parameters.Specifically, the state of devices can be only defined as one of thefive states at any learning step. Figure 5b illustrates the relation-ship of these states. • Action : We define a = { a i , i = , , ..., N } , where N denotes thenumber of devices. It represents which device is selected by thescheduler to act in the environment. • Reward : r = { r t , t = , , ..., T } is defined by the reward function r ( s t , a , s t + ) in Eq. (7). Reward Model

To solve the optimization function in Eq. (4), wedesign a three-stage mechanism as follows to offer a fast learningprocess.

Firstly , to reduce training latency, we design a decayingpenalty function ρ + tρ / ( N − ) aiming to set S busy as the low-est priority. Since our protocol requires that a pair ( i , j ) should beselected by each scheduling, it takes two steps in the learning pro-cess. Then, 2 ( N − ) represents the best learning steps at one Epoch .With N increases, the penalty reward of S busy decreases. Thus,the later S busy chooses, the smaller penalty reward the model gets.From concurrent perspective, the training latency can be reduced DLdroid: a ChainSGD-reduce Approach to Mobile Deep Learning for Personal Mobile Sensing IPSN’20, April 21–24, 2020, Sydney, Australia E x p l o r e - C o un t DGE-Fast DGE-Slow

Figure 6: RL training speed comparison by using differentexploration rate as the process of model aggregation can be done with S f ree inadvance. Secondly , to balance energy consumption, if the countof S f ree → S дet for each device is close to the average, the energyvariance of all devices in Eq. (6) can reach the minimal. Hence, wedesign a penalty function α − βn aдд , where n aдд records the countof S f ree → S дet . With n aдд increases, the penalty reward alsoincreases. In addition, we design an incentive function α + βn aдд tospeed up the model learning. With n aдд increases, the model canget more rewards if S f ree → S send is done. Thirdly , to achievefast learning, we design an efficient termination function. If theselected action is a repeat as an invalid action or the learning stepis larger than the best learning steps , the model gets a severe penaltyreward and the current learning step is terminated. In summary,we present our reward function as follows. r ( t ) =  α − βn aдд S f ree → S дet α + βn aдд S f ree → S send ρ + tρ / ( N − ) S busy − Exceeds limits − Select invalid action Completed (7)where α , β and ρ denote initial reward, initial busy penalty andbattery cost per aggregation, respectively. n aдд denotes the numberof model aggregations. By default, we set α to − . β to 0 . ρ to − . Accelerated Reward Function . To further accelerate the RL learning process, we propose athreshold-based decaying greedy-exploration (TDGE) strategy whichextends the existing decaying greedy-exploration (DGE) strategy[38]. The decaying greedy-exploration (DGE) strategy has beenused as a default option with a fair performance. However, Figure6 shows DGE-Fast (using 2x exploration rate) completes earlierthan DGE-Slow, i.e., each stops at 73

Epoch and 143

Epoch , respec-tively. A key observation from our empirical study is that moresufficient exploration achieves more efficient learning time reduc-tion. Therefore, in the proposed TDGE strategy, we aim to ensuresufficient exploration in the beginning to accelerate the learningprocess. The interpretation is to ensure the model to fully exploreuntil the reward reaches the given threshold, we then switch theexploration to start epsilon decaying from the predefined value.Specifically, the threshold is approximately defined by achievingoptimal latency and battery balance at the same time in Eq.(8) sincethe reward function is designed to optimize both features by Eq.(4). The estimation of optimal latency is defined as reward value by Eq. (10), where the estimation of optimal battery balance is definedin Eq. (9).We define threshold T ( N ) by separating the calculation of theoptimal reward into two parts in Eq. (8), which imitate architecturein defining optimization function of Chain-scheduler. The first partrepresents the maximal reward value of battery balance by Eq. (9).The second part denotes the optimal reward value of latency by Eq.(10). T ( N ) = f ( N ) + д ( N ) − Φ (8)where Φ denotes an offset value obtained from our experiment. f ( N ) = (cid:26) β ( N mod ) + f (⌊ N / ⌋) N > N = д ( N ) = β log N (10)In summary, the Chain-scheduler algorithm is shown in Algo-rithm 2. ALGORITHM 2:

Chain-scheduler AlgorithmInitialize target network weights, epsilon ϵ = .

0, threshold

T R in Eq.(8), state s ; while Epoch < max Epoch dofor t ← to maxStep doif random( ) < ϵ then randomly explore an action from validAction() ; else take action a with ϵ -greedy policy based on Q-valuefunction Q ( s t , a t ) ; end receive s t + and r t in Eq. (7); s t ← s t + ; if s t + is terminated thenif reward > T R then ϵ ← ϵ new endbreak;endend update ϵ by decaying; end Continuous Environment Learning . As analyzed in Section (§2),

Non-Stationarity scenario has anegative impact on the performance of the RL model. However,since MA can continuously monitor device resource condition,this impact can be mitigated by a repeating environment learningmechanism [2]. If the current resource condition is not equal tothe previous record, MA can restart the RL learning process toupdate the model. Figure 5b indicates three re-learning conditions.In the first two conditions, the RL learning process can be restartedwith a small learning slot by the resource condition changes ofenvironment, and it finishes before the scheduling actions. For thethird condition, if scheduling actions require to perform during theslot of re-learning , Chain-scheduler can output a fair schedulingbased on the history experience of the RL model.

PSN’20, April 21–24, 2020, Sydney, Australia Y. Zhang et al. (a) (b)

Figure 7: Experiment screenshot. (a) User model input; (b)Training execution screenshotTable 1: Dataset Specifications

Datasets Type Task [22] EMG GR 37 6 14695 50 8 25 6

MHEALTH [3] IMU HBM 10 9 3255 50 23 100 30

UniMiB [30] IMU FDR 30 8 8430 50 1 116 15

HAR [1] IMU ADLs 30 6 10299 50 9 160 60

OPPORTUNITY [35] IMU ADLs 12 11 16837 50 77 500 50

PAMAP2 [34] IMU ADLs 9 12 12397 100 9 900 90

We implement MDLdroid based on an open-source DL library (i.e.,DL4J). In particular, we essentially modify DL4J to enable the pro-posed ChainSGD-reduce approach on device. We also tailor ourimplementation for execution on Android smartphone. With mi-nor model configurations, MDLdroid is fully compatible with arange of DL models without scaling down the model. We employ9 off-the-shelf Android smartphones and Table 2 gives their spec-ifications. Figure 7a plots a screenshot in which user customizesmodel configuration such as the parameters for certain datasets,customized hidden layer structures, and the required number oftraining devices. Figure 7b plots a screenshot during an execution oftraining on 9 smartphones using MDLdroid. The MA device scansall nearby devices, and build a BLE mesh network. The black dashlines represent the BLE connections between MA and training de-vices for resource condition monitoring. The yellow lines indicatesa particular chain-directed model aggregation process via BS.

To evaluate MDLdroid, we select 6 public personal mobile sensingdatasets with a training scale ranged from 25M to 900M. Thesedatasets are typically used for building a variety of personal mobileapplications, e.g., gesture recognition (GR), recognition of activityfor daily living (ADLs), fall detection recognition (FDR), and healthbehavior monitoring (HBM). The specification of each dataset islisted in Table 1.

Table 2: Smartphone specifications

Device HD RAM CPU Battery(mAh)

OnePlus 6 128GB 8GB Snapdragon 845 3300Pixel 2 XL 64GB 4GB Snapdragon 835 3520Huawei Honor 8 32GB 4GB HiSilicon Kirin 950 3000 of samples, sampling rate (Hz), the number of channels, the size oftraining data (MB), and the size of testing data (MB),

In our evaluation, all the participating smartphones are placed inproximity with a range from 1m to 5m for any twos. To evaluateactual battery consumption, we set all smartphones as discharging.Before each experiment, we reset the battery of smartphones tobe full. For static dataset allocation, we divide each dataset equallyamong all the training devices since they have similar resourcecapacity and pre-load a sub-dataset to each device in advance tosimplify our evaluation. We first evaluate the performance of Chain-scheduler. We then compare MDLdroid to Federated Learning. Fi-nally, we conduct a series of experiments to discover optimizedparameters to trade-off between resource used, training accuracyand scalability.

To evaluate Chain-scheduler, we select two resource-agnosticschedulers (i.e.,

Tree-scheduler and

Ring-scheduler ) as the baseline.We also use both DGE and the threshold-based greedy-exploration(TGE) (i.e., the exploration only relies on the given threshold with-out decaying) approaches (§3.3.2) as the baseline to compare withour TDGE approach and evaluate the performance of the explo-ration strategy when training Chain-scheduler.

Experimental Setup . For a fair benchmark comparison, we simulate the resource dy-namicity scenarios in reality for each device in MDLdroid. Specifi-cally, we randomly allocate resources for each training device whileassuring a maximum of 50% of the devices being in busy state.To evaluate our re-learning mechanism, we randomly modify theresource state of some devices being busy or free to emulate theconditions mentioned in Figure 5b. We run each experiment 50times and report the performance and resource usage presented inthe following sections. Exploration Strategy of Chain-scheduler . In this experiment, we first compare TDGE with both DGE andTGE in term of training time. Figure 8a shows the average trainingtime for each exploration strategy in a network size ranged from 3to 9 devices. Result shows that TDGE outperforms TGE and DGEin training time by 1.2x and 1.3x less, with a standard deviation of1.9x and 2.1x less, respectively. In addition, Figure 8b shows theconvergence of cumulative reward in each strategy in a networksize of 8 devices. The triangle markers indicate convergence points.We observe that TDGE converges 1.2x and 1.4x faster than TGDand DGE, respectively. As a result, TDGE can accelerate the processof Chain-scheduler training.

DLdroid: a ChainSGD-reduce Approach to Mobile Deep Learning for Personal Mobile Sensing IPSN’20, April 21–24, 2020, Sydney, Australia T i m e ( s e c ) DGETGETDGE (a) R e w a r d DGE TGE TDGE (b)

Figure 8: Chain-scheduler training comparison by differentexploration strategies. (a) Training time; (b) Convergence ofcumulative reward T i m e ( s e c ) DGE TDGE (a) B a tt e r y ( m A h ) DGE TDGE (b)

Figure 9: Resources reduction in re-learning conditions.(a) Time and (b) Battery

Performance in Re-learning . In this experiment, we evaluate TDGE under a re-learning sce-nario in terms of training time and battery consumption. Figure 9ashows that TDGE outperforms DGE with less training time, espe-cially in more-device scenarios (e.g., 1.5x less when the number ofdevices is larger than 7). Figure 9b shows that on average TDGEconsumes 1.3x less battery consumption than DGE. Similarly, morebattery savings are observed in TDGE with more devices.

Scheduling Performance . In this experiment, we compare Chain-scheduler with Tree-scheduler and Ring-scheduler in terms of training time and energybalance using the HAR dataset in Table 1. Figure 10a shows thaton average ChainSGD-scheduler outperforms Tree-scheduler andRing-scheduler by 23% and 53% less training time, respectively. Inaddition, more time savings are observed in ChainSGD-schedulerwhen the number of devices increases. Figure 10b shows that Ring-scheduler achieves the minimal energy variance among three, andChainSGD-scheduler reduces energy variance by 40% on averagecompared to Tree-scheduler. This experiment demonstrates thatChainSGD-scheduler achieves the best trade-off between trainingtime and energy variance compared to Tree-scheduler and Ring-scheduler.

To give a comprehensive evaluation for the performance of MDL-droid, we compare MDLdroid with the Federated Learning (FL) [7]from training accuracy and resource used perspectives. Besides, we T i m e ( s e c ) Ring-schedulerTree-schedulerChainSGD-scheduler (a) E n g y V a r i a n c e (b) Figure 10: Scheduling performance comparison. (a) Latency;(b) Energy balance choose a server-based approach to further ensure the training accu-racy to be reliable. we next explore the optimized resource-accuracytrade-off options and limitations of MDLdroid.

Experimental Setup . For performance benchmarking, we use FL as our baseline, whichis implemented based on a master-slave structure, i.e., the samesystem architecture, to keep training computational and commu-nication costs identical for fair comparison. In addition, we alsocompare the training accuracy of MDLdroid vs. the server-basedapproach by running training with the same model configurationson a desktop computer using all the 6 datasets. To measure batteryconsumption, we use Java refection to access the system instanceof

BatteryStatsImpl after rooting the smartphones. To monitor real-time CPUs and memory consumption on smartphone, we use An-droid Debug Bridge (ADB) commands.

MDLdroid vs. FL . We compare the overall performance of MDLdroid vs. FL fromthree resource perspectives—peak-memory overhead, training time,and network energy balance.

Peak-Memory

Figure 11a plots the peak-memory value for eachapproach with a network size ranged from 1 to 9 using the PAMA2dataset based on LeNet. Result shows that the peak-memory over-head of the master device in FL increases linearly with the numberof devices, while MDLdroid remains stable at low overhead due toonly a pair of devices is involved during each model aggregationtask.

Training Time

Figure 11b shows that the training time in MDL-droid is effectively saved by 1.5x on average compared to FL. Thetraining time in MDLdroid decreases with the network size in-creased. While the training time in FL presents a U-shaped curveas the communication time for model aggregations gradually in-creases. This is due to the efficient design of both model aggregationscheduling (§3.3) and resource-aware broadcasts (§3.2.3) in MDL-droid.

Network Energy Balance

Figure 11c shows that the energyconsumption of each device for model aggregation in MDLdriod ismuch less than that of the master device in FL, especially in moredevice scenarios. The result shows a training device in MDLdriodachieves 5.8x energy consumption reduction for communication onaverage compared to the master device in FL. In addition, anotherkey observation is that MDLdroid achieves better energy balanceamong devices as MDLdroid evenly distribute model aggregation

PSN’20, April 21–24, 2020, Sydney, Australia Y. Zhang et al. M e m o r y ( M B ) MDLdroidFL (a) T i m e ( s ) MDLdroidFL (b) B a tt e r y ( m A h ) MDLdroidFL (c)

Server Smartphone050010001500 T i m e ( m s ) TraningLoad DataBatch Time (d)

Figure 11: MDLdroid vs. FL performance comparison. (a)Memory; (b) Time; (c) Battery consumption; (d) One batchtraining time breakdown comparison tasks to each device during scheduling. Since MDLdroid requires atraining device to periodically report its resource condition via atiny BLE message, the actual battery consumption to send modelparameters is much smaller than that using BS, e.g., fully trainingPAMA2 by 20

Epoch on a single device with roughly 15 mAh outof 3600 mAh cost.

Trade-off between Resource and Accuracy . Both model aggregation frequency and training iteration

Epoch can significantly impact the balance between resource and accuracy.In this experiment, we evaluate the trade-off between resource andaccuracy in three aspects—training accuracy by a given threshold,battery consumption by maximum battery, and training time. Inaddition, we design 5 trade-off parameters, i.e., 1-E20, 2-E20, 10-E20, 1-E30, and 1-E40, where each of them represents the numberof model aggregation rounds per

Epoch —the number of trainingiteration

Epoch . Scalability

We observe that the training accuracy decreasesafter applying trade-offs, as shown in Figure 12. With the networksize increases, the training data size for each device decreases,resulting that the learning convergence rate of a DL model becomesslow. However, Figure 12 indicates the accuracy can be improved ifwe enlarge the

Epoch from 1-E20 to 1-E40, while the training timeand battery cost increase as shown in Figure 14 and 13, respectively.Therefore, the scalability of network may be limited by the trade-offbetween resource and accuracy.

Battery Saving

To explore the battery limitations by differenttrade-offs, we continually charge smartphones during the experi-ment. Specifically, the battery consumption can be reduced by 2xto 8.5x compared to single-device training, as shown in Figure 13.

Table 3: Training Accuracy Comparison

Datasets State-of-art Server FL-B Chain-B Chain-OffHAR

96% [1] 93.8% 92.7%

90% [3] 92.3% 91.0%

85% [1] 96.1% 93.3%

88% [22] 89.2% 86.3%

85% [35] 88.6% 87.5%

With the number of devices increases, the battery consumption canbe effectively shared among multiple devices.

Training Time Reduction

Figure 14 shows that the trainingtime of MDLdroid is reduced by 2x on average compared to single-device training. Besides, the training time is significantly increasedif the model aggregation frequency increases since sending modelgradient parameters via BS is time-consuming depending on themodel size of dataset.

Training Speed Limitation

Figure 11d reveals that loadingone batch data from file to memory has a huge latency on deviceusing DL4J libraries. The same result can be found on the server. AsDL4J requires to convert data into an

INDArray , the speed of thisstep depends on the performance of hardware (i.e., CPUs). SinceDL4J uses only CPUs on Android, the latency is much larger thanthat on the server. Using accelerators available on the mobile device[8] may largely improve the training process.In summary, MDLdroid accelerates training by 2x to 3.5x asshown in Figure 14 and reduces battery consumption by 2x to 8.5x,compared to single device as shown in Figure 13. With the numberof training

Epoch increases, the training accuracy is increased by2.5% on average as shown in Figure 12, but much more batteryand time are consumed. While increasing model aggregation fre-quency can sightly improve training accuracy with a small impacton battery usage. However, the training time is increased by upto 2x due to heavy communication cost , as shown in Figure 14.Therefore, we choose 1-E20 as the optimized resource-accuracyoption to achieve the best performance. Finally, Table 3 summarizesthe comparison result of training accuracy, and MDLdroid achievesreliable state-of-the-art results. In the table, FL-B denotes the bestaccuracy on FL, Chain-B denotes the best accuracy on MDLdroidand Chain-Off denotes the accuracy on MDLdroid after trade-off.

In this section, we discuss several limitations in our current proto-type implementation.

Training Time

The long training time in Figure 14 may not bevery practical for end users in reality. However, the training speedof the prototype implementation depends on several factors: 1) thespeed of reading training data from a CSV file on mobile device islow as shown in Figure 11d. 2) the training performance on singledevice is strongly limited by DL4J libraries; 3) the used standardmodel structures are large for mobile training, and advanced light-weight models should be applied in practice, such as MobileNet [16].Further improving training time will leave for our future works.

Bluetooth Limitation

As an initial prototype system, we useBluetooth Low Energy for building a mesh network to simplify ourimplementation. However, when the model complexity increases,

DLdroid: a ChainSGD-reduce Approach to Mobile Deep Learning for Personal Mobile Sensing IPSN’20, April 21–24, 2020, Sydney, Australia A cc u r a cy ( % ) (a) HAR (b) sEMG (c) UniMiB (d) PAMAP2 (e) MHEALTH (f) OPPORTUNITY Figure 12: Training accuracy by trade-off options B a tt e r y ( m A h ) (a) HAR (b) sEMG (c) UniMiB

012 10 (d) PAMAP2 (e) MHEALTH (f) OPPORTUNITY Figure 13: Training battery consumption by trade-off options T i m e ( s ) (a) HAR (b) sEMG (c) UniMiB (d) PAMAP2 (e) MHEALTH (f) OPPORTUNITY Figure 14: Training time by trade-off options sending a large number of model parameters via BS may suffer longlatency due to low bandwidth available in Bluetooth, and hence willaffect the overall training performance. Since Wi-Fi Direct [36] hasbeen widely available on smartphones and it offers much higherbandwidth, a hybrid solution can be implemented to use Wi-FiDirect in the model aggregation process to reduce the communica-tion cost between devices. Furthermore, proper model compressingtechniques [29] can be applied for effective communication, whichwe leave for our future work.

Future User Inference and Applications

MDLdroid primar-ily targets personal sensing applications which are privacy-sensitivewith low-latency response requirement for continually model in-ference and update, but the framework can be applied to a widerange of DL-based low-latency applications with a moderate modelsize, such as real-time surveillance, image recognition, and naturallanguage processing. The prototype user inference of MDLdroidis mainly used for experiments. Since end users may not need tomanually set complex parameters for training, we plan to embedthe MDLdroid into mobile OS to offer automatic background train-ing, and develop a wider variety of applications to fully explore thecapability of MDLdroid in our future work.

Decentralized Deep Learning

For decentralized framework, theexisting work [18] proposes a theoretical model based on a fixeddirected graph to offer a decentralized SGD algorithm to exchangemodel gradient parameters with its one-hop neighbors. However,if the relationship between the device and its one-hop neighboris one-to-many, the device still suffers huge resource overheadwhich is similar to the master device case. On the other hand,as the underlying topology is a fixed graph, it cannot properlybe performed in a real-time condition with resource dynamicity.By contrast, MDLdroid presents a dynamic chain-directed

SGDalgorithm based on a mesh network with a Chain-scheduler thatenables a resource-aware model aggregation process to minimizetraining latency and reduce training resource overhead.

Resource-aware Mobile Deep Learning

Most of existing worksabout resource-aware mobile DL mainly focus on inference tasks.NestDNN [12] proposes a multi-tenant framework that can enablea resource-aware on-device to efficiently execute inference tasksfor mobile vision applications. Besides, MCDNN [12] presents aframework that can execute multiple mobile vision applicationsbased on cloud-based inference solution. In MDLdroid, we fullyimplement and execute both DL training and inference tasks ondevices.

PSN’20, April 21–24, 2020, Sydney, Australia Y. Zhang et al.

Resource-aware Task Scheduling

The latest works [5] [33]propose to use a MARL based approach to solve task schedulingbased on distributed network, which achieves fair performance.However, due to on-device resource limitation, the MARL imple-mentation cannot well perform with training task on device. Incontrast, MDLdroid applies a single agent-based DQN approach todeal with resource-aware task scheduling.

Towards pushing DL on devices, in this paper, we present MDLdroid,a novel decentralized mobile DL framework to enable resource-aware on-device collaborative learning for personal mobile sensingapplications. MDLdroid achieves a reliable state-of-the-art modeltraining accuracy on multiple off-the-shelf mobile devices. Thekey advantages of MDLdroid include on-device mobile DL, hightraining accuracy, low resource overhead, low latency for modelinference and update, and fair scalability.

ACKNOWLEDGMENTS

This work is supported by Australian Research Council (ARC) Dis-covery Project grants DP180103932 and DP190101888.

REFERENCES [1] Davide A., Alessandro G., Luca O., Xavier P., and J L Reyes-Ortiz. 2013. APublic Domain Dataset for Human Activity Recognition using Smartphones. In

ESANN’13 .[2] S. Abdallah and M. Kaisers. 2016. Addressing Environment Non-Stationarity byRepeating Q-learning Updates.

Journal of Machine Learning Research (2016).[3] Oresti B., Rafael G., Juan A. H., Miguel D., Hector P., Ignacio R., Alejandro S., andClaudia V. 2014. mHealthDroid: A Novel Framework for Agile Development ofMobile Health Applications. In

Ambient Assisted Living and Daily Activities .[4] Amin B. Abkenar, Seng Loke, Arkady Zaslavsky, and Wenny Rahayu. 2019.GARSAaaS: group activity recognition and situation analysis as a service.

JISA’19 (2019).[5] D. b. noureddine, Atef G., and Samir A. 2017. Multi-agent Deep ReinforcementLearning for Task Allocation in Dynamic Environment. In

ICSOFT’17 .[6] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo.2019. Analyzing Federated Learning through an Adversarial Lens. In

ICML’19 .[7] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, ChloéKiddon, Jakub Konecný, S. Mazzocchi, H. B. McMahan, T. V. Overveldt, D. Petrou,D. Ramage, and J. Roselander. 2019. Towards Federated Learning at Scale: SystemDesign.

CoRR (2019).[8] Yitao Chen, Saman Biookaghazadeh, and Ming Zhao. 2018. Exploring the Capa-bilities of Mobile Devices Supporting Deep Learning. In

HPDC ’18 .[9] Jeffrey D., Greg S. C., Rajat M., Kai C., Matthieu D., Quoc V. L., Mark Z. M.,Marc’Aurelio R., Andrew S., Paul T., Ke Y., and Andrew Y. N. 2012. Large ScaleDistributed Deep Networks. In

NIPS’12 .[10] deeplearning4j. 2019. NDArrays: How Are They Stored in Memory?

Deeplearn-ing4j: Open-source, distributed deep learning for the JVM (2019).[11] Yunbin Deng. 2019. Deep Learning on Mobile Devices - A Review.

CoRR (2019).[12] Biyi F., Xiao Z., and Mi Z. 2018. NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In

MobiCom ’18 .[13] S. Gupta, W. Zhang, and F. Wang. 2016. Model Accuracy and Runtime Tradeoffin Distributed Deep Learning: A Systematic Study. In

ICDM’16 .[14] T. Hoefler, C. Siebert, and W. Rehm. 2007. A practically constant-time MPI Broad-cast Algorithm for large-scale InfiniBand Clusters with Multicast. In

IPDPS’07 .[15] Seyed Amir Hoseini-Tabatabaei, Alexander Gluhak, and Rahim Tafazolli. 2013. ASurvey on Smartphone-Based Systems for Opportunistic User Context Recogni-tion.

CSUR’13 (2013).[16] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, WeijunWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets:Efficient Convolutional Neural Networks for Mobile Vision Applications.

CoRR (2017).[17] Jeya Vikranth Jeyakumar, Liangzhen Lai, Naveen Suda, and Mani Srivastava. 2019.SenseHAR: A Robust Virtual Activity Sensor for Smartphones and Wearables. In

SenSys’19 .[18] Zhanhong Jiang, Aditya Balu, Chinmay Hegde, and Soumik Sarkar. 2017. Collab-orative Deep Learning in Fixed Topology Networks. In

NIPS’17 . [19] Jakub Konecný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik.2016. Federated Optimization: Distributed Machine Learning for On-DeviceIntelligence.

CoRR (2016).[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classi-fication with Deep Convolutional Neural Networks. In

NIPS’12 .[21] J. Leon Kröger, Philip R., and T. Rahman B. 2019. Privacy Implications of Ac-celerometer Data: A Review of Possible Inferences. In

ICCSP ’19 .[22] Sergey L., Nadia K., Innokentiy K., Victor K., and Valeri A. M. 2018. Latent FactorsLimiting the Performance of sEMG-Interfaces.

Sensors (2018).[23] Xiangru L., Ce Z., Huan Z., Cho-Jui H., Wei Z., and Ji L. 2017. Can DecentralizedAlgorithms Outperform Centralized Algorithms? A Case Study for DecentralizedParallel Stochastic Gradient Descent. In

NIPS’17 .[24] Nicholas D. Lane and Petko Georgiev. 2015. Can Deep Learning RevolutionizeMobile Sensing?. In

HotMobile ’15 .[25] Francisco Laport-López, Emilio Serrano, Javier Bajo, and Andrew T. Campbell.2019. A review of mobile sensing systems, applications, and opportunities.

KAIS’19 (2019).[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learningapplied to document recognition.

Proc. IEEE (1998).[27] En Li, Zhi Zhou, and Xu Chen. 2018. Edge Intelligence: On-Demand DeepLearning Model Co-Inference with Device-Edge Synergy. In

MECOMM’18 .[28] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed,Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. ScalingDistributed Machine Learning with the Parameter Server. In

OSDI’14 .[29] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. DeepGradient Compression: Reducing the Communication Bandwidth for DistributedTraining.

CoRR (2017).[30] Daniela M., Marco M., and Paolo N. 2017. UniMiB SHAR: a new dataset forhuman activity recognition using acceleration data from smartphones.

CoRR (2017).[31] E. L. M., Saeed A., Mark M., Matthew K., Julie A. K., Tanzeem C., Geri G., and DanC. 2016. Mobile Manifestations of Alertness: Connecting Biological Rhythmswith Patterns of Smartphone App Use. In

MobileHCI ’16 .[32] Riccardo Miotto, Fei Wang, Shuang Wang, and Xiaoqian Jiang. 2017. Deeplearning for healthcare: review, opportunities and challenges.

Briefings in bioin-formatics (2017).[33] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi. 2018. Deep Reinforcement Learn-ing for Multi-Agent Systems: A Review of Challenges, Solutions and Applications.

CoRR (2018).[34] A. Reiss and D. Stricker. 2012. Introducing a New Benchmarked Dataset forActivity Monitoring. In

ISWC’12 .[35] D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. Förster, G. Tröster, P. Lukowicz,D. Bannach, G. Pirkl, A. Ferscha, J. Doppler, C. Holzmann, M. Kurz, G. Holl, R.Chavarriaga, H. Sagha, H. Bayati, M. Creatura, and J. d. R. Millàn. 2010. Collectingcomplex activity datasets in highly rich networked sensor environments. In

INSS’10 .[36] A. A. Shahin and M. Younis. 2014. A framework for P2P networking of smartdevices using Wi-Fi direct. In

PIMRC’14 .[37] Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, XiangHuang, and Xiaowen Chu. 2019. A Distributed Synchronous SGD Algorithmwith Global Top-k Sparsification for Low Bandwidth Networks.

CoRR (2019).[38] Michel Tokic. 2010. Adaptive ϵ -Greedy Exploration in Reinforcement LearningBased on Value Differences. In KI 2010: Advances in Artificial Intelligence .[39] He Wang, Ted Tsung-Te Lai, and Romit Roy Choudhury. 2015. MoLe: MotionLeaks Through Smartwatch Sensors. In

MobiCom ’15 .[40] J. Wang, B. Cao, P. Yu, L. Sun, W. Bao, and X. Zhu. 2018. Deep Learning towardsMobile Applications. In

ICDCS’18 .[41] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan. 2019.Adaptive Federated Learning in Resource Constrained Edge Computing Systems.

J-SAC’19 (2019).[42] J. Yang, H. Xu, and P. Jia. 2009. Task Scheduling for Heterogeneous ComputingBased on Bayesian Optimization Algorithm. In

CIS’09 .[43] Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and ShonaliKrishnaswamy. 2015. Deep Convolutional Neural Networks on MultichannelTime Series for Human Activity Recognition. In

IJCAI’15 .[44] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated MachineLearning: Concept and Applications.

TIST’19 (2019).[45] Kwangmin Yu, Thomas Flynn, Shinjae Yoo, and Nicholas D’Imperio. 2019. LayeredSGD: A Decentralized and Synchronous SGD Algorithm for Scalable Deep NeuralNetwork Training.

CoRR (2019).[46] D. Zhang, X. Chen, D. Wang, and J. Shi. 2018. A Survey on Collaborative DeepLearning and Privacy-Preserving. In

DSC’18 .[47] Xiang Zhang, Lina Yao, Chaoran Huang, Sen Wang, Mingkui Tan, Guodong Long,and Can Wang. 2018. Multi-modality Sensor Data Classification with SelectiveAttention. In

IJCAI-18 .[48] G. Zyskind, O. Nathan, and A. ’. Pentland. 2015. Decentralizing Privacy: UsingBlockchain to Protect Personal Data. In