[PDF] FedMood: Federated Learning on Mobile Health Data for Mood Detection

Abstract

Depression is one of the most common mental illness problems, and the symptoms shown by patients are not consistent, making it difficult to diagnose in the process of clinical practice and pathological research. Although researchers hope that artificial intelligence can contribute to the diagnosis and treatment of depression, the traditional centralized machine learning needs to aggregate patient data, and the data privacy of patients with mental illness needs to be strictly confidential, which hinders machine learning algorithms clinical application. To solve the problem of privacy of the medical history of patients with depression, we implement federated learning to analyze and diagnose depression. First, we propose a general multi-view federated learning framework using multi-source data, which can extend any traditional machine learning model to support federated learning across different institutions or parties. Secondly, we adopt late fusion methods to solve the problem of inconsistent time series of multi-view data. Finally, we compare the federated framework with other cooperative learning frameworks in performance and discuss the related results.

Full PDF

aa r X i v : . [ c s . C Y ] F e b Federated Depression Detection from Multi-SourceMobile Health Data

Xiaohang Xu ∗ , Hao Peng ∗ , † , Lichao Sun ‡ , Yan Niu § , Hongyuan Ma ¶ , Lianzhong Liu ∗ , Lifang He k∗ School of Cyber Science and Technology, Beihang University, Beijing 100083, China. † Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing 100083, China. ‡ Department of Computer Science, University of Illinois at Chicago, Chicago, USA. § China Academy of Industrial Internet, Beijing, China. ¶ National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing100029, China. k Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015 USA.

Abstract —Depression is one of the most common mentalillness problems, and the symptoms shown by patients are notconsistent, making it difﬁcult to diagnose in the process ofclinical practice and pathological research. Although researchershope that artiﬁcial intelligence can contribute to the diagnosisand treatment of depression, the traditional centralized machinelearning needs to aggregate patient data, and the data privacyof patients with mental illness needs to be strictly conﬁdential,which hinders machine learning algorithms clinical application.To solve the problem of privacy of the medical history ofpatients with depression, we implement federated learning toanalyze and diagnose depression. First, we propose a generalmulti-view federated learning framework using multi-sourcedata,which can extend any traditional machine learning modelto support federated learning across different institutions orparties. Secondly, we adopt late fusion methods to solve theproblem of inconsistent time series of multi-view data. Finally,we compare the federated framework with other cooperativelearning frameworks in performance and discuss the relatedresults.

Index Terms —Federated learning; Depression; Data privacy;Mobile device.

I. I

NTRODUCTION

Depression is a very common disease in real life. More than300 million people worldwide suffer from depression.At present, the diagnosis of depression depends almostentirely on the subjective judgment of the doctor throughcommunication with the patient and the relevant questionnairesﬁlled out. Hamilton Depression Rating Scale (HDRS) [1] andYoung Mania Rating Scale (YMRS) [2] are commonly usedevaluation criteria for doctors when diagnosing depression. Inorder to better help doctors diagnose depression, researchersanalyze patient data by introducing machine learning technol-ogy [3].But when using machine learning technology, there isa contradiction between the performance of the model and theprotection of data privacy [4].First, the quality of the model trained by machine learningis closely related to the amount of data. Deep Neural Network(DNN) has achieved good results in a variety of medicalapplications, but it highly depends on the amount and diversityof training data [5]. Although Wang et al. [6] proposeda scheme to avoid data privacy leakage under centralized learning, hospitals need to protect the privacy of patients’diagnosis data, so different medical institutions cannot gatherand share data [7], which greatly affects the accuracy of themodel [8]. For example, in the work of electrocardiogram,because a single medical institution cannot collect enoughhigh-quality data, the predictive ability of the model cannotachieve the role of clinical assistance.Second, although there are many machine learning algo-rithms that involve privacy protection, it is difﬁcult to achievegood training results. There are many machine learning appli-cation scenarios for privacy protection, such as recommenda-tion systems and face recognition. However, the privacy pro-tection machine learning method needs to increase the noiseaccording to the sensitivity of the algorithm’s intermediateproduct, so under the limited privacy budget, the predictionperformance of the privacy algorithm is often poor [9]. Inthe medical ﬁeld, the predictive performance of the model isdirectly related to the doctor’s diagnosis of the disease.Third, due to the huge gap between various medical institu-tions, the patient data they have varies greatly. In order to dealwith various situations, algorithms and software are requiredto have a high generalization ability, and it is difﬁcult for themodel to obtain sufﬁcient accuracy and speciﬁcity without dataexchange.To address the above limitations, in 2016, Google [10]proposed a method called federated learning to break theproblem of medical data silos due to patient data privacy.Instead of having to centralized data to train machine learningmodels, each health care facility aggregates the trained modelsin one place and continually optimizes the models usingfederated averaging technology to make data available to allhealth care facilities.However, most of the existing research using the federatedlearning framework in the medical ﬁeld is based on the existingdata of hospitals. It mainly includes the diagnosis of thecharacteristics of patients with speciﬁc diseases, reducing thecost of diagnosis and treatment, medical image processingand other issues [8], [11]. As mobile devices become moreand more popular, smart phones, bracelets and other devicesare also recording users’ information all the time. Accordingto existing researches [12], as one of the most important tools for information transmission in patients’ lives, mobilephones can also be an important data source for diseaseprediction. We believe that keyboard keystrokes, such as theinterval between two keystrokes, can be used as a form ofbiometric identiﬁcation to predict depression by analyzing thekeystroke habits of patients with depression. The typing speedof depression patients is usually different from that of normalpeople, which may be caused by emotional instability duringthe onset of the disease [13].Our work uses a virtual keyboard customized for mobilephones to collect metadata (including key letters, specialcharacters, and phone accelerometer values). Using > RELIMINARY

Since the datasets we use has the problem that the timeseries under three views have different frequencies and cannotbe aligned, in this section, we introduce the later fusionstrategy adopted by the model to make the time series of thedata consistent [15], [16]. We set the output vector at the endof the p − th view sequence as h ( p ) , and let { h ( p ) ∈ R d h } mp =1 be the multi-view data where m is the number of views. A. Fully connected layer

We ﬁrst consider the simplest way to connect multi-viewsdirectly, ie, h = [ h (1) ; h (2) ; ... ; h ( m ) ] ∈ R d , where d is the total number of multi-view features, and typically d = (2) md h for one directional(bidirectional) GRU. The connected hiddenstate h is inserted into the fully connected neural networkthrough a nonlinear function σ ( · ) . The feature interactionmode of the input unit is as follows: q = relu ( W (1) [ h ; 1]) , ˆ y = W (2) q, (1)where W (1) ∈ R k × ( d +1) , W (2) ∈ R c × k , k is the number ofhidden units, c is the number of classes, and the constant signal“1” is to model the global bias. To simplify the illustration,we only set a hidden layer as shown in Fig 1(a). B. Factorization Machine layer

As shown in Fig 1(b), instead of transforming the input witha nonlinear function, we directly model the features of eachinput part as follows: q a = U a h,b a = W Ta [ h ; 1] , ˆ y a = sum ([ q a ⊙ q a ; b a ]) , (2)where U a ∈ R k × d , W a ∈ R d +1 , k is the number of factorunits, and a denotes the a -th class. C. Multi-view Machine layer

Only considering the second-order feature interaction ofthe input data may not be comprehensive enough. We nestinteraction to the m -th order between m views in the followingway: ˆ y a = β + m X p =1 d p X i p =1 β ( p ) i p h ( p ) i p + · · · + d X i =1 · · · d m X i m =1 β i ( m Y p =1 h ( p ) i p ) , (3) where β is the global offset, the second part is the ﬁrst-orderfusion, and the last part is the m -th order fusion. Next, theoutput vector h ( p ) i p is combined with the constant 1 as anadditional feature. The Eq. 3 can be rewritten as follows: ˆ y a = d +1 X i =1 · · · d m +1 X i m =1 ω i , ··· ,i m ( m Y p =1 [ h ( p ) i p : 1]) , (4)where ω d +1 ,...,d m +1 = β and ω i ,...,i m = β i ,...,i m , ∀ i p ≤ d p . Next, we decompose the m -th order weight tensor ω i ,...,i m into k factors: C × U (1) × · · · × U ( m ) . U ( m ) ∈ R k × ( d h +1) isthe factor matrix of the m -th view and C ∈ R k ×···× k is theidentity tensor. Finally, we transform Eq. 4 as follows: ˆ y a = d h +1 X i =1 · · · d h +1 X i m =1 ( k X f =1 m Y p =1 [ h ( p ) i p : 1]( i p )) . (5)As shown in the ﬁgure 1(c), we can simplify Eq. 5 as follows: q ( p ) a = U ( p ) a [ h ( p ) ; 1] , ˆ y a = sum ([ q (1) a ⊙ ... ⊙ q ( m ) a ]) . (6) (a) Fully connected layer. (b) Factorization Machine layer. (c) Multi-view Machine layer.Fig. 1. A comparison of different strategies for fusing multi-view data from the perspective of computational graph [14]. III. M

ETHODOLOGY

In this section, we introduce how to use data generated bymobile devices to train local and federated learning models.We ﬁrst discuss the reasons for task deﬁnition, and thenintroduce the federal learning framework proposed by Google[10].

A. Problem Description

In the absence of federated learning frameworks, medicalinstitutions can only use local datasets without interactiveprocesses when using machine learning algorithms to buildmodels for disease diagnosis, medical imaging research, andso on. We retain this local learning model as a comparisonto measure the improvement effect of the federated learningalgorithm on the multi-views heterogeneous data trainingmodel. We have conceived the following three situations.At ﬁrst there were several hospitals in a city { H , ..., H m } .Assume that patients with bipolar I disorder and bipolarII disorder and normal people who are suspected of beingsick will go to different hospitals for depression test scoreson average, and the hospital will also record the patient’smobile terminal data. From a certain moment, we stop thecollection of data by hospitals. At this time, each hospitalhas a ﬁxed amount of data D x . Each medical institution willﬁrst use its own local data to train and test the effect. Theresults obtained at this time are generally difﬁcult to use as areference for the diagnosis of depression. Each participant willcooperate with other medical institutions for federated studyand training, and in this process, new medical institutions willcontinue to participate. Without reducing the total number ofcommunication rounds, increase the degree of parallelism totest the changes in the prediction effect.At a certain moment, the number of hospitals in a certaincity is constant at n { H , ..., H n } , and no new hospitals will beestablished in this city for a certain period of time. Patients willgo to different hospitals on average as described above, and allmedical institutions will predict depression mood through localtraining and federated learning. Initially, each hospital has asmall amount of data D a . As patients continue to come to thehospital for treatment and review, the hospital will continue toincrease the amount of data D x . When the data added by eachhospital reaches a mark value D f , the participants will restartthe training in hopes of improving the prediction effect.In the actual medical environment, the data owned by eachhospital must be non-IID. We assume that patients with bipolarI disorder and bipolar II disorder and normal people who aresuspected of being sick will only go to a speciﬁc hospital H x for treatment. Each medical institution has a different Fig. 2. The architecture of federated learning. Firstly, the participated partieshave data on normal people, bipolar I and bipolar II users, and no datainteraction between different parties. At the beginning of each communicationround, the server will assign the global model to the participated party intraining in this round. Next, the activated parties will train the local modelthrough its own mobile health data and upload it to the server. Finally, theserver updates the global model according to the uploaded local model. amount of patient data, and the serious condition of patientsis inconsistent, resulting in extreme data distribution. On thisbasis, we detect the impact of this extreme distribution dataon the decline of model prediction progress compared to IIDdata. The speciﬁc data division is introduced in section IV-C.

B. Federated learning

Due to the privacy issues of patient health data stored inhospitals, we cannot use these data for centralized learning.Google has proposed a new framework called federated learn-ing. During the training of the federated learning model, thedata owned by each hospital participating in the model trainingcollaboration can be saved locally without uploading. Eachhospital uses its own data to download the model from theserver for training, and upload the trained model or gradientto the server for aggregation, and then the server sends theaggregated model or gradient information to each hospital.Considering the communication burden and connection relia-bility and other issues, we adopt the model average methodfor training. Assuming that there are K hospitals participatingin federated learning, when the global model parameters areupdated in the t round, the K -th participant calculates thelocal data average gradient of the current model parametersaccording to Eq. 7, and the server aggregates these gradientsand uses the model according to Eq. 8 parameter updateinformation. g k = ∇ F k ( ω t ) , (7) ω t +1 ← ω t − η K X k =1 n k n g k , (8) where g k is the average gradient of the local data of the currentmodel parameter ω t . η is the learning rate, P Kk =1 n k n g k = ∇ f ( ω t ) . Literature [10] also proposed an equivalent federatedmodel training method. Each hospital uses local data toperform one (or more) steps of gradient descent, accordingto Eq. 9 the existing model parameters locally and sends thelocally updated model parameters to the server. The server thencalculates the weighted average of the model results accordingto Eq. 10 and sends the aggregated model parameters to eachhospital.The literature [10] shows that compared with the purelydistributed SGD, the improved scheme can reduce the amountof communication used by 10-100 times, and can choose theupdate optimizer of the gradient other than SGD. ∀ k, ω ( k ) t +1 ← ω t − ηg k , (9) ω t +1 ← K X k =1 ω ( k ) t +1 , (10)where ω t is the existing model parameter of the local client.In this work, we use the federated learning frameworkto focus on studying the inﬂuence of participants and datavolume under IID data and the extreme distribution of non-IIDdata on the results of emotion prediction through multi-viewsheterogeneous data collected on the mobile terminal. However,in the actual use of the federated learning framework, there arealso such issues as the ﬂexibility and stability of party joiningand quitting at any time, the party’s dynamic increase andchange with the increase of patients, the statistical contributionof party to the model and the design of corresponding incentivemechanism. Although these issues are beyond the scope ofour existing work, we still deal with non-IID data and partyparticipation stability. We speciﬁed the number of partiesparticipating in the experiment. The datasets owned by eachclient have been determined before training, and we ﬁxedthe parties participating in the training update in advance tostabilize the ﬁnal training effect after each experiment.IV. E XPERIMENTS

In this section, we introduce how to use data generatedby personal mobile devices to train deep learning models. Weassume that the hospital allocates a special mobile device toeach user to collect alphanumeric characters, special charac-ters, and accelerometer values used in the conversation, andthe hospital has weekly HDRS test scores for patients. Due tothe particularity of the diagnosis of depression, patients maygo to multiple hospitals to try and seek treatment, and somehospitals may have the same patient data.

A. Dataset

The data used in the experiment comes from a real ob-servation study of a free mobile app by BiAffect. In thedata collection stage, the researchers provide the users witha special Android smartphone. The mobile phone uses acustomized virtual keyboard to replace the default keyboard,so as to collect the metadata input by the user without affecting

Algorithm 1:

Federated Averaging. The K is the totalnumber of party, M is the local minibatch size, E isthe number of local epochs, and η is the learning rate. Server executes:

Initialize ω for each round t = 1 , , . . . do t ← random choose (1 , K ) C t ← (random set of t parties) for each party k ∈ C t in parallel do ω ( k ) t +1 ← localtraining ( k, ω t ) ω t +1 ← P Kk =1 n k n ω ( k ) t +1 Localtraining ( k, ω t ) : //k parties training in parallel for each local epoch i from to S do D m ← (split D k into batches of size M randomly) for each batches b from to B = n k M do ω ( k ) b +1 ← ω ( k ) b,i − η ∇ F k ( ω t ) return ω ( k ) t +1 = ω ( k ) B,S to serverthe operation in the background. The collected content of thekeyboard includes data such as the user key input time, thenumber of keystrokes and the phone accelerometer value. Thethree types of metadata we use are as follows:

Alphanumeric characters.

In order to protect user privacy,we do not collect speciﬁc alphanumeric characters. We onlycollect the duration and time of the button, the duration afterthe last button was pressed, and the distance from the lastbutton to the coordinate axis on the horizontal and verticalaxes.

Special characters.

We perform one-hot encoding for oper-ations including space, backspace, and keyboard switching.Compared with alphanumeric characters, the number of oper-ations for special characters is less.

Accelerometer value.

The accelerometer record every 60msbetween sessions. Because different users have different typingspeeds, the accelerometer values are more densely recordedthan alphanumeric characters.We deﬁne a session as an interval of more than ﬁve secondsafter the user presses the key ﬁve seconds after the last keypress or lasts longer. Due to the user typing habits, the durationof the session is generally less than one minute.At the same time, participants receive the Hamilton Depres-sion Rating Scale (HDRS) [1] and the Young Man Mania Scale(YMRS) [2] once a week, which is a doctor clinical diagnosisand evaluation of bipolar depression and a very effectiveassessment questionnaire for bipolar disorder. After the datacollected by the study participants, according to the control ofextreme subjects and normal users. There are 6 participantssuffered from bipolar I disorder, including severe episodesranging from bipolar disorder to depression, 6 participantssuffered from bipolar II disorder, including clinical manifesta-tions were mildly elevated mood between mild manic episodesand severe episodes, and 8 participants were diagnosed asnormal subjects.Since the evaluation process only relies onthe communication between the patient and the doctor and the indicators given by the evaluation scale, the results ofthe diagnosis are not necessarily reliable Therefore, we tryto predict the occurrence of depression from an objectiveperspective by recording real-time data of patients.

B. Experimental Setup

Our model is implemented using Keras with Tensorﬂow asthe backend. All experiments are conducted on a 64 core IntelXeon CPU E5-2680 [email protected] with 512GB RAM and 1 × NVIDIA Tesla P100-PICE GPU. We use RMSProp [17] as thetraining optimizer. We retain sessions with keypresses between10 and 100, and ﬁnally generate 14960 samples. Each usercontributes ﬁrst 80% sessions for training and the rest forvalidation.

TABLE IP

ARAMETER CONFIGURATION .Parameter ValueDNN communication rounds 400DNN local epochs 15DFM communication rounds 300DFM local epochs 20DMVM communication rounds 400DMVM local epochs 15Batch size 256Learning rate 0.001Dropout fraction 0.1Maximum sequence length 100Minimum sequence length 10TABLE IIT

HE ACCURACY OF THE COMPARED MODELS UNDER DIFFERENT LOCALEPOCHS AND COMMUNICATION ROUNDS . W

E SHOW THE BEST RESULTSWITH BOLDFACE .Communication Rounds 100 200 300 400 500Model local epochsDNN 5 0.7919 0.8295 0.8435

15 0.8338

We set the parameters based on experience and some exper-imental comparisons, including the number of communicationrounds, the number of local epochs, batch size, learning rate,and dropout rate. We consider sessions with the HDRS scorebetween 0 and 7 (inclusive) as negative samples (normal) andthose with HDRS greater than or equal to 8 as positive samples(from mild to severe depression). Our code is open-sourced athttps://github.com/RingBDStack/Fed mood.In order to study the inﬂuence of local epochs parameters,we evenly distribute the training data set to 8 participants fortesting. The results are shown in the Table II, we can ﬁnd:(1) As the number of communication rounds increases, theaccuracy rate shows a trend of rising and then a slight decrease. (2) Our work is inconsistent with the results of Zhao et al. [18].A large number of local epochs can signiﬁcantly improve theeffect of federated learning. However, when epochs = 20, theaccuracy of DMVM in any communication round shows adownward trend. These results show that increasing the localepoch can make the training more stable and speed up theconvergence speed, but it may not make the global modelconverge to a higher accuracy level. That is, over-optimizingthe local datasets may cause performance loss. (3) In the ﬁrst300 periods, the fusion efﬁciency of DFM is higher than thatof DNN and DMVM, which shows the improvement effect ofthe fusion layer, and DFM achieve better local minima of lossfunctions in some results. Compared with centralized learning,due to the sharp reduction of local data, the effect of DMVMfusion of multi-view and multi-level features will be affectedto a certain extent. Because the three models get differentresults when the local epochs are 15 and 20, we perform theparameters separately settings, as shown in Table I.

C. IID Experiments1) Compared Methods:

We compare FedAVG with the fol-lowing methods, each of which represents a different strategyfor data interaction.

Local Training : Local training means that each party onlyuses its data for training, without any interaction with otherparties.

CDS [19] : Collaborative data sharing is a traditional central-ized machine learning strategy, which requires that each partyuploads its patient data to the center server for training.

IIL [19], [20] : Institutional incremental learning is a serialtraining method. Each party transfers its model to the nextparticipator after training ﬁnish, until all have trained once.

CIIL [19], [20] : Cyclic institutional incremental learningrepeats the IIL training process. It keeps consistent with thenumber of federated learning local training epochs and loopingrepeatedly through the parties.In each experiment, the models we compared are summa-rized as follows:

DMVM : The proposed DeepMood architecture with a Multi-view Machine layer for data fusion.

DFM : The proposed DeepMood architecture with a Factoriza-tion Machine layer for data fusion.

DNN : The proposed DeepMood architecture with a conven-tional fully connected layer for data fusion.In this work, for the IID setting, we randomly assign eachclient a uniform distribution of three data categories: normalusers, bipolar I disorder patients, and bipolar II disorderpatients.The speciﬁc methods as follows:1.The number of participants data remains unchanged, andthe number of parallel participants is increasing. The amountof data owned by each party is ﬁxed at 1500, and the number ofhospitals participating in the training gradually increases from4. We test the training effect of up to 24 parallel participants.2.The number of concurrent participants remains the same,increasing the amount of data owned by each participant.Consistent with the experiment of setting hyperparameters, weset the number of concurrent participants to 8. The amount of (a) Alphanumeric characters. (b) Special characters. (c) Accelerometer value.Fig. 3. Visualization of labeling with TSNE for three views data owned by each party gradually increases from 100, andwe use about 25% (3000) of the total data as the maximumvalue of the experiment.In order to make the results of the experiment stable, weconduct each group of experiments ﬁve times and average theresults.

2) Evaluation criteria:

In order to evaluate the inﬂuence offederated learning and local training on the prediction results,we adopted the following measures:Accuracy is one of the most frequently used criterion,which represents the ratio of the number of correctly predictedsamples to the total number of predicted samples. In thefederated learning experiment, the central server can test theﬁnal global model with its own test set. In the local trainingexperiment, we regard the local data as a whole, and comparethe number of samples correctly predicted by each participantwith the test set.

3) Experiment Result:

Table III shows the mood predictioneffect of increasing parallel parties. Since local training hasno interactive learning process and the amount of data ownedby each participant is constant, the ﬁnal result has alwaysbeen between 73% and 75% ﬂuctuation. In most cases, CDScan achieve the best prediction effect, but the best effect ofDMVM model using CIIL can reach 85.29%, which is about18% higher than local training without updating model weight.Table IV shows the accuracy performance of increasingthe amount of data for each participant. The accuracy of thelocal training without weight update and FedAvg training isincreasing at the same time. When the amount of data foreach party is small (data < D. Non-IID Experiments

In the real medical environment, the data owned by thehospital should be non-iid. In this section we introduce themethods and results of non-IID experiments.

1) Compared Methods:

The models we compared areshown in section 4.3.1. In this work, for non-IID settings, wehave 8 normal users personal data, 6 bipolar I disorder patientsdata, 6 bipolar II disorder patients data. There are 4 hospitalsparticipating in the training experiment, and the data volumeof different users is inconsistent. Each hospital has two normal

TABLE IIIA

CCURACY PERFORMACE OF THE

IID

EXPERIMENTS

I. W

E SHOW THEBEST RESULTS WITH BOLDFACE .Number of party Metrics DNN DFM DMVM4 Local Training 0.7431 0.7514 0.7343CDS

FedAVG 0.8283 0.8266 0.8156IIL 0.7899 0.7774 0.7751CIIL 0.8266 0.8288 0.818412 Local Training 0.7406 0.7444 0.7238CDS 0.8352

16 Local Training 0.7356 0.7379 0.7248CDS

24 Local Training 0.7388 0.7455 0.7240CDS users data, one bipolar I disorder patients data, and one bipolarII disorder patients data.

2) Evaluation criteria:

Our evaluation criteria are con-sistent with section 4.3.2, and accuracy is still used as thecriterion for evaluating mood prediction.

3) Experiment Result:

As shown in Table V, the predictioneffect of CDS under the non-IID setting is far ahead of thedistributed cooperative learning method, and the federatedlearning prediction accuracy of the three models decreased by5.2% (DNN), 13.3% (DFM) and 5.1% (DMVM). The natureof the extreme distribution of non-IID data is the reason forthe decline in prediction effect. We also ﬁnd that the predictioneffects of the two models that do not use nonlinear functionsfor feature interaction are signiﬁcantly different under non-IID data. Due to the large difference in the number of patientdata owned by each party and the completely inconsistentpatient data types, the second-order feature interaction failedto integrate all features well, and the log also showed that itsprediction accuracy ﬂuctuates more than DMVM.

TABLE IVA

CCURACY PERFORMACE OF THE

IID

EXPERIMENTS

II. W

E SHOW THEBEST RESULTS WITH BOLDFACE .Number of data Metrics DNN DFM DMVM100 Local Training 0.6420 0.6230 0.6180CDS

FedAVG 0.7176 0.7172 0.7026IIL 0.6295 0.6722 0.6126CIIL 0.6958 0.7304 0.7043500 Local Training 0.6878 0.6894 0.6828CDS

FedAVG 0.7865 0.8044 0.7810IIL 0.7389 0.7272 0.7244CIIL 0.7612

IIL 0.7647 0.7707 0.7642CIIL 0.8266 0.8183 0.81651500 Local Training 0.7267 0.7337 0.7344CDS

FedAVG 0.8283 0.8266 0.8156IIL 0.7899 0.7774 0.7751CIIL 0.8279 0.8288 0.81842000 Local Training 0.7516 0.7558 0.7395CDS 0.8331

FedAVG 0.8432 0.8430 0.8390IIL 0.8145 0.8130 0.8018CIIL

CCURACY PERFORMANCE OF N ON -IID EXPERIMENT AND

IID. W

ESHOW THE BEST RESULTS WITH BOLDFACE .Types of data Metrics DNN DFM DMVMNon-IID CDS

FedAVG 0.7695 0.7159 0.7684IIL 0.6881 0.6881 0.7032CIIL 0.7316 0.7416 0.7651IID CDS

E. Discussion

In this section, we discuss the experimental results.As shown in Table III, when the amount of data held by eachparty is constant, we can ﬁnd that CDS can always maintainthe best effect on DNN and DFM models in most cases, but inthe DMVM model, CIIL has the best result. We can see fromTable IV that when the amount of data is 1000, the data of eachparty is not repeated, and the federated learning frameworkhas achieved the best results under the three models. Whenthe amount of data is 1500, the model is affected by repeateddata for the ﬁrst time. Instead, the prediction performance ofFedAVG declined slightly, and the prediction effect of CIIL isstill rising.Since the prediction accuracy of CIIL mostly depends on theeffect of the last trained party model, we guess that repeatedinput data will seriously affect the fusion interaction mode ofthe multi-view machine layer. Compared with CDS, the lastparticipation of CIIL has less repeated data, and the federated framework will be affected by repeated data due to the lastround of global model weight update. Therefore, CIIL is lessaffected than these two methods, and the best accuracy resultcan eventually reach 85.29%.For non-IID experiments, CDS can still maintain the predic-tion accuracy of about 83%, while federated learning and CIILboth have an inadequate accuracy drop. However, under thenon-IID setting, the federated learning framework surpassesCIIL on the DMVM model, which proves the superiority ofthe multi-view machine layer under the federated framework.At the same time, for the non-IID experiment, we alsoﬁnd that in one hospital, the accuracy of the validation setis distributed between 50% and 75% during the trainingprocess, while the training logs of other hospitals show thatthe accuracy of the validation set is almost above 90% aftereach round of local training. However, we have not yet dealtwith the weights of participants with poor performance inthe training. Our next work is to consider constructing anappropriate incentive mechanism to weaken the inﬂuence ofparticipants with poor contribution on the overall predictioneffect, so as to fully reduce the inﬂuence of non-IID data onthe model weight.Furthermore, in order to test the inﬂuence of different viewson the model prediction effect, as shown in Fig 3, we visualizethe data of each view. We ﬁnd that the distribution of Spec. istoo scattered, and it is difﬁcult to distinguish normal peoplefrom patients in special operations such as backspace, space,and keyboard switching. Alph. and Accel. have better catego-rizable results from a single view. This also illustrates from theother hand that there are obvious differences in typing patternsbetween normal people and depressed patients, including theduration of keystrokes. And judging from the distribution ofaccelerometer values, the way depression patients use mobilephones is also different. In summary, it is necessary to mergedata from different views as input.V. R

ELATED W ORK

In this section, we introduce the related research results offederated learning and multi-view learning, and discuss therecently proposed federated multi-view learning.

Multi-View learning.

Xu et al. [21] pointed out that multi-perspective learning requires the use of one function to modelone perspective and the use of other perspectives to jointlyoptimize all functions. Cao et al. [16], [22] used tensor productto process multi-view data. Yao et al. [23] integrated CNN,LSTM and Graph embedding tackled the complex nonlinearspatial and temporal dependency in a multi-perspective way.In addition, some work integrates multi-view into the processof deep learning [24] and transfer learning [25], so as to helpexpand samples from data.

Federated learning.

Here we mainly refer to the medicalapplication of federated learning and some common federatedmulti-View deep learning framework.Lee et al. [26] proposed a privacy protection platform in thefederated environment, which could ﬁnd similar patients fromdifferent hospitals without sharing patient-level information.

Huang L et al. [27] improved the performance of federatedlearning for predicting mortality and length of stay by usingfeature autoencoders and patient clustering. However, themethod of combining federated learning and multi-view datain existing research is still in the development stage. AdrianFlanagan et al. [28] proposed the federated multi-view matrixfactorization method and address cold-start problem. Huanget al. [29] proposed FL-MV-DSSM, which is the ﬁrst generalcontent-based joint multi-view framework, which successfullyextended traditional federated learning to federated multi-viewlearning. Kang et al. [30] proposed the FedMVT algorithm forsemi-supervised learning, which can improve the performanceof vertical federated learning with limited overlapping sam-ples. These related methods using public datasets are mainlyused to recommend systems to solve the cold start problem.However, our framework uses data collected from mobiledevices to solve medical mood prediction problems.VI. C

ONCLUSION

Federated learning plays a key role in solving the problemof data islands. In the problem of data islands, user privacy anddata security are very important. In this work, we use the datarecords generated by the user when typing on the mobile phoneand the user’s HDRS score to predict depression through theDeepMood architecture. For IID data, with different amountsof data, the accuracy of federated learning is about 10%-15%higher than that of local training without weight update. Fornon-IID data, accuracy is only reduced by 13% at most. Fromthe perspective of protecting user privacy, this is completelyacceptable. R

EFERENCES[1] M. Hamilton, “The hamilton rating scale for depression,” in

Assessmentof depression . Springer, 1986, pp. 143–152.[2] R. C. Young, J. T. Biggs, V. E. Ziegler, and D. A. Meyer, “A ratingscale for mania: reliability, validity and sensitivity,”

The British journalof psychiatry , vol. 133, no. 5, pp. 429–435, 1978.[3] A. Gr¨unerbl, A. Muaremi, V. Osmani, G. Bahle, S. Oehler, G. Tr¨oster,O. Mayora, C. Haring, and P. Lukowicz, “Smartphone-based recognitionof states and state changes in bipolar disorder patients,”

JBHI , vol. 19,no. 1, pp. 140–148, 2014.[4] A. M. Darcy, A. K. Louie, and L. W. Roberts, “Machine learning andthe profession of medicine,”

Jama , vol. 315, no. 6, pp. 551–552, 2016.[5] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonableeffectiveness of data in deep learning era,” in

Proceedings of the IEEEICCV , 2017, pp. 843–852.[6] T. Wang, Z. Cao, S. Wang, J. Wang, L. Qi, A. Liu, M. Xie, and X. Li,“Privacy-enhanced data collection based on deep learning for internetof vehicles,”

IEEE Transactions on Industrial Informatics , 2019.[7] M. Hao, H. Li, X. Luo, G. Xu, H. Yang, and S. Liu, “Efﬁcient andprivacy-enhanced federated learning for industrial artiﬁcial intelligence,”

IEEE Transactions on Industrial Informatics , vol. 16, no. 10, pp. 6532–6542, 2019.[8] W. Li, F. Milletar`ı, D. Xu, N. Rieke, J. Hancox, W. Zhu, M. Baust,Y. Cheng, S. Ourselin, M. J. Cardoso et al. , “Privacy-preserving feder-ated brain tumour segmentation,” in

International Workshop on MachineLearning in Medical Imaging . Springer, 2019, pp. 133–141.[11] Y. Kim, J. Sun, H. Yu, and X. Jiang, “Federated tensor factorization forcomputational phenotyping,” in

Proceedings of ACM KDD , 2017, pp.887–895. [9] Q. Yao, X. Guo, J. T. Kwok, W.-W. Tu, Y. Chen, W. Dai, and Q. Yang,“Privacy-preserving stacking with application to cross-organizationaldiabetes prediction.” in

Proceedings of IJCAI , 2019, pp. 4114–4120.[10] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federatedlearning of deep networks using model averaging. corr abs/1602.05629(2016),” arXiv:1602.05629 , 2016.[12] E. Agu, P. Pedersen, D. Strong, B. Tulu, Q. He, L. Wang, and Y. Li,“The smartphone as a medical device: Assessing enablers, beneﬁtsand challenges,” in

IEEE International Workshop of Internet-of-ThingsNetworking and Control , 2013, pp. 48–52.[13] F. Hussain, J. P. Stange, S. A. Langenecker, M. G. McInnis, J. Zulueta,A. Piscitello, B. Cao, H. Huang, S. Y. Philip, P. Nelson et al. , “Pas-sive sensing of affective and cognitive functioning in mood disordersby analyzing keystroke kinematics and speech dynamics,” in

DigitalPhenotyping and Mobile Sensing . Springer, 2019, pp. 161–183.[14] B. Cao, L. Zheng, C. Zhang, P. S. Yu, A. Piscitello, J. Zulueta, O. Ajilore,K. Ryan, and A. D. Leow, “Deepmood: modeling mobile phone typingdynamics for mood detection,” in

Proceedings of the ACM KDD , 2017,pp. 747–755.[15] S. Rendle, “Factorization machines with libfm,”

ACM TIST , vol. 3, no. 3,pp. 1–22, 2012.[16] B. Cao, H. Zhou, G. Li, and P. S. Yu, “Multi-view machines,” in

Proceedings of ACM WSDM , 2016, pp. 427–436.[17] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradientby a running average of its recent magnitude,”

COURSERA: Neuralnetworks for machine learning , vol. 4, no. 2, pp. 26–31, 2012.[18] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federatedlearning with non-iid data,” arXiv:1806.00582 , 2018.[19] M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou,M. Milchenko, W. Xu, D. Marcus, R. R. Colen et al. , “Federated learn-ing in medicine: facilitating multi-institutional collaborations withoutsharing patient data,”

Scientiﬁc reports , vol. 10, no. 1, pp. 1–12, 2020.[20] K. Chang, N. Balachandar, C. Lam, D. Yi, J. Brown, A. Beers,B. Rosen, D. L. Rubin, and J. Kalpathy-Cramer, “Distributed deeplearning networks among institutions for medical imaging,”

Journal ofthe American Medical Informatics Association , vol. 25, no. 8, pp. 945–954, 2018.[21] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” arXiv:1304.5634 , 2013.[22] B. Cao, L. He, X. Kong, S. Y. Philip, Z. Hao, and A. B. Ragin, “Tensor-based multi-view feature selection with applications to brain diseases,”in

Proceedings of the IEEE ICDM . IEEE, 2014, pp. 40–49.[23] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li,“Deep multi-view spatial-temporal network for taxi demand prediction,” arXiv:1802.08714 , 2018.[24] J. Zhang, B. Cao, S. Xie, C.-T. Lu, P. S. Yu, and A. B. Ragin,“Identifying connectivity patterns for brain diseases via multi-side-viewguided deep architectures,” in

Proceedings of the SDM . SIAM, 2016,pp. 36–44.[25] Q. Wu, H. Wu, X. Zhou, M. Tan, Y. Xu, Y. Yan, and T. Hao, “Onlinetransfer learning with multiple homogeneous or heterogeneous sources,”

IEEE TKDE , vol. 29, no. 7, pp. 1494–1507, 2017.[26] J. Lee, J. Sun, F. Wang, S. Wang, C.-H. Jun, and X. Jiang, “Privacy-preserving patient similarity learning in a federated environment: devel-opment and analysis,”

JMIR , vol. 6, no. 2, p. e20, 2018.[27] L. Huang, A. L. Shea, H. Qian, A. Masurkar, H. Deng, and D. Liu,“Patient clustering improves efﬁciency of federated machine learningto predict mortality and hospital stay time using distributed electronicmedical records,”

JBI , vol. 99, p. 103291, 2019.[28] A. Flanagan, W. Oyomno, A. Grigorievskiy, K. E. Tan, S. A. Khan,and M. Ammad-Ud-Din, “Federated multi-view matrix factorization forpersonalized recommendations,” arXiv:2004.04256 , 2020.[29] M. Huang, H. Li, B. Bai, C. Wang, K. Bai, and F. Wang, “A federatedmulti-view deep learning framework for privacy-preserving recommen-dations,” arXiv:2008.10808 , 2020.[30] Y. Kang, Y. Liu, and T. Chen, “Fedmvt: Semi-supervised verticalfederated learning with multiview training,” arXiv:2008.10838arXiv:2008.10838