[PDF] Privacy-preserving Artificial Intelligence Techniques in Biomedicine

Abstract

Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g. in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.

Full PDF

PPrivacy-preserving Artiﬁcial Intelligence Techniques in Biomedicine

Reihaneh Torkzadehmahani , Reza Nasirigerdeh , David B. Blumenthal , Tim Kacprowski , MarkusList , Julian Matschinske , Julian Sp¨ath , Nina Kerstin Wenke , B´ela Bihari , Tobias Frisch , AnneHartebrodt , Anne-Christin Hausschild , Dominik Heider , Andreas Holzinger , Walter H¨otzendorfer ,Markus Kastelitz , Rudolf Mayer , Cristian Nogales , Anastasia Pustozerova , Richard R ¨ottger , HaraldH.H.W. Schmidt , Ameli Schwalber , Christof Tschohl , Andrea Wohner , and Jan Baumbach Chair of Experimental Bioinformatics, Technical University of Munich, Freising, Germany Gnome Design SRL, Sfˆantu Gheorghe, Romania Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark Department of Mathematics and Computer Science, Philipps-University of Marburg, Marburg, Germany Institute for Medical Informatics/Statistics, Medical University Graz, Graz, Austria Research Institute AG & Co KG, Vienna, Austria SBA Research Gemeinn¨utzige GmbH, Vienna, Austria Department of Pharmacology and Personalised Medicine, MeHNS, FHML, Maastricht University,Maastricht, the Netherlands Concentris Research Management GmbH, F¨urstenfeldbruck, Germany

Abstract

Artiﬁcial intelligence (AI) has been successfully applied in numerous scientiﬁc domains including biomedicine and healthcare.Here, it has led to several breakthroughs ranging from clinical decision support systems, image analysis to whole genomesequencing. However, training an AI model on sensitive data raises also concerns about the privacy of individual participants.Adversary AIs, for example, can abuse even summary statistics of a study to determine the presence or absence of an individualin a given dataset. This has resulted in increasing restrictions to access biomedical data, which in turn is detrimental forcollaborative research and impedes scientiﬁc progress. Hence there has been an explosive growth in efforts to harness the powerof AI for learning from sensitive data while protecting patients’ privacy. This paper provides a structured overview of recentadvances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within auniﬁed taxonomy, and discusses their strengths, limitations, and open problems.

Introduction

AI strives to emulate human intelligence and to develop in-telligent algorithms that undertake complicated tasks. Formany complex tasks, AI already surpasses humans in termsof accuracy, speed and cost. Recently, the rapid adoptionof AI and its subﬁelds, speciﬁcally machine learning anddeep learning, has led to substantial progress in applicationssuch as autonomous driving [1], text translation [2] and voiceassistance [3]. At the same time, AI is becoming essential inbiomedicine, where it has increasingly captured the attentionof researchers. In particular, the rise of big data in healthcaremakes it necessary to develop techniques that help scientiststo gain understanding from it [4].Success stories such as acquiring the compressed rep-resentation of drug-like molecules [5], modeling the hier-archical structure and function of a cell [6] and translating magnetic resonance images to computed tomography [7]using deep learning models illustrate the remarkable perfor-mance of these AI approaches. AI has not only achievedremarkable success in analyzing biomedicine data [8–18],but also has surpassed humans in applications such as sep-sis prediction [19], malignancy detection on mammography[20] and mitosis detection in breast cancer [21].Despite these AI-fueled advancements, important pri-vacy concerns have been raised regarding the individualswho contribute to the training datasets. While taking care ofthe conﬁdentiality and privacy of sensitive biological data iscrucial, several studies showed that AI techniques often donot maintain data privacy [22–24]. In general, attacks knownas membership inference can be used to infer an individual’smembership by querying over the dataset [25] or the trainedmodel [22], or by having access to certain statistics aboutthe dataset [26–28]. Homer et al. [26] showed that under1 a r X i v : . [ c s . CR ] J u l ome assumptions, adversaries can use the genomic statisticspublished as the results of genome-wide association studies(GWAS) to ﬁnd out if an individual was a part of the study.Another example of this kind of attack was demonstrated byattacks on Genomics Beacons [25, 29], in which an adver-sary (an attacker who attempts to invade data privacy) couldidentify the presence of an individual in the dataset by simplyquerying the presence of a particular allele. Moreover, theattacker could identify the relatives of those individuals andobtain sensitive disease information [28]. Besides targetingthe training dataset, an adversary may attack a fully-trainedAI model to extract individual-level membership by trainingan adversarial inference model that learns the behaviour ofthe target model [22].As a result of the aforementioned studies, health researchcenters such as the National Institutes of Health (NIH) aswell as hospitals have restricted access to the pseudonymizeddata [30–32]. Furthermore, data privacy laws such as thoseenforced by the Health Insurance Portability and Account-ability Act (HIPAA), and the Family Educational Rightsand Privacy Act (FERPA) in the US as well as the EU Gen-eral Data Protection Regulation (GDPR) restrict the use ofsensitive data [33, 34]. Consequently, everyone who needsaccess to these datasets has to go through a difﬁcult approvalprocess, which signiﬁcantly impedes collaborative research.Therefore, both industry and academia urgently need to applyprivacy-preserving techniques to respect individual privacyand comply with these laws.This paper provides a systematic overview over variousrecently proposed privacy-preserving AI techniques, whichfacilitate the collaboration between health research instituteswhile ensuring data privacy. Several efforts exist to tackle theprivacy concerns in the biomedical domain, some of whichhave been examined in a couple of surveys [35–37]. Aziz et al. [35] investigated previous studies which employed dif-ferential privacy and cryptographic techniques for humangenomic data. Kaissis et al. [37] brieﬂy reviewed federatedlearning, differential privacy and cryptographic techniquesapplied in medical imaging. Xu et al. [36] surveyed the gen-eral solutions to challenges in federated learning includingcommunication efﬁciency, optimization, as well as privacyand discussed possible applications for federated learningincluding a few examples in healthcare. Our review dif-fers from previous works in several aspects. Compared to[35] and [37], this paper covers a broader set of privacypreserving techniques including federated learning and hy-brid approaches but also a wider range of problems suchas privacy-preserving medical image segmentation and elec-tronic health record classiﬁcation. In contrast to [36] thatonly surveyed federated learning and hybrid approaches, thispaper discusses cryptographic techniques and differentialprivacy approaches and their applications in healthcare too.Moreover, it covers a wider range of studies which employedfour different privacy-preserving techniques for healthcareapplications and compares the approaches using differentcriteria such as privacy, accuracy and efﬁciency.The presented approaches in this review, are divided into four categories, namely, cryptographic techniques, differen-tial privacy, federated learning, and hybrid approaches. First,we describe how cryptographic techniques — in particular,homomorphic encryption (HE) and secure multiparty compu-tation (SMPC) — ensure secrecy of sensitive data by carryingout computations on encrypted biological data. Next, weillustrate the differential privacy approach and its capabilityin quantifying individuals’ privacy in published summarystatistics of, for instance, GWAS data and deep learning mod-els trained on clinical data. Then, we elaborate on federatedlearning, which allows health institutes to train AIs locallyand to share only selected parameters without sensitive datawith a coordinator, who aggregates them and builds a globalmodel. Following that, we discuss the hybrid approacheswhich enhance data privacy by combining multiple privacy-preserving techniques. We elaborate on the strengths anddrawbacks of each approach as well as its applications inbiomedicine and healthcare. Next, we provide a compari-son among the approaches from different perspectives suchas computational and communication efﬁciency, accuracy,and privacy. Afterwards, we discuss the most realistic ap-proaches from practical viewpoint and provide a list of openproblems and challenges to the adoption of these techniquesin real-world healthcare applications. Cryptographic Techniques

In the healthcare domain and GWAS in particular, cryp-tograohic techniques have been used to collaboratively com-pute result statistics while preserving the data privacy [38–48]. These cryptographic approaches are based on HE [49]or SMPC [50]. HE enables the computation of addition andmultiplication over encrypted data.Figure 1:

Homomorphic encryption : The participants en-crypt the private data and share it to a computing party, whichcomputes the aggregated (and encrypted) results over the en-crypted data from the participants.HE-based approaches share three steps (Figure 1):1. Participants (e.g. hospitals or medical centers) encrypttheir private data and send the encrypted data to acomputing party.2. The computing party calculates the statistics over theencrypted data and shares the statistics (which areencrypted) with the participants.3. The participants access the results by decrypting them.In SMPC, there are multiple participants as well as acouple of computing parties which perform computations onsecret shares from the participants. Given M participants and N computing parties, SMPC-based approaches follow threesteps (Figure 2):1. Each participant sends a separate and different secretto each of the N computing parties.2. Each computing party computes the intermediate re-sults on the M secret shares from the participants andshares the intermediate results with the other N − Secure multi-party computation : Each partici-pant shares a separate, different secret with each computingparty. The computing parties calculate the intermediate re-sults, secretly share them with each other, and aggregate allintermediate results to obtain the ﬁnal results.To clarify the concepts of secret sharing [51] and multi-party computation, consider a scenario [52] with two partici-pants P and P and two computing parties C and C . P and P possess the private data X and Y , respectively. The aimis to compute X + Y , where neither P nor P reveals its datato the computing parties. To this end, P and P generaterandom numbers R X and R Y , respectively; P reveals R X to C and ( X − R X ) to C ; likewise, P R Y with C and ( Y − R Y ) with C ; R X , R Y , ( X − R X ) and ( Y − R Y ) are secretshares. C computes ( R X + R Y ) and sends it to C and C calculates ( X − R X ) + ( Y − R Y ) and reveals it to C . Both C and C sum the result they computed and the result each obtained from the other computing party. The sum is in fact ( X + Y ) , which can be shared with P and P .It is worth mentioning that to preserve data privacy, thecomputing parties C and C must be non-colluding. Thatis, C must not send R X and R Y to C and C must not share ( X − R X ) and ( Y − R Y ) with C . Otherwise, the computingparties can compute X and Y , revealing the participants’ data.In general, in a SMPC with N computing parties, data privacyis protected as long as at most N − N , the stronger the privacybut the higher the communication overhead and processingtime.Several studies use HE to develop secure, privacy-awarealgorithms for healthcare data. Kim et al. [41] and Lu et al. [43] implemented a secure χ test for GWAS data using HE.Lauter et al. [42] developed privacy-preserving versions ofcommon statistical tests in in GWAS, such as Pearson goodof ﬁt test, tests for linkage disequilibrium, and the CochranArmitage trend test. Kim et al. [53] and Morshed et al. [54]presented a secure logistic for GWAS and linear regressionalgorithm for healthcare data, based on HE.Other studies mainly capitalized on SMPC to implementdifferent privacy-preserving algorithms applicable to health-care data. Zhang et al. [47], Constable et al. [46], and Kamm et al. [45] developed a secure χ test based on SMPC forGWAS data. Shi et al. [55] developed a secure logistic re-gression algorithm using SMPC. Bloom [56] implemented asecure linear regression test based on SMPC for GWAS data.Cho et al. [38] introduced a SMPC based framework to facil-itate quality control and population stratiﬁcation correctionfor large-scale GWAS and showed that their framework isscalable to one million individuals and half million singlenucleotide polymorphisms (SNPs).Despite the promises of privacy-preserving algorithmsleveraging cryptographic techniques (Table 1), the road forthe wide adoption of these algorithms in the biomedicine andhealthcare community is long [57]. The major limitationsof HE are few supported operations and computational over-head [58]. HE supports only addition and multiplicationoperations, and as a result, developing complex AI mod-els with non-linear operations such as deep neural networks(DNNs) using HE is very challenging. Moreover, HE incursremarkable computational overhead since it performs oper-ations on encrypted data. The main constraints of SMPCare computational overhead and network bottleneck [59].Similar to HE, SMPC suffers from high overhead whichcomes from operating on secret shares from a large numberof participants or large amount of data. Additionally, SMPCconsumes high network bandwidth because participants needto send a large number of secret shares to the computing par-ties, which in turn, send the intermediate results to the otherparties. Unlike HE, SMPC is ﬂexible in terms of operations.On the other hand, HE is more communication-efﬁcient com-pared to SMPC. Both HE and SMPC based algorithms arenot scalable due to their computational overhead, which hin-ders their adoption for large-scale biomedical and healthcaredata [57].3able 1: Literature for cryptographic techniques in biomedicine. HE: homomorphic encryption, SMPC: secure multipartycomputation Authors Year Privacy Technique Model Application

Kim et al. [41] 2015 HE χ statisticsminor allele frequencyHamming DistanceEdit distance genetic associationsDNA comparisonLu et al. [43] 2015 HE χ statistics D (cid:48) measure genetic associationsLauter et al. [42] 2014 HE D (cid:48) and r measuresPearson goodness-of-ﬁtexpectation maximizationCochran-Armitage genetic associationsKim et al. [53] 2018 HE logistic regression medical decision makingMorshed et al. [54] 2018 HE linear regression medical decision makingKamm et al. [45] 2013 SMPC χ statistics genetic associationsConstable et al. [46]Zhang et al. [47] 20152015 SMPC χ statisticsminor allele frequency genetic associationsShi et al. [55] 2016 SMPC logistic regression genetic associationsBloom [56] 2019 SMPC linear regression genetic associationsCho et al. [38] 2018 SMPC quality controlpopulation stratiﬁcation genetic associations Differential Privacy

One of the state-of-the-art concepts for eliminating and quan-tifying the chance of information leakage that has gained con-siderable attention in recent years is differential privacy [60–66]. Differential privacy [67–69] is a mathematical modelthat encapsulates the idea of injecting enough randomnessor noise to sensitive data. So, even a strong adversary witharbitrary auxiliary information about the data will still beuncertain in identifying any of the individuals in the dataset.It’s primary goal is to camouﬂage the contribution of everysingle individual by inserting uncertainty into the learningprocess. It has become standard in data protection and hasbeen effectively deployed by Google [70] and Apple [71] aswell as agencies such as the United States Census Bureau.Furthermore, it has drawn the attention of researchers inprivacy-sensitive ﬁelds such as biomedicine and healthcare[66, 72–86].Differential privacy ensures that the model we train doesnot overﬁt the sensitive data of a particular user. In particular,the model trained on a dataset containing information of aspeciﬁc individual should be statistically indistinguishablefrom a model trained without the individual (Figure 3). Asan example, assume that a patient would like to give consentto his/her doctor to include his/her personal health recordin a medical dataset to study the coordination between ageand Cardiovascular disease. Differential privacy providesa mathematical guarantee which captures the privacy riskassociated with the patient’s participation in the study andexplains to what extent the analyst or the potential adversarycan learn about a particular individual in the dataset. More formally, a randomized algorithm (an algorithmthat has randomness in its logic and its output can vary evenon a ﬁxed input) A : D n −→ Y is ( ε , δ )-differentially private iffor all subsets y ⊆ Y and for all adjacent datasets D , D (cid:48) ∈ D n that differ in at most one record the following inequalityholds: Pr [ A ( D ) ∈ y ] ≤ e ε Pr [ A ( D (cid:48) ) ∈ y ] + δ Here, ε and δ are privacy loss parameters where lowervalues imply stronger privacy guarantees. δ is an exceed-ingly small value (e.g. 10 − ) indicating the probability of anuncontrolled breach, where the algorithm produces a speciﬁcoutput only in the presence of a speciﬁc individual and nototherwise. ε represents the worst case privacy breach in theabsence of any such rare breach. If you assume δ =

0, youwill have a pure ( ε )-differentially private algorithm, while ifyou consider δ > ε , δ )-differentially private algorithm.Two important properties of differential privacy are com-posability [87] and resilience to post-processing. Compos-ability means that combining multiple differentially privatealgorithms yields another differentially private algorithm.More precisely, if you combine k ( ε , δ )-differentially pri-vate algorithms, the composed algorithm is at least ( k ε , k δ )-differentially private. Differential privacy also assures re-sistance to post-processing theorem which states passingthe output of an ( ε , δ )-differentially private algorithm to4ny arbitrary randomized algorithm will still uphold the ( ε , δ )-differential privacy guarantee.Figure 3: Differential privacy ; The model trained on adataset including a speciﬁc individual and the one trained onthe same dataset excluding that individual, looks statisticallyindistinguishable to the adversary.The community efforts to ensure the privacy of sensitivebiomedicine data using differential privacy can be groupedinto four categories according to the problem they address(Table 2):1. Approaches to query genomics databases [66, 85, 86].2. Statistical and AI modeling techniques in biomedicine[78–83].3. Data release, i.e., releasing summary statistics such as p -values and χ contingency tables [73–75, 84].4. Training privacy-preserving generative models [63, 88,89].Studies in the ﬁrst category proposed solutions to re-duce the privacy risks of genomics databases such as GWASdatabases and genomics beacon service [90]. The BeaconNetwork [29] is an online web service developed by theGlobal Alliance for Genomics and Health (GA4GH) throughwhich the users can query the data provided by owners or re-search institutes, ask about the presence of a genetic variantin the database, and get a YES/NO as response. Studies haveshown that an attacker can detect membership in the Beaconor GWAS by querying these databases multiple times andasking different questions [25, 91, 92]. In a recent work,Aziz et al. [86] proposed two lightweight algorithms tomake the Beacon’s response inaccurate by controlling a biasvariable. These algorithms decide when to answer the querycorrectly/incorrectly according to speciﬁc conditions in thebias variable so that it gets harder for the attacker to suc-ceed. In another work, Johnson et al. [66] developed adifferentially private query answering framework. With thisframework the analysts can explore the GWAS data withoutany prior knowledge of the number and location of SNPsin the DNA sequence. The analysts can retrieve statisticalproperties such as the correlation between SNPs and get an almost accurate answer while the GWAS dataset is protectedagainst privacy risks.Some of the efforts in the second category addressedthe privacy concerns in GWAS data analysis by introducingdifferentially private logistic regression to identify associa-tions between SNPs and diseases [81] or associations amongmultiple SNPs [79]. Honkela et al. [80] improve drug sensi-tivity prediction by effectively employing differential privacyfor Bayesian linear regression. Moreover, Simmons et al. [83] presented a differentially private EIGENSTRAT (PrivS-TRAT) [93] and linear mixed model (PrivLMM) [94] whilecorrecting for population stratiﬁcation. In another work, Sim-mons et al. [82] tackled the problem of ﬁnding signiﬁcantSNPs by modeling it as an optimization problem. Solvingthis problem provides a differentially private estimate of theneighbor distance for all SNPs, such that high scoring SNPscan be found.The third category focused on releasing summary statis-tics such as p -values, χ contingency tables, and minorallele frequencies in a differentially private fashion. Thecommon approach in these works is to add Laplaciannoise to the true value of the statistics, so that sharingthe perturbed statistics preserves privacy of the individu-als. They vary in the sensitivity of the algorithms (thatis, the maximum change on the output of an algorithm inpresence or absence of a speciﬁc data point) and hencerequire different amounts of injected noise [73, 74, 84].Figure 4: Differentially private deep generative models:

The sensitive data holder (e.g. health institutes) train a dif-ferentially private generative model locally and share justthe trained data generator with the outside world (e.g. re-searchers). The shared data generator can then be used toproduce artiﬁcial data with the same characteristics as thesensitive data.The forth category proposed novel privacy-protectingmethods to generate synthetic healthcare data leveragingdifferentially private generative models (Figure 4). Deepgenerative models, such as generative adversarial networks(GANs), can be trained on sensitive biomedical data to cap-ture its properties and generate artiﬁcial data with similarcharacteristics as the original data.Abay et al. [88] presented a differentially private deepgenerative model, DP-SYN, a generative autoencoder thatsplits the input data into multiple partitions, then learns andsimulates the representation of each partition while maintain-5able 2: Literature for differentially private (DP) techniques in biomedicine

Authors Year Model Application

Aziz et al. [86] 2017 eliminating random positionsbiased random response querying genomics databaseJohnson et al. [66] 2013 distance-score mechanismp-value and χ statistics querying genomics databaseHan et al. [81]Yu et al. [79] 20192014 logistic regression genetic associationsHonkela et al. [80] 2018 bayesian linear regression drug sensitivity predictionSimmons et al. [83] 2016 EIGENSTRATlinear mixed model genetic associationsSimmons et al. [82] 2016 nearest neighbor optimization genetic associationsFienberg et al. [73]Uhlerop et al. [74]Yu et al. [75]Wang et al. [84] 2011201320142014 statistics such as p-value, χ and contingency table genetic associationsAbay et al. [88] 2018 deep autoencoder generating artiﬁcial medical dataBeaulieu et al. [63] 2019 GAN simulating SPRINT trialJordon et al. [89] 2018 GAN generating artiﬁcial medical dataing the privacy of input data. They assessed the performanceof DP-SYN on sensitive datasets of breast cancer and dia-betes. Beaulieu et al. [63] trained an auxiliary classiﬁer GAN(AC-GAN) in a differentially private manner to simulate theparticipants of the SPRINT trial (Systolic Blood PressureTrial), so that the clinical data can be shared while respect-ing participants’ privacy. In another approach, Jordon et al. [89] introduced a differentially private GAN, PATE-GAN,and evaluated the quality of synthetic data on Meta-AnalysisGlobal Group in Chronic Heart Failure (MAGGIC) and theUnited Network for Organ Transplantation (UNOS) datasets.Despite the aforementioned achievements in adoptingdifferential privacy in the ﬁeld, several challenges remain tobe addressed. Although differential privacy involves less net-work communication, memory usage and time complexitycompared to cryptographic techniques, it still struggles withgiving highly accurate results within a reasonable privacybudget, namely, intended ε and δ , on large scale datasetssuch as genomics datasets [35, 95]. In more details, sincethe genomics datasets are huge, the sensitivity of the appliedalgorithms on these datasets is large. Hence, the amount ofdistortion required for anonymization increases signiﬁcantly,sometimes to the extent that the results will not be meaning-ful anymore [96]. Therefore, to make differential privacymore practical in the ﬁeld, balancing a trade off between pri-vacy and utility demands more attention than it has received[76–78, 84]. Federated Learning

Federated learning [97] is a type of distributed learningwhere multiple clients (e.g. hospitals) collaboratively learna model under the coordination of a central server while preserving the privacy of their data [98], [99]. Instead ofsharing its private data with the server or the other clients,each client extracts knowledge (that is, model parameters)from its data and transfers it to the server for aggregation(Figure 5).Figure 5:

Federated Learning : Each participant downloadsthe global model from the server, computes the local modelgiven its private data and the global model, and ﬁnally sendsits local model to the server for aggregation and for updatingthe global model.Federated learning is an iterative process in which eachiteration consists of the following steps [99]:1. The server chooses a set of clients to participate in thecurrent iteration of the model.2. The selected clients obtain the current model from theserver.3. Each selected client computes the local parametersusing the current model and its private data (e.g., runs6radient descent algorithm initialized by the currentmodel on its local data to obtain the local gradientupdates).4. The server collects the local parameters from the se-lected clients and aggregates them to update the cur-rent model.The data of the clients can be considered as a table, whererows represent samples (e.g., individuals) and columns rep-resent features or labels (e.g., age, blood pressure, case vs.control). We refer to the set of samples, features, and labelsof the data as sample space , feature space , and label space ,respectively. Federated learning can be categorized into threetypes based on the distribution characteristics of the clients’data: • Horizontal (sample-based) federated learning [100]: Data from different clients shares similar fea-ture space but is very different in sample space. As anexample, consider two hospitals in two different citieswhich collected similar information such as age or sex.In this case, the feature spaces are similar; but becausethe people who participated in the hospitals’ data col-lections are from different cities, their intersection ismost probably very small, and the sample spaces arehence very different. • Vertical (feature-based) federated learning [100] :Clients’ data is similar in sample space but very dif-ferent in feature space. For example, two hospitalswith different expertise in the same city might collectdifferent information (different feature space) fromalmost the same people (similar sample space). • Hybrid federated learning : Both feature space andsample space are different in the data from the clients.For example, consider a medical center with expertisein brain image analysis located in New York and a re-search center with expertise in protein research basedin Berlin. Their data is completely different (imagevs. protein data) and disjoint groups of individualsparticipated in the data collection of each center.To illustrate the concept of federated learning, considera scenario with two hospitals A and B . A and B possess lists X and Y , containing the age of their cancer patients, respec-tively. A simple federated mean algorithm to compute theaverage age of cancer patients in both hospitals without re-vealing the real values of X and Y works as follows: For thesake of brevity, we assume that both hospitals are selected inthe ﬁrst step and that the current global model parameters inthe second step are zero (see federated learning steps). • Hospital A computes the average age ( M X ) and numberof its cancer patients ( N X ). Hospital B does the same,resulting in M Y , N Y . Here, X and Y are private datawhile M X , N X , M Y , N Y are the parameters extractedfrom the private data. • The server obtains the values of local model parame-ters from the hospitals and computes the global meanas follows: M G = M X × N X + M Y × N Y N X + N Y Two well-known concepts in machine learning are alsorelated to federated learning: transfer learning [101] andmulti-task learning [102, 103]. In transfer learning, thereare source and destination tasks. The aim is to transfer theknowledge from the source to the destination task. As anexample of federated transfer learning, suppose that hospital A has a DNN model trained on its rich dataset of medicalimages (source task). On the other hand, the hospital B wantsto train a DNN model on a dataset containing brain images(a special kind of medical images) of its cancer patients (des-tination task) but the dataset does not have enough samples.Hospital B can take advantage of hospital A ’s DNN modelby incorporating some parts of the source model into itsown DNN model (knowledge transfer) instead of trainingthe model from scratch on its dataset [104].In multi-task learning, there are multiple tasks and thegoal is to exchange the knowledge among the tasks to im-prove the performance (accuracy) of all tasks. As an exampleof federated multi-task learning, assume hospitals A and B again, where hospital A has a task of training a DNN modelon its cancer image dataset and hospital B ’s task is to train alogistic regression model on a dataset including the age, sex,and genetic variants of its cancer patients. Here, both DNNand logistic regression models are trained concurrently (anditeratively) and the knowledge (weights) from both modelsare exchanged in each iteration to improve (tune the weights)both models.A crucial consideration in both transfer and multi-tasklearning is task relatedness. Employing unrelated taskscan lead to transferring negative knowledge and deteriorat-ing the performance of the model(s). To learn more abouttransfer/multi-task learning, interested readers are referredto [101–103, 116]. Moreover, federated transfer/multi-tasklearning can be a horizontal or hybrid federated learningapproach. In the example provided for federated transferlearning, if the shape of the images in the source and des-tination tasks (feature space) are the same, it is consideredas a horizontal approach. Otherwise, it is a hybrid federatedlearning approach similar to the example given for federatedmulti-task learning.The emerging demand for federated learning gave riseto a wealth of both simulation [117, 118] and production-oriented [119, 120] open source frameworks. Additionally,there are AI platforms whose goal is to apply federatedlearning in real-world healthcare settings [121, 122]. Inthe following, we survey works on federated AI techniquesin biomedicine and healthcare (Table 3). The recent stud-ies in this regard mainly focused on horizontal federatedlearning and there are a few vertical federated learning andfederated transfer/multi-task learning algorithms applicableto healthcare and biomedical data.7able 3: Summary of federated learning (FL) approaches in healthcare and biomedicine Authors Year Model Application

Sheller et al. [105] 2018 DNN medical image segmentationChang et al. [106]Balachandar et al. [107] 20182020 single weight transfercyclical weight transfer medical image classiﬁcationNasirigerdeh et al. [108] 2020 linear regression, chi-square, logistic regression GWASWu et al. [109]Wang et al. [110]Li et al. [111] 201220132016 logistic regression GWASBrisimi et al. [112] 2018 support vector machine classifying electrical health recordsHuang et al. [113] 2018 adaptive boosting ensemble classifying medical dataLiu et al. [114] 2018 autonomous deep learning classifying medical dataChen et al. [115] 2019 transfer learning training wearable healthcare devicesA number of the studies provided solutions for the lackof sufﬁcient data due to the the privacy challenges in themedical imaging domain [105–107, 123–125]. For instance,Sheller et al. developed a supervised DNN in a federated wayfor semantic segmentation of brain Gliomas from magneticresonance imaging scans [105]. Chang et al. [106] simulateda distributed DNN in which multiple participants collabora-tively update model weights using training heuristics such assingle weight transfer and cyclical weight transfer (CWT).They evaluated this distributed model using image classiﬁca-tion tasks on medical image datasets such as mammographyand retinal fundus image collections, which were evenly dis-tributed among the participants. Balachandar et al. [107]optimized CWT for cases where the datasets are unevenlydistributed across participants. They assessed their optimiza-tion methods on simulated diabetic retinopathy detection andchest radiograph classiﬁcation.Federated linear/logistic regression or chi-square testhave been developed for sensitive biological data that isvertically or horizontally distributed [108–111]. The gridbinary logistic regression (GLORE) [109] and the expec-tation propagation logistic regression (EXPLORER) [110]are horizontal federated learning approaches designed forclinical data. Unlike GLORE, EXPLORER supports asyn-chronous communication and online learning functionalityso that the system can continue collaborating in case a par-ticipant is absent or if communication is interrupted. Li etal. presented VERTIGO [111], a vertical grid logistic regres-sion algorithm designed for vertically distributed biologicaldatasets such as breast cancer genome and myocardial infarc-tion data. Nasirigerdeh et al. [108] developed a horizontallyfederated tool set for GWAS, called sPLINK , which supportschi-square test, linear regression, and logistic regression. No-tably, federated results from sPLINK on distributed datasetsare the same as those from aggregated analysis conductedwith

PLINK [126]. Moreover, they showed that sPLINK is robust against heterogeneous (imbalanced) data distribu-tions across clients and does not lose its accuracy in suchscenarios.Moreover, there are studies in the literature that combine federated learning with other traditional AI modeling tech-niques such as ensemble learning, support vector machines(SVMs) and principle component analysis (PCA) [112–115,127]. Brisimi et al. [112] presented a federated soft-marginsupport vector machine (sSVM) for distributed electronichealth records. Huang et al. [113] introduced LoAdaBoost,a federated adaptive boosting method for learning medicaldata such as intensive care unit data from distinct hospitals[128] while Liu et al. [114] trained a federated autonomousdeep learner to this end. There have also been a couple ofattempts at incorporating federated learning into multi-tasklearning and transfer learning in general [129–131]. How-ever, to the best of our knowledge, FedHealth [115] is theonly federated transfer learning framework speciﬁcally de-signed for healthcare applications. It enables users to trainpersonalized models for their wearable healthcare devicesby aggregating the data from different organizations withoutcompromising privacy.One of the major challenges for adopting federated learn-ing in large scale healthcare applications is the signiﬁcantnetwork communication overhead, especially for complexAI models such as DNNs that contain millions of model pa-rameters and require thousands of iterations to converge. Arich body of literature exists to tackle this challenge, knownas communication-efﬁcient federated learning. These ap-proaches can be categorized into three categories: gradientquantiﬁcation [132], gradient sparsiﬁcation [133], and morelocal updates in clients than global model update [134].The main idea behind gradient quantiﬁcation is to useless bytes for each model parameter (gradient), e.g., 2 bytesinstead of 8. In gradient sparsiﬁcation, instead of sendingall parameters, a fraction, e.g. 10%, of parameters is ex-changed between the server and clients, saving 90% networkbandwidth. In the last category of communication-efﬁcientapproaches, the clients update their local parameters multipletimes before sending them to the server to reduce the numberof total iterations, and as a result, decrease the total networkbandwidth usage.There is a trade-off between communication efﬁ-ciency and model convergence (accuracy). Employing8able 4: Summary of the hybrid privacy-preserving approaches in healthcare and biomedicine

Authors Year Privacy Technique Model Application Li et al. [135] 2019 FL+DP DNN medical image segmentationLi et al. [136] 2020 FL+DP domain adoption medical image pattern recognitionChoudhury et al. [137] 2019 FL+DP perceptron neural networksupport vector machinelogistic regression classifying electronic health recordsConstable et al. [46] 2015 FL+SMPC statistical analysis(e.g. χ statistics) genetic associationsLee et al. [138] 2018 FL+HE context-speciﬁc hashing learning patient similarityKim et al. [139] 2019 FL+DP+HE logistic regression classifying medical datacommunication-efﬁcient approaches reduces the networkoverhead but might jeopardize the model convergence. Con-sequently, one should keep in mind that communication-efﬁcient approaches should be leveraged as long as they keepthe accuracy of the model acceptable. Interested readersare referred to relevant publications [134, 140] for detaileddescriptions.Another challenge in federated learning is the possibleaccuracy loss from the aggregation process if the data distri-bution across the clients is not independent and identicallydistributed (IID). More speciﬁcally, federated learning candeal with non-IID data while preserving the model accuracyif the learning model is simple such as ordinary least squares(OLS) linear regression ( sPLINK [108]). However, whenit comes to learning complex models such as DNNs, theglobal model might not converge on non-IID data across theclients. Zhao et al. [141] showed that simple averaging of themodel parameters in the server signiﬁcantly diminishes theaccuracy of a convolutional neural network model in highlyskewed non-IID settings. To solve this problem, they traina warm-up model on an IID dataset and share the model aswell as a portion of the dataset with all clients. Each clientuses its local data and the shared dataset to train the localmodel and the simple averaging is employed in the server toaggregate the model parameters. Developing the aggregationstrategies which are robust against non-IID scenarios is stillan open and interesting problem in federated learning.Finally, federated learning is based on the assumptionthat the centralized server is honest and not compromised,which is not necessarily the case in real applications. Torelax this assumption, differential privacy or cryptographictechniques can be leveraged in federated learning, which iscovered in the next section. For further reading on future di-rections of federated learning in general, we refer the readerto comprehensive surveys [99, 142, 143].

Hybrid Privacy-preserving Techniques

The hybrid techniques combine federated learning with theother paradigms (cryptographic techniques and differentialprivacy) to enhance privacy or provide privacy guarantees(Table 4). Federated learning preserves privacy to some extent because it does not require the health institutes toshare the patients’ data with the central server. However,the model parameters that participants share with the servermight be abused to reveal the underlying private data if thecoordinator is compromised [144]. To handle this issue, theparticipants can leverage differential privacy and add noiseto the model parameters before sending them to the server(FL+DP) [135, 136, 145, 146] or they employ HE (FL+HE)or SMPC (FL+SMPC) to securely share the parameters withthe server [46, 138].In the biomedical ﬁeld, several hybrid approaches havebeen presented recently. Li et al. [135] presented a federateddeep learning framework for magnetic resonance brain imagesegmentation in which the client side provides differentialprivacy guarantees on selecting and sharing the local gradientweights with the server for imbalanced data. A recent study[136] extracted neural patterns from brain functional mag-netic resonance images by developing a privacy-preservingpipeline that analyzes image data of patients having differ-ent psychiatric disorders using federated domain adaptionmethods. Choudhury et al. [137] developed a federated dif-ferential privacy mechanism for gradient-based classiﬁcationon electronic health records. There are also some studies thatincorporate federate learning with cryptographic techniques.For instance, Constable et al. [46] implemented a privacy-protecting structure for federated statistical analysis suchas χ statistics on GWAS while maintaining privacy usingSMPC. In a slightly different approach, Lee et al. [138] pre-sented a privacy-preserving platform for learning patient sim-ilarity in multiple hospitals using a context-speciﬁc hashingapproach which employs homomorphic encryption to limitthe privacy leakage. Moreover, Kim et al. [139] presenteda privacy-preserving federated logistic regression algorithmfor horizontally distributed diabetes and intensive care unitdatasets. In this approach, the logistic regression ensuresprivacy by making the aggregated weights differentially pri-vate and encrypting the local weights using homomorphicencryption.Incorporating HE, SMPC, and differential privacy intofederated learning brings about enhanced privacy but it com-bines the limitations of the approaches, too. FL+HE putsmuch more computational overhead on the server, since9able 5: Comparison among the privacy-preserving techniques including homomorphic encryption (HE), secure multipartycomputation (SMPC), federated learning (FL), differential privacy (DP) and the hybrid approaches (FL+DP, FL+HE andFL+SMPC); The generic ranking (lowest =1 to highest = 6) is used for comparison purposes such that having a higher score fora criteria, represents performing better on that metric. HE SMPC DP FL FL+DP FL+HE FL+SMPC

Accuracy 2 6 1 5 3 4 5Computational efﬁciency 1 2 6 6 5 3 4Network communication efﬁciency 5 4 6 3 3 2 1Privacy of exchanged trafﬁc 4 3 NA 1 2 4 3Exchanging low sensitive trafﬁc (cid:55) (cid:55) NA (cid:51) (cid:51) (cid:51) (cid:51) Privacy guarantee (cid:55) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) it requires to perform aggregation on the encrypted modelparameters from the clients. The network communicationoverhead is exacerbated in FL+SMPC, because clients needto securely share the model parameters with multiple comput-ing parties instead of one. FL+DP might result in inaccuratemodels because of adding noise to the model parameters inthe clients.

Comparison

We compare the privacy-preserving techniques (HE, SMPC,differential privacy, federated learning, and the hybrid ap-proaches) using various performance and privacy criteriasuch as computational/communication efﬁciency , accuracy , privacy guarantee , exchanging sensitive trafﬁc through net-work and privacy of exchanged trafﬁc (Table 5 and Figure6). We employ a generic ranking (lowest =1 to highest = 6) [35] for all comparison criteria except for privacy guaran-tee and exchanging sensitive trafﬁc through network , whichare binary criteria. This comparison is made under the as-sumption of applying a complex model (e.g. DNN with ahuge number of model parameters) on the large sensitivebiomedical datatsets distributed across dozens of clients inIID conﬁguration. Additionally, there are a few computingparties in SMPC (practical conﬁguration).Computational efﬁciency is an indicator of the extra com-putational overhead an approach incurs to preserve the pri-vacy. According to Table 5 and Figure 6, differential privacyand federated learning are the best from this perspective.This is because the noise injection procedure in differentialprivacy is not computationally expensive and federated learn-ing follows the paradigm of bringing computation to data,distributing computational overhead among the clients. HEand SMPC are based on the paradigm of moving data tocomputation. In HE, encryption of the whole private data inthe clients and carrying out computation on encrypted databy the computing party cause a huge amount of overhead.In SMPC, a couple of computing parties process the secretshares from dozens of clients, incurring considerable com-putational overhead. Among the hybrid approaches, FL+DPhas the best computational efﬁciency given the lower over- head of the two approaches whereas FL+HE has the highestoverhead because aggregation process on encrypted parame-ters is computationally expensive.Network communication efﬁciency indicates how ef-ﬁcient an approach utilizes the network bandwidth. Theless data trafﬁc is exchanged in the network, the morecommunication-efﬁcient the approach is. Federated learn-ing is the least efﬁcient approach from the communicationaspect since exchanging a large number of model parametervalues between the clients and the server generates a hugeamount of network trafﬁc. Notice that network bandwidthusage of federated learning is independent of the clients’ databecause federated learning does not move data to computa-tion but depends on the model complexity (i.e. the numberof model parameters). The next approach in this regard isSMPC, where not only each participant sends a large trafﬁc(almost as big as its data) to each computing party but alsoeach computing party exchanges intermediate results (whichmight be large) with the other computing parties through thenetwork. The network overhead of homomorphic encryptioncomes from sharing the encrypted data of the clients (as bigas the data itself) with the computing party, which is smallcompared to network trafﬁc generated by federated learningand SMPC. The best approach is differential privacy withno network overhead. Accordingly, FL+DP and FL+SMPCare the best and worst among the hybrid approaches fromcommunication efﬁciency viewpoint, respectively.Accuracy of the model in a privacy-preserving approachis a crucial factor in whether to adopt the approach. SMPCand federated learning are the most accurate approaches in-curring no or a little bit accuracy loss in the ﬁnal model. Thenext is homomorphic encryption whose accuracy loss is dueto approximating the non-linear operations using additionand multiplication (e.g. least squares approximation [53]).The worst approach is differential privacy where the addednoise can considerably affect the model accuracy. In thehybrid approaches, FL+SMPC is the best and FL+DP is theworst considering the accuracy of SMPC and differentialprivacy approaches.The rest of the comparison measures are privacy-related.The trafﬁc transferred from the clients (participants) to theserver (computing parties) is highly sensitive if it carries10 ) All AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency b ) HE

AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency c ) SMPC

AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency d ) DP

AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency e ) FL

AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency f ) FL+DP

AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency g ) FL+HE

AccuracyComputational efficiency Exchanging low sensitive traffic Privacy gaurantee Privacy of exchanged traffic Network communication efficiency h ) FL+SMPC

Figure 6: Comparison radar plots for all (a) and each of (b-h) the privacy preserving approaches including homomorphic en-cryption (HE), Secure multiparty computation (SMPC), differential privacy (DP), federated learning (FL) and hybrid techniques(FL+DP, FL+HE and FL+SMPC)the private data of the clients. The less sensitive the ex-changed trafﬁc is, the more robust the approach is from theprivacy perspective. HE and SMPC send the encrypted andanonymous form of the clients’ private data to the server, re-spectively. Federated learning and hybrid approaches shareonly the model parameters with the server. In HE, if theserver has the key to decrypt the trafﬁc from the clients, thewhole private data of the clients will be revealed. The sameholds if the computing parties in SMPC collude with eachother. This might or might not be the case for the otherapproaches (e.g. federated learning) depending on the ex-changed model parameters and whether they can be abusedto infer the underlying private data.Privacy of the exchanged trafﬁc indicates how much thetrafﬁc is kept private from the server. In HE/SMPC, the datais encrypted/anonymized ﬁrst and then shared with the server,which is reasonable since it is the clients’ private data. Infederated learning, the trafﬁc (model parameters) is directlyshared with the server assuming that it does not reveal anydetails regarding individual samples in the data. The aim ofthe hybrid approaches is to hide the real values of the modelparameters from the server to minimize the possibility ofinference attacks using the model parameters. FL+HE is thebest among the hybrid approaches from this viewpoint.Privacy guarantee is a metric which quantiﬁes the degreeto which the privacy of the clients’ data can be preserved.Differential privacy and the corresponding hybrid approach(FL+DP) are the only approaches providing a privacy guar-antee, whereas all other approaches can only protect theprivacy under a set of certain assumptions. HE assumes thatthe server does not have the decryption key; The underlyingassumption in SMPC is that the computing parties do notcollude with each other; federated learning supposes that themodel parameters do not give any detail about a sample in the clients’ data.

Discussion and open problems

From a practical point of view, homomorphic encryptionand SMPC that follow the paradigm of ”move data to com-putation” do not scale as the number of clients or data sizein clients become large. This is because they put the com-putational burden on a single or a few computing parties.Federated learning, on the other hand, distributes the com-putation across the clients (aggregation on the server is notcomputationally heavy) but the communication overheadbetween the server and clients is the major challenge to scal-ability of federated learning. The hybrid approaches inheritthis issue and it is exacerbated in FL+SMPC. Combininghomomorphic encryption with federated learning (FL+HE)adds another obstacle (computational overhead) to scalabilityof federated learning. There is a growing body of literatureon communication-efﬁcient approaches to federated learn-ing, which we already discussed. These approaches candramatically improve the scalability of federated learningand make it suitable for large-scale applications includingthose in biomedicine.Given that federated learning is the most realistic ap-proach from a scalability viewpoint, it can be used as astandalone approach as long as inferring the clients’ datafrom the model parameters is practically impossible. Other-wise, it should be combined with differential privacy to avoidpossible inference attacks and exposure of clients’ privatedata and to provide privacy guarantee. The accuracy of themodel will be satisfactory in federated learning but it mightbe deteriorated in FL+DP. A realistic trade-off needs to beconsidered depending on the application of interest.Moreover, differential privacy can have many practical11pplications in biomedicince as a standalone approach. Itworks very well for low-sensitivity queries such as count-ing queries (e.g number of patients with a speciﬁc disease)on biomedical databases and its generalizations (e.g. his-tograms) since the presence or absence of an individualchanges the query’s response by at most one. Moreover,it can be employed to release summary statistics such as χ and p-values in a differentially private manner while keepingthe accuracy acceptable. A novel promising research direc-tion is to incorporate differential privacy in deep generativemodels to generate synthetic biomedical data.Future studies can investigate how to reach a compro-mise between scalability, privacy, and accuracy in real-worldsettings. The communication overhead of federated learn-ing is still an open and interesting problem since althoughstate-of-the-art approaches considerably reduce the networkoverhead, they adversely affect the accuracy of the model.Hence, novel approaches are required to preserve the ac-curacy, which is of great importance in biomedicine appli-cations, while making federated learning communication-efﬁcient.Adopting federated learning in non-IID settings, wherebiomedical datasets across different hospitals/medical cen-ters are heterogeneous, is another important challenge toaddress. This is because typical aggregation procedures suchas simple averaging do not work well for these settings, yield-ing inaccurate models. Hence, new aggregation proceduresare required to tackle non-IID scenarios. Moreover, currentcommunication-efﬁcient approaches which were developedfor an IID setting might not be applicable to heterogeneousscenarios. Consequently, new techniques are needed to re-duce network overhead in these settings, while keeping themodel accuracy satisfactory.Combining differential privacy with federated learningto enhance privacy and to provide a privacy guarantee isstill a challenging issue in the ﬁeld. It becomes even morechallenging for healthcare applications, where accuracy ofthe model is of crucial importance. Moreover, the conceptof privacy guarantee in differential privacy has been deﬁnedfor local settings. In distributed scenarios, a dataset might beemployed multiple times to train different models with vari-ous privacy budgets. Therefore, a new formulation of privacyguarantee should be proposed for distributed settings. Conclusion

The advent of AI in biomedicine has brought about indis-pensable progress in the ﬁeld and is expected to result in evenmore impressive advances in the future [147]. For AI tech-niques to succeed, big biomedical or healthcare data needsto be available and accessible. However, the more AI modelsare trained on sensitive biological data, the more pressing pri-vacy concerns become, which, in turn, necessitate strategiesfor shielding the data [148]. Hence, privacy-enhancing tech-niques are crucial to allow AI to beneﬁt from the sensitivebiological data. Cryptographic techniques, differential privacy and fed-erated learning can be considered as the prime strategiesfor protecting personal data privacy. Broadly, these emerg-ing techniques are based on either securing sensitive data,perturbing it or not moving it off site. In particular, cryp-tographic techniques securely share the data with a single(HE) or multiple computing parties (SMPC), differential pri-vacy adds noise to sensitive data and quantiﬁes privacy lossaccordingly, while federated learning enables collaborativelearning under orchestration of a centralized server withoutmoving the private data outside local environments.All of these techniques have their own strengths and lim-itations. HE and SMPC are more communication efﬁcientcompared to federated learning but they are computation-ally expensive since they move data to computation and putthe computational burden on a server or a few computingparties. Federated learning, on the other hand, distributescomputation across the clients but suffers from high networkcommunication overhead. Differential privacy is an efﬁcientapproach from a computational and a communication per-spective but it introduces accuracy loss by adding noise todata or model parameters. Hybrid approaches are studied tocombine the advantages or to overcome the disadvantagesand limitations of the individual techniques. We argued thatfederated learning as a standalone approach or in combina-tion with differential privacy is the most realistic approach tobe adopted in healthcare applications. We discussed the openproblems and challenges in this regard including the balanceof communication efﬁciency and model accuracy in non-IIDsettings, and need for a new notion of privacy guarantee fordistributed biomedical datasets.Incorporating privacy into the analysis of biomedical andhealthcare data is still an open challenge, yet preliminaryaccomplishments are promising to bring practical privacyeven closer to real-world healthcare settings. Future researchshould investigate how to make a trade-off between scalabil-ity, privacy, and accuracy in real healthcare settings.

Acknowledgement

This work has received funding from the European Union’sHorizon2020 research and innovation program under grantagreement nr. 826078. The work of JB, HS and TK was alsosupported by H2020 project REPO-TRIA (nr. 777111). ML,TK, and JB have further been supported by BMBF projectsSYS Care (01ZX1908A) and SYMBOD (01ZX1910D). JB’scontribution was also supported by his VILLUM Young In-vestigator grant (nr. 13154). This paper reﬂects only theauthor’s view and the Commission is not responsible for anyuse that may be made of the information it contains.

References

1. Schwarting, W., Alonso-Mora, J. & Rus, D. Planningand Decision-Making for Autonomous Vehicles.

An- ual Review of Control, Robotics, and AutonomousSystems Convolutional sequence to sequencelearning in Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 (2017),1243–1252.3. Xiong, W. et al. The Microsoft 2017 conversationalspeech recognition system in (2018), 5934–5938.4. Holzinger, A., Kieseberg, P., Weippl, E. & Tjoa, A. M.in Springer Lecture Notes in Computer Science LNCS11015 et al.

Automatic chemical de-sign using a data-driven continuous representation ofmolecules.

ACS central science et al. Using deep learning to model the hierar-chical structure and function of a cell.

Nature methods

290 (2018).7. Nie, D. et al.

Medical image synthesis with deep con-volutional adversarial networks.

IEEE Transactionson Biomedical Engineering

Nature Reviews Cancer

Jama

Nature biomedical engineering et al. Visible machine learning forbiomedicine.

Cell

Drug discovery today

Nature biotechnology

Brieﬁngs in bioinformatics et al.

A survey on deep learning in medi-cal image analysis.

Medical image analysis

Annual review of biomedical engi-neering et al.

Artiﬁcial intelligence in healthcare:past, present and future.

Stroke and vascular neurol-ogy Nature ReviewsGenetics et al.

An interpretable machine learningmodel for accurate prediction of sepsis in the ICU.

Critical care medicine

Journal of digital imaging et al.

Assessment of algorithms for mito-sis detection in breast cancer histopathology images.

Medical image analysis

Membership inference attacks against machine learn-ing models in (2017), 3–18.23. Papernot, N., McDaniel, P., Sinha, A. & Wellman,M. P. SoK: Security and privacy in machine learning in (2018), 399–414.24. Zhang, C., Bengio, S., Hardt, M., Recht, B. &Vinyals, O. Understanding deep learning requiresrethinking generalization in Proceedings of the In-ternational Conference on Learning Representations(ICLR) (2017).25. Shringarpure, S. S. & Bustamante, C. D. Privacy risksfrom genomic data-sharing beacons.

The AmericanJournal of Human Genetics et al.

Resolving individuals contributingtrace amounts of DNA to highly complex mixtures us-ing high-density SNP genotyping microarrays.

PLoSgenetics (2008).27. Harmanci, A. & Gerstein, M. Analysis of sensitiveinformation leakage in functional genomics signalproﬁles through genomic deletions. Nature communi-cations Learning your identity and disease from research pa-pers: information leaks in genome wide associationstudy in Proceedings of the 16th ACM conference onComputer and communications security (2009), 534–544.29. For Genomics, G. A. & Health*. A federated ecosys-tem for sharing genomic, clinical data.

Science

Nature Reviews Genetics et al.

Privacy in the genomic era.

ACMComputing Surveys (CSUR)

General Data Protection Regulation(GDPR) https://gdpr-info.eu/ . 2020.34. Cohen, A. & Nissim, K. Towards formalizing theGDPR’s notion of singling out.

Proceedings of theNational Academy of Sciences (2020).35. Aziz, M. M. A. et al.

Privacy-preserving techniques ofgenomic data—a survey.

Brieﬁngs in bioinformatics arXiv preprint arXiv:1911.06270 (2019).37. Kaissis, G. A., Makowski, M. R., R¨uckert, D. &Braren, R. F. Secure, privacy-preserving and feder-ated machine learning in medical imaging.

NatureMachine Intelligence (June 2020).38. Cho, H., Wu, D. J. & Berger, B. Secure genome-wideassociation analysis using multiparty computation.

Nature biotechnology et al.

Towards practical privacy-preservinggenome-wide association study.

BMC bioinformatics bioRxiv,

Private genome analysisthrough homomorphic encryption in BMC medicalinformatics and decision making (2015), S3.42. Lauter, K., L´opez-Alt, A. & Naehrig, M. Private com-putation on encrypted genomic data in InternationalConference on Cryptology and Information Securityin Latin America (2014), 3–27.43. Lu, W.-J., Yamada, Y. & Sakuma, J.

Privacy-preserving genome-wide association studies on cloudenvironment using fully homomorphic encryption in BMC medical informatics and decision making (2015), S1.44. Zhang, Y., Dai, W., Jiang, X., Xiong, H. & Wang,S. Foresee: Fully outsourced secure genome studybased on homomorphic encryption in BMC medicalinformatics and decision making (2015), S5.45. Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A newway to protect privacy in large-scale genome-wide as-sociation studies. Bioinformatics

Privacy-preserving GWAS analysis on fed-erated genomic datasets in BMC medical informaticsand decision making (2015), S2.47. Zhang, Y., Blanton, M. & Almashaqbeh, G. Securedistributed genome analysis for GWAS and sequencecomparison computation in BMC medical informaticsand decision making (2015), S4. 48. Mohassel, P. & Zhang, Y. Secureml: A system forscalable privacy-preserving machine learning in (2017), 19–38.49. Gentry, C. Fully homomorphic encryption using ideallattices in Proceedings of the forty-ﬁrst annual ACMsymposium on Theory of computing (2009), 169–178.50. Cramer, R., Damg˚ard, I. B. & Nielsen, J. B.

Securemultiparty computation (Cambridge University Press,2015).51. Shamir, A. How to share a secret.

Communications ofthe ACM bioRxiv,

JMIR medical infor-matics e19 (2018).54. Morshed, T., Alhadidi, D. & Mohammed, N. Parallellinear regression on encrypted data in (2018), 1–5.55. Shi, H. et al. Secure multi-pArty computation gridLOgistic REgression (SMAC-GLORE).

BMC medi-cal informatics and decision making

89 (2016).56. Bloom, J. M. Secure multi-party linear regressionat plaintext speed. arXiv preprint arXiv:1901.09531 (2019).57. Berger, B. & Cho, H.

Emerging technologies towardsenhancing privacy in genomic data sharing arXiv preprint arXiv:1810.12380 (2018).59. Alexandru, A. B. & Pappas, G. J. Secure Multi-partyComputation for Cloud-Based Control.

Privacy inDynamical Systems,

Differ-entially private k-means clustering in Proceedings ofthe sixth ACM conference on data and applicationsecurity and privacy (2016), 26–37.61. Abadi, M. et al. Deep learning with differential pri-vacy in Proceedings of the 2016 ACM SIGSAC Con-ference on Computer and Communications Security (2016), 308–318.62. Phan, N., Wang, Y., Wu, X. & Dou, D.

Differentialprivacy preservation for deep auto-encoders: an ap-plication of human behavior prediction in ThirtiethAAAI Conference on Artiﬁcial Intelligence (2016).143. Beaulieu-Jones, B. K. et al.

Privacy-preserving gener-ative deep neural networks support clinical data shar-ing.

Circulation: Cardiovascular Quality and Out-comes e005122 (2019).64. Ren, X. et al.

LoPub: High-Dimensional Crowd-sourced Data Publication With Local Differential Pri-vacy.

IEEE Transactions on Information Forensicsand Security

Marginalrelease under local differential privacy in Proceed-ings of the 2018 International Conference on Man-agement of Data (2018), 131–146.66. Johnson, A. & Shmatikov, V.

Privacy-preserving dataexploration in genome-wide association studies in Proceedings of the 19th ACM SIGKDD internationalconference on Knowledge discovery and data mining (2013), 1079–1087.67. Dwork, C., McSherry, F., Nissim, K. & Smith, A.Calibrating noise to sensitivity in private data analy-sis.

Journal of Privacy and Conﬁdentiality Our data, ourselves: Privacy via dis-tributed noise generation in Annual InternationalConference on the Theory and Applications of Cryp-tographic Techniques (2006), 486–503.69. Nissim, K. et al. Differential privacy: A primer fora non-technical audience in Privacy Law ScholarsConf (2017).70. Erlingsson, ´U., Pihur, V. & Korolova, A.

Rappor: Ran-domized aggregatable privacy-preserving ordinal re-sponse in Proceedings of the 2014 ACM SIGSAC con-ference on computer and communications security (2014), 1054–1067.71. Thakurta, A. G. et al. Learning new words

US Patent9,594,741. 2017.72. Beaulieu-Jones, B. K., Yuan, W., Finlayson, S. G. &Wu, Z. S. Privacy-preserving distributed deep learn-ing for clinical data. arXiv preprint arXiv:1812.01484 (2018).73. Fienberg, S. E., Slavkovic, A. & Uhler, C.

Privacypreserving GWAS data sharing in (2011), 628–635.74. Uhlerop, C., Slavkovi´c, A. & Fienberg, S. E. Privacy-preserving data sharing for genome-wide associationstudies. The Journal of privacy and conﬁdentiality

137 (2013).75. Yu, F. & Ji, Z. Scalable privacy-preserving data shar-ing methodology for genome-wide association stud-ies: an application to iDASH healthcare privacy pro-tection challenge.

BMC medical informatics and deci-sion making

S3 (2014). 76. Han, Z., Liu, H. & Wu, Z.

A Differential PrivacyPreserving Framework with Nash Equilibrium inGenome-Wide Association studies in (2018), 91–96.77. Tram`er, F., Huang, Z., Hubaux, J.-P. & Ayday, E. Dif-ferential privacy with bounded priors: reconcilingutility and privacy in genome-wide association stud-ies in Proceedings of the 22nd ACM SIGSAC Con-ference on Computer and Communications Security (2015), 1286–1297.78. Vu, D. & Slavkovic, A.

Differential privacy for clini-cal trial data: Preliminary evaluations in (2009), 138–143.79. Yu, F., Rybar, M., Uhler, C. & Fienberg, S. E. Differentially-private logistic regression for detect-ing multiple-SNP association in GWAS databases in International Conference on Privacy in StatisticalDatabases (2014), 170–184.80. Honkela, A., Das, M., Nieminen, A., Dikmen, O. &Kaski, S. Efﬁcient differentially private learning im-proves drug sensitivity prediction.

Biology direct

A Differential Privacy Pre-serving Approach for Logistic Regression in Genome-Wide Association Studies in (2019), 181–185.82. Simmons, S. & Berger, B. Realizing privacy preserv-ing genome-wide association studies. Bioinformatics

Cell systems BMC medical informatics and decisionmaking

S2 (2014).85. Wan, Z., Vorobeychik, Y., Kantarcioglu, M. & Malin,B. Controlling the signal: Practical privacy protec-tion of genomic data sharing through Beacon services.

BMC medical genomics

39 (2017).86. Al Aziz, M. M., Ghasemi, R., Waliullah, M. & Mo-hammed, N. Aftermath of bustamante attack on ge-nomic beacon service.

BMC medical genomics

IEEE Transactionson Information Theory

Privacy preserving syntheticdata release using deep learning in Joint EuropeanConference on Machine Learning and KnowledgeDiscovery in Databases (2018), 510–526.89. Jordon, J., Yoon, J. & van der Schaar, M.

PATE-GAN:Generating Synthetic Data with Differential PrivacyGuarantees in (2018).90. Fiume, M. et al.

Federated discovery and sharing ofgenomic data using Beacons.

Nature biotechnology et al.

Protecting privacy and security ofgenomic data in I2B2 with homomorphic encryptionand differential privacy.

IEEE/ACM transactions oncomputational biology and bioinformatics

A simple andpractical algorithm for differentially private data re-lease in Advances in Neural Information ProcessingSystems (2012), 2339–2347.93. Price, A. L. et al.

Principal components analysis cor-rects for stratiﬁcation in genome-wide associationstudies.

Nature genetics

Na-ture genetics

100 (2014).95. Wang, S. et al.

Genome privacy: challenges, technicalapproaches to mitigate risk, and ethical considerationsin the United States.

Annals of the New York Academyof Sciences

73 (2017).96. Kieseberg, P., Hobel, H., Schrittwieser, S., Weippl,E. & Holzinger, A. in

Interactive Knowledge Dis-covery and Data Mining in Biomedical Informat-ics, Lecture Notes in Computer Science, LNCS 8401 (eds Holzinger, A. & Jurisica, I.) 301–316 (Springer,Berlin Heidelberg, 2014).97. McMahan, H. B., Moore, E., Ramage, D., Hamp-son, S., et al.

Communication-efﬁcient learning ofdeep networks from decentralized data. arXiv preprintarXiv:1602.05629 (2016).98. Malle, B., Giuliani, N., Kieseberg, P. & Holzinger,A. in

Machine Learning and Knowledge Extraction,IFIP CD-MAKE, Lecture Notes in Computer ScienceLNCS 10410 et al.

Advances and open problems infederated learning. arXiv preprint arXiv:1912.04977 (2019).100. Yang, Q., Liu, Y., Chen, T. & Tong, Y. Federatedmachine learning: Concept and applications.

ACMTransactions on Intelligent Systems and Technology(TIST)

IEEE Transactions on knowledge and data engineer-ing

Machine learning arXiv preprint arXiv:1707.08114 (2017).104. Holzinger, A., Haibe-Kains, B. & Jurisica, I. Whyimaging data alone is not enough: AI-based integra-tion of imaging, omics, and clinical data.

EuropeanJournal of Nuclear Medicine and Molecular Imaging

Multi-institutional deep learning modelingwithout sharing patient data: A feasibility study onbrain tumor segmentation in International MICCAIBrainlesion Workshop (2018), 92–104.106. Chang, K. et al.

Distributed deep learning networksamong institutions for medical imaging.

Journal ofthe American Medical Informatics Association

Journal of the American Medical Informat-ics Association (2020).108. Nasirigerdeh, R. et al. sPLINK: A Federated,Privacy-Preserving Tool as a Robust Alternative toMeta-Analysis in Genome-Wide Association Studies. < > (2020).109. Wu, Y., Jiang, X., Kim, J. & Ohno-Machado, L.Grid Binary LOgistic REgression (GLORE): buildingshared models without sharing data. Journal of theAmerican Medical Informatics Association et al.

EXpectation Propagation LOgis-tic REgRession (EXPLORER): distributed privacy-preserving online model learning.

Journal of biomed-ical informatics

Journal of the American Medical Informatics Associ-ation et al.

Federated learning of predictivemodels from federated electronic health records.

In-ternational journal of medical informatics et al.

Loadaboost: Loss-based adaboostfederated machine learning on medical data. arXivpreprint arXiv:1811.12629 (2018).1614. Liu, D., Miller, T., Sayeed, R. & Mandl, K. D.Fadl: Federated-autonomous deep learning for dis-tributed electronic health record. arXiv preprintarXiv:1811.11400 (2018).115. Chen, Y., Wang, J., Yu, C., Gao, W. & Qin, X.FedHealth: A Federated Transfer Learning Frame-work for Wearable Healthcare. arXiv preprintarXiv:1907.09173 (2019).116. Smith, V., Chiang, C.-K., Sanjabi, M. & Talwalkar,A. S.

Federated multi-task learning in Advances inNeural Information Processing Systems (2017), 4424–4434.117. The TFF Authors.

TensorFlow Federated < > .118. Ryffel, T. et al. A generic framework for pri-vacy preserving deep learning. arXiv preprintarXiv:1811.04017 (2018).119. The FATE Authors.

Federated AI technology enabler < > .120. The PaddleFL Authors. PaddleFL < https ://github.com/PaddlePaddle/PaddleFL > .121. The FeatureCloud Authors. FeatureCloud < https://featurecloud.eu/ > .122. The Clara Training Framework Authors. NVIDIAClara < https://developer.nvidia.com/clara > .123. Vepakomma, P., Gupta, O., Swedish, T. & Raskar,R. Split learning for health: Distributed deep learn-ing without sharing raw patient data. arXiv preprintarXiv:1812.00564 (2018).124. Vepakomma, P., Gupta, O., Dubey, A. & Raskar, R.Reducing leakage in distributed deep learning for sen-sitive health data. arXiv preprint arXiv:1812.00564 (2019).125. Poirot, M. G. et al. Split Learning for collabo-rative deep learning in healthcare. arXiv preprintarXiv:1912.12115 (2019).126. Purcell, S. et al.

PLINK: a tool set for whole-genomeassociation and population-based linkage analyses.

The American journal of human genetics et al. Federated learning in distributed medi-cal databases: Meta-analysis of large-scale subcorti-cal brain data in (2019),270–274.128. Pollard, T. J. et al. The eICU Collaborative ResearchDatabase, a freely available multi-center databasefor critical care research.

Scientiﬁc data Federated multi-task learning in Advances inNeural Information Processing Systems (2017), 4424–4434.130. Corinzia, L. & Buhmann, J. M. Variational federatedmulti-task learning. arXiv preprint arXiv:1906.06268 (2019).131. Liu, Y., Chen, T. & Yang, Q. Secure federated transferlearning. arXiv preprint arXiv:1812.03337 (2018).132. Gupta, S., Agrawal, A., Gopalakrishnan, K. &Narayanan, P.

Deep learning with limited numeri-cal precision in International Conference on MachineLearning (2015), 1737–1746.133. Aji, A. F. & Heaﬁeld, K.

Sparse Communication forDistributed Gradient Descent in Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing (2017), 440–445.134. McMahan, H. B., Moore, E., Ramage, D., Hamp-son, S., et al.

Communication-efﬁcient learning ofdeep networks from decentralized data. arXiv preprintarXiv:1602.05629 (2016).135. Li, W. et al. Privacy-preserving Federated Brain Tu-mour Segmentation in International Workshop on Ma-chine Learning in Medical Imaging (2019), 133–141.136. Li, X. et al.

Multi-site fMRI Analysis UsingPrivacy-preserving Federated Learning and Do-main Adaptation: ABIDE Results. arXiv preprintarXiv:2001.05647 (2020).137. Choudhury, O. et al.

Differential Privacy-enabledFederated Learning for Sensitive Health Data. arXivpreprint arXiv:1910.02578 (2019).138. Lee, J. et al.

Privacy-preserving patient similaritylearning in a federated environment: development andanalysis.

JMIR medical informatics e20 (2018).139. Kim, M., Lee, J., Ohno-Machado, L. & Jiang, X. Se-cure and Differentially Private Logistic Regressionfor Horizontally Distributed Data. IEEE Transactionson Information Forensics and Security arXiv preprintarXiv:2003.06307 (2020).141. Zhao, Y. et al.

Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018).142. Li, Q., Wen, Z. & He, B. Federated learning systems:Vision, hype and reality for data privacy and protec-tion. arXiv preprint arXiv:1907.09693 (2019).143. Rieke, N. et al.

The Future of Digital Health with Fed-erated Learning. arXiv preprint arXiv:2003.08119 (2020).1744. Melis, L., Song, C., De Cristofaro, E. & Shmatikov, V.

Exploiting unintended feature leakage in collabora-tive learning in (2019), 691–706.145. Geyer, R. C., Klein, T. & Nabi, M. Differentially pri-vate federated learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).146. Truex, S. et al. A hybrid approach to privacy-preserving federated learning in Proceedings of the 12th ACM Workshop on Artiﬁcial Intelligence andSecurity (2019), 1–11.147. Sanders, S. F., Terwiesch, M., Gordon, W. J. & Stern,A. D. How Artiﬁcial Intelligence Is Changing HealthCare Delivery.