[PDF] ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine Learning Models

Abstract

Inference attacks against Machine Learning (ML) models allow adversaries to learn information about training data, model parameters, etc. While researchers have studied these attacks thoroughly, they have done so in isolation. We lack a comprehensive picture of the risks caused by the attacks, such as the different scenarios they can be applied to, the common factors that influence their performance, the relationship among them, or the effectiveness of defense techniques. In this paper, we fill this gap by presenting a first-of-its-kind holistic risk assessment of different inference attacks against machine learning models. We concentrate on four attacks - namely, membership inference, model inversion, attribute inference, and model stealing - and establish a threat model taxonomy. Our extensive experimental evaluation conducted over five model architectures and four datasets shows that the complexity of the training dataset plays an important role with respect to the attack's performance, while the effectiveness of model stealing and membership inference attacks are negatively correlated. We also show that defenses like DP-SGD and Knowledge Distillation can only hope to mitigate some of the inference attacks. Our analysis relies on a modular re-usable software, ML-Doctor, which enables ML model owners to assess the risks of deploying their models, and equally serves as a benchmark tool for researchers and practitioners.

Full PDF

MML-D

OCTOR : Holistic Risk Assessment of Inference AttacksAgainst Machine Learning Models

Yugeng Liu Rui Wen ∗ Xinlei He Ahmed Salem Zhikun Zhang Michael Backes Emiliano De Cristofaro Mario Fritz Yang Zhang CISPA Helmholtz Center for Information Security UCL & Alan Turing Institute

Abstract

Inference attacks against Machine Learning (ML) modelsallow adversaries to learn information about training data,model parameters, etc. While researchers have studied theseattacks thoroughly, they have done so in isolation. We lacka comprehensive picture of the risks caused by the attacks,such as the different scenarios they can be applied to, thecommon factors that inﬂuence their performance, the rela-tionship among them, or the effectiveness of defense tech-niques. In this paper, we ﬁll this gap by presenting a ﬁrst-of-its-kind holistic risk assessment of different inference attacksagainst machine learning models. We concentrate on four at-tacks – namely, membership inference, model inversion, at-tribute inference, and model stealing – and establish a threatmodel taxonomy. Our extensive experimental evaluationconducted over ﬁve model architectures and four datasetsshows that the complexity of the training dataset plays an im-portant role with respect to the attack’s performance, whilethe effectiveness of model stealing and membership infer-ence attacks are negatively correlated. We also show thatdefenses like DP-SGD and Knowledge Distillation can onlyhope to mitigate some of the inference attacks. Our anal-ysis relies on a modular re-usable software, ML-D

OCTOR ,which enables ML model owners to assess the risks of de-ploying their models, and equally serves as a benchmark toolfor researchers and practitioners.

Over the last decade, research in Machine Learning (ML),and in particular Deep Learning, has made tremendousprogress. Deployment and overall success of this technologyare jeopardized by attacks against ML models, that promptserious security and privacy risks. In particular, inferenceattacks [12, 18, 28, 34, 38, 43, 44, 46, 50, 51, 56] allow ad-versaries to infer information from a target ML model, e.g.,about the training data, the model’s parameters, etc.In this paper, we focus on four types of attacks: mem-bership inference [46], model inversion [12], attribute infer-ence [28], and model stealing [51]. The ﬁrst three target amodel’s training dataset , aiming to, respectively, determinewhether or not an exact data sample belongs to it, recover * The ﬁrst two authors made equal contributions. (part of) it, or predict properties that are not related to themodel’s original task. Model stealing involves reconstruct-ing the target model’s (non-public) parameters. Inference at-tacks can lead to severe consequences, violating individuals’privacy, as ML models are often trained on sensitive data, orcompromising the model owner’s intellectual property [10].Overall, existing inference attacks have been studied un-der different threat models and experimental settings, albeitin isolation. This prompts the need for a holistic understand-ing of the risks caused by these attacks, such as the scenar-ios different inference attacks can be applied to, the commonfactors that inﬂuence these attacks’ performance, and the re-lations among the attacks, as well as the overall effectivenessof defense mechanisms.

In this paper, we perform a ﬁrst-of-its-kind holistic securityand privacy risk assessment of ML models, vis-à-vis fourrepresentative inference attacks.

Threat Model Taxonomy.

Our work starts with a systemat-ical categorization of the knowledge that an adversary mighthave in order to launch inference attacks along two dimen-sions, i.e., access to: 1) a target model (white-box or black-box), 2) an auxiliary dataset (partial training dataset, shadowdataset, or no dataset). We consider four types of state-of-the-art inference attacks and describe under which threatmodels they can be applied to. This provides us with a fullspectrum of inference attack surface for ML models.

Experimental Evaluation.

We perform a comprehensivemeasurement study of the attacks, jointly, over ﬁve differ-ent ML model architectures (AlexNet [22], ResNet18 [17],VGG19 [47], Xception [8], and SimpleCNN) and fourdatasets (CelebA [27], Fashion-MNIST (FMNIST) [53],STL10 [9], and UTKFace [57]).

Main Findings.

The complexity of the target model’s train-ing dataset plays a major role in the accuracy of membershipinference, model inversion, and model stealing. In particular,the former is much more effective on complex datasets, whilethe other two exhibit the opposite trend. For instance, the ac-curacy of membership inference (with black-box access tothe target model and a shadow dataset) against ResNet18 is0.88 for a more complex dataset such as STL10, while 0.541 a r X i v : . [ c s . CR ] F e b or a less complex one like FMNIST. On the other hand,model stealing achieves 0.52 agreement (the standard metricfor this attack) on ResNet18 trained on STL10 but a muchhigher 0.93 on FMNIST. This stems from ML models beingmore prone to overﬁtting on complex datasets, which leads tobetter membership inference. Whereas, when an ML modelis trained on a complex dataset, it is harder for an adversaryto obtain a dataset with similar complexity (by querying thetarget model) to train their stolen model.We also ﬁnd that the performance of membership infer-ence and model stealing are negatively correlated ( r = − . Defenses.

We evaluate two defense mechanisms, i.e., DP-SGD [1] and Knowledge Distillation (KD) [20], against allthe inference attacks. Empirical results show that DP-SGDcan mitigate membership inference attacks in general with-out damaging target models’ utility signiﬁcantly. Mean-while, KD also reduces membership inference risks, but,generally, to a lesser extent compared to DP-SGD. However,neither of them are effective against other inference attacks.This highlights the lack of a general, effective defense mech-anism, and our work sheds light as to what extent/why.

ML-D

OCTOR . To support the comprehensive evaluationof a wide range of inference attacks and defenses (currentand future), we introduce a framework called ML-D

OCTOR .This can be used by multiple entities and for multiple pur-poses. For instance, model owners can use it to meaning-fully assess potential security and privacy risks before de-ploying their model. Also, as we make source code publiclyavailable, researchers will be able to re-use ML-D

OCTOR tobenchmark new inference attacks and defense mechanisms.Indeed, ML-D

OCTOR follows a modular design, which eas-ily supports the integration of additional inference attacksand defenses, as well as plugging in a variety of datasets,models, etc.

The rest of the paper is organized as follows. In Section 2,we present a taxonomy of threat models, then, introduce theattacks considered in this paper in Section 3. Next, Section 4describes the design of ML-D

OCTOR . In Section 5, we intro-duce our experimental setup. Evaluation for attacks and de-fense mechanisms are presented in Section 6 and Section 7,respectively. We review related work in Section 8 and con-clude the paper in Section 9. We refer to both sample complexity and class diversity (see Sec-tion 6.2).

Auxiliary Model AccessDataset

Black-Box ( M B ) White-Box ( M W )Partial ( D Paux ) MemInf , ModSteal MemInf , AttrInf

Shadow ( D Saux ) MemInf , ModSteal MemInf , AttrInf , ModInv

No ( D Naux ) -

ModInv

Table 1: Different attacks under different threat models.

In this work, we focus on machine learning classiﬁcation,one of the most popular ML applications. In general, the goalof an ML classiﬁer is to map a data sample to a label/class.The input to an ML model is a data sample and the output is avector of probabilities, or posteriors, with each element rep-resenting the likelihood of the sample belonging to a class.We categorize the threat models for all inference attacksconsidered in this paper along two dimensions, i.e., 1) accessto the target model and 2) auxiliary dataset . In total, weconsider ﬁve different scenarios.

Access to the Target Model.

We consider two access set-tings: white-box and black-box . The former, denoted with M W , means that an adversary has full information about thetarget model, including its parameters and architecture. Inblack-box attacks, denoted with M B , the adversary can onlyaccess the target model in an API-like manner, e.g., they canquery the target model and get the model’s output. However,most of the existing black-box literature [14, 46, 54] also as-sumes that the adversary knows the target model’s architec-ture which they use to build shadow models (see Section 3).Overall, the white-box model captures scenarios where thetarget model’s parameters are leaked, e.g., following a databreach or through reverse engineering, e.g., from pre-trainedmodels deployed to mobile devices [20]. Whereas, the black-box model encapsulate API access akin to features providedby Machine Learning as a Service (MLaaS) platforms. Auxiliary Dataset.

The adversary needs an auxiliary datasetin order to train their attack model. We consider three scenar-ios, in decreasing order of adversarial “strength”: 1) partialtraining dataset ( D Paux ), 2) shadow dataset ( D Saux ), and 3) nodataset ( D Naux ). In the ﬁrst scenario, the adversary obtainsparts of the actual training data from the target model (e.g.,it is public knowledge), while in the last one, they have noinformation at all. In between is the D Saux setting, where theadversary gets a “shadow” dataset from the same distributionas the target model’s training data (see Section V-C in [46]for a discussion on how to generate such data, using, e.g.,through model-based or statistics-based synthesis, or noisyreal data).

Considered Settings.

Overall, the two different types ofmodel access and the three types of auxiliary dataset avail-ability lead to six threat models. In the rest of the pa-per, we consider ﬁve of them: (cid:104) M B , D Paux (cid:105) , (cid:104) M B , D Paux (cid:105) , (cid:104) M W , D Paux (cid:105) , (cid:104) M W , D Saux (cid:105) , and (cid:104) M W , D Naux (cid:105) . We do not ex-periment with black-box access and no auxiliary dataset, asthis is unlikely to yield successful attacks.2

Inference Attacks

In this section, we present the four inference attacks mea-sured in this paper. Speciﬁcally, we consider: membershipinference (

MemInf ), model inversion (

ModInv ), attribute in-ference (

AttrInf ), and model stealing (

ModSteal ). The ﬁrstthree are designed to infer information about a target MLmodel’s training data, while the last one aims to steal the tar-get model’s parameters.Different attacks can be applied to different threat models,as summarized in Table 1. For each attack and each threatmodel, we concentrate on one representative state-of-the-artmethod.

Membership Inference (

MemInf ) [46] against ML models in-volves an adversary aiming to determine whether or not a tar-get data sample was used to train a target ML model. Moreformally, given a target data sample x target , (the access to) atarget model M , and an auxiliary dataset D aux , a member-ship inference attack can be deﬁned as: MemInf : x target , M , D aux → { member , non-member } where M ∈ { M B , M W } and D aux ∈ { D Paux , D Saux } .Membership inference has been extensively studied in lit-erature [6, 7, 21, 23, 25, 30, 42, 44, 46]. Inferring member-ship of a target sample prompts severe privacy threats; forinstance, if an ML model for drug dose prediction is trainedusing data from patients with a certain disease, then inclusionin the training set inherently leaks the individuals’ health sta-tus. Overall, MemInf is also often a signal that a target modelis “leaky” and can be a gateway to additional attacks [10].In the following, we illustrate how to implement member-ship inference (

MemInf ) under different threat models.

Black-Box/Shadow (cid:104)

MemInf , M B , D Saux (cid:105) [44].

We start withthe most common and difﬁcult setting for the attack [44, 46],whereby the adversary has black-box access ( M B ) to the tar-get model and a shadow auxiliary dataset ( D Saux ).The adversary ﬁrst splits the shadow dataset into two partsand uses one to train a shadow model on the same task. Next,the adversary uses the entire shadow dataset to query theshadow model. For each querying sample, the shadow modelreturns its posteriors and the predicted label: if the sample ispart of the shadow model’s training set, the adversary labelsit as a member, and as a non-member otherwise. With this la-beled dataset, the adversary trains an attack model, which is abinary membership classiﬁer. Finally, to determine whethera data sample is a member of the target model’s trainingdataset, the sample is fed to the target model, and the pos-teriors and the predicted label (transformed to a binary in-dicator on whether the prediction is correct) are fed to theattack model.

Black-Box/Partial (cid:104)

MemInf , M B , D Paux (cid:105) [44].

If the adver-sary has black-box access to the target model and a partialtraining dataset, the attack method is very similar to that for (cid:104)

MemInf , M B , D Saux (cid:105) . However, the adversary does not needto train a shadow model; rather, they use the partial training dataset as the ground truth for membership and directly traintheir attack model.

White-Box/Shadow (cid:104)

MemInf , M W , D Saux (cid:105) [31].

Nasr etal. [31] introduce an attack in the white-box setting with ei-ther a shadow or a partial training dataset as the auxiliarydataset. In the former, similar to (cid:104)

MemInf , M B , D Saux (cid:105) , theadversary uses D Saux to train a shadow model to mimic the be-havior of the target model and to generate data to train theirattack model. As the adversary has white-box access to thetarget model, they can also exploit the target sample’s gradi-ents with respect to the model parameters, embeddings fromdifferent intermediate layers, classiﬁcation loss, and predic-tion posteriors (and label).

White-Box/Partial (cid:104)

MemInf , M W , D Paux (cid:105) [31].

The attackmethodology here is almost identical to the black-boxcounterpart. The only difference is that the adversarycan use the same set of features as the attack model for (cid:104)

MemInf , M W , D Saux (cid:105) . Model inversion attacks (

ModInv ) [12] aim to reconstructdata samples from a target ML model, i.e., they allow anadversary to directly learn information about the trainingdataset.For instance, in a facial recognition system, a

ModInv ad-versary tries to learn the facial data of a victim whose data isused to train the model. Model inversion requires the adver-sary to have white-box access to the target model, this is dueto the fact that the attack needs to perform back-propagationover the target model’s parameters (detailed below).Formally, we deﬁne model inversion as:

ModInv : M W , D aux → { training samples } where D aux ∈ { D Naux , D Saux } .We consider two types of model inversion attacks: the oneproposed by Fredrikson et al. [12], which aims to reconstructa representative sample for each class of the target model,and that by Zhang et al. [56], which aims to synthesize thetraining dataset. These two attacks follow different threatmodels, which we discuss next. White-Box/No Auxiliary (cid:104)

ModInv , M W , D Naux (cid:105) [12].

Themethod by Fredrikson et al. [12] assumes the adversary haswhite-box access to the target model and does not need anyauxiliary dataset. For each class of the target model, the ad-versary ﬁrst creates a noise sample, feeds this sample to themodel, and gets the posteriors. The adversary then uses back-propagation over the target model’s parameters to optimizethe input sample so that the corresponding posterior of theclass can exceed a pre-set threshold. Once the threshold isreached, the optimized sample is the representative sampleof that class, i.e., the attack output. White-Box/Shadow (cid:104)

ModInv , M W , D Saux (cid:105) [56].

The attackby Zhang et al. [56] uses a shadow dataset to enhance the The attack by Nasr et al. [31] was originally designed for the partialtraining dataset setting but it can be adapted to the shadow dataset setting. Fredrikson et al. [12] also introduce a model inversion attack wherethe adversary only has black-box access to the target model, however, itsperformance is not as good and therefore we do not consider it.

An ML model may learn extra information about the train-ing data that is not related to its original task; e.g., a modelpredicting age from proﬁle photos can also learn to predictrace [28, 50]. Attribute inference (

AttrInf ) aims to exploitsuch unintended information leakage.State-of-the-art attacks usually rely on the embeddings ofa target sample ( x target ) obtained from the target model topredict the sample’s target attributes. Thus, the adversary isassumed to have white-box access to the target model. For-mally, attribute inference is deﬁned as: AttrInf : x target , M W , D aux → { target attributes } where D aux ∈ { D Paux , D Saux } can either be a partial trainingdataset or a shadow dataset. White-Box/Shadow and Partial [28, 50].

Both attacks, i.e., (cid:104)

AttrInf , M W , D Saux (cid:105) [28] and (cid:104)

AttrInf , M W , D Paux (cid:105) [50], fol-low a similar attack methodology. The only difference liesin the dataset used to train the attack model. In both cases,the adversary is assumed to know the target attributes of theauxiliary dataset. Then, they use the embeddings and the tar-get attributes to train a classiﬁer to mount the attack.

The goal of model stealing attacks (

ModSteal ) [34, 51], akamodel extraction, is to extract the parameters from a targetmodel. Ideally, an adversary will be able to obtain a model(the “stolen” model) with very similar performance as thetarget model. More formally:

ModSteal : M B , D aux → M C where M C is the stolen model and D aux ∈ { D Paux , D Saux } .Model stealing prompts severe security risks. For instance,as it is often difﬁcult to train an advanced ML model (e.g.,due to the lack of data or computing resources), stealinga trained model inherently constitutes intellectual propertytheft. Also, as many other attacks, such as adversarial exam-ples [39], require white-box access to the target ML model,model stealing can be a stepping stone to perform these at-tacks. Black-Box/Partial and Shadow [51].

In this paper, we con-centrate on the attacks proposed by Tramèr et al. [51], for

MemInfModInvModStealAttrInf DistillationDP-SGD Attack

EvaluationDefense

Evaluation

ModuleAttack

Module

Defense

Module

ModelDataset

Data

Processing Module

Figure 1: Overview of ML-D

OCTOR ’s modules. (cid:104)

ModSteal , M B , D Paux (cid:105) and (cid:104)

ModSteal , M B , D Saux (cid:105) . The adver-sary is assumed to have knowledge of the target model’s ar-chitecture, and both attacks follow a similar methodology.Speciﬁcally, the adversary uses data samples from their aux-iliary dataset ( D Paux or D Saux ) to query the target model andgets the corresponding posteriors. Then, they use them totrain the stolen model, with the posteriors as the ground truth.

OCTOR

In this section, we introduce ML-D

OCTOR , a modularframework geared to evaluate the four inference attacks, aswell as the two defenses (see Section 7), considered in thispaper.Prior work has proposed software tools to evaluate attacksagainst ML models, such as adversarial examples [26, 37],backdoor attacks [35], and membership inference [29]. Tothe best of our knowledge, ML-D

OCTOR is the ﬁrst frame-work that jointly considers different types of inference at-tacks.

Modules.

In Figure 1, we report the four different modulesof ML-D

OCTOR :1.

Data Processing.

This processes the datasets to mountdifferent attacks. It also involves data pre-processing,e.g., re-sizing and normalization, data partition, etc.2.

Attack.

This module performs the actual inference at-tacks. At the moment, it supports ten different attacksbelonging to four different attack types (see Section 3).3.

Defense.

We currently support two representative mit-igation techniques for inference attacks against MLmodels, as discussed later in Section 7.4.

Evaluation.

This module is used to evaluate the perfor-mance of attacks and defenses.The modular design of ML-D

OCTOR allows to easily inte-grate additional attacks and defense mechanisms, as well asplugging in any dataset or model.

Using ML-D

OCTOR . A user needs to submit their targetmodel and its training dataset to use ML-D

OCTOR . Here, thereason for submitting the training dataset is to achieve a full-ﬂedged privacy risk assessment. We envision ML-D

OCTOR to be used for the following purposes:• As it supports a systematic taxonomy of different threatmodels for inference attacks, ML-D

OCTOR enablesmodel owners to obtain an overview of the threats theirmodel may face when deployed in the real world.4 ML-D

OCTOR provides a holistic assessment of differ-ent attacks, as well as the effectiveness of possible de-fenses. To our best knowledge, this is the ﬁrst tool toprovide such a comprehensive analysis of inference at-tacks.• Researchers can re-use ML-D

OCTOR as a benchmarktool to experiment with new inference attacks and de-fenses in the future.

NB:

ML-D

OCTOR is currently implemented in Python 3.8and PyTorch 1.7.1. Complete source code of all modules canbe obtained anonymously via the PC Chairs and will be madepublicly available with the ﬁnal version of the paper.

In this section, we present our experimental setup.

For the sake of this paper, we experiment with four datasets:•

CelebA [27] contains 202,599 face images, each im-age is associated with 40 binary attributes. We selectand combine 3 attributes out of 40 including

Heavy-Makeup , MouthSlightlyOpen , and

Smiling to form ourtarget models’ classes/labels. As each attribute is bi-nary, this leads to an 8-class classiﬁcation.•

FMNIST (Fashion-MNIST) [53] is also an imagedataset with 70,000 grayscale images equally dis-tributed among 10 different classes including T-shirt,trouser, pullover, dress, coat, sandal, shirt, sneaker, bag,and ankle boot.•

STL10 [9] is a 10-class image dataset with each classcontaining 1,300 images. Classes include airplane, bird,car, cat, deer, dog, horse, monkey, ship, and truck.•

UTKFace [57] has 23,000 face images labeled withage, gender, and race. We only consider images fromthe largest four races in the dataset (White, Black,Asian, and Indian). This leaves us with 22,012 images.Note that all the samples in the datasets are re-sized to32 ×

32 pixels. We randomly split each dataset into four equalparts:1.

Target Training Dataset is used to train all the targetmodels, and to evaluate the performance of all mem-bership inference attacks and both model inversion at-tacks. For attacks that require a partial training dataset,we sample 70% samples from the target training dataset.2.

Target Testing Dataset is used to evaluate the perfor-mance of the target model. It is also used to evaluatethe performance of all membership inference, attributeinference, and model stealing attacks.3.

Shadow Training Dataset is used to train all the attacksthat require a shadow dataset as the auxiliary dataset. 4.

Shadow Testing Dataset is used to train two member-ship inference attacks that require a shadow dataset asthe auxiliary dataset.

Membership Inference.

Recall that there are four differentscenarios for

MemInf ; we establish two types of attack mod-els: one for the black-box and one for the white-box setting.•

Black-box.

Our attack model has two inputs, the rankedtarget sample’s posteriors and a binary indicator onwhether the target sample being predicted correctly.Each input is ﬁrst fed into a different 2-layer MLP (Mul-tilayer Perceptron), then the two obtained embeddingsare concatenated together, and fed into a 4-layer MLP.•

White-box.

We have four inputs for this attack model,like the one used by Nasr et al. [31], including the targetsample’s posteriors, classiﬁcation loss, gradients of theparameters of the target model’s last layer, and one-hotencoding of its true label. Each input is fed into a dif-ferent neural network structure and the resulted embed-dings are concatenated together as one input to a 4-layerMLP.We use ReLU as the activation function for the attack model.The mini-batch size is set to 64, and cross-entropy is the lossfunction. We use Adam as optimizer, with learning rate 1e-5.The attack model is trained for 50 epochs. As the evaluationmetric, we adopt accuracy , as done in previous work [44,46].

Model Inversion.

For the attack in [12], we set the thresholdto 0.999, learning rate to 1e-2, maximum iteration to 3,000,and early stop criteria to 100 for all the target models. Thisattack generates one representative sample for each class ofthe target model. Therefore, for all samples in one class, weﬁrst calculate their average sample, then get the mean-squareerror (MSE) between this averaged sample and the recon-structed sample by model inversion. Finally, we average theMSE value for all classes and use this as the evaluation met-ric.For (cid:104)

ModInv , M W , D Saux (cid:105) [56], we ﬁrst use the shadowtraining dataset to train a DC-GAN [41] with the noise ofthe generator set to 100. For the attack, we set the learn-ing rate to 1e-3, momentum to 0.9, lambda to 100, iterationround to 1,500, and clip range to 1. To evaluate the effec-tiveness of this attack, we use the same approach by Zhanget al. [56], i.e., we train an evaluation classiﬁer on the identi-cal task of the target model and use this evaluation classiﬁerto see whether the reconstructed samples can be recognizedcorrectly. We use the accuracy of this classiﬁer on recon-structed samples as the performance metric.

Attribute Inference.

We only use two datasets, namely,CelebA and UTKFace, to evaluate this attack as both of themhave extra attributes for samples that can be used as the tar-get attributes. For the former, we select two attributes:

Male and

Young . As mentioned before, each attribute is binary,thus, we have a total of 4 target attributes. For the latter, wechoose

Male and

Female as the target attributes.5 elebA FMNIST STL10 UTKFaceAlexNet

ResNet18

VGG19

Xception

SimpleCNN

Table 2: Performance of target models, namely, training/testingaccuracy for each setting.

Our attack model is a 2-layer MLP, its input is the tar-get sample’s embeddings from the second-to-last layer of thetarget model. We use cross-entropy as the loss function andAdam as the optimizer with learning rate 1e-3. The attackmodel is trained for 50 epochs. As the main evaluation met-ric, we use accuracy [50].

Model Stealing.

We evaluate the model stealing attack overall 20 target models. For the stolen model, we use the samearchitecture as the target model [51]. Each stolen model istrained using the MSE loss and SGD as the optimizer (mo-mentum 0.9 and learning rate 1e-2) for 50 epochs.

Agreement is used to assess the success of the attack, calculating the pro-portion of samples in the target testing dataset on which thetarget and the stolen models make the same prediction.

We focus on ﬁve model architectures including AlexNet [22],ResNet18 [17], VGG19 [47], Xception [8], and SimpleCNN(containing 2 convolutional layers and 2 fully connected lay-ers) for all the four datasets introduced above. In total, wetrain 20 different target models.For training, we set the mini-batch size to 64 and use cross-entropy as the loss function. We use stochastic gradient de-scent (SGD) as the optimizer with weight decay set to 5e-4and momentum to 0.9. Each target model is trained for 300epochs. The learning rate is 1e-2 before 50 epochs, 1e-3 from50-100 epochs, and 1e-4 until the end. All target models’training and testing accuracies are shown in Table 2.

In this section, we leverage ML-D

OCTOR to provide a holis-tic assessment of inference attacks against ML models. Westart with assessing the performance of the four attacks. Wethen analyze the impact of dataset and overﬁtting on attackperformance, as well as the relation among different attacks.Experiments are performed on an NVIDIA DGX-A100Ubuntu 18.04 server. We report average results over ﬁveruns.

Membership Inference.

In Figure 2, we report the perfor-mance of

MemInf . The attack achieves high accuracy onCelebA, STL10, and UTKFace. For instance, the attack ac-curacy of (cid:104)

MemInf , M W , D Saux (cid:105) on ResNet18 trained on theSTL10 dataset is 0.93. However, the performance on FM-NIST is not strong, as models trained on FMNIST generalize well on non-member data samples—in other words, there isless overﬁtting [46] (see Section 6.3).An adversary with white-box access to the targetmodel generally achieves better performance than theone with black-box access. For instance, the ac-curacy of (cid:104)

MemInf , M W , D Saux (cid:105) is higher than that of (cid:104)

MemInf , M B , D Saux (cid:105) , except for the FMNIST dataset onResNet18, VGG19, and Xception. A similar ob-servation can be drawn from (cid:104)

MemInf , M W , D Paux (cid:105) and (cid:104)

MemInf , M B , D Paux (cid:105) ; this is expected as the adversary can ex-ploit more information in the white-box setting. In particu-lar, we ﬁnd that the classiﬁcation loss possesses the strongestsignal among others for the attack [31]. Meanwhile, par-tial training dataset also leads to better membership infer-ence than the shadow dataset, however, the effect is less pro-nounced.

Model Inversion.

Next, we measure the performance ofthe model inversion attacks; see Figure 3. As discussedin Section 5.2, we use different metrics to evaluate thesetwo attacks, i.e., MSE for (cid:104)

ModInv , M W , D Naux (cid:105) and accu-racy for (cid:104)

ModInv , M W , D Saux (cid:105) , due to their different design.Thus, we cannot directly compare them. Rather, we eval-uate their attack performance qualitatively (see Figure 4for two examples), and discover that the images gener-ated by (cid:104)

ModInv , M W , D Saux (cid:105) are more realistic than those by (cid:104)

ModInv , M W , D Naux (cid:105) . This is due to the capability of GANfor generating high-quality samples.

Attribute Inference.

The performance of the attribute in-ference attacks is shown in Figure 5. In general, the at-tacks work quite well. For instance, both (cid:104)

AttrInf , M W , D Saux (cid:105) and (cid:104)

AttrInf , M W , D Paux (cid:105) reach around 0.80 accuracy forResNet18 trained on UTKFace. We also notice that using apartial training dataset does not provide the adversary withmuch advantage compared to using a shadow dataset. Insome cases, partial training dataset even yields worse per-formance, as in the case of VGG19 trained on CelebA.

Model Stealing.

Finally, we report the agreement metricto evaluate model stealing attacks; see Figure 6. Over-all,

ModSteal has strong performance. For instance, (cid:104)

ModSteal , M B , D Saux (cid:105) for ResNet18 trained on FMNISTachieves an agreement of 0.93. Similar to attribute inference,we observe that using a partial training dataset as the auxil-iary dataset has lower performance than the shadow datasetfor model stealing. One reason might be using partial train-ing dataset querying the target model results in more conﬁ-dent posteriors (low entropy), which contain less informationfor the adversary to exploit.

We now explore the effect of dataset complexity on theattack performance; see Figure 7. (The x-axis representsthe datasets and the y-axis shows the attack performance,and each node corresponds to one attack against one targetmodel). Due to space limitations, we only show one plot forone threat model, for each attack. Similar results are shown by Zhang et al. [56]. elebA FMNIST STL10 UTKFace . . . . . A cc u r a c y AlexNet

CelebA FMNIST STL10 UTKFace . . . . . ResNet18

CelebA FMNIST STL10 UTKFace . . . . . VGG19

CelebA FMNIST STL10 UTKFace . . . . . Xception

CelebA FMNIST STL10 UTKFace . . . . . SimpleCNN hM B , D Saux i hM B , D Paux i hM W , D Saux i hM W , D Paux i Figure 2: Performance of membership inference attacks (

MemInf ) under different threat models, datasets, and target model architec-tures.

CelebA FMNIST STL10 UTKFace . . . . . . M S E AlexNet

CelebA FMNIST STL10 UTKFace . . . . . . ResNet18

CelebA FMNIST STL10 UTKFace . . . . . . VGG19

CelebA FMNIST STL10 UTKFace . . . . . . Xception

CelebA FMNIST STL10 UTKFace . . . . . . SimpleCNN

CelebA FMNIST STL10 UTKFace . . . . . . A cc u r a c y AlexNet

CelebA FMNIST STL10 UTKFace . . . . . . ResNet18

CelebA FMNIST STL10 UTKFace . . . . . . VGG19

CelebA FMNIST STL10 UTKFace . . . . . . Xception

CelebA FMNIST STL10 UTKFace . . . . . . SimpleCNN hM W , D Naux i hM W , D Saux i Figure 3: Performance of model inversion attacks (

ModInv ) under different threat models, datasets, and target model architectures.Figure 4: Visualization of model inversion (AlexNet modeltrained on UTKFace). The left column depicts two samples re-constructed using [12], the middle one using [56], while the rightcolumn reports two samples from the target model’s trainingdataset.

Dataset Complexity.

As mentioned before, all the samplesin the four datasets are re-sized to 32 ×

32 pixels. FMNISTis the simplest dataset as it only contains gray-scale images,followed by UTKFace, which consists of (full-color) humanfaces, and CelebA, which has 10 times more images thanUTKFace. The most complex dataset is STL10, as it containsimages of 10 diverse classes, ranging from cat to ship.

Results.

Overall, the complexity of the dataset does have asigniﬁcant effect on

MemInf and

ModSteal . More precisely,more complex datasets lead to better membership inferencebut worse model stealing performance. Ostensibly, this isdue to the fact that a complex dataset is harder for a modelto generalize on, and thus more prone to overﬁtting, whichresults in better membership inference attack [46]. Whereas,when a model is trained on a complex dataset, it is harder foran adversary to obtain a dataset with similar complexity (byquerying the target model) to train their stolen model.We also observe that model inversion is less effective onSTL10 than on UTKFace and CelebA. Whereas, there is nostrong inﬂuence of dataset complexity on attribute inference;this might due to the different target classes of our attackson these two datasets (see Section 5.2). Note that we alsoinvestigate the complexity of the target model structure onthe attack performance but do not observe any clear relation.

Next, we analyze the effect of target models’ overﬁtting oninference attacks’ performance. We adopt two metrics toquantify overﬁtting in each target model: 1) the differencebetween the training accuracy and the testing accuracy of thetarget model, referred to as the overﬁtting level , and 2) thenumber of epochs used to train the target model [44].7 elebA UTKFace . . . . A cc u r a c y AlexNet

CelebA UTKFace . . . . ResNet18

CelebA UTKFace . . . . VGG19

CelebA UTKFace . . . . Xception

CelebA UTKFace . . . . SimpleCNN hM W , D Saux i hM W , D Paux i Figure 5: Performance of attribute inference attacks (

AttrInf ) under different threat models, datasets, and target model architectures.

CelebA FMNIST STL10 UTKFace . . . . A g r ee m e n t AlexNet

CelebA FMNIST STL10 UTKFace . . . . ResNet18

CelebA FMNIST STL10 UTKFace . . . . VGG19

CelebA FMNIST STL10 UTKFace . . . . Xception

CelebA FMNIST STL10 UTKFace . . . . SimpleCNN hM B , D Saux i hM B , D Paux i Figure 6: Performance of model stealing attacks (

ModSteal ) under different threat models, datasets, and target model architectures.

FMNIST UTKFace CelebA STL10 . . . . . A cc u r a c y h MemInf , M W , D Saux i FMNIST UTKFace CelebA STL10 . . . . . A cc u r a c y h ModInv , M W , D Saux i UTKFace CelebA . . . A cc u r a c y h AttrInf , M W , D Saux i FMNIST UTKFace CelebA STL10 . . . A g r ee m e n t h ModSteal , M B , D Saux i AlexNet ResNet18 VGG19 Xception SimpleCNN

Figure 7: The relation between dataset complexity and attack performance. . . Overﬁtting Level . . . . . A cc u r a c y h MemInf , M W , D Saux i . . Overﬁtting Level . . . . . . A cc u r a c y h ModInv , M W , D Saux i . . Overﬁtting Level . . . A cc u r a c y h AttrInf , M W , D Saux i . . Overﬁtting Level . . . A g r ee m e n t h ModSteal , M B , D Saux i CelebA FMNIST STL10 UTKFace

Figure 8: The relation between overﬁtting level (on target models) and attack performance. Epochs . . . A cc u r a c y h MemInf , M W , D Saux i

10 20 50 100 200 300

Epochs . . . . . A cc u r a c y h ModInv , M W , D Saux i

10 20 50 100 200 300

Epochs . . . A cc u r a c y h AttrInf , M W , D Saux i

10 20 50 100 200 300

Epochs . . . . A g r ee m e n t h ModSteal , M B , D Saux i AlexNet-CelebAResNet18-CelebA VGG19-CelebAXception-CelebA SimpleCNN-CelebAAlexNet-UTKFace ResNet18-UTKFaceVGG19-UTKFace Xception-UTKFaceSimpleCNN-UTKFace

Figure 9: The relation between the number of epochs and attack performance.

Overﬁtting Level.

The relation between overﬁtting leveland attack performance is shown in Figure 8. First, we ob-serve that different datasets have different overﬁtting levels,and this correlates well with the dataset complexity (see Sec-tion 6.2). Speciﬁcally, the largest (smallest) overﬁtting levelhappens on the most (least) complex dataset in our experi-ments, i.e., STL10 (FMNIST).Overall, overﬁtting does have a signiﬁcant impact on

MemInf ( (cid:104) MemInf , M W , D Saux (cid:105) ). That is, a higher overﬁttinglevel leads to better membership inference. This is in linewith previous analysis [44, 46], and is expected, as an over-ﬁtted model provides more conﬁdent predictions on its mem-ber samples (reﬂected on the posteriors) than on non-membersamples, which can be exploited by the attack model to ef-fectively differentiate them.Meanwhile, model stealing displays a completely oppo-site trend, i.e., it is more difﬁcult to steal a highly overﬁttedmodel. This can be explained by the fact that an overﬁttedmodel memorizes its training dataset to a large extent, andan adversary usually does not have the ability to get the ex-act training dataset, thus the stolen model is likely to be dis-similar to the target model. Also, model inversion tends tohave better performance on less overﬁtted models, except forFMNIST. We believe this is due to the quality of the GANemployed in the attack model. For attribute inference, we donot observe a clear relationship between attack performanceand overﬁtting level.

Number of Epochs.

The relation between the number ofepochs and attack performance (on UTKFace and CelebA)is shown in Figure 9. First, we ﬁnd that all attacks’ per-formance becomes steady after 100 epochs; this is reason-able since 100 epochs are usually enough to train good targetmodels, and further training does not cause an obvious ef-fect on overﬁtting. Second, the performance of membershipinference increases from 10 epochs until 100 epochs, whilemodel stealing shows the opposite trend. This observationechoes our previous argument that a highly overﬁtted modelis easier to be attacked by membership inference, but harderto be stolen. For model inversion and attribute inference, theattack performance only has slight ﬂuctuations with a differ-ent number of epochs.

Next, we analyze the relationship between different at-tacks under the same threat model. In total, we considerall the six pairs of attacks (as depicted in Table 1) in-cluding

MemInf and

ModSteal under (cid:104) M B , D Saux (cid:105) , MemInf and

ModSteal under (cid:104) M B , D Paux (cid:105) , MemInf and

AttrInf un-der (cid:104) M W , D Saux (cid:105) , MemInf and

AttrInf under (cid:104) M W , D Paux (cid:105) , MemInf and

ModInv under (cid:104) M W , D Saux (cid:105) , and

AttrInf and

ModInv under (cid:104) M W , D Saux (cid:105) .Figure 10 shows that there is a strong negative cor-relation between membership inference and model steal-ing ( (cid:104) M B , D Saux (cid:105) ) with respect to their performance ( r = − . We now evaluate two representative defense mechanisms,namely differential privacy (DP) and Knowledge Distillation(KD), and investigate whether or not, and how effectively,they can be used to mitigate these attacks.

First, we introduce the concepts of DP and KD.

Differential privacy (DP) [11, 24] guarantees that any singledata sample in a dataset has a limited impact on the output.

Deﬁnition 1 ( ( ε , δ ) -DP) . A randomization algorithm A sat-isﬁes ( ε , δ ) -differential privacy, with ε > and ≤ δ < , ifand only if for any two neighboring datasets D and D (cid:48) thatdiffer in one record, we have: ∀ T ⊆ Range ( A ) : Pr [ A ( D ) ∈ T ] ≤ e ε Pr (cid:2) A ( D (cid:48) ) ∈ T (cid:3) + δ , . . Accuracy (MemInf) . . . . . A g r ee m e n t ( M o d S t e a l ) hM B , D Saux i ( r = − . ) . . Accuracy (MemInf) . . . A g r ee m e n t ( M o d S t e a l ) hM B , D Paux i ( r = − . ) .

65 0 .

70 0 . Accuracy (MemInf) . . . A cc u r a c y ( A tt r I n f ) hM W , D Saux i ( r = 0 . ) .

65 0 .

70 0 . Accuracy (MemInf) . . . A cc u r a c y ( A tt r I n f ) hM W , D Paux i ( r = 0 . ) . . Accuracy (MemInf) . . . . A cc u r a c y ( M o d I n v ) hM W , D Saux i ( r = − . ) . . . Accuracy (AttrInf) . . . . A cc u r a c y ( M o d I n v ) hM W , D Saux i ( r = 0 . ) CelebA FMNIST STL10 UTKFace

Figure 10: The relation between different attacks under the same threat model.CelebA FMNIST STL10 UTKFace

Original ε = . ε = .

57 Original ε = . ε = .

36 Original ε = . ε = .

60 Original ε = . ε = . (cid:104) MemInf , M B , D Saux (cid:105) (cid:104)

MemInf , M B , D Paux (cid:105) (cid:104)

MemInf , M W , D Saux (cid:105) (cid:104)

MemInf , M W , D Paux (cid:105) (cid:104)

ModInv , M W , D Naux (cid:105) (cid:104)

ModInv , M W , D Saux (cid:105) (cid:104)

ModSteal , M B , D Saux (cid:105) (cid:104)

ModSteal , M B , D Paux (cid:105) (cid:104)

AttrInf , M W , D Saux (cid:105) (cid:104)

AttrInf , M W , D Paux (cid:105)

Table 3: Attack performance under different threat models and datasets, on SimpleCCN, using DP-SGD. For

MemInf , ModInv , and

AttrInf , we use accuracy, for

ModSteal , agreement. where Range ( A ) denotes the set of all possible outputs of thealgorithm A , δ can be interpreted as the probability that themechanism fail to satisfy ε -DP. Gaussian Mechanism.

There are several approaches to de-sign mechanisms satisfying ( ε , δ ) -differential privacy. TheGaussian mechanism is arguably the most widely used onein the ML context. Essentially, it computes a function f ondataset D by adding random (Gaussian) noise to f ( D ) . Themagnitude of the noise depends on ∆ f , i.e., the global sensi-tivity of f (also referred to as the (cid:96) sensitivity). More for-mally, we deﬁne the A mechanism as A ( D ) = f ( D ) + N (cid:0) , ∆ f σ I (cid:1) where ∆ f = max ( D , D (cid:48) ) : D (cid:39) D (cid:48) || f ( D ) − f ( D (cid:48) ) || . Here, N ( , ∆ f σ I ) denotes a multi-dimensional randomvariable sampled from the normal distribution with mean 0and standard deviation ∆ f σ , and σ = (cid:113) . δ / ε . DP-SGD.

We experiment with Differentially-PrivateStochastic Gradient Descent (DP-SGD) [1], the most rep-resentative DP mechanism for protecting machine learningmodels. In general, DP-SGD adds Gaussian noise to gra-dient g during the target ML model’s training process, i.e.,˜ g = g + N (cid:16) , ∆ f σ I (cid:17) . Due to the unbounded nature of thegradient of ML models, one needs to ﬁrst clip the gradientbefore adding noise. Concretely, we can bound the (cid:96) normof the gradient to C by calculating g ← g / max { , || g || / C } . Composition.

Note that we need to calculate the gradientmultiple times when training an ML model. Each calculationrequires access to the training data and thus consumes a por-10ion of the privacy budget. We use the notion of zCDP [3] tocalculate the total privacy budget consumption. The generalidea of zCDP is to connect ( ε , δ ) -DP to Rényi divergence,and use the properties of Rényi divergence to achieve tightercomposition property. Another defense mechanism we consider is Knowledge Dis-tillation (KD) [20]. Generally, KD aims to transfer theknowledge from a larger model (original model) to a smallermodel (distilled model). Compared to the original model,the distilled model has a lower capacity, which may make itremember less information about its training data.Papernot et al. [40] show that KD can reduce the risksof machine learning models against adversarial examples.Here, we take a broader view investigating whether or notKD is effective to defend against inference attacks.

Both DP-SGD and KD are applied in the training process oftarget models. Due to space limitations, we only apply DP-SGD to SimpleCNN and KD to VGG19.

DP-SGD Target Model.

We use the Opacus library to im-plement DP-SGD on SimpleCNN. This allows a user to con-ﬁgure the clip bound C , the standard deviation of the Gaus-sian noise σ , and the failure probability δ , then the librarycan automatically calculate the total privacy budget ε usingzCDP. A larger number of epochs implies higher ε . Our tar-get model is trained for 300 epochs, thus we ﬁx δ =1e-5 andchoose two sets of C and σ such that ε is smaller than 10. Welist these settings in the second row of Table 3. All the otherhyperparameters are the same as presented in Section 5.3. Distillation Target Model.

We distill the model knowledgeof VGG19 (16 convolution layers and 3 fully connected lay-ers) to a smaller model, i.e., VGG11 [47] (8 convolution lay-ers and 3 fully connected layers). We use Kullback-Leiblerdivergence as the soft target loss, namely the distill loss, withthe temperature set to 20. For the student loss, i.e., the hardloss, we use cross-entropy. We set α to 0.7 for the ratio ofthe soft target loss. Other settings are the same as the targetmodel’s training phase in Section 5.3. DP-SGD.

Table 3 reports the performance of inferenceattacks against target models protected by DP-SGD. For

MemInf , DP-SGD is effective in almost all cases. Forinstance, for (cid:104)

MemInf , M W , D Saux (cid:105) on the CelebA dataset,the accuracy of membership inference drops from 0.72 to0.50, which is the random guess. This is expected asDP, by deﬁnition, can mitigate membership inference. For

ModInv and

AttrInf , DP-SGD can only reduce the attackaccuracy to a small extent. However, the MSE loss for (cid:104)

ModInv , M W , D Naux (cid:105) remains stable. https://github.com/pytorch/opacus DP-SGD indeed reduces the risks of model stealing onSTL10 and UTKFace under different threat models. For in-stance, for (cid:104)

ModSteal , M B , D Saux (cid:105) on STL10, the agreementfor the two different privacy budgets are 0.53 and 0.51, whilethe agreement for the original model is 0.66. Meanwhile,DP-SGD does not inﬂuence model stealing on FMNIST. In-terestingly, it actually enhances the performance of modelstealing on CelebA. Overall, DP-SGD can effectively defendagainst membership inference attacks, but not the others.

KD.

In Table 4, we report the effectiveness of KD as a gen-eral defense mechanism. We do not observe any signiﬁ-cant decrease in attack performance for model inversion, at-tribute inference, and model stealing on original vs. the dis-tilled models. Speciﬁcally, the attack performance differ-ence, in most cases, is less than 5%. In certain cases, KDis effective against membership inference, but to a lesser ex-tent compared to DP-SGD. For instance, the performance of (cid:104)

MemInf , M B , D Saux (cid:105) on the STL10 dataset drops from 0.79 to0.65.

Effect on Utility.

Finally, we measure how much defensesreduce the performance of the target models on their origi-nal tasks; see Table 5. We observe that DP-SGD reduces thetarget model’s performance, but to a small extent. For in-stance, the testing accuracy of SimpleCNN with DP-SGD onCelebA under ε and ε is 0.65 and 0.67, respectively, whilethe original testing accuracy is 0.71. KD only has a very mi-nor inﬂuence on model utility; however, as discussed above,that is not an overall effective defense mechanism against in-ference attacks. We now review relevant related work on inference attacksand defenses, as well as software dedicated to evaluatingthem.

Membership Inference Attacks.

Shokri et al. [46] proposethe ﬁrst membership inference attack against black-box MLmodels: they train multiple shadow models to simulate thetarget model and use multiple attack models to conduct theinference. Salem et al. [44] later relax several key assump-tions from [46]; namely, using multiple shadow models, theknowledge of the target model structure, and having a datasetfrom the same distribution as the target model’s. Yeom etal. [55] assume that the adversary knows the target model’straining dataset’s distribution and size, and they collude withthe training algorithm. Both [44] and [55] are close in per-formance to Shokri et al.’s attacks [46]. In this paper, weimplement the attack proposed by Salem et al. [44], i.e., oneshadow model, one attack model, and a shadow dataset.More recently, researchers have generalized membershipinference to other settings, including natural language pro-cessing models [5, 49], generative models [6, 16, 19], andfederated learning [28, 31]. Note that previous work [44, 46]also shows that overﬁtting is the major factor causing mem-bership inference risks. To the best of our knowledge, how-ever, none of them has investigated other factors studied inour paper, such as the inﬂuence of dataset complexity or therelationship among different inference attacks.11 elebA FMNIST STL10 UTKFace

Original Distilled Original Distilled Original Distilled Original Distilled (cid:104)

MemInf , M B , D Saux (cid:105) (cid:104)

MemInf , M B , D Paux (cid:105) (cid:104)

MemInf , M W , D Saux (cid:105) (cid:104)

MemInf , M W , D Paux (cid:105) (cid:104)

ModInv , M W , D Naux (cid:105) (cid:104)

ModInv , M W , D Saux (cid:105) (cid:104)

ModSteal , M B , D Saux (cid:105) (cid:104)

ModSteal , M B , D Paux (cid:105) (cid:104)

AttrInf , M W , D Saux (cid:105) (cid:104)

AttrInf , M W , D Paux (cid:105)

Table 4: Attack performance under different threat models and datasets, on VGG19, using Knowledge Distillation (KD). For

MemInf , ModInv , and

AttrInf , we use accuracy, for

ModSteal , agreement.Experiment Model CelebA FMNIST STL10 UTKFaceDP-SGD ( ε ) SimpleCNN 0.65 0.83 0.35 0.70 DP-SGD ( ε ) SimpleCNN 0.67 0.84 0.31 0.68 KD VGG19 0.71 0.92 0.59 0.82

No Defense

SimpleCNN 0.71 0.90 0.52 0.82

No Defense

VGG19 0.73 0.91 0.59 0.83

Table 5: Performance of target models protected by DP-SGDand KD.

Attribute Inference.

Prior research [2, 14] has studiedmacro-level attribute inference attacks against ML models,whereby the adversary aims to infer some general proper-ties of the training dataset. Melis et al. [28] propose theﬁrst sample-level attribute inference attack against federatedmachine learning systems. Song and Shmatikov [50] revealthat the risks of attribute inference are caused by the intrinsicoverlearning characteristics of machine learning models.

Model Inversion.

Model inversion is ﬁrst proposed byFredrikson et al. [13] in the setting of drug dose classiﬁca-tion. Later, they extend model inversion to general ML set-tings relying on back-propagation over a target ML model’sparameters [12]. More recently, Zhang et al. [56] developa more advanced attack aiming to synthesize the trainingdataset relying on GANs. Finally, Carlini et al. [4] show thatmodel inversion can be effectively performed against naturallanguage processing models as well.

Model Stealing.

Tramèr et al. [51] propose the ﬁrst modelstealing attack against black-box machine learning API.Orekondy et al. [34] develop a reinforcement learning basedframework to optimize both query time and effectiveness.Also, Wang and Gong [52] and Oh et al. [33] show that hy-perparameters of a target model can be inferred as well.

Defense Mechanisms.

A few defense mechanisms havebeen proposed to mitigate membership inference attacks [21,30, 44]. However, these defenses are speciﬁcally designedfor membership inference and cannot mitigate other infer-ence attacks. For instance, Salem et al. [44] propose to re-duce overﬁtting of the target model as a defense, however, as we show in our analysis (see Section 6.3), reducing overﬁt-ting will improve the performance of model stealing.Differential privacy (DP) [11, 24] guarantees that any sin-gle data sample in a dataset has a limited impact on the outputof an algorithm. As such, it is an effective defense mech-anism against inference attacks. Abadi et al. [1] introduceDP-SGD, which adds Gaussian noise to the gradients of thetarget model during the training process. Another DP methodfor protecting the privacy of ML models is PATE [36]: a setof teacher models is trained on a private dataset, which areused to label a public dataset in a differentially private man-ner. The ﬁnal public dataset is then used to train a studentmodel. Recently, Nasr et al. [32] instantiate a number of at-tacks against ML to evaluate the effectiveness of DP defensesand in particular how tight are theoretical DP bounds.Another defense mechanism is Knowledge Distillation(KD) [20]. Papernot et al. [40] propose a defensive distil-lation mechanism to effectively reduce the risks for targetmodels with respect to adversarial examples. Shejwalkar andHoumansadr [45] reveal that distillation can reduce the gapbetween the posteriors of members and non-members, thusprotecting membership privacy. In our experiments, we showthat distillation is indeed effective against certain target mod-els supported by ML-D

OCTOR , however, it cannot defendother types of inference attacks.

Risk Assessment Tools.

Finally, researchers have recentlydeveloped a number of software tools to measure the poten-tial security/privacy risks of ML models. Ling et al. [26]propose DEEPSEC, a security analysis system to evaluatedifferent adversarial example attacks and defenses. Anothersystem for adversarial examples is CleverHans [37]. Pang etal. [35] introduce TROJANZOO, which focuses on backdoorattacks.Closer to our work is ML Privacy Meter [29], whichjointly considers membership inference attacks in bothblack-box and white-box settings. Unlike ML Privacy Meter,which focuses on membership inference only, ML-D

OCTOR considers four types of inference attacks simultaneously. Inaddition, we rely on ML-D

OCTOR to perform a comprehen-sive analysis for all these inference attacks.12

Conclusion

In this paper, we performed the ﬁrst holistic analysis of pri-vacy risks caused by inference attacks against machine learn-ing models. We established a taxonomy of threat models forfour types of inference attacks including membership infer-ence, model inversion, attribute inference, and model steal-ing. We conducted an extensive measurement study, over ﬁvemodel architectures and four datasets, of both attacks and de-fenses. Among other things, we found that the complexity ofthe training dataset plays an important role in the attack’sperformance, while the effectiveness of model stealing andmembership inference attacks are negatively correlated. Wealso showed that defenses such as DP-SGD and KnowledgeDistillation can only hope to mitigate some of the inferenceattacks.We integrated all the attacks and defenses into a re-usable,modular software called ML-D

OCTOR . We envision thatML-D

OCTOR can be used in various scenarios. An MLmodel owner can use ML-D

OCTOR to evaluate the model’sinference risks before deploying it in the real world. We arealso conﬁdent that, as we will make all source code publiclyavailable, ML-D

OCTOR will serve as a benchmark tool tofacilitate future research on inference attacks and defenses.Currently, ML-D

OCTOR concentrates on classiﬁcationmodels, as classiﬁcation is the most popular ML applicationand most of the current inference attacks are run on ML clas-siﬁers. Recently, researchers have demonstrated that infer-ence attacks can also be successfully launched on other typesof ML models, such as language models [48, 49], generativemodels [6, 16], and graph-based models [18], as well as othertraining paradigms, such as federated learning [28]. We planto extend ML-D

OCTOR to support a broader range of MLapplication scenarios.Finally, while ML-D

OCTOR is designed for inference at-tacks, we plan to integrate tools [26, 35, 37] geared to evalu-ate risks aimed to jeopardize models’ functionality, e.g., ad-versarial example, data poisoning, etc., thus providing a one-stop-shop toward enabling secure and trustworthy AI.

Acknowledgments

The authors would like to thank Luca Melis and Jamie Hayesfor valuable discussions and feedback.

References [1] Martin Abadi, Andy Chu, Ian Goodfellow, BrendanMcMahan, Ilya Mironov, Kunal Talwar, and Li Zhang.Deep Learning with Differential Privacy. In

ACMSIGSAC Conference on Computer and Communica-tions Security (CCS) , pages 308–318. ACM, 2016. 2,10, 12[2] Giuseppe Ateniese, Luigi V. Mancini, Angelo Spog-nardi, Antonio Villani, Domenico Vitali, and GiovanniFelici. Hacking smart machines with smarter ones:How to extract meaningful data from machine learningclassiﬁers.

Int. J. Secur. Networks , 2015. 12 [3] Mark Bun and Thomas Steinke. Concentrated Differ-ential Privacy: Simpliﬁcations, Extensions, and LowerBounds. In

Theory of Cryptography Conference (TCC) ,pages 635–658. Springer, 2016. 11[4] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, JernejKos, and Dawn Song. The Secret Sharer: Evaluatingand Testing Unintended Memorization in Neural Net-works. In

USENIX Security Symposium (USENIX Se-curity) , pages 267–284. USENIX, 2019. 12[5] Nicholas Carlini, Florian Tramèr, Eric Wallace,Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee,Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Er-lingsson, Alina Oprea, and Colin Raffel. ExtractingTraining Data from Large Language Models.

CoRRabs/2012.07805 , 2020. 11[6] Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz.GAN-Leaks: A Taxonomy of Membership InferenceAttacks against Generative Models. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 343–362. ACM, 2020. 3, 11, 13[7] Min Chen, Zhikun Zhang, Tianhao Wang, MichaelBackes, Mathias Humbert, and Yang Zhang. WhenMachine Unlearning Jeopardizes Privacy.

CoRRabs/2005.02205 , 2020. 3[8] François Chollet. Xception: Deep Learning withDepthwise Separable Convolutions. In

IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) , pages 1800–1807. IEEE, 2017. 1, 6[9] Adam Coates, Andrew Y. Ng, and Honglak Lee. AnAnalysis of Single-Layer Networks in UnsupervisedFeature Learning. In

International Conference on Arti-ﬁcial Intelligence and Statistics (AISTATS) , pages 215–223. JMLR, 2011. 1, 5[10] Emiliano De Cristofaro. An Overview of Privacy inMachine Learning.

CoRR abs/2005.08679 , 2020. 1, 3[11] Cynthia Dwork and Aaron Roth.

The AlgorithmicFoundations of Differential Privacy . Now PublishersInc., 2014. 9, 12[12] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart.Model Inversion Attacks that Exploit Conﬁdence Infor-mation and Basic Countermeasures. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 1322–1333. ACM, 2015. 1, 3, 4, 5, 7,12[13] Matt Fredrikson, Eric Lantz, Somesh Jha, Simon Lin,David Page, and Thomas Ristenpart. Privacy in Phar-macogenetics: An End-to-End Case Study of Personal-ized Warfarin Dosing. In

USENIX Security Symposium(USENIX Security) , pages 17–32. USENIX, 2014. 12[14] Karan Ganju, Qi Wang, Wei Yang, Carl A. Gunter, andNikita Borisov. Property Inference Attacks on Fully13onnected Neural Networks using Permutation Invari-ant Representations. In

ACM SIGSAC Conference onComputer and Communications Security (CCS) , pages619–633. ACM, 2018. 2, 12[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative Adversar-ial Nets. In

Annual Conference on Neural InformationProcessing Systems (NIPS) , pages 2672–2680. NIPS,2014. 4[16] Jamie Hayes, Luca Melis, George Danezis, and Emil-iano De Cristofaro. LOGAN: Evaluating Privacy Leak-age of Generative Models Using Generative Adversar-ial Networks.

Symposium on Privacy Enhancing Tech-nologies Symposium , 2019. 11, 13[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition.In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 770–778. IEEE, 2016. 1, 6[18] Xinlei He, Jinyuan Jia, Michael Backes, Neil Zhen-qiang Gong, and Yang Zhang. Stealing Links fromGraph Neural Networks. In

USENIX Security Sympo-sium (USENIX Security) . USENIX, 2021. 1, 13[19] Benjamin Hilprecht, Martin Härterich, and DanielBernau. Monte Carlo and Reconstruction MembershipInference Attacks against Generative Models.

Sympo-sium on Privacy Enhancing Technologies Symposium ,2019. 11[20] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.Distilling the Knowledge in a Neural Network.

CoRRabs/1503.02531 , 2015. 2, 11, 12[21] Jinyuan Jia, Ahmed Salem, Michael Backes, YangZhang, and Neil Zhenqiang Gong. MemGuard: De-fending against Black-Box Membership Inference At-tacks via Adversarial Examples. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 259–274. ACM, 2019. 3, 12[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-ton. ImageNet Classiﬁcation with Deep ConvolutionalNeural Networks. In

Annual Conference on NeuralInformation Processing Systems (NIPS) , pages 1106–1114. NIPS, 2012. 1, 6[23] Klas Leino and Matt Fredrikson. Stolen Memo-ries: Leveraging Model Memorization for CalibratedWhite-Box Membership Inference. In

USENIX Secu-rity Symposium (USENIX Security) , pages 1605–1622.USENIX, 2020. 3[24] Ninghui Li, Min Lyu, Dong Su, and Weining Yang.

Dif-ferential Privacy: From Theory to Practice . Morgan &Claypool Publishers, 2016. 9, 12 [25] Zheng Li and Yang Zhang. Label-Leaks: MembershipInference Attack with Label.

CoRR abs/2007.15528 ,2020. 3[26] Xiang Ling, Shouling Ji, Jiaxu Zou, Jiannan Wang,Chunming Wu, Bo Li, and Ting Wang. DEEPSEC: AUniform Platform for Security Analysis of Deep Learn-ing Model. In

IEEE Symposium on Security and Pri-vacy (S&P) , pages 673–690. IEEE, 2019. 4, 12, 13[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and XiaoouTang. Deep Learning Face Attributes in the Wild.In

IEEE International Conference on Computer Vision(ICCV) , pages 3730–3738. IEEE, 2015. 1, 5[28] Luca Melis, Congzheng Song, Emiliano De Cristofaro,and Vitaly Shmatikov. Exploiting Unintended FeatureLeakage in Collaborative Learning. In

IEEE Sympo-sium on Security and Privacy (S&P) , pages 497–512.IEEE, 2019. 1, 4, 11, 12, 13[29] Sasi Kumar Murakonda and Reza Shokri. ML Pri-vacy Meter: Aiding Regulatory Compliance by Quan-tifying the Privacy Risks of Machine Learning.

CoRRabs/2007.09339 , 2020. 4, 12[30] Milad Nasr, Reza Shokri, and Amir Houmansadr. Ma-chine Learning with Membership Privacy using Adver-sarial Regularization. In

ACM SIGSAC Conference onComputer and Communications Security (CCS) , pages634–646. ACM, 2018. 3, 12[31] Milad Nasr, Reza Shokri, and Amir Houmansadr. Com-prehensive Privacy Analysis of Deep Learning: Pas-sive and Active White-box Inference Attacks againstCentralized and Federated Learning. In

IEEE Sympo-sium on Security and Privacy (S&P) , pages 1021–1035.IEEE, 2019. 3, 5, 6, 11[32] Milad Nasr, Shuang Song, Abhradeep Thakurta, Nico-las Papernot, and Nicholas Carlini. Adversary Instanti-ation: Lower Bounds for Differentially Private MachineLearning.

CoRR abs/2101.04535 , 2021. 12[33] Seong Joon Oh, Max Augustin, Bernt Schiele, andMario Fritz. Towards Reverse-Engineering Black-BoxNeural Networks. In

International Conference onLearning Representations (ICLR) , 2018. 12[34] Tribhuvanesh Orekondy, Bernt Schiele, and MarioFritz. Knockoff Nets: Stealing Functionality of Black-Box Models. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 4954–4963.IEEE, 2019. 1, 4, 12[35] Ren Pang, Zheng Zhang, Xiangshan Gao, Zhaohan Xi,Shouling Ji, Peng Cheng, and Ting Wang. TROJAN-ZOO: Everything You Ever Wanted to Know aboutNeural Backdoors (But Were Afraid to Ask).

CoRRabs/2012.09302 , 2020. 4, 12, 131436] Nicolas Papernot, Martin Abadi, Ulfar Erlingsson,Ian Goodfellow, and Kunal Talwar. Semi-supervisedKnowledge Transfer for Deep Learning from PrivateTraining Data. In

International Conference on Learn-ing Representations (ICLR) , 2017. 12[37] Nicolas Papernot, Fartash Faghri, Nicholas Carlini, IanGoodfellow, Reuben Feinman, Alexey Kurakin, CihangXie, Yash Sharma, Tom Brown, Aurko Roy, AlexanderMatyasko, Vahid Behzadan, Karen Hambardzumyan,Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheat-sley, Abhibhav Garg, Jonathan Uesato, Willi Gierke,Yinpeng Dong, David Berthelot, Paul Hendricks, JonasRauber, Rujun Long, and Patrick McDaniel. TechnicalReport on the CleverHans v2.1.0 Adversarial ExamplesLibrary.

CoRR abs/1610.00768 , 2018. 4, 12, 13[38] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha,and Michael Wellman. SoK: Towards the Science ofSecurity and Privacy in Machine Learning. In

IEEEEuropean Symposium on Security and Privacy (EuroS&P) , pages 399–414. IEEE, 2018. 1[39] Nicolas Papernot, Patrick D. McDaniel, Ian Goodfel-low, Somesh Jha, Z. Berkay Celik, and AnanthramSwami. Practical Black-Box Attacks Against MachineLearning. In

ACM Asia Conference on Computer andCommunications Security (ASIACCS) , pages 506–519.ACM, 2017. 4[40] Nicolas Papernot, Patrick D. McDaniel, Xi Wu,Somesh Jha, and Ananthram Swami. Distillation asa Defense to Adversarial Perturbations Against DeepNeural Networks. In

IEEE Symposium on Security andPrivacy (S&P) , pages 582–597. IEEE, 2016. 11, 12[41] Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised Representation Learning with Deep Con-volutional Generative Adversarial Networks.

CoRRabs/1511.06434 , 2015. 5[42] Alexandre Sablayrolles, Matthijs Douze, CordeliaSchmid, Yann Ollivier, and Hervé Jégou. White-box vsBlack-box: Bayes Optimal Strategies for MembershipInference. In

International Conference on MachineLearning (ICML) , pages 5558–5567. PMLR, 2019. 3[43] Ahmed Salem, Apratim Bhattacharya, Michael Backes,Mario Fritz, and Yang Zhang. Updates-Leak: Data SetInference and Reconstruction Attacks in Online Learn-ing. In

USENIX Security Symposium (USENIX Secu-rity) , pages 1291–1308. USENIX, 2020. 1[44] Ahmed Salem, Yang Zhang, Mathias Humbert, PascalBerrang, Mario Fritz, and Michael Backes. ML-Leaks:Model and Data Independent Membership InferenceAttacks and Defenses on Machine Learning Models. In

Network and Distributed System Security Symposium(NDSS) . Internet Society, 2019. 1, 3, 5, 7, 9, 11, 12[45] Virat Shejwalkar and Amir Houmansadr. ReconcilingUtility and Membership Privacy via Knowledge Distil-lation.

CoRR abs/1906.06589 , 2019. 12 [46] Reza Shokri, Marco Stronati, Congzheng Song, and Vi-taly Shmatikov. Membership Inference Attacks AgainstMachine Learning Models. In

IEEE Symposium on Se-curity and Privacy (S&P) , pages 3–18. IEEE, 2017. 1,2, 3, 5, 6, 7, 9, 11[47] Karen Simonyan and Andrew Zisserman. Very DeepConvolutional Networks for Large-Scale Image Recog-nition. In

International Conference on Learning Rep-resentations (ICLR) , 2015. 1, 6, 11[48] Congzheng Song and Ananth Raghunathan. Informa-tion Leakage in Embedding Models. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 377–390. ACM, 2020. 13[49] Congzheng Song and Vitaly Shmatikov. Auditing DataProvenance in Text-Generation Models. In

ACM Con-ference on Knowledge Discovery and Data Mining(KDD) , pages 196–206. ACM, 2019. 11, 13[50] Congzheng Song and Vitaly Shmatikov. OverlearningReveals Sensitive Attributes. In

International Confer-ence on Learning Representations (ICLR) , 2020. 1, 4,6, 12[51] Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Re-iter, and Thomas Ristenpart. Stealing Machine Learn-ing Models via Prediction APIs. In

USENIX Secu-rity Symposium (USENIX Security) , pages 601–618.USENIX, 2016. 1, 4, 6, 12[52] Binghui Wang and Neil Zhenqiang Gong. Stealing Hy-perparameters in Machine Learning. In

IEEE Sym-posium on Security and Privacy (S&P) , pages 36–52.IEEE, 2018. 12[53] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Ma-chine Learning Algorithms.

CoRR abs/1708.07747 ,2017. 1, 5[54] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov,Carl A. Gunter, and Bo Li. Detecting AI Trojans UsingMeta Neural Analysis. In

IEEE Symposium on Securityand Privacy (S&P) . IEEE, 2021. 2[55] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, andSomesh Jha. Privacy Risk in Machine Learning: An-alyzing the Connection to Overﬁtting. In

IEEE Com-puter Security Foundations Symposium (CSF) , pages268–282. IEEE, 2018. 11[56] Yuheng Zhang, Ruoxi Jia, Hengzhi Pei, WenxiaoWang, Bo Li, and Dawn Song. The Secret Re-vealer: Generative Model-Inversion Attacks AgainstDeep Neural Networks. In

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , pages250–258. IEEE, 2020. 1, 3, 5, 6, 7, 12[57] Zhifei Zhang, Yang Song, and Hairong Qi. Age Pro-gression/Regression by Conditional Adversarial Au-toencoder. In

IEEE Conference on Computer Vision nd Pattern Recognition (CVPR)nd Pattern Recognition (CVPR)