[PDF] Quantifying and Mitigating Privacy Risks of Contrastive Learning

Abstract

Data is the key factor to drive the development of machine learning (ML) during the past decade. However, high-quality data, in particular labeled data, is often hard and expensive to collect. To leverage large-scale unlabeled data, self-supervised learning, represented by contrastive learning, is introduced. The objective of contrastive learning is to map different views derived from a training sample (e.g., through data augmentation) closer in their representation space, while different views derived from different samples more distant. In this way, a contrastive model learns to generate informative representations for data samples, which are then used to perform downstream ML tasks. Recent research has shown that machine learning models are vulnerable to various privacy attacks. However, most of the current efforts concentrate on models trained with supervised learning. Meanwhile, data samples' informative representations learned with contrastive learning may cause severe privacy risks as well. In this paper, we perform the first privacy analysis of contrastive learning through the lens of membership inference and attribute inference. Our experimental results show that contrastive models are less vulnerable to membership inference attacks but more vulnerable to attribute inference attacks compared to supervised models. The former is due to the fact that contrastive models are less prone to overfitting, while the latter is caused by contrastive models' capability of representing data samples expressively. To remedy this situation, we propose the first privacy-preserving contrastive learning mechanism, namely Talos, relying on adversarial training. Empirical results show that Talos can successfully mitigate attribute inference risks for contrastive models while maintaining their membership privacy and model utility.

Full PDF

QQuantifying and Mitigating Privacy Risks of Contrastive Learning

Xinlei He, Yang Zhang

CISPA Helmholtz Center for Information Security

Abstract

Data is the key factor to drive the development of machinelearning (ML) during the past decade. However, high-qualitydata, in particular labeled data, is often hard and expen-sive to collect. To leverage large-scale unlabeled data, self-supervised learning, represented by contrastive learning, isintroduced. The objective of contrastive learning is to mapdifferent views derived from a training sample (e.g., throughdata augmentation) closer in their representation space, whiledifferent views derived from different samples more dis-tant. In this way, a contrastive model learns to generate in-formative representations for data samples, which are thenused to perform downstream ML tasks. Recent researchhas shown that machine learning models are vulnerable tovarious privacy attacks. However, most of the current ef-forts concentrate on models trained with supervised learn-ing. Meanwhile, data samples’ informative representationslearned with contrastive learning may cause severe privacyrisks as well.In this paper, we perform the ﬁrst privacy analysis of con-trastive learning through the lens of membership inferenceand attribute inference. Our experimental results show thatcontrastive models are less vulnerable to membership infer-ence attacks but more vulnerable to attribute inference at-tacks compared to supervised models. The former is due tothe fact that contrastive models are less prone to overﬁtting,while the latter is caused by contrastive models’ capabilityof representing data samples expressively. To remedy thissituation, we propose the ﬁrst privacy-preserving contrastivelearning mechanism, namely

Talos , relying on adversarialtraining. Empirical results show that

Talos can successfullymitigate attribute inference risks for contrastive models whilemaintaining their membership privacy and model utility.

Machine learning (ML) has progressed tremendously, anddata is the key factor to drive such development. However,high-quality data, in particular labeled data, is often hard andexpensive to collect as this relies on large-scale human an-notation. Meanwhile, unlabeled data is being generated atevery moment.To leverage unlabeled data for machine learning tasks, self-supervised learning has been introduced [29]. The goalof self-supervised learning is to derive labels from an unla-beled dataset and train an unsupervised task in a supervisedmanner. A trained self-supervised model serves as an en- coder transforming data samples into their representationswhich are then used to perform supervised downstream MLtasks. One of the most prominent self-supervised learningparadigms is contrastive learning [8, 15, 17, 20, 24, 51, 57],with SimCLR [8] as its most representative framework [29].Different from supervised learning which directly opti-mizes an ML model on a labeled training dataset, referredto as a supervised model, contrastive learning aims to train acontrastive model, which is able to generate expressive rep-resentations for data samples, and relies on such represen-tations to perform downstream supervised ML tasks. Theoptimization objective for contrastive learning is to map dif-ferent views derived from one training sample (e.g., throughdata augmentation) closer in the representation space whiledifferent views derived from different training samples moredistant. By doing this, a contrastive model is capable of rep-resenting each sample in an informative way.Recently, machine learning models have been demon-strated to be vulnerable to various privacy attacks againsttheir training dataset [4, 6, 16, 19, 31, 41, 43, 46, 47]. The twomost representative attacks in this domain are membershipinference attack [41, 43] and attribute/property inference at-tack [31, 47]. The former aims to infer whether a data sam-ple is part of a target ML model’s training dataset. The lat-ter leverages the overlearning property of a machine learningmodel to infer the sensitive attribute of a data sample. Sofar, most of the research on the privacy of machine learningconcentrates on supervised models. Meanwhile, informativerepresentations for data samples learned by contrastive mod-els may cause severe privacy risks as well. To the best of ourknowledge, this has been left largely unexplored.

In this paper, we perform the ﬁrst privacy quantiﬁcation ofcontrastive learning, the most representative self-supervisedlearning paradigm. More speciﬁcally, we study the privacyrisks of data samples in the contrastive learning setting, witha focus on SimCLR, through the lens of membership infer-ence and attribute inference.We adapt the existing attack methodologies for mem-bership inference and attribute inference against supervisedmodels to contrastive models. Our empirical results showthat contrastive models are less vulnerable to membershipinference attacks than supervised models. For instance, weachieve 0.552 membership inference accuracy on a con-trastive model trained on STL10 [9] with ResNet-50 [18],while the result is 0.844 on the corresponding supervised1 a r X i v : . [ c s . L G ] F e b odel. The reason behind this is contrastive models are lessprone to overﬁtting.On the other hand, we observe that contrastive modelsare more vulnerable to attribute inference attacks than su-pervised models. For instance, on the UTKFace [58] datasetwith ResNet-18, we can achieve 0.628 attribute inference at-tack accuracy on the contrastive model while only 0.496 onthe supervised model. This is due to the fact that the rep-resentations generated by a contrastive model contain richand expressive information about their original data samples,which can be exploited for effective attribute inference.To mitigate the attribute inference risks stemming fromcontrastive models, we propose the ﬁrst privacy-preservingcontrastive learning mechanism, namely Talos , relying onadversarial training. Concretely,

Talos introduces an adver-sarial classiﬁer into the original contrastive learning frame-work to censor the sensitive attributes learned by a con-trastive model. Our evaluation reveals that

Talos can success-fully mitigate attribute inference risks for contrastive modelswhile maintaining their membership privacy and model util-ity. Our code and models will be made publicly available.In summary, we make the following contributions:• We take the ﬁrst step towards quantifying the privacyrisks of contrastive learning.• Empirical evaluation shows that contrastive models areless vulnerable to membership inference attacks butmore prone to attribute inference attacks compared tosupervised models.• We propose the ﬁrst privacy-preserving contrastivelearning mechanism, which is able to protect thetrained contrastive models from attribute inference at-tacks without jeopardizing their membership privacyand model utility.

The rest of the paper is organized as follows. In Section 2,we introduce the necessary background for contrastive learn-ing. Section 3 and Section 4 present our privacy analysis oncontrastive models with membership inference and attributeinference, respectively. Section 5 presents our defense mech-anism

Talos and its evaluation. We summarize related workin Section 6 and conclude the paper in Section 7.

In this section, we provide the background knowledge of su-pervised learning and contrastive learning.

Supervised learning, represented by classiﬁcation, is one ofthe most common and important ML applications. We ﬁrstdenote a set of data samples by X and a set of labels by Y .The objective of a supervised ML model M is to learn a map-ping function from each data sample x ∈ X to its label/class y ∈ Y . Formally, we have M : x (cid:55)→ y (1)Given a sample x , its output from M , denoted by p = M ( x ) ,is a vector that represents the probability distribution of thesample belonging to a certain class. In this paper, we refer to p as the prediction posteriors.To train an ML model, we need to deﬁne a loss function L ( y , M ( x )) which measures the distance between a sample’sprediction posteriors and its label. The training process isthen performed by minimizing the expectation of the lossfunction over a training dataset D train , i.e., the empirical loss.We formulate this as follow:arg min M | D train | ∑ ( x , y ) ∈ D train L ( y , M ( x )) (2)Cross-entropy loss is one of the most common loss func-tions used for classiﬁcation tasks, it is deﬁned as the follow-ing. L CE ( y , p ) = − k ∑ i = y i log p i (3)Here, k is the total number of classes, y i equals to 1 if thesample belongs to class i (otherwise 0), and p i is the i -th ele-ment of the posteriors p . In this paper, we use cross-entropyfor training all the supervised models. Supervised learning is powerful, but its success heavily de-pends on the labeled training dataset. In the real world,high-quality labeled dataset is hard and expensive to obtainas it often relies on human annotation. For instance, theILSVRC2011 dataset [39] contains more than 12 million la-beled images that are all annotated by Amazon MechanicalTurk workers. Meanwhile, unlabeled data is being gen-erated at every moment. To leverage large-scale unlabeleddata, self-supervised learning is introduced.The goal of self-supervised learning is to get labels froman unlabeled dataset for free so that one can train an unsuper-vised task on this unlabeled dataset in a supervised manner.Contrastive learning/loss [8, 15, 17, 20, 24, 51, 57] is one ofthe most successful and representative self-supervised learn-ing paradigms in recent years and has received a lot of atten-tion from both academia and industry. In general, contrastivelearning aims to map a sample closer to its correlated viewsand more distant to other samples’ correlated views. In thisway, contrastive learning is able to learn an informative rep-resentation for each sample, which can then be leveraged toperform different downstream tasks. Contrastive learning re-lies on Noise Contrastive Estimation (NCE) [15] as its objec-tive function, which can be formulated as: L = − log ( e sim ( f ( x ) , f ( x + )) e sim ( f ( x ) , f ( x + )) + e sim ( f ( x ) , f ( x − )) ) (4)where f is an encoder that maps a sample into its represen-tation, x + is similar to x (referred to as a positive pair), x −

2s dissimilar to x (referred to as a negative pair), and sim isa similarity function. The structure of the encoder and thesimilarity function can vary from different tasks. From Equa-tion 4, we can see that NCE only considers one negative pairfor each positive pair. To involve more negative pairs, In-foNCE [51] has been proposed.In this paper, we focus on one of the most popular con-trastive learning frameworks [29], namely SimCLR [8]. Thisframework is assembled with the following components.• Data Augmentation.

SimCLR ﬁrst uses a data aug-mentation module to transform a given data sample x toits two augmented views, denoted by ˜ x i and ˜ x j , whichcan be considered as a positive pair for x . In our work,we follow the same data augmentation process usedby SimCLR [8], i.e., ﬁrst random cropping and ﬂip-ping with resizing, second random color distortions, andthird random Gaussian blur.• Base Encoder f . Base encoder f is used to ex-tract representations from the augmented data sam-ples. The base encoder can follow various neural net-work architectures. In this paper, we apply the widelyused ResNet [18] (ResNet-18 and ResNet-50) followingChen et al. [8] to obtain the representation h i = f ( ˜ x i ) for˜ x i .• Projection Head g . Projection head g is a simple neu-ral network that maps the representations from the baseencoder to another latent space to apply the contrastiveloss. The goal of the projection head is to enhancethe encoder’s performance. We implement it with a 2-layer MLP (multilayer perceptron) to obtain the output z i = g ( h i ) for h i .• Contrastive Loss Function.

The contrastive loss func-tion is deﬁned to guide the model to learn the generalrepresentation from the data itself. Given a set of aug-mented samples { ˜ x k } including a positive pair ˜ x i and ˜ x j ,the contrastive loss maximizes the similarity between ˜ x i and ˜ x j and minimizes the similarity between ˜ x i ( ˜ x j ) andother samples. For each mini-batch of N samples, wehave 2 N augmented samples. The loss function for apositive pair ˜ x i and ˜ x j can be formulated as: (cid:96) ( i , j ) = − log e sim ( z i , z j ) / τ ∑ Nk = , k (cid:54) = i e sim ( z i , z k ) / τ (5)where sim ( z i , z j ) = z i (cid:62) z j / (cid:107) z i (cid:107)(cid:107) z j (cid:107) represents the cosinesimilarity between z i and z j and τ is a temperature pa-rameter. The ﬁnal loss is calculated over all positivepairs in a mini-batch, which can be deﬁned as the fol-lowing. L Contrastive = N N ∑ k = [ (cid:96) ( k − , k ) + (cid:96) ( k , k − )] (6)Here, 2 k − k are the indices for each positive pair.Training classiﬁcation models with SimCLR can be parti-tioned into two phases. • In the ﬁrst phase, we train a base encoder as well as aprojection head by the contrastive loss using an unla-beled dataset. After training, we discard the projectionhead and keep the base encoder only.• In the second phase, to perform classiﬁcation tasks, wefreeze the parameters of the encoder, add a trainable lin-ear layer at the end of the encoder, and ﬁne-tune the lin-ear layer with the cross-entropy loss (see Equation 3) ona labeled dataset. The linear layer serves as a classiﬁer,with its input being the representations generated by theencoder. We refer to this linear layer as the classiﬁca-tion layer .In the rest of the paper, we call a model trained with su-pervised learning as a supervised model and a model trainedwith contrastive learning as a contrastive model .Compared to supervised learning which only optimizes themodel on a labeled dataset, contrastive learning can learnmore informative representations for data samples. Previouswork show that supervised models are vulnerable to variousprivacy attacks [4, 6, 31, 41, 43, 47, 56]. However, to the bestof our knowledge, privacy risks stemming from contrastivemodels have been left largely unexplored. In this work, weaim to ﬁll this gap. In this section, we quantify the privacy risks of contrastivemodels through the lens of membership inference attack. InSection 3.1, we deﬁne the attack and its threat model, thenwe present the attack methodology in the contrastive learn-ing setting in Section 3.2. Experimental settings and resultsare shown in Section 3.3 and Section 3.4. Note that our goalhere is not to propose a novel membership inference attack,instead, we aim to quantify the membership privacy of con-trastive models. Therefore, we follow existing attacks andtheir threat models [41, 43].

Membership inference attack is one of the most popular pri-vacy attacks against ML models [6, 7, 16, 23, 26, 28, 41, 43].The goal of membership inference is to determine whether adata sample x is part of the training dataset of a target model T . We formally deﬁne a membership inference attack model A MemInf as a binary classiﬁer. A MemInf : x , T (cid:55)→ { member , non-member } (7)Here, the target model is the contrastive model introducedin Section 2. A successful membership inference attack cancause severe privacy risks. For instance, if a model is trainedon data samples collected from people with certain sensitiveinformation, then successfully inferring a sample from a per-son being a member of the model can directly reveal the per-son’s sensitive information.Following previous work [41, 43], we assume that an ad-versary only has black-box access to the target model T , i.e.,they can only query T with their data samples and obtain the3utputs. In addition, the adversary also has a shadow dataset D shadow , which comes from the same distribution as the tar-get model’s training dataset. The shadow dataset D shadow isused to train a shadow model S , the goal of which is to de-rive a training dataset with ground truth labels to train an at-tack model. We further assume that the shadow model sharesthe same architecture as the target model [43]. This is re-alistic as the adversary can use the same machine learningservice (e.g., Amazon Prediction API , Google Cloud , andMicrosoft Azure ) as the target model owner to train theirshadow model. Alternatively, the adversary can also learnthe target model’s architecture ﬁrst by applying model ex-traction attacks [34, 35, 50, 53]. We apply the similar attack methodology from previousmembership inference attacks against supervised models tocontrastive models [41, 43]. The attack process can be di-vided into three stages, i.e., shadow model training, attackmodel training, and membership inference.

Shadow Model Training.

Given a shadow dataset D shadow ,the adversary ﬁrst splits it into two disjoint sets, namelyshadow training dataset D trainshadow and shadow testing dataset D testshadow . Then, D trainshadow is used to train the shadow model S ,which mimics the behavior of the target model. This meansthe shadow model is trained to perform the same task as thetarget model. Attack Model Training.

Next, the adversary uses D shadow (including both D trainshadow and D testshadow ) to query the shadowmodel S and obtains the corresponding posteriors and pre-diction labels. For each data sample in D shadow , the adversaryranks its posteriors in descending order and takes the largesttwo posteriors (classiﬁcation tasks considered in this paperhave at least two classes) as part of the input to the attackmodel. The other part is an indicator representing whetherthe prediction is correct or not. Thus, the dimension of theinput to A MemInf is 3. If a sample belongs to D trainshadow , the ad-versary labels its corresponding input to the attack model asa member, otherwise as a non-member. Then, this obtaineddataset is used to train the attack model, which is a binarymachine learning classiﬁer. Membership Inference.

To determine whether a target datasample x is used to train the target model, the adversary ﬁrstqueries the target model T with x and obtains the input to theattack model for this sample. Then, the adversary queries thisinput to the attack model and gets its membership prediction. Datasets.

We utilize 5 different datasets to conduct our ex-periments for membership inference.•

CIFAR10 [1].

This dataset contains 60,000 images in10 classes. Each class represents one object and has6,000 images. The size of each image is 32 × https://aws.amazon.com/machine-learning https://cloud.google.com/ai-platform https://azure.microsoft.com • CIFAR100 [1].

This dataset is similar to CIFAR10, ex-cept it has 100 classes, with each class containing 600images. The size of each image is also 32 × STL10 [9].

This dataset is composed of 10 classesof images. Each class has 1,300 samples. The sizeof each image is 96 ×

96. Besides the labeled image,STL10 also contains 100,000 unlabeled images, whichwe use for pretraining the encoder for the contrastivemodel (detailed later). These images are extracted froma broader distribution compared to those with labeledclasses.•

UTKFace [58].

This dataset consists of over 23,000 fa-cial images labeled with gender, age, and race. We setits target model’s classiﬁcation task as gender classiﬁ-cation.•

CelebA [30].

This dataset is composed of more than200,000 celebrities’ facial images. Each image is asso-ciated with 40 binary attributes. Note that in CelebA,we randomly select 60,000 images for our experiments.We set its target model’s classiﬁcation task as genderclassiﬁcation as well.All the datasets are used to evaluate membership inferenceattacks, while UTKFace and CelebA are also used to evalu-ate attribute inference attacks since they have extra labels thatcan be used as sensitive attributes (see Section 4.3). For allthe datasets, we rescale their images to the size of 96 × Datasets Conﬁguration.

For each dataset, we ﬁrst splitit into four equal parts, i.e., D traintarget , D testtarget , D trainshadow , and D testshadow . D traintarget is used to train the target model T , the sam-ples of which are thus considered as members of the targetmodel. We treat D testtarget as non-members of the target model T . D trainshadow is used to train the shadow model S , and D trainshadow and D testshadow are used to train the attack model A MemInf . Metric.

Since the attack model’s training and testing datasetsare both balanced with respect to membership distribution,we adopt accuracy as our evaluation metric following previ-ous work [41, 43].

Attack Model.

The attack model is a 3-layer MLP, and thenumber of neurons for each hidden layer is set to 32. We usecross-entropy as the loss function and Adam as the optimizerwith a learning rate of 0.05. The attack model is trained for100 epochs.

Target Model (Contrastive Model).

We adopt two types ofneural networks as the contrastive model’s base encoder f inour experiments, including ResNet-18 and ResNet-50, fol-lowing Chen et al. [8]. Speciﬁcally, we discard the last layerof ResNet-18 and ResNet-50 and use the remaining parts as f . Then, a 2-layer MLP is added after f as the projectionhead g . For ResNet-18, the dimensions for the output of f ,the ﬁrst-layer of g , and the second-layer of g are set to 512,4 IFAR10 CIFAR100 STL10 UTKFace CelebA . . . . . . A cc u r a c y ResNet-18 (Supervised)ResNet-18 (Contrastive) ResNet-50 (Supervised)ResNet-50 (Contrastive) (a) Training accuracy

CIFAR10 CIFAR100 STL10 UTKFace CelebA . . . . . . A cc u r a c y ResNet-18 (Supervised)ResNet-18 (Contrastive) ResNet-50 (Supervised)ResNet-50 (Contrastive) (b) Testing accuracyFigure 1: The performance of original classiﬁcation tasks for both supervised models and contrastive models with ResNet-18 andResNet-50 on 5 different datasets. The x-axis represents different datasets. The y-axis represents original classiﬁcation tasks’ accuracy.

CIFAR10 CIFAR100 STL10 UTKFace CelebA . . . . . . M e m b e r s h i p I n f e r e n ce A cc u r a c y ResNet-18 (Supervised)ResNet-18 (Contrastive) ResNet-50 (Supervised)ResNet-50 (Contrastive) (b) Testing accuracyFigure 2: The performance of membership inference attacks against both supervised models and contrastive models with ResNet-18and ResNet-50 on 5 different datasets. The x-axis represents different datasets. The y-axis represents membership inference attacks’accuracy. .

00 0 .

25 0 .

50 0 .

75 1 . Classiﬁcation Loss N u m b e r MemberNon-member (a) Supervised Model .

00 0 .

25 0 .

50 0 .

75 1 . Classiﬁcation Loss N u m b e r MemberNon-member (b) Contrastive ModelFigure 3: The distribution of loss with respect to original clas-siﬁcation tasks for member and non-member samples for boththe supervised model and the contrastive model with ResNet-18on CIFAR10. The x-axis represents each sample’s classiﬁcationloss. The y-axis represents the number of member and non-member samples. g and add a new linear layer to Figure 4: Randomly selected images from STL10 and their aug-mented views used during the process of contrastive learning.The ﬁrst and fourth columns show the original images (boundedby orange boxes), and the rest columns show their augmentedviews. the base encoder f as its classiﬁcation layer.• For CIFAR10, CIFAR100, and STL10, we ﬁrst use the5 . . . . . . . Overﬁtting Level0.50.60.70.80.9 M e m b e r s h i p I n f e r e n ce A tt a c k A cc u r a c y CelebACelebA UTKFaceUTKFace CIFAR10CIFAR10 STL10STL10 CIFAR100CIFAR100Supervised ModelContrastive Model (a) ResNet-18 . . . . . . . Overﬁtting Level0.50.60.70.8 M e m b e r s h i p I n f e r e n ce A tt a c k A cc u r a c y CelebACelebA UTKFaceUTKFace CIFAR10CIFAR10 STL10STL10 CIFAR100CIFAR100Supervised ModelContrastive Model (b) ResNet-50Figure 5: The performance of membership inference attacks against both supervised models and contrastive models with ResNet-18and ResNet-50 on 5 different datasets under different overﬁtting levels. The x-axis represents different overﬁtting levels. The y-axisrepresents membership inference attacks’ accuracy. unlabeled dataset of STL10 to pretrain the base encoder f for 100 epochs. Then, we freeze the parameters of f and use the corresponding dataset to only ﬁne-tunethe classiﬁcation layer for 100 epochs to establish thecontrastive model.• For UTKFace and CelebA, we also adopt the pretrainedencoder with STL10’s unlabeled dataset, then ﬁne-tunethe base encoder f with the corresponding dataset for50 epochs. In the end, we ﬁne-tune the classiﬁcationlayer for 100 epochs, with f being frozen.In all cases, Adam is utilized as the optimizer. Baseline (Supervised Model).

To fully understand the pri-vacy leakage of contrastive models, we further use super-vised models as the baseline. Concretely, we train two mod-els including ResNet-18 and ResNet-50 from scratch for allthe datasets. The models are trained for 100 epochs. Cross-entropy (Equation 3) is adopted as the loss function, and weagain use Adam as the optimizer.

Remarks.

All our experiments are performed 5 times, andwe report the average results. Our code is currently imple-mented in Python 3.6 and PyTorch We ﬁrst show the performance of supervised models andcontrastive models on their original classiﬁcation tasks inFigure 1. We observe that contrastive models perform bet-ter than supervised models on CIFAR10, CIFAR100, andSTL10. For instance, on STL10 with ResNet-18 as thebase encoder, the contrastive model achieves 0.738 accuracywhile the supervised model achieves 0.558 accuracy. On theother hand, contrastive models perform slightly worse thansupervised models on the UTKFace and CelebA datasets [8]. https://pytorch.org/ Regarding membership inference, we see that all the su-pervised models have higher attack accuracy than the con-trastive models (see Figure 2). For instance, when the su-pervised model is ResNet-18 trained on CIFAR100, the ac-curacy of membership inference is 0.941, while the accuracyfor the corresponding contrastive model is only 0.640. Theseresults indicate that contrastive models are less vulnerable tomembership inference attacks.To investigate the reason behind this, we analyze the lossdistribution between members and non-members in both su-pervised models and contrastive models. Due to space limi-tations, we only show the results of ResNet-18 trained on theCIFAR10 dataset in Figure 3. A clear trend is that comparedto the contrastive model, the supervised model has a largerdivergence between the classiﬁcation loss (cross-entropy) formembers and non-members. Recall that contrastive learn-ing uses two augmented views of each sample in each epochto train its base encoder and the original sample to train itsclassiﬁcation layer. This indicates that each sample is gener-alized to multiple views during the contrastive model trainingprocess. In this way, the contrastive model reduces its mem-orization of the original sample itself.Interestingly, Song et al. [48] observe that defense mecha-nisms for mitigating adversarial example attacks [3,5,38,49]increase the membership inference performance. This meanssuch defense and contrastive learning have different effectson membership privacy. On the one hand, these defensemechanisms for adversarial examples use original samplesand their visually imperceptible adversarial examples to traina model; in this way, the model learns to remember eachoriginal sample more accurately. On the other hand, theaugmented samples in contrastive learning are very differ-ent from their original samples (see Figure 4 for some ex-amples). Therefore, membership inference is less effectiveagainst contrastive models.We notice that the attack performance varies on differentmodels and different datasets. We relate this to the differ-ent overﬁtting levels of machine learning models. Similar6

Training Epochs . . . . . . . . M e m b e r s h i p I n f e r e n ce A cc u r a c y CIFAR10CIFAR100STL10UTKFaceCelebA

Figure 6: The performance of membership inference at-tacks against contrastive models with ResNet-50 on 5 differ-ent datasets under different numbers of epochs for classiﬁca-tion layer training. The x-axis represents different numbers ofepochs. The y-axis represents membership inference attacks’accuracy. Each line corresponds to a different dataset. to previous work [41, 43], we measure the overﬁtting levelof a target model by calculating the difference between itstraining accuracy and testing accuracy. In Figure 5, we seethat the overﬁtting level is highly correlated with the attackperformance: If a model is more overﬁtted, it is more vul-nerable to membership inference attacks. For instance, inFigure 5a, the contrastive model has an overﬁtting level of0.279 on CIFAR100, and the attack accuracy is 0.640, whilethe supervised model has a larger overﬁtting level (0.597)and higher attack accuracy (0.941). Another observation isthat compared to the supervised models, the overﬁtting lev-els of the contrastive models reside in a smaller range. Wefurther measure the inﬂuence of the number of epochs usedfor training each contrastive model’s classiﬁcation layer onattack performance. Figure 6 shows that the attack accuracyis rather stable; this means contrastive models consistentlyreduce the membership threat.In conclusion, contrastive models are less vulnerable tomembership inference attacks compared to supervised mod-els.

In this section, we take a different angle to measure the pri-vacy risks of contrastive learning using attribute inference at-tack [31,47]. We ﬁrst formally deﬁne the attack and its threatmodel in Section 4.1, then describe the attack methodologyin Section 4.2. In the end, we discuss the experimental set-tings and analyze the results in Section 4.3 and Section 4.4.Similar to membership inference attacks, we use existing at-tribute inference attacks [31, 47] to measure the contrastivemodel’s privacy risks instead of inventing new methods.

In attribute inference, the adversary’s goal is to infer a spe-ciﬁc sensitive attribute of a data sample from its representa-

UTKFace CelebA . . . . . . . . A tt r i bu t e I n f e r e n ce A cc u r a c y ResNet-18 (Supervised)ResNet-18 (Contrastive) ResNet-50 (Supervised)ResNet-50 (Contrastive)

Figure 7: The performance of attribute inference attacksagainst both supervised models and contrastive models withResNet-18 and ResNet-50 on 2 different datasets. The x-axisrepresents different datasets. The y-axis represents attributeinference attacks’ accuracy. tion generated by a target model. This sensitive attribute isnot related to the target ML model’s original classiﬁcationtask. For instance, a target model is designed to classify anindividual’s age from their social network posts, while at-tribute inference aims to infer their educational background.Attribute inference attacks have been successfully per-formed on supervised models [31, 47], the reason behind thisis the intrinsic overlearning property of ML models. Over-learning means that an ML model trained for a certain taskmay also learn to represent other characteristics of data sam-ples. Such representation capability, in some cases, can beexploited by an adversary to infer data samples’ sensitive at-tributes.Once a contrastive model is trained, it can generate a rep-resentation for each sample with its base encoder f . Forattribute inference against contrastive learning, we leveragethis representation. Given a data sample x and its representa-tion from a target contrastive model, denoted by h = f ( x ) , toconduct the attribute inference attack, the adversary trains anattack model A AttInf formally deﬁned as follows: A AttInf : h (cid:55)→ s (8)where s represents the sensitive attribute.Regarding the threat model, we follow previous work [31,47] by assuming the adversary having a set of samples andtheir sensitive attributes; this dataset is termed as an auxiliarydataset D aux . The adversary also needs to obtain representa-tions for samples in D aux from the target contrastive model.Therefore, we assume the adversary having white-box accessto the target contrastive model. As argued by previous work,attribute inference attacks can be applied in both federatedlearning [31] and model partitioning [47] settings. We generalize the methodology of attribute inference attacksagainst supervised models [31, 47] to contrastive models.The attack process can be partitioned into two stages, i.e.,attack model training and attribute inference.7 aleFemale (a) Original Classiﬁcation Task (Supervised)

WhiteBlackAsianIndianOther (b) Sensitive Attribute (Supervised)

MaleFemale (c) Original Classiﬁcation Task (Contrastive)

WhiteBlackAsianIndianOther (d) Sensitive Attribute (Contrastive)Figure 8: The representations for 200 randomly selected samples generated by both the supervised model and the contrastive modelwith ResNet-18 on UTKFace projected into a 2-dimension space using t-SNE. Each point represents a sample.

Attack Model Training.

To train the attack model A AttInf ,for each x ∈ D aux , the adversary ﬁrst obtains its representa-tion, which is a vector from the target model. All the repre-sentations serve as the inputs, and their corresponding sensi-tive attributes serve as the labels for training the attack model. Attribute Inference.

To determine the sensitive attribute ofa given data sample x , the adversary ﬁrst queries the base en-coder f and gets its representation h . Then, the adversaryqueries the attack model A AttInf with h and obtains the sensi-tive attribute prediction. Datasets.

As mentioned in Section 3.3, we utilize UTK-Face and CelebA to evaluate attribute inference attacks asthey contain extra attributes that can be considered as sensi-tive attributes for our experiments. In UTKFace, the targetmodel’s classiﬁcation task is gender classiﬁcation, and thesensitive attribute is race (Black, White, Asian, Indian, andOther). In CelebA, the target classiﬁcation task is genderclassiﬁcation, and the sensitive attribute is smiling (Smilingor No-smiling). We take D traintarget as the auxiliary dataset totrain the attack model and take D testtarget to test the attack per-formance. Metric.

We adopt accuracy as the metric to evaluate attributeinference attacks following previous work [31, 47].

Models.

All the target models’ architectures are the same asthose for membership inference attacks. Regarding the at-tack model, we leverage a 2-layer MLP with the number ofneurons in the hidden layer set to 128. Hyperparameters, in-cluding loss function, optimizer, epochs, and learning rate,are the same as the attack model used for membership infer-ence. The dimension of each sample’s representation fromthe last layer of the base encoder, i.e., the attack model’s in-put, is 512 for ResNet-18 and 2,048 for ResNet-50.

The performance of attribute inference attacks is depicted inFigure 7. First, we observe that, in general, attribute infer-ence achieves effective performance except for ResNet-50on the CelebA dataset (close to the prior sensitive attributedistribution in the attack training dataset).Second, compared to the supervised models, the con-trastive models are more vulnerable to attribute inference at-tacks. For instance, on the UTKFace dataset with ResNet-18, we can achieve an attack accuracy of 0.628 on the con-trastive model while only 0.496 on the supervised model. Tobetter understand this, we extract samples’ representations8512-dimension) from ResNet-18 on UTKFace for both thesupervised model and the contrastive model and project theminto a 2-dimension space using t-Distributed Neighbor Em-bedding (t-SNE) [52]: Figure 8a shows the results for thesupervised model on the original classiﬁcation task, i.e., gen-der classiﬁcation; Figure 8b shows the results for the super-vised model on attribute inference, i.e., race. We see that inFigure 8a, male samples (blue) and female samples (orange)reside in completely different regions, which can be sepa-rated perfectly (the gender classiﬁcation accuracy is 0.873 inFigure 1). However, for the sensitive attribute (Figure 8b),samples of different classes are clustered tightly, which in-creases the difﬁculty for attribute inference. Figure 8c andFigure 8d show the corresponding results for the contrastivemodel. We observe that different samples’ representationson the contrastive model are less separable with respect to theoriginal classiﬁcation task compared to the supervised model(see Figure 8c and Figure 8a), but we can still successfullyseparate most of them correctly (the gender classiﬁcation ac-curacy is 0.788 in Figure 1) since most of the male samples(blue) lie in the upper-right area while the female samples(orange) are in the lower-left area. On the other hand, for thesensitive attribute, compared to the supervised model (Fig-ure 8b), representations generated by the contrastive model(Figure 8d) are more distinguishable. Our ﬁnding revealsthat the representations generated by the contrastive modelare more informative, which can be exploited not only for theoriginal classiﬁcation tasks but also for attribute inference.To study the effect of training dataset size on the attackmodel A AttInf , we randomly select from 10% to 90% of thetraining dataset to train the attack model and evaluate the per-formance using all the testing dataset; the results are summa-rized in Figure 9. We can observe that even using 10% of thetraining dataset, the contrastive models are still more vulner-able (higher attribute inference accuracy) than the supervisedmodels when the attack model is trained with its full trainingdataset. This further shows the privacy risks of contrastivelearning.We also observe that attribute inference attacks are moreeffective against less complex models (see Figure 7). For in-stance, both supervised models and contrastive models usingResNet-18 leak more information than those using ResNet-50. We conjecture that a complex model learns to representeach sample in a complex space, which is harder for the at-tack model to decode.In conclusion, contrastive models are more vulnerable toattribute inference attacks compared to supervised models.

So far, we have demonstrated that compared to supervisedmodels, contrastive models are more vulnerable to attributeinference attacks (Section 4) but less vulnerable to member-ship inference attacks (Section 3). In this section, we proposethe ﬁrst privacy-preserving contrastive learning mechanism,namely

Talos , which aims to reduce the risks of attribute in-ference for contrastive models while maintaining their mem-bership privacy and model utility. We ﬁrst introduce

Talos

10% 30% 50% 70% 90%Percentage of Training Data . . . . . . A tt r i bu t e I n f e r e n ce A cc u r a c y UTKFace (Contrastive)CelebA (Contrastive)UTKFace (Supervised)CelebA (Supervised)

Figure 9: The performance of attribute inference attacksagainst contrastive models with ResNet-18 on 2 differentdatasets under different percentages of the attack trainingdataset. The x-axis represents different percentages of the at-tack training dataset. The y-axis represents attribute inferenceattacks’ accuracy. Each line corresponds to a different dataset.Each dashed line corresponds to the attribute inference attack’saccuracy on the corresponding supervised model trained withthe full training dataset. in Section 5.1. Experiment settings and results are then pre-sented in Section 5.2 and Section 5.3, respectively.

Intuition.

As shown in Section 4, the reason for a con-trastive model to be vulnerable to attribute inference attacksis that the model’s base encoder f learns informative repre-sentations for data samples, which can be exploited by anadversary. To mitigate such a threat, we aim for a new train-ing paradigm for contrastive learning which can eliminatedata samples’ sensitive attributes from their representations.Meanwhile, the base encoder of the contrastive model stillneeds to represent data samples expressively for preservingmodel utility. These two objectives are in conﬂict, and ourdefense mechanism should consider both simultaneously. Methodology.

Our defense mechanism, namely

Talos , canbe modeled as a mini-max game, and we rely on adversarialtraining [10–12, 14, 55] to realize it. Similar to the originalcontrastive model,

Talos also leverages a base encoder and aprojection head to learn informative representations for datasamples. Besides,

Talos introduces an adversarial classiﬁer C , which is used to censor sensitive attributes from data sam-ples’ representations.The adversarial classiﬁer of Talos is essentially designedfor attribute inference. Similar to the original contrastivelearning process (see Section 2),

Talos is trained with mini-batches. Given a mini-batch of 2 N augmented data samples(generated from N original samples), we deﬁne the loss ofthe adversarial classiﬁer C as follows. L C = N N ∑ k = [ L CE ( s k , C ( f ( ˜ x k − ))) + L CE ( s k , C ( f ( ˜ x k )))] (9)9 lgorithm 1: The training process of

Talos . Input : Target training dataset D traintarget with sensitive attribute s , base encoder f , projection head g , adversarial classiﬁer C ,and adversarial factor λ . Initialize f , g , and C ’s parameters. for each epoch do for each mini-batch do Sample a mini-batch with N training data samples and its corresponding sensitive attributes { ( x , s ) , ( x , s ) , ..., ( x N , s N ) } from D traintarget Generate augmented data samples: { ( ˜ x , s ) , ( ˜ x , s ) , ..., ( ˜ x N , s N ) } , where ˜ x k − and ˜ x k are the two augmentedviews of x k Feed augmented data samples into the base encoder f and the projection head g to calculate the contrastive loss: L Contrastive = N ∑ Nk = [ (cid:96) ( k − , k ) + (cid:96) ( k , k − )] Feed the representations generated by the base encoder f into the adversarial classiﬁer C to calculate theadversarial classiﬁer loss: L C = N ∑ Nk = [ L CE ( s k , C ( f ( ˜ x k − ))) + L CE ( s k , C ( f ( ˜ x k )))] Optimize adversarial classiﬁer C ’s parameters with the adversarial classiﬁer loss: L C Optimize projection head g ’s parameters with the contrastive loss: L Contrastive Optimize base encoder f ’s parameters with adversarial training loss: L Talos = L Contrastive − λ L C end end Return : Base encoder f where ˜ x k − and ˜ x k are the two augmented samples of anoriginal sample x k , s k represents x k ’s sensitive attribute, f isthe base encoder, and L CE is the cross-entropy loss (Equa-tion 3). Note that we consider ˜ x k − and ˜ x k sharing the samesensitive attribute as x k . Talos also adopts the original contrastive loss L Contrastive (Equation 6). By jointly considering the adversarial classiﬁerloss and the contrastive loss,

Talos ’s loss function is deﬁnedas follows: L Talos = L Contrastive − λ L C (10)where λ is the adversarial factor to balance the two losses.We refer to a model trained with Talos as a

Talos model.Algorithm 1 presents the training process of

Talos . In eachmini-batch, given N training samples, we ﬁrst generate 2 N augmented views (Line 6) and feed them into the base en-coder. The generated representations are then fed into theprojection head (Line 7) and the adversarial classiﬁer (Line8) simultaneously. We then optimize the adversarial classi-ﬁer with the cross-entropy loss (Line 9) and the projectionhead with the contrastive loss (Line 10).The base encoder, on the other hand, is optimized withthe loss function of Talos , i.e., Equation 10 (Line 11). Toimplement this in practice, we utilize the gradient reversallayer (GRL) proposed by Ganin et al. [13]. GRL is a layerthat can be added between the base encoder f and the adver-sarial classiﬁer C . In the forward propagation, GRL acts asan identity transform that simply copies the input as the out-put. During the backpropagation, GRL takes the gradientspassed through it from the adversarial classiﬁer C , multipliesthe gradients by − λ , and passes them to the base encoder f . Such operation lets the base encoder receive the oppositedirection of gradients from the adversarial classiﬁer. In thisway, the base encoder f is able to learn informative represen-tations for samples while censoring their sensitive attributes.Note that our adversarial training is performed only on the process of training the base encoder f . The training forthe classiﬁcation layer of the contrastive model remains un-changed. As we show in Section 3, the classiﬁcation layergeneralizes well on the contrastive models, i.e., less overﬁt-ting. Therefore, models trained by Talos should be robustagainst membership inference attacks as well.

Adaptive Membership Inference Attack.

An adversaryneeds to establish a shadow model to mount membership in-ference attacks (see Section 3.2). To evaluate membershipprivacy risks of

Talos , we consider an adaptive (and stronger)adversary [23]. Concretely, we assume that the adversaryknows the training details of

Talos and trains their shadowmodel in the same way. Note that attribute inference attacksdo not require a shadow model (see Section 4.2), therefore,we do not have an adaptive adversary in this case.

We follow the same experimental setting, including datasets,metrics, target models, and attack models (both attribute in-ference and membership inference), as those in Section 3.3and Section 4.3. As mentioned before, membership infer-ence attacks are performed in an adaptive way. Regardingthe adversarial classiﬁer of

Talos , we leverage a 2-layer MLPwith 64 neurons in the hidden layer, which is smaller than theattribute inference attack model. We further set the adver-sarial factor λ to 10 and evaluate its inﬂuence on the attackperformance in our experiments. The performance of attribute inference attacks and member-ship inference attacks for the original contrastive models andthe

Talos models are depicted in Figure 10. First of all,

Talos indeed reduces the attribute inference accuracy compared tothe original contrastive learning. For instance, the attribute10 ttribute Inference Membership Inference . . . . . . . A cc u r a c y ResNet-18 (Original Contrastive)ResNet-18 (

Talos )ResNet-50 (Original Contrastive)ResNet-50 (

Talos ) (a) UTKFace Attribute Inference Membership Inference . . . . . . . A cc u r a c y ResNet-18 (Original Contrastive)ResNet-18 (

Talos )ResNet-50 (Original Contrastive)ResNet-50 (

Talos ) (b) CelebAFigure 10: The performance of membership inference attacks and attribute inference attacks against original contrastive models and Talos models ( λ = ) with ResNet-18 and ResNet-50 on 2 different datasets. The x-axis represents different attacks. The y-axisrepresents attack accuracy. λ . . . . A cc u r a c y ResNet-18 (UTKFace)ResNet-18 (CelebA)ResNet-50 (UTKFace)ResNet-50 (CelebA) (a) Original Task λ . . . . . A cc u r a c y (b) Membership Inference Attack λ . . . . . . A cc u r a c y (c) Attribute Inference AttackFigure 11: The performance of original classiﬁcation tasks, membership inference attacks, and attribute inference attacks for the Talos models with ResNet-18 and ResNet-50 on 2 different datasets under different adversarial factor λ . The x-axis represents different λ .The y-axis represents the corresponding performance. inference accuracy is 0.695 on the original contrastive modelwith ResNet-18 on the CelebA dataset, while only 0.517 onthe Talos model. Meanwhile, the testing accuracy of the orig-inal classiﬁcation task for the

Talos model only drops by0.014 compared to the original contrastive model (see Ta-ble 1). This demonstrates that

Talos can effectively eliminatesensitive attributes from data samples’ representations whilemaintaining model utility.Second, we observe that both contrastive models and

Ta-los models are robust against membership inference attacks(see Figure 10). This shows that

Talos indeed does not affectthe generalizability of the classiﬁcation layer of contrastivelearning as discussed before (see Section 5.1).We also investigate the effect of the adversarial factor λ onthe performance of original classiﬁcation tasks, membershipinference attacks, and attribute inference attacks. The resultsare summarized in Figure 11. In general, larger λ leads toweaker attribute inference (see Figure 11c). This is expectedas a larger λ increases the contribution of the adversarial clas-siﬁer loss for Talos . In particular, when λ is set to 10, we ob-serve the lowest attribute inference in all scenarios. Anotherﬁnding is that, in general, the performance of original clas-siﬁcation tasks and membership inference attacks are stable Table 1: The performance of original classiﬁcation tasks (test-ing accuracy) for both original contrastive models and

Talos models ( λ = ). Dataset, Model Original

Talos

Utility LossUTKFace, ResNet-18 0.788 0.747 0.041UTKFace, ResNet-50 0.781 0.647 0.134CelebA, ResNet-18 0.856 0.842 0.014CelebA, ResNet-50 0.848 0.829 0.019with respect to different adversarial factors. This shows ourchoice of setting λ to 10 is suitable. Note that in other set-tings, the adversarial factor λ can be treated as an importanthyperparameter for Talos , which can be further ﬁne-tuned tosatisfy the corresponding requirements.In conclusion,

Talos can successfully defend attribute in-ference attacks for contrastive models without jeopardizingtheir membership privacy and model utility.11

Related Work

Contrastive Learning.

Contrastive learning is one of themost popular self-supervised learning paradigms [8, 15, 17,24, 51, 57]. Oord et al. [51] propose contrastive predictivecoding, which leverages autoregressive models to predict fu-ture observations for data samples. Wu et al. [54] utilize amemory bank to save instance representation and k-nearestneighbors to conduct prediction. He et al. [17] introduceMoCo, which relies on momentum to update the key encoderwith the query encoder to maintain consistency. Chen etal. [8] propose SimCLR, which leverages data augmentationand the projection head to enhance the performance of con-trastive models. SimCLR is the most prominent contrastivelearning paradigm at the moment [29], thus we concentrateon it in this paper.

Membership Inference Attack.

In membership inference,the adversary’s goal is to infer whether a given data sampleis used to train a target model. Right now, membership infer-ence is one of the major means to measure privacy risks ofmachine learning models [16,26,33,41,43,48,56]. Shokri etal. [43] propose the ﬁrst membership inference attack in theblack-box setting. Speciﬁcally, they rely on training multipleshadow models to mimic the behavior of a target model to de-rive the data for training their attack models. Salem et al. [41]further relax the assumptions made by Shokri et al. [43] andpropose three novel attacks. Later, Nasr et al. [33] conducta comprehensive analysis of membership privacy under bothblack-box and white-box settings for centralized as well asfederated learning scenarios. Song et al. [48] study the syn-ergy between adversarial example and membership inferenceand show that membership privacy risks increase when amodel owner applies measures to defend against adversarialexample attacks. To mitigate membership inference, manydefense mechanisms have been proposed [23, 32, 41]. Nasret al. [32] introduce an adversarial regularization term into atarget model’s loss function. Salem et al. [41] propose to usedropout and model stacking to reduce model overﬁtting, themain reason behind the success of membership inference. Jiaet al. [23] rely on adversarial examples to craft noise to addto a target sample’s posteriors.

Attribute Inference Attack.

Another major type of pri-vacy attack against ML models is attribute inference. Here,an adversary aims to infer a speciﬁc sensitive attribute ofa data sample from its representation generated by a targetmodel [31, 47]. Melis et al. [31] propose the ﬁrst attributeinference attack against machine learning, in particular fed-erated learning. Song and Shmatikov [47] later show thatattribute inference attacks are also effective against anothertraining paradigm, namely model partitioning. They furtherdemonstrate that the success of attribute inference is due tothe overlearning behavior of ML models. More recently,Song and Raghunathan [44] demonstrate that language mod-els are also vulnerable to attribute inference.

Other Attacks Against Machine Learning Models.

Be-sides membership inference and attribute inference, there ex-ist a plethora of other attacks against ML models [2,4,19,22, 27, 36, 37, 40, 42, 45]. One major attack is adversarial ex-ample [3, 5, 38, 49], where an adversary aims to add imper-ceptible noises to data samples to evade a target ML model.Another representative attack in this domain is model extrac-tion, the goal of which is to learn a target model’s parame-ters [21, 25, 35, 50] or hyperparameters [34, 53].

In this paper, we perform the ﬁrst privacy quantiﬁcation ofthe most representative self-supervised learning paradigm,i.e., contrastive learning. Concretely, we investigate the pri-vacy risks of contrastive models through the lens of member-ship inference and attribute inference. Empirical evaluationshows that contrastive models are less vulnerable to member-ship inference attacks compared to supervised models. Thisis due to the fact that contrastive models are normally lessoverﬁtted. Meanwhile, contrastive models are more proneto attribute inference attacks. We posit this is because con-trastive models can generate more informative representa-tions for data samples, which can be exploited by an adver-sary to achieve effective attribute inference.To reduce the risks of attribute inference stemming fromcontrastive models, we propose the ﬁrst privacy-preservingcontrastive learning mechanism, namely

Talos . Speciﬁcally,

Talos introduces an adversarial classiﬁer to censor the sen-sitive attributes learned by the contrastive models under theadversarial training framework. Our evaluation shows that

Talos can effectively mitigate the attribute inference risks forcontrastive models while maintaining their membership pri-vacy and model utility.

References [1] . 4[2] Santiago Zanella Béguelin, Lukas Wutschitz, ShrutiTople, Victor Rühle, Andrew Paverd, Olga Ohrimenko,Boris Köpf, and Marc Brockschmidt. Analyzing In-formation Leakage of Updates to Natural LanguageModels. In

ACM SIGSAC Conference on Computerand Communications Security (CCS) , pages 363–375.ACM, 2020. 12[3] Battista Biggio, Igino Corona, Davide Maiorca, BlaineNelson, Nedim Srndic, Pavel Laskov, Giorgio Giac-into, and Fabio Roli. Evasion Attacks against MachineLearning at Test Time. In

European Conference on Ma-chine Learning and Principles and Practice of Knowl-edge Discovery in Databases (ECML/PKDD) , pages387–402. Springer, 2013. 6, 12[4] Nicholas Carlini, Florian Tramèr, Eric Wallace,Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee,Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Er-lingsson, Alina Oprea, and Colin Raffel. ExtractingTraining Data from Large Language Models.

CoRRabs/2012.07805 , 2020. 1, 3, 12125] Nicholas Carlini and David Wagner. Towards Evaluat-ing the Robustness of Neural Networks. In

IEEE Sym-posium on Security and Privacy (S&P) , pages 39–57.IEEE, 2017. 6, 12[6] Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz.GAN-Leaks: A Taxonomy of Membership InferenceAttacks against Generative Models. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 343–362. ACM, 2020. 1, 3[7] Min Chen, Zhikun Zhang, Tianhao Wang, MichaelBackes, Mathias Humbert, and Yang Zhang. WhenMachine Unlearning Jeopardizes Privacy.

CoRRabs/2005.02205 , 2020. 3[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, andGeoffrey E. Hinton. A Simple Framework for Con-trastive Learning of Visual Representations. In

Interna-tional Conference on Machine Learning (ICML) , pages1597–1607. PMLR, 2020. 1, 2, 3, 4, 6, 12[9] Adam Coates, Andrew Y. Ng, and Honglak Lee. AnAnalysis of Single-Layer Networks in UnsupervisedFeature Learning. In

International Conference on Arti-ﬁcial Intelligence and Statistics (AISTATS) , pages 215–223. JMLR, 2011. 1, 4[10] Maximin Coavoux, Shashi Narayan, and Shay B. Co-hen. Privacy-preserving Neural Representations ofText. In

Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) , pages 1–10. ACL,2018. 9[11] Harrison Edwards and Amos J. Storkey. CensoringRepresentations with an Adversary. In

InternationalConference on Learning Representations (ICLR) , 2016.9[12] Yanai Elazar and Yoav Goldberg. Adversarial Removalof Demographic Attributes from Text Data. In

Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP) , pages 11–21. ACL, 2018. 9[13] Yaroslav Ganin and Victor S. Lempitsky. UnsupervisedDomain Adaptation by Backpropagation. In

Interna-tional Conference on Machine Learning (ICML) , pages1180–1189. JMLR, 2015. 10[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative Adversar-ial Nets. In

Annual Conference on Neural InformationProcessing Systems (NIPS) , pages 2672–2680. NIPS,2014. 9[15] Michael Gutmann and Aapo Hyvärinen. Noise-Contrastive Estimation: A New Estimation Princi-ple for Unnormalized Statistical Models. In

Interna-tional Conference on Artiﬁcial Intelligence and Statis-tics (AISTATS) , pages 297–304. JMLR, 2010. 1, 2, 4,12[16] Jamie Hayes, Luca Melis, George Danezis, and Emil-iano De Cristofaro. LOGAN: Evaluating Privacy Leak- age of Generative Models Using Generative Adversar-ial Networks.

Symposium on Privacy Enhancing Tech-nologies Symposium , 2019. 1, 3, 12[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, andRoss B. Girshick. Momentum Contrast for Unsuper-vised Visual Representation Learning. In

IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) , pages 9726–9735. IEEE, 2020. 1, 2, 4, 12[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition.In

IEEE Conference on Computer Vision and PatternRecognition (CVPR) , pages 770–778. IEEE, 2016. 1, 3[19] Xinlei He, Jinyuan Jia, Michael Backes, Neil Zhen-qiang Gong, and Yang Zhang. Stealing Links fromGraph Neural Networks. In

USENIX Security Sympo-sium (USENIX Security) . USENIX, 2021. 1, 12[20] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, AdamTrischler, and Yoshua Bengio. Learning Deep Repre-sentations by Mutual Information Estimation and Max-imization. In

International Conference on LearningRepresentations (ICLR) , 2019. 1, 2[21] Matthew Jagielski, Nicholas Carlini, David Berthelot,Alex Kurakin, and Nicolas Papernot. High Accu-racy and High Fidelity Extraction of Neural Networks.In

USENIX Security Symposium (USENIX Security) ,pages 1345–1362. USENIX, 2020. 12[22] Matthew Jagielski, Alina Oprea, Battista Biggio, ChangLiu, Cristina Nita-Rotaru, and Bo Li. ManipulatingMachine Learning: Poisoning Attacks and Counter-measures for Regression Learning. In

IEEE Symposiumon Security and Privacy (S&P) , pages 19–35. IEEE,2018. 12[23] Jinyuan Jia, Ahmed Salem, Michael Backes, YangZhang, and Neil Zhenqiang Gong. MemGuard: De-fending against Black-Box Membership Inference At-tacks via Adversarial Examples. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 259–274. ACM, 2019. 3, 10, 12[24] Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang,Tianqi Zhang, and Yangyong Zhu. Sub-graph Con-trast for Scalable Self-Supervised Graph Representa-tion Learning.

CoRR abs/2009.10273 , 2020. 1, 2, 4,12[25] Kalpesh Krishna, Gaurav Singh Tomar, Ankur P.Parikh, Nicolas Papernot, and Mohit Iyyer. Thieves ofSesame Street: Model Extraction on BERT-based APIs.In

International Conference on Learning Representa-tions (ICLR) , 2020. 12[26] Klas Leino and Matt Fredrikson. Stolen Memo-ries: Leveraging Model Memorization for CalibratedWhite-Box Membership Inference. In

USENIX Secu-rity Symposium (USENIX Security) , pages 1605–1622.USENIX, 2020. 3, 121327] Shaofeng Li, Shiqing Ma, Minhui Xue, and BenjaminZi Hao Zhao. Deep Learning Backdoors.

CoRRabs/2007.08273 , 2020. 12[28] Zheng Li and Yang Zhang. Label-Leaks: MembershipInference Attack with Label.

CoRR abs/2007.15528 ,2020. 3[29] Xiao Liu, Fanjin Zhang, Zhenyu Hou, ZhaoyuWang, Li Mian, Jing Zhang, and Jie Tang. Self-supervised Learning: Generative or Contrastive.

CoRRabs/2006.08218 , 2020. 1, 3, 12[30] Ziwei Liu, Ping Luo, Xiaogang Wang, and XiaoouTang. Deep Learning Face Attributes in the Wild.In

IEEE International Conference on Computer Vision(ICCV) , pages 3730–3738. IEEE, 2015. 4[31] Luca Melis, Congzheng Song, Emiliano De Cristofaro,and Vitaly Shmatikov. Exploiting Unintended FeatureLeakage in Collaborative Learning. In

IEEE Sympo-sium on Security and Privacy (S&P) , pages 497–512.IEEE, 2019. 1, 3, 7, 8, 12[32] Milad Nasr, Reza Shokri, and Amir Houmansadr. Ma-chine Learning with Membership Privacy using Adver-sarial Regularization. In

ACM SIGSAC Conference onComputer and Communications Security (CCS) , pages634–646. ACM, 2018. 12[33] Milad Nasr, Reza Shokri, and Amir Houmansadr. Com-prehensive Privacy Analysis of Deep Learning: Pas-sive and Active White-box Inference Attacks againstCentralized and Federated Learning. In

IEEE Sympo-sium on Security and Privacy (S&P) , pages 1021–1035.IEEE, 2019. 12[34] Seong Joon Oh, Max Augustin, Bernt Schiele, andMario Fritz. Towards Reverse-Engineering Black-BoxNeural Networks. In

International Conference onLearning Representations (ICLR) , 2018. 4, 12[35] Tribhuvanesh Orekondy, Bernt Schiele, and MarioFritz. Knockoff Nets: Stealing Functionality of Black-Box Models. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , pages 4954–4963.IEEE, 2019. 4, 12[36] Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang.Privacy Risks of General-Purpose Language Models.In

IEEE Symposium on Security and Privacy (S&P) ,pages 1471–1488. IEEE, 2020. 12[37] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha,and Michael Wellman. SoK: Towards the Science ofSecurity and Privacy in Machine Learning. In

IEEEEuropean Symposium on Security and Privacy (EuroS&P) , pages 399–414. IEEE, 2018. 12[38] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha,Matt Fredrikson, Z. Berkay Celik, and AnanthramSwami. The Limitations of Deep Learning in Adversar-ial Settings. In

IEEE European Symposium on Securityand Privacy (Euro S&P) , pages 372–387. IEEE, 2016.6, 12 [39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexan-der C. Berg, and Li Fei-Fei. ImageNet Large ScaleVisual Recognition Challenge.

CoRR abs/1409.0575 ,2015. 2[40] Ahmed Salem, Apratim Bhattacharya, Michael Backes,Mario Fritz, and Yang Zhang. Updates-Leak: Data SetInference and Reconstruction Attacks in Online Learn-ing. In

USENIX Security Symposium (USENIX Secu-rity) , pages 1291–1308. USENIX, 2020. 12[41] Ahmed Salem, Yang Zhang, Mathias Humbert, PascalBerrang, Mario Fritz, and Michael Backes. ML-Leaks:Model and Data Independent Membership InferenceAttacks and Defenses on Machine Learning Models. In

Network and Distributed System Security Symposium(NDSS) . Internet Society, 2019. 1, 3, 4, 7, 12[42] Roei Schuster, Congzheng Song, Eran Tromer, andVitaly Shmatikov. You Autocomplete Me: Poison-ing Vulnerabilities in Neural Code Completion.

CoRRabs/2007.02220 , 2020. 12[43] Reza Shokri, Marco Stronati, Congzheng Song, and Vi-taly Shmatikov. Membership Inference Attacks AgainstMachine Learning Models. In

IEEE Symposium on Se-curity and Privacy (S&P) , pages 3–18. IEEE, 2017. 1,3, 4, 7, 12[44] Congzheng Song and Ananth Raghunathan. Informa-tion Leakage in Embedding Models. In

ACM SIGSACConference on Computer and Communications Secu-rity (CCS) , pages 377–390. ACM, 2020. 12[45] Congzheng Song, Thomas Ristenpart, and VitalyShmatikov. Machine Learning Models that RememberToo Much. In

ACM SIGSAC Conference on Computerand Communications Security (CCS) , pages 587–601.ACM, 2017. 12[46] Congzheng Song and Vitaly Shmatikov. Auditing DataProvenance in Text-Generation Models. In

ACM Con-ference on Knowledge Discovery and Data Mining(KDD) , pages 196–206. ACM, 2019. 1[47] Congzheng Song and Vitaly Shmatikov. OverlearningReveals Sensitive Attributes. In

International Confer-ence on Learning Representations (ICLR) , 2020. 1, 3,7, 8, 12[48] Liwei Song, Reza Shokri, and Prateek Mittal. PrivacyRisks of Securing Machine Learning Models againstAdversarial Examples. In

ACM SIGSAC Conference onComputer and Communications Security (CCS) , pages241–257. ACM, 2019. 6, 12[49] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, IanGoodfellow, Dan Boneh, and Patrick McDaniel. En-semble Adversarial Training: Attacks and Defenses. In

International Conference on Learning Representations(ICLR) , 2017. 6, 121450] Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Re-iter, and Thomas Ristenpart. Stealing Machine Learn-ing Models via Prediction APIs. In

USENIX Secu-rity Symposium (USENIX Security) , pages 601–618.USENIX, 2016. 4, 12[51] Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Rep-resentation Learning with Contrastive Predictive Cod-ing.

CoRR abs/1807.03748 , 2018. 1, 2, 3, 4, 12[52] Laurens van der Maaten and Geoffrey Hinton. Visual-izing Data using t-SNE.

Journal of Machine LearningResearch , 2008. 9[53] Binghui Wang and Neil Zhenqiang Gong. Stealing Hy-perparameters in Machine Learning. In

IEEE Sym-posium on Security and Privacy (S&P) , pages 36–52.IEEE, 2018. 4, 12[54] Zhirong Wu, Yuanjun Xiong, Stella X. Yu, andDahua Lin. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In

IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) , pages 3733–3742. IEEE, 2018. 12[55] Qizhe Xie, Zihang Dai, Yulun Du, Eduard H. Hovy,and Graham Neubig. Controllable Invariance throughAdversarial Feature Learning. In

Annual Conference onNeural Information Processing Systems (NIPS) , pages585–596. NIPS, 2017. 9[56] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, andSomesh Jha. Privacy Risk in Machine Learning: An-alyzing the Connection to Overﬁtting. In

IEEE Com-puter Security Foundations Symposium (CSF) , pages268–282. IEEE, 2018. 3, 12[57] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen,Zhangyang Wang, and Yang Shen. Graph ContrastiveLearning with Augmentations. In

Annual Conferenceon Neural Information Processing Systems (NeurIPS) .NeurIPS, 2020. 1, 2, 4, 12[58] Zhifei Zhang, Yang Song, and Hairong Qi. Age Pro-gression/Regression by Conditional Adversarial Au-toencoder. In