[PDF] Lethean Attack: An Online Data Poisoning Technique

Abstract

Data poisoning is an adversarial scenario where an attacker feeds a specially crafted sequence of samples to an online model in order to subvert learning. We introduce Lethean Attack, a novel data poisoning technique that induces catastrophic forgetting on an online model. We apply the attack in the context of Test-Time Training, a modern online learning framework aimed for generalization under distribution shifts. We present the theoretical rationale and empirically compare it against other sample sequences that naturally induce forgetting. Our results demonstrate that using lethean attacks, an adversary could revert a test-time training model back to coin-flip accuracy performance using a short sample sequence.

Full PDF

LLethean Attack: An Online Data Poisoning Technique

Eyal Perry

MIT Media LabMassachusetts Institute of Technology [email protected]

Abstract

Data poisoning is an adversarial scenario where an attacker feeds a specially craftedsequence of samples to an online model in order to subvert learning. We introduceLethean Attack, a novel data poisoning technique which induces catastrophicforgetting on an online model. We apply the attack in the context of Test-TimeTraining, a modern online learning framework aimed for generalization underdistribution shifts. We present the theoretical rationale and empirically compareagainst other sample sequences which naturally induce forgetting. Our resultsdemonstrate that using lethean attacks, an adversary could revert a test-time trainingmodel back to coin-ﬂip accuracy performance using a short sample sequence.

Modern machine learning models, especially deep neural networks, have seemingly achieved humanperformance in problems such as image classiﬁcation [Krizhevsky et al., 2012], speech recognition[Hinton et al., 2012], natural language understanding [Devlin et al., 2018] and others. One of themajor underlying assumptions behind neural networks is for the training data and test instances bedrawn from the same distribution. However, under minor differences in the distribution state-of-the-art models often crumble [Sun et al., 2020]. Methods such as domain adaption [Tzeng et al.,2017, Ganin et al., 2015] and adversarial robustness [Madry et al., 2017] have tackled shifts in thedistribution by either assuming access to data from the new distribution at training time, or assuminga structure of the perturbations.Online learning [Shalev-Shwartz et al., 2012] is a domain of machine learning, where a learneris given a sequence of instances from a distribution X . For each instance x t , the learner makes aprediction and updates the model. In the classical online learning setting, there exists an oracle thatgives the ground truth y t for each time step t . One of the advantages of an online learning paradigmis the ability to adapt a model under distribution shifts. A well-researched pitfall of online neuralnetworks is catastrophic forgetting [McCloskey and Cohen, 1989, Ratcliff, 1990], the tendency of amodel forget previously learned information upon the arrival of new instances.Sun et al. [2020] recently proposed a novel online approach named Test-Time Training to achievebetter generalization under distribution shifts, without prior knowledge. In summary, test-timetraining adapts model parameters at testing time, using a self-supervised task. Let’s consider thesupervised setting, e.g. image classiﬁcation. Alongside the main task, test-time training requiresanother self-supervised auxiliary task, e.g. predicting image rotation by an angle (0 ◦ , 90 ◦ , 180 ◦ ,270 ◦ ). The model predicts two outputs (e.g. image class and rotation) and importantly, a signiﬁcantpercentage of the model parameters are shared across the two tasks. Hendrycks et al. [2019] showedthat joint-training with a self-supervised task increases robustness of the model. Yet, Sun et al. [2020]take one step further than joint-training. At testing time, the new instance x is ﬁrst handled by the Lethe : a river in Hades whose waters cause drinkers to forget their past.MIT 9.520/6.860: Statistical Learning Theory and Applications (Fall 2019). a r X i v : . [ c s . CR ] N ov elf-supervised task. Then, the model parameters are updated according to the sub-task loss. Finally,the adapted model is run on x with the main task to get a prediction.Sun et al. [2020] presented promising results. By applying the test-time training paradigm to imageclassiﬁcation problems, their neural network successfully adapted to various types and levels ofcorruptions in both CIFAR-10 [Krizhevsky et al., 2009] and ImageNet [Russakovsky et al., 2015]and surpassed state-of-the-art models by a large margin. The most signiﬁcant improvement occurs inan online setting, where at each time step t the model is adapted and saved for future test instances. Itis important to note the difference from classical online learning. While both frameworks performmodels updates at each time step, test-time training does not assume or need any oracle. It adapts tonew instances as they arrive regardless of labels, simply according to their distribution.By merging the training and testing phases, test-time training has become exposed to an adversarialscenario named data poisoning [Zhang et al., 2019, Wang and Chaudhuri, 2018]. In online datapoisoning, the attacker feeds the model a sequence of specially crafted samples in order to manipulatelearning. Moreover, Goodfellow et al. [2013] demonstrated that multi-task models are especiallyprone to a catastrophic forgetting. In this work, we propose a practical methodology for test-timetraining data poisoning, lethean attack, which induces catastrophic forgetting on a model. First, wedescribe the attack method and the rationale behind it. Then, we present an experiment to compareforgetfulness of test-time training under various adversarial scenarios. Last, we discuss the defensetactics and directions for future research. Let f denote a test-time training classiﬁer and D is data f has already seen D = ( x , ..., x n ) . Theobjective of a lethean attack is to ﬁnd a sequence S = ( x ∗ , ..., x ∗ t ) such that after f adapts to S , itsaccuracy on samples from D is not better than chance.The theoretical basis for test-time training lies in the following theorem: Theorem 1 (Sun et al. [2020])

Let l m ( x, y ; θ ) denote the main task loss on test instance x, y withparameters θ , and l s ( x ; θ ) the self-supervised task loss that only depends on x . Assume that for all x, y, l m ( x, y ; θ ) is differentiable, convex and β -smooth in θ , and both ||∇ l m ( x, y ; θ ) || , ||∇ l s ( x, θ ) || ≤ G for all θ . With a ﬁxed learning rate η = (cid:15)βG , for every x, y such that (cid:104)∇ l m ( x, y ; θ ) , ∇ l s ( x ; θ ) (cid:105) > (cid:15) (1) We have l m ( x, y ; θ ) > l m ( x, y ; θ ( x )) (2) where θ ( x ) = θ − η ∇ l s ( x ; θ ) i.e - test-time training with one step of gradient descent. Meaning, as long as the losses for the main and auxiliary tasks are positively correlated, test-timetraining is expected to perform well.An adversary could take advantage of the above assumption by crafting samples for which the mainand auxiliary losses have a negative correlation. Formally, ﬁnd x ∗ for which: (cid:104)∇ l m ( x ∗ , y ; θ ) , ∇ l s ( x ∗ ; θ ) (cid:105) < (3)Notice the difference between standard adversarial examples [Szegedy et al., 2013, Goodfellow et al.,2014] and lethean attacks. We do not require x to be indistinguishable or even similar to a realsample point. The adversary objective is to poison the model for future instances, rather than alter theclassiﬁcation of a present instance.For certain loss functions and data distributions we could compute x for (3). Instead, we apply a trick.Remember that f had already seen and therefore trained on samples from D . Thus, we can expectthat the main and auxiliary gradient losses are positively correlated. E x ∈ D (cid:104)∇ l s ( x, y ; θ ) , ∇ l m ( x ; θ ) (cid:105) > (4)Consequently, we can search for x ∗ for which: E x ∈ D (cid:104)∇ l m ( x, y ; θ ) , ∇ l m ( x ∗ ; θ ) (cid:105) > (5)2 x ∈ D (cid:104)∇ l s ( x, y ; θ ) , ∇ l s ( x ∗ ; θ ) (cid:105) < (6)If (4), (5), (6) are very strongly correlated then (3) is implied.In simple words, we try to ﬁnd x ∗ which on one hand, positively correlate with the main historicalgradient loss and on the other hand, negatively correlate with the auxiliary historical gradient loss.Note that due to symmetry, samples that have negative correlation with the main historical task andpositive correlation with the historical sub-task would also work, although we argue that in realitythese would be harder to conjure. On the next section I present a practical example. Based on the code released by Sun et al. [2020], we trained from scratch a test-time training networkbased on the ResNet18 architecture [He et al., 2016]. The model has two heads which correspondto the classiﬁcation task and the self-supervised task. The auxiliary task is prediction of rotation bya ﬁxed angle (0 ◦ , 90 ◦ , 180 ◦ , 270 ◦ ). ResNet18 has four "groups", and the split point of the sharedparameters between the two tasks is right after the ﬁrst three. Similarly to the original implementation,we used Group Normalization [Wu and He, 2018] to prevent inaccuracies for small batches. Themodel was trained on the CIFAR-10 dataset [Krizhevsky et al., 2009], including data augmentations.Optimization was done using stochastic gradient descent (SGD) with momentum and weight decay;learning rate starts at 0.1 and is dropped by 10% every 50 epochs. Training was done on Nvidia GTX1080 Ti with batch size 128, for 137 epochs. The ﬁnal test accuracy of the trained model is 90.2%.At test-time, new instances are evaluated sequentially. Each test instance x is rotated in all fourangles, for which the auxiliary loss is computed. The model parameters are updated according theauxiliary loss with a single step of learning rate 0.001. To perform a lethean attack, we need to craft a sample that is (1) positively correlated with classiﬁ-cation gradient loss for CIFAR-10 training data and (2) negatively correlated with rotation gradientloss for the same data. To achieve (1), we pick a sample from CIFAR-10 training data. For (2), werotate the image by 90 ◦ , 180 ◦ or 270 ◦ . This simple change is expected to cause a negative correlationbetween the gradient loss of our adversarial one and the original image that was not rotated.The experiment procedure is as follows: at each time step, pick a random sample from the trainingset, rotate and feed it to the online network. Save the adapted network for future time steps. Every 50time steps, we evaluate the performance of the network (without adaptation) on the CIFAR-10 testset. Repeat until we reach coin-ﬂip accuracy (~10% for CIFAR-10).It is well known that online learning algorithms are naturally prone to forgetfulness, without theneed for data poisoning. Moreover, it could be that existing adversarial methods, such as the FastSign Gradient Method (FGSM) [Goodfellow et al., 2014], would disrupt the model just as bad. Toprove the effectiveness of lethean attack, we run the exact same procedure using three other methods:(1) random pixel images, (2) distribution shifts and (3) FGSM attacks. For (1), we generate imageswhere each pixel value is drawn from a normal distribution with the same mean and variance ofpixels in the training set. For (2), we follow the evaluation framework used by Sun et al. [2020] bypicking a random sample from CIFAR-10-C [Hendrycks and Dietterich, 2019], a dataset of noisy andcorrupted images based on CIFAR-10. The dataset contains 15 types of noise and perturbations, eachwith ﬁve levels of intensity. We evaluated three types of noise (Gaussian, Shot, Impulse) with thehighest intensity level (5). Since all noise types gave extremely similar results, we present only theeffect of Gaussian noise. For (3), we pick a random sample from the training set and run the networkonce (with no adaption). We get the sign of the gradient for each pixel in the the image and use it toperturbate ( (cid:15) = 0.2) the image, which is then fed to the online model.The results are presented in Figure 1. The only method to induce complete forgetting is the letheanattack. Only a few dozens of examples are needed to heavily disrupt test performance, and after 1000examples the model is back to coin-ﬂip accuracy. Random pixels do not affect the network, while The code used to conduct experiments is available at https://github.com/eyalperry88/lethean

One defense tactic for test-time training against lethean attacks could be a different auxiliary lossfunction which isn’t as susceptible to malicious examples. A more robust approach could beregularization that controls for the correlations between new gradient updates and historical ones.Hence, limit the learning step to be always somewhat correlated with previous learning steps. Wecannot apply this method at the beginning of training, but once a model reaches high performance,activating correlation regularization could prevent "untraining" the model back to coin-ﬂip accuracy.A drawback of this scheme is adaptation to abrupt distribution shifts, therefore it is more ﬁtting forreal-life scenarios where distribution shifts are smooth and have a small Lipschitz constant. It is worthnoting that in Sun et al. [2020] implementation of test-time training, there exists a hyper-parameter threshold conﬁdence . This parameter controls which samples are being trained on, such that sampleswhich the model is highly conﬁdent about, do not need to alter the model. From our preliminaryexperimentation, this parameter slows down but does not prevent a lethean attack.Since catastrophic forgetting is an inherent feature of current online neural networks, test-time trainingcould beneﬁt from the ﬁeld of continual learning [Li and Hoiem, 2017, Lopez-Paz and Ranzato,2017, Kirkpatrick et al., 2017] which aims to prevent forgetting in general, not necessarily under anadversarial setting. Understanding forgetting is a major milestone in neural networks research, as it isone of the most signiﬁcant dissimilarities between artiﬁcial and biological neural networks. Last, theconcept of lethean attacks, a.k.a memory erasure in human minds has been the subject of countlessHollywood movies and in reality could potentially restore the lives of millions of PTSD patients. Areview of the controversial ﬁeld of memory reconsolidation [Besnard et al., 2012] and its applicationsis beyond the scope of this paper, interesting as they are. Breaking the boundary between training and test-time enables generalization across distribution shifts,and possibly other interesting applications such as hyper-personalization. In this work we examinedthe tradeoff which emerges from model modiﬁcation in an online setting. A malicious agent withlimited knowledge could render a test-time trained model completely useless. We do not considerthis work as a warning against adaptive online learning paradigms, on the contrary, we would like to A few personal recommendations:

Eternal Sunshine of the Spotless Mind, Inception, Men in Black

References

A. Besnard, J. Caboche, and S. Laroche. Reconsolidation of memory: a decade of debate.

Progressin neurobiology , 99(1):61–80, 2012.J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding, 2018.Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, andV. Lempitsky. Domain-adversarial training of neural networks, 2015.I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation ofcatastrophic forgetting in gradient-based neural networks, 2013.I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXivpreprint arXiv:1412.6572 , 2014.K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In

Europeanconference on computer vision , pages 630–645. Springer, 2016.D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptionsand perturbations. arXiv preprint arXiv:1903.12261 , 2019.D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song. Using self-supervised learning can improvemodel robustness and uncertainty, 2019.G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,B. Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition.

IEEESignal processing magazine , 29, 2012.J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan,T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.

Proceedings of the national academy of sciences , 114(13):3521–3526, 2017.A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. Technicalreport, Citeseer, 2009.A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processing systems , pages 1097–1105, 2012.Z. Li and D. Hoiem. Learning without forgetting.

IEEE transactions on pattern analysis and machineintelligence , 40(12):2935–2947, 2017.D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In

Advances inNeural Information Processing Systems , pages 6467–6476, 2017.A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistantto adversarial attacks, 2017.M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequentiallearning problem. In

Psychology of learning and motivation , volume 24, pages 109–165. Elsevier,1989.R. Ratcliff. Connectionist models of recognition memory: constraints imposed by learning andforgetting functions.

Psychological review , 97(2):285, 1990.O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, et al. Imagenet large scale visual recognition challenge.

International journal ofcomputer vision , 115(3):211–252, 2015.S. Shalev-Shwartz et al. Online learning and online convex optimization.

Foundations and Trends®in Machine Learning , 4(2):107–194, 2012. 5. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training with self-supervisionfor generalization under distribution shifts. 2020.C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguingproperties of neural networks, 2013.E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation.

CoRR , abs/1702.05464, 2017. URL http://arxiv.org/abs/1702.05464 .Y. Wang and K. Chaudhuri. Data poisoning attacks against online learning. arXiv preprintarXiv:1808.08994 , 2018.Y. Wu and K. He. Group normalization. In