[PDF] A Too-Good-to-be-True Prior to Reduce Shortcut Reliance

Abstract

Despite their impressive performance in object recognition and other tasks under standard testing conditions, deep networks often fail to generalize to out-of-distribution (o.o.d.) samples. One cause for this shortcoming is that modern architectures tend to rely on "shortcuts" - superficial features that correlate with categories without capturing deeper invariants that hold across contexts. Real-world concepts often possess a complex structure that can vary superficially across contexts, which can make the most intuitive and promising solutions in one context not generalize to others. One potential way to improve o.o.d. generalization is to assume simple solutions are unlikely to be valid across contexts and avoid them, which we refer to as the too-good-to-be-true prior. A low-capacity network (LCN) with a shallow architecture should only be able to learn surface relationships, including shortcuts. We find that LCNs can serve as shortcut detectors. Furthermore, an LCN's predictions can be used in a two-stage approach to encourage a high-capacity network (HCN) to rely on deeper invariant features that should generalize broadly. In particular, items that the LCN can master are downweighted when training the HCN. Using a modified version of the CIFAR-10 dataset in which we introduced shortcuts, we found that the two-stage LCN-HCN approach reduced reliance on shortcuts and facilitated o.o.d. generalization.

Full PDF

AA Too-Good-to-be-True Prior to Reduce Shortcut Reliance

Nikolay Dagaev * 1

Brett D. Roads Xiaoliang Luo Daniel N. Barry Kaustubh R. Patil

Bradley C. Love

Abstract

Despite their impressive performance in objectrecognition and other tasks under standard test-ing conditions, deep convolutional neural net-works (DCNNs) often fail to generalize to out-of-distribution (o.o.d.) samples. One cause forthis shortcoming is that modern architectures tendto rely on “shortcuts” – superﬁcial features thatcorrelate with categories without capturing deeperinvariants that hold across contexts. Real-worldconcepts often possess a complex structure thatcan vary superﬁcially across contexts, which canmake the most intuitive and promising solutions inone context not generalize to others. One potentialway to improve o.o.d. generalization is to assumesimple solutions are unlikely to be valid acrosscontexts and downweight them, which we referto as the too-good-to-be-true prior . We imple-ment this inductive bias in a two-stage approachthat uses predictions from a low-capacity network(LCN) to inform the training of a high-capacitynetwork (HCN). Since the shallow architectureof the LCN can only learn surface relationships,which includes shortcuts, we downweight train-ing items for the HCN that the LCN can master,thereby encouraging the HCN to rely on deeperinvariant features that should generalize broadly.Using a modiﬁed version of the CIFAR-10 datasetin which we introduced shortcuts, we found thatthe two-stage LCN-HCN approach reduced re-liance on shortcuts and facilitated o.o.d. general-ization. School of Psychology, HSE University, Moscow, Russia Department of Experimental Psychology, University College Lon-don, London, United Kingdom Institute of Neuroscience andMedicine, Brain and Behaviour (INM-7), Research Center J¨ulich,J¨ulich, Germany Institute of Systems Neuroscience, Medical Fac-ulty, Heinrich Heine University D¨usseldorf, D¨usseldorf, Germany The Alan Turing Institute, London, United Kingdom. Correspon-dence to: Nikolay Dagaev < [email protected] > .Copyright 2021 by the author(s).

1. Introduction “If you would only recognize that life is hard, things would beso much easier for you.”—Louis D. Brandeis

Deep convolutional neural networks (DCNNs) haveachieved notable success in image recognition, sometimesachieving human-level performance or even surpassing it(He et al., 2015). However, DCNNs often suffer whenout-of-distribution (o.o.d.) generalization is needed, that is,when training and test data are drawn from different distri-butions (Beery et al., 2018; Geirhos et al., 2019; 2020). Thislimitation has multiple consequences, such as susceptibilityto adversarial interventions (Szegedy et al., 2013; Goodfel-low et al., 2014; Hendrycks et al., 2019) or to previouslyunseen types of noise (Geirhos et al., 2019; Hendrycks &Dietterich, 2019).Failure to generalize o.o.d. may reﬂect the tendency ofmodern network architectures to discover simple features,so-called “shortcut” features (Geirhos et al., 2020). Whilethe perils of overly-complex solutions are well appreciated, overly-simplistic solutions should be viewed with equalskepticism . In this work, we assume that features that areeasy to learn are likely too good to be true. For instance, agreen background may be highly correlated with the “horse”category, but green grass is not a central feature. A horse de-tector relying on such simplistic features–i.e. shortcuts–mayperform well when applied in Spain—where the training setoriginates—but will fail when deployed in snow-coveredSiberia. In effect, shortcuts are easily discovered by a net-work but may be inappropriate for classifying items in anindependent set where superﬁcial features are distributeddifferently than in the training set. Thus, the sensitivityto shortcuts may have far-reaching and dangerous conse-quences in applications, like when the pneumonia predic-tions of a system were based on a metal token placed inradiographs (Zech et al., 2018).In general, one cannot a priori know whether shortcuts willbe helpful or misleading, nor can shortcut learning be re-duced to overﬁtting the training data. While overﬁtting canbe estimated using an available test set from the same distri-bution, assessing shortcuts depends on all possible unseendata. A model relying on shortcuts can show remarkablehuman-level results on test sets where shortcut features aredistributed identically to the training set (i.i.d.), but fail dra- a r X i v : . [ c s . C V ] F e b Too-Good-to-be-True Prior matically on o.o.d. test sets where shortcuts are missing ormisleading (Recht et al., 2019).Shortcuts can adversely affect generalization even when theyare not perfectly predictive. Because shortcuts are easilylearned by DCNNs, they can be misleading even in thepresence of more reliable but complex features (Hermann& Lampinen, 2020). To illustrate, shape may be perfectlypredictive of class membership but networks may rely oncolor or other easily accessed features like texture (Geirhoset al., 2018) when tested on novel cases (see Figure 1A).

Figure 1.

The standard and too-good-to-be-true prior approachesto learning. (A) In the standard approach, a single high-capacitynetwork (HCN) is trained and is susceptible to shortcuts, in thiscase relying on color as opposed to shape. Such a network willgeneralize well to i.i.d. test items but fail on o.o.d. test items (thelast item for each class; shown in red). (B) In contrast, imple-menting the too-good-to-be true prior by pairing a low-capacitynetwork (LCN) with an HCN leads to successful i.i.d. and o.o.d.generalization. Items that the LCN can master, which may con-tain shortcuts, are downweighted when the HCN is trained, whichshould reduce shortcut reliance and promote use of more complexand invariant features by the HCN.

Although what is and is not a shortcut cannot be knownwith perfect conﬁdence, all shortcuts are simple. We ﬁndit unlikely that difﬁcult learning problems will have trivialsolutions when they have not been fully solved by brainswith billions of neurons shaped by millions of years of natural selection nor by engineers working diligently fordecades. Based on this observation, we are skeptical ofvery simple solutions to complex problems and believe theywill have poor o.o.d. generalization. This inductive bias,which we refer to as the “too-good-to-be-true prior”, can beincorporated into the training of DCNNs to reduce shortcutreliance and promote o.o.d. generalization. At its heart, thetoo-good-to-be true prior is a belief about the relationshipbetween the world, models, and machine learning problems,which places limits on Occam’s razor.How does one identify such too-good-to-be-true solutions?One option is to make use of a learning system wittinglysimplistic for the problem at hand and capable of only trivialsolutions. Here we develop this approach, thus resultingin one particular implementation of the proposed inductivebias – a simple and general method aimed at discardingtraining examples that are suspected of containing shortcuts.We hypothesize that, in order to prevent shortcut learning bya high-capacity network (HCN), the predictions of a muchsimpler, low-capacity network (LCN) could be used to guidethe training of the target network. An architecture of a sufﬁ-ciently limited capacity should be capable of learning onlyshallow features, that is, primarily shortcuts. Consequently,a trained LCN would provide high-probability predictionsprecisely for the training items containing these shortcuts.Such probabilistic predictions can be transformed into im-portance weights (IWs) for training items, and these IWscan be further used in a loss function for training an HCNby downweighting the shortcut items (Figure 1B). Wedemonstrate our method’s efﬁciency by applying it to mul-tiple CIFAR-10-based binary classiﬁcation problems withsynthetic shortcuts, permitting well-controlled experiments.According to our results, training an HCN with IWs indeedleads to ignoring the shortcuts almost entirely. When pre-sented with the test o.o.d. examples, where the shortcutsare misleading, the network does not misclassify these ex-amples but makes a decision based on more complex andreliable features.

2. Related Work

Shortcut learning and robust generalization . Multipleapproaches have been suggested for preventing shortcutreliance and increasing generalization robustness in deepneural networks (Geirhos et al., 2020). However, to ourknowledge, none of them explicitly stems from any induc-tive bias concerning the nature of relationship between ma-chine learning models and problems they attempt to solve,as our too-good-to-be-true prior does. Nonetheless, there areapproaches that rely on some speciﬁc assumptions concern-ing shortcuts. For example, Geirhos et al. (2018), suggestedthat DCNNs are biased towards easier-to-learn texture fea-tures at the expense of shape features and demonstrated this

Too-Good-to-be-True Prior by using a texture-shape conﬂict stimuli; they further im-proved o.o.d. generalization by reducing this bias. Mindereret al. (2020) suggested that shortcut features are the ﬁrst tobe affected by adversarial attacks and demonstrated the pos-sibility of using an adversarial-based technique to identifyand remove shortcuts. In fact, adversarial vulnerability isknown to be closely related to o.o.d. generalization: it wasshown that increasing adversarial robustness can result inbetter representations and improved generalization in thepresence of distribution shift (Engstrom et al., 2019).Huang et al. (2020) suggested a heuristic, RepresentationSelf-Challenging (RSC), which at ﬁrst glance might seemsimilar in spirit to our two-stage LCN-HCN procedure, eventhough it discards features, not items. This method im-pedes predicting class from features most correlated withit and thus encourages a DCNN to rely on more complexcombinations of features. However, this approach does notoriginate from the too-good-to-be-true prior and may resultin either selecting shortcuts less correlated with the class orsuppressing non-shortcut features highly correlated with theclass.In addition to the inductive bias lying at the heart of ourmethod, it is also less computationally demanding than thosementioned here (Geirhos et al., 2018; Huang et al., 2020;Minderer et al., 2020).

Sample weighting . Although in the current implementationof the too-good-to-be-true prior we assign weights to thetraining items, our approach is fundamentally different fromexisting re-weighting schemes (Li et al., 2017; Zhang &Sabuncu, 2018; Shu et al., 2019). The two-stage LCN-HCNprocedure exploits not the predictions of the network beingtrained itself but of an independent simpler network (LCN).In other words, we are not interested in the difﬁculty ofan item per se, but in whether this item can be masteredthrough simple means.Re-weighting of data samples is a well-known approach toguiding the training of DCNNs and machine learning mod-els in general, and corresponding methods differ in terms ofwhich examples must be downweighted/emphasized. Someauthors suggested to mitigate the impact of easy examplesand focus on hard ones (Malisiewicz et al., 2011; Shrivastavaet al., 2016). For instance, to address class imbalance andachieve state-of-the-art accuracy by a one-stage object de-tector, Lin et al. (2017) modiﬁed the standard cross-entropyloss such that the terms corresponding to well-classiﬁedexamples were downweighted.On the other hand, it was shown that emphasizing easy sam-ples can substantially improve results as well. In curriculumlearning, faster convergence and better generalization canbe achieved by using predeﬁned difﬁculty ranks, so thattraining starts with easier samples and then gradually takes more difﬁcult samples into consideration (Bengio et al.,2009). A related approach is self-paced learning, whereeasy examples are stressed early in training as well, but thecurriculum is dynamically generated while the training pro-gresses (Kumar et al., 2010; Meng et al., 2015). It was alsoshown that combining the two approaches is possible (Jianget al., 2015). Success of curriculum-based methods dependsupon availability of a suitable scoring function assigninga difﬁculty score to training samples which is usually notreadily available and often requires tuning of additionalhyper-parameters such as learning rate schedules and batchsizes (Hacohen & Weinshall, 2019) and their usefulness hasbeen recently questioned (Wu et al., 2020).Differently from emphasizing hard or easy examples, Changet al. (2017) proposed methods to improve classiﬁcationaccuracy and robustness of DCNNs by accounting for un-certain samples of high prediction variance.

3. Implementing the Too-Good-to-be-TruePrior by Utilizing Predictions of aLow-Capacity Network

Our approach starts with training an LCN – an architectureof a relatively low capacity that learns primitive features.Reliable features necessary for a robust generalization arerelatively high-level and shortcuts are usually low-level char-acteristics of an image. Given this assumption, the LCNwill primarily produce accurate predictions for images con-taining shortcuts.Given a training dataset D = { x i , y i } , the correspondingIW ( w i ) for a training image x i is its probability of misclas-siﬁcation as given by an LCN, w i = 1 − p ( y i | x i ) . (1)IWs are then employed while training an HCN: for everytraining image, the corresponding loss term is multiplied bythe IW of this image. We normalize IWs with respect to amini-batch: IWs of samples from a mini-batch are dividedby the sum of all IWs in that mini-batch. The mini-batchtraining loss L B is thus the following: L B = (cid:88) k ∈B (cid:101) w k L k , (2)where L k indicates the loss of the k th sample in the mini-batch. The mini-match normalized IW is (cid:101) w j = w j (cid:80) k ∈B w k . (3)

4. Experiments

Generally, whether a dataset contains shortcuts is not knownbeforehand. In order to overcome this issue and test the

Too-Good-to-be-True Prior too-good-to-be-true prior, we introduced synthetic shortcutsinto a well-known dataset (cf. Malhotra et al., 2020). Wethen applied our approach and investigated whether it wasable to avoid reliance on these shortcuts while learning thedeeper structure. This testing strategy allowed us to runwell-controlled experiments and quantify the effects of ourmethod.We ran a set of experiments on all possible pairs of classesfrom the CIFAR-10 dataset (Krizhevsky et al., 2009). Inevery classiﬁcation problem, a synthetic shortcut was in-troduced in each of the two classes. In order to have abetter understanding of our method’s generalizability, weinvestigated two opposite types of shortcuts as well as twoHCN architectures, ResNet (He et al., 2016) and VGG-11(Simonyan & Zisserman, 2015). Note that our too-good-to-be-true prior is readily applicable to multi-class problems.For both shortcut types and both HCN architectures, weexpected the two-stage LCN-HCN procedure to discard themajority of shortcut images. Therefore, compared to theordinary training procedure, better performance should beobserved when shortcuts in a test set are misleading (i.e.o.o.d. test set). We also expected that the two-stage LCN-HCN procedure may suppress some non-shortcut images.Thus, a slightly worse performance was expected for a testset without shortcuts as well as for a test set with helpfulshortcuts (i.e. i.i.d. test set).The main objective of these experiments was to compare anordinary and a weighted training procedure in terms of thesusceptibility of resulting models to the shortcuts. However,crucially for our idea of the too-good-to-be-true prior, it wasalso important to validate our reasoning concerning the keyrole of a network’s low capacity in the derivation of usefulIWs. For this purpose, we introduced another training condi-tion where IWs were obtained from probabilistic predictionsof the same HCN architecture as the target network. Werefer to the IWs obtained from an HCN as HCN-IWs andto the IWs obtained from an LCN as LCN-IWs. We ex-pected HCN-IWs either to fail to suppress shortcut images,resulting in poor performance on a test set with misleadingshortcuts (o.o.d. test set), or to equally suppress both short-cut and non-shortcut images, resulting in poor performanceon any test data. Using HCN-IWs mirrors approaches thatplace greater emphasis on challenging items.

For the sake of generality, we introduced two shortcut types:the “local” was salient and localized, and the “global” wassubtle and diffuse. The local shortcut was intended to cap-ture real-world cases such as a marker in the corner of aradiograph (Zech et al., 2018) and the global was intendedto capture such situations as subtle distortions in the lens ofa camera. The local shortcut was a horizontal line of three pixels, redfor one class and blue for the other (Figure 2, left). Thelocation of the line was the same for all images: upper leftcorner. The shortcut was present in randomly chosen 30%of training as well as validation images in each class.The global shortcut was a mask of Gaussian noise, one perclass (Figure 2, right). The mask was sampled from a mul-tivariate normal distribution with zero mean and isotropiccovariance matrix, with variance set to × − , and thenadded to randomly chosen 30% of training and validationimages of a corresponding class. Figure 2.

Examples of two shortcut types used in our experiments:local and global. Since the global shortcut is subtle to humans, cor-responding original images and additive masks are also depicted.For the subset of images containing a shortcut, a network couldlearn to rely on these superﬁcial features at the expense of moreinvariant properties, which has consequences for generalization.

Based on CIFAR-10 test images of the selected classes, weprepared three test sets for each shortcut type (examplesshown in Figure 3).

Congruent (i.i.d.): all images containedshortcuts, each associated with the same class as in thetraining set.

Incongruent (o.o.d.): all images containedshortcuts but each shortcut was placed in the images ofthe opposite class compared to the training set.

Neutral :original CIFAR-10 images without shortcuts.

The LCN consisted of a single convolutional layer followedby a fully-connected softmax classiﬁcation layer. The con-volutional layer included 4 channels with 3-by-3 kernels, alinear activation function and no downsampling.In two separate sets of simulations, we tested two differ-ent HCN architectures: VGG-11 (Simonyan & Zisserman,2015) and the 56-layer ResNet for CIFAR-10 (He et al.,2016). The ﬁrst two fully-connected layers of VGG-11 had1024 units each and no dropout was used.

Too-Good-to-be-True Prior

Figure 3.

Illustration of the predicted results of LCN-IWs train-ing on classifying different test cases by the target HCN. Correctnetwork decisions are in green, incorrect are in red. Training exam-ples containing local shortcuts are highlighted by a yellow border.Whereas ordinary training should lead to poor o.o.d. test perfor-mance on Incongruent test items where the shortcut is now mis-leading, LCN-IWs should selectively downweight training itemswith the shortcut allowing the HCN to generalize well across thespectrum.

Network weights were initialized according to Glorot andBengio (2010). We used stochastic gradient descent to trainboth LCN and HCN. The initial learning rate was set to0.01 for the LCN. The HCN’s initial learning rate was setto 0.01 for VGG (Simonyan & Zisserman, 2015) and to 0.1for ResNet (He et al., 2016). The HCNs were trained witha momentum of 0.9 and a weight decay of 5×10 -4 for 150epochs. To avoid overﬁtting, the HCN’s performance onvalidation data (see below) was tested at each epoch, andbest-performing parameters were chosen as the result oftraining. The LCN was trained for 40 epochs. For bothLCN and HCN, the learning rate was decreased by a factorof 10 on epochs corresponding to 50% and 75% of the totalduration of the network’s training. Mini-batch size for bothnetworks was set to 256. No data augmentation was used.For each class, the original 5,000 images from the CIFAR-10 training set were divided into 4,500 training images and500 validation images. Thus, the training set of every classpair included 9,000 images and the validation set included1,000 images.IWs were introduced to the training process as describedin Section 3, and for every mini-batch a weighted-averageloss was calculated. During ordinary training without IWs,a simple average loss was calculated.All the results reported below are the averages from 10independent runs on all class pairs, shortcut types, and HCNarchitectures. The overall pattern of results was in accord with ourpredictions–downweighting training items that could bemastered by a low-capacity network reduced shortcut re- liance in a high-capacity network, which improved o.o.d.generalization at a small cost in i.i.d. generalization. Typi-cal distributions of LCN-IWs and HCN-IWs are shown inFigure 4. The LCN-IWs distribution has a large peak closeto zero, almost entirely consisting of shortcut examples. Ex-amples without shortcuts are present along the entire IWrange. In contrast, the HCN-IWs distribution groups bothshortcut and non-shortcut examples equally close to zero.Moreover, as illustrated in Figure 4, we noticed that non-shortcut examples closest to zero are usually those whichare, intuitively, of high typicality for the corresponding class.These results together imply that LCN-IWs would suppressalmost exclusively shortcut images while HCN-IWs wouldsuppress both shortcut and non-shortcut images, with highlytypical class examples in the latter group.Effects of the training condition (ordinary, HCN-IWs, andLCN-IWs) on how well the HCN performs on every testset (incongruent, neutral, and congruent) are presented inFigure 5. Both ResNet and VGG-11 are prone to rely on ourshortcuts, as evidenced by low incongruent accuracies andvery high congruent accuracies after the ordinary training.Incongruent accuracies are improved after the LCN-IWstraining, compared to those after the ordinary training. Im-portantly, after LCN-IWs training, incongruent, neutral, andcongruent accuracies are all similarly high. Together, theseresults suggest that LCN-IWs are successful in reducingshortcut reliance in the target network.Although exceeding performance on the incongruent testset after the ordinary training condition, incongruent accu-racies after the HCN-IWs training are substantially lowerthan after the LCN-IWs training. The neutral and congruentaccuracies are lower than in both ordinary and LCN-IWstraining conditions. At the same time, incongruent accura-cies are still noticeably lower than neutral and congruentconditions. These results indicate that HCN-IWs are noteffective in resisting shortcut learning due to, at least tosome extent, suppressing typical class examples containinguseful and well-generalizable features (Figure 4).The main results shown in Figure 5 indicate that LCN-IWsreduce shortcut reliance with little cost to performance onother items, whereas HCN-IWs are less effective becausethey remove non-shortcut items as well (see Figure 4). Keyto the LCN-IW results is properly matching network capac-ity to the learning problem. Out of the 45 classiﬁcation pairsconsidered, there should be natural variation in problem dif-ﬁculty that affects target network performance. In particular,we predict that overall beneﬁt will be lower when the LCNperforms better on a class pair, indicating that its capacity issufﬁcient to learn non-shortcut information.We deﬁne Overall Beneﬁt ( OB ) as a combination of Gain Too-Good-to-be-True Prior

Figure 4.

Typical observed distributions of HCN-IWs (A) and LCN-IWs (B). Four illustrative examples along with corresponding IWs aredepicted for three parts of each distribution – extremes of the observed IW range and its center. The lowest HCN-IWs correspond toshortcut images and non-shortcut images depicting examples of a high typicality; the lowest LCN-IWs correspond almost exclusively toshortcut images. As predicted, the LCN has the capacity to master images containing shortcuts but few other images, providing IWs foran HCN that reduce shortcut reliance, thereby implementing the too-good-to-be true prior. ( G ) and Loss ( L ): OB = G + L, (4)where G = logit( p ( correct | incongruent , IW )) − logit( p ( correct | incongruent , ordinary )) (5)and L = logit( p ( correct | neutral , IW )) − logit( p ( correct | neutral , ordinary )) . (6)We compute average OB for each class pair and contrastthose against corresponding neutral test accuracies afterordinary training. The latter are introduced to reﬂect thedefault classiﬁcation difﬁculty of each class pair. Thesecomparisons are shown in Figure 6. Two evident trendsare important. First, recapitulating the previous results,LCN-IWs result in greater OB than HCN-IWs. OB cor-responding to LCN-IWs is almost always positive, while OB corresponding to HCN-IWs is often negative. Second, OB is negatively correlated with the neutral test accuracyafter ordinary training; that is, as the difﬁculty of a classiﬁ-cation problem increases, beneﬁts of using IWs generallyincrease as well. One possibility is that for easy to discrimi-nate pairs, such as frog and ship, the LCN was able to learnnon-shortcut information which reduced the overall beneﬁtof the LCN-IWs.

5. Discussion

In general, using Occam’s razor to favor simple solutions isa sensible policy. We certainly do not advocate for addingunnecessary complexity. However, for difﬁcult problemsthat have evaded a solution, it is unlikely that a trivial so-lution exists. The problems of interest in machine learninghave taken millions of years for nature to solve and havepuzzled engineers for decades. It seems implausible thattrivial solutions to such problems would exist and we shouldbe skeptical when they appear.

Too-Good-to-be-True Prior

Figure 5.

Accuracies on incongruent, neutral, and congruent test sets after ordinary and HCN-/LCN-weighted training. Across shortcuttypes and HCN architectures, LCN-IWs result in almost identically high accuracy on all three test sets and thus, are successful in avoidingshortcut reliance. HCN-IWs constantly result in accuracies inferior to LCN-IWs; moreover, on neutral and congruent test sets, accuraciesafter HCN-weighted training are lower than after ordinary training. HCN-IWs, thus, are not as effective as LCN-IWs in avoiding shortcutreliance and also result in suppressing useful features. Together, these results indicate that the LCN-HCN two-stage approach is a validrepresentative of the too-good-to-be-true prior.

Figure 6.

Effects of the LCN/HCN-IWs training procedure for individual class pairs depending on their respective difﬁculty. The effectsof training are represented by the Overall Beneﬁt measure (gain + loss; see text); the difﬁculty of a pair is represented by the neutraltest accuracy after ordinary training on the training set. Recapitulating previous results, LCN-IWs are more effective than HCN-IWs.Furthermore, the easier learning problem, the less Overall Beneﬁt from IWs because the relatively higher capacity of IW network leads todownweighting non-shortcut items.

For such difﬁcult problems, we suggest adopting a too-good-to-be-true prior that shies away from simple solutions.Simple solutions to complex problems are likely to relyon superﬁcial features that are reliable within the partic- ular training context, but are unlikely to capture the moresubtle invariants central to a concept. To use a historic exam-ple, people had great hopes that the Perceptron (Rosenblatt,1958), a one-layer neural network, would master computer

Too-Good-to-be-True Prior vision to only have their hopes dashed (Minsky & Papert,1969). When such simple systems appear successful, in-cluding on held-out test data, they are most likely relyingon shortcuts that will not generalize out of sample on some-what different test distributions, such as when a system isdeployed.We proposed and evaluated a simple implementation of thetoo-good-to-be-true inductive bias. We used a low-capacitynetwork (LCN) to establish importance weights (IWs) tohelp train a high-capacity network (HCN). The idea wasthat the LCN would not have the capacity to learn subtleinvariants but instead be reduced to relying on superﬁcialshortcuts. By downweighting the items that LCN couldmaster, we found that the HCN was less susceptible toshortcuts and showed better o.o.d. generalization at littlecost when misleading shortcuts were not present.Although we evaluated the too-good-to-be-true approach onCIFAR-10 images, the basic method of using an LCN toestablish IWs for an HCN is broadly applicable. We con-sidered two network architectures for the HCN, ResNet andVGG-11, which both showed the same overall pattern ofperformance. Interestingly, ResNet appeared more suscepti-ble to shortcuts, perhaps because its architecture containsskip connections that are themselves a type of shortcut al-lowing lower-level information in the network to propagateupwards absent intermediate processing stages.One key challenge in our approach is matching the com-plexity of the LCN to the learning problem. When the LCNhas too much capacity, it may learn more than shortcutsand downweight information useful to o.o.d. generalization(see Figure 6). It is for this reason that LCN-IWs are muchmore effective than HCN-IWs (see Figure 5). Unfortunately,there is no simple procedure that guarantees selecting theappropriate LCN. The choice depends on one’s beliefs aboutthe structure of the world, the susceptibility of models tomisleading shortcuts, and the nature of the learning problem.Nevertheless, reasonable decisions can be made. For exam-ple, we would be skeptical of a Perceptron that successfullyclassiﬁed medical imagery, so it could serve as an LCN.Since the too-good-to-be-true prior is a general inductivebias, our two-stage LCN-HCN approach is just one speciﬁcimplementation of it and other techniques may be devel-oped. The effectiveness of our two-stage approach shouldbe evaluated in other tasks and domains outside computervision. Further research should consider how to choose thearchitecture of an LCN and how the effectiveness of thisarchitecture depends on different types of shortcuts. Finally,a promising direction is to use IWs to selectively suppressaspects of training items rather than downweighting entireexamples.

Acknowledgements

This article is an output of a research project implemented aspart of the Basic Research Program at the National ResearchUniversity Higher School of Economics (HSE University).This work was supported by NIH Grant 1P01HD080679,Wellcome Trust Investigator Award WT106931MA, andRoyal Society Wolfson Fellowship 183029 to B.C.L.

References

Beery, S., Van Horn, G., and Perona, P. Recognition in terraincognita. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pp. 456–473, 2018.Bengio, Y., Louradour, J., Collobert, R., and Weston, J.Curriculum learning. In

Proceedings of the InternationalConference on Machine Learning , pp. 41–48, 2009.Chang, H.-S., Learned-Miller, E., and McCallum, A. Activebias: Training more accurate neural networks by empha-sizing high variance samples. In

Proceedings of the Inter-national Conference on Neural Information ProcessingSystems , pp. 1003–1013, 2017.Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D.,Tran, B., and Madry, A. Adversarial robustness asa prior for learned representations. arXiv preprintarXiv:1906.00945 , 2019.Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-mann, F. A., and Brendel, W. ImageNet-trained CNNs arebiased towards texture; increasing shape bias improves ac-curacy and robustness. arXiv preprint arXiv:1811.12231 ,2018.Geirhos, R., Medina Temme, C., Rauber, J., Sch¨utt, H.,Bethge, M., and Wichmann, F. Generalisation in humansand deep neural networks. In

Proceeding of the AnnualConference on Neural Information Processing Systems ,pp. 7549–7561. Curran, 2019.Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Bren-del, W., Bethge, M., and Wichmann, F. A. Shortcut learn-ing in deep neural networks.

Nature Machine Intelligence ,2(11):665–673, 2020.Glorot, X. and Bengio, Y. Understanding the difﬁcultyof training deep feedforward neural networks. In

Pro-ceedings of the International Conference on ArtiﬁcialIntelligence and Statistics , pp. 249–256, 2010.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 , 2014.Hacohen, G. and Weinshall, D. On the power of curriculumlearning in training deep networks. In

Proceedings of

Too-Good-to-be-True Prior the International Conference on Machine Learning , pp.2535–2544. PMLR, 2019.He, K., Zhang, X., Ren, S., and Sun, J. Delving deepinto rectiﬁers: Surpassing human-level performance onImageNet classiﬁcation. In

Proceedings of the IEEEinternational conference on computer vision , pp. 1026–1034, 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Hendrycks, D. and Dietterich, T. Benchmarking neuralnetwork robustness to common corruptions and perturba-tions. arXiv preprint arXiv:1903.12261 , 2019.Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., andSong, D. Natural adversarial examples. arXiv preprintarXiv:1907.07174 , 2019.Hermann, K. and Lampinen, A. What shapes feature repre-sentations? Exploring datasets, architectures, and training.

Advances in Neural Information Processing Systems , 33,2020.Huang, Z., Wang, H., Xing, E. P., and Huang, D. Self-challenging improves cross-domain generalization. In

ECCV , 2020.Jiang, L., Meng, D., Zhao, Q., Shan, S., and Hauptmann,A. Self-paced curriculum learning. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 29,2015.Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images. 2009.Kumar, M. P., Packer, B., and Koller, D. Self-paced learningfor latent variable models. In

Proceedings of the Inter-national Conference on Neural Information ProcessingSystems , volume 1, pp. 1189–1197, 2010.Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., and Li, L.-J.Learning from noisy labels with distillation. In

Proceed-ings of the IEEE International Conference on ComputerVision , pp. 1910–1918, 2017.Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P.Focal loss for dense object detection. In

Proceedings ofthe IEEE international conference on computer vision ,pp. 2980–2988, 2017.Malhotra, G., Evans, B. D., and Bowers, J. S. Hiding aplane with a pixel: examining shape-bias in CNNs andthe beneﬁt of building in biological constraints.

VisionResearch , 174:57–68, 2020. ISSN 0042-6989. doi: https://doi.org/10.1016/j.visres.2020.04.013. Malisiewicz, T., Gupta, A., and Efros, A. A. Ensembleof exemplar-SVMs for object detection and beyond. In

Proceedings of the International conference on computervision , pp. 89–96. IEEE, 2011.Meng, D., Zhao, Q., and Jiang, L. What objective doesself-paced learning indeed optimize? arXiv preprintarXiv:1511.06049 , 2015.Minderer, M., Bachem, O., Houlsby, N., and Tschannen, M.Automatic shortcut removal for self-supervised represen-tation learning. In

Proceedings of the International Con-ference on Machine Learning , pp. 6927–6937. PMLR,2020.Minsky, M. L. and Papert, S.

Perceptrons: An introductionto computational geometry . MIT Press, Cambridge, MA,1969.Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do Ima-geNet classiﬁers generalize to ImageNet? In

Proceedingsof the International Conference on Machine Learning , pp.5389–5400. PMLR, 2019.Rosenblatt, F. The Perceptron: A probabilistic model forinformation storage in the brain.

Psychological review ,65:386–408, 1958.Shrivastava, A., Gupta, A., and Girshick, R. Training region-based object detectors with online hard example mining.In

Proceedings of the IEEE conference on computer vi-sion and pattern recognition , pp. 761–769, 2016.Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., andMeng, D. Meta-weight-net: Learning an explicit mappingfor sample weighting. arXiv preprint arXiv:1902.07379 ,2019.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In

Proceed-ings of the International Conference on Learning Repre-sentations , 2015.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing properties ofneural networks. arXiv preprint arXiv:1312.6199 , 2013.Wu, X., Dyer, E., and Neyshabur, B. When do curriculawork? arXiv preprint arXiv:2012.03107 , 2020.Zech, J. R., Badgeley, M. A., Liu, M., Costa, A. B.,Titano, J. J., and Oermann, E. K. Variable general-ization performance of a deep learning model to de-tect pneumonia in chest radiographs: A cross-sectionalstudy.

PLoS medicine , 15(11):e1002683, 2018. doi:https://doi.org/10.1371/journal.pmed.1002683.

Too-Good-to-be-True Prior

Zhang, Z. and Sabuncu, M. R. Generalized cross entropyloss for training deep neural networks with noisy labels.In