[PDF] More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

Abstract

Overfit is a fundamental problem in machine learning in general, and in deep learning in particular. In order to reduce overfit and improve generalization in the classification of images, some employ invariance to a group of transformations, such as rotations and reflections. However, since not all objects exhibit necessarily the same invariance, it seems desirable to allow the network to learn the useful level of invariance from the data. To this end, motivated by self-supervision, we introduce an architecture enhancement for existing neural network models based on input transformations, termed 'TransNet', together with a training algorithm suitable for it. Our model can be employed during training time only and then pruned for prediction, resulting in an equivalent architecture to the base model. Thus pruned, we show that our model improves performance on various data-sets while exhibiting improved generalization, which is achieved in turn by enforcing soft invariance on the convolutional kernels of the last layer in the base model. Theoretical analysis is provided to support the proposed method.

Full PDF

MMore Is More - Narrowing the Generalization Gap by AddingClassiﬁcation Heads

Roee CatesThe Hebrew University of Jerusalem [email protected]

Daphna WeinshallThe Hebrew University of Jerusalem [email protected]

Abstract

Overﬁt is a fundamental problem in machine learning ingeneral, and in deep learning in particular. In order to re-duce overﬁt and improve generalization in the classiﬁcationof images, some employ invariance to a group of transfor-mations, such as rotations and reﬂections. However, sincenot all objects exhibit necessarily the same invariance, itseems desirable to allow the network to learn the usefullevel of invariance from the data. To this end, motivated byself-supervision, we introduce an architecture enhancementfor existing neural network models based on input transfor-mations, termed ’TransNet’, together with a training algo-rithm suitable for it. Our model can be employed duringtraining time only and then pruned for prediction, result-ing in an equivalent architecture to the base model. Thuspruned, we show that our model improves performance onvarious data-sets while exhibiting improved generalization,which is achieved in turn by enforcing soft invariance on theconvolutional kernels of the last layer in the base model.Theoretical analysis is provided to support the proposedmethod.

1. Introduction

Deep neural network models currently deﬁne the stateof the art in many computer vision tasks, as well as speechrecognition and other areas. These expressive models areable to model complicated input-output relations. At thesame time, models of such large capacity are often proneto overﬁt, i.e . performing signiﬁcantly better on the trainingset as compared to the test set. This phenomenon is alsocalled the generalization gap .We propose a method to narrow this generalization gap.Our model, which is called

TransNet , is deﬁned by a setof input transformations. It augments an existing Convolu-tional Neural Network (CNN) architecture by allocating aspeciﬁc head - a fully-connected layer which receives as in-

Figure 1: Illustration of the

TransNet architecture, which consistsof 2 heads associated with 2 transformations, the identity and ro-tation by ◦ . Each head classiﬁes images transformed by associ-ated transformation, while both share the same backbone. put the penultimate layer of the base CNN - for each inputtransformation (see Fig. 1). The transformations associatedwith the model’s heads are not restricted apriori.The idea behind the proposed architecture is that eachhead can specialize in a different yet related classiﬁcationtask. We note that any CNN model can be viewed as a spe-cial case of the TransNet model, consisting of a single headassociated with the identity transformation. The overall taskis typically harder when training

TransNet , as compared tothe base CNN architecture. Yet by training multiple heads,which share the convolutional backbone, we hope to reducethe model’s overﬁt by providing a form of regularization.In Section 3 we deﬁne the basic model and the trainingalgorithm designed to train it (see Alg. 1). We then discussthe type of transformations that can be useful when learningto classify images. We also discuss the model’s variations:(i) pruned version that employs multiple heads during train-ing and then keeps only the head associated with the identitytransformation for prediction; (ii) the full version where allheads are used in both training and prediction.Theoretical investigation of this model is provided inSection 4, using the dihedral group of transformations ( D )that includes rotations by o and reﬂections. We ﬁrst prove1 a r X i v : . [ c s . L G ] F e b hat under certain mild assumptions, instead of applyingeach dihedral transformation to the input, one can compileit into the CNN model’s weights by applying the inversetransformation to the convolutional kernels. In order to ob-tain intuition about the inductive bias of the model’s trainingalgorithm in complex realistic frameworks, we analyze themodel’s inductive bias using a simpliﬁed framework.In Section 5 we describe our empirical results. We ﬁrstintroduce a novel invariance score ( IS ), designed to mea-sure the model’s kernel invariance under a given group oftransformations. IS effectively measures the inductive biasimposed on the model’s weights by the training algorithm.To achieve a fair comparison, we compare a regular CNNmodel traditionally trained, to the same model trained likea TransNet model as follows: heads are added to the basemodel, it is trained as a

TransNet model, and then the extraheads are pruned. We then show that training as

TransNet improves test accuracy as compared to the base model. Thisimprovement was achieved while keeping the optimizedhyper-parameters of the base CNN model, suggesting thatfurther improvement by ﬁne tuning may be possible. Wedemonstrate the increased invariance of the model’s kernelswhen trained with

TransNet . Our Contribution • Introduce

TransNet - a model inspired by self-supervision for supervised learning that imposes par-tial invariance to a group of transformations.• Introduce an invariance score ( IS ) for CNN convolu-tional kernels.• Theoretical investigation of the inductive bias impliedby the TransNet training algorithm.• Demonstrate empirically how both the full and prunedversions of

TransNet improve accuracy.

2. Related Work

Overﬁt.

A fundamental and long-standing issue in machinelearning, overﬁt occurs when a learning algorithm mini-mizes the train loss, but generalizes poorly to the unseentest set. Many methods were developed to mitigate thisproblem, including early stopping - when training is haltedas soon as the loss over a validation set starts to increase,and regularization - when a penalty term is added to theoptimization loss. Other related ideas, which achieve sim-ilar goals, include dropout [27], batch normalization [14],transfer learning [25, 29], and data augmentation [3, 33].

Self-Supervised Learning.

A family of learning algo-rithms that train a model using self generated labels ( e.g .the orientation of an image), in order to exploit unlabeleddata as well as extract more information from labeled data. Self training algorithms are used for representation learn-ing, by training a deep network to solve pretext tasks wherelabels can be produced directly from the data. Such tasksinclude colorization [32, 16], placing image patches in theright place [22, 7], inpainting [23] and orientation predic-tion [10]. Typically, self-supervision is used in unsuper-vised learning [8], to impose some structure on the data, orin semi-supervised learning [31, 12]. Our work is motivatedby

RotNet , an orientation prediction method suggested by[10]. It differs from [31, 12], as we allocate a speciﬁc clas-siﬁcation head for each input transformation rather than pre-dicting the self-supervised label with a separate head.

Equivariant CNNs.

Many computer vision algorithms aredesigned to exhibit some form of invariance to a transfor-mation of the input, including geometric transformations[20], transformations of time [28], or changes in pose andillumination [24]. Equivariance is a more relaxed property,exploited for example by CNN models when translation isconcerned. Work on CNN models that enforces strict equiv-ariance includes [26, 9, 1, 21, 2, 5]. Like these methods, ourmethod seeks to achieve invariance by employing weightsharing of the convolution layers between multiple heads.But unlike these methods, the invariance constraint is soft.Soft equivariance is also seen in works like [6], which em-ploys a convolutional layer that simultaneously feeds ro-tated and ﬂipped versions of the original image to a CNNmodel, or [30] that appends rotation and reﬂection versionsof each convolutional kernel.

3. TransNet

Notations and deﬁnitions

Let X = { ( x i , y i ) } ni =1 denotethe training data, where x i ∈ R d denotes the i-th data pointand y i ∈ [ K ] its corresponding label. Let D denote thedata distribution from which the samples are drawn. Let H denote the set of hypotheses, where h θ ∈ H is deﬁned by itsparameters θ (often we use h = h θ to simplify notations).Let (cid:96) ( h, x , y ) denote the loss of hypothesis h when givensample ( x , y ) . The overall loss is: L ( h, X ) = E ( x ,y ) ∼D [ (cid:96) ( h, x , y )] (1)Our objective is to ﬁnd the optimal hypothesis: h ∗ := arg min h ∈H L ( h, X ) (2)For simplicity, whenever the underlying distribution of arandom variable isn’t explicitly deﬁned we use the uniformdistribution, e.g . E a ∈ A [ a ] = 1 / | A | (cid:80) | A | i =1 a . The

TransNet architecture is deﬁned by a set of inputtransformations T = { t j } mj =1 , where each transformation2 ∈ T operates on the inputs ( t : R d → R d ) and is associ-ated with a corresponding model’s head. Thus each trans-formation operates on datapoint x as t ( x ) , and the trans-formed data-set is deﬁned as: t ( X ) := { ( t ( x i ) , y i ) } ni =1 (3)Given an existing NN model h , henceforth called the base model , we can split it to two components: all the lay-ers except for the last one denoted f , and the last layer g as-sumed to be a fully-connected layer. Thus h = g ◦ f . Next,we enhance model h by replacing g with | T | = m heads,where each head is an independent fully connected layer g t associated with a speciﬁc transformation t ∈ T . Formally,each head is deﬁned by h t = g t ◦ f , and it operates on thecorresponding transformed input as h t ( t ( x )) .The full model, with its m heads, is denoted by h T := { h t } t ∈ T , and operates on the input as follows: h T ( x ) := E t ∈ T [ h t ( t ( x ))] The corresponding loss of the full model is deﬁned as: L T ( h T , X ) := E t ∈ T [ L ( h t , t ( X ))] (4)Note that the resulting model (see Fig. 1) essentially rep-resents m models, which share via f all the weights up tothe last fully-connected layer. Each of these models can beused separately, as we do later on. Our method uses SGD with a few modiﬁcations to min-imize the transformation loss (4), as detailed in Alg. 1. Re-lying on the fact that each batch is sampled i.i.d. from D ,we can prove (see Lemma 1) the desirable property thatthe sampled loss L T ( h T , B ) is an unbiased estimator forthe transformation loss L T ( h T , X ) . This justiﬁes the useof Alg. 1 to optimize the transformation loss. Lemma 1.

Given batch B , the sampled transformation loss L T ( h T , B ) is an unbiased estimator for the transformationloss L T ( h T , X ) .Proof. E B ∼D b [ L T ( h T , B )]= E B ∼D b [ E t ∈ T [ L ( h t , t ( B ))]]= E t ∈ T [ E B ∼D b [ L ( h t , t ( B ))]] ( B iid ∼ D b ) = E t ∈ T [ L ( h t , t ( X ))]= L T ( h T , X ) (5) Algorithm 1:

Training the

TransNet model input :

TransNet model h T , batch size b ,maximum iterations num M AX IT ER output: trained

TransNet model for i = 1 . . . M AX IT ER do sample a batch B = { ( x k , y k ) } bk =1 iid ∼ D b forward: for t ∈ T do L ( h t , B ) = b (cid:80) bk =1 (cid:96) ( h t , t ( x k ) , y k ) end L T ( h T , B ) = m (cid:80) t ∈ T L ( h t , B ) backward (SGD): update the model’s weights by differentiatingthe sampled loss L T ( h T , B ) end Which transformations should we use?

Given a speciﬁcdata-set, we distinguish between transformations that occurnaturally in the data-set versus such transformations thatdo not. For example, horizontal ﬂip can naturally occurin the CIFAR-10 data-set, but not in the MNIST data-set.

TransNet can only beneﬁt from transformations that do notoccur naturally in the target data-set, in order for each headto learn a well deﬁned and non-overlapping classiﬁcationtask. Transformations that occur naturally in the data-setare often used for data augmentation, as by deﬁnition theydo not change the data domain.

Dihedral group D . As mentioned earlier, the

TransNet model is deﬁned by a set of input transformations T . Weconstrain T to be a subset of the dihedral group D , whichincludes reﬂections and rotations by multiplications of ◦ .We denote a horizontal reﬂection by m and a counter-clockwise ◦ rotation by r . Using these two elements wecan express all the D group elements as { r i , m ◦ r i | i ∈ , , , } . These transformations were chosen because, asmentioned in [10], their application is relatively efﬁcientand does not leave artifacts in the image (unlike scaling orchange of aspect ratio).Note that these transformations can be applied to any 3Dtensor while operating on the height and width dimensions,including an input image as well as the model’s kernels.When applying a transformation t to the model’s weights θ , denoted t ( θ ) , the notation implies that t operates on themodel’s kernels separately, not affecting other layers suchas the fully-connected ones (see Fig. 2).3 igure 2: The transformed input convolved with a kernel (upperpath) equals to the transformation applied on the output of the in-put convolved with the inversely transformed kernel (lower path). Once trained, the full

TransNet model can be viewed asan ensemble of m shared classiﬁers. Its time complexityis linear with the number of heads, almost equivalent to anensemble of the base CNN model, since the time neededto apply each one of the D transformations to the inputis negligible as compared to the time needed for the modelto process the input. Differently, the space complexity isalmost equivalent to the space complexity of only one baseCNN model .We note that one can prune each one of the model’sheads, thus leaving a smaller ensemble of up to m classi-ﬁers. A useful reduction prunes all the model’s heads exceptone, typically the one corresponding to the identity trans-formation, which yields a regular CNN that is equivalentin terms of time and space complexity to the base architec-ture used to build the TransNet model. Having done so, wecan evaluate the effect of the

TransNet architecture’s andits training algorithm’s inductive bias solely on the trainingprocedure, by comparing the pruned

TransNet to the baseCNN model (see Section 5).

4. Theoretical Analysis

In this section we analyze theoretically the

TransNet model. We consider the following basic CNN architecture: h θ = g ◦ l inv ◦ k (cid:89) i =1 c i (6)where g denotes a fully-connected layer, l inv denotes aninvariant layer under the D transformations group ( e.g . aglobal average pooling layer - GAP), and { c i } i ∈ [ k ] denoteconvolutional layers . The TransNet model extends the ba- Each additional head adds 102K ( ∼ ∼ While each convolutional layer may be followed by ReLU and BatchNormalization [14] layers, this doesn’t change the analysis so we obviatethe extra notation. sic model by appending additional heads: h T , θ = { g t ◦ l inv ◦ k (cid:89) i =1 c i } t ∈ T (7)We denote the parameters of a fully-connected or a convo-lutional layer by subscripts of w (weight) and b (bias), e.g . g ( x ) = g w · x + g b . Transformations in the dihedral D group satisfy anotherimportant property, expressed by the following proposition: Proposition 1.

Let h θ denote a CNN model where the lastconvolutional layer is followed by an invariant layer underthe D group. Then any transformation t ∈ D applied tothe input image can be compiled into the model’s weights θ as follows: ∀ t ∈ D ∀ x ∈ X : h θ ( t ( x )) = h t − ( θ ) ( x ) (8) Proof.

By induction on k we can show that: k (cid:89) i =1 c i ◦ t ( x ) = t ◦ k (cid:89) i =1 t − ( c i )( x ) (9)(see Fig. 2). Plugging (9) into (6), we get: h θ ( t ( x )) = g ◦ l inv ◦ k (cid:89) i =1 c i ◦ t ( x )= g ◦ l inv ◦ t ◦ k (cid:89) i =1 t − ( c i )( x )= g ◦ l inv ◦ k (cid:89) i =1 t − ( c i )( x ) ( l inv ◦ t = l inv ) = h t − ( θ ) ( x ) Implication.

The ResNet model [11] used in our exper-iments satisﬁes the pre-condition in the proposition statedabove, since it contains a GAP layer [19] after the last con-volutional layer, and GAP is invariant under D . In order to acquire intuition regarding the inductive biasimplied by training algorithm Alg. 1, we consider two cases,a single and a double headed model, trained with the sametraining algorithm. A single headed model is a special caseof the full multi-headed model, where all the heads shareweights h t ( t ( x )) = h ( t ( x )) ∀ t , and the loss in line 5 ofAlg. 1 becomes L ( h, B ) = b (cid:80) bk =1 (cid:96) ( h, t ( x k ) , y k ) .4s it’s hard to analyze non-convex deep neural networks,we focus on a simpliﬁed framework and consider a con-vex optimization problem where the loss function is convexw.r.t. the model’s parameters θ . We also assume that themodel’s transformations in T form a group . Single Headed model Analysis.

In this simpliﬁed case, wecan prove the following strict proposition:

Proposition 2.

Let h θ denote a CNN model satisfying thepre-condition of Prop. 1, and T ⊂ D a transformationsgroup. Then the optimal transformation loss L T (see Eq. 4)is obtained by invariant model’s weights under the transfor-mations T . Formally: ∃ θ : ( ∀ t ∈ T : θ = t ( θ )) ∧ ( θ ∈ arg min θ L T ( θ , X )) Proof.

To simplify the notations, henceforth we let θ de-note the model h θ . L T ( θ , X )= E t ∈ T [ L ( θ , t ( X ))]= E t ∈ T [ E ( x ,y ) ∼D [ (cid:96) ( θ , t ( x ) , y )]]= E t ∈ T [ E ( x ,y ) ∼D [ (cid:96) ( t − ( θ ) , x , y )]] (by Prop. 1) = E ( x ,y ) ∼D [ E t ∈ T [ (cid:96) ( t − ( θ ) , x , y )]] ≥ E ( x ,y ) ∼D [ (cid:96) ( E t ∈ T [ t − ( θ )] , x , y )] (Jensen’s inequality) = E ( x ,y ) ∼D [ (cid:96) ( ¯ θ , x , y )] ( ¯ θ := E t ∈ T [ t ( θ ))] , T = T − ) = L ( ¯ θ , X )= E t ∈ T [ L ( t − ( ¯ θ ) , X )] ( ¯ θ is invariant under T ) = E t ∈ T [ L ( ¯ θ , t ( X ))] (by Prop. 1) = L T ( ¯ θ , X ) Above we use the fact that ¯ θ is invariant under T since T isa group and thus t T = T , hence: t ( ¯ θ ) = t ( E t ∈ T [ t ( θ )]) = E t ∈ T [ t ◦ t ( θ )] = E t ∈ T [ t ( θ )] = ¯ θ Double headed model.

In light of Prop. 2 we now presenta counter example, which shows that Prop. 2 isn’t true forthe general

TransNet model.

Example 1.

Let T = { t = r , t = m ◦ r } ⊂ D denote the transformations group consisting of the iden-tity and the vertical reﬂection transformations. Let h T , θ = { h i = g i ◦ GAP ◦ c } i =1 denote a double headed TransNet model, which comprises a single convolutional layer (1channel in and 2 channels out), followed by a GAP layerand then 2 fully-connected layers { g i } i =1 , one for each T being a group is a technical constraint needed for the analysis, notrequired by the algorithm. head. Each g i outputs a vector of size 2. The data-set X = { ( x , y ) , ( x , y ) } consists of 2 examples: x =   , y = 1 , x =   , y = 2 Note that x = t ( x ) .Now, assume the model’s convolutional layer c is com-posed of 2 invariant kernels under T , and denote it by c inv .Let i ∈ , , then: h i ( x ) = h i ( t ( x )) = g i ◦ GAP ◦ c inv ◦ t ( x )= g i ◦ GAP ◦ c inv ( x ) = h i ( x ) (10) In this case both heads predict the same output for bothinputs with different labels, thus: L ( h i , t i ( X )) > ⇒ L T ( h T , θ , X ) > In contrast, by setting c w = ( x , x ) , c b = (0 , , whichisn’t invariant under T , as well as: g ,w = (cid:20) (cid:21) , g ,b = (cid:20) (cid:21) g ,w = (cid:20) (cid:21) , g ,b = (cid:20) (cid:21) , we obtain: L ( h i , t i ( X )) = 0 = ⇒ L T ( h T , θ , X ) = 0 . We may conclude that the optimal model’s kernels aren’tinvariant under T , as opposed to the claim of Prop. 2. Discussion.

The intuition we derive from the analysis aboveis that the training algorithm (Alg. 1) implies an invariantinductive bias on the model’s kernels as proved in the sin-gle headed model, while not strictly enforcing invariance asshown by the counter example of the double headed model.

5. Experimental Results data-sets.

For evaluation we used the 5 image classiﬁcationdata-sets detailed in Table 1. These diverse data-sets allowus to evaluate our method across different image resolutionsand number of predicted classes.

Implementation Details.

We employed the ResNet18 [11]architecture for all the data-sets except for ImageNet-200,which was evaluated using the ResNet50 architecture (seeAppendix A for more implementation details).

Notations. This example may seem rather artiﬁcial, but in fact this isn’t such arare case.

E.g ., the airplane and the ship classes, both found in the CIFAR-10 data-set, that share similar blue background. ame Classes Train/Test dimSamples CIFAR-10 [15] 10 50K/10K 32CIFAR-100 [15] 100 50K/10K 32ImageNette [13] 10 10K/4K 224ImageWoof [13] 10 10K/4K 224ImageNet-200 200 260K/10K 224

Table 1: The data-sets used in our experiments. The dimension ofeach example, a color image, is dim × dim × pixels. ImageNetterepresents 10 easy to classify classes from ImageNet [4], while Im-ageWoof represents 10 hard to classify classes of dog breeds fromImageNet. ImageNet-200 represents 200 classes from ImageNet(same classes as in [17]) of full size images. • ”base CNN” - a regular convolutional neural network,identical to the TransNet model with only the head cor-responding to the identity transformation.• ”PT m -CNN” - a pruned TransNet model trained with m heads, where a single head is left and used for pre-diction . It has the same space and time complexity asthe base CNN.• ”T m -CNN” - a full TransNet model trained with m heads, where all are used for prediction. It has roughlythe same space complexity and m times the time com-plexity as compared to the base CNN.To denote an ensemble of the models above, we add a sufﬁxof a number in parentheses, e.g . T2-CNN (3) is an ensembleof 3 T2-CNN models. We now compare the accuracy of the ”base-CNN”,”PT m -CNN” and ”T m -CNN” models, where m = 2 , , denotes the number of heads of the TransNet model, andtheir ensembles, across all the data-sets listed in Table 1.

Models with the same space and time complexity.

First,we evaluate the pruned

TransNet model by comparing the”PT m -CNN” models with the ”base-CNN” model, see Ta-ble 2. Essentially, we evaluate the effect of using the TransNet model only for training, as the ﬁnal ”PT m -CNN”models are identical to the ”base-CNN” model regardlessof m . We can clearly see the inductive bias implied bythe training procedure. We also see that TransNet trainingimproves the accuracy of the ﬁnal ”base-CNN” classiﬁeracross all the evaluated data-sets.

Models with similar space complexity, different timecomplexity.

Next, we evaluate the full

TransNet model by In our experiments we chose the head associated with the identity ( r )transformation when evaluating a pruned TransNet . Note, however, that wecould have chosen the best head in terms of accuracy, as it follows fromProp. 1 that its transformation can be compiled into the model’s weights. comparing the ”T m -CNN” models with the ”base-CNN”model, see Table 3. Despite the fact that the full TransNet model processes the (transformed) input m times more ascompared to the ”base-CNN” model, its architecture is notsigniﬁcantly larger than the base-CNN’s. The full TransNet adds to the ”base-CNN” a negligible number of parame-ters, in the form of its multiple heads . Clearly the full TransNet model improves the accuracy as compared to the”base-CNN” model, and also as compared to the pruned

TransNet model. Thus, if the additional runtime complexityduring test is not an issue, it is beneﬁcial to employ the full

TransNet model during test time. In fact, one can processthe input image once, and then choose whether to continueprocessing it with the other heads to improve the prediction,all this while keeping roughly the same space complexity.

Ensembles: models with similar time complexity, dif-ferent space complexity.

Here we evaluate ensembles ofpruned

TransNet models, and compare them to a single full

TransNet model that can be seen as a space-efﬁcient ensem-ble: full

TransNet generates m predictions with only /m parameters, where m is the number of TransNet heads. Re-sults are shown in Fig. 3. Clearly an ensemble of pruned

TransNet models is superior to an ensemble of base CNNmodels, suggesting that the accuracy gain achieved by thepruned

TransNet model doesn’t overlap with the accuracygain achieved by using an ensemble of classiﬁers. Fur-thermore, we observe that the full

TransNet model exhibitscompetitive accuracy results, with 2 and 3 heads, as com-pared to an ensemble of 2 or 3 base CNN models respec-tively. This is achieved while utilizing / and / asmany parameters respectively. Figure 3: Model accuracy as a function of the number of instances( X -axis) processed during prediction. Each instance requires acomplete run from input to output. An ensemble includes: m in-dependent base CNN classiﬁers for ”CNN”; m pruned TransNet trained with 2 heads for ”PT2-CNN”; and one

TransNet modelwith m heads, where m is the ensemble size, for ”T m -CNN”. ODEL CIFAR-10 CIFAR-100 ImageNette ImageWoof ImageNet-200 base-CNN 95.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Accuracy of models with the same space and time complexity, comparing the Base CNN with pruned

TransNet models ”PT m -CNN”, where m = 2 , , denotes the number of heads in training. Mean and standard error for 3 repetitions are shown. MODEL CIFAR-10 CIFAR-100 ImageNette ImageWoof ImageNet-200 base-CNN 95.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Accuracy of models with similar space complexity and different time complexity, comparing the Base CNN with full

TransNet models. With m denoting the number of heads, chosen to be 2,3 or 4, the prediction time complexity of the respective TransNet model”T m -CNN” is m times larger than the base CNN. Mean and standard error for 3 repetitions are shown. Accuracy vs. generalization.

In Fig. 3 we can see that 2heads improve the model’s performance across all data-sets,3 heads improve it on most of the data-sets, and 4 heads ac-tually reduce performance on most data-sets. We hypothe-size that too many heads impose too strict an inductive biason the model’s kernels. Thus, although generalization is im-proved, test accuracy is reduced due to insufﬁcient variance.Further analysis is presented in the next section.

We’ve seen in Section 5.1 that the

TransNet model,whether full or pruned, achieves better test accuracy as com-pared to the base CNN model. This occurs despite the factthat the transformation loss L T ( h T , X ) minimized by the TransNet model is more demanding than the loss L ( h, X ) minimized by the base CNN, and appears harder to opti-mize. This conjecture is justiﬁed by the following Lemma: Lemma 2.

Let h T denote a TransNet model that obtainstransformation loss of a := L T ( h T , X ) . Then there exists areduction from h T to the base CNN model h that obtains aloss of at most a , i.e . L ( h, X ) ≤ a .Proof. a = L T ( h T , X ) = E t ∈ T [ L ( h θ t , t ( X ))] , so theremust be a transformation t ∈ T s.t. L ( h θ t , t ( X )) ≤ a .Now, one can compile the transformation t into h θ t (seeProp. 1) and get a base CNN: ˜ h = h t − ( θ t ) which obtains L (˜ h, X ) = L ( h t − ( θ t ) , t ( X )) = L ( h θ t , t ( X )) ≤ a .Why is it, then, that the TransNet model achieves overallbetter accuracy than the base CNN? The answer lies in itsability to achieve a better generalization.In order to measure the generalization capability of amodel w.r.t. a data-set, we use the ratio between the test-set and train-set loss, where a lower ratio indicates bettergeneralization. As illustrated in Fig. 4, clearly the pruned

TransNet models exhibit better generalization when com-pared to the base CNN model. Furthermore, the generaliza-tion improvement increases with the number of

TransNet model heads, which are only used for training and thenpruned. The observed narrowing of the generalization gapoccurs because, although the

TransNet model slightly in-creases the training loss, it more signiﬁcantly decreases thetest loss as compared to the base CNN.

Figure 4: CIFAR-100 results. Left panel: learning curve of theBase CNN model (”base-CNN”) and a pruned

TransNet model(”PT2-CNN”). Right panel: generalization score, test-train lossratio, measured for the base-CNN model and various pruned

TransNet models with a different number of heads.

We note that better generalization does not necessarilyimply a better model. The ”PT4-CNN” model generalizesbetter than any other model (see right panel of Fig. 4), butits test accuracy is lower as seen in Table 2.

What characterizes the beneﬁcial inductive bias impliedby the

TransNet model and its training algorithm Alg. 1?.7o answer this question, we investigate the emerging invari-ance of kernels in the convolutional layers of the learnednetwork, w.r.t. the

TransNet transformations set T .We start by introducing the ”Invariance Score” ( IS ),which measures how invariant a 3D tensor is w.r.t. a trans-formations group. Speciﬁcally, given a convolutional kerneldenoted by w (3D tensor) and a set of transformations group T , the IS score is deﬁned as follows: IS ( w , T ) := min u ∈ INV T (cid:107) w − u (cid:107) (11)where IN V T is the set of invariant kernels (same shape as w ) under T , i.e . IN V T := { u : u = t ( u ) ∀ t ∈ T } . Lemma 3. arg min u ∈ INV T (cid:107) w − u (cid:107) = E t ∈ T [ t ( w )] Proof.

Let u be an invariant tensor under T . Deﬁne f ( u ) := (cid:107) w − u (cid:107) . Note that arg min u ∈ INV T (cid:107) w − u (cid:107) =arg min u ∈ INV T f ( u ) . f ( u ) = (cid:107) w − u (cid:107) = E t ∈ T [ (cid:107) w − t ( u ) (cid:107) ] ( u is invariant under T ) = E t ∈ T [ (cid:13)(cid:13) t − ( w ) − u (cid:13)(cid:13) ]= E t ∈ T [ (cid:107) t ( w ) − u (cid:107) ] ( T = T − ) = E t ∈ T [ size ( w ) (cid:88) i =1 ( t ( w ) i − u i ) ] Where index i runs over all the tensors’ elements. Finally,we differentiate f to obtain its minimum: ∂f∂ u i = E t ∈ T [ − t ( w ) i − u i )] = 0= ⇒ u i = E t ∈ T [[ t ( w ) i ] = ⇒ u = E t ∈ T [ t ( w )] Lemma 3 gives a closed-form expression for the IS gauge: IS ( w , T ) = (cid:107) w − E t ∈ T [ t ( w )] (cid:107) (12)Equipped with this gauge, we can inspect the invari-ance level of the model’s kernels w.r.t. a transformationsgroup. Note that this measure allows us to compare the full TransNet model with the base CNN model, as both sharethe same convolution layers. Since the transformations ofthe

TransNet model don’t necessarily form a group, we usethe minimal group containing these transformations - thegroup of all rotations { r i } i =1 .In Fig. 5 we can see that the full TransNet model ”T2-CNN” and the base CNN model demonstrate similar in-variance level in all the convolutional layers but the lastone. In Fig. 6, where the distribution of the IS score overthe last layer of 4 different models is fully shown, we canmore clearly see that the last convolutional layer of full TransNet models exhibits much higher invariance level ascompared to the base CNN. This phenomenon is robust tothe metric used in the IS deﬁnition, with similar resultswhen using ”Pearson Correlation” or ”Cosine Similarity”.The increased invariance in the last convolutional layer ismonotonically increasing with the number of heads in the TransNet model, which is consistent with the generalizationcapability of these models (see Fig 4).

Figure 5: CIFAR-100 results, plotting the distribution of the IS scores (mean and std) for the kernels in each layer of the differentmodels. Invariance is measured w.r.t. the group of ◦ rotations.Figure 6: CIFAR-100 results, plotting the full distribution of the IS scores for the kernels in the last (17-th) layer of the differentmodels. Invariance is measured w.r.t. the group of ◦ rotations. The generalization improvement achieved by the

TransNet model, as reported in Section 5.2, may be ex-plained by this increased level of invariance, as highly in-variant kernels have fewer degrees of freedom, and shouldtherefore be less prone to overﬁt.

Our method consists of 2 main components - the

TransNet architecture as well as the training algorithmAlg. 1. To evaluate the accuracy gain of each componentwe consider two variations:8

ODEL CIFAR-10 CIFAR-100 ImageNette ImageWoof ImageNet-200 base-CNN 95.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Accuracy of the ablation study models with the same space and time complexity, these 4 models enable us to evaluate the effectof the

TransNet architecture as well as the

TransNet algorithm separately. Mean and standard error for 3 repetitions are shown. • Architecture only : in this method we train the multi-headed architecture (2 in this case) by feeding eachhead the same un-transformed batch (equivalent to a

TransNet model with the multi-set of { id, id } transfor-mations). Prediction is retrieved from a single head(similar to PT2-CNN).• Algorithm only : in this method we train the base (oneheaded) model by the same algorithm Alg. 1. (Thismodel was also considered in the theoretical part 4.2,termed single headed model.)We compare the two methods above to the ”base-CNN”regular model and the complete model ”PT2-CNN”, seeTable 4. We can see that using only one of the compo-nents doesn’t yield any signiﬁcant accuracy gain. This sug-gest that the complete model beneﬁts from both compo-nents working together: the training algorithm increases themodel kernel’s invariance on the one hand, while the multi-heads architecture encourage the model to capture meaning-ful orientation information on the other hand.

6. Summary

We introduced a model inspired by self-supervision,which includes a base CNN model attached to multipleheads, each corresponding to a different transformationfrom a ﬁxed set of transformations. The self-supervised as-pect of the model is crucial, as the chosen transformationsmust not occur naturally in the data. When the model ispruned back to match the base CNN, it achieves better testaccuracy and improved generalization, which is attributedto the increased invariance of the model’s kernels in the lastlayer. We observed that excess invariance, while improvinggeneralization, eventually curtails the test accuracy.We evaluated our model on various image data-sets, ob-serving that each data-set achieves its own optimal ker-nel’s invariance level, i.e . there’s no optimal number ofheads for all data-sets. Finally, we introduced an invari-ance score gauge ( IS ), which measures the level of invari-ance achieved by the model’s kernels. IS may be leveragedto determine the optimal invariance level, as well as poten-tially function as an independent regularization term. Acknowledgements

This work was supported in part by a grant from the Is-rael Science Foundation (ISF) and by the Gatsby CharitableFoundations.

References [1] Christopher Clark and Amos Storkey. Training deep convo-lutional neural networks to play go. In

International confer-ence on machine learning , pages 1766–1774, 2015. 2[2] Taco Cohen and Max Welling. Group equivariant convo-lutional networks. In

International conference on machinelearning , pages 2990–2999, 2016. 2[3] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-van, and Quoc V Le. Autoaugment: Learning augmentationstrategies from data. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 113–123,2019. 2[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009. 6[5] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu.Exploiting cyclic symmetry in convolutional neural net-works. arXiv preprint arXiv:1602.02660 , 2016. 2[6] Sander Dieleman, Kyle W Willett, and Joni Dambre.Rotation-invariant convolutional neural networks for galaxymorphology prediction.

Monthly notices of the royal astro-nomical society , 450(2):1441–1459, 2015. 2[7] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-vised visual representation learning by context prediction. In

Proceedings of the IEEE international conference on com-puter vision , pages 1422–1430, 2015. 2[8] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen-berg, Martin Riedmiller, and Thomas Brox. Discriminativeunsupervised feature learning with exemplar convolutionalneural networks.

IEEE transactions on pattern analysis andmachine intelligence , 38(9):1734–1747, 2015. 2[9] Robert Gens and Pedro M Domingos. Deep symmetry net-works. In

Advances in neural information processing sys-tems , pages 2537–2545, 2014. 2[10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-supervised representation learning by predicting image rota-tions. arXiv preprint arXiv:1803.07728 , 2018. 2, 3[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed- ngs of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 4, 5, 10[12] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, andDawn Song. Using self-supervised learning can improvemodel robustness and uncertainty. In Advances in NeuralInformation Processing Systems , pages 15663–15674, 2019.2[13] Jeremy Howard. Imagewang. 6[14] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. arXiv preprint arXiv:1502.03167 , 2015. 2, 4[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. 2009. 6[16] Gustav Larsson, Michael Maire, and GregoryShakhnarovich. Learning representations for automaticcolorization. In

European conference on computer vision ,pages 577–593. Springer, 2016. 2[17] Ya Le and Xuan Yang. Tiny imagenet visual recognitionchallenge.

CS 231N , 7, 2015. 6[18] Chen-Yu Lee, Saining Xie, Patrick Gallagher, ZhengyouZhang, and Zhuowen Tu. Deeply-supervised nets. In

Ar-tiﬁcial intelligence and statistics , pages 562–570, 2015. 10[19] Min Lin, Qiang Chen, and Shuicheng Yan. Network in net-work. arXiv preprint arXiv:1312.4400 , 2013. 4[20] Joseph L Mundy, Andrew Zisserman, et al.

Geometric in-variance in computer vision , volume 92. MIT press Cam-bridge, MA, 1992. 2[21] Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W Koh,Quoc V Le, and Andrew Y Ng. Tiled convolutional neu-ral networks. In

Advances in neural information processingsystems , pages 1279–1287, 2010. 2[22] Mehdi Noroozi and Paolo Favaro. Unsupervised learningof visual representations by solving jigsaw puzzles (2016). arXiv preprint arXiv:1603.09246 . 2[23] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei A. Efros. Context encoders: Featurelearning by inpainting, 2016. 2[24] Pascal Paysan, Reinhard Knothe, Brian Amberg, SamiRomdhani, and Thomas Vetter. A 3d face model for poseand illumination invariant face recognition. In , pages 296–301. Ieee, 2009. 2[25] Ling Shao, Fan Zhu, and Xuelong Li. Transfer learning forvisual categorization: A survey.

IEEE transactions on neuralnetworks and learning systems , 26(5):1019–1034, 2014. 2[26] Laurent Sifre and St´ephane Mallat. Rotation, scaling anddeformation invariant scattering for texture discrimination.In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 1233–1240, 2013. 2[27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simple wayto prevent neural networks from overﬁtting.

The journal ofmachine learning research , 15(1):1929–1958, 2014. 2[28] Pavan Turaga and Rama Chellappa. Locally time-invariantmodels of human activities using trajectories on the grass-mannian. In , pages 2435–2441. IEEE, 2009. 2 [29] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. Asurvey of transfer learning.

Journal of Big data , 3(1):9, 2016.2[30] Fa Wu, Peijun Hu, and Dexing Kong. Flip-rotate-pooling convolution and split dropout on convolution neu-ral networks for image classiﬁcation. arXiv preprintarXiv:1507.08754 , 2015. 2[31] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-cas Beyer. S4l: Self-supervised semi-supervised learning. In

Proceedings of the IEEE international conference on com-puter vision , pages 1476–1485, 2019. 2[32] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In

European conference on computervision , pages 649–666. Springer, 2016. 2[33] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, andYi Yang. Random erasing data augmentation. In

AAAI , pages13001–13008, 2020. 2

AppendixA. Implementation details

We employed the ResNet [11] architecture, speciﬁcallythe ResNet18 architecture for all the data-sets except for theImageNet-200 which was evaluated using the ResNet50 ar-chitecture. It’s important to notice that we haven’t changedthe hyper-parameters used by the regular CNN architecturewhich

TransNet is based on. This may strengthen the resultsas one may ﬁne tune these hyper-parameters to suit best the

TransNet model.We used a weight decay of 0.0001 and momentum of0.9. The model was trained with a batch size of 64 for allthe data-sets except for ImageNet-200 where we increasedthe batch size to 128. We trained the model for 300 epochs,starting with a learning rate of 0.1, divided by 10 at the 150and 225 epochs, except for the ImageNet-200 model whichwas trained for 120 epochs, starting with a learning rate of0.1, divided by 10 at the 40 and 80 epochs. We normalizedthe images as usual by subtracting the image’s mean anddividing by the image’s standard deviation (color-wise).We employed a mild data augmentation scheme - hori-zontal ﬂip with probability of 0.5. For the CIFAR data-setswe padded each dimension by 4 pixels and cropped ran-domly (uniform) a 32 ×

32 patch from the enlarged image[18] while for the ImageNet family data-sets we croppedrandomly (uniform) a 224 ×

224 patch from the original im-age.In test time, we took the original image for the CIFARdata-sets and a center crop for the ImageNet family data-sets. The prediction of each model is the mean of themodel’s output on the original image and a horizontallyﬂipped version of it. Note that a horizontal ﬂip occurs nat-urally in every data-set we use for evaluation and thereforeisn’t associated with any of the