More Is More -- Narrowing the Generalization Gap by Adding Classification Heads
MMore Is More - Narrowing the Generalization Gap by AddingClassification Heads
Roee CatesThe Hebrew University of Jerusalem [email protected]
Daphna WeinshallThe Hebrew University of Jerusalem [email protected]
Abstract
Overfit is a fundamental problem in machine learning ingeneral, and in deep learning in particular. In order to re-duce overfit and improve generalization in the classificationof images, some employ invariance to a group of transfor-mations, such as rotations and reflections. However, sincenot all objects exhibit necessarily the same invariance, itseems desirable to allow the network to learn the usefullevel of invariance from the data. To this end, motivated byself-supervision, we introduce an architecture enhancementfor existing neural network models based on input transfor-mations, termed ’TransNet’, together with a training algo-rithm suitable for it. Our model can be employed duringtraining time only and then pruned for prediction, result-ing in an equivalent architecture to the base model. Thuspruned, we show that our model improves performance onvarious data-sets while exhibiting improved generalization,which is achieved in turn by enforcing soft invariance on theconvolutional kernels of the last layer in the base model.Theoretical analysis is provided to support the proposedmethod.
1. Introduction
Deep neural network models currently define the stateof the art in many computer vision tasks, as well as speechrecognition and other areas. These expressive models areable to model complicated input-output relations. At thesame time, models of such large capacity are often proneto overfit, i.e . performing significantly better on the trainingset as compared to the test set. This phenomenon is alsocalled the generalization gap .We propose a method to narrow this generalization gap.Our model, which is called
TransNet , is defined by a setof input transformations. It augments an existing Convolu-tional Neural Network (CNN) architecture by allocating aspecific head - a fully-connected layer which receives as in-
Figure 1: Illustration of the
TransNet architecture, which consistsof 2 heads associated with 2 transformations, the identity and ro-tation by ◦ . Each head classifies images transformed by associ-ated transformation, while both share the same backbone. put the penultimate layer of the base CNN - for each inputtransformation (see Fig. 1). The transformations associatedwith the model’s heads are not restricted apriori.The idea behind the proposed architecture is that eachhead can specialize in a different yet related classificationtask. We note that any CNN model can be viewed as a spe-cial case of the TransNet model, consisting of a single headassociated with the identity transformation. The overall taskis typically harder when training
TransNet , as compared tothe base CNN architecture. Yet by training multiple heads,which share the convolutional backbone, we hope to reducethe model’s overfit by providing a form of regularization.In Section 3 we define the basic model and the trainingalgorithm designed to train it (see Alg. 1). We then discussthe type of transformations that can be useful when learningto classify images. We also discuss the model’s variations:(i) pruned version that employs multiple heads during train-ing and then keeps only the head associated with the identitytransformation for prediction; (ii) the full version where allheads are used in both training and prediction.Theoretical investigation of this model is provided inSection 4, using the dihedral group of transformations ( D )that includes rotations by o and reflections. We first prove1 a r X i v : . [ c s . L G ] F e b hat under certain mild assumptions, instead of applyingeach dihedral transformation to the input, one can compileit into the CNN model’s weights by applying the inversetransformation to the convolutional kernels. In order to ob-tain intuition about the inductive bias of the model’s trainingalgorithm in complex realistic frameworks, we analyze themodel’s inductive bias using a simplified framework.In Section 5 we describe our empirical results. We firstintroduce a novel invariance score ( IS ), designed to mea-sure the model’s kernel invariance under a given group oftransformations. IS effectively measures the inductive biasimposed on the model’s weights by the training algorithm.To achieve a fair comparison, we compare a regular CNNmodel traditionally trained, to the same model trained likea TransNet model as follows: heads are added to the basemodel, it is trained as a
TransNet model, and then the extraheads are pruned. We then show that training as
TransNet improves test accuracy as compared to the base model. Thisimprovement was achieved while keeping the optimizedhyper-parameters of the base CNN model, suggesting thatfurther improvement by fine tuning may be possible. Wedemonstrate the increased invariance of the model’s kernelswhen trained with
TransNet . Our Contribution • Introduce
TransNet - a model inspired by self-supervision for supervised learning that imposes par-tial invariance to a group of transformations.• Introduce an invariance score ( IS ) for CNN convolu-tional kernels.• Theoretical investigation of the inductive bias impliedby the TransNet training algorithm.• Demonstrate empirically how both the full and prunedversions of
TransNet improve accuracy.
2. Related Work
Overfit.
A fundamental and long-standing issue in machinelearning, overfit occurs when a learning algorithm mini-mizes the train loss, but generalizes poorly to the unseentest set. Many methods were developed to mitigate thisproblem, including early stopping - when training is haltedas soon as the loss over a validation set starts to increase,and regularization - when a penalty term is added to theoptimization loss. Other related ideas, which achieve sim-ilar goals, include dropout [27], batch normalization [14],transfer learning [25, 29], and data augmentation [3, 33].
Self-Supervised Learning.
A family of learning algo-rithms that train a model using self generated labels ( e.g .the orientation of an image), in order to exploit unlabeleddata as well as extract more information from labeled data. Self training algorithms are used for representation learn-ing, by training a deep network to solve pretext tasks wherelabels can be produced directly from the data. Such tasksinclude colorization [32, 16], placing image patches in theright place [22, 7], inpainting [23] and orientation predic-tion [10]. Typically, self-supervision is used in unsuper-vised learning [8], to impose some structure on the data, orin semi-supervised learning [31, 12]. Our work is motivatedby
RotNet , an orientation prediction method suggested by[10]. It differs from [31, 12], as we allocate a specific clas-sification head for each input transformation rather than pre-dicting the self-supervised label with a separate head.
Equivariant CNNs.
Many computer vision algorithms aredesigned to exhibit some form of invariance to a transfor-mation of the input, including geometric transformations[20], transformations of time [28], or changes in pose andillumination [24]. Equivariance is a more relaxed property,exploited for example by CNN models when translation isconcerned. Work on CNN models that enforces strict equiv-ariance includes [26, 9, 1, 21, 2, 5]. Like these methods, ourmethod seeks to achieve invariance by employing weightsharing of the convolution layers between multiple heads.But unlike these methods, the invariance constraint is soft.Soft equivariance is also seen in works like [6], which em-ploys a convolutional layer that simultaneously feeds ro-tated and flipped versions of the original image to a CNNmodel, or [30] that appends rotation and reflection versionsof each convolutional kernel.
3. TransNet
Notations and definitions
Let X = { ( x i , y i ) } ni =1 denotethe training data, where x i ∈ R d denotes the i-th data pointand y i ∈ [ K ] its corresponding label. Let D denote thedata distribution from which the samples are drawn. Let H denote the set of hypotheses, where h θ ∈ H is defined by itsparameters θ (often we use h = h θ to simplify notations).Let (cid:96) ( h, x , y ) denote the loss of hypothesis h when givensample ( x , y ) . The overall loss is: L ( h, X ) = E ( x ,y ) ∼D [ (cid:96) ( h, x , y )] (1)Our objective is to find the optimal hypothesis: h ∗ := arg min h ∈H L ( h, X ) (2)For simplicity, whenever the underlying distribution of arandom variable isn’t explicitly defined we use the uniformdistribution, e.g . E a ∈ A [ a ] = 1 / | A | (cid:80) | A | i =1 a . The
TransNet architecture is defined by a set of inputtransformations T = { t j } mj =1 , where each transformation2 ∈ T operates on the inputs ( t : R d → R d ) and is associ-ated with a corresponding model’s head. Thus each trans-formation operates on datapoint x as t ( x ) , and the trans-formed data-set is defined as: t ( X ) := { ( t ( x i ) , y i ) } ni =1 (3)Given an existing NN model h , henceforth called the base model , we can split it to two components: all the lay-ers except for the last one denoted f , and the last layer g as-sumed to be a fully-connected layer. Thus h = g ◦ f . Next,we enhance model h by replacing g with | T | = m heads,where each head is an independent fully connected layer g t associated with a specific transformation t ∈ T . Formally,each head is defined by h t = g t ◦ f , and it operates on thecorresponding transformed input as h t ( t ( x )) .The full model, with its m heads, is denoted by h T := { h t } t ∈ T , and operates on the input as follows: h T ( x ) := E t ∈ T [ h t ( t ( x ))] The corresponding loss of the full model is defined as: L T ( h T , X ) := E t ∈ T [ L ( h t , t ( X ))] (4)Note that the resulting model (see Fig. 1) essentially rep-resents m models, which share via f all the weights up tothe last fully-connected layer. Each of these models can beused separately, as we do later on. Our method uses SGD with a few modifications to min-imize the transformation loss (4), as detailed in Alg. 1. Re-lying on the fact that each batch is sampled i.i.d. from D ,we can prove (see Lemma 1) the desirable property thatthe sampled loss L T ( h T , B ) is an unbiased estimator forthe transformation loss L T ( h T , X ) . This justifies the useof Alg. 1 to optimize the transformation loss. Lemma 1.
Given batch B , the sampled transformation loss L T ( h T , B ) is an unbiased estimator for the transformationloss L T ( h T , X ) .Proof. E B ∼D b [ L T ( h T , B )]= E B ∼D b [ E t ∈ T [ L ( h t , t ( B ))]]= E t ∈ T [ E B ∼D b [ L ( h t , t ( B ))]] ( B iid ∼ D b ) = E t ∈ T [ L ( h t , t ( X ))]= L T ( h T , X ) (5) Algorithm 1:
Training the
TransNet model input :
TransNet model h T , batch size b ,maximum iterations num M AX IT ER output: trained
TransNet model for i = 1 . . . M AX IT ER do sample a batch B = { ( x k , y k ) } bk =1 iid ∼ D b forward: for t ∈ T do L ( h t , B ) = b (cid:80) bk =1 (cid:96) ( h t , t ( x k ) , y k ) end L T ( h T , B ) = m (cid:80) t ∈ T L ( h t , B ) backward (SGD): update the model’s weights by differentiatingthe sampled loss L T ( h T , B ) end Which transformations should we use?
Given a specificdata-set, we distinguish between transformations that occurnaturally in the data-set versus such transformations thatdo not. For example, horizontal flip can naturally occurin the CIFAR-10 data-set, but not in the MNIST data-set.
TransNet can only benefit from transformations that do notoccur naturally in the target data-set, in order for each headto learn a well defined and non-overlapping classificationtask. Transformations that occur naturally in the data-setare often used for data augmentation, as by definition theydo not change the data domain.
Dihedral group D . As mentioned earlier, the
TransNet model is defined by a set of input transformations T . Weconstrain T to be a subset of the dihedral group D , whichincludes reflections and rotations by multiplications of ◦ .We denote a horizontal reflection by m and a counter-clockwise ◦ rotation by r . Using these two elements wecan express all the D group elements as { r i , m ◦ r i | i ∈ , , , } . These transformations were chosen because, asmentioned in [10], their application is relatively efficientand does not leave artifacts in the image (unlike scaling orchange of aspect ratio).Note that these transformations can be applied to any 3Dtensor while operating on the height and width dimensions,including an input image as well as the model’s kernels.When applying a transformation t to the model’s weights θ , denoted t ( θ ) , the notation implies that t operates on themodel’s kernels separately, not affecting other layers suchas the fully-connected ones (see Fig. 2).3 igure 2: The transformed input convolved with a kernel (upperpath) equals to the transformation applied on the output of the in-put convolved with the inversely transformed kernel (lower path). Once trained, the full
TransNet model can be viewed asan ensemble of m shared classifiers. Its time complexityis linear with the number of heads, almost equivalent to anensemble of the base CNN model, since the time neededto apply each one of the D transformations to the inputis negligible as compared to the time needed for the modelto process the input. Differently, the space complexity isalmost equivalent to the space complexity of only one baseCNN model .We note that one can prune each one of the model’sheads, thus leaving a smaller ensemble of up to m classi-fiers. A useful reduction prunes all the model’s heads exceptone, typically the one corresponding to the identity trans-formation, which yields a regular CNN that is equivalentin terms of time and space complexity to the base architec-ture used to build the TransNet model. Having done so, wecan evaluate the effect of the
TransNet architecture’s andits training algorithm’s inductive bias solely on the trainingprocedure, by comparing the pruned
TransNet to the baseCNN model (see Section 5).
4. Theoretical Analysis
In this section we analyze theoretically the
TransNet model. We consider the following basic CNN architecture: h θ = g ◦ l inv ◦ k (cid:89) i =1 c i (6)where g denotes a fully-connected layer, l inv denotes aninvariant layer under the D transformations group ( e.g . aglobal average pooling layer - GAP), and { c i } i ∈ [ k ] denoteconvolutional layers . The TransNet model extends the ba- Each additional head adds 102K ( ∼ ∼ While each convolutional layer may be followed by ReLU and BatchNormalization [14] layers, this doesn’t change the analysis so we obviatethe extra notation. sic model by appending additional heads: h T , θ = { g t ◦ l inv ◦ k (cid:89) i =1 c i } t ∈ T (7)We denote the parameters of a fully-connected or a convo-lutional layer by subscripts of w (weight) and b (bias), e.g . g ( x ) = g w · x + g b . Transformations in the dihedral D group satisfy anotherimportant property, expressed by the following proposition: Proposition 1.
Let h θ denote a CNN model where the lastconvolutional layer is followed by an invariant layer underthe D group. Then any transformation t ∈ D applied tothe input image can be compiled into the model’s weights θ as follows: ∀ t ∈ D ∀ x ∈ X : h θ ( t ( x )) = h t − ( θ ) ( x ) (8) Proof.
By induction on k we can show that: k (cid:89) i =1 c i ◦ t ( x ) = t ◦ k (cid:89) i =1 t − ( c i )( x ) (9)(see Fig. 2). Plugging (9) into (6), we get: h θ ( t ( x )) = g ◦ l inv ◦ k (cid:89) i =1 c i ◦ t ( x )= g ◦ l inv ◦ t ◦ k (cid:89) i =1 t − ( c i )( x )= g ◦ l inv ◦ k (cid:89) i =1 t − ( c i )( x ) ( l inv ◦ t = l inv ) = h t − ( θ ) ( x ) Implication.
The ResNet model [11] used in our exper-iments satisfies the pre-condition in the proposition statedabove, since it contains a GAP layer [19] after the last con-volutional layer, and GAP is invariant under D . In order to acquire intuition regarding the inductive biasimplied by training algorithm Alg. 1, we consider two cases,a single and a double headed model, trained with the sametraining algorithm. A single headed model is a special caseof the full multi-headed model, where all the heads shareweights h t ( t ( x )) = h ( t ( x )) ∀ t , and the loss in line 5 ofAlg. 1 becomes L ( h, B ) = b (cid:80) bk =1 (cid:96) ( h, t ( x k ) , y k ) .4s it’s hard to analyze non-convex deep neural networks,we focus on a simplified framework and consider a con-vex optimization problem where the loss function is convexw.r.t. the model’s parameters θ . We also assume that themodel’s transformations in T form a group . Single Headed model Analysis.
In this simplified case, wecan prove the following strict proposition:
Proposition 2.
Let h θ denote a CNN model satisfying thepre-condition of Prop. 1, and T ⊂ D a transformationsgroup. Then the optimal transformation loss L T (see Eq. 4)is obtained by invariant model’s weights under the transfor-mations T . Formally: ∃ θ : ( ∀ t ∈ T : θ = t ( θ )) ∧ ( θ ∈ arg min θ L T ( θ , X )) Proof.
To simplify the notations, henceforth we let θ de-note the model h θ . L T ( θ , X )= E t ∈ T [ L ( θ , t ( X ))]= E t ∈ T [ E ( x ,y ) ∼D [ (cid:96) ( θ , t ( x ) , y )]]= E t ∈ T [ E ( x ,y ) ∼D [ (cid:96) ( t − ( θ ) , x , y )]] (by Prop. 1) = E ( x ,y ) ∼D [ E t ∈ T [ (cid:96) ( t − ( θ ) , x , y )]] ≥ E ( x ,y ) ∼D [ (cid:96) ( E t ∈ T [ t − ( θ )] , x , y )] (Jensen’s inequality) = E ( x ,y ) ∼D [ (cid:96) ( ¯ θ , x , y )] ( ¯ θ := E t ∈ T [ t ( θ ))] , T = T − ) = L ( ¯ θ , X )= E t ∈ T [ L ( t − ( ¯ θ ) , X )] ( ¯ θ is invariant under T ) = E t ∈ T [ L ( ¯ θ , t ( X ))] (by Prop. 1) = L T ( ¯ θ , X ) Above we use the fact that ¯ θ is invariant under T since T isa group and thus t T = T , hence: t ( ¯ θ ) = t ( E t ∈ T [ t ( θ )]) = E t ∈ T [ t ◦ t ( θ )] = E t ∈ T [ t ( θ )] = ¯ θ Double headed model.
In light of Prop. 2 we now presenta counter example, which shows that Prop. 2 isn’t true forthe general
TransNet model.
Example 1.
Let T = { t = r , t = m ◦ r } ⊂ D denote the transformations group consisting of the iden-tity and the vertical reflection transformations. Let h T , θ = { h i = g i ◦ GAP ◦ c } i =1 denote a double headed TransNet model, which comprises a single convolutional layer (1channel in and 2 channels out), followed by a GAP layerand then 2 fully-connected layers { g i } i =1 , one for each T being a group is a technical constraint needed for the analysis, notrequired by the algorithm. head. Each g i outputs a vector of size 2. The data-set X = { ( x , y ) , ( x , y ) } consists of 2 examples: x = , y = 1 , x = , y = 2 Note that x = t ( x ) .Now, assume the model’s convolutional layer c is com-posed of 2 invariant kernels under T , and denote it by c inv .Let i ∈ , , then: h i ( x ) = h i ( t ( x )) = g i ◦ GAP ◦ c inv ◦ t ( x )= g i ◦ GAP ◦ c inv ( x ) = h i ( x ) (10) In this case both heads predict the same output for bothinputs with different labels, thus: L ( h i , t i ( X )) > ⇒ L T ( h T , θ , X ) > In contrast, by setting c w = ( x , x ) , c b = (0 , , whichisn’t invariant under T , as well as: g ,w = (cid:20) (cid:21) , g ,b = (cid:20) (cid:21) g ,w = (cid:20) (cid:21) , g ,b = (cid:20) (cid:21) , we obtain: L ( h i , t i ( X )) = 0 = ⇒ L T ( h T , θ , X ) = 0 . We may conclude that the optimal model’s kernels aren’tinvariant under T , as opposed to the claim of Prop. 2. Discussion.
The intuition we derive from the analysis aboveis that the training algorithm (Alg. 1) implies an invariantinductive bias on the model’s kernels as proved in the sin-gle headed model, while not strictly enforcing invariance asshown by the counter example of the double headed model.
5. Experimental Results data-sets.
For evaluation we used the 5 image classificationdata-sets detailed in Table 1. These diverse data-sets allowus to evaluate our method across different image resolutionsand number of predicted classes.
Implementation Details.
We employed the ResNet18 [11]architecture for all the data-sets except for ImageNet-200,which was evaluated using the ResNet50 architecture (seeAppendix A for more implementation details).
Notations. This example may seem rather artificial, but in fact this isn’t such arare case.
E.g ., the airplane and the ship classes, both found in the CIFAR-10 data-set, that share similar blue background. ame Classes Train/Test dimSamples CIFAR-10 [15] 10 50K/10K 32CIFAR-100 [15] 100 50K/10K 32ImageNette [13] 10 10K/4K 224ImageWoof [13] 10 10K/4K 224ImageNet-200 200 260K/10K 224
Table 1: The data-sets used in our experiments. The dimension ofeach example, a color image, is dim × dim × pixels. ImageNetterepresents 10 easy to classify classes from ImageNet [4], while Im-ageWoof represents 10 hard to classify classes of dog breeds fromImageNet. ImageNet-200 represents 200 classes from ImageNet(same classes as in [17]) of full size images. • ”base CNN” - a regular convolutional neural network,identical to the TransNet model with only the head cor-responding to the identity transformation.• ”PT m -CNN” - a pruned TransNet model trained with m heads, where a single head is left and used for pre-diction . It has the same space and time complexity asthe base CNN.• ”T m -CNN” - a full TransNet model trained with m heads, where all are used for prediction. It has roughlythe same space complexity and m times the time com-plexity as compared to the base CNN.To denote an ensemble of the models above, we add a suffixof a number in parentheses, e.g . T2-CNN (3) is an ensembleof 3 T2-CNN models. We now compare the accuracy of the ”base-CNN”,”PT m -CNN” and ”T m -CNN” models, where m = 2 , , denotes the number of heads of the TransNet model, andtheir ensembles, across all the data-sets listed in Table 1.
Models with the same space and time complexity.
First,we evaluate the pruned
TransNet model by comparing the”PT m -CNN” models with the ”base-CNN” model, see Ta-ble 2. Essentially, we evaluate the effect of using the TransNet model only for training, as the final ”PT m -CNN”models are identical to the ”base-CNN” model regardlessof m . We can clearly see the inductive bias implied bythe training procedure. We also see that TransNet trainingimproves the accuracy of the final ”base-CNN” classifieracross all the evaluated data-sets.
Models with similar space complexity, different timecomplexity.
Next, we evaluate the full
TransNet model by In our experiments we chose the head associated with the identity ( r )transformation when evaluating a pruned TransNet . Note, however, that wecould have chosen the best head in terms of accuracy, as it follows fromProp. 1 that its transformation can be compiled into the model’s weights. comparing the ”T m -CNN” models with the ”base-CNN”model, see Table 3. Despite the fact that the full TransNet model processes the (transformed) input m times more ascompared to the ”base-CNN” model, its architecture is notsignificantly larger than the base-CNN’s. The full TransNet adds to the ”base-CNN” a negligible number of parame-ters, in the form of its multiple heads . Clearly the full TransNet model improves the accuracy as compared to the”base-CNN” model, and also as compared to the pruned
TransNet model. Thus, if the additional runtime complexityduring test is not an issue, it is beneficial to employ the full
TransNet model during test time. In fact, one can processthe input image once, and then choose whether to continueprocessing it with the other heads to improve the prediction,all this while keeping roughly the same space complexity.
Ensembles: models with similar time complexity, dif-ferent space complexity.
Here we evaluate ensembles ofpruned
TransNet models, and compare them to a single full
TransNet model that can be seen as a space-efficient ensem-ble: full
TransNet generates m predictions with only /m parameters, where m is the number of TransNet heads. Re-sults are shown in Fig. 3. Clearly an ensemble of pruned
TransNet models is superior to an ensemble of base CNNmodels, suggesting that the accuracy gain achieved by thepruned
TransNet model doesn’t overlap with the accuracygain achieved by using an ensemble of classifiers. Fur-thermore, we observe that the full
TransNet model exhibitscompetitive accuracy results, with 2 and 3 heads, as com-pared to an ensemble of 2 or 3 base CNN models respec-tively. This is achieved while utilizing / and / asmany parameters respectively. Figure 3: Model accuracy as a function of the number of instances( X -axis) processed during prediction. Each instance requires acomplete run from input to output. An ensemble includes: m in-dependent base CNN classifiers for ”CNN”; m pruned TransNet trained with 2 heads for ”PT2-CNN”; and one
TransNet modelwith m heads, where m is the ensemble size, for ”T m -CNN”. ODEL CIFAR-10 CIFAR-100 ImageNette ImageWoof ImageNet-200 base-CNN 95.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2: Accuracy of models with the same space and time complexity, comparing the Base CNN with pruned
TransNet models ”PT m -CNN”, where m = 2 , , denotes the number of heads in training. Mean and standard error for 3 repetitions are shown. MODEL CIFAR-10 CIFAR-100 ImageNette ImageWoof ImageNet-200 base-CNN 95.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Accuracy of models with similar space complexity and different time complexity, comparing the Base CNN with full
TransNet models. With m denoting the number of heads, chosen to be 2,3 or 4, the prediction time complexity of the respective TransNet model”T m -CNN” is m times larger than the base CNN. Mean and standard error for 3 repetitions are shown. Accuracy vs. generalization.
In Fig. 3 we can see that 2heads improve the model’s performance across all data-sets,3 heads improve it on most of the data-sets, and 4 heads ac-tually reduce performance on most data-sets. We hypothe-size that too many heads impose too strict an inductive biason the model’s kernels. Thus, although generalization is im-proved, test accuracy is reduced due to insufficient variance.Further analysis is presented in the next section.
We’ve seen in Section 5.1 that the
TransNet model,whether full or pruned, achieves better test accuracy as com-pared to the base CNN model. This occurs despite the factthat the transformation loss L T ( h T , X ) minimized by the TransNet model is more demanding than the loss L ( h, X ) minimized by the base CNN, and appears harder to opti-mize. This conjecture is justified by the following Lemma: Lemma 2.
Let h T denote a TransNet model that obtainstransformation loss of a := L T ( h T , X ) . Then there exists areduction from h T to the base CNN model h that obtains aloss of at most a , i.e . L ( h, X ) ≤ a .Proof. a = L T ( h T , X ) = E t ∈ T [ L ( h θ t , t ( X ))] , so theremust be a transformation t ∈ T s.t. L ( h θ t , t ( X )) ≤ a .Now, one can compile the transformation t into h θ t (seeProp. 1) and get a base CNN: ˜ h = h t − ( θ t ) which obtains L (˜ h, X ) = L ( h t − ( θ t ) , t ( X )) = L ( h θ t , t ( X )) ≤ a .Why is it, then, that the TransNet model achieves overallbetter accuracy than the base CNN? The answer lies in itsability to achieve a better generalization.In order to measure the generalization capability of amodel w.r.t. a data-set, we use the ratio between the test-set and train-set loss, where a lower ratio indicates bettergeneralization. As illustrated in Fig. 4, clearly the pruned
TransNet models exhibit better generalization when com-pared to the base CNN model. Furthermore, the generaliza-tion improvement increases with the number of
TransNet model heads, which are only used for training and thenpruned. The observed narrowing of the generalization gapoccurs because, although the
TransNet model slightly in-creases the training loss, it more significantly decreases thetest loss as compared to the base CNN.
Figure 4: CIFAR-100 results. Left panel: learning curve of theBase CNN model (”base-CNN”) and a pruned
TransNet model(”PT2-CNN”). Right panel: generalization score, test-train lossratio, measured for the base-CNN model and various pruned
TransNet models with a different number of heads.
We note that better generalization does not necessarilyimply a better model. The ”PT4-CNN” model generalizesbetter than any other model (see right panel of Fig. 4), butits test accuracy is lower as seen in Table 2.
What characterizes the beneficial inductive bias impliedby the
TransNet model and its training algorithm Alg. 1?.7o answer this question, we investigate the emerging invari-ance of kernels in the convolutional layers of the learnednetwork, w.r.t. the
TransNet transformations set T .We start by introducing the ”Invariance Score” ( IS ),which measures how invariant a 3D tensor is w.r.t. a trans-formations group. Specifically, given a convolutional kerneldenoted by w (3D tensor) and a set of transformations group T , the IS score is defined as follows: IS ( w , T ) := min u ∈ INV T (cid:107) w − u (cid:107) (11)where IN V T is the set of invariant kernels (same shape as w ) under T , i.e . IN V T := { u : u = t ( u ) ∀ t ∈ T } . Lemma 3. arg min u ∈ INV T (cid:107) w − u (cid:107) = E t ∈ T [ t ( w )] Proof.
Let u be an invariant tensor under T . Define f ( u ) := (cid:107) w − u (cid:107) . Note that arg min u ∈ INV T (cid:107) w − u (cid:107) =arg min u ∈ INV T f ( u ) . f ( u ) = (cid:107) w − u (cid:107) = E t ∈ T [ (cid:107) w − t ( u ) (cid:107) ] ( u is invariant under T ) = E t ∈ T [ (cid:13)(cid:13) t − ( w ) − u (cid:13)(cid:13) ]= E t ∈ T [ (cid:107) t ( w ) − u (cid:107) ] ( T = T − ) = E t ∈ T [ size ( w ) (cid:88) i =1 ( t ( w ) i − u i ) ] Where index i runs over all the tensors’ elements. Finally,we differentiate f to obtain its minimum: ∂f∂ u i = E t ∈ T [ − t ( w ) i − u i )] = 0= ⇒ u i = E t ∈ T [[ t ( w ) i ] = ⇒ u = E t ∈ T [ t ( w )] Lemma 3 gives a closed-form expression for the IS gauge: IS ( w , T ) = (cid:107) w − E t ∈ T [ t ( w )] (cid:107) (12)Equipped with this gauge, we can inspect the invari-ance level of the model’s kernels w.r.t. a transformationsgroup. Note that this measure allows us to compare the full TransNet model with the base CNN model, as both sharethe same convolution layers. Since the transformations ofthe
TransNet model don’t necessarily form a group, we usethe minimal group containing these transformations - thegroup of all rotations { r i } i =1 .In Fig. 5 we can see that the full TransNet model ”T2-CNN” and the base CNN model demonstrate similar in-variance level in all the convolutional layers but the lastone. In Fig. 6, where the distribution of the IS score overthe last layer of 4 different models is fully shown, we canmore clearly see that the last convolutional layer of full TransNet models exhibits much higher invariance level ascompared to the base CNN. This phenomenon is robust tothe metric used in the IS definition, with similar resultswhen using ”Pearson Correlation” or ”Cosine Similarity”.The increased invariance in the last convolutional layer ismonotonically increasing with the number of heads in the TransNet model, which is consistent with the generalizationcapability of these models (see Fig 4).
Figure 5: CIFAR-100 results, plotting the distribution of the IS scores (mean and std) for the kernels in each layer of the differentmodels. Invariance is measured w.r.t. the group of ◦ rotations.Figure 6: CIFAR-100 results, plotting the full distribution of the IS scores for the kernels in the last (17-th) layer of the differentmodels. Invariance is measured w.r.t. the group of ◦ rotations. The generalization improvement achieved by the
TransNet model, as reported in Section 5.2, may be ex-plained by this increased level of invariance, as highly in-variant kernels have fewer degrees of freedom, and shouldtherefore be less prone to overfit.
Our method consists of 2 main components - the
TransNet architecture as well as the training algorithmAlg. 1. To evaluate the accuracy gain of each componentwe consider two variations:8
ODEL CIFAR-10 CIFAR-100 ImageNette ImageWoof ImageNet-200 base-CNN 95.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Accuracy of the ablation study models with the same space and time complexity, these 4 models enable us to evaluate the effectof the
TransNet architecture as well as the
TransNet algorithm separately. Mean and standard error for 3 repetitions are shown. • Architecture only : in this method we train the multi-headed architecture (2 in this case) by feeding eachhead the same un-transformed batch (equivalent to a
TransNet model with the multi-set of { id, id } transfor-mations). Prediction is retrieved from a single head(similar to PT2-CNN).• Algorithm only : in this method we train the base (oneheaded) model by the same algorithm Alg. 1. (Thismodel was also considered in the theoretical part 4.2,termed single headed model.)We compare the two methods above to the ”base-CNN”regular model and the complete model ”PT2-CNN”, seeTable 4. We can see that using only one of the compo-nents doesn’t yield any significant accuracy gain. This sug-gest that the complete model benefits from both compo-nents working together: the training algorithm increases themodel kernel’s invariance on the one hand, while the multi-heads architecture encourage the model to capture meaning-ful orientation information on the other hand.
6. Summary
We introduced a model inspired by self-supervision,which includes a base CNN model attached to multipleheads, each corresponding to a different transformationfrom a fixed set of transformations. The self-supervised as-pect of the model is crucial, as the chosen transformationsmust not occur naturally in the data. When the model ispruned back to match the base CNN, it achieves better testaccuracy and improved generalization, which is attributedto the increased invariance of the model’s kernels in the lastlayer. We observed that excess invariance, while improvinggeneralization, eventually curtails the test accuracy.We evaluated our model on various image data-sets, ob-serving that each data-set achieves its own optimal ker-nel’s invariance level, i.e . there’s no optimal number ofheads for all data-sets. Finally, we introduced an invari-ance score gauge ( IS ), which measures the level of invari-ance achieved by the model’s kernels. IS may be leveragedto determine the optimal invariance level, as well as poten-tially function as an independent regularization term. Acknowledgements
This work was supported in part by a grant from the Is-rael Science Foundation (ISF) and by the Gatsby CharitableFoundations.
References [1] Christopher Clark and Amos Storkey. Training deep convo-lutional neural networks to play go. In
International confer-ence on machine learning , pages 1766–1774, 2015. 2[2] Taco Cohen and Max Welling. Group equivariant convo-lutional networks. In
International conference on machinelearning , pages 2990–2999, 2016. 2[3] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-van, and Quoc V Le. Autoaugment: Learning augmentationstrategies from data. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 113–123,2019. 2[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In , pages 248–255. Ieee, 2009. 6[5] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu.Exploiting cyclic symmetry in convolutional neural net-works. arXiv preprint arXiv:1602.02660 , 2016. 2[6] Sander Dieleman, Kyle W Willett, and Joni Dambre.Rotation-invariant convolutional neural networks for galaxymorphology prediction.
Monthly notices of the royal astro-nomical society , 450(2):1441–1459, 2015. 2[7] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-vised visual representation learning by context prediction. In
Proceedings of the IEEE international conference on com-puter vision , pages 1422–1430, 2015. 2[8] Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springen-berg, Martin Riedmiller, and Thomas Brox. Discriminativeunsupervised feature learning with exemplar convolutionalneural networks.
IEEE transactions on pattern analysis andmachine intelligence , 38(9):1734–1747, 2015. 2[9] Robert Gens and Pedro M Domingos. Deep symmetry net-works. In
Advances in neural information processing sys-tems , pages 2537–2545, 2014. 2[10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-supervised representation learning by predicting image rota-tions. arXiv preprint arXiv:1803.07728 , 2018. 2, 3[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed- ngs of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 4, 5, 10[12] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, andDawn Song. Using self-supervised learning can improvemodel robustness and uncertainty. In Advances in NeuralInformation Processing Systems , pages 15663–15674, 2019.2[13] Jeremy Howard. Imagewang. 6[14] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. arXiv preprint arXiv:1502.03167 , 2015. 2, 4[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. 2009. 6[16] Gustav Larsson, Michael Maire, and GregoryShakhnarovich. Learning representations for automaticcolorization. In
European conference on computer vision ,pages 577–593. Springer, 2016. 2[17] Ya Le and Xuan Yang. Tiny imagenet visual recognitionchallenge.
CS 231N , 7, 2015. 6[18] Chen-Yu Lee, Saining Xie, Patrick Gallagher, ZhengyouZhang, and Zhuowen Tu. Deeply-supervised nets. In
Ar-tificial intelligence and statistics , pages 562–570, 2015. 10[19] Min Lin, Qiang Chen, and Shuicheng Yan. Network in net-work. arXiv preprint arXiv:1312.4400 , 2013. 4[20] Joseph L Mundy, Andrew Zisserman, et al.
Geometric in-variance in computer vision , volume 92. MIT press Cam-bridge, MA, 1992. 2[21] Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W Koh,Quoc V Le, and Andrew Y Ng. Tiled convolutional neu-ral networks. In
Advances in neural information processingsystems , pages 1279–1287, 2010. 2[22] Mehdi Noroozi and Paolo Favaro. Unsupervised learningof visual representations by solving jigsaw puzzles (2016). arXiv preprint arXiv:1603.09246 . 2[23] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei A. Efros. Context encoders: Featurelearning by inpainting, 2016. 2[24] Pascal Paysan, Reinhard Knothe, Brian Amberg, SamiRomdhani, and Thomas Vetter. A 3d face model for poseand illumination invariant face recognition. In , pages 296–301. Ieee, 2009. 2[25] Ling Shao, Fan Zhu, and Xuelong Li. Transfer learning forvisual categorization: A survey.
IEEE transactions on neuralnetworks and learning systems , 26(5):1019–1034, 2014. 2[26] Laurent Sifre and St´ephane Mallat. Rotation, scaling anddeformation invariant scattering for texture discrimination.In
Proceedings of the IEEE conference on computer visionand pattern recognition , pages 1233–1240, 2013. 2[27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: a simple wayto prevent neural networks from overfitting.
The journal ofmachine learning research , 15(1):1929–1958, 2014. 2[28] Pavan Turaga and Rama Chellappa. Locally time-invariantmodels of human activities using trajectories on the grass-mannian. In , pages 2435–2441. IEEE, 2009. 2 [29] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. Asurvey of transfer learning.
Journal of Big data , 3(1):9, 2016.2[30] Fa Wu, Peijun Hu, and Dexing Kong. Flip-rotate-pooling convolution and split dropout on convolution neu-ral networks for image classification. arXiv preprintarXiv:1507.08754 , 2015. 2[31] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-cas Beyer. S4l: Self-supervised semi-supervised learning. In
Proceedings of the IEEE international conference on com-puter vision , pages 1476–1485, 2019. 2[32] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In
European conference on computervision , pages 649–666. Springer, 2016. 2[33] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, andYi Yang. Random erasing data augmentation. In
AAAI , pages13001–13008, 2020. 2
AppendixA. Implementation details
We employed the ResNet [11] architecture, specificallythe ResNet18 architecture for all the data-sets except for theImageNet-200 which was evaluated using the ResNet50 ar-chitecture. It’s important to notice that we haven’t changedthe hyper-parameters used by the regular CNN architecturewhich
TransNet is based on. This may strengthen the resultsas one may fine tune these hyper-parameters to suit best the
TransNet model.We used a weight decay of 0.0001 and momentum of0.9. The model was trained with a batch size of 64 for allthe data-sets except for ImageNet-200 where we increasedthe batch size to 128. We trained the model for 300 epochs,starting with a learning rate of 0.1, divided by 10 at the 150and 225 epochs, except for the ImageNet-200 model whichwas trained for 120 epochs, starting with a learning rate of0.1, divided by 10 at the 40 and 80 epochs. We normalizedthe images as usual by subtracting the image’s mean anddividing by the image’s standard deviation (color-wise).We employed a mild data augmentation scheme - hori-zontal flip with probability of 0.5. For the CIFAR data-setswe padded each dimension by 4 pixels and cropped ran-domly (uniform) a 32 ×
32 patch from the enlarged image[18] while for the ImageNet family data-sets we croppedrandomly (uniform) a 224 ×
224 patch from the original im-age.In test time, we took the original image for the CIFARdata-sets and a center crop for the ImageNet family data-sets. The prediction of each model is the mean of themodel’s output on the original image and a horizontallyflipped version of it. Note that a horizontal flip occurs nat-urally in every data-set we use for evaluation and thereforeisn’t associated with any of the