Kernelized Classification in Deep Networks
KKernelized Classification in Deep Networks
Sadeep Jayasumana Srikumar Ramalingam Sanjiv KumarGoogle Research, New York { sadeep,rsrikumar,sanjivk } @google.com Abstract
In this paper, we propose a kernelized classification layerfor deep networks. Although conventional deep networksintroduce an abundance of nonlinearity for representation(feature) learning, they almost universally use a linear clas-sifier on the learned feature vectors. We introduce a non-linear classification layer by using the kernel trick on thesoftmax cross-entropy loss function during training and thescorer function during testing. Furthermore, we study thechoice of kernel functions one could use with this frame-work and show that the optimal kernel function for a givenproblem can be learned automatically within the deep net-work itself using the usual backpropagation and gradientdescent methods. To this end, we exploit a classic math-ematical result on the positive definite kernels on the unit n -sphere embedded in the ( n + 1) -dimensional Euclideanspace. We show the usefulness of the proposed nonlinearclassification layer on several vision datasets and tasks.
1. Introduction
Deep learning, which is now a ubiquitous technique inmachine learning, is built upon the premise that useful rep-resentations of the inputs can be automatically learned fromdata [4]. For example, in the image classification setting, arich representation learning network consisting of buildingblocks such as convolution and max-pooling is first used toobtain a vector representation of the input image. This rep-resentation is commonly referred to as the feature vector ofthe input image.The image is then classified into the correct class withinthe last layer of the network using a fully-connected layeroperating on the feature vector [33, 17]. As will be demon-strated later, this last classification layer represents a linearclassifier in the space of the learned feature vectors. There-fore, to perform well on the classification task, the classeshave to be linearly separable in the space of feature vec-tors. While this is a standard assumption in many tasks, wewonder if using a nonlinear classifier on the learned featurevectors would give any additional benefits, especially when the backbone feature extractor fails to learn fully linearlyseparable features.Kernel methods is a different branch of machine learn-ing that has been immensely successful in the pre-deep-learning era, particularly with the popularity of the Sup-port Vector Machines (SVM) algorithm [9, 31]. Conven-tionally, kernel methods have been used for learning withhand-crafted feature vectors, such as the scale invariant fea-ture transform (SIFT) [26], histogram-of-oriented-gradients(HOG) [12] and bag-of-visual-words [37] for image classi-fication. The key idea in kernel methods is the following:instead of running a linear classifier on the feature vectors,they are first mapped to a higher-dimensional ReproducingKernel Hilbert Space (RKHS) using a positive definite ker-nel function. For certain kernel functions, the RKHS caneven be infinite dimensional. A linear classifier is then runon this high-dimensional RKHS. Since the dimensionalityof the feature vectors is dramatically increased via this map-ping, a linear classifier in the RKHS corresponds to a pow-erful nonlinear classifier in the original feature vector space.Such a classifier is capable of learning more complex pat-terns than a linear classifier directly operating on the featurevectors. Thanks to the kernel trick, we never have to explic-itly calculate the high-dimensional vectors in the RKHS,which will be computationally expensive (or even impossi-ble in the case of an infinite-dimensional space) to computeand store.Although kernel methods yield excellent results in shal-low machine learning in general, the choice of the kernelfunction is often problematic. There is a collection of well-known kernels such as the linear kernel, polynomial kernels,and the Gaussian RBF kernel [32, 31]. The Gaussian RBFkernel is a common choice; however, tuning its bandwidthparameter is non-trivial and different values for this param-eter can give vastly different results.In this work, we join the concepts of automatic represen-tation learning and nonlinear, kernel-based classification.To this end, we formalize a nonlinear, kernelized classifi-cation layer for deep networks. To address the problem ofchoosing an appropriate kernel for the task, we take someresults from classic mathematics literature on positive defi-1 a r X i v : . [ c s . L G ] D ec ite kernel functions and device a method that automaticallylearns the optimal kernel from data within the deep learningframework itself. As a result, the full network, which con-sists of a conventional representation learner and our kernel-ized classification layer, can be trained end-to-end using theusual backpropagation and gradient descent methods. Weshow the benefits of this framework in image classification,transfer learning, and distillation settings on a number ofdatasets.
2. Related Work
One of the earliest attempts to connect neural networkbased learning and kernel methods was the sigmoid ker-nel [34], which became popular in SVMs due to the earlysuccess of the neural networks. The design of this kernelfunction was inspired by the sigmoid activation functionused in the early generations of neural networks. More re-cently, the authors of [7] proposed a family of kernel func-tions that mimic computations in large, multi-layer neuralnetworks.There have been a few methods that focus on extendingthe linear convolution in Convolutional Neural Networks toa nonlinear operation. Wang et al. [38] proposed a kernel-ized version of the convolution operation and demonstratedthat it can learn more complicated features than the usualconvolution operation. However, some kernels used in thatwork, such as the l p -norm kernels, are not positive definiteand hence do not represent a valid mapping to an RKHS.Convolutional kernel networks provides a kernel approx-imation scheme to interpret convolutions [27]. Other ap-proaches include the use of Volterra series approximationsto extend convolutions to a nonlinear operation by introduc-ing quadratic terms [42].In a related work [6], an RBF kernel layer is introducedto produce a feature space from pointcloud input. In con-trast to our work, the RBF kernel layer is used on the inputdata and not in the classification layer.Prior to the dominance of deep learning methods, pick-ing the right kernel for a given problem has been studiedextensively in works such as [19, 2, 20, 15]. In particular,Multiple Kernel Learning (MKL) approaches [15, 35] werepopular in conjunction with SVM. Unfortunately, thesemethods scale poorly with the size of the training dataset.In this work, we automatically learn the kernel using thedata within a deep network. This not only allows automaticrepresentation learning, but also scales well for large train-ing sets.Kernels have also been considered for deep learningto reduce the memory footprint of CNNs. This was ac-complished by achieving an end-to-end training of a Fast-food kernel layer [40], which uses approximations of kernelfunctions using Fastfood transforms [23].Other related methods involving both kernels and deep learning include scalable kernel methods [11], kernel pool-ing [10], deep SimNets [8], and deep kernel learning [39].
3. Nonlinear Softmax Loss
In this section, we formalize the kernelized classifica-tion in deep networks. Let us consider a multi-class clas-sification problem with a training set { ( x i , y i ) } Ni =1 , where x i ∈ X for each i , y i ∈ [ L ] = { , , . . . , L } for each i , X is a nonempty set, L is the number of labels, and N is thenumber of training examples. For example, each trainingdata point ( x i , y i ) can be an image with its class label.A deep neural network that solves this task has two com-ponents: a representation learner (also called the featurelearner ) and a classifier . In the case of image classification,the representation learner consists of modules such as con-volution layers, max-pooling layers, and fully-connectedlayers. The classifier is the last fully-connected layer op-erating on the feature vectors, which is endowed with a lossfunction during training.Let r (Θ) : X → R d denote the representation learner,where d is the dimensionality of the learned feature vectorsand Θ represents all the (learnable) parameters in this partof the network. The classifier is charaterized by a function g (Ω) : R d → [ L ] , where Ω denotes all the parameters inthe last layer of the network. Usually, Ω consists of weightvectors w , w , . . . , w L with each w j ∈ R d , and bias terms b , b , . . . , b L with each b j ∈ R . The function g (Ω) thentakes the form: g (Ω) ( f ) = argmax j w Tj f , (1)where f = r (Θ) ( x ) ∈ R d is the feature vector for input x . Note that we have dropped the additive bias term b j tokeep the notation uncluttered. There is no loss of general-ization here since the bias term can be absorbed into w j byappending a constant element to f . During inference, thedeep network’s class prediction ˆ y ∗ for an input x ∗ is thecomposite of these two functions: ˆ y ∗ = (cid:16) g (Ω) ◦ r (Θ) (cid:17) ( x ∗ ) . (2)Although conceptually there are two components of thedeep network, their parameters Θ and Ω are learned jointlyduring the training of the network. The de facto standardway of training a classification network is minimizing the softmax loss applied to the classification layer. The soft-max loss is the combination of the softmax function andthe cross-entropy loss. More specifically, for a single train-ing example ( x, y ) with the feature vector f = r (Θ) ( x ) , thesoftmax loss is calculated as, l ( y, f ) = − log (cid:32) exp( w Ty f ) (cid:80) Lj =1 exp( w Tj f ) (cid:33) . (3)2ote that the classifier g (Ω) trained is this manner is com-pletely linear in R d , the space of the feature vectors f s, as isevident from Eq. (2).From the classic knowledge in kernel methods on hand-crafted features, we are aware that more powerful nonlin-ear classifiers on R d can be obtained using the kernel trick.The key idea here is to first embed the feature vectors f sinto a high-dimensional Reproducing Kernel Hilbert Space(RKHS) H and perform classification in H . Although theclassification is linear in the high-dimensional Hilbert space H , it is nonlinear in the original feature vector space R d . Let φ : R d → H represent this RKHS embedding. Performingclassification in H is then equivalent to training the neuralnetwork with the following modified version of the softmaxloss: l (cid:48) ( y, f ) = − log (cid:32) exp (cid:0) (cid:104) φ ( w y ) , φ ( f ) (cid:105) H (cid:1)(cid:80) Lj =1 exp (cid:0) (cid:104) φ ( w j ) , φ ( f ) (cid:105) H (cid:1) (cid:33) , (4)where (cid:104) ., . (cid:105) H denotes the inner product in the Hilbert space H . The key difference between Eq. (3) and Eq. (4) is that thedot products between w j s and f have been replaced with theinner products between φ ( w j ) s and φ ( f ) . The more generalnotion of inner product is used instead of the dot product because the Hilbert space H can be infinite dimensional.For a network trained with this nonlinear version of thesoftmax function, the label prediction can be obtained usingthe following modified version of the predictor: g (cid:48) (Ω) ( f ) = argmax j (cid:104) φ ( w j ) , φ ( f ) (cid:105) H . (5)Note that the Hilbert space embeddings φ ( . ) s can bevery-high, even infinite, dimensional. Therefore, comput-ing and storing them can be problematic. We can use the kernel trick in the classic machine learning literature [31,32] to overcome this problem: explicit computation of φ ( . ) sis avoided by directly evaluating the inner product betweenthem using a kernel function k : R d × R d → R . This ideais summarized by the equation: (cid:104) φ ( w ) , φ ( f ) (cid:105) H = k ( w , f ) . (6)For a kernel function to represent a valid RKHS, it mustbe positive definite [3, 5]. We discuss this notion next.
4. Kernels on the Unit Sphere
It was shown in the previous section that, given a ker-nel function on the feature vector space, we can obtain anonlinear classifier in the last layer of a deep network bymodifying the softmax loss function during training and thepredictor during inference. However, only positive defi-nite kernels allow this trick. There are various choices for kernel functions in the classic machine learning literature.Some popular choices include the polynomial kernel, theGaussian RBF kernel (squared exponential kernel), and theLaplacian kernel. However, in the classic kernel methodsliterature, there is no principled method for selecting theoptimal kernel for a given problem. Furthermore, manyof the kernels have hyperparameters that need to be man-ually tuned. The generally accepted solution to this prob-lem in classic kernel methods is the MKL framework [15],where the optimal kernel is learned as a linear combina-tion of some pre-defined kernels. Unfortunately, like SVM,MKL methods do not scale well with the train set size.In this section, we present some theoretical results thatwill pave the way to define a neural network layer that canautomatically learn the optimal kernel from data. By formu-lating kernel learning as a neural network layer, we inheritthe desirable properties of deep learning, including scalabil-ity and automatic feature learning.We start the discussion with the following definition ofpositive definite kernels [5].
Definition 4.1.
Let U be a nonempty set. A function k :( U ×U ) → R is called a positive definite kernel if k ( u, v ) = k ( v, u ) for all u, v ∈ U and N (cid:88) j =1 N (cid:88) i =1 c i c j k ( u i , u j ) ≥ , for all N ∈ N , { u , . . . , u N } ⊆ U and { c , . . . , c N } ⊆ R . Properties of positive definite kernels have been studiedextensively in mathematics literature [3, 29, 5]. The follow-ing proposition summarizes some important closure proper-ties of this class of functions.
Proposition 4.2.
The family of all positive definite kernelson a given nonempty set forms a convex cone that is closedunder pointwise multiplication and pointwise convergence.Proof.
To intuitively understand this result, it is helpful torecall that the geometry of the family of the all positive def-inite kernels on a given nonempty set is closely related tothe geometry of the space of the d × d symmetric positivedefinite matrices, which forms a convex cone. The formalproof of this proposition can be found in Remark 1.11 andTheorem 1.12 of Chapter 3 of [5].To simplify the problem setting, we assume that boththe feature vectors f s and the weight vectors w j s are unitnorm. Not only this simplifies the mathematics, but alsoit is a practice in use for stabilizing the training of neuralnetworks [41]. Due to this assumption, we are interested inpositive definite kernels on the unit sphere in R d . From nowon, we use S n , where n = d − , to denote this space.3e also restrict our discussion to radial kernels on S n .Radial kernels, kernels that only depend on the distance be-tween the two input points, have desirable property of trans-lation invariance. Furthermore, all the commonly used ker-nels on S n , such as the linear kernel, the polynomial kernel,the Gaussian RBF kernel, and the Laplacian kernel are ra-dial kernels. The following theorem, origins of which canbe traced back to [30], fully characterizes radial kernels on S n . Theorem 4.3.
A radial kernel k : S n × S n → R is positivedefinite for any n if and only if it admits a unique seriesrepresentation of the form k ( u , v ) = ∞ (cid:88) m =0 α m (cid:104) u , v (cid:105) m + α − ( (cid:74) (cid:104) u , v (cid:105) = 1 (cid:75) − (cid:74) (cid:104) u , v (cid:105) = − (cid:75) )+ α − (cid:74) (cid:104) u , v (cid:105) ∈ {− , } (cid:75) , (7) where each α m ≥ , (cid:80) ∞ m = − α m < ∞ , and (cid:74) . (cid:75) depicts theIversion bracket.Proof. The kernel k : S n × S n → [ − ,
1] : k ( u , v ) = (cid:104) u , v (cid:105) is positive definite on S n for any n since (cid:80) j (cid:80) i c i c j (cid:104) u i , u j (cid:105) = (cid:107) (cid:80) i c i u i (cid:107) ≥ . Therefore,from the closure properties in Proposition 4.2, the kernel k m : ( u , v ) (cid:55)→ (cid:104) u , v (cid:105) m is also positive definite on S n forany m ∈ N . Furthermore, k m is positive definite for m = 0 since (cid:80) j (cid:80) i c i c j (cid:104) u i , u j (cid:105) = (cid:107) (cid:80) i c i (cid:107) ≥ .Let us now consider the following two sequences of ker-nels: s odd = k , k , . . . , k m +1 , . . . and s even = k , k , . . . , k m , . . . Since − ≤ (cid:104) u , v (cid:105) ≤ , it is clear that s odd and s even converge pointwise to the following kernels, respectively. k odd ( u , v ) = (cid:74) (cid:104) u , v (cid:105) = 1 (cid:75) − (cid:74) (cid:104) u , v (cid:105) = − (cid:75) ,k even ( u , v ) = (cid:74) (cid:104) u , v (cid:105) ∈ {− , } (cid:75) . From the last closure property of Proposition 4.2, both k odd and k even are positive definite on S n . Invoking Proposi-tion 4.2 again, we conclude that any finite conic combina-tion of the kernels k even , k odd , k , k , . . . is positive definiteon S n for any n . This completes the forward direction ofthe proof.For the proof of the converse, we refer the reader to The-orem 3.6 in Chapter 5 of [5].Equipped with a complete characterization of the posi-tive definite radial kernels on S n , we now discuss how wecan combine this result with the nonlinear softmax formu-lation derived in Section 3 to automatically learn the bestkernel classifier within a deep network.
5. The Kernelized Classification Layer
We are now ready to describe our full framework fornonlinear classification in the feature space. We introduce akernelized classification layer that acts as a drop-in replace-ment for the usual softmax classification layer in a deep net-work. This new layer classifies feature vectors in a high-dimensional RKHS while automatically choosing the op-timal positive definite kernel that enables the mapping intothe RKHS. As a result, we do not have to hand-pick a kernelor its hyperparameters.
The new classification layer is parameterized by theusual weight vectors: w , w , . . . , w L , and some additionallearnable coefficients: α − , α − , . . . , α M , where M ∈ N and each α m ≥ . During training, this classifier maps fea-ture vectors f s to a high-dimensional RKHS H opt , whichoptimally separates feature vectors belonging to differentclasses, and learns a linear classifier in H opt . During infer-ence, the classifier maps feature vectors of previously un-seen input to the RKHS it learned during training and per-forms classification in that space. This is achieved by usingthe nonlinear softmax loss defined in Eq (4) during trainingand the nonlinear predictor defined in Eq (5) during testing,with the inner product in H given by: (cid:104) φ ( w ) , φ ( f ) (cid:105) H = (cid:104) φ ( w ) , φ ( f ) (cid:105) H opt = k opt ( w , f ) , (8)where k opt ( ., . ) is the reproducing kernel of H opt . Theoptimal RKHS H opt for a given classification problem islearned by finding the optimal kernel k opt during trainingas discussed in the following.Theorem 4.3 states that any positive definite radial ker-nel on S n admits the series representation shown in Eq. (7).Therefore, the optimal kernel k opt must also have such a se-ries representation. We approximate this series with a finitesummation by cutting off the terms beyond the order M .More specifically, we use: k opt ( w , f ) ≈ M (cid:88) m =0 α m k m ( w , f ) + α − k odd ( w , f )+ α − k even ( w , f ) , (9)where, k even , k odd , k , k , . . . , k M have meanings definedSection 4 and α − , α − , . . . , α M ≥ . Using Proposi-tion 4.2 and the discussion in the proof of Theorem 4.3, onecan easily verify that this approximation does not violatethe positive definiteness of k opt .With this, k opt is learned automatically from databy making the coefficients α − , α − , . . . , α M s learn-able parameters of the classification layer. Let α =[ α − , α − , . . . , α M ] T . The gradient of the loss function4ith respect to α can be easily calculated via the back-propagation algorithm using the equations (4), (8), and (9).Therefore, it can be optimized along with w , w , . . . , w L during the gradient descent based optimization of the net-work. This procedure is equivalent to automatically findingthe RKHS that optimally separates the feature vectors be-longing to different classes.The constraint α − , α − , . . . , α M ≥ in Eq. (9) can beimposed with the commonly used ReLU operation. We dis-cuss this in more detail in Section 5.2. As for the number ofkernels M in the approximation, as long as it is sufficientlylarge, the exact value of M is not critical. This is because,as discussed in the proof of Theorem 4.3, the higher orderterms that are truncated approach either k odd or k even , bothof which are already included in the finite summation. Onthe flip side, if the terms beyond some order M (cid:48) < M arenot important, the network can automatically learn to makethe corresponding α coefficients vanish. In practice, we ob-served that 10 kernels work well enough and stick to thisnumber in all our experiments.Importantly, the kernelized classification layer describedabove can pass on the gradients of the loss to its inputs: thefeature vectors f s. Therefore, our kernelized classificationlayer is fully compatible with end-to-end training and canact as a drop-in replacement for an existing softmax classi-fication layer. The constraint α − , α − , . . . , α M ≥ is important inorder to preserve the positive definiteness of k opt . Asnoted above, this can be straightforwardly imposed by using α = ReLU( α (cid:48) ) , where α (cid:48) is the learnable parameter vec-tor. However, ReLU has no upper-bound and allowing thescale of α to grow unboundedly causes the following issuein optimization: Assume we have an instantiation α of thevector α . By replacing α with λ α , where λ > , we scaleall the inner product terms in Eq. (4) and Eq. (5) by the same λ . As a result, we improve the loss of the already correctlyclassified training examples, but without making any effec-tive change to the predictor. This is analogous to decreasingthe temperature [18] of the softmax loss. Therefore, underthis setting, once the majority of the training examples arecorrectly classified, the neural network can easily improvethe loss just by increasing the norm of α , which is not use-ful. We therefore advocate an l -regularization (weight de-cay) term on α when ReLU activation is used.Alternatively, one could also use α = sigmoid( α (cid:48) ) or α = softmax( α (cid:48) ) , both of which not only guaran-tee α − , α − , . . . , α M ≥ , but also produce bounded α .Therefore, no regularization on α is needed for these op-tions. The softmax activation here should not be confusedwith the softmax loss discussed in Section 3. The usageof the softmax activation in this context is similar that in the self-attention literature [36], where it is used to normal-ize the coefficients of a linear combination. Note also thatboth ReLU and sigmoid are elementwise operations on α (cid:48) ,while softmax is a vector operation. We expect the kernelized classification to be particularlyuseful in settings where the capacity of the feature learningor backbone network is capacity limited. This is becausea capacity-limited network might not be able to learn fullylinearly separable features and therefore a nonlinear classi-fier can be useful to augment its capabilities.Another common method used to improve classifica-tion with capacity-limited networks is knowledge distilla-tion [18], where the logit or probability outputs of a larger teacher network is used to train a smaller student network.While training the student network, the loss function (par-tially) consists of the cross-entropy loss with the teachernetwork’s output. More specifically, using the same no-tation as in Section 3, assume that for a training example ( x, y ) , the teacher produces logits h = [ h , h , . . . , h L ] T .Then, for the student network with the feature vector f = r (Θ) ( x ) and a usual classification layer parameterized bythe weight vectors w , w , . . . , w L , the cross-entropy lossis given by: l st ( h , f ) = − L (cid:88) j (cid:48) =1 ˜ h j (cid:48) log (cid:32) exp( w Th (cid:48) f /T ) (cid:80) Lj =1 exp( w Tj f /T ) (cid:33) , (10)where ˜ h j (cid:48) = exp( h j (cid:48) /T ) / (cid:80) Lj =1 exp( h j /T ) and T is thetemperature hyperparameter.In distillation, the student network tries to imitate ateacher network, which is capable of producing more pow-erful feature vectors than the student. Intuitively, therefore,the student could benefit from using a powerful nonlinearclassifier on the weak feature vectors it produces. Withthis mind, we explore the use of kernelized classificationlayer in the student network. The cross-entropy loss withthe teacher scores in this case is: l (cid:48) st ( h , f ) = − L (cid:88) j (cid:48) =1 ˜ h j (cid:48) log (cid:32) exp (cid:0) (cid:104) φ ( w j (cid:48) ) , φ ( f ) (cid:105) H /T (cid:1)(cid:80) Lj =1 exp (cid:0) (cid:104) φ ( w j ) , φ ( f ) (cid:105) H /T (cid:1) (cid:33) , (11)where all the terms have meanings defined earlier.
6. Experiments
In this section, we present experimental evaluations ofour method. Note that our focus is to validate the theorypresented in the previous sections and demonstrate its effi-cacy, not to claim state-of-the-art results on the already wellexplored image classification datasets.5 a) Training data (b) Softmax classifier’s regions (c) Kernelized classifier’s regionsFigure 1.
Classification of a binary synthetic dataset on S . In the second and third images, binary regions identified by the respectiveclassifiers are shown in blue and orange colors. Training data has been overlaid in all three images. Note that the softmax classifier (middle)can only separate cap-like regions on the sphere, whereas our kernelized classifer (right) can do more complex nonlinear classificationthanks to the higher dimensional RKHS embedding of the sphere. Best viewed in color.
For each experiment, our baseline method used the stan-dard softmax-based classifier. To evaluate our method, wereplaced this classification layer with our proposed kernel-ized classification layer. In all experiments we use 10 ker-nels, meaning that the kernelized classification layer usesonly 10 additional learnable parameters. As discussed inSection 5.2, we also used
ReLU activation and a weightdecay of . on this 10-dimensional parameter vector.This is the same amount of weight decay used in the otherparts of the network. This vector was initialized to have allones in all our experiments.To keep the number of learnable parameters compara-ble, we keep the additive bias term in the baseline classifierand omit them in the kernelized classifier. This bias termsintroduces a number of learnable parameters equal to thenumber of classes used. Therefore, in most cases, the base-line model actually uses more learnable parameters than thekernelized classifier.We also removed the ReLU activation on the feature vec-tors to utilize the full surface of S n . We did the same to thebaseline model as well to enable a fair comparison. For thebaseline model, removal of the ReLU activation or the nor-malization of the feature vectors did not make a significantdifference in the accuracy (see Section 6.5 for more details).Throughout the experiments, we SGD with momentum0.9, linear learning rate warmup [14], cosine learning ratedecay [25], and decide the base learning by cross validation.When a better learning rate schedule is available for thebaseline (e.g. CIFAR schedule in [17]), we experimentedwith both that and our schedule and report the best accu-racy of the two. The maximum number of epochs was 450in all cases. Mini-batch size was 128 for the synthetic andCIFAR datasets and 64 other datasets with larger images.We used the CIFAR data augmentation method in [17] for Method AccuracySoftmax classifier (baseline) 85.51Kernelized classifier (ours)
Bayes optimal classifier 95.06
Table 1.
Results on the synthetic dataset.
Note that the accu-racy of the kernelized classifier is close to that of the ideal Bayesoptimal classifier (theoretical maximum).
CIFAR-10 and CIFAR-100 datasets, and the Imagenet dataaugmentation in the same paper for other image datasets.
We use our first experiment to evaluate the proposed ker-nelized classification layer as an isolated unit by demon-strating its capabilities to learn nonlinear patterns on S n . Tothis end, inspired by the blue-orange dataset used in Chap-ter 2 of [16], we generated a two-class dataset on S witha mixture of Gaussian clusters for each class. More specifi-cally, we first generated 10 cluster centers for the each classby sampling from an isotropic Gaussian distribution withcovariance . I and mean [1 , , T for the first class and [0 , , T for the second class. We then generated 5,000 ob-servations for each class using the following method: foreach observation, we uniform-randomly picked a clustercenter and then generated a sample from an isotropic Gaus-sian distribution centered at that cluster center with covari-ance . I . All the observations were projected on to S by l normalizing them.We then considered these observations as feature vec-tors and trained a usual softmax-based classification layer(baseline) and our kernelized classification layer on themfor comparison. Results on a test set generated in the samemanner are shown in Table 1. We also report the theo-6etical maximum accuracy, the accuracy of the Bayes op-timal classifier [16], which we can calculate for this syn-thetic dataset since we know the data generating model. It isnoteworthy that the accuracy of our kernelized classificationlayer significantly outperforms the baseline and gets closeto the Bayes optimal performance. This can be attributed tothe layer’s capabilities to learning nonlinear patterns on thesphere by embedding the data into a much higher dimen-sional RKHS.We visualize our training data, the baseline classifier’soutput and the kernelized classifier’s output in Figure 1.Note that the usual softmax classifier can only separate cap-like regions on S , this is a result of its being a linear classi-fier with respect to the feature vectors. Our kernelized clas-sifier, on the other hand, can do a more complex nonlinearseparation of the data. We now report results on CIFAR-10 and CIFAR-100 realworld image benchmarks [22]. For each dataset, we experi-mented with several CIFAR ResNet architectures [17]. Re-sults are summarized in Table 2. In all cases our methodsignificantly outperforms the baseline.Representation CIFAR-10 CIFAR-100Learner Baseline Ours Baseline OursResNet-8 83.73
ResNet-14 89.87
ResNet-20 91.14
ResNet-32 92.22
ResNet-44 92.10
ResNet-56 93.01
Table 2.
Results on CIFAR-10 and CIFAR-100 datasets.
Accu-racy of the proposed kernelized classifier is higher than the that ofthe baseline in all cases.
In the next set of experiments we evaluate our method ina transfer learning setting. To this end, we take a ResNet-50network pre-trained on the Imagenet ILSVRC 2012 classi-fication dataset [13] and fine tune it Oxford-IIIT Pets [28]and Stanford Cars [21] datasets. For each dataset, we usethe train/test splits provided by the standard TensorflowDatasets implementation [1]. Results are summarized inTable 3. On both datasets, kernelized classification layerresults in significant gains. This is intuitive to understandsince the feature vectors learned from the source task mightnot linearly separate the new classes in the target task. Wecan therefore benefit from a nonlinear classifier in this set-ting. Method AccuracyBaseline OursOxford-IIIT Pets 92.06
Stanford Cars 90.78
Table 3.
Results on the transfer learning datasets.
Accuracyof the proposed kernelized classifier is higher than the that of thebaseline on both datasets.
We now demonstrate the capabilities of our method inthe distillation setting discussed in Section 5.3. We usedthe CIFAR-10 and CIFAR-100 datasets, the baseline CI-FAR ResNet-56 models from Table 2 as the teacher mod-els, and the LeNet-5 network [24] as the student model. Weuse only the cross-entropy loss with the teacher scores withthe temperature parameter T set to 20 in all cases. Resultsare shown in Table 4. Once again, significant gains are ob-served with the kernelized classification layer. This can beattributed to its capabilities to approximate complex teacherprobabilities even with weak features.Method AccuracyBaseline OursCIFAR-10 76.06 CIFAR-100 44.38
Table 4.
Results in the distillation setting.
Accuracy of the pro-posed nonlinear classifier is higher than the that of the baseline inall cases.
We conclude our experimental evaluation with a numberof ablation studies and other analyses. For all the studiesbelow, we used the CIFAR-100 dataset and the ResNet-56backbone network.
As discussed earlier, we use l normalization on both thefeatures and the weight vectors in the classification layer.One can wonder what the effect of a similar normaliza-tion would be on the baseline model. To answer this ques-tion, we experimented with a setup where we used nor-malized embeddings and normalized weight vectors in theusual softmax loss (Eq. (3)). Out of the box, this had sig-nificantly lower performance than softmax loss with unnor-malized features/weights. This is because when both fea-tures and weights are normalized, we have the restriction − ≤ w j T f ≤ in Eq. (3) which cripples the usual soft-max loss function. This can, however, be circumvented byusing an appropriate temperature parameter [18]. In this set-ting, the performance of the softmax classifier stays largely7table as long as the temperature parameter is small enough.This is consistent with what authors of [41] have observed.We summarize our results in Table 5, where we used a tem-perature of 0.05. Note that making the temperature a learn-able parameter has the same optimization issue outlined inSection 5.2.In all the experiments in the previous sections, we usenormalized features for the baseline (row 3 in Table 5),which be believe to provide a level playing field withoutlimiting the capabilites of the softmax classifier.Method AccuracySoftmax classifier with:no normalization 71.16features normalized 71.23features & weights normalized 60.34features & weights normalized with temp. 70.43Kernelized classifier (ours) Table 5.
Effect of the normalization of features and weights.
Normalizing both features and weights cripples the softmax clas-sifier, but it can be circumvented by using an appropriate temper-ature (0.05 in this case).
Another thing we do differently to the usual image classifi-cation networks such as ResNet [17], is removing the
ReLU activation from the feature vectors. We do this in order toutilize the full surface of S n without restricting ourselves toonly the nonnegative orthant. As is evident from Table 6,removing ReLU has only a marginal effect on the baseline.It is however an important factor for our method.In all our experiments in the previous sections we usedfeature vectors without the
ReLU activation.Method AccuracySoftmax classifier with:rectified features 70.96unrectified features 71.23Kernelized classifier with:rectified features 71.61unrectified features
Table 6.
Effect of rectification of the feature vectors.
Using un-rectified feature vectors is important for the kernelized classifier.
As discussed in Section 5.2, we have several choices to im-pose the constraint that kernel coefficients α m s are non-negative. Using the notation from that section, we exper-imented with ReLU , sigmoid , and softmax activations on α (cid:48) and summerized the results in Table 7. Due to the reasons discussed in Section 5.2, we useda weight decay of 0.0001 on the coefficient vector when-ever ReLU activation is used. Although sigmoid and softmax activations eliminate the need for weight decay,they put a hard constraint on | k opt ( ., . ) | and therefore | (cid:104) φ ( w ) , φ ( f ) (cid:105) H | . To overcome this limitation, it is help-ful to use a temperature hyperparameter in Eq. (4), whereeach inner product is multiplied by /T before taking theexponential. We used a temperature of 0.1 and 0.005,with sigmoid and softmax , respectively. Although sigmoid gives the best performance in Table 7, we occasionally ob-served optimization issues with it, which might be due tothe vanishing gradient issue associated with this activationfunction. We therefore stick to ReLU in all other experi-ments. We however note that, in most cases, competitiveresults can be obtained with softmax as well, when usedwith a temperature of 0.005.It is also interesting to note that using no activation func-tion on α (cid:48) causes frequent divergence in training. This isconsistent with the theory: The summation in Eq. (9) is notguaranteed to be positive definite when a m s are allowed tobe negative (see Proposition 4.2). Therefore, the theory ofkernelized classification is not valid in this case.Activation function Accuracy sigmoid softmax ReLU
Table 7.
Different activation functions on the coefficient vector.
Note that the kernelized classifier is unstable when no activationfunction is used, this agrees with the theoretical analysis.
To investigate what benefits automatic learning of the ker-nel gives us, we performed an experiment where we keptthe coefficient vector α fixed at the initial value, all ones,throughout the training. Results are shown in Table 8.Method AccuracyCoefficients fixed 73.16Coefficients trained Table 8.
Effect of training the kernel coefficient vector.
End-to-end training of the coefficient vector with the rest of the networkgives the best results.
7. Conclusion
We presented a kernelized classification layer for deepneural networks. This classification layer classifies fea-ture vectors in a high dimensional RKHS while automat-8cally learning the optimal kernel that enables this high-dimensional embedding. We showed substantial accuracyimprovements with this layer in the image classification,transfer learning, and distillation settings. These accu-racy improvements are due to the kernelized classificationlayer’s capability in finding nonlinear patterns in the featurevectors.In the future, we plan to extend this idea to other lossfunctions such as regression losses and other tasks such asobject detection and semantic image segmentation.
References [1] TensorFlow Datasets: a collection of ready-to-use datasets.[Online; accessed Nov-10-2020]. 7[2] Shawkat Ali and Kate A. Smith-Miles. A meta-learning ap-proach to automatic kernel selection for support vector ma-chines.
Neurocomputing , 2006. Neural Networks. 2[3] Nachman Aronszajn. Theory of Reproducing Kernels.
Transactions of the American Mathematical Society , 1950.3[4] Y. Bengio, A. Courville, and P. Vincent. Representationlearning: A review and new perspectives.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 35(8):1798–1828, 2013. 1[5] C. Berg, J. P. R. Christensen, and P. Ressel.
Harmonic Anal-ysis on Semigroups . Springer, 1984. 3, 4[6] Weikai Chen, Xiaoguang Han, Guanbin Li, Chao Chen, JunXing, Yajie Zhao, and Hao Li. Deep rbfnet: Point cloudfeature learning using radial basis functions, 2019. 2[7] Youngmin Cho and Lawrence K. Saul. Kernel methods fordeep learning. In Y. Bengio, D. Schuurmans, J. D. Lafferty,C. K. I. Williams, and A. Culotta, editors,
Advances in Neu-ral Information Processing Systems 22 , pages 342–350. Cur-ran Associates, Inc., 2009. 2[8] Nadav Cohen, Or Sharir, and Amnon Shashua. Deep sim-nets.
CoRR , abs/1506.03059, 2015. 2[9] C. Cortes and V. Vapnik. Support Vector Networks.
MachineLearning , 1995. 1[10] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie.Kernel pooling for convolutional neural networks. In , 2017. 2[11] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods viadoubly stochastic gradients, 2015. 2[12] Navneet Dalal and Bill Triggs. Histograms of Oriented Gra-dients for Human Detection. In
CVPR , 2005. 1[13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. In
CVPR09 , 2009. 7[14] Priya Goyal, Piotr Doll´ar, Ross B. Girshick, P. Noordhuis,L. Wesolowski, Aapo Kyrola, Andrew Tulloch, Y. Jia, andKaiming He. Accurate, Large Minibatch SGD: Training Im-ageNet in 1 Hour.
ArXiv , abs/1706.02677, 2017. 6[15] Mehmet G¨onen, Ethem Alpaydın, and Francis Bach. Multi-ple kernel learning algorithms.
JMLR , 2011. 2, 3 [16] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
TheElements of Statistical Learning . Springer Series in Statis-tics. Springer New York Inc., New York, NY, USA, 2001. 6,7[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In , pages 770–778, 2016. 1, 6, 7, 8[18] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distillingthe Knowledge in a Neural Network. In
NIPS Deep Learningand Representation Learning Workshop , 2015. 5, 7[19] Tom Howley and Michael G. Madden. The genetic kernelsupport vector machine: Description and evaluation.
Artif.Intell. Rev. , 2005. 2[20] Sadeep Jayasumana, Richard Hartley, Mathieu Salzmann,Hongdong Li, and Mehrtash Harandi. Optimizing Over Ra-dial Kernels on Compact Manifolds. In
Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2014. 2[21] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.3d object representations for fine-grained categorization. In , Sydney, Australia, 2013. 7[22] Alex Krizhevsky. Learning multiple layers of features fromtiny images. Technical report, 2009. 7[23] Quoc Le, Tam´as Sarl´os, and Alex Smola. Fastfood: Approx-imating kernel expansions in loglinear time. In
ICML , 2013.2[24] Yann Lecun, L´eon Bottou, Yoshua Bengio, and PatrickHaffner. Gradient-based learning applied to document recog-nition. In
Proceedings of the IEEE , pages 2278–2324, 1998.7[25] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient de-scent with warm restarts. In
International Conference onLearning Representations (ICLR) 2017 Conference Track ,Apr. 2017. 6[26] David G Lowe. Distinctive image features from scale-invariant keypoints.
International journal of computer vi-sion , 60(2):91–110, 2004. 1[27] Julien Mairal, Piotr Koniusz, Za¨ıd Harchaoui, and CordeliaSchmid. Convolutional kernel networks.
CoRR ,abs/1406.3332, 2014. 2[28] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar.Cats and dogs. In , 2012. 7[29] I. J. Schoenberg. Metric Spaces and Positive Definite Func-tions.
Transactions of the American Mathematical Society ,1938. 3[30] I. J. Schoenberg. Positive Definite Functions on Spheres.
Duke Mathematical Journal , 1942. 4[31] Bernhard Sch¨olkopf and Alexander J. Smola.
Learning withKernels: Support Vector Machines, Regularization, Opti-mization, and Beyond . MIT Press, 2002. 1, 3[32] John Shawe-Taylor and Nello Cristianini.
Kernel Methodsfor Pattern Analysis . Cambridge University Press, 2004. 1,3[33] K. Simonyan and A. Zisserman. Very Deep ConvolutionalNetworks for Large-Scale Image Recognition. In
ICLR ,2015. 1
34] Vladimir N. Vapnik.
The Nature of Statistical Learning The-ory . Springer-Verlag, Berlin, Heidelberg, 1995. 2[35] Manik Varma and D. Ray. Learning the discriminativepower-invariance trade-off. In
IN ICCV , 2007. 2[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and IlliaPolosukhin. Attention is All you Need. In
Advances in Neu-ral Information Processing Systems , 2017. 5[37] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Mul-tiple kernels for object detection. In , pages 606–613,2009. 1[38] Chen Wang, Jianfei Yang, Lihua Xie, and Junsong Yuan.Kervolutional neural networks. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 31–40, 2019. 2[39] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdi-nov, and Eric P. Xing. Deep kernel learning.
CoRR ,abs/1511.02222, 2015. 2[40] Z. Yang, M. Moczulski, M. Denil, N. De Freitas, L. Song,and Z. Wang. Deep fried convnets. In , 2015. 2[41] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng,Lukasz Heldt, Aditee Kumthekar, Zhe Zhao, Li Wei, and EdChi. Sampling-Bias-Corrected Neural Modeling for LargeCorpus Item Recommendations. In
Proceedings of the 13thACM Conference on Recommender Systems , New York, NY,USA, 2019. Association for Computing Machinery. 3, 8[42] Georgios Zoumpourlis, Alexandros Doumanoglou, NicholasVretos, and Petros Daras. Non-linear convolution filters forcnn-based learning.
CoRR , abs/1708.07038, 2017. 2, abs/1708.07038, 2017. 2