Deep Gaussian Processes with Convolutional Kernels
Vinayak Kumar, Vaibhav Singh, P. K. Srijith, Andreas Damianou
aa r X i v : . [ s t a t . M L ] J un Deep Gaussian Processes with ConvolutionalKernels
Vinayak Kumar ∗ , Vaibhav Singh ∗ , P. K. Srijith , and Andreas Damianou † Department of Computer Science and Engineering, Indian Institute of Technology,Hyderabad, India Amazon Research, Cambridge, United Kingdom { vinayakk,cs16mtech11017,srijith } @iith.ac.in , [email protected] Abstract.
Deep Gaussian processes (DGPs) provide a Bayesian non-parametric al-ternative to standard parametric deep learning models. A DGP is formedby stacking multiple GPs resulting in a well-regularized composition offunctions. The Bayesian framework that equips the model with attrac-tive properties, such as implicit capacity control and predictive uncer-tainty, makes it at the same time challenging to combine with a convo-lutional structure. This has hindered the application of DGPs in com-puter vision tasks, an area where deep parametric models (i.e. CNNs)have made breakthroughs. Standard kernels used in DGPs such as radialbasis functions (RBFs) are insufficient for handling pixel variability inraw images. In this paper, we build on the recent convolutional GP todevelop Convolutional DGP (CDGP) models which effectively captureimage level features through the use of convolution kernels, thereforeopening up the way for applying DGPs to computer vision tasks. Ourmodel learns local spatial influence and outperforms strong GP basedbaselines on multi-class image classification. We also consider variousconstructions of convolution kernel over the image patches, analyze thecomputational trade-offs and provide an efficient framework for convo-lutional DGP models. The experimental results on image data such asMNIST, rectangles-image, CIFAR10 and Caltech101 demonstrate the ef-fectiveness of the proposed approaches.
Keywords:
Gaussian Processes · Bayesian Deep Learning · Convolutional Neu-ral Network · Variational Inference
Deep learning models have made tremendous progress in computer vision prob-lems through their ability to learn complex functions and representations [1]. ∗ Equal Contribution. † A. Damianou contributed to this work prior to joining Amazon. V. Kumar et al.
They learn complex functions mapping some input x to output y through com-position of linear and non-linear functions. However, popular deep learning mod-els based on convolutional and recurrent neural networks have significant lim-itations. The parametric form of the functions lead them to have millions ofparameters to estimate which is less suitable for problems where the data arescarce. Deep learning models though probabilistic in nature, do not provide anyuncertainty estimates on its predictions. Knowledge of uncertainty helps in betterdecision making and is crucial in high risk applications such as disease diagnosisand autonomous driving [2]. Another major limitation with the existing deeplearning networks is model selection. Developing an appropriate deep learningmodel to solve a problem is time consuming and computationally expensive.Deep Gaussian processes (DGPs) [3] constitute a deep Bayesian non-parametricapproach based on Gaussian processes (GPs) and have the potential to overcomethe aforementioned limitations.The original DGP model was introduced by [3,4] inspired by the hierarchi-cal GP-LVM structure [5] and variations have emerged in recent years, mainlydiffering in the employed inference procedure. While [3] employs a mean fieldvariational posterior over the latent layers, [6] extends this formulation withamortized inference, [7] considers a nested variational inference approach, [8]uses an approximate Expectation Propagation procedure. Further, [9] achievesscalability through random Fourier features while the approach of [10] considersthe variational posterior to be conditioned over the previous layer, preservingcorrelations across the layers, and uses a doubly stochastic variational inferenceapproach.All the DGP models use kernels such as radial basis function (RBF) which isinadequate for problems in computer vision, such as object detection. They failto capture wide variability of objects in images due to pose, illumination andcomplex backgrounds. RBF captures similarity between images on a global scaleand is not invariant to unwanted variations in the image. On the other hand,convolutional neural networks (CNN) [1] learn image representations from rawpixel data which are invariant to such perturbations in the image. They learnfeatures important for the object detection task by successively convolving therepresentations by filters, applying non-linearity and performing feature pooling.We propose to use convolutional kernels [11] in DGPs to learn salient featuresfrom the images which are invariant to transformations. This is different fromrecent works which combine CNNs and GPs in hybrid mode, such as [12,13,14]. Inparticular, [13] replaces the fully connected layers of a CNN with GPs, aimingat obtaining well-calibrated probabilities. While, in deep kernel learning [12],the kernel in GPs are computed using deep neural networks. In contrast, ourapproach brings the convolutional structure inside the deep GP model, throughkernels, and remains fully non-parametric.Convolutional kernels could effectively learn rich representations of the data.The similarity between structured objects such as images are computed by con-sidering the similarity of the sub-structures in the object which makes theminvariant to transformations in the image. They have been used to compute sim- eep Gaussian Processes with Convolutional Kernels 3 ilarities between structured objects such as graphs and trees [15,16]. Recently,they were used as a covariance function in GPs and were found to be very ef-fective for object recognition tasks [17]. Here, the kernel computations betweenimages are done by summing the base kernel acting over different patches of theimages.We introduce convolutional kernels in the DGP framework in order to extractdiscriminative features from images for object classification. Our work builds onthe convolutional GP [17] and extends it for the deep learning case, allowingthe resulting model to additionally perform hierarchical feature learning. Weconsider various DGP architectures obtained by stacking together convolutionaland RBF kernels in various combinations. Further, we consider variants of theconvolutional kernel such as weighted convolutional kernels which provide morediscriminative features, and combination of RBF kernels as the base kernel.Convolutional kernels are computationally expensive as they require perform-ing summation over all patches of the image. We propose an approach to im-prove the computational efficiency by random sub-sampling of the patches. Wedemonstrate the effectiveness of the proposed approaches for image classifica-tion on benchmark data sets such as MNIST, Rectangles-image, CIFAR10 andCaltech101. The experiments show that DGP models typically achieve bettergeneralization performance by using convolutional kernels compared to state-of-the-art shallow GP models.
We consider the image classification problem with C classes and N trainingdata points, X = { x i } Ni = and the corresponding labels y = { y i } Ni = , where x i ∈ R W × H and y i ∈ Y = { , , . . . C } . Assume there exists a latent function f : R W × H → Y mapping the training data to outputs. In a Bayesian setting, westrive to learn a posterior distribution over this function, so that we can use itto compute the predictive distribution over the test labels. It helps one to makesound predictions about the test data labels, taking into account the uncertaintyabout them. Gaussian processes provide a Bayesian non-parametric approach toperform classification. In this section, we summarize Gaussian process classifica-tion and Deep Gaussian Processes (DGP) that will lay the groundwork for ourmodel. A GP is defined as a collection of random variables such that any finite subsetof which is Gaussian distributed [18]. It allows one to specify a prior distri-bution over real valued functions f , represented as f ( x ) ∼ GP ( m ( x ) , k ( x , x ′ ))where m ( x ) is the mean function and k ( x , x ′ ) provides the covariance across thefunction values at two data points x and x ′ .The kernel function determines various properties of the function such asstationarity, smoothness etc. A popular kernel function is the radial basis func-tion (RBF) (squared exponential kernel), as it can model any smooth function. V. Kumar et al.
It is given by σ f exp( − κ || x − x ′ || ) where the length scale κ determines thevariations in function values across the inputs.For multi-class classification problems, we associate a separate function f c with each class c . An independent GP prior is placed over each of these functions, f c ( x ) ∼ GP ( m c ( x ) , k ( x , x ′ )). Let f c = [ f c ( x ) , f c ( x ) , · · · , f c ( x N )] be a columnvector indicating function values at the input data points for a class c . Further,let F be the matrix formed by stacking all column vectors { f c } Cc = , with F n,c representing the latent function value of n th sample belonging to class c and F n representing the vector of latent function values over classes for the n th sample.The GP prior over F takes the following form : p ( F ) = C Q c =1 N ( f c ; m c ( X ) , K XX ),where K XX is the N × N covariance matrix formed by evaluating kernel over allpairs of training data points. For a data point n , the likelihood of it belongingto class c , p ( y n = c | F n ), is obtained by considering a soft-max link function.The posterior distribution over F is obtained by combining the prior and thelikelihood using Bayes theorem: p ( F | y ) = Q Nn =1 p ( y n | F n ) p ( F ) p ( y ) . In GP multi-class classification, the posterior distribution cannot be com-puted in closed form due to the non-conjugacy between likelihood and prior.Learning in GPs involves learning the kernel hyper-parameters by maximizingthe evidence p ( y ) = R Q Nn =1 p ( y n | F n ) p ( F ) dF , which also cannot be computed inclosed form. The posterior distribution can be approximated as a Gaussian usingapproximate inference techniques such as Laplace approximation [19] and varia-tional inference [20,21,22]. The Gaussian approximated posterior is then used tomake predictions on the test data points. Variational inference has received a lotinterest recently as it does not suffer from convergence problems unlike Markovchain Monte Carlo techniques and it provides a posterior approximation quicklyby solving an optimization problem. It is scalable to large data sets and amenableto distributed processing. It also provides a lower bound on the marginal like-lihood which can be used to perform model selection. The variational infer-ence approach learns an approximate posterior distribution q ( F ) by minimiz-ing the KL divergence between q ( F ) and p ( F | y ). Choosing a mean field familyof variational distributions, q ( F ) factorizes across dimensions(or columns), i.e q ( F ) = Q q ( f c ). Each variational factor q ( f c ) is assumed to be a Gaussian withvariational parameters, mean vector µ c and covariance Σ c . In the variationalinference framework, minimizing the KL divergence with respect to the varia-tional parameters is equivalent to maximizing the so-called variational E vidence L ower BO und (ELBO) which is given by L ( { µ c , Σ c } Cc =1 ) = E q ( F ) [log N Y n =1 p ( y n | F n )] − C X c =1 KL( q ( f c ) k p ( f c )) . (1) eep Gaussian Processes with Convolutional Kernels 5 The variational parameters { µ c , Σ c } Cc =1 and the kernel hyperparameters { σ f , l } are learnt by jointly maximizing the variational lower bound in eq. (1) using anygradient based approach.The KL divergence term in eq. (1) involves inversion of the covariance ma-trix K XX which scales as O ( N ) computationally. Therefore, we opt for thevariational sparse Gaussian process approximation [23,24] which reduces thecomputational complexity to O ( N M ) , where M ≪ N represents the numberof inducing points. Specifically, the variational sparse approximation expandsthe latent function space with M inducing variables u ∈ R M which are la-tent function values at inducing points Z = { z i } Mi =1 . Within the context of GPmulti-class classification, we additionally have the inducing variable outputs u c for each class c which are stacked together to form the matrix U ∈ R M × C . Thejoint GP prior over { f, u } is then (cid:20) f c u c (cid:21) ∼ N ( (cid:20) f c u c (cid:21) ; (cid:20) m c ( X ) m c ( Z ) (cid:21) , (cid:20) K XX K XZ K ⊤ XZ K ZZ (cid:21) ) , (2)where K XZ is the N × M covariance matrix over training inputs X and inducinginputs Z and K ZZ is the M × M covariance matrix over inducing points Z . Theconditional distribution of f c given u c is given by p ( f c | u c , X, Z ) = N ( f c ; m c ( X ) + K XZ K − ZZ ( u c − m c ( Z )) , K XX − K XZ K − ZZ K ⊤ XZ )and the marginal distribution over u c is p ( u c ) = N ( u c ; m c ( Z ) , K ZZ ). The vari-ational sparse approximation of [24] considers a joint variational posterior over { f c , u c } in factorized form and is written as q ( f c , u c ) = p ( f c | u c , X ) q ( u c ). Assum-ing Gaussian variational factors for inducing points q ( u c ) = N ( u c ; m c , S c ), thevariational lower bound (ELBO) can be derived as L ( { m c , S c } Cc =1 ) = E q ( F ) [log N Y i =1 p ( y n | F n )] − C X c =1 KL( p ( u c ) || q ( u c )) . (3)Following [21], the variational posterior q ( F ) = Q Cc =1 q ( f c ) and q ( f c ) is ob-tained by integrating out u c from p ( f c | u c ) q ( u c ) and is given by N ( f c ; ˜m c , ˜ V c ) ,where ˜m c = m ( X ) + K XZ K − ( m c − m ( Z )) and ˜V c = K XX − k XZ K − ( K ZZ − S c ) K − K ZX . The Expected Log likelihood term above is intractable due to non-conjugate likelihood (softmax in this case). One could apply a quadrature [21]or reparameterization-based [25] monte carlo sampling scheme to approximatethis. Deep Gaussian processes (DGPs) [3,4,10] learn complex functions by stackingGPs one over the other resulting in a deep architecture of GPs. The functionmapping one hidden layer to the next in DGPs is more expressive and data de-pendent compared to the pre-fixed sigmoid non-linear function used in standard
V. Kumar et al. parametric deep learning approaches. In addition, it is devoid of large numberof parameters but only a few kernel hyper-parameters and few variational pa-rameters (few due to sparse GP approach). Deep GPs do not typically overfit onsmall data due to Bayesian model averaging, and the stochasticity inherent inGPs naturally allows them to handle uncertainty in the data. Furthermore, byusing a specific kernel which enables automatic relevance determination, one canautomatically learn the dimensionality of hidden layers (number of neurons) [4].This overcomes the model selection problem in deep learning to a great extent.DGPs consider the function mapping input to output to be represented asa composition of functions, f ( x ) = f L ◦ ( f L − . . . ◦ ( f ( x ))), assuming thereare L layers. The l th layer consists of D l functions f l = { f lj } D l j =1 mapping rep-resentations in layer l − D l representation for layer l . IndependentGP priors are placed over the function f lj producing j th representation in layer l , f lj ( · ) ∼ GP ( m lj ( · ) , k l ( · , · )). The j th function in layer l , f j , acts on the inputdata point x i to produce the representation F i,j = f j ( x i ). In general, the j th function in layer l , f lj ( · ) acts on the representation of the data point x i at layer l − F l − i to produce the representation F li,j = f lj ( F l − i ). Let f lj denote the j th representation at layer l computed over all inputs. The final layer L will have C functions corresponding to the classes and these functions values are squashedthrough a soft-max function to produce the class probabilities.We follow the DGP variant presented in [10] where the noise between layersis absorbed into the kernel. The kernel function associated with a GP in layer l isdefined as k l ( F li , F lj ) = σ lf exp( − κ l || F li − F lj || )+ σ ln δ ij . Following the variationalsparse Gaussian process approximation as explained in the section 2.1, each layer l is associated with inducing variables { U l } which are function values over M inducing points Z l associated with layer l , Z l = { z li } Mi =1 . Let u lj represent theinducing variables associated with the j th representation at layer l . The numberof inducing points are kept fixed for all layers (only for convenience) as M anda joint GP prior is considered over latent function values and inducing points.The joint distribution p ( y , F, U ) is given by N Q n =1 P ( y n | F Ln ) | {z } Likelihood L Q l =1 D l Q j =1 p ( f lj | u lj , F l − , Z l ) p ( u lj | Z l ) | {z } Deep GP Prior , (4)where a deep GP prior is put recursively over the entire latent space with F = X and a soft-max likelihood is used for classification. The conditional above is: p ( f lj | u lj , F l − , Z l ) = N ( f lj ; mean ( f lj ) , cov ( f lj )) where (5) mean ( f lj ) = m lj ( F l − ) + K lF l − Z l ( K lZ l Z l ) − ( u lj − m lj ( Z l )) cov ( f lj ) = K lF l − F l − − K lF l − Z l ( K lZ l Z l ) − ( K lF l − Z l ) ⊤ The posterior distribution p ( F, U | y ) and marginal likelihood p ( y ) cannot becomputed in closed form due to the intractability in obtaining the marginal eep Gaussian Processes with Convolutional Kernels 7 prior over { F l } Ll =2 . This involves integrating out the previous layer, which ispresent in a non linear manner inside the covariance matrices ( K lF l − F l − ) ap-pearing in (5). Along with non-conjugate likelihood, this brings in additionaldifficulty to the DGP model. Multiple approaches have been suggested in the lit-erature for achieving tractability in DGPs, such as variational inference [3,7,10],amortized inference [6], expectation propagation [8] and random Fourier fea-tures [9]. Here we follow the variational inference approach, and we assume thevariational posterior to be having form q ( F, U ) = L Q l =1 D l Q j =1 p ( f lj | u lj , F l − , Z l ) q ( u lj ),where q ( u lj ) = N ( u lj ; m lj , S lj ) [24,3,23]. Let m l be a vector formed by concate-nating the vectors m lj and S l be the block diagonal covariance matrix formedfrom S lj . We can formulate the ELBO by extending the methodology describedin Section 2.1 to multiple layers [3,10] as follows: L ( { m l , S l } Ll =1 ) = N X n =1 E q ( F Ln ) [log p ( y n | F Ln )] − L X l =1 KL [ q ( U l ) || p ( U l )] (6)where, the marginal distribution of the functions values for layer L over all thedata points is obtained as q ( F L |{ Z l , m l , S l } Ll =1 ) = Z F ,F , ··· F L − L Y l =1 q ( F l | F l − , Z l , m l , S l ) dF . . . dF L − (7)and the conditional distribution in (7) is computed as q ( F l | F l − , Z l , m l , S l ) = D l Y j =1 Z u lj p ( f lj | u lj , F l − , Z l ) q ( u lj ) d u lj = D l Y j =1 N ( f lj ; ˜m lj , ˜ V lj )(8)where ˜m lj = m lj ( F l − ) + K lF l − Z L ( K lZ l Z l ) − l ( m lj − m lj ( Z l )) and (9)˜ V lj = K lF l − F l − − K lF l − Z l ( K lZ l Z l ) − l ( K lZ l Z l − S lj )( K lZ l Z l ) − l ( K lF l − Z l ) ⊤ . (10)The marginal distribution in (7) is intractable, due to presence of stochastic term { F l − } Ll =2 inside the conditional distributions { q ( F l | F l − , Z l , m l , S l ) } L − l =2 in anon-linear manner. This intractability results in the expected log likelihood in(6) to be intractable even for Gaussian likelihood. We approximate it via MonteCarlo sampling as done in [10].As has been shown in [10], the marginal variational posterior over functionvalues in the final layer for n th data point, i.e q ( F Ln ) depends only on the n th marginals of all the previous layers. Each F ln is sampled from q ( F ln | F l − n , Z l , m l , S l )= N ( F ln ; ˜m l [ n ] , ˜ V l [ n ]), where ˜m l [ n ] ( D l dimensional vector) and ˜ V l [ n ] ( D l × D l diagonal matrix) are respectively the mean and covariance of the n th data pointover representations in layer l and depends on F l − n . Applying the “reparameta-rization trick” the sampling can be written as: F ln = ˜m l [ n ] + ǫ l ⊙ ˜ V l [ n ] ; ǫ l ∼ N ( ǫ l ; 0 , I D l ) . V. Kumar et al.
The lower bound can be written as sum over data points and the parameters canbe updated based gradients computed on a mini-batch of data. This enables oneto use stochastic gradient techniques for maximizing the variational lower bound.This stochasticity in gradient computation combined with the stochasticity in-troduced by the Monte Carlo sampling in variational lower bound computationresults in the doubly stochastic variational inference method for deep GPs.
We combine the convolutional GP kernels [17] with deep Gaussian processes inorder to obtain the convolutional deep Gaussian process (CDGP). A CDGP cancapture salient features which are invariant to variations in the image throughthe convolutional structures and is simultaneously performing strong functionlearning through out its depth, all within a Bayesian framework. This results ina powerful well-calibrated model for tasks like image classification.
Our starting point is the recently introduced convolutional Gaussian processes(CGP) [17] where the function evaluation on an image is considered as sum offunctions over the patches of the input image. Assuming there are P patchesin x with each patch x [ p ] to be w × h dimensional, CGP considers f ( x ) = P Pp =1 g ( x [ p ] ). Placing a zero mean GP prior over the function g ( x [ p ] ), g ( x [ p ] ) ∼GP (0 , k g ( x [ p ] i , x [ p ] j )), induces a zero mean GP prior over the function f ( x ) witha convolutional kernel (Conv kernel) k f , f ( x ) ∼ GP (0 , k f ( x i , x j )) , k f ( x i , x j ) = P X p =1 P X p ′ =1 k g ( x [ p ] i , x [ p ′ ] j ) . (11)We refer to k g as the base kernel. Considering a convolutional kernel in com-puting the similarities between the images is useful in capturing non-local simi-larities among the images. The convolutional kernel compares one region in theimage x i with another region in the image x j , and could provide a high sim-ilarity even under transformations in the image. The kernel computation overpatches ( x [ p ] i , x [ p ′ ] j ) considers similarity in a spatial neighborhood, whereas withother kernels (such as RBF kernel) only global similarity across images can becomputed and fails to capture similarity in images due to transformations.Convolutional Neural Networks(CNNs) convolve image with multiple kernels(filters), apply a non-linear operation and then feature pooling (average, max)multiple times to learn discriminative features useful for the object detectiontask. Similar to CNN, the function f ( x ) could be seen to perform average poolingof the non-linear feature maps produced by the patch response functions g ( x [ p ] ).This pooling operation results in convolution operation in kernel space. The eep Gaussian Processes with Convolutional Kernels 9 convolutional kernel computation between two images x i and x j is expanded as k f ( x i , x j ) = P X p ′ =1 k g ( x [1] i , x [ p ′ ] j ) + . . . + P X p ′ =1 k g ( x [ p ] i , x [ p ′ ] j ) + . . . + P X p ′ =1 k g ( x [ P ] i , x [ p ′ ] j ) . The convolution operation between p th patch of image x i (which now acts as afilter) and the image x j results in a convolution signal, where signal value at anypoint p ′ is obtained by computing the dot product between the filter x [ p ] i andpatch x [ p ′ ] j . This dot product is performed by the base kernel which transformsthese patches into feature vectors in a high dimensional space and computes thedot product between them in that space. Any p th summand is the sum of theconvolution signal values obtained at all the points. Convolutional DGP considers multiple functions from a GP prior with convo-lutional kernels to form a representation of the image in the first layer. Thefunction corresponding to o th representation for layer 1 is obtained as f o ( x ) = P X p =1 g o ( x [ p ] ) ; g o ( x [ p ] ) ∼ GP ( m o ( x [ p ] ) , k g ( x [ p ] i , x [ p ] j )) (12) f o ( x ) ∼ GP ( m o ( x ) , k f ( x i , x j )) ; k f ( x i , x j ) = P X p =1 P X p ′ =1 k g ( x [ p ] i , x [ p ′ ] j ) . (13)Each output in layer 1 captures different features of the image. The featurerepresentations of the image obtained in the first layer are then mapped usinga GP with convolutional or RBF kernel to obtain further representations. Ingeneral, the function corresponding to o th representation for layer l is consideredas f lo ( F l − ) ∼ GP ( m lo ( F l − ) , k lf ( F l − i , F l − j )) k lf ( F l − i , F l − j ) = P X p =1 P X p ′ =1 k lg ( F l − i [ p ] , F l − j [ p ′ ] ) . (14)The kernel matrices involved in the computation of the conditional distributionin eq. (8) such as K lF l − F l − , K lF l − Z l and K lZ l Z l use the convolutional kerneldefined in (14). As before, Z l represents the inducing points associated with layer l and has the same dimension as F l − . The variational lower bound expressionand “reparameterization trick” remains the same as has been derived for deepGPs in Section 2.2.We also consider variants of the convolutional kernel such as weighted con-volutional kernels (Wconv kernels) [17]. It associates a weight with each patchwhich allows the kernel to provide differential weightage to the patches which is useful for object detection. The function f ( x ) in general for any layer is consid-ered as f ( x ) = P X p =1 w p g ( x [ p ] ) ; k f ( x i , x j ) = P X p =1 P X p ′ =1 w p w p ′ k g ( x [ p ] i , x [ p ′ ] j ) . (15) Convolutional kernels provide an effective way to capture the similarity acrossimages, but are computationally expensive. Computing the similarity betweentwo images involves O ( P ) computational cost, where P is the number of patchesin the input image or the feature representation. For the input image of size W × H , it is of the order of O ( W H ) when stride length and patch sizes are small. Thisis costly even for image data sets such as MNIST and rectangles which containimages of size (28 × × Z lj ∈ R w × h rather than in the input space R W × H . In this case, computation ofthe entries in the matrix K lF l − Z l can be performed in O ( P ) time, and that of K lZ l Z l can be performed in constant time. K lF l − Z l [ i, j ] = k lf ( F l − i , Z lj ) = P X p =1 k lg ( F l − i [ p ] , Z lj ) (16) K lZ l Z l [ i, j ] = k lg ( Z li , Z lj ) (17)However, computation of the entries in the matrix K lF l − F l − matrix which ap-pears in the conditional distribution in (8) still requires O ( P ) computationsfor the first layer making it a costly operation. This makes the approach prac-tically inapplicable to high dimensional data sets such as Caltech101 even witha reduced image size. Moreover in these images, a lot of information will beshared by overlapping patches and will be redundant for the computation of thesimilarity across images. We propose to use random sub-sampling of the patchesin computing the convolutional kernel for the entries in the matrix K lF l − F l − and K lF l − Z l . Let S, S ′ ⊂ { , , . . . , P } represent the random subsets. For the o th representation of layer 1 ( F = X ), we consider the covariance functions tobe as follows f o ( x ) = X p ∈ S g o ( x [ p ] ) (18) k f ( x i , x j ) = X p ∈ S X p ′ ∈ S ′ k g ( x [ p ] i , x [ p ′ ] j ) and (19) eep Gaussian Processes with Convolutional Kernels 11 k f ( x i , Z j ) = E g [ f o ( x i ) g o ( Z j )] = E g [ X p ∈ S g o ( x [ p ] ) g o ( Z j )] (20)= X p ∈ S k g ( x [ p ] , Z j ) (21)This reduces the cost of computing the matrix K lF l − F l − for layer 1 to O ( | S || S ′ | )where the size of the subsets | S | , | S ′ | ≪ P . Computational speedup achievedthrough random sub-sampling of patches is testified in our experiments on Cal-tech101. We evaluate the generalization performance of the proposed model, convolu-tional deep Gaussian processes (CDGP), on various image classification datasets, namely MNIST, Rectangle-Images, CIFAR10 and Caltech101. We considerdifferent kernel architectures of the proposed CDGP model and compare it withsparse GPs (SGP) , deep GP (DGP) models with RBF kernel and with con-volutional GPs (CGP) with different convolutional kernels. The convolutionaldeep GP uses the same inference procedure as in deep GP (“re-parameterizationtrick”) and uses an a priori fixed inducing input points by considering centroidsof the clustered images [10]. The inducing points and the linear mean functionfor each of the inner layers is obtained using the singular value decompositionapproach mentioned in [10]. The number of inducing points is taken to be 100.We follow the same approach for convolutional GPs also to maintain a fair play-ground. The kernel parameters are kept the same across various outputs in alayer while it is different across the layers. The number of outputs in the latentlayers is taken to be 30 for MNIST and 50 for other datasets (except for thefinal layer which will be equal to the number of classes). For the models con-sidering convolutional kernels, the patch size is taken to be 3 × × k g forall our experiments. The approaches are compared in terms of their accuracy inmaking predictions on the test data and the negative log predictive probability(NLPP) on test data which considers uncertainty in predictions. The code hasbeen developed on top of GPflow [26] framework with ADAM [27] optimizer tolearn the kernel and variational parameters by maximizing the variational lowerbound. The variational mean parameters are initialized to 0, variance parame-ters to 1 e − and length-scales are initialized to 2 for MNIST and 10 for otherdatasets. We performed experiments with MNIST dataset with 10 classes corresponding tothe digits 0 −
9. We consider the standard train/test split with 60K training and Results as reported in [10].2 V. Kumar et al.
Table 1.
Comparison of SGP, DGP, CGP and CDGP approaches with different archi-tectures on the MNIST data set along with the kernels used by GP in each layer.Model Layer 1 Layer 2 Layer 3 Layer 4 Accuracy% NLPPSGP RBF – – – 97.48 –DGP1 RBF RBF – – 97.94 0.073DGP2 RBF RBF RBF – 97.99 0.070CGP1 Conv – – – 95.59 0.170CGP2 Wconv – – – 97.54 0.103
CDGP1
Wconv RBF – –
CDGP2 Conv RBF – – 98.53 0.536CDGP3 Conv RBF RBF – 98.40 0.055CDGP4 Conv RBF RBF RBF 98.41 0.051CDGP5 Wconv Wconv RBF – 98.44 0.048CDGP6 Wconv Wconv RBF RBF 98.60 0.046
Table 2.
Comparison of SGP, DGP, CGP and CDGP approaches with different ar-chitectures on the Rectangles-Image data set along with the kernels used in by GP ineach layer.Model Layer 1 Layer 2 Layer 3 Accuracy% NLPPSGP RBF – – 76.1 0.493DGP1 RBF RBF – 76.93 0.478DGP2 RBF RBF RBF 76.98 0.476CGP Wconv – – 71.06 0.602
CDGP1
Wconv RBF –
CDGP2 Wconv RBF RBF 77.95 0.449
10K test images. We considered CDGP and DGP models with various architec-tures as described in Table 1. Parameters of the model are learned by runningthe ADAM optimizer for 400 epochs with 0 .
01 step size and a mini-batch of1000. Experimental comparison indicates that the proposed CDGP models with2 layers, first layer with a weighted convolutional kernel and the second layerwith an RBF kernel gave the best performance, an accuracy of 98 .
66 and anNLPP score of 0 . .
01 and 10, as the base kernel in a convolutional kernel. The approachgave an accuracy of 98 .
46. We found that the learned length scales are also quitefar apart which shows that one RBF kernel is trying to capture long distancecorrelations while the other one captures short distant correlations. This did not eep Gaussian Processes with Convolutional Kernels 13
Table 3.
Comparison of SGP, DGP, CGP and CDGP approaches with different archi-tectures on the CIFAR10 data set along with the kernels used by GP in each layer.Model Layer 1 Layer 2 Layer 3 Accuracy% NLPPDGP1 RBF RBF – 42.20 3.2579DGP2 RBF RBF RBF 40.13 3.5785
CGP
Wconv – – –CDGP1 Wconv RBF – 51.74 2.4893CDGP2 Wconv RBF RBF 51.59 2.4607 result in better results as MNIST is quite simple dataset for which capturingsuch information might not be necessary. We consider the rectangles-image data set used in [10], where a rectangle ofvarying height and width is placed inside images. The patches in the border andinside of the rectangle and the background patches are sampled to make therectangle hard to detect . The task is to classify if a rectangle in an image hasa larger height or width. The data set consists of 12K training images and 50Ktest images, and is known to require deep architectures for correct classification.We consider two different architectures of CDGP, and compare it against sparseGPs, deep GPs with 2 and 3 layers and convolutional GPs. Parameters of themodels are learnt by running the ADAM optimizer for 200 epochs with 0 . The CIFAR-10 dataset [28] consists of total 60K images out of which 50K areused as training images while the rest 10K images are being used for testing.The dataset contains colored images of objects like airplane, automobile, etc.There are 10 classes in total having 6K images per class. The dimensionality ofeach image is 32 × × Rectangles-image data is different from the simpler rectangles data used in [17],where a random size rectangle is placed in black background with the pixels corre-sponding to the border of the rectangle in white, while that of inside in black.4 V. Kumar et al.
Table 4.
Comparison of Training time required for different CDGP architectures withdifferent number of patches on the Caltech-101 dataset.Model Layer 1 Layer 2 Layer 3 Training time Accuracy%NLPPCDGP1(Allpatches) Wconv RBF – 11 hrs 18 min 20.39 6.5811CDGP2(Allpatches) Wconv RBF RBF 12 hrs 2min 19.51 6.787CDGP1(Randompatches) Wconv RBF – 1 hr 15 min 20 6.7009CDGP2(Randompatches) Wconv RBF RBF 1hr 19 min 18.82 7.0473
CDGP, DGP and CGP models in Table 3. Parameters of the models are learnedby running the ADAM optimizer for 200 epochs with a mini-batch size of 40 .We observe that DGP models gave a relatively low performance on theCIFAR10 datasets. Equipping DGP models with convolutional kernels haveboosted the performance by 10% showing the effectiveness of convolutional ker-nels for image classification. However, CDGP models were not able to obtain aperformance close to CGP. This could be an indication that, for this particulardataset, the properties of a single-layer CDGP i.e, CGP is enough to learn a goodclassifier. In fact, the previous experiments have shown that 2-layer CDGPs typ-ically result in the best accuracy (in comparison with deeper models), implyingthat a CGP has already very large capacity for classification and therefore theaddition of one layer is usually enough to improve on the results. Computation of convolutional kernels becomes prohibitive on data sets such asCaltech101 [29] with very high dimensionality. It consists of 101 classes with 20images per class for training and 10 images per class for testing. The size variesslightly for each image in the actual dataset but is roughly around 300 × × ×
3. Instead of taking allthe patches of the image for computing the convolutional kernel, we randomlypicked up one-tenth of the total number of image patches for computing thekernel. This resulted in a very significant speed-up in learning time withoutmuch loss in accuracy, as can be seen from table 4 . The test accuracy obtainedwith CDGP1 is 20 .
39% and time taken for training is 11 hrs 18min. On the other Learning took around 11 hours on Nvidia GTX 1080 Ti GPU, while the best resultsreported in [17] is obtained after running the optimization for 40 hours. We ran the experiments on Nvidia GTX 1080 Ti GPU.eep Gaussian Processes with Convolutional Kernels 15 hand, for the same model considering random subset of image patches, trainingtime drops to only 1 hr 15 min with an accuracy drop of only 0 .
39% making itaround 10 times faster. Similar phenomenon has been observed in case of CDGP2considering random subset of patches, where training time improved from 12 hrs2 min to 1 hrs 19 min with an accuracy drop of just 0 . ×
50, resulting inloss of information. As a future work, we will conduct experiments by keepingthe original image size, and study the effectiveness of random sub-sampling ofpatches and generalization performance of the proposed approach.
Deep GP models provide a lot of advantages in terms of capacity control andpredictive uncertainty, but they are less effective in computer vision tasks. Com-monly used RBF kernels in the DGP models fail to capture variations in imagedata and are not invariant to translations. In this paper we proposed a DGPmodel which captures convolutional structure in image data using convolutionalkernels. Our model extends the convolutional GPs with the ability to learn hier-archical latent representations making it a useful model for image classification.We incorporated different types of convolutional kernels [17] in the DGP modelsand demonstrated their usefulness for image classification in benchmark data setssuch as MNIST, Rectangles-Image and CIFAR10. In the future, we plan to de-velop methods to further reduce the cost of convolutional kernel computation andmemory requirements of the CDGP model for high dimensional datasets. Thiswill allow us to consider a higher mini-batch size, leading to reduced stochasticgradient variance and faster convergence of the optimization routine. We foundthat increasing the number of layers in CDGP did not bring much improvementsin performance contrary to what we expected. We hope that our future researchon faster and more effective variational inference techniques will address theselimitations with convolutional DGPs.
References
1. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. In
MIT Press , 2016.2. Y. Gal. Uncertainty in deep learning. In
University of Cambridge , 2016.3. Andreas Damianou and Neil Lawrence. Deep Gaussian processes. In
InternationalConference on Artificial Intelligence and Statistics , pages 207–215, 2013.4. Andreas Damianou. Deep Gaussian processes and variational propagation of un-certainty.
PhD Thesis, University of Sheffield , 2015.5. Neil D Lawrence and Andrew J Moore. Hierarchical Gaussian process latent vari-able models. In
Proceedings of the 24th international conference on Machine learn-ing , pages 481–488. ACM, 2007.6. Zhenwen Dai, Andreas Damianou, Javier Gonz´alez, and Neil Lawrence. Variationalauto-encoded deep Gaussian processes.
International Conference on Learning Rep-resentations (ICLR) , 2016.7. James Hensman and Neil D Lawrence. Nested variational compression in deepGaussian processes. arXiv preprint arXiv:1412.1370 , 2014.8. Thang Bui, Daniel Hern´andez-Lobato, Jose Hernandez-Lobato, Yingzhen Li, andRichard Turner. Deep Gaussian processes for regression using approximate ex-pectation propagation. In
International Conference on Machine Learning , pages1472–1481, 2016.9. Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. Ran-dom feature expansions for deep Gaussian processes. In
International Conferenceon Machine Learning , volume 70, pages 884–893, 2017.10. Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference fordeep gaussian processes. In
Advances in Neural Information Processing Systems30 , pages 4588–4599. 2017.11. David Haussler. Convolution kernels on discrete structures. Technical report,University of California in Santa Cruz, 1999.12. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing.Deep kernel learning. In
International Conference on Artificial Intelligence andStatistics , volume 51, pages 370–378, 09–11 May 2016.13. Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, Pietro Michiardi, and Mau-rizio Filippone. Calibrating deep convolutional gaussian processes, 2018.14. John Bradshaw, Alexander G de G Matthews, and Zoubin Ghahramani. Adver-sarial examples, uncertainty, and transfer testing robustness in Gaussian processhybrid deep networks. arXiv preprint arXiv:1707.02476 , 2017.15. S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt.Graph kernels. .
Journal of Machine Learning Research , 11:12011242, 2010.16. M. Collins and N. Duffy. Convolution kernels for natural language. In
Advancesin Neural Information Processing Systems , pages 25–632, 2001.17. Mark van der Wilk, Carl Edward Rasmussen, and James Hensman. ConvolutionalGaussian processes. In
Advances in Neural Information Processing Systems 30 ,pages 2849–2858, 2017.18. Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes forMachine Learning (Adaptive Computation and Machine Learning) . The MIT Press,2005.19. C.K.I. Williams and D. Barber. Bayesian classification with Gaussian processes.In
IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1342–1351, 1998.eep Gaussian Processes with Convolutional Kernels 1720. K.M.A. Chai. Variational multinomial logit Gaussian process. In
Journal of Ma-chine Learning Research , 2012.21. James Hensman, Alexander G de G Matthews, and Zoubin Ghahramani. Scal-able variational Gaussian process classification. In
International Conference onArtificial Intelligence and Statistics , 2015.22. P. K. Srijith, P. Balamurugan, and Shirish Shevade. Gaussian process pseudo-likelihood models for sequence labeling. In
European Conference on MachineLearning and Knowledge Discovery in Databases , pages 215–231, 2016.23. James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for bigdata, 2013.24. Michalis Titsias. Variational learning of inducing variables in sparse Gaussianprocesses. In
International Conference on Artificial Intelligence and Statistics ,pages 567–574, 2009.25. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes.
CoRR ,abs/1312.6114, 2013.26. Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke. Fujii,Alexis Boukouvalas, Pablo Le´on-Villagr´a, Zoubin Ghahramani, and James Hens-man. GPflow: A Gaussian process library using TensorFlow.
Journal of MachineLearning Research , 18(40):1–6, 2017.27. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
CoRR , abs/1412.6980, 2014.28. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
Master’s thesis, Department of Computer Science, University of Toronto , 2009.29. Li Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from fewtraining examples: An incremental bayesian approach tested on 101 object cate-gories. In2004 Conference on Computer Vision and Pattern Recognition Work-shop