Learning Filter Scale and Orientation In CNNs
CCAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS Learning Filter Scale and Orientation InCNNs
Ilker Cam [email protected]
F. Boray Tek [email protected]
Computer EngineeringIsik UniversityIstanbul, TR
Abstract
Convolutional neural networks have many hyperparameters such as the filter size,number of filters, and pooling size, which require manual tuning. Though deep stackedstructures are able to create multi-scale and hierarchical representations, manually fixedfilter sizes limit the scale of representations that can be learned in a single convolutionallayer.This paper introduces a new adaptive filter model that allows variable scale and ori-entation. The scale and orientation parameters of filters can be learned using back prop-agation. Therefore, in a single convolution layer, we can create filters of different scaleand orientation that can adapt to small or large features and objects. The proposed modeluses a relatively large base size (grid) for filters. In the grid, a differentiable functionacts as an envelope for the filters. The envelope function guides effective filter scale andshape/orientation by masking the filter weights before the convolution. Therefore, onlythe weights in the envelope are updated during training.In this work, we employed a multivariate (2D) Gaussian as the envelope functionand showed that it can grow, shrink, or rotate by updating its covariance matrix duringback propagation training . We tested the new filter model on MNIST, MNIST-cluttered,and CIFAR-10 and compared the results with the networks that used conventional con-volution layers. The results demonstrate that the new model can effectively learn andproduce filters of different scales and orientations in a single layer. Moreover, the exper-iments show that the adaptive convolution layers perform equally; or better, especiallywhen data includes objects of varying scale and noisy backgrounds.
Naming or describing real life objects is only meaningful with respect to a relevant scale [8].For example, a view can be described as a leaf, a branch, or a tree depending on the distanceof the observer. Natural and casual scenes are generally composed of many different enti-ties/objects at different scales. During image acquisition, the true physical scale is usuallyignored. However, the relative scale of the objects is somehow implicitly captured and storedin the image grid and pixels.An automated method to identify or describe objects in images can be analyzed in twoparts: representation + classification. Basic classification algorithms without add-ons can c (cid:13) a r X i v : . [ c s . C V ] F e b CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS not successfully handle variation and complexity of raw pixel-level representation of ob-jects, instead they rely on functions that map image pixels into different constructs, -namedfeatures-, which are sought to represent the image content more briefly and invariant to vari-ous geometric and intensity changes.Traditionally, computer vision researchers relied on manually designed feature extractorsfor representation. Recently, we are witnessing the success of the algorithms which can self-learn appropriate feature extractors. In either case, the size of an operator or a probe usuallydetermines and fixes scale of the entities that can be represented. However, even in the self-learn case, size of the probes or operators is often manually selected. On the other hand,last two decades has seen many automated object detection/recognition algorithms that weresuperior to their counterparts because they have comprised multi-scale processing of images[3], [6]. Multi-scale feature extractors gather and present the inherent scale information ofimage pixels to a subsequent classifier. In SIFT [9] and wavelets [10], this is done by creatinga multi-scale pyramid from the input image and then applying a fixed size probe-kernel toeach scale. In an application of Gabor filters for object recognition, Serre et al. [12] used ahierarchy of stacked Gabor filtering layers, where the filters have predetermined scales andorientations. However, Chan et al. [1] showed that the adaptation of handcrafted filters tolow-level representations is difficult. On the other hand, convolution neural networks (CNN)rely on stacked and hierarchical convolutions of the image to extract multi-scale information.Convention of CNNs for filter size selection is to use small fixed size weight kernels in thelower levels. However, thanks to the stacked operation of convolutional layers, sandwichedby pooling layers which down-sample intermediate outputs, the deeper levels of a networkare able to learn representations of larger scales. Though the optimality of fixed size kernelshas not proven, the convention is to use filters small as 3x3 in the first layer, which canbe larger 5x5 or 7x7 in the later stages [16]. During back propagation training, filters areevolved to imitate lower level receptive fields in biological vision which are sensitive tocertain shapes and orientations. Another justification for avoiding large filter sizes is that,while certainly increasing computation time, they may also increase over-fitting.Though the number of filters and their sizes in convolution layers are usually selectedintuitively, researchers are seeking alternatives to improve representation capacity of the net-work in deeper architectures. For example, Szegedy et al. [14] handcrafted their “inception”architecture to include mixing of parallel and wide convolution layers which use differentsized filter kernels. In a deep architecture, this approach allows multi-scale, parallel andsparse representations. [ ? ] In summary, existing CNN based methods use fixed size con-volution kernels and then rely on the fact that shape and orientation of the filters can beinferred from the training data. Additionally, CNNs employ stacked convolution layers tosuccessfully create multi-scale representations.On the other hand, Hubel and Wiesel [4] discovered three type of cells in visual cortex:simple, complex, hyper-complex (i.e. end-stopped cells). The simple cells are sensitive tothe orientation of the excitatory input, whereas the hypercomplex cells are activated with acertain orientation, motion, and length of the stimuli. Therefore it is biologically plausibleto assume that filters of different scales next to different orientations and direction may alsowork better in CNNs.In this study, we create a new and adaptive model of convolution layers where the filtersize (actually scale) and orientation are learned during training. Therefore, a single convolu-tion layer can have distinct filter scales and orientations. Broadly speaking, this correspondsto extracting multi-scale information using a single convolution layer. However, our aim isnot to fully replace the stacked architectures and deep networks for multi-scale information. AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Arbitrary envelope (b) Gaussian envelope (c) Random weights (d) Masked weights Figure 1: Illustration of the proposed weight envelope (a) an arbitrary differentiable envelopefunction controls the weight spread and shape on a regular relatively large base grid, (b) anexample, initial Gaussian kernel with centered µ and initial Σ , (c) initial weights of thefilter that are randomly generated, (d) weights are masked with the envelope (b) by simpleelement-wise multiplication.Instead, our approach improves the information that can be extracted from an input (may bean image), in a single layer. Additionally, the model removes the necessity of fixing convo-lution kernel sizes, so that the filter size can be removed from the list of hyper-parameters ofdeep learning networks.Our experiments use MNIST, MNIST-cluttered and CIFAR-10 datasets to show that theproposed model can learn and produce different scaled filters in a single convolution layerwhilst improving classification performance compared to conventional convolution layers. Inexperimental side, our work is concentrated on developing and proving an effective method-ology for learnable adaptive filter scales and orientations, rather than improving highly opti-mized state-of-the-art results in these datasets. Organization of the paper is as follows ?? . In deep network architectures, stacked convolution layers perform convolutions with fixedsize kernels where sandwiched pooling layers perform downsampling operations to achievea multi-scale and hierarchal representation. Fixed size convolution kernels put a limit on thescale of features which can be extracted from a single layer. ?? Though it is possible to mix several kernels of different size in a single convolution layer,the convention is to use a fixed size for all the kernels in a layer. Here, we introduce a newfilter model which can adapt its scale and orientation. Therefore it allows development ofmulti-scale and differently oriented filters in a single convolution layer. To realize this, weneed a smooth and function that can grow, shrink, or rotate during training which acts as anenvelope to guide filter scale and orientation. The following subsections explain the role ofthe envelope, selecting an appropriate function for the envelope, and its partial derivativeswhich are used in backpropagation.
The role of the envelope function is to guide the filter scale and orientation development.As illustrated in Figure 1, a base grid acts as the envelope and filter domain. Since it is the
CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS most common case, we will assume a two-dimensional domain. Generalizations to a higherdimensions is straightforward. In this domain, the envelope function must be differentiableand smooth. Let us assume a base grid for an n × n odd sized and square filter (1); and let G be a (continuously) differentiable function defined in grid g (coordinate space) and parametervector ΘΘΘ ∈ R i to define its shape (2). A = { , , .., n } , g ∈ A × A = { ( a , b ) | a ∈ A and b ∈ A } (1) u = U ( g , ΘΘΘ ) , ΘΘΘ = { θ , θ , .. θ i } (2)By updating the parameters in ΘΘΘ , envelope function must be able to grow or shrink itseffective area and change its orientation. The feed forward model of a single neuron with aninput x and transfer function f can be written as: o = f ( ∑ g ∈ A × A x g w g U ( g , ΘΘΘ )) (3)Or if we think of a whole convolution layer of input matrix (image) X and weight matrix W and envelope matrix U . Simply, an element-wise multiplication of U with the weightmatrix W will mask and scale the weights before the convolution. O = f ( X ∗ ( W ◦ U )) (4)Since the weights can not grow out of envelope U , the filter size and orientation will bebounded and determined by U . Assuming that the partial derivatives of G with respect tocontinuous parameter θ i is defined using the chain rule, the update can be performed usingthe standard backpropagation algorithm with the learning rate η . However note that theweight update also gets u as a scaler w (cid:48) g : = w g − η ∂ E ∂ o ∂ o ∂ w g θ (cid:48) i : = θ i − η ∂ E ∂ o ∂ o ∂ U ∂ U ∂ w g (5) It is well known that continuous Gaussian kernel has unique properties which are importantfor generating a scale space. Simply put, the Gaussian kernel does not create new local ex-trema, nor enhance existing extrema, whilst smoothing the image with a variable continuousparameter [8]. Some of this properties are proven to exist in discrete space if the sample sizeis sufficiently large. Therefore, Gaussian is an ideal candidate for the envelope function U : U ( g , µ , Σ ) = A e − ( g − µ ) (cid:48) Σ − ( g − µ ) (6)Here, parameter A is an optional normalization parameter; µ controls the center of theenvelope, whereas the covariance Σ controls the scale and orientation of the kernel. µ = (cid:26) µ x , µ y (cid:27) Σ = (cid:20) σ xx σ xy σ xy σ yy (cid:21) (7)During the feed forward execution the envelope function is calculated on the grid co-ordinates g with the current covariance Σ ; and then elementwise multiplied with the weight AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS matrix W , prior to the convolution. This is illustrated in Figure 1(b)-1(d). Note that this oper-ation not only bounds the weights and adjusts the effective area, it also scales the weights. Toimplement the convolution operation appropriately, we set µµµ as a vector of constants that isinitialized with the center point coordinates of the grid g . Therefore, it is not updated duringtraining. However, the covariance Σ must be updated to learn the filter scale and orientation.In order to keep the symmetric property of Σ we calculate the gradients for each σ andapply update rule. ∀ σ ∈ Σ , σ : = σ − η ∂ E ∂ U ∂ U ∂ σ (8)Covariance Σ must be kept as a symmetric and positive definite matrix. A symmetricmatrix is positive definite if for all non-zero vectors: x T Σ x ≥
0; which imposes the followingconditions: (cid:0) σ xx > , σ yy > , σ xx σ yy > σ xy (cid:1) or λ i ≥ λ i denotes the eigenvalues of the covariance matrix, which can be checked to en-sure positive definiteness. However, during training the diagonal sigma terms are ensuredto be positive (and nonzero) by setting bounds, e.g. σ xx = min ( ε , σ xx ) ; whereas σ xy is con-strained by σ xy =max (cid:0) σ xy , √ σ xx σ yy (cid:1) . Experiments show that the covariance behave wellduring training when it is initialized properly and updated with a small learning rate; andthus it removes the necessity for these constraints. We implemented the adaptive convolution filters using Lasagne& ?? Theano [15] and thentested using a Nvidia Tesla K40 GPU board. In terms of computational complexity, as it canbe expected, calculating the Gaussian envelope function adds an extra overhead in training.However, during feed forward execution, the trained and enveloped final weights can bestored and used immediately without any overhead. Compared to conventional filters, weuse relatively larger (e.g. 11x11) base filter sizes to observe adaptive growth and rotations.Please note that the grid can be selected as large as the input image. ACNN ran one epoch(500 examples) in 6 . ± . . ± .
15 seconds on MNISTdataset without an optimized implementation. (The code will be available in the camera-ready version, not disclosed for anonymity.)
One can argue that the filter guide is unnecessary because a relatively large CNN-layer canlearn any filter. To disprove this idea we conducted a test with an encoder where the input isan image the filter is expected to learn a simple filtering operation. Can CNN learn the samefilter without a filter guide.Show learned filter of a Gaussian blur filter with encoder.Show learned filter of an edge filter with encoder.Show learned filter of a blurred edge filter with encoder.
CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS
Table 1: The network topology that is used to test our method. All three networks werecomprised of 8 layers. In conv-1 and conv-2 layers, the proposed adaptive model (ACNN)used an 11x11 base grid for filters, whereas ’cnn-5’ and ’cnn-11’ used 5x5 and 11x11 filtersizes, respectively. *CIFAR-10 experiments used 16 filters instead of 8.Layer Units Filters Filter Size Pool Size Activation1- input - - - - -2- conv-1 - 8/16* 5x5/11x11 - ReLu3- maxpool-1 - - - 2x2 -4- conv-2 - 8/16* 5x5/11x11 - ReLu5- maxpool-2 - - - 2x2 -6- dropout(50%) - - - - -7- fully connected - 1 256 - - - ReLu8- fully connected - 2 10 - - - Sigmoid
Show how covariance learns, whether it requires the constraints, in which conditions.
The experiments were aimed at observing whether the adaptive filters can change its scaleand orientation during training and whether this adaptation yields an improved classificationperformance. We tested the adaptive filter model with three different datasets, and also com-pared the results against two conventional CNN configurations that used different fixed sizefilters. All three networks had the same structure which was comprised of two convolution,two pooling layers, a dropout layer and two fully connected layers (Table 1). The only differ-ence between the adaptive CNN (ACNN) and conventional CNNs (cnn-5, cnn-11) was thereplacement of convolution layers. We used cross-entropy as the error function. The hyperparameters we used are as follows; Learning rate: 0.01, momentum: 0.95, batch size: 500.On each test, we examined training loss, error % and scale, orientation changes in filters.Though the learning rate for Σ could be adjusted separately it was not necessary. MNIST [7] is a database of handwritten digits, widely used in machine learning research totest models. It has 50,000 training and 10,000 testing images from 10 different categories.To observe the change in Σ , we calculated its eigenvalues and eigenvectors. The maximumeigenvalue represents the scale, whereas the tangent between the eigenvectors shows theorientation as illustrated in Figure 2.In Figure 3, we can observe the learned envelope functions scale and orientation effectson filters. Smoothing effect of the envelope function over the input is also observed in someoutputs (3(c)). Figure 4 shows the training loss and classification error plots. The adaptivefilters had no performance gain against conventional cnn-11, cnn-5. AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Orientation Change. (b) Scale Change. Figure 2: The plots for covariance matrix Σ change in MNIST dataset. Depicted by the (a)angle of largest eigenvector and (b) largest eigenvalue. (a) Gaussian envelope functions. (b) Scaled filters. (c) Convolution outputs. Figure 3: The first layer (conv-1) filters at the end of training with MNIST. (a) Gaussianenvelopes, (b) scaled filters, (c) output of a sample that was convolved with each filter. (a) Training loss, y axis is on log scale. (b) Classification Error.
Figure 4: Training loss and classification error for MNIST.
CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS
Figure 5: Eight randomly selected samples from the cluttered MNIST dataset. (a) Training loss, y axis is on log scale. (b) Classification Error.
Figure 6: Training loss and classification error in cluttered MNIST database.
Cluttered MNIST dataset [2] consists of 60,000 samples in 10 classes. We split this datasetinto 50.000 and 10.000 for test and train purposes, respectively. Randomly selected 8 sam-ples are illustrated in Figure 5. It contains 60x60 images generated from the original MNISTdatabase with numerous of distractors. Projecting the original MNIST 28x28 pixel spaceonto 60x60 also caused changes in scale. Thus, the dataset had scale and rotational variances,in addition to cluttered background noise which makes it a suitable test case to demonstratethe use of adaptive filters.Figure 6 shows the training loss and classification error, where we can observe betterperformance compared with conventional CNNs.
This dataset [5] is a relatively small (32x32x3) image set with 60,000 samples from 10different classes. We divided this dataset into 50,000 train and 10,000 test sets, respectively.Other than MNIST dataset, color channels are present, and objects are much more in needfor multi-scale features.The classification results that are shown in Figure 8(b) demonstrates that again ACNNperformed better in classification error. For further investigation, we also included the changeof Σ for learned envelope functions and scaled filters in Figure 7. Compared to MNIST, the AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Gaussian envelope functions. (b) Scaled filters. Figure 7: The first layer filters at the end of training in CIFAR-10 database. (a) Training loss, y axis is on log scale. (b) Classification Error.
Figure 8: Training loss and classification error in CIFAR-10 database.envelope functions are observably different; and included both large, small, rotated filters.The change in scale and orientation is shown in Figure 9. Compared with change on Σ in MNIST test, scales and orientations have more variation and some of the filters tend toshrink, whereas some were enlarged their scales. In this paper, we propose an adaptive convolution filter model based on a Gaussian ker-nel that is acting as an envelope function on shared filter weights. The plots of scale andorientation changes during the training epochs show that the adaptive model is capable ofgenerating differently scaled and oriented filters in a single convolution layer. However, be-sides bounding and scaling of convolution weights, the Gaussian kernel tends to performsmoothing on input. Such that, if all weights were set to 1 and not trainable, the kernels per-form only a Gaussian smoothing operation on input. The initial setting of variance terms to1.0, enables an initial filter of 5x5 size. During training effective size of the filters are grad- CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Orientation Change. (b) Scale Change.
Figure 9: Plots for covariance matrix Σ change in CIFAR-10 dataset, depicted by the (a)angle of largest eigenvector, (b) largest eigenvalue.ually increased. This is because enlarging filters enables more weights to be included in theconvolution, which will allow further reduction in the network error. Therefore, the adaptivefilter may be prone to overfit more than a conventional fixed sized filter of the same initialsize. However, because the envelope rescales weights (max 1.0), it has a regulative effect ontheir magnitudes, which shall create an advantage. In overall, training of the adaptive filtermodel did not require very fine tuning of the parameters. However, we observed that theuse of dropout layer encouraged the development of filters of different scale and orientation.This can be explained by that the parallel and sparse network configurations induced by thedropout mechanism forces filters to prevent co-adaptation and become independent. We willinvestigate other ways of inducing independent filters, perhaps with an additional cost termfor the network which punishes co-adaptation.A clear benefit of our model is that it removes the filter size from the list of hyperparame-ters of deep learning networks. However, our main purpose is to add an adaptive multi-scalerepresentation capacity to convolution layers. The results show that the advantage of usingthe new model depends on the complexity and variations in training and test data. Amongthree datasets, MNIST is the simplest where digits are size-normalized and centered [7].The adaptive filters have less or no need for scale adaptation in pixel space, which resultedin no improvement in classification error when compared to conventional CNNs. However,MNIST cluttered and CIFAR-10 include examples of arbitrary scale, orientation and center-ing [5] [2], which allowed the filters to adapt their scale and orientation to improve trainingwhile not overfitting. Therefore, we can conclude the adaptive filters expressive power isrevealed in datasets with variations in scale and orientation. It is worthwhile to investigateits applications to other domains.The new and adaptive model of convolution layers allows filters’ scale and orientation tobe learned during training. Therefore, a single convolution layer can have filters at variousscales and orientations. Therefore, a single convolution layer can adapt to extract multi-scale information from its input. State-of-the-art deep networks have many layers and morecomplex designs compared to the networks that were tested in this study. An interestingquestion which we will investigate further is whether using the adaptive filter layers canshorten the depth of the state-of-the-art architectures, such as inception [14], highway [13]or thin [11]. Though our aim is not to fully replace stacked and deep architectures, the new AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS model may help reduce redundancy and improve accuracy. Another question is whetherplacing the adaptive layer in deeper levels of a network can produce additional gains byfocusing on the higher level representations. ?? using Gaussian lookup table for This research was supported by a grant (undisclosed for anonymity) and NVIDIA HardwareGrant scheme.
References [1] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. Pcanet:A simple deep learning baseline for image classification?
CoRR , abs/1404.3606, 2014.URL http://arxiv.org/abs/1404.3606 .[2] christopher5106. Christopher. Cluttered mnist dataset. https://github.com/christopher5106/mnist-cluttered , 2015.[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Soci-ety Conference on , volume 1, pages 886–893. IEEE, 2005.[4] David H Hubel and Torsten N Wiesel. Receptive fields and functional architecture ofmonkey striate cortex.
The Journal of physiology , 195(1):215–243, 1968.[5] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.
Master’s thesis, Department of Computer Science, University of Toronto , 2009.[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition.
Proceedings of the IEEE , 86(11):2278–2324, Nov 1998. ISSN0018-9219. doi: 10.1109/5.726791.[7] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/ .[8] Tony Lindeberg.
Scale-Space Theory in Computer Vision . Kluwer Academic Publish-ers, Norwell, MA, USA, 1994. ISBN 0792394186.[9] David G Lowe. Object recognition from local scale-invariant features. In
Computervision, 1999. The proceedings of the seventh IEEE international conference on , vol-ume 2, pages 1150–1157. Ieee, 1999.[10] Stephane Mallat and Sifen Zhong. Characterization of signals from multiscale edges.
IEEE Transactions on pattern analysis and machine intelligence , 14(7):710–732, 1992.[11] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, CarloGatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets.
CoRR , abs/1412.6550,2014. URL http://arxiv.org/abs/1412.6550 . CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS [12] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recogni-tion with cortex-like mechanisms.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 29(3):411–426, March 2007. ISSN 0162-8828. doi: 10.1109/TPAMI.2007.56.[13] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.
CoRR , abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387 .[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeperwith convolutions.
CoRR , abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842 .[15] Theano Development Team. Theano: A Python framework for fast computation ofmathematical expressions. arXiv e-prints , abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688 .[16] Matthew D. Zeiler and Rob Fergus.
Visualizing and Understanding Convolutional Net-works , pages 818–833. Springer International Publishing, Cham, 2014. ISBN 978-3-319-10590-1. doi: 10.1007/978-3-319-10590-1_53. URL http://dx.doi.org/10.1007/978-3-319-10590-1_53http://dx.doi.org/10.1007/978-3-319-10590-1_53