[PDF] Learning Filter Scale and Orientation In CNNs

Abstract

Convolutional neural networks have many hyperparameters such as the filter size, number of filters, and pooling size, which require manual tuning. Though deep stacked structures are able to create multi-scale and hierarchical representations, manually fixed filter sizes limit the scale of representations that can be learned in a single convolutional layer. This paper introduces a new adaptive filter model that allows variable scale and orientation. The scale and orientation parameters of filters can be learned using back propagation. Therefore, in a single convolution layer, we can create filters of different scale and orientation that can adapt to small or large features and objects. The proposed model uses a relatively large base size (grid) for filters. In the grid, a differentiable function acts as an envelope for the filters. The envelope function guides effective filter scale and shape/orientation by masking the filter weights before the convolution. Therefore, only the weights in the envelope are updated during training. In this work, we employed a multivariate (2D) Gaussian as the envelope function and showed that it can grow, shrink, or rotate by updating its covariance matrix during back propagation training . We tested the new filter model on MNIST, MNIST-cluttered, and CIFAR-10 and compared the results with the networks that used conventional convolution layers. The results demonstrate that the new model can effectively learn and produce filters of different scales and orientations in a single layer. Moreover, the experiments show that the adaptive convolution layers perform equally; or better, especially when data includes objects of varying scale and noisy backgrounds.

Full PDF

CCAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS Learning Filter Scale and Orientation InCNNs

Ilker Cam [email protected]

F. Boray Tek [email protected]

Computer EngineeringIsik UniversityIstanbul, TR

Abstract

Convolutional neural networks have many hyperparameters such as the ﬁlter size,number of ﬁlters, and pooling size, which require manual tuning. Though deep stackedstructures are able to create multi-scale and hierarchical representations, manually ﬁxedﬁlter sizes limit the scale of representations that can be learned in a single convolutionallayer.This paper introduces a new adaptive ﬁlter model that allows variable scale and ori-entation. The scale and orientation parameters of ﬁlters can be learned using back prop-agation. Therefore, in a single convolution layer, we can create ﬁlters of different scaleand orientation that can adapt to small or large features and objects. The proposed modeluses a relatively large base size (grid) for ﬁlters. In the grid, a differentiable functionacts as an envelope for the ﬁlters. The envelope function guides effective ﬁlter scale andshape/orientation by masking the ﬁlter weights before the convolution. Therefore, onlythe weights in the envelope are updated during training.In this work, we employed a multivariate (2D) Gaussian as the envelope functionand showed that it can grow, shrink, or rotate by updating its covariance matrix duringback propagation training . We tested the new ﬁlter model on MNIST, MNIST-cluttered,and CIFAR-10 and compared the results with the networks that used conventional con-volution layers. The results demonstrate that the new model can effectively learn andproduce ﬁlters of different scales and orientations in a single layer. Moreover, the exper-iments show that the adaptive convolution layers perform equally; or better, especiallywhen data includes objects of varying scale and noisy backgrounds.

Naming or describing real life objects is only meaningful with respect to a relevant scale [8].For example, a view can be described as a leaf, a branch, or a tree depending on the distanceof the observer. Natural and casual scenes are generally composed of many different enti-ties/objects at different scales. During image acquisition, the true physical scale is usuallyignored. However, the relative scale of the objects is somehow implicitly captured and storedin the image grid and pixels.An automated method to identify or describe objects in images can be analyzed in twoparts: representation + classiﬁcation. Basic classiﬁcation algorithms without add-ons can c (cid:13) a r X i v : . [ c s . C V ] F e b CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS not successfully handle variation and complexity of raw pixel-level representation of ob-jects, instead they rely on functions that map image pixels into different constructs, -namedfeatures-, which are sought to represent the image content more brieﬂy and invariant to vari-ous geometric and intensity changes.Traditionally, computer vision researchers relied on manually designed feature extractorsfor representation. Recently, we are witnessing the success of the algorithms which can self-learn appropriate feature extractors. In either case, the size of an operator or a probe usuallydetermines and ﬁxes scale of the entities that can be represented. However, even in the self-learn case, size of the probes or operators is often manually selected. On the other hand,last two decades has seen many automated object detection/recognition algorithms that weresuperior to their counterparts because they have comprised multi-scale processing of images[3], [6]. Multi-scale feature extractors gather and present the inherent scale information ofimage pixels to a subsequent classiﬁer. In SIFT [9] and wavelets [10], this is done by creatinga multi-scale pyramid from the input image and then applying a ﬁxed size probe-kernel toeach scale. In an application of Gabor ﬁlters for object recognition, Serre et al. [12] used ahierarchy of stacked Gabor ﬁltering layers, where the ﬁlters have predetermined scales andorientations. However, Chan et al. [1] showed that the adaptation of handcrafted ﬁlters tolow-level representations is difﬁcult. On the other hand, convolution neural networks (CNN)rely on stacked and hierarchical convolutions of the image to extract multi-scale information.Convention of CNNs for ﬁlter size selection is to use small ﬁxed size weight kernels in thelower levels. However, thanks to the stacked operation of convolutional layers, sandwichedby pooling layers which down-sample intermediate outputs, the deeper levels of a networkare able to learn representations of larger scales. Though the optimality of ﬁxed size kernelshas not proven, the convention is to use ﬁlters small as 3x3 in the ﬁrst layer, which canbe larger 5x5 or 7x7 in the later stages [16]. During back propagation training, ﬁlters areevolved to imitate lower level receptive ﬁelds in biological vision which are sensitive tocertain shapes and orientations. Another justiﬁcation for avoiding large ﬁlter sizes is that,while certainly increasing computation time, they may also increase over-ﬁtting.Though the number of ﬁlters and their sizes in convolution layers are usually selectedintuitively, researchers are seeking alternatives to improve representation capacity of the net-work in deeper architectures. For example, Szegedy et al. [14] handcrafted their “inception”architecture to include mixing of parallel and wide convolution layers which use differentsized ﬁlter kernels. In a deep architecture, this approach allows multi-scale, parallel andsparse representations. [ ? ] In summary, existing CNN based methods use ﬁxed size con-volution kernels and then rely on the fact that shape and orientation of the ﬁlters can beinferred from the training data. Additionally, CNNs employ stacked convolution layers tosuccessfully create multi-scale representations.On the other hand, Hubel and Wiesel [4] discovered three type of cells in visual cortex:simple, complex, hyper-complex (i.e. end-stopped cells). The simple cells are sensitive tothe orientation of the excitatory input, whereas the hypercomplex cells are activated with acertain orientation, motion, and length of the stimuli. Therefore it is biologically plausibleto assume that ﬁlters of different scales next to different orientations and direction may alsowork better in CNNs.In this study, we create a new and adaptive model of convolution layers where the ﬁltersize (actually scale) and orientation are learned during training. Therefore, a single convolu-tion layer can have distinct ﬁlter scales and orientations. Broadly speaking, this correspondsto extracting multi-scale information using a single convolution layer. However, our aim isnot to fully replace the stacked architectures and deep networks for multi-scale information. AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Arbitrary envelope (b) Gaussian envelope (c) Random weights (d) Masked weights Figure 1: Illustration of the proposed weight envelope (a) an arbitrary differentiable envelopefunction controls the weight spread and shape on a regular relatively large base grid, (b) anexample, initial Gaussian kernel with centered µ and initial Σ , (c) initial weights of theﬁlter that are randomly generated, (d) weights are masked with the envelope (b) by simpleelement-wise multiplication.Instead, our approach improves the information that can be extracted from an input (may bean image), in a single layer. Additionally, the model removes the necessity of ﬁxing convo-lution kernel sizes, so that the ﬁlter size can be removed from the list of hyper-parameters ofdeep learning networks.Our experiments use MNIST, MNIST-cluttered and CIFAR-10 datasets to show that theproposed model can learn and produce different scaled ﬁlters in a single convolution layerwhilst improving classiﬁcation performance compared to conventional convolution layers. Inexperimental side, our work is concentrated on developing and proving an effective method-ology for learnable adaptive ﬁlter scales and orientations, rather than improving highly opti-mized state-of-the-art results in these datasets. Organization of the paper is as follows ?? . In deep network architectures, stacked convolution layers perform convolutions with ﬁxedsize kernels where sandwiched pooling layers perform downsampling operations to achievea multi-scale and hierarchal representation. Fixed size convolution kernels put a limit on thescale of features which can be extracted from a single layer. ?? Though it is possible to mix several kernels of different size in a single convolution layer,the convention is to use a ﬁxed size for all the kernels in a layer. Here, we introduce a newﬁlter model which can adapt its scale and orientation. Therefore it allows development ofmulti-scale and differently oriented ﬁlters in a single convolution layer. To realize this, weneed a smooth and function that can grow, shrink, or rotate during training which acts as anenvelope to guide ﬁlter scale and orientation. The following subsections explain the role ofthe envelope, selecting an appropriate function for the envelope, and its partial derivativeswhich are used in backpropagation.

The role of the envelope function is to guide the ﬁlter scale and orientation development.As illustrated in Figure 1, a base grid acts as the envelope and ﬁlter domain. Since it is the

CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS most common case, we will assume a two-dimensional domain. Generalizations to a higherdimensions is straightforward. In this domain, the envelope function must be differentiableand smooth. Let us assume a base grid for an n × n odd sized and square ﬁlter (1); and let G be a (continuously) differentiable function deﬁned in grid g (coordinate space) and parametervector ΘΘΘ ∈ R i to deﬁne its shape (2). A = { , , .., n } , g ∈ A × A = { ( a , b ) | a ∈ A and b ∈ A } (1) u = U ( g , ΘΘΘ ) , ΘΘΘ = { θ , θ , .. θ i } (2)By updating the parameters in ΘΘΘ , envelope function must be able to grow or shrink itseffective area and change its orientation. The feed forward model of a single neuron with aninput x and transfer function f can be written as: o = f ( ∑ g ∈ A × A x g w g U ( g , ΘΘΘ )) (3)Or if we think of a whole convolution layer of input matrix (image) X and weight matrix W and envelope matrix U . Simply, an element-wise multiplication of U with the weightmatrix W will mask and scale the weights before the convolution. O = f ( X ∗ ( W ◦ U )) (4)Since the weights can not grow out of envelope U , the ﬁlter size and orientation will bebounded and determined by U . Assuming that the partial derivatives of G with respect tocontinuous parameter θ i is deﬁned using the chain rule, the update can be performed usingthe standard backpropagation algorithm with the learning rate η . However note that theweight update also gets u as a scaler w (cid:48) g : = w g − η ∂ E ∂ o ∂ o ∂ w g θ (cid:48) i : = θ i − η ∂ E ∂ o ∂ o ∂ U ∂ U ∂ w g (5) It is well known that continuous Gaussian kernel has unique properties which are importantfor generating a scale space. Simply put, the Gaussian kernel does not create new local ex-trema, nor enhance existing extrema, whilst smoothing the image with a variable continuousparameter [8]. Some of this properties are proven to exist in discrete space if the sample sizeis sufﬁciently large. Therefore, Gaussian is an ideal candidate for the envelope function U : U ( g , µ , Σ ) = A e − ( g − µ ) (cid:48) Σ − ( g − µ ) (6)Here, parameter A is an optional normalization parameter; µ controls the center of theenvelope, whereas the covariance Σ controls the scale and orientation of the kernel. µ = (cid:26) µ x , µ y (cid:27) Σ = (cid:20) σ xx σ xy σ xy σ yy (cid:21) (7)During the feed forward execution the envelope function is calculated on the grid co-ordinates g with the current covariance Σ ; and then elementwise multiplied with the weight AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS matrix W , prior to the convolution. This is illustrated in Figure 1(b)-1(d). Note that this oper-ation not only bounds the weights and adjusts the effective area, it also scales the weights. Toimplement the convolution operation appropriately, we set µµµ as a vector of constants that isinitialized with the center point coordinates of the grid g . Therefore, it is not updated duringtraining. However, the covariance Σ must be updated to learn the ﬁlter scale and orientation.In order to keep the symmetric property of Σ we calculate the gradients for each σ andapply update rule. ∀ σ ∈ Σ , σ : = σ − η ∂ E ∂ U ∂ U ∂ σ (8)Covariance Σ must be kept as a symmetric and positive deﬁnite matrix. A symmetricmatrix is positive deﬁnite if for all non-zero vectors: x T Σ x ≥

0; which imposes the followingconditions: (cid:0) σ xx > , σ yy > , σ xx σ yy > σ xy (cid:1) or λ i ≥ λ i denotes the eigenvalues of the covariance matrix, which can be checked to en-sure positive deﬁniteness. However, during training the diagonal sigma terms are ensuredto be positive (and nonzero) by setting bounds, e.g. σ xx = min ( ε , σ xx ) ; whereas σ xy is con-strained by σ xy =max (cid:0) σ xy , √ σ xx σ yy (cid:1) . Experiments show that the covariance behave wellduring training when it is initialized properly and updated with a small learning rate; andthus it removes the necessity for these constraints. We implemented the adaptive convolution ﬁlters using Lasagne& ?? Theano [15] and thentested using a Nvidia Tesla K40 GPU board. In terms of computational complexity, as it canbe expected, calculating the Gaussian envelope function adds an extra overhead in training.However, during feed forward execution, the trained and enveloped ﬁnal weights can bestored and used immediately without any overhead. Compared to conventional ﬁlters, weuse relatively larger (e.g. 11x11) base ﬁlter sizes to observe adaptive growth and rotations.Please note that the grid can be selected as large as the input image. ACNN ran one epoch(500 examples) in 6 . ± . . ± .

15 seconds on MNISTdataset without an optimized implementation. (The code will be available in the camera-ready version, not disclosed for anonymity.)

One can argue that the ﬁlter guide is unnecessary because a relatively large CNN-layer canlearn any ﬁlter. To disprove this idea we conducted a test with an encoder where the input isan image the ﬁlter is expected to learn a simple ﬁltering operation. Can CNN learn the sameﬁlter without a ﬁlter guide.Show learned ﬁlter of a Gaussian blur ﬁlter with encoder.Show learned ﬁlter of an edge ﬁlter with encoder.Show learned ﬁlter of a blurred edge ﬁlter with encoder.

CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS

Table 1: The network topology that is used to test our method. All three networks werecomprised of 8 layers. In conv-1 and conv-2 layers, the proposed adaptive model (ACNN)used an 11x11 base grid for ﬁlters, whereas ’cnn-5’ and ’cnn-11’ used 5x5 and 11x11 ﬁltersizes, respectively. *CIFAR-10 experiments used 16 ﬁlters instead of 8.Layer Units Filters Filter Size Pool Size Activation1- input - - - - -2- conv-1 - 8/16* 5x5/11x11 - ReLu3- maxpool-1 - - - 2x2 -4- conv-2 - 8/16* 5x5/11x11 - ReLu5- maxpool-2 - - - 2x2 -6- dropout(50%) - - - - -7- fully connected - 1 256 - - - ReLu8- fully connected - 2 10 - - - Sigmoid

Show how covariance learns, whether it requires the constraints, in which conditions.

The experiments were aimed at observing whether the adaptive ﬁlters can change its scaleand orientation during training and whether this adaptation yields an improved classiﬁcationperformance. We tested the adaptive ﬁlter model with three different datasets, and also com-pared the results against two conventional CNN conﬁgurations that used different ﬁxed sizeﬁlters. All three networks had the same structure which was comprised of two convolution,two pooling layers, a dropout layer and two fully connected layers (Table 1). The only differ-ence between the adaptive CNN (ACNN) and conventional CNNs (cnn-5, cnn-11) was thereplacement of convolution layers. We used cross-entropy as the error function. The hyperparameters we used are as follows; Learning rate: 0.01, momentum: 0.95, batch size: 500.On each test, we examined training loss, error % and scale, orientation changes in ﬁlters.Though the learning rate for Σ could be adjusted separately it was not necessary. MNIST [7] is a database of handwritten digits, widely used in machine learning research totest models. It has 50,000 training and 10,000 testing images from 10 different categories.To observe the change in Σ , we calculated its eigenvalues and eigenvectors. The maximumeigenvalue represents the scale, whereas the tangent between the eigenvectors shows theorientation as illustrated in Figure 2.In Figure 3, we can observe the learned envelope functions scale and orientation effectson ﬁlters. Smoothing effect of the envelope function over the input is also observed in someoutputs (3(c)). Figure 4 shows the training loss and classiﬁcation error plots. The adaptiveﬁlters had no performance gain against conventional cnn-11, cnn-5. AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Orientation Change. (b) Scale Change. Figure 2: The plots for covariance matrix Σ change in MNIST dataset. Depicted by the (a)angle of largest eigenvector and (b) largest eigenvalue. (a) Gaussian envelope functions. (b) Scaled ﬁlters. (c) Convolution outputs. Figure 3: The ﬁrst layer (conv-1) ﬁlters at the end of training with MNIST. (a) Gaussianenvelopes, (b) scaled ﬁlters, (c) output of a sample that was convolved with each ﬁlter. (a) Training loss, y axis is on log scale. (b) Classiﬁcation Error.

Figure 4: Training loss and classiﬁcation error for MNIST.

CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS

Figure 5: Eight randomly selected samples from the cluttered MNIST dataset. (a) Training loss, y axis is on log scale. (b) Classiﬁcation Error.

Figure 6: Training loss and classiﬁcation error in cluttered MNIST database.

Cluttered MNIST dataset [2] consists of 60,000 samples in 10 classes. We split this datasetinto 50.000 and 10.000 for test and train purposes, respectively. Randomly selected 8 sam-ples are illustrated in Figure 5. It contains 60x60 images generated from the original MNISTdatabase with numerous of distractors. Projecting the original MNIST 28x28 pixel spaceonto 60x60 also caused changes in scale. Thus, the dataset had scale and rotational variances,in addition to cluttered background noise which makes it a suitable test case to demonstratethe use of adaptive ﬁlters.Figure 6 shows the training loss and classiﬁcation error, where we can observe betterperformance compared with conventional CNNs.

This dataset [5] is a relatively small (32x32x3) image set with 60,000 samples from 10different classes. We divided this dataset into 50,000 train and 10,000 test sets, respectively.Other than MNIST dataset, color channels are present, and objects are much more in needfor multi-scale features.The classiﬁcation results that are shown in Figure 8(b) demonstrates that again ACNNperformed better in classiﬁcation error. For further investigation, we also included the changeof Σ for learned envelope functions and scaled ﬁlters in Figure 7. Compared to MNIST, the AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Gaussian envelope functions. (b) Scaled ﬁlters. Figure 7: The ﬁrst layer ﬁlters at the end of training in CIFAR-10 database. (a) Training loss, y axis is on log scale. (b) Classiﬁcation Error.

Figure 8: Training loss and classiﬁcation error in CIFAR-10 database.envelope functions are observably different; and included both large, small, rotated ﬁlters.The change in scale and orientation is shown in Figure 9. Compared with change on Σ in MNIST test, scales and orientations have more variation and some of the ﬁlters tend toshrink, whereas some were enlarged their scales. In this paper, we propose an adaptive convolution ﬁlter model based on a Gaussian ker-nel that is acting as an envelope function on shared ﬁlter weights. The plots of scale andorientation changes during the training epochs show that the adaptive model is capable ofgenerating differently scaled and oriented ﬁlters in a single convolution layer. However, be-sides bounding and scaling of convolution weights, the Gaussian kernel tends to performsmoothing on input. Such that, if all weights were set to 1 and not trainable, the kernels per-form only a Gaussian smoothing operation on input. The initial setting of variance terms to1.0, enables an initial ﬁlter of 5x5 size. During training effective size of the ﬁlters are grad- CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS (a) Orientation Change. (b) Scale Change.

Figure 9: Plots for covariance matrix Σ change in CIFAR-10 dataset, depicted by the (a)angle of largest eigenvector, (b) largest eigenvalue.ually increased. This is because enlarging ﬁlters enables more weights to be included in theconvolution, which will allow further reduction in the network error. Therefore, the adaptiveﬁlter may be prone to overﬁt more than a conventional ﬁxed sized ﬁlter of the same initialsize. However, because the envelope rescales weights (max 1.0), it has a regulative effect ontheir magnitudes, which shall create an advantage. In overall, training of the adaptive ﬁltermodel did not require very ﬁne tuning of the parameters. However, we observed that theuse of dropout layer encouraged the development of ﬁlters of different scale and orientation.This can be explained by that the parallel and sparse network conﬁgurations induced by thedropout mechanism forces ﬁlters to prevent co-adaptation and become independent. We willinvestigate other ways of inducing independent ﬁlters, perhaps with an additional cost termfor the network which punishes co-adaptation.A clear beneﬁt of our model is that it removes the ﬁlter size from the list of hyperparame-ters of deep learning networks. However, our main purpose is to add an adaptive multi-scalerepresentation capacity to convolution layers. The results show that the advantage of usingthe new model depends on the complexity and variations in training and test data. Amongthree datasets, MNIST is the simplest where digits are size-normalized and centered [7].The adaptive ﬁlters have less or no need for scale adaptation in pixel space, which resultedin no improvement in classiﬁcation error when compared to conventional CNNs. However,MNIST cluttered and CIFAR-10 include examples of arbitrary scale, orientation and center-ing [5] [2], which allowed the ﬁlters to adapt their scale and orientation to improve trainingwhile not overﬁtting. Therefore, we can conclude the adaptive ﬁlters expressive power isrevealed in datasets with variations in scale and orientation. It is worthwhile to investigateits applications to other domains.The new and adaptive model of convolution layers allows ﬁlters’ scale and orientation tobe learned during training. Therefore, a single convolution layer can have ﬁlters at variousscales and orientations. Therefore, a single convolution layer can adapt to extract multi-scale information from its input. State-of-the-art deep networks have many layers and morecomplex designs compared to the networks that were tested in this study. An interestingquestion which we will investigate further is whether using the adaptive ﬁlter layers canshorten the depth of the state-of-the-art architectures, such as inception [14], highway [13]or thin [11]. Though our aim is not to fully replace stacked and deep architectures, the new AM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS model may help reduce redundancy and improve accuracy. Another question is whetherplacing the adaptive layer in deeper levels of a network can produce additional gains byfocusing on the higher level representations. ?? using Gaussian lookup table for This research was supported by a grant (undisclosed for anonymity) and NVIDIA HardwareGrant scheme.

References [1] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng, and Yi Ma. Pcanet:A simple deep learning baseline for image classiﬁcation?

CoRR , abs/1404.3606, 2014.URL http://arxiv.org/abs/1404.3606 .[2] christopher5106. Christopher. Cluttered mnist dataset. https://github.com/christopher5106/mnist-cluttered , 2015.[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.In

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Soci-ety Conference on , volume 1, pages 886–893. IEEE, 2005.[4] David H Hubel and Torsten N Wiesel. Receptive ﬁelds and functional architecture ofmonkey striate cortex.

The Journal of physiology , 195(1):215–243, 1968.[5] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.

Master’s thesis, Department of Computer Science, University of Toronto , 2009.[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324, Nov 1998. ISSN0018-9219. doi: 10.1109/5.726791.[7] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/ .[8] Tony Lindeberg.

Scale-Space Theory in Computer Vision . Kluwer Academic Publish-ers, Norwell, MA, USA, 1994. ISBN 0792394186.[9] David G Lowe. Object recognition from local scale-invariant features. In

Computervision, 1999. The proceedings of the seventh IEEE international conference on , vol-ume 2, pages 1150–1157. Ieee, 1999.[10] Stephane Mallat and Sifen Zhong. Characterization of signals from multiscale edges.

IEEE Transactions on pattern analysis and machine intelligence , 14(7):710–732, 1992.[11] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, CarloGatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets.

CoRR , abs/1412.6550,2014. URL http://arxiv.org/abs/1412.6550 . CAM & TEK: LEARNING FILTER SCALE AND ORIENTATION IN CNNS [12] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recogni-tion with cortex-like mechanisms.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 29(3):411–426, March 2007. ISSN 0162-8828. doi: 10.1109/TPAMI.2007.56.[13] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.

CoRR , abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387 .[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeperwith convolutions.

CoRR , abs/1409.4842, 2014. URL http://arxiv.org/abs/1409.4842 .[15] Theano Development Team. Theano: A Python framework for fast computation ofmathematical expressions. arXiv e-prints , abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688 .[16] Matthew D. Zeiler and Rob Fergus.

Visualizing and Understanding Convolutional Net-works , pages 818–833. Springer International Publishing, Cham, 2014. ISBN 978-3-319-10590-1. doi: 10.1007/978-3-319-10590-1_53. URL http://dx.doi.org/10.1007/978-3-319-10590-1_53http://dx.doi.org/10.1007/978-3-319-10590-1_53