X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets
XX-CNN: Cross-modal Convolutional NeuralNetworks for Sparse Datasets
Petar Veliˇckovi´c ∗‡ , Duo Wang ∗ , Nicholas D. Lane †‡ and Pietro Li`o ∗∗ Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK † Department of Computer Science, University College London, London WC1E 6BT, UK ‡ Nokia Bell Labs, Cambridge CB3 0FA, UKEmail: { pv273, wd263, pl219 } @cam.ac.uk, [email protected] Abstract —In this paper we propose cross-modal convolutionalneural networks (X-CNNs), a novel biologically inspired type ofCNN architectures, treating gradient descent-specialised CNNs asindividual units of processing in a larger-scale network topology,while allowing for unconstrained information flow and/or weightsharing between analogous hidden layers of the network—thus generalising the already well-established concept of neuralnetwork ensembles (where information typically may flow onlybetween the output layers of the individual networks). Theconstituent networks are individually designed to learn the outputfunction on their own subset of the input data, after whichcross-connections between them are introduced after each poolingoperation to periodically allow for information exchange betweenthem. This injection of knowledge into a model (by prior partitionof the input data through domain knowledge or unsupervisedmethods) is expected to yield greatest returns in sparse dataenvironments, which are typically less suitable for training CNNs.For evaluation purposes, we have compared a standard four-layerCNN as well as a sophisticated FitNet4 architecture against theircross-modal variants on the CIFAR-10 and CIFAR-100 datasetswith differing percentages of the training data being removed,and find that at lower levels of data availability, the X-CNNssignificantly outperform their baselines (typically providing a 2–6% benefit, depending on the dataset size and whether dataaugmentation is used), while still maintaining an edge on all ofthe full dataset tests.
I. I
NTRODUCTION
In recent years, the number of success stories of machinelearning has seen an all-time rise across a wide range of fieldsand tasks, examples including: computer vision [1], speechrecognition [2], reinforcement learning [3] and guiding MonteCarlo tree search [4]. The unifying idea behind all of the aboveis deep learning , the utilisation of neural networks with manyhidden layers, for the purposes of learning complex featurerepresentations from raw data, rather than relying on hand-crafted feature extraction.As the networks become deeper, however, they become moreand more reliant on the amount of training examples providedfor maximising their performance. While we are now ableto extract large quantities of labelled information for manyproblems of interest, there remains a significant proportion oftasks for which “big data” simply isn’t available at this time,which makes it extremely difficult to fully exploit a deep CNNarchitecture and properly learn generalisable features of the data. Here we will present an architectural methodology thatattempts to extract additional predictive power from a convolu-tional neural network (CNN) in such circumstances by insteadfocussing on the width of the data, i.e. the heterogeneity ofinformation present within each training example. The keyidea constitutes appropriate partitioning of this informationand training smaller CNNs on these partitions (allowing themto train faster and more effectively under sparse data envi-ronments), while allowing for information exchange betweenthem at various stages (Fig. 1).A classic example where such an approach is bound to beuseful are clinical studies , where there typically may not bethat many patients, but for each patient there is potentiallya heterogeneous wealth of information, such as various testresults, patient history, ethnic background, body scans (CT,MRI. . . ) and so on, depending on the type of study.II. C
ROSS - MODAL
CNN S Our methodology is inspired by multilayer networks [5],mathematical structures encompassing several layers of graphsover the same set of nodes, allowing for unrestricted intra-layer as well as inter-layer connections. They have been ademonstrably valuable tool for modelling a variety of naturaland social systems ([6], [7], [8]), and their applicabilityto machine learning (within the context of hidden Markovmodels) was already demonstrated by some of the authors[9], managing to achieve high performance on a sparse breastcancer classification dataset involving gene expression andmethylation data.The network design process is initiated by appropriately par-titioning the input data—this may be done either manually(by exploiting existing domain knowledge) or through anunsupervised pre-training step, which will determine which(not necessarily disjoint) fragments of the input data are morelikely to constructively influence one another. Afterwards, across-modal CNN is constructed such that a separate CNN superlayer is dedicated to each partition of the input data,attempting to learn the target function from its partition only.The purpose of the partitioning is to help the constituentCNNs become powerful predictors while requiring a smallerdimensionality of the input data, by allowing them access to a r X i v : . [ s t a t . M L ] O c t nput(channel 1)Input(channel 2)Input(channel 3) Conv.Conv.Conv. PoolPoolPool × conn. × conn.Conv.Conv.Conv. PoolPoolPool M e r g e FCFC Softmax
Fig. 1. Diagram of a simple cross-modal CNN for image classification, generated from a baseline CNN of the form [ Conv → P ool ] × → F C → Softmax .Each of the three channels (RGB/YUV) of the input image receives its own CNN superlayer, with cross-connections inserted after the pooling operation, andfull weight sharing in the fully connected layers. A more in-depth view of a potential cross-connection layout is provided by Figure 2. those parts of the input which are most significantly relatedto each other in the context of the predictions that need to bemade.Finally, the superlayers may be interconnected by any sort of(feedforward) cross-connection as is best seen fit, and theymay be combined in arbitrary ways at the output stage to pro-duce the final output. Similarly, at any stage the weights of thesuperlayers may be shared—the simplest case, which we willexplore in our analysis, constitutes complete weight sharingof the fully connected layers at the tail of the networks. Thisconstruction is biologically inspired by cross-modal systems [10] within the visual and auditory systems of the human brain(which in turn inspired the development of CNNs)—whereinseveral cross-connections between various sensory networkshave been discovered [11], [12].To quantify the gains of this approach, our evaluation focusseson an already well-understood problem of coloured imageclassification, on established CIFAR-10/100 [13] benchmarksfor which an abundance of data is available, so it is easier toinvestigate the effects of restricting the size of the training seton various CNN models. The partitioning of the input that weconsider is per-channel—each of the three image channels willbe an input to an individual superlayer, and these superlayerswill have identical high-level architecture (differing only inthe number of feature maps per hidden layer)—as illustratedby Fig. 1. This also allows for a simple approach to cross-connections; namely, after every downsampling ( pooling ) op-eration we allow for the feature maps to be exchanged betweensuperlayers, after being passed through another convolutionallayer (Fig. 2).While this model in itself constitutes a committee of CNNs,it differs from most traditional ensemble applications in two key ways: • An ensemble’s constituent models typically exchange in-formation only in the output stage, while the cross-modalframework allows for arbitrary (feedforward) informationflow at any stage of the processing pipeline; • Constituent models of an ensemble usually receive afull copy of the input each, while superlayers within across-modal neural network receive only a fraction of theinput, allowing for a decrease in degrees-of-freedom ofthe model compared to an unrestricted network.In fact, this can be taken a step further: one may consider ensembles of cross-modal CNNs , which may compound onbenefits already given by X-CNNs themselves, on exampleswhere the networks are potentially struggling to choose aproper class with sufficient confidence. As the X-CNN modelcan be observed as an ordinary CNN from a high level, any ensemble strategies that are found useful for CNNs should beuseful for X-CNNs as well.Lastly, it should be noted that our approach is not restrictedto CNNs, but it is then easiest to scrutinise, as the trainedparameters are bound to obey a certain spatial structure. In linewith this, an entire section of this manuscript will be dedicatedto analysing the learned convolutional kernels within an X-CNN, as well as visualising the inputs that would maximisethe model’s cross-connection activations.III. M
ODEL A RCHITECTURES
For the purposes of evaluating our proposed architecture’s per-formance, we have implemented two baseline CNN models—along with their cross-modal variants—in Keras [14] (withTheano [15] back-end). For purposes of reproducibility, in this onvolutionConvolution Max-PoolMax-Pool Conv. Conv. Conv.Conv. MergeMerge Conv.Conv.. . .. . . . . .. . .
Fig. 2. Illustration of a single cross-connection segment within an X-CNN with two superlayers. After each pooling operation, we exchange the feature mapsbetween the superlayers, after first passing them through an additional convolutional layer. We may also perform an additional intra-superlayer convolutionbefore merging the feature maps in each superlayer via concatenation. section we will expose their architectures and hyperparametersas used for the evaluation. The cross-modal variants’ featuremap counts have been altered in such a way as to makethe overall number of parameters as close as possible to thebaseline, making for fair evaluation with respect to degrees-of-freedom.For both of the models used, we represent images in the YUVcolour space. As a linear transformation from RGB, it shouldnot have an impact on performance of the baselines, while ithas the benefit of decoupling luminance from chrominance ,allowing for a simpler analysis of cross-connections (andrelating its learned kernels to human vision processes). Weinject further domain knowledge into the model by favouringthe CNN superlayer corresponding to the Y channel in termsof feature map counts (typically doubled compared to the U/Vsuperlayers within the same hidden layer). This correspondsto the assumption that the majority of relevant informationabout an object is contained within its brightness channel,while colour usually represents auxiliary information.
A. KerasNet
Our initial model of choice represents a simple CNN withfour convolutional ReLU [16] layers, followed by two fullyconnected layers, one of which is also ReLU. We will bereferring to it as KerasNet throughout this manuscript as it isbased on the Keras CIFAR-10 CNN example [17]. It representsa likely style of a “starting” model that one is going to attemptto apply on an image classification problem (without particularprior knowledge about it), perhaps especially bearing in mindthat the training data may be sparse.The architecture of the model, as well as its cross-modalvariant (X-KerasNet) is outlined in Table I. Both models aretrained for 200 epochs using the Adam SGD optimiser, with
TABLE IA
RCHITECTURES FOR K ERAS N ET AND
X-K
ERAS N ET Output size KerasNet X-KerasNet ∼ . M param. ∼ . M param. ×
32 [3 × , × Y: [3 × , × U/V: [3 × , × ×
16 2 × Max-Pool, stride Y → Y: identity U → U: identity V → V: identity Y (cid:32) U/V: [1 × , U/V (cid:32) Y: [1 × , × , × Y: [3 × , × U/V: [3 × , × × × Max-Pool, stride × Fully connected, 512-D10/100-way softmaxhyperparameters as described in [18], and a batch size of32. Dropout [19] has been applied after both of the poolingoperations (with p = 0 . ) as well as after the first fullyconnected layer (with p = 0 . ). B. FitNet4
We decided to implement FitNet4 by Romero et al. [20] as oursecond baseline, representing a sophisticated CNN close to thestate-of-the-art on CIFAR-10/100. We opted for this model asit is prominently featured in a variety of recent neural networksresearch ([21], [22]), and due to its design goal of being a“thin&deep” network, managing to keep its parameter countrelatively low compared to many other successful models, and
ABLE IIA
RCHITECTURES FOR F IT N ET AND
X-F IT N ET Output size FitNet4 X-FitNet4 ∼ . M param. ∼ . M param. ×
32 [3 × , × Y: [3 × , × U/V: [3 × , × × , × Y: [3 × , × U/V: [3 × , × ×
16 2 × Max-Pool, stride Y → Y: [1 × , U → U: [1 × , V → V: [1 × , Y (cid:32) U/V: [1 × , U/V (cid:32) Y: [1 × , × , × Y: [3 × , × U/V: [3 × , × × × Max-Pool, stride Y → Y: [1 × , U → U: [1 × , V → V: [1 × , Y (cid:32) U/V: [1 × , U/V (cid:32) Y: [1 × , × , × Y: [3 × , × U/V: [3 × , × × × (global) Max-PoolFully connected, 500-D10/100-way softmaxtherefore could still be a feasible first choice for handling asparse dataset.The FitNet4 consists of 17 convolutional 2-way maxout [23]layers, followed by two fully connected layers, the first ofwhich is a 5-way maxout layer. The full architecture of thismodel—as well as its cross-modal variant (X-FitNet4)—ispresented in Table II.Both models are initialised using Xavier initialisation [24], andare then trained for 230 epochs using the Adam SGD optimiserwith a batch size of 128. We have applied batch normalisation[25] to the output of each hidden layer to significantly acceler-ate the training procedure. L regularisation with λ = 0 . has been applied to all weights in the model. Finally, dropout(with p = 0 . ) was applied on the input, after every poolingoperation, and after the fully connected maxout layer.IV. E VALUATION
To verify our insights, we have utilised two well-known imageclassification benchmark datasets, CIFAR-10 and CIFAR-100[13], for which an abundance of data is available (50000training and 10000 testing examples). This makes it easierto study the behaviour of the considered CNNs as differentfractions of the training data are discarded. We hypothesisethat, at lower levels of data availability (up to a threshold), our methodology will yield significant gains over an equivalentunrestricted CNN—and also that it will remain competitive atall higher training set sizes.The validity of our claim is investigated by performingcomparative evaluation, with the KerasNet and FitNet4 asbaselines against X-KerasNet and X-FitNet4, respectively. Ineach individual test we evaluate the accuracy of these fourmodels on the entire test set of 10000 samples, when thetraining routine is presented with only p % of the entire trainingdataset (chosen deterministically). The schedule for the testsis as follows: • Initially test in increments of , until reaching (atwhich time the training and testing sets have equal sizes); • Afterwards, test in increments of until either reach-ing or the accuracies of the two models get within . of each other (corresponding to a gain of ≤ images properly classified), whichever is later; • Specially, we always test on (corresponding to ahighly sparse environment with only training images)and of the training dataset.The images are preprocessed by applying a single batchnormalisation operation on them; we have found this toyield slightly better results compared to doing global con-trast normalisation and ZCA whitening (the more commonapproach). Finally, given that it is, depending on the task,sometimes possible to significantly enhance results in a sparseenvironment by way of data augmentation , we have run all ofthe above tests twice —with and without random translationsand horizontal reflections applied to the training images—providing insight as to whether data augmentation compoundsthe effects of a cross-modal architecture, and to what extent.V. R ESULTS AND D ISCUSSION
The full evaluation results on the aforementioned tests arepresented in Tables III–VI.The results on tests without data augmentation are completelyin line with the claim of Section IV; at sufficiently low trainingdata sizes, both X-KerasNet and X-FitNet4 significantly out-perform their respective baselines on the testing set, for bothof the CIFAR-10/100 datasets.For CIFAR-10, the threshold at which the baselines “catchup” (in terms of being able to manually learn the domainknowledge directly injected into their cross-modal variants) isat around p = 40% , corresponding to 20000 training examplesbeing available. Furthermore, on CIFAR-100, such a thresholdis never reached, most likely due to the extreme sparsity of per-class examples making this problem particularly suitable forthe X-CNN models; the only exception is the scenario forFitNet4, where the data sparsity is probably too extreme (fiveexamples/class) for such a deep model to reach its potential.Regardless of when the threshold is surpassed, we report thatthe cross-modal CNNs will generally continue to have a slight ABLE IIIC
OMPARATIVE E VALUATION R ESULTS ON
CIFAR-10 W
ITHOUT D ATA A UGMENTATION
Model (cid:15) p
1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 37.94% 53.82% 62.95% 67.39% 70.26% 74.39% 76.62% 78.55% ——— ——— ——— ——— 82.50%X-KerasNet ——— ——— ——— ———
FitNet4 38.97% 56.78% 70.37% 75.07% 78.50% 81.95% 83.95% 85.22% ——— ——— ——— ——— 89.56%X-FitNet4 ——— ——— ——— ———
TABLE IVC
OMPARATIVE E VALUATION R ESULTS ON
CIFAR-10 W
ITH D ATA A UGMENTATION
Model (cid:15) p
1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 45.45% 67.01% 70.89% 78.83% ——— ——— ——— ———
FitNet4 40.91% ——— ——— ——— ———
TABLE VC
OMPARATIVE E VALUATION R ESULTS ON
CIFAR-100 W
ITHOUT D ATA A UGMENTATION
Model (cid:15) p
1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 7.55% 15.10% 20.24% 24.76% 28.18% 32.43% 36.29% 38.61% 41.63% 44.10% 45.56% 46.26% 48.26%X-KerasNet
FitNet4 6.48% 16.84% 22.12% 28.30% 35.52% 39.28% 43.59% 49.69% 50.42% 55.83% 56.62% 58.00% 59.78%X-FitNet4
TABLE VIC
OMPARATIVE E VALUATION R ESULTS ON
CIFAR-100 W
ITH D ATA A UGMENTATION
Model (cid:15) p
1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 9.09% 24.68% 32.63% 38.64% 42.62% 47.64% 49.91% 52.46% 53.77% 54.26% 55.12% 55.42% 55.45%X-KerasNet
FitNet4 7.25% 17.94% 23.55% 29.24% 38.76% 48.07% 50.06% 56.01% 58.55% 59.80% 62.38% 63.60% 65.59%X-FitNet4 edge over the baselines—outperforming them on all of thefull training dataset experiments, sometimes significantly. Thisnaturally invites the conclusion that converting a CNN into aX-CNN (if allowed by the task) is always a reasonable step;it can yield significant benefits (the significance depending onthe relation between the sparsity of the training dataset andthe complexity of the baseline model), while rarely makingperformance significantly worse.To further verify this claim, we have performed experimentson the full datasets (with and without augmentation) where wemonitored how the testing accuracy evolves as a function oftraining epoch. The resulting plots are summarised in Figure3; it is clear that the X-CNNs are at least as powerful astheir baselines, even when the full training sets are available.Furthermore, it is possible to detect a narrow edge for thecross-modal models in the CIFAR-10 experiments, and a sig- nificant edge in the CIFAR-100 experiments. The concludingremark is that even when the dataset under investigation is notvery sparse, attempting to utilise a cross-modal variant of theconsidered models (if applicable) is a reasonable action, as itmight yield noticeable returns in predictive power.The analysis of the interplay between data augmenta-tion and cross-modal networks on CIFAR-100 remainsstraightforward—the X-CNN models remaining consistentlyand significantly ahead of their baselines throughout the entirespectrum of training set sizes. On CIFAR-10, however, this isslightly more complicated; while the catch-up threshold getsexpectedly decreased (to around p = 20% ), the behaviourof X-CNNs for smaller training set sizes does not alwayssignificantly compound the benefits of data augmentation.Specifically, at 5% of the training set the FitNet4 modelmanages to outperform X-FitNet4 (the roles do get reversed
20 40 60 80 100 120 140 160 180 200 220 2400 . . . . . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-10 (no aug.)
KerasNetX-KerasNetFitNet4X-FitNet4 . . . . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-10 (aug.)
KerasNetX-KerasNetFitNet4X-FitNet4 . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-100 (no aug.)
KerasNetX-KerasNetFitNet4X-FitNet4 . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-100 (aug.)
KerasNetX-KerasNetFitNet4X-FitNet4
Fig. 3. Plots of the test accuracy of the four CNN models under consideration as a function of the number of training epochs, under 100% of the trainingset available. The experiments have been carried out on both CIFAR-10 and CIFAR-100, with and without data augmentation. The cross-modal CNNs areconsistently competitive with their respective baselines across all four datasets, with a significant edge present for CIFAR-100. starting from 10%, however). As a possible cause of thisphenomenon, we note that, at this data availability level, bothof the FitNet4 models are significantly inferior in performanceto the KerasNet models, for which there is a significant benefitto the usage of X-CNNs. The takeaway lesson here is that,while the cross-modal architecture need not always compoundnicely with data augmentation, an occurrence of such an eventcould signify that the baseline was not particularly suitable forproperly accommodating data augmentation at this training setsize in the first place. If this happens, one should attempt to usea more suitable/shallower CNN—the X-CNN variant shouldthen produce the desired benefits.Finally, we have taken advantage of some of the smallertraining set sizes to perform statistical significance tests ,typically scarce in deep learning literature. For training setsizes up to 15%, we trained the models five times (from dif-ferent initial conditions) and then performed t -tests, choosing p < . as our significance threshold. Our findings showthat, under these assumptions, the best-performing X-CNN model’s performance advantages are statistically significant inall scenarios , aside from the data-augmented CIFAR-10.VI. C ROSS - CONNECTION A NALYSIS
A key element of the X-CNN architecture are the cross-connection layers, as they enable information flow betweenindividual channels. It will therefore be of interest to under-stand and visualise what is the mode of operation for theselayers. All of the visualisations in this section correspond tothe learned weights after fully training on 100% of CIFAR-10with data augmentation.We will first demonstrate that cross-connections inserted inthe considered models, though being × convolutions, learnmore complex functions than simple feature map passing.First, we note that the weights of a × convolutional layermay be represented as a 2D table that maps input channels tooutput channels (akin to an adjacency matrix, where columnsare the input channels and rows are the output channels). ig. 4. Weight visualisation of the first-level cross-connection layer for theX-FitNet4 CNN. The columns correspond to input channels, while rowscorrespond to output channels. Green colour indicates a positive-weightconnection between an input channel and an output channel, while blue colourindicates a negative-weight connection. The colour intensities are proportionalto the absolute weight values. Top: Y (cid:32) U/V (36 input channels, 12 outputchannels). Bottom, left-to-right: U (cid:32)
Y and V (cid:32)
Y (18 input channels, 12output channels).
Rather than displaying the raw table values, we decided tovisualise weights in a heatmap style; Figure 4 showcases thisvisualisation for the first cross-connection layer of X-FitNet4.Green colours indicate that an input channel has a positiveconnection weight to the respective output channel while bluecolours indicate negative weights. The colour intensities areproportional to the absolute weight values.It can be seen that each output channel of the cross-connectionlayer is obtained through a nontrivial weighted combination of input channels. We hypothesise that the cross-connectionlayers selectively filter and combine input features that aremore utilisable in another processing stream.To delve deeper into what kinds of features the cross-connection layers are filtering, combining and passing, weapplied layer-wise feature-map activation techniques proposedby Simonyan et al. [26]. This technique performs gradientascent on a white-noise input image to maximise activationsof a specific channel of feature maps at any of the layers withina pre-trained model. The objective function for gradient ascentis defined as I (cid:48) = argmax I Σ( I ) − λ (cid:107) I (cid:107) (1)where I is input image, Σ( I ) is the activation of the consideredneuron when provided with I as input, and λ is a regularisationfactor. After iterating for a number of gradient ascent steps,the original white-noise image will be modified into patternsthat approximate the detection function of a specific neuron.Lower-level convolutional layers are well-known to learnfilters approximating Gabor wavelet filters that act as edgedetectors, corner detectors, etc; we can confirm that in ourexperiments this has indeed been the case. For the first cross-connection layer of X-FitNet4, we have visualised a selectionof channel activations in Figure 5. This visualisation indicatesthat the cross-connection layer is indeed passing combined Fig. 5. Artificially generated images (from white noise) that cause strongactivations of specific channels in the first cross-connection layer of the X-FitNet4 model. Top: Three channels from the Y (cid:32)
U/V cross-connections.Middle: Three channels from the U (cid:32)
Y cross-connections. Bottom: Threechannels from the V (cid:32)
Y cross-connections. lower-level features, such as the addition of horizontal andvertical stripes in the upper right image in the figure. Weobserve further that the pattern frequency for the Y channel’scrossconnection layer is higher than the one for the U andV layers. This observation reflects the fact that the humanvision system is able to detect higher frequency variations inintensity than chrominance. This is a solid indicator that theX-CNN architecture, when faced with an image classificationtask in the YUV colour scheme, is actually attempting tomimic human vision.Our final analysis focusses on the X-KerasNet model, wherewe transformed feature maps of arbitrary depths into RGBimages by a colour-mapping scheme. Figure 6 shows thefeature maps of the inputs and outputs of the Y (cid:32)
U/Vcross connections for representative images of the truck and airplane classes. These were easier to comparatively analyseon the X-KerasNet, as its cross-connection layers do not alterthe number of feature maps, and therefore the same colour-mapping scheme remained meaningful for both. We observethat cross-connection output maps have background and somefeatures emphasised, while other features de-emphasised—which further indicates that the cross-connection layers areperforming more complex inter-superlayer feature integrationthan simply passing feature maps between superlayers.VII. C
ONCLUSION
We have introduced cross-modal convolutional neural net-works (X-CNNs), a novel architecture that decouples con-volutional processing of (typically image-based) input par-titions, while allowing for periodical information flow be- ig. 6. Visualisation of input and output feature maps of the Y (cid:32)
U/V cross-connection layer of X-KerasNet. Left: input images ( truck / airplane ). Middle:input feature maps to the cross-connection layer. Right: output feature mapsof the cross-connection layer. tween the processing pipelines, in order to achieve perfor-mance improvements in sparse data environments. We haveapplied this methodology on the popular CIFAR-10/100 imageclassification datasets for two baseline models, managingto significantly outperform them in low-data environments,while remaining competitive in high-data environments—outperforming them on all of the full-dataset experiments.Aside from reinforcing the claim that the X-CNN architecturecan only be beneficial to a baseline model (depending on thelevels of training data sparsity, potentially highly significantly),we have further verified that the introduced cross-connectionlayers perform rather complex functions (thus they are notlimited to simple feature map passing) and are capable ofmimicking human vision processes—confirming that the bi-ological inspiration behind such a model is justified.R EFERENCES[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in NeuralInformation Processing Systems 25 , F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”
IEEE Signal Processing Magazine , vol. 29,no. 6, pp. 82–97, 2012.[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”
Nature , vol. 518, no. 7540, pp. 529–533, 2015.[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, pp. 484–489, 2016.[5] M. Kivel¨a, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, andM. A. Porter, “Multilayer networks,”
Journal of complex networks ,vol. 2, no. 3, pp. 203–271, 2014.[6] M. De Domenico, C. Granell, M. A. Porter, and A. Arenas, “The physicsof multilayer networks,” arXiv preprint arXiv:1604.02021 , 2016. [7] E. Estrada and J. G´omez-Garde˜nes, “Communicability reveals a transi-tion to coordinated behavior in multiplex networks,”
Physical Review E ,vol. 89, no. 4, p. 042819, 2014.[8] C. Granell, S. G´omez, and A. Arenas, “Competing spreading processeson multiplex networks: awareness and epidemics,”
Physical Review E ,vol. 90, no. 1, p. 012808, 2014.[9] P. Veliˇckovi´c and P. Li`o, “Molecular multiplex network inferenceusing gaussian mixture hidden markov models,”
Journal of ComplexNetworks , 2015. [Online]. Available: http://comnet.oxfordjournals.org/content/early/2015/12/25/comnet. \ cnv029.abstract[10] M. A. Eckert, N. V. Kamdar, C. E. Chang, C. F. Beckmann, M. D.Greicius, and V. Menon, “A cross-modal system linking primary auditoryand visual cortices: Evidence from intrinsic fmri connectivity analysis,” Human brain mapping , vol. 29, no. 7, pp. 848–857, 2008.[11] A. L. Beer, T. Plank, and M. W. Greenlee, “Diffusion tensor imagingshows white matter tracts between human auditory and visual cortex,”
Experimental Brain Research , vol. 213, no. 2, pp. 299–308, 2011.[Online]. Available: http://dx.doi.org/10.1007/s00221-011-2715-y[12] W. Yang, J. Yang, Y. Gao, X. Tang, Y. Ren, S. Takahashi, and J. Wu,“Effects of sound frequency on audiovisual integration: An event-relatedpotential study,”
PLoS ONE , vol. 10, no. 9, pp. 1–15, 09 2015.[Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pone.0138296[13] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” 2009.[14] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.[15] Theano Development Team, “Theano: A Python framework forfast computation of mathematical expressions,” arXiv e-prints , vol.abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688[16] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in
Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) , 2010, pp. 807–814.[17] F. Chollet, “Keras cnn example for cifar-10,” github.com/fchollet/keras/blob/master/examples/cifar10 cnn.py, 2015.[18] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[19] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural networksfrom overfitting.”
Journal of Machine Learning Research , vol. 15, no. 1,pp. 1929–1958, 2014.[20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 ,2014.[21] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deepnetworks,” in
Advances in neural information processing systems , 2015,pp. 2377–2385.[22] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprintarXiv:1511.06422 , 2015.[23] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio,“Maxout networks,” in
Proceedings of The 30th International Confer-ence on Machine Learning , 2013, pp. 1319–1327.[24] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,” in
International Conference on ArtificialIntelligence and Statistics , 2010, pp. 249–256.[25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[26] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034arXiv preprint arXiv:1312.6034