[PDF] X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets

Abstract

In this paper we propose cross-modal convolutional neural networks (X-CNNs), a novel biologically inspired type of CNN architectures, treating gradient descent-specialised CNNs as individual units of processing in a larger-scale network topology, while allowing for unconstrained information flow and/or weight sharing between analogous hidden layers of the network---thus generalising the already well-established concept of neural network ensembles (where information typically may flow only between the output layers of the individual networks). The constituent networks are individually designed to learn the output function on their own subset of the input data, after which cross-connections between them are introduced after each pooling operation to periodically allow for information exchange between them. This injection of knowledge into a model (by prior partition of the input data through domain knowledge or unsupervised methods) is expected to yield greatest returns in sparse data environments, which are typically less suitable for training CNNs. For evaluation purposes, we have compared a standard four-layer CNN as well as a sophisticated FitNet4 architecture against their cross-modal variants on the CIFAR-10 and CIFAR-100 datasets with differing percentages of the training data being removed, and find that at lower levels of data availability, the X-CNNs significantly outperform their baselines (typically providing a 2--6% benefit, depending on the dataset size and whether data augmentation is used), while still maintaining an edge on all of the full dataset tests.

Full PDF

XX-CNN: Cross-modal Convolutional NeuralNetworks for Sparse Datasets

Petar Veliˇckovi´c ∗‡ , Duo Wang ∗ , Nicholas D. Lane †‡ and Pietro Li`o ∗∗ Computer Laboratory, University of Cambridge, Cambridge CB3 0FD, UK † Department of Computer Science, University College London, London WC1E 6BT, UK ‡ Nokia Bell Labs, Cambridge CB3 0FA, UKEmail: { pv273, wd263, pl219 } @cam.ac.uk, [email protected] Abstract —In this paper we propose cross-modal convolutionalneural networks (X-CNNs), a novel biologically inspired type ofCNN architectures, treating gradient descent-specialised CNNs asindividual units of processing in a larger-scale network topology,while allowing for unconstrained information ﬂow and/or weightsharing between analogous hidden layers of the network—thus generalising the already well-established concept of neuralnetwork ensembles (where information typically may ﬂow onlybetween the output layers of the individual networks). Theconstituent networks are individually designed to learn the outputfunction on their own subset of the input data, after whichcross-connections between them are introduced after each poolingoperation to periodically allow for information exchange betweenthem. This injection of knowledge into a model (by prior partitionof the input data through domain knowledge or unsupervisedmethods) is expected to yield greatest returns in sparse dataenvironments, which are typically less suitable for training CNNs.For evaluation purposes, we have compared a standard four-layerCNN as well as a sophisticated FitNet4 architecture against theircross-modal variants on the CIFAR-10 and CIFAR-100 datasetswith differing percentages of the training data being removed,and ﬁnd that at lower levels of data availability, the X-CNNssigniﬁcantly outperform their baselines (typically providing a 2–6% beneﬁt, depending on the dataset size and whether dataaugmentation is used), while still maintaining an edge on all ofthe full dataset tests.

I. I

NTRODUCTION

In recent years, the number of success stories of machinelearning has seen an all-time rise across a wide range of ﬁeldsand tasks, examples including: computer vision [1], speechrecognition [2], reinforcement learning [3] and guiding MonteCarlo tree search [4]. The unifying idea behind all of the aboveis deep learning , the utilisation of neural networks with manyhidden layers, for the purposes of learning complex featurerepresentations from raw data, rather than relying on hand-crafted feature extraction.As the networks become deeper, however, they become moreand more reliant on the amount of training examples providedfor maximising their performance. While we are now ableto extract large quantities of labelled information for manyproblems of interest, there remains a signiﬁcant proportion oftasks for which “big data” simply isn’t available at this time,which makes it extremely difﬁcult to fully exploit a deep CNNarchitecture and properly learn generalisable features of the data. Here we will present an architectural methodology thatattempts to extract additional predictive power from a convolu-tional neural network (CNN) in such circumstances by insteadfocussing on the width of the data, i.e. the heterogeneity ofinformation present within each training example. The keyidea constitutes appropriate partitioning of this informationand training smaller CNNs on these partitions (allowing themto train faster and more effectively under sparse data envi-ronments), while allowing for information exchange betweenthem at various stages (Fig. 1).A classic example where such an approach is bound to beuseful are clinical studies , where there typically may not bethat many patients, but for each patient there is potentiallya heterogeneous wealth of information, such as various testresults, patient history, ethnic background, body scans (CT,MRI. . . ) and so on, depending on the type of study.II. C

ROSS - MODAL

CNN S Our methodology is inspired by multilayer networks [5],mathematical structures encompassing several layers of graphsover the same set of nodes, allowing for unrestricted intra-layer as well as inter-layer connections. They have been ademonstrably valuable tool for modelling a variety of naturaland social systems ([6], [7], [8]), and their applicabilityto machine learning (within the context of hidden Markovmodels) was already demonstrated by some of the authors[9], managing to achieve high performance on a sparse breastcancer classiﬁcation dataset involving gene expression andmethylation data.The network design process is initiated by appropriately par-titioning the input data—this may be done either manually(by exploiting existing domain knowledge) or through anunsupervised pre-training step, which will determine which(not necessarily disjoint) fragments of the input data are morelikely to constructively inﬂuence one another. Afterwards, across-modal CNN is constructed such that a separate CNN superlayer is dedicated to each partition of the input data,attempting to learn the target function from its partition only.The purpose of the partitioning is to help the constituentCNNs become powerful predictors while requiring a smallerdimensionality of the input data, by allowing them access to a r X i v : . [ s t a t . M L ] O c t nput(channel 1)Input(channel 2)Input(channel 3) Conv.Conv.Conv. PoolPoolPool × conn. × conn.Conv.Conv.Conv. PoolPoolPool M e r g e FCFC Softmax

Fig. 1. Diagram of a simple cross-modal CNN for image classiﬁcation, generated from a baseline CNN of the form [ Conv → P ool ] × → F C → Softmax .Each of the three channels (RGB/YUV) of the input image receives its own CNN superlayer, with cross-connections inserted after the pooling operation, andfull weight sharing in the fully connected layers. A more in-depth view of a potential cross-connection layout is provided by Figure 2. those parts of the input which are most signiﬁcantly relatedto each other in the context of the predictions that need to bemade.Finally, the superlayers may be interconnected by any sort of(feedforward) cross-connection as is best seen ﬁt, and theymay be combined in arbitrary ways at the output stage to pro-duce the ﬁnal output. Similarly, at any stage the weights of thesuperlayers may be shared—the simplest case, which we willexplore in our analysis, constitutes complete weight sharingof the fully connected layers at the tail of the networks. Thisconstruction is biologically inspired by cross-modal systems [10] within the visual and auditory systems of the human brain(which in turn inspired the development of CNNs)—whereinseveral cross-connections between various sensory networkshave been discovered [11], [12].To quantify the gains of this approach, our evaluation focusseson an already well-understood problem of coloured imageclassiﬁcation, on established CIFAR-10/100 [13] benchmarksfor which an abundance of data is available, so it is easier toinvestigate the effects of restricting the size of the training seton various CNN models. The partitioning of the input that weconsider is per-channel—each of the three image channels willbe an input to an individual superlayer, and these superlayerswill have identical high-level architecture (differing only inthe number of feature maps per hidden layer)—as illustratedby Fig. 1. This also allows for a simple approach to cross-connections; namely, after every downsampling ( pooling ) op-eration we allow for the feature maps to be exchanged betweensuperlayers, after being passed through another convolutionallayer (Fig. 2).While this model in itself constitutes a committee of CNNs,it differs from most traditional ensemble applications in two key ways: • An ensemble’s constituent models typically exchange in-formation only in the output stage, while the cross-modalframework allows for arbitrary (feedforward) informationﬂow at any stage of the processing pipeline; • Constituent models of an ensemble usually receive afull copy of the input each, while superlayers within across-modal neural network receive only a fraction of theinput, allowing for a decrease in degrees-of-freedom ofthe model compared to an unrestricted network.In fact, this can be taken a step further: one may consider ensembles of cross-modal CNNs , which may compound onbeneﬁts already given by X-CNNs themselves, on exampleswhere the networks are potentially struggling to choose aproper class with sufﬁcient conﬁdence. As the X-CNN modelcan be observed as an ordinary CNN from a high level, any ensemble strategies that are found useful for CNNs should beuseful for X-CNNs as well.Lastly, it should be noted that our approach is not restrictedto CNNs, but it is then easiest to scrutinise, as the trainedparameters are bound to obey a certain spatial structure. In linewith this, an entire section of this manuscript will be dedicatedto analysing the learned convolutional kernels within an X-CNN, as well as visualising the inputs that would maximisethe model’s cross-connection activations.III. M

ODEL A RCHITECTURES

For the purposes of evaluating our proposed architecture’s per-formance, we have implemented two baseline CNN models—along with their cross-modal variants—in Keras [14] (withTheano [15] back-end). For purposes of reproducibility, in this onvolutionConvolution Max-PoolMax-Pool Conv. Conv. Conv.Conv. MergeMerge Conv.Conv.. . .. . . . . .. . .

Fig. 2. Illustration of a single cross-connection segment within an X-CNN with two superlayers. After each pooling operation, we exchange the feature mapsbetween the superlayers, after ﬁrst passing them through an additional convolutional layer. We may also perform an additional intra-superlayer convolutionbefore merging the feature maps in each superlayer via concatenation. section we will expose their architectures and hyperparametersas used for the evaluation. The cross-modal variants’ featuremap counts have been altered in such a way as to makethe overall number of parameters as close as possible to thebaseline, making for fair evaluation with respect to degrees-of-freedom.For both of the models used, we represent images in the YUVcolour space. As a linear transformation from RGB, it shouldnot have an impact on performance of the baselines, while ithas the beneﬁt of decoupling luminance from chrominance ,allowing for a simpler analysis of cross-connections (andrelating its learned kernels to human vision processes). Weinject further domain knowledge into the model by favouringthe CNN superlayer corresponding to the Y channel in termsof feature map counts (typically doubled compared to the U/Vsuperlayers within the same hidden layer). This correspondsto the assumption that the majority of relevant informationabout an object is contained within its brightness channel,while colour usually represents auxiliary information.

A. KerasNet

Our initial model of choice represents a simple CNN withfour convolutional ReLU [16] layers, followed by two fullyconnected layers, one of which is also ReLU. We will bereferring to it as KerasNet throughout this manuscript as it isbased on the Keras CIFAR-10 CNN example [17]. It representsa likely style of a “starting” model that one is going to attemptto apply on an image classiﬁcation problem (without particularprior knowledge about it), perhaps especially bearing in mindthat the training data may be sparse.The architecture of the model, as well as its cross-modalvariant (X-KerasNet) is outlined in Table I. Both models aretrained for 200 epochs using the Adam SGD optimiser, with

TABLE IA

RCHITECTURES FOR K ERAS N ET AND

X-K

ERAS N ET Output size KerasNet X-KerasNet ∼ . M param. ∼ . M param. ×

32 [3 × , × Y: [3 × , × U/V: [3 × , × ×

16 2 × Max-Pool, stride Y → Y: identity U → U: identity V → V: identity Y (cid:32) U/V: [1 × , U/V (cid:32) Y: [1 × , × , × Y: [3 × , × U/V: [3 × , × × × Max-Pool, stride × Fully connected, 512-D10/100-way softmaxhyperparameters as described in [18], and a batch size of32. Dropout [19] has been applied after both of the poolingoperations (with p = 0 . ) as well as after the ﬁrst fullyconnected layer (with p = 0 . ). B. FitNet4

We decided to implement FitNet4 by Romero et al. [20] as oursecond baseline, representing a sophisticated CNN close to thestate-of-the-art on CIFAR-10/100. We opted for this model asit is prominently featured in a variety of recent neural networksresearch ([21], [22]), and due to its design goal of being a“thin&deep” network, managing to keep its parameter countrelatively low compared to many other successful models, and

ABLE IIA

RCHITECTURES FOR F IT N ET AND

X-F IT N ET Output size FitNet4 X-FitNet4 ∼ . M param. ∼ . M param. ×

32 [3 × , × Y: [3 × , × U/V: [3 × , × × , × Y: [3 × , × U/V: [3 × , × ×

16 2 × Max-Pool, stride Y → Y: [1 × , U → U: [1 × , V → V: [1 × , Y (cid:32) U/V: [1 × , U/V (cid:32) Y: [1 × , × , × Y: [3 × , × U/V: [3 × , × × × Max-Pool, stride Y → Y: [1 × , U → U: [1 × , V → V: [1 × , Y (cid:32) U/V: [1 × , U/V (cid:32) Y: [1 × , × , × Y: [3 × , × U/V: [3 × , × × × (global) Max-PoolFully connected, 500-D10/100-way softmaxtherefore could still be a feasible ﬁrst choice for handling asparse dataset.The FitNet4 consists of 17 convolutional 2-way maxout [23]layers, followed by two fully connected layers, the ﬁrst ofwhich is a 5-way maxout layer. The full architecture of thismodel—as well as its cross-modal variant (X-FitNet4)—ispresented in Table II.Both models are initialised using Xavier initialisation [24], andare then trained for 230 epochs using the Adam SGD optimiserwith a batch size of 128. We have applied batch normalisation[25] to the output of each hidden layer to signiﬁcantly acceler-ate the training procedure. L regularisation with λ = 0 . has been applied to all weights in the model. Finally, dropout(with p = 0 . ) was applied on the input, after every poolingoperation, and after the fully connected maxout layer.IV. E VALUATION

To verify our insights, we have utilised two well-known imageclassiﬁcation benchmark datasets, CIFAR-10 and CIFAR-100[13], for which an abundance of data is available (50000training and 10000 testing examples). This makes it easierto study the behaviour of the considered CNNs as differentfractions of the training data are discarded. We hypothesisethat, at lower levels of data availability (up to a threshold), our methodology will yield signiﬁcant gains over an equivalentunrestricted CNN—and also that it will remain competitive atall higher training set sizes.The validity of our claim is investigated by performingcomparative evaluation, with the KerasNet and FitNet4 asbaselines against X-KerasNet and X-FitNet4, respectively. Ineach individual test we evaluate the accuracy of these fourmodels on the entire test set of 10000 samples, when thetraining routine is presented with only p % of the entire trainingdataset (chosen deterministically). The schedule for the testsis as follows: • Initially test in increments of , until reaching (atwhich time the training and testing sets have equal sizes); • Afterwards, test in increments of until either reach-ing or the accuracies of the two models get within . of each other (corresponding to a gain of ≤ images properly classiﬁed), whichever is later; • Specially, we always test on (corresponding to ahighly sparse environment with only training images)and of the training dataset.The images are preprocessed by applying a single batchnormalisation operation on them; we have found this toyield slightly better results compared to doing global con-trast normalisation and ZCA whitening (the more commonapproach). Finally, given that it is, depending on the task,sometimes possible to signiﬁcantly enhance results in a sparseenvironment by way of data augmentation , we have run all ofthe above tests twice —with and without random translationsand horizontal reﬂections applied to the training images—providing insight as to whether data augmentation compoundsthe effects of a cross-modal architecture, and to what extent.V. R ESULTS AND D ISCUSSION

The full evaluation results on the aforementioned tests arepresented in Tables III–VI.The results on tests without data augmentation are completelyin line with the claim of Section IV; at sufﬁciently low trainingdata sizes, both X-KerasNet and X-FitNet4 signiﬁcantly out-perform their respective baselines on the testing set, for bothof the CIFAR-10/100 datasets.For CIFAR-10, the threshold at which the baselines “catchup” (in terms of being able to manually learn the domainknowledge directly injected into their cross-modal variants) isat around p = 40% , corresponding to 20000 training examplesbeing available. Furthermore, on CIFAR-100, such a thresholdis never reached, most likely due to the extreme sparsity of per-class examples making this problem particularly suitable forthe X-CNN models; the only exception is the scenario forFitNet4, where the data sparsity is probably too extreme (ﬁveexamples/class) for such a deep model to reach its potential.Regardless of when the threshold is surpassed, we report thatthe cross-modal CNNs will generally continue to have a slight ABLE IIIC

OMPARATIVE E VALUATION R ESULTS ON

CIFAR-10 W

ITHOUT D ATA A UGMENTATION

Model (cid:15) p

1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 37.94% 53.82% 62.95% 67.39% 70.26% 74.39% 76.62% 78.55% ——— ——— ——— ——— 82.50%X-KerasNet ——— ——— ——— ———

FitNet4 38.97% 56.78% 70.37% 75.07% 78.50% 81.95% 83.95% 85.22% ——— ——— ——— ——— 89.56%X-FitNet4 ——— ——— ——— ———

TABLE IVC

OMPARATIVE E VALUATION R ESULTS ON

CIFAR-10 W

ITH D ATA A UGMENTATION

Model (cid:15) p

1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 45.45% 67.01% 70.89% 78.83% ——— ——— ——— ———

FitNet4 40.91% ——— ——— ——— ———

TABLE VC

OMPARATIVE E VALUATION R ESULTS ON

CIFAR-100 W

ITHOUT D ATA A UGMENTATION

Model (cid:15) p

1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 7.55% 15.10% 20.24% 24.76% 28.18% 32.43% 36.29% 38.61% 41.63% 44.10% 45.56% 46.26% 48.26%X-KerasNet

FitNet4 6.48% 16.84% 22.12% 28.30% 35.52% 39.28% 43.59% 49.69% 50.42% 55.83% 56.62% 58.00% 59.78%X-FitNet4

TABLE VIC

OMPARATIVE E VALUATION R ESULTS ON

CIFAR-100 W

ITH D ATA A UGMENTATION

Model (cid:15) p

1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100%KerasNet 9.09% 24.68% 32.63% 38.64% 42.62% 47.64% 49.91% 52.46% 53.77% 54.26% 55.12% 55.42% 55.45%X-KerasNet

FitNet4 7.25% 17.94% 23.55% 29.24% 38.76% 48.07% 50.06% 56.01% 58.55% 59.80% 62.38% 63.60% 65.59%X-FitNet4 edge over the baselines—outperforming them on all of thefull training dataset experiments, sometimes signiﬁcantly. Thisnaturally invites the conclusion that converting a CNN into aX-CNN (if allowed by the task) is always a reasonable step;it can yield signiﬁcant beneﬁts (the signiﬁcance depending onthe relation between the sparsity of the training dataset andthe complexity of the baseline model), while rarely makingperformance signiﬁcantly worse.To further verify this claim, we have performed experimentson the full datasets (with and without augmentation) where wemonitored how the testing accuracy evolves as a function oftraining epoch. The resulting plots are summarised in Figure3; it is clear that the X-CNNs are at least as powerful astheir baselines, even when the full training sets are available.Furthermore, it is possible to detect a narrow edge for thecross-modal models in the CIFAR-10 experiments, and a sig- niﬁcant edge in the CIFAR-100 experiments. The concludingremark is that even when the dataset under investigation is notvery sparse, attempting to utilise a cross-modal variant of theconsidered models (if applicable) is a reasonable action, as itmight yield noticeable returns in predictive power.The analysis of the interplay between data augmenta-tion and cross-modal networks on CIFAR-100 remainsstraightforward—the X-CNN models remaining consistentlyand signiﬁcantly ahead of their baselines throughout the entirespectrum of training set sizes. On CIFAR-10, however, this isslightly more complicated; while the catch-up threshold getsexpectedly decreased (to around p = 20% ), the behaviourof X-CNNs for smaller training set sizes does not alwayssigniﬁcantly compound the beneﬁts of data augmentation.Speciﬁcally, at 5% of the training set the FitNet4 modelmanages to outperform X-FitNet4 (the roles do get reversed

20 40 60 80 100 120 140 160 180 200 220 2400 . . . . . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-10 (no aug.)

KerasNetX-KerasNetFitNet4X-FitNet4 . . . . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-10 (aug.)

KerasNetX-KerasNetFitNet4X-FitNet4 . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-100 (no aug.)

KerasNetX-KerasNetFitNet4X-FitNet4 . . . . . . . Epoch T e s t a cc u r a c y Test accuracy on CIFAR-100 (aug.)

KerasNetX-KerasNetFitNet4X-FitNet4

Fig. 3. Plots of the test accuracy of the four CNN models under consideration as a function of the number of training epochs, under 100% of the trainingset available. The experiments have been carried out on both CIFAR-10 and CIFAR-100, with and without data augmentation. The cross-modal CNNs areconsistently competitive with their respective baselines across all four datasets, with a signiﬁcant edge present for CIFAR-100. starting from 10%, however). As a possible cause of thisphenomenon, we note that, at this data availability level, bothof the FitNet4 models are signiﬁcantly inferior in performanceto the KerasNet models, for which there is a signiﬁcant beneﬁtto the usage of X-CNNs. The takeaway lesson here is that,while the cross-modal architecture need not always compoundnicely with data augmentation, an occurrence of such an eventcould signify that the baseline was not particularly suitable forproperly accommodating data augmentation at this training setsize in the ﬁrst place. If this happens, one should attempt to usea more suitable/shallower CNN—the X-CNN variant shouldthen produce the desired beneﬁts.Finally, we have taken advantage of some of the smallertraining set sizes to perform statistical signiﬁcance tests ,typically scarce in deep learning literature. For training setsizes up to 15%, we trained the models ﬁve times (from dif-ferent initial conditions) and then performed t -tests, choosing p < . as our signiﬁcance threshold. Our ﬁndings showthat, under these assumptions, the best-performing X-CNN model’s performance advantages are statistically signiﬁcant inall scenarios , aside from the data-augmented CIFAR-10.VI. C ROSS - CONNECTION A NALYSIS

A key element of the X-CNN architecture are the cross-connection layers, as they enable information ﬂow betweenindividual channels. It will therefore be of interest to under-stand and visualise what is the mode of operation for theselayers. All of the visualisations in this section correspond tothe learned weights after fully training on 100% of CIFAR-10with data augmentation.We will ﬁrst demonstrate that cross-connections inserted inthe considered models, though being × convolutions, learnmore complex functions than simple feature map passing.First, we note that the weights of a × convolutional layermay be represented as a 2D table that maps input channels tooutput channels (akin to an adjacency matrix, where columnsare the input channels and rows are the output channels). ig. 4. Weight visualisation of the ﬁrst-level cross-connection layer for theX-FitNet4 CNN. The columns correspond to input channels, while rowscorrespond to output channels. Green colour indicates a positive-weightconnection between an input channel and an output channel, while blue colourindicates a negative-weight connection. The colour intensities are proportionalto the absolute weight values. Top: Y (cid:32) U/V (36 input channels, 12 outputchannels). Bottom, left-to-right: U (cid:32)

Y and V (cid:32)

Y (18 input channels, 12output channels).

Rather than displaying the raw table values, we decided tovisualise weights in a heatmap style; Figure 4 showcases thisvisualisation for the ﬁrst cross-connection layer of X-FitNet4.Green colours indicate that an input channel has a positiveconnection weight to the respective output channel while bluecolours indicate negative weights. The colour intensities areproportional to the absolute weight values.It can be seen that each output channel of the cross-connectionlayer is obtained through a nontrivial weighted combination of input channels. We hypothesise that the cross-connectionlayers selectively ﬁlter and combine input features that aremore utilisable in another processing stream.To delve deeper into what kinds of features the cross-connection layers are ﬁltering, combining and passing, weapplied layer-wise feature-map activation techniques proposedby Simonyan et al. [26]. This technique performs gradientascent on a white-noise input image to maximise activationsof a speciﬁc channel of feature maps at any of the layers withina pre-trained model. The objective function for gradient ascentis deﬁned as I (cid:48) = argmax I Σ( I ) − λ (cid:107) I (cid:107) (1)where I is input image, Σ( I ) is the activation of the consideredneuron when provided with I as input, and λ is a regularisationfactor. After iterating for a number of gradient ascent steps,the original white-noise image will be modiﬁed into patternsthat approximate the detection function of a speciﬁc neuron.Lower-level convolutional layers are well-known to learnﬁlters approximating Gabor wavelet ﬁlters that act as edgedetectors, corner detectors, etc; we can conﬁrm that in ourexperiments this has indeed been the case. For the ﬁrst cross-connection layer of X-FitNet4, we have visualised a selectionof channel activations in Figure 5. This visualisation indicatesthat the cross-connection layer is indeed passing combined Fig. 5. Artiﬁcially generated images (from white noise) that cause strongactivations of speciﬁc channels in the ﬁrst cross-connection layer of the X-FitNet4 model. Top: Three channels from the Y (cid:32)

U/V cross-connections.Middle: Three channels from the U (cid:32)

Y cross-connections. Bottom: Threechannels from the V (cid:32)

Y cross-connections. lower-level features, such as the addition of horizontal andvertical stripes in the upper right image in the ﬁgure. Weobserve further that the pattern frequency for the Y channel’scrossconnection layer is higher than the one for the U andV layers. This observation reﬂects the fact that the humanvision system is able to detect higher frequency variations inintensity than chrominance. This is a solid indicator that theX-CNN architecture, when faced with an image classiﬁcationtask in the YUV colour scheme, is actually attempting tomimic human vision.Our ﬁnal analysis focusses on the X-KerasNet model, wherewe transformed feature maps of arbitrary depths into RGBimages by a colour-mapping scheme. Figure 6 shows thefeature maps of the inputs and outputs of the Y (cid:32)

U/Vcross connections for representative images of the truck and airplane classes. These were easier to comparatively analyseon the X-KerasNet, as its cross-connection layers do not alterthe number of feature maps, and therefore the same colour-mapping scheme remained meaningful for both. We observethat cross-connection output maps have background and somefeatures emphasised, while other features de-emphasised—which further indicates that the cross-connection layers areperforming more complex inter-superlayer feature integrationthan simply passing feature maps between superlayers.VII. C

ONCLUSION

We have introduced cross-modal convolutional neural net-works (X-CNNs), a novel architecture that decouples con-volutional processing of (typically image-based) input par-titions, while allowing for periodical information ﬂow be- ig. 6. Visualisation of input and output feature maps of the Y (cid:32)

U/V cross-connection layer of X-KerasNet. Left: input images ( truck / airplane ). Middle:input feature maps to the cross-connection layer. Right: output feature mapsof the cross-connection layer. tween the processing pipelines, in order to achieve perfor-mance improvements in sparse data environments. We haveapplied this methodology on the popular CIFAR-10/100 imageclassiﬁcation datasets for two baseline models, managingto signiﬁcantly outperform them in low-data environments,while remaining competitive in high-data environments—outperforming them on all of the full-dataset experiments.Aside from reinforcing the claim that the X-CNN architecturecan only be beneﬁcial to a baseline model (depending on thelevels of training data sparsity, potentially highly signiﬁcantly),we have further veriﬁed that the introduced cross-connectionlayers perform rather complex functions (thus they are notlimited to simple feature map passing) and are capable ofmimicking human vision processes—conﬁrming that the bi-ological inspiration behind such a model is justiﬁed.R EFERENCES[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in NeuralInformation Processing Systems 25 , F. Pereira, C. J. C. Burges,L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classiﬁcation-with-deep-convolutional-neural-networks.pdf[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al. , “Deep neuralnetworks for acoustic modeling in speech recognition: The shared viewsof four research groups,”

IEEE Signal Processing Magazine , vol. 29,no. 6, pp. 82–97, 2012.[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533, 2015.[4] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neural networksand tree search,”

Nature , vol. 529, no. 7587, pp. 484–489, 2016.[5] M. Kivel¨a, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, andM. A. Porter, “Multilayer networks,”

Journal of complex networks ,vol. 2, no. 3, pp. 203–271, 2014.[6] M. De Domenico, C. Granell, M. A. Porter, and A. Arenas, “The physicsof multilayer networks,” arXiv preprint arXiv:1604.02021 , 2016. [7] E. Estrada and J. G´omez-Garde˜nes, “Communicability reveals a transi-tion to coordinated behavior in multiplex networks,”

Physical Review E ,vol. 89, no. 4, p. 042819, 2014.[8] C. Granell, S. G´omez, and A. Arenas, “Competing spreading processeson multiplex networks: awareness and epidemics,”

Physical Review E ,vol. 90, no. 1, p. 012808, 2014.[9] P. Veliˇckovi´c and P. Li`o, “Molecular multiplex network inferenceusing gaussian mixture hidden markov models,”

Journal of ComplexNetworks , 2015. [Online]. Available: http://comnet.oxfordjournals.org/content/early/2015/12/25/comnet. \ cnv029.abstract[10] M. A. Eckert, N. V. Kamdar, C. E. Chang, C. F. Beckmann, M. D.Greicius, and V. Menon, “A cross-modal system linking primary auditoryand visual cortices: Evidence from intrinsic fmri connectivity analysis,” Human brain mapping , vol. 29, no. 7, pp. 848–857, 2008.[11] A. L. Beer, T. Plank, and M. W. Greenlee, “Diffusion tensor imagingshows white matter tracts between human auditory and visual cortex,”

Experimental Brain Research , vol. 213, no. 2, pp. 299–308, 2011.[Online]. Available: http://dx.doi.org/10.1007/s00221-011-2715-y[12] W. Yang, J. Yang, Y. Gao, X. Tang, Y. Ren, S. Takahashi, and J. Wu,“Effects of sound frequency on audiovisual integration: An event-relatedpotential study,”

PLoS ONE , vol. 10, no. 9, pp. 1–15, 09 2015.[Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pone.0138296[13] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” 2009.[14] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.[15] Theano Development Team, “Theano: A Python framework forfast computation of mathematical expressions,” arXiv e-prints , vol.abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688[16] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) , 2010, pp. 807–814.[17] F. Chollet, “Keras cnn example for cifar-10,” github.com/fchollet/keras/blob/master/examples/cifar10 cnn.py, 2015.[18] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[19] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural networksfrom overﬁtting.”

Journal of Machine Learning Research , vol. 15, no. 1,pp. 1929–1958, 2014.[20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550 ,2014.[21] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deepnetworks,” in

Advances in neural information processing systems , 2015,pp. 2377–2385.[22] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprintarXiv:1511.06422 , 2015.[23] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio,“Maxout networks,” in

Proceedings of The 30th International Confer-ence on Machine Learning , 2013, pp. 1319–1327.[24] X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deepfeedforward neural networks,” in

International Conference on ArtiﬁcialIntelligence and Statistics , 2010, pp. 249–256.[25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[26] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classiﬁcation models and saliency maps,” arXiv preprint arXiv:1312.6034arXiv preprint arXiv:1312.6034