[PDF] Deep Convolutional Networks on Graph-Structured Data

Abstract

Deep Learning's recent successes have mostly relied on Convolutional Networks, which exploit fundamental statistical properties of images, sounds and video data: the local stationarity and multi-scale compositional structure, that allows expressing long range interactions in terms of shorter, localized interactions. However, there exist other important examples, such as text documents or bioinformatic data, that may lack some or all of these strong statistical regularities. In this paper we consider the general question of how to construct deep architectures with small learning complexity on general non-Euclidean domains, which are typically unknown and need to be estimated from the data. In particular, we develop an extension of Spectral Networks which incorporates a Graph Estimation procedure, that we test on large-scale classification problems, matching or improving over Dropout Networks with far less parameters to estimate.

Full PDF

DDeep Convolutional Networks on Graph-StructuredData

Mikael Henaff

Courant Institute of Mathematical SciencesNew York University [email protected]

Joan Bruna

University of California, Berkeley [email protected]

Yann LeCun

Courant Institute of Mathematical SciencesNew York University [email protected]

Abstract

Deep Learning’s recent successes have mostly relied on Convolutional Networks,which exploit fundamental statistical properties of images, sounds and video data:the local stationarity and multi-scale compositional structure, that allows express-ing long range interactions in terms of shorter, localized interactions. However,there exist other important examples, such as text documents or bioinformaticdata, that may lack some or all of these strong statistical regularities.In this paper we consider the general question of how to construct deep architec-tures with small learning complexity on general non-Euclidean domains, whichare typically unknown and need to be estimated from the data. In particular, wedevelop an extension of Spectral Networks which incorporates a Graph Estima-tion procedure, that we test on large-scale classiﬁcation problems, matching orimproving over Dropout Networks with far less parameters to estimate.

In recent times, Deep Learning models have proven extremely successful on a wide variety of tasks,from computer vision and acoustic modeling to natural language processing [9]. At the core of theirsuccess lies an important assumption on the statistical properties of the data, namely the stationarity and the compositionality through local statistics, which are present in natural images, video, andspeech. These properties are exploited efﬁciently by ConvNets [8, 7], which are designed to extractlocal features that are shared across the signal domain. Thanks to this, they are able to greatlyreduce the number of parameters in the network with respect to generic deep architectures, withoutsacriﬁcing the capacity to extract informative statistics from the data. Similarly, Recurrent NeuralNets (RNNs) trained on temporal data implicitly assume a stationary distribution.One can think of such data examples as being signals deﬁned on a low-dimensional grid. In thiscase stationarity is well deﬁned via the natural translation operator on the grid, locality is deﬁnedvia the metric of the grid, and compositionality is obtained from downsampling, or equivalentlythanks to the multi-resolution property of the grid. However, there exist many examples of data thatlack the underlying low-dimensional grid structure. For example, text documents represented asbags of words can be thought of as signals deﬁned on a graph whose nodes are vocabulary terms andwhose weights represent some similarity measure between terms, such as co-occurence statistics. Inmedicine, a patient’s gene expression data can be viewed as a signal deﬁned on the graph imposedby the regulatory network. In fact, computer vision and audio, which are the main focus of researchefforts in deep learning, only represent a special case of data deﬁned on an extremely simple low-dimensional graph. Complex graphs arising in other domains might be of higher dimension, andthe statistical properties of data deﬁned on such graphs might not satisfy the stationarity, locality1 a r X i v : . [ c s . L G ] J un nd compositionality assumptions previously described. For such type of data of dimension N ,deep learning strategies are reduced to learning with fully-connected layers, which have O ( N ) parameters, and regularization is carried out via weight decay and dropout [17].When the graph structure of the input is known, [2] introduced a model to generalize ConvNets usinglow learning complexity similar to that of a ConvNet, and which was demonstrated on simple low-dimensional graphs. In this work, we are interested in generalizing ConvNets to high-dimensional,general datasets, and, most importantly, to the setting where the graph structure is not known a priori.In this context, learning the graph structure amounts to estimating the similarity matrix, which hascomplexity O ( N ) . One may therefore wonder whether the graph estimation followed by graphconvolutions offers advantages with respect to learning directly from the data with fully connectedlayers. We attempt to answer this question experimentally and to establish baselines for future work.We explore these approaches in two areas of application for which it has not been possible to ap-ply convolutional networks before: text categorization and bioinformatics. Our results show thatour method is capable of matching or outperforming large, fully-connected networks trained withdropout using fewer parameters. Our main contributions can be summarized as follows: • We extend the ideas from [2] to large-scale classiﬁcation problems, speciﬁcally ImagenetObject Recognition, text categorization and bioinformatics. • We consider the most general setting where no prior information on the graph structureis available, and propose unsupervised and new supervised graph estimation strategies incombination with the supervised graph convolutions.The rest of the paper is structured as follows. Section 2 reviews similar works in the literature. Sec-tion 3 discusses generalizations of convolutions on graphs, and Section 4 addresses the question ofgraph estimation. Finally, Section 5 shows numerical experiments on large scale object recogniton,text categorization and bioinformatics.

There have been several works which have explored architectures using the so-called local receptiveﬁelds [6, 4, 14], mostly with applications to image recognition. In particular, [4] proposes a schemeto learn how to group together features based upon a measure of similarity that is obtained in anunsupervised fashion. However, it does not attempt to exploit any weight-sharing strategy.Recently, [2] proposed a generalization of convolutions to graphs via the Graph Laplacian. Byidentifying a linear, translation-invariant operator in the grid (the Laplacian operator), with its coun-terpart in a general graph (the Graph Laplacian), one can view convolutions as the family of lineartransforms commuting with the Laplacian. By combining this commutation property with a ruleto ﬁnd localized ﬁlters, the model requires only O (1) parameters per “feature map”. However,this construction requires prior knowledge of the graph structure, and was shown only on simple,low-dimensional graphs. More recently, [12] introduced Shapenet, another generalization of con-volutions on non-Euclidean domains based on geodesic polar coordinates, which was successfullyapplied to shape analysis, and allows comparison across different manifolds. However, it also re-quires prior knowledge of the manifolds.The graph or similarity estimation aspects have also been extensively studied in the past. For in-stance, [15] studies the estimation of the graph from a statistical point of view, through the identi-ﬁcation of a certain graphical model using (cid:96) -penalized logistic regression. Also, [3] considers theproblem of learning a deep architecture through a series of Haar contractions, which are learnt usingan unsupervised pairing criteria over the features. Our work builds upon [2] which introduced spectral networks. We recall the deﬁnition here and itsmain properties. A spectral network generalizes a convolutional network through the Graph FourierTransform, which is in turn deﬁned via a generalization of the Laplacian operator on the grid to thegraph Laplacian. An input vector x ∈ R N is seen as a a signal deﬁned on a graph G with N nodes. Deﬁnition 1.

Let W be a N × N similarity matrix representing an undirected graph G , and let L = I − D − / W D − / be its graph Laplacian with D = W · eigenvectors U = ( u , . . . , u N ) . hen a graph convolution of input signals x with ﬁlters g on G is deﬁned by x ∗ G g = U T ( U x (cid:12)

U g ) ,where (cid:12) represents a point-wise product. Here, the unitary matrix U plays the role of the Fourier Transform in R d . There are several waysof computing the graph Laplacian L [1]. In this paper, we choose the normalized version L = I − D − / W D − / , where D is a diagonal matrix with entries D ii = (cid:80) j W ij . Note that in the casewhere W represents the lattice, from the deﬁnition of L we recover the discrete Laplacian operator ∆ . Also note that the Laplacian commutes with the translation operator, which is diagonalized inthe Fourier basis. It follows that the eigenvectors of ∆ are given by the Discrete Fourier Transform(DFT) matrix. We then recover a classical convolution operator by noting that convolutions are bydeﬁnition linear operators that diagonalize in the Fourier domain (also known as the ConvolutionTheorem [11]).Learning ﬁlters on a graph thus amounts to learning spectral multipliers w g = ( w , . . . , w N ) x ∗ G g := U T ( diag ( w g ) U x ) . Extending the convolution to inputs x with multiple input channels is straightforward. If x is a signalwith M input channels and N locations, we apply the transformation U on each channel, and thenuse multipliers w g = ( w i,j ; i ≤ N , j ≤ M ) .However, for each feature map g we need convolutional kernels are typically restricted to have smallspatial support, independent of the number of input pixels N , which enables the model to learn anumber of parameters independent of N . In order to recover a similar learning complexity in thespectral domain, it is thus necessary to restrict the class of spectral multipliers to those correspondingto localized ﬁlters.For that purpose, we seek to express spatial localization of ﬁlters in terms of their spectral multipli-ers. In the grid, smoothness in the frequency domain corresponds to the spatial decay, since (cid:12)(cid:12)(cid:12)(cid:12) ∂ k ˆ x ( ξ ) ∂ξ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:90) | u | k | x ( u ) | du , where ˆ x ( ξ ) is the Fourier transform of x . In [2] it was suggested to use the same principle in ageneral graph, by considering a smoothing kernel K ∈ R N × N , such as splines, and searching forspectral multipliers of the form w g = K ˜ w g . The algorithm which implements the graph convolution is described in Algorithm 1.

Algorithm 1

Train Graph Convolution Layer Given GFT matrix U , interpolation kernel K , weights w . Forward Pass: Fetch input batch x and gradients w.r.t outputs ∇ y . Compute interpolated weights: w f (cid:48) f = K ˜ w f (cid:48) f . Compute output: y sf (cid:48) = U T (cid:16)(cid:80) f U x sf (cid:12) w f (cid:48) f (cid:17) . Backward Pass: Compute gradient w.r.t input: ∇ x sf = U T (cid:16)(cid:80) f (cid:48) ∇ y sf (cid:48) (cid:12) w f (cid:48) f (cid:17) Compute gradient w.r.t interpolated weights: ∇ w f (cid:48) f = U T ( (cid:80) s ∇ y sf (cid:48) (cid:12) x sf ) Compute gradient w.r.t weights ∇ ˜ w f (cid:48) f = K T ∇ w f (cid:48) f . In image and speech applications, and in order to reduce the complexity of the model, it is oftenuseful to trade off spatial resolution for feature resolution as the representation becomes deeper.For that purpose, pooling layers compute statistics in local neighborhoods, such as the averageamplitude, energy or maximum activation.The same layers can be deﬁned in a graph by providing the equivalent notion of neighborhood.In this work, we construct such neighborhoods at different scales using multi-resolution spectralclustering [20], and consider both average and max-pooling as in standard convolutional networkarchitectures. 3

Graph Construction

Whereas some recognition tasks in non-Euclidean domains, such as those considered in [2] or [12],might have a prior knowledge of the graph structure of the input data, many other real-world ap-plications do not have such knowledge. It is thus necessary to estimate a similarity matrix W fromthe data before constructing the spectral network. In this paper we consider two possible graph con-structions, one unsupervised by measuring joint feature statistics, and another one supervised usingan initial network as a proxy for the estimation. Given data X ∈ R L × N , where L is the number of samples and N the number of features, thesimplest approach to estimating a graph structure from the data is to consider a distance betweenfeatures i and j given by d ( i, j ) = (cid:107) X i − X j (cid:107) , where X i is the i -th column of X . While correlations are typically sufﬁcient to reveal the intrinsicgeometrical structure of images [16], the effects of higher-order statistics might be non-negligible inother contexts, especially in presence of sparsity. Indeed, in many situations the pairwise Euclideandistances might suffer from unnormalized measurements. Several strategies and variants exist togain some robustness, for instance replacing the Euclidean distance by the Z -score (thus renormal-izing each feature by its standard deviation), the “square-correlation” (computing the correlation ofsquares of previously whitened features), or the mutual information.This distance is then used to build a Gaussian diffusion Kernel [1] ω ( i, j ) = exp − d ( i,j ) σ . (1)In our experiments, we also consider the variant of self-tuning diffusion kernel [21] ω ( i, j ) = exp − d ( i,j ) σiσj , where σ i is computed as the distance d ( i, i k ) corresponding to the k -th nearest neighbor i k of feature i . This deﬁnes a kernel whose variance is locally adapted around each feature point, as opposed to(1) where the variance is shared.The main advantage of (1) is that it does not require labeled data. Therefore, it is possible to estimatethe similarity using several datasets that share the same features, for example in text classiﬁcation. As discussed in the previous section, the notion of feature similarity is not well deﬁned, as it dependson our choice of kernel and criteria. Therefore, in the context of supervised learning, the relevantstatistics from the input signals might not correspond to our imposed similarity criteria. It may thusbe interesting to ask for the feature similarity that best suits a particular classiﬁcation task.A particularly simple approach is to use a fully-connected network to determine the feature similar-ity. Given a training set with normalized features X ∈ R L × N and labels y ∈ { , . . . , C } L , weinitially train a fully connected network φ with K layers of weights W , . . . , W K , using standardReLU activations and dropout. We then extract the ﬁrst layer features W ∈ R N × M , where M isthe number of ﬁrst-layer hidden features, and consider the distance d sup ( i, j ) = (cid:107) W ,i − W ,j (cid:107) , (2)that is then fed into the Gaussian kernel as in (1). The interpretation is that the supervised crite-rion will extract through W a collection of linear measurements that best serve the classiﬁcationtask. Thus two features are similar if the network decides to use them similarly within these linearmeasurements.This constructions can be seen as “distilling” the information learnt by a ﬁrst network into a kernel.In the general case where no assumptions are made on the dimension of the graph, it amounts toextracting N / parameters from the ﬁrst learning stage (which typically involves a much larger In our experiments we simply normalized each feature by its standard deviation, but one could also whitencompletely the data. m , then mN parameters are extracted by projecting the resulting kernel into its leading m directions.Finally, observe that one could simply replace the eigen-basis U obtained by diagonalizing the graphLaplacian by an arbitrary unitary matrix, which is then optimized by back-propagation together withthe rest of the parameters of the model. We do not report results on this strategy, although we pointout that it has the same learning complexity as the Fully Connected network (requiring O ( KN ) parameters, where K is the number of layers and N is the input dimension). In order to measure the performance of spectral networks on real-world data and to explore theeffect of the graph estimation procedure, we conducted experiments on three datasets from textcategorization, computational biology and computer vision. All experiments were done using theTorch machine learning environment with a custom CUDA backend.We based the spectral network architecture on that of a classical convolutional network, namely byinterleaving graph convolution, ReLU and graph pooling layers, and ending with one or more fullyconnected layers. As noted above, training a spectral network requires an O ( N ) matrix multipli-cation for each input and output feature map to perform the Graph Fourier Transform, compared tothe efﬁcient O ( N log N ) Fast Fourier Transform used in classical ConvNets. We found that trainingthe spectral networks with large numbers of feature maps to be very time-consuming and thereforechose to experiment mostly with architectures with fewer feature maps and smaller pool sizes. Wefound that performing pooling at the beginning of the network was especially important to reduce thedimensionality in the graph domain and mitigate the cost of the expensive Graph Fourier Transformoperation.In this section we adopt the following notation to descibe network architectures: GC k denotes agraph convolution layer with k feature maps, P k denotes a graph pooling layer with stride k andpool size k , and FC k denotes a fully connected layer with k hidden units. In our results we alsodenote the number of free parameters in the network by P net and the number of free parameters whenestimating the graph by P graph . We used the Reuters dataset described in [18], which consists of training and test sets each con-taining 201,369 documents from 50 mutually exclusive classes. Each document is represented as alog-normalized bag of words for 2000 common non-stop words. As a baseline we used the fully-connected network of [18] with two hidden layers consisting of 2000 and 1000 hidden units regu-larized with dropout.We chose hyperparameters by performing initial experiments on a validation set consisting of one-tenth of the training data. Speciﬁcally, we set the number of subsampled weights to k = 60 , learningrate to 0.01 and used max pooling rather than average pooling. We also found that using AdaGrad[5] made training faster. All architectures were then trained using the same hyperparameters. Sincethe experiments were computationally expensive, we did not train all models until full convergence.This enabled us to explore more model architectures and obtain a clearer understanding of the effectsof graph construction. 5

200 400 600 800 1000 1200 1400 1600 1800 2000200400600800100012001400160018002000 200 400 600 800 1000 1200 1400 1600 1800200400600800100012001400160018002000500 1000 1500 2000 25005001000150020002500 500 1000 1500 2000 25005001000150020002500

Figure 1: Similarity graphs for the Reuters (top) and Merck DPP4 (bottom) datasets. Left plotscorrespond to global σ , right plots to local σ .Table 1: Results for Reuters dataset. Accuracy is shown at epochs 200 and 1500.Graph Architecture P net P graph Acc. (200) Acc. (1500)- FC2000-FC1000 · · · · · · · · · · · · · · · · · · · · · · · · · this is the maximum value before the fully connected starts overﬁtting .2 Merck Molecular Activity Challenge The Merck Molecular Activity Challenge is a computational biology benchmark where the task is topredict activity levels for various molecules based on the distances in bonds between different atoms.For our experiments we used the DPP4 dataset which has 8193 samples and 2796 features. We chosethis dataset because it was one of the more challenging and was of relatively low dimensionalitywhich made the spectral networks tractable. As a baseline architecture, we used the network of [10]which has 4 hidden layers and is regularized using dropout and weight decay. We used the samehyperparameter settings and data normalization recommended in the paper.As before, we used one-tenth of the training set to tune hyperparameters of the network. For thistask we found that k = 40 subsampled weights worked best, and that average pooling performedbetter than max pooling. Since the task is to predict a continuous variable, all networks were trainedby minimizing the Root Mean-Squared Error loss. Following [10], we measured performance bycomputing the squared correlation between predictions and targets. t e s t a cc u r a cy FC2000 − FC1000GC4 − P4 − FC1000, supervised graph t e s t r Fully ConnectedSpectral16, supervisedSpectral64, supervisedSpectral64, kernel (local)Spectral64, kernel (global)

Figure 2: Evolution of Test accuracy. Left: Reuters dataset, Right: Merck dataset.Table 2: Results for Merck DPP4 dataset.Graph Architecture P net P graph R - FC4000-FC2000-FC1000-FC1000 . · . · . · . · . · . · . · . · . · In the experiments above our graph construction relied on estimation from the data. To measure theinﬂuence of the graph construction compared to the ﬁlter learning in the graph frequency domain,we performed the same experiments on the ImageNet dataset for which the graph is already known,namely it is the 2-D grid. The spectral network was thus a convolutional network whose weightswere deﬁned in the frequency domain using frequency smoothing rather than imposing compactly7upported ﬁlters. Training was performed exactly as in Figure 1, except that the linear transformationwas a Fast Fourier Transform.Our network consisted of 4 convolution/ReLU/max pooling layers with 48, 128, 256 and 256 featuremaps, followed by 3 fully-connected layers each with 4096 hidden units regularized with dropout.We trained two versions of the network: one classical convolutional network and one as a spectralnetwork where the weights were deﬁned in the frequency domain only and were interpolated usinga spline kernel. Both networks were trained for 40 epochs over the ImageNet dataset where inputimages were scaled down to × to accelerate training.Table 3: ImageNet resultsGraph Architecture Test Accuracy (Top 5) Test Accuracy (Top 1)2-D Grid Convolutional Network 71.854 46.242-D Grid Spectral Network 71.998 46.71 pe r c en t a cc u r a cy ConvNet, top 1SpectralNet, top 1ConvNet, top 5SpectralNet, top 5

Figure 3: ConvNet vs. SpectralNet on ImageNet.We see that both models yield nearly identical performance. Interstingly, the spectral network learnsfaster than the ConvNet during the ﬁrst part of training, although both networks converge around thesame time. This requires further investigation.

ConvNet architectures base their appeal and success on their ability to produce highly informativelocal statistics using low learning complexity and avoiding expensive matrix multiplications. Thismotivated us to consider generalizations on high-dimensional, unstructured data.When the statistical properties of the input satisfy both stationarity and composotionality, spectralnetworks have a learning complexity of the same order as Convnets. In the general setting where noprior knowledge of the input graph structure is known, our model requires estimating the similarities,a O ( N ) operation, but making the model deeper does not increase learning complexity as muchas the general Fully Connected architectures. Moreover, in contexts where feature similarities canbe estimated using unlabeled data (such as word representations), our model has less parameters tolearn from labeled data.However, as our results demonstrate, their extension poses signiﬁcant challenges: • Although the learning complexity requires O (1) parameters per feature map, the evalua-tion, both forward and backward, requires a multiplication by the Graph Fourier Transform,which costs O ( N ) operations. This is a major difference with respect to traditional Con-vNets, which require only O ( N ) . Fourier implementations of Convnets [13, 19] bring thecomplexity to O ( N log N ) thanks again to the speciﬁc symmetries of the grid. An openquestion is whether one can ﬁnd approximate eigenbasis of general Graph Laplacians usingGivens’ decompositions similar to those of the FFT.8 Our experiments show that when the input graph structure is not known a priori, graph es-timation is the statistical bottleneck of the model, requiring O ( N ) for general graphs and O ( M N ) for M -dimensional graphs. Supervised graph estimation performs signiﬁcantlybetter than unsupervised graph estimation based on low-order moments. Furthermore, wehave veriﬁed that the architecture is quite sensitive to graph estimation errors. In the su-pervised setting, this step can be viewed in terms of a Bootstrapping mechanism, where aninitially unconstrained network is self-adjusted to become more localized and with weight-sharing. • Finally, the statistical assumptions of stationarity and compositionality are not always ver-iﬁed. In those situations, the constraints imposed by the model risk to reduce its capacityfor no reason. One possibility for addressing this issue is to insert Fully connected lay-ers between the input and the spectral layers, such that data can be transformed into theappropriate statistical model. Another strategy, that is left for future work, is to relax thenotion of weight sharing by introducing instead a commutation error (cid:107) W i L − LW i (cid:107) withthe graph Laplacian, which puts a soft penalty on transformations that do not commute withthe Laplacian, instead of imposing exact commutation as is the case in the spectral net. References [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embed-ding and clustering. In

NIPS , volume 14, pages 585–591, 2001.[2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and deeplocally connected networks on graphs. In

Proceedings of the 2nd International Conference onLearning Representations , 2013.[3] Xu Chen, Xiuyuan Cheng, and St´ephane Mallat. Unsupervised deep haar scattering on graphs.In

Advances in Neural Information Processing Systems , pages 1709–1717, 2014.[4] Adam Coates and Andrew Y Ng. Selecting receptive ﬁelds in deep networks. In

Advances inNeural Information Processing Systems , pages 2528–2536, 2011.[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learningand stochastic optimization.

The Journal of Machine Learning Research , 12:2121–2159, 2011.[6] Karol Gregor and Yann LeCun. Emergence of complex-like cells in a temporal product net-work with local receptive ﬁelds. arXiv preprint arXiv:1006.0448 , 2010.[7] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly,Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deepneural networks for acoustic modeling in speech recognition.

Signal Processing Magazine ,2012.[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger,editors,

Advances in Neural Information Processing Systems 25 , pages 1097–1105. CurranAssociates, Inc., 2012.[9] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, 05 2015.[10] Junshui Ma, Robert P. Sheridan, Andy Liaw, George E. Dahl, and Vladimir Svetnik. Deep neu-ral networks as a method for quantitative structure-activity relationships.

Journal of ChemicalInformation and Modeling , 2015.[11] St´ephane Mallat.

A wavelet tour of signal processing . Academic press, 1999.[12] Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. Shapenet:Convolutional neural networks on non-euclidean manifolds.

CoRR , abs/1501.06297, 2015.[13] Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networksthrough ffts. arXiv preprint arXiv:1312.5851 , 2013.[14] Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W Koh, Quoc V Le, and Andrew Y Ng.Tiled convolutional neural networks. In

Advances in Neural Information Processing Systems ,pages 1279–1287, 2010.[15] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional isingmodel selection using (cid:96) -regularized logistic regression. The Annals of Statistics , 38(3):1287–1319, 2010. 916] Nicolas L Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, and Bal´azs K´egl. Learningthe 2-d topology of images. In

Advances in Neural Information Processing Systems , pages841–848, 2008.[17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. Dropout: A simple way to prevent neural networks from overﬁtting.

The Journal ofMachine Learning Research , 15(1):1929–1958, 2014.[18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of MachineLearning Research , 15:1929–1958, 2014.[19] Nicolas Vasilache, Jeff Johnson, Micha¨el Mathieu, Soumith Chintala, Serkan Piantino, andYann LeCun. Fast convolutional nets with fbfft: A GPU performance evaluation.

CoRR ,abs/1412.7580, 2014.[20] Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 17(4):395–416, 2007.[21] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In