[PDF] Tensor Contraction Layers for Parsimonious Deep Nets

Abstract

Tensors offer a natural representation for many kinds of data frequently encountered in machine learning. Images, for example, are naturally represented as third order tensors, where the modes correspond to height, width, and channels. Tensor methods are noted for their ability to discover multi-dimensional dependencies, and tensor decompositions in particular, have been used to produce compact low-rank approximations of data. In this paper, we explore the use of tensor contractions as neural network layers and investigate several ways to apply them to activation tensors. Specifically, we propose the Tensor Contraction Layer (TCL), the first attempt to incorporate tensor contractions as end-to-end trainable neural network layers. Applied to existing networks, TCLs reduce the dimensionality of the activation tensors and thus the number of model parameters. We evaluate the TCL on the task of image recognition, augmenting two popular networks (AlexNet, VGG). The resulting models are trainable end-to-end. Applying the TCL to the task of image recognition, using the CIFAR100 and ImageNet datasets, we evaluate the effect of parameter reduction via tensor contraction on performance. We demonstrate significant model compression without significant impact on the accuracy and, in some cases, improved performance.

Full PDF

TTensor Contraction Layers for Parsimonious Deep Nets

Jean KossaiﬁAmazon AIImperial College London [email protected]

Aran KhannaAmazon AI [email protected]

Zachary C. LiptonAmazon AIUniversity of California, San Diego [email protected]

Tommaso FurlanelloAmazon AIUniversity of Southern California [email protected]

Anima AnandkumarAmazon AICalifornia Institute of Technology [email protected]

Abstract

Tensors offer a natural representation for many kindsof data frequently encountered in machine learning. Im-ages, for example, are naturally represented as third ordertensors, where the modes correspond to height, width, andchannels. Tensor methods are noted for their ability to dis-cover multi-dimensional dependencies, and tensor decom-positions in particular, have been used to produce compactlow-rank approximations of data. In this paper, we explorethe use of tensor contractions as neural network layers andinvestigate several ways to apply them to activation ten-sors. Speciﬁcally, we propose the Tensor Contraction Layer(TCL), the ﬁrst attempt to incorporate tensor contractionsas end-to-end trainable neural network layers. Applied toexisting networks, TCLs reduce the dimensionality of theactivation tensors and thus the number of model param-eters. We evaluate the TCL on the task of image recog-nition, augmenting two popular networks (AlexNet, VGG).The resulting models are trainable end-to-end. Applying theTCL to the task of image recognition, using the CIFAR100and ImageNet datasets, we evaluate the effect of parame-ter reduction via tensor contraction on performance. Wedemonstrate signiﬁcant model compression without signiﬁ-cant impact on the accuracy and, in some cases, improvedperformance.

1. Introduction

Following their successful application to computer vi-sion, speech recognition, and natural language processing,deep neural networks have become ubiquitous in the ma-chine learning community. And yet many questions remainunanswered: Why do deep neural networks work? Howmany parameters are really necessary to achieve state of the art performance?Recently, tensor methods have been used in attempts tobetter understand the success of deep neural networks [4, 6].One class of broadly useful techniques within tensor meth-ods are tensor decompositions. While the properties of ten-sors have long been studied, in the past decade they havecome to prominence in machine learning in such varied ap-plications as learning latent variable models [1], and devel-oping recommender systems [10]. Several recent papers ap-ply tensor learning and tensor decomposition to deep neuralnetworks for the purpose of devising neural network learn-ing algorithms with theoretical guarantees of convergence[17, 9].Other lines of research have investigated practical ap-plications of tensor decomposition to deep neural networkswith aims including multi-task learning [20], sharing resid-ual units [3], and speeding up convolutional neural networks[15]. Several recent papers apply decompositions for eitherinitialization [20] or post-training [16]. These techniquesthen often require additional ﬁne-tuning to compensate forthe loss of information [11]. However, to our knowledge,no attempt has been made to apply tensor contractions asa generic layer directly on the activations or weights of adeep neural network and to train the resulting network end-to-end.In deep convolutional neural networks, the output ofeach layer is a tensor. We posit that tensor algebraic tech-niques can exploit multidimensional dependencies in the ac-tivation tensors. We propose to leverage that structure byincorporating Tensor Contraction Layers (TCLs) into neu-ral networks. Speciﬁcally, in our experiments, we applyTCLs directly to the third-order activation tensors producedby the ﬁnal convolutional layer of an image recognition net-work. Traditional networks ﬂatten this activation tensor,passing it to subsequent fully-connected layers. However, a r X i v : . [ c s . L G ] J un he ﬂattening process loses information about the multidi-mensional structure of the tensor. Our experiments showthat incorporating TCLs into several popular deep convo-lutional networks can improve their performance, despitereducing the number of parameters. Moreover, inferenceon TCL-equipped networks, which contain less parameters,requires considerably fewer ﬂoating point operations.We organize the rest of this paper as follows: Section 1.1introduces prerequisite concepts needed to understand theTCL; Section 2 explains the TCL in detail; Section 3 exper-imentally evaluates the TCL. Notation:

We deﬁne tensors as multidimensional arrays,denoting ﬁrst-order tensors v as vectors , second-order ten-sors M as matrices and by ˜ X , refer to tensors of order 3 orgreater. M (cid:62) denotes the transpose of M . Tensor unfolding:

Given a tensor, ˜ X ∈ R D × D ×···× D N , the mode- n unfolding of ˜ X is amatrix X [ n ] ∈ R D n ,D ( − n ) , with D ( − n ) = (cid:81) Nk =1 ,k (cid:54) = n D k and isdeﬁned by the mapping from element ( d , d , · · · , d N ) to ( d n , e ) , with e = (cid:80) Nk =1 ,k (cid:54) = n d k × (cid:81) Nm = k +1 D m . n-mode product: For a tensor ˜ X ∈ R D × D ×···× D N and a matrix M ∈ R R × D n , then-mode product of ˜ X by M is a tensor of size ( D × · · · × D n − × R × D n +1 × · · · × D N ) and can beexpressed using the unfolding of ˜ X and the classical matrixmultiplication as: ˜ X × n M = M ˜ X [ n ] ∈ R D ×···× D n − × R × D n +1 ×···× D N (1) Tensor contraction:

Given a tensor ˜ X ∈ R D × D ×···× D N , we can decompose it into a low-dimensional core tensor ˜ G ∈ R R × R ×···× R N throughprojection along each of its modes by projection factors (cid:0) U (1) , · · · , U ( N ) (cid:1) , with U ( k ) ∈ R R k ,D k , k ∈ (1 , · · · , N ) .In other words, we can write: ˜ G = ˜ X × U (1) × U (2) × · · · × N U ( N ) (2)or, in short: ˜ G = (cid:74) ˜ X ; U (1) , · · · , U ( N ) (cid:75) (3)In the case of tensor decomposition, the factors of thecontraction are obtained by solving a least squares problem.In particular, closed form solutions can be obtained for thefactor by considering the n − mode unfolding of ˜ X that canbe expressed as: G [ n ] = U ( n ) X [ n ] (cid:16) U (1) ⊗ · · · U ( n − ⊗ U ( n +1) ⊗ · · · ⊗ U ( N ) (cid:17) T (4) Figure 1. A representation of the Tensor Contraction Layer (TCL)applied on a tensor of order 3. The input tensor ˜ X is contractedinto a low-dimensionality core ˜ G . We refer the interested reader to the seminal work ofKolda and Bader [12].

Many popular convolutional neural networks for com-puter vision, e.g. AlexNet, ResNet, and Inception, requirehundreds of millions of parameters to achieve the reportedresults. This can be problematic when running these net-works for inference on resource-constrained devices, whereit may not be easy to execute hundreds of millions of calcu-lations just to classify a single image.While these widely used architectures exhibit consider-able variety, they also exhibit some commonalities. Often,they consist of blocks containing convolution, activationand pooling layers followed by fully-connected layers be-fore the ﬁnal classiﬁcation layer. Both the popular networksAlexNet [14] and VGG [19] follow this meta-architecture,with both containing two fully-connected layers of hidden units each. In both networks, these fully-connectedlayers hold over percent of the parameters. In VGG,the hidden units contain 119,545,856 of the 138,357,544total parameters, and in AlexNet the hidden units contain54,534,144 out the 62,378,344 total parameters.Given the enormous computational costs for both train-ing and running inference in these networks, we desire tech-niques that preserve high accuracy while reducing the num-ber of parameters in the network. Notable work in this di-rection includes approaches to induce and exploit sparsityin the parameters during training [7].

2. Tensor Contraction Layer

In this paper, we propose to incorporate the tensor con-traction into convolutional neural networks as an end-to-endtrainable layer, applying it to the third order activation ten-sor output by the ﬁnal convolutional layer.In particular, given an activation tensor ˜ X of size ( D , · · · , D N ) , we seek a low dimensional core ˜ G ofsmaller size ( R , · · · , R N ) such that: ˜ G = ˜ X × V (1) × V (2) × · · · × N V ( N ) (5) igure 2. A representation of the symbolic graph of the TensorContraction Layer. with V ( k ) ∈ R R k ,D k , k ∈ (1 , · · · , N ) .We leverage this formulation and deﬁne a new layer thattakes the activation tensor ˜ X obtained from a previous layerand applies such a projection to it (Figure. 1). We optimizethe projection factors (cid:0) V ( k ) (cid:1) k ∈ [1 , ··· N ] to obtain a low di-mensional projection of the activation tensor as the outputof the layer. We learn the projection factors by backpropa-gation jointly with the rest of the network’s parameters. Wecall this new layer the tensor contraction layer and denoteby size– ( R , · · · , R N ) TCL , or

TCL– ( R , · · · , R N ) a TCLproducing a contracted output of size ( R , · · · , R N ) .The gradients with respect to each of the factors can bederived easily from 4. Speciﬁcally, for each k ∈ , · · · , N ,we use the following equivalences: ∂ ˜ G ∂ V ( k ) = ∂ ˜ X × V (1) × V (2) × · · · × N V ( N ) ∂ V ( k ) = ∂ ˜ G [ k ] ∂ V ( k ) = ∂ V ( k ) X [ k ] (cid:16) V (1) ⊗ · · · V ( k − ⊗ V ( k +1) ⊗ · · · ⊗ V ( N ) (cid:17) T ∂ V ( k ) In practice, with minibatch training, we might think ofthe ﬁrst mode of an activation tensor as corresponding to thebatch-size. Technically, it is possible to applying a transfor-mation along this dimension too, but we leave this consid-eration for future work. It is trivial to address this case by either starting the n − mode products at the second mode orby setting the ﬁrst factor to be the Identity and not optimizeover it. Therefore, in the remainder of the paper, we con-sider the activation tensor for a single sample for clarity,without loss of generality.Figure. 2 presents the symbolic graph of the tensor con-traction layer. Note that when taking the n -mode productover different modes, the order in which the n -mode prod-ucts are computed does not matter. In this section, we detail the number of parameters andcomplexity of the tensor contraction layer.

Number of parameters

Let ˜ X be an activation tensorof size ( D , · · · , D N ) which we pass through a size– ( R , · · · , R N ) tensor contraction layer.This TCL has a total of (cid:80) Nk =1 D k × R k parameters (cor-responding to the factors of the N n − mode products) andproduces as input a tensor of size ( R , · · · , R N ) .By comparison, a fully-connected layer producing anoutput of the same size, i.e. with H = (cid:81) Nk =1 R k hiddenunits, and taking the same (ﬂattened) tensor as input wouldhave a total of (cid:81) Nk =1 D k × (cid:81) Nk =1 R k parameters. Complexity

As previously exposed, one way to look atthe TCL is as a series of matrix multiplications between thefactors of the contraction and the unfolded activation ten-sor. Let’s place ourselves in the setting previously detailedwith an activation tensor ˜ X of size ( D , · · · , D N ) and aTCL– ( R , · · · , R N ) of complexity O ( C TCL ) . We can write C TCL = (cid:80) Nk =1 C k where C k is the complexity of the k th n − mode product. Note that the order in which the productsare taken does not matter due to the commutativity of the n − mode product over disjoint modes (e.g. it is commuta-tive for ˜ X × i U ( i ) × j U ( j ) as long as i (cid:54) = j ). However, forillustrative purposes, we consider them to be done in order,from the ﬁrst mode to the N th . We then have: C k = R k × D k k − (cid:89) i =1 R i N (cid:89) j = k +1 D j (6)It follows that the overall complexity of the TCL is: C TCL = N (cid:88) k =1 k (cid:89) i =1 R i N (cid:89) j = k D j (7) Comparison with a fully-connected layer

A fully-connected layer with H hidden units has complexity O ( C FC ) , with: C FC = H N (cid:89) i =1 D i (8) ethod Added TCL st fully-connected nd fully-connected Accuracy (%) Space savings(%) Baseline - 4096 hidden units 4096 hidden units 65.41 0Added TCL TCL–(256, 3, 3) 4096 hidden units 4096 hidden units 65.53 -0.25Added TCL TCL–(192, 3, 3) 3072 hidden units 3072 hidden units 65.92 43.28Added TCL TCL–(128, 3, 3) 2048 hidden units 2048 hidden units

Table 1. Results with AlexNet on CIFAR100. The ﬁrst column presents the method, the second speciﬁes whether a tensor contraction wasadded and when this is the case, the size of the TCL. Columns 3 and 4 specify the number of hidden units in the fully-connected layersor the size of the TCL used instead when relevant. Column 5 presents the top-1 accuracy on the test set. Finally, the last column presentsthe reduction factor in the number of parameters in the fully-connected layers (which represent more than 80% of the total number ofparameters of the networks) where the reference is the original network without any modiﬁcation (

Baseline ). Consider a TCL that maintains the size of its input, i.e.,for any k in [1 . . N ] , R k = D k . In other words, C k = D k (cid:81) Ni =1 D i . Therefore, C TCL = N (cid:88) k =1 D k N (cid:89) i =1 D i (9)By comparison, a fully-connected layer that also main-tains the size of its input, i.e. H = (cid:81) Nk =1 D k , would have acomplexity of: C FC = (cid:32) N (cid:89) i =1 D i (cid:33) (10)Notice the product in the fully-connected case versus asum for the TCL case. We see several straightforward ways to incorporate theTCL into existing neural network architectures.

TCL as An Additional Layer

First, we can insert a ten-sor contraction layer following the last pooling layer, reduc-ing the dimensionality of the activation tensor before feed-ing it to the subsequent two fully-connected layers and soft-max output of the network. In general, ﬂattening inducesa loss of information. By applying tensor contraction wereduce dimensionality efﬁciently by leveraging the multidi-mensional dependencies in the activation tensor.

TCL as Replacement of a Fully Connected Layer

Wecan also incorporate the TCL into existing architectures bycompletely replacing fully-connected layers. This has theadvantage of signiﬁcantly reducing the number of parame-ters in our model. Concretely, consider an activation ten-sor of size (256 , , that is fed to either a fully-connected layer (after having been ﬂattened) or to a TCL. A fully-connected layer with hidden units has × × × , , parameters. A TCL that pre-serves the size of its input, on the other hand, only has + 7 + 7 = 1 , , parameters. The TCL has times fewer parameters than the fully-connected layer.Similarly, a TCL– (128 , , (approximately half size) willhave only × × × , parameters, or , times fewer parameters than a fully-connected layer.

3. Experiments

Our experiments investigate the representational powerof the TCL, demonstrating results on the CIFAR100 dataset[13]. Subsequently, we offer some preliminary results onthe ImageNet 1k dataset [5]. We hypothesize that a TCLcan efﬁciently represent an activation tensor for processingby subsequent layers of the network, allowing for a largereduction in parameters without a reduction in accuracy.We conduct our investigation on CIFAR100 using theAlexNet [14] and VGG [19] architectures, each modiﬁedto take × images as inputs. We also present resultswith a traditional AlexNet on ImageNet. In all cases we re-port the accuracy (top-1) as well as the space saved, whichwe quantify as:space savings = 1 − n TCL n original where n original is the number of parameters in the fully-connected layers of the standard network and n TCL is thenumber of parameters in the fully-connected layers of thenetwork modiﬁed to include the TCL.To avoid vanishing or exploding gradients, and to makethe TCL more robust to changes in the initialization of thefactors, we added a batch normalization layer [8] before andafter the TCL. ethod Added TCL st fully-connected nd fully-connected Accuracy (%) Space savings (%) Baseline - 4096 hidden units 4096 hidden units -0.73Added TCL TCL–(384, 3, 3) 3072 hidden units 3072 hidden units 68.56 42.99Added TCL TCL–(256, 3, 3) 2048 hidden units 2048 hidden units 67.57 74.351 TCL substitution - TCL–(512, 3, 3) 4096 hidden units 69.71 45.81 TCL substitution - TCL–(384, 3, 3) 3072 hidden units 68.83 69.161 TCL substitution - TCL–(256, 3, 3) 2048 hidden units 68.51 85.982 TCL substitutions - TCL–(512, 3, 3) TCL–(512, 3, 3) 67.20 97.272 TCL substitutions - TCL–(384, 3, 3) TCL–(288, 3, 3) 67.38

Table 2. Results obtained on CIFAR100 using a VGG-19 network architecture with different variations of the Tensor Contraction Layer.In all cases we report Top-1 Accuracy and space savings with respect to the baseline. As observed with the AlexNet, TCL allows for largespace savings with minimal impact on performance and even improvement in some cases.

The CIFAR100 dataset is composed of 100 classes con-taining 600 × images each, with 500 training imagesand 100 testing images per class. In all cases, we reportperformance on the testing set in term of accuracy (Top-1). We implemented all models using the MXNet library[2] and ran all experiments training with data parallelismacross multiple GPUs on Amazon Web Services, with twoNVIDIA k80 GPUs.Because both the original AlexNet and VGG architec-tures were deﬁned for the ImageNet data set, which has alarger input image size, to adapt them for CIFAR100 by ad-justing the stride size on the input convolution layer of bothnetworks so that they would take × input images. Weinvestigate two sets of experiments, described below. Added TCL

In the ﬁrst experiments, we added a TCL asadditional layer after the last pooling layer and performthe contraction along the two spacial modes of the im-age, leaving the modes corresponding to the channeland the batch size untouched. We gradually reducedthe number of hidden units in these last two layers withand without the TCL included and retrain the nets un-til convergence to demonstrate how the TCL can learnmore compact representations without compromisingaccuracy.

TCL substitution

In this case, we completely replace oneor both of the fully-connected layers by a tensor con-traction layer. We reduce the number of hidden unitsin the subsequent layers proportionally to the reductionin the size of the activation tensor.

Network architectures

We experimented with anAlexNet, with an adjusted stride and ﬁlter size in the ﬁnalconvolutional layer. From the last convolutional layer, weget an activation tensor of size ( batch size , , , . Sim-ilarly, in the case of the VGG network, we obtain activationtensors of size ( batch size , , , . We experiment with several variations of the tensor contraction layer. First,we consider the case where we project the activationsto a tensor of identical shape. Additionally, we evaluatethe effect of reducing the dimensionality of the activationtensor by 25% and by 50%. For AlexNet, because thespatial modes already compact are already, we preserve thespatial dimensions, and reduce dimensionality along thechannel. Table 1 summarizes our results on CIFAR100 using theAlexNet, while results with VGG are presented in Table 2.The ﬁrst column presents the method, the second speciﬁeswhether a tensor contraction was added and when this is thecase, the size of the contracted core. Columns 3 and 4 spec-ify the number of hidden units in the fully connected layersor the size of the TCL used instead when relevant. Column5 presents the top-1 accuracy on the validation. Finally, thelast column presents the reduction factor in the number ofparameters in the fully connected layers (which represent,as previously mentioned, more than 80% of the total num-ber of parameters of the networks) where the reference isthe original network without any modiﬁcation (

Baseline ).A ﬁrst observation is that adding a tensor contractionlayer (

Added TCL in Tables 1 and 2) consistently increasesperformance while having minimal impact on the overallnumber of parameters. Replacing the ﬁrst fully-connectedlayer ( in the Tables) allows us to reducethe number of parameters in the fully connected layers bya factor of more than , while observing the same perfor-mance as the original network. By replacing both fully con-nected layers ( in the Tables) we canobtain a reduction of more than × , with only a . de-crease in performance. In this section, we present preliminary experiments us-ing the larger ILSVRC 2012 (ImageNet) dataset [5], using ethod Additional TCL st fully-connected nd fully-connected Accuracy (in %) Space savings (%) Baseline - 4096 hidden units 4096 hidden units 56.29 0Added TCL TCL–(256, 5, 5) 4096 hidden units 4096 hidden units -0.11Added TCL TCL–(200, 5, 5) 3276 hidden units 3276 hidden units 56.11 35.36TCL substitution - TCL–(256, 5, 5) 4096 hidden units 56.57

Table 3. Results obtained with AlexNet on ImageNet, for a standard AlexNet ( baseline ), with an added Tensor Contraction Layer (

AddedTCL ) and by replacing the ﬁrst fully-connected layer with a TCL (

TCL substitution ). Simply adding the TCL results in a higher performancewhile having a minimal impact on the number of parameters in the fully connected layers. By reducing the size of the TCL or using a TCLto replace a fully connected layer, we can obtain a space savings of more than 35% with virtually no deterioration in performance. the AlexNet architecture. ImageNet is composed of . mil-lions image for testing and 50,000 for validation and com-prises 1,000 labeled classes.For these experiments, we trained each network simulta-neously on 4 NVIDIA k80 GPUs using data parallelism andreport preliminary results. We report Top-1 accuracy on thevalidation set, across all 1000 classes. All experiments wererun using the same setting. Network architecture

We use a standard AlexNet [14].From the last convolutional layer, we get an activation ten-sor of size ( batch size , , , . As in the CIFAR100case, we experiment with several variations of the tensorcontraction layer. We ﬁrst insert a TCL before the fully-connected layers, either a size-preserving TCL (i.e. pro-jecting to a tensor of the same size) or with a smaller sizeTCL and a proportionally smaller number of hidden unitsin the subsequent fully-connected layers. We then exper-iment with replacing completely the ﬁrst fully-connectedlayer with a TCL. In Table 3 we summarize the results from a standardAlexNet (

Baseline , ﬁrst row), with an added tensor contrac-tion layer (

Added TCL ) that preserves the dimensionality ofits input (row 2) or reduces it (last row). We also reportresult for substituting the ﬁrst fully connected layer with aTCL ( , last row). Simply adding the TCLimproves performance while the increase in number of pa-rameters in the fullly connected layers is negligible. We canobtain similar performance by ﬁrst adding a TCL to reducethe dimensionality of the activation tensor and reducing thenumber of hidden units in the fully-connected layers, lead-ing to a large space saving with virtually no decrease in per-formance. Replacing the ﬁrst fully-connected layer with asize-preserving TCL results in a similar space savings whilemaintaining the same performance as the standard network.

4. Discussion

We introduced a new neural network layer that performsa tensor contraction on an activation tensor to yield a low dimensional representation of it. By exploiting the natu-ral multi-linear structure of the data in the activation tensor,where each mode corresponds to a distinct modality (i.e. thedimensions of the image and the channels), we are able todecrease the size of the data representation passed to subse-quent layers in the network without compromising accuracyon image recognition tasks.The biggest practical contribution of the TCL is the dras-tic reduction in the number of parameters with little to noperformance penalty. This also allows neural networks toperform faster inference with fewer parameters by increas-ing their representational power. We demonstrated this viathe performance of TCLs on the widely used CIFAR100dataset with two established architectures, namely AlexNetand VGG. We also show results with AlexNet on the Ima-geNet dataset. Our proposed tensor contraction layer seemsto be able to capture the underlying structure in the acti-vation tensor and improve performance when added to anexisting network. When we replace fully-connected layerswith TCLs, we signiﬁcantly reduce the number of param-eters and nevertheless maintain (or in some cases even im-prove) performance.Going forward, we plan to extend our work to more net-work architectures, especially in settings where raw data orlearned representations exhibit natural multi-modal struc-ture that we might capture via high-order tensors. We alsoendeavor to advance our experimental study of TCLS forlarge-scale, high-resolutions vision datasets. Given the timerequired to train a large network on such datasets we are in-vestigating ways to reduce the dimension of the tensor con-tractions of an already trained model and simply ﬁne tune.In addition, recent work [18] has shown that new extendedBLAS primitives can avoid transpositions needed to com-pute the tensor contractions. This will further speed up thecomputations and we plan to implement it in future. Fur-thermore, we will look into methods to induce and exploitsparsity in the TCL, to understand the parameter reductionsthis method can yield over existing state-of-the-art pruningmethods. Finally, we are working on an extension to theTCL: a tensor regression layer to replace both the fully-connected and ﬁnal output layers, potentially yielding in-creased accuracy with even greater parameter reductions. eferences [1] A. Anandkumar, R. Ge, D. J. Hsu, S. M. Kakade, andM. Telgarsky. Tensor decompositions for learning latentvariable models.

Journal of Machine Learning Research ,15(1):2773–2832, 2014. 1[2] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,B. Xu, C. Zhang, and Z. Zhang. Mxnet: A ﬂexible and efﬁ-cient machine learning library for heterogeneous distributedsystems.

CoRR , abs/1512.01274, 2015. 5[3] Y. Chen, X. Jin, B. Kang, J. Feng, and S. Yan. Sharing resid-ual units through collective tensor factorization in deep neu-ral networks. 2017. 1[4] N. Cohen, O. Sharir, and A. Shashua. On the expres-sive power of deep learning: A tensor analysis.

CoRR ,abs/1509.05009, 2015. 1[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In

CVPR , 2009. 4, 5[6] B. D. Haeffele and R. Vidal. Global optimality in ten-sor factorization, deep learning, and beyond.

CoRR ,abs/1506.07540, 2015. 1[7] S. Han, H. Mao, and W. J. Dally. Deep compression: Com-pressing deep neural networks with pruning, trained quan-tization and huffman coding.

International Conference onLearning Representations (ICLR) , 2016. 2[8] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.

CoRR , abs/1502.03167, 2015. 4[9] M. Janzamin, H. Sedghi, and A. Anandkumar. Generaliza-tion bounds for neural networks through tensor factorization.

CoRR , abs/1506.08473, 2015. 1[10] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver.Multiverse recommendation: n-dimensional tensor factor-ization for context-aware collaborative ﬁltering. In

Proceed-ings of the fourth ACM conference on Recommender systems ,pages 79–86. ACM, 2010. 1[11] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin.Compression of deep convolutional neural networks for fastand low power mobile applications.

CoRR , abs/1511.06530,2015. 1[12] T. G. Kolda and B. W. Bader. Tensor decompositions andapplications.

SIAM REVIEW , 51(3):455–500, 2009. 2[13] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009. 4[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. InF. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,editors,

Advances in Neural Information Processing Systems25 , pages 1097–1105. Curran Associates, Inc., 2012. 2, 4, 6[15] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, andV. S. Lempitsky. Speeding-up convolutional neural networksusing ﬁne-tuned cp-decomposition.

CoRR , abs/1412.6553,2014. 1[16] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Ten-sorizing neural networks. In

Proceedings of the 28th Inter-national Conference on Neural Information Processing Sys-tems , NIPS’15, pages 442–450, 2015. 1 [17] H. Sedghi and A. Anandkumar. Training input-output re-current neural networks through spectral methods.

CoRR ,abs/1603.00954, 2016. 1[18] Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka. Ten-sor contractions with extended blas kernels on cpu and gpu.In , pages 193–202, Dec 2016. 6[19] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014. 2, 4[20] Y. Yang and T. M. Hospedales. Deep multi-task represen-tation learning: A tensor factorisation approach.