[PDF] Functional Hashing for Compressing Neural Networks

Abstract

As the complexity of deep neural networks (DNNs) trend to grow to absorb the increasing sizes of data, memory and energy consumption has been receiving more and more attentions for industrial applications, especially on mobile devices. This paper presents a novel structure based on functional hashing to compress DNNs, namely FunHashNN. For each entry in a deep net, FunHashNN uses multiple low-cost hash functions to fetch values in the compression space, and then employs a small reconstruction network to recover that entry. The reconstruction network is plugged into the whole network and trained jointly. FunHashNN includes the recently proposed HashedNets as a degenerated case, and benefits from larger value capacity and less reconstruction loss. We further discuss extensions with dual space hashing and multi-hops. On several benchmark datasets, FunHashNN demonstrates high compression ratios with little loss on prediction accuracy.

Full PDF

FFunctional Hashing for CompressingNeural Networks

Lei Shi, Shikun Feng, Zhifan Zhu

Baidu, Inc. {shilei06, fengshikun, zhuzhifan}@baidu.com

Abstract

As the complexity of deep neural networks (DNNs) trend to grow to absorbthe increasing sizes of data, memory and energy consumption has beenreceiving more and more attentions for industrial applications, especially onmobile devices. This paper presents a novel structure based on functionalhashing to compress DNNs, namely FunHashNN. For each entry in a deepnet, FunHashNN uses multiple low-cost hash functions to fetch values inthe compression space, and then employs a small reconstruction network torecover that entry. The reconstruction network is plugged into the wholenetwork and trained jointly. FunHashNN includes the recently proposedHashedNets [7] as a degenerated case, and beneﬁts from larger value capacityand less reconstruction loss. We further discuss extensions with dual spacehashing and multi-hops. On several benchmark datasets, FunHashNNdemonstrates high compression ratios with little loss on prediction accuracy.

Deep Neural networks (DNNs) have been receiving ubiquitous success in wide applications,ranging from computer vision [22], to speech recognition [17], natural language processing [8],and domain adaptation [13]. As the sizes of data mount up, people usually have to increasethe number of parameters in DNNs so as to absorb the vast volume of supervision. Highperformance computing techniques are investigated to speed up DNN training, concerningoptimization algorithms, parallel synchronisations on clusters w/o GPUs, and stochasticbinarization/ternarization, etc [10, 12, 27].On the other hand the memory and energy consumption is usually, if not always, constrainedin industrial applications [21, 35]. For instance, for commercial search engines (e.g., Googleand Baidu) and recommendation systems (e.g., NetFlix and YouTube), the ratio betweenthe increased model size and the improved performance should be considered given limitedonline resources. Compressing the model size becomes more important for applicationson mobile and embedded devices [15, 21]. Having DNNs running on mobile apps ownsmany great features such as better privacy, less network bandwidth and real time processing.However, the energy consumption of battery-constrained mobile devices is usually dominatedby memory access, which would be greatly saved if a DNN model can ﬁt in on-chip storagerather than DRAM storage (c.f. [15, 16] for details).A recent trend of studies are thus motivated to focus on compressing the size of DNNs whilemostly keeping their predictive performance [15, 21, 35]. With diﬀerent intuitions, there aremainly two types of DNN compression methods, which could be used in conjunction for betterparameter savings. The ﬁrst type tries to revise the training target into more informativesupervision using dark knowledge . In speciﬁc, Hinton et al. [18] suggested to train a largenetwork ahead, and distill a much smaller model on a combination of the original labelsand the soft-output by the large net. The second type observes the redundancy existence1 a r X i v : . [ c s . L G ] M a y n network weights [7, 11], and exploits techniques to constrain or reduce the number offree-parameters in DNNs during learning. This paper focuses on the latter type.To constrain the network redundancy, eﬀorts [11, 25, 35] formulated an original weightmatrix into either low-rank or fast-food decompositions. Moreover [15, 16] proposed asimple-yet-eﬀective pruning-retraining iteration during training, followed by quantizationand ﬁne-tuning. Chen et al. [7] proposed HashedNets to eﬃciently implement parametersharing prior to learning, and showed notable compression with much less loss of accuracythan low-rank decomposition. More precisely, prior to training, a hash function is used torandomly group (virtual) weights into a small number of buckets, so that all weights mappedinto one hash bucket directly share a same value. HashedNets was further deliberated infrequency domain for compressing convolutional neural networks in [6].In applications, we observe HashedNets compresses model sizes greatly at marginal lossof accuracy for some situations, whereas also signiﬁcantly loses accuracy for others. Afterrevisiting its mechanism, we conjecture this instability comes from at least three factors.First, hashing and training are disjoint in a two-phase manner, i.e., once inappropriatecollisions exist, there may be no much optimization room left for training. Second, one singlehash function is used to fetch a single value in the compression space, whose collision risk islarger than multiple hashes [4]. Third, parameter sharing within a buckets implicitly uses identity mapping from the hashed value to the virtual entry.This paper proposes an approach to relieve this instability, still in a two-phase style forpreserving eﬃciency. Speciﬁcally, we use multiple hash functions [4] to map per virtual entryinto multiple values in compression space. Then an additional network plays in a mappingfunction role from these hashed values to the virtual entry before hashing, which can be alsoregarded as “reconstructing” the virtual entry from its multiple hashed values. Plugged intoand jointly trained within the original network, the reconstruction network is of a comparablyignorable size, i.e., at low memory cost.This functional hashing structure includes HashedNets as a degenerated special case, andfacilitates less value collisions and better value reconstruction. Shortly denoted as Fun-HashNN, our approach could be further extended with dual space hashing and multi-hops.Since it imposes no restriction on other network design choices (e.g. dropout and weightsparsiﬁcation), FunHashNN can be considered as a standard tool for DNN compression.Experiments on several datasets demonstrate promisingly larger reduction of model sizesand/or less loss on prediction accuracy, compared with HashedNets. Notations.

Throughout this paper we express scalars in regular ( A or b ), vectors in bold( x ), and matrices in capital bold ( X ). Furthermore, we use x i to represent the i -th dimensionof vector x , and use X ij to represent the ( i, j ) -th entry of matrix X . Occasionally, [ x ] i isalso used to represent the i -th dimension of vector x for speciﬁcation clarity . Notation E [ · ] stands for the expectation operator. Feed Forward Neural Networks.

We deﬁne the forward propagation of the (cid:96) -th layeras a (cid:96) +1 i = f ( z (cid:96) +1 i ) , with z (cid:96) +1 i = b (cid:96) +1 i + d (cid:96) (cid:88) j =1 V (cid:96)ij a (cid:96)j , for ∀ i ∈ [1 , d (cid:96) +1 ] . (1)For each (cid:96) -th layer, d (cid:96) is the output dimensionality, b (cid:96) is the bias vector, and V (cid:96) is the( virtual ) weight matrix in the (cid:96) -th layer. Vectors z (cid:96) , a (cid:96) ∈ R d (cid:96) denote the units before andafter the activation function f ( · ) . Typical choices of f ( · ) include rectiﬁed linear unit (ReLU)[29], sigmoid and tanh [2]. Feature Hashing has been studied as a dimension reduction method for reducing modelstorage size without maintaining the mapping matrices like random projection [32, 34]. Brieﬂy,it maps an input vector x ∈ R n to a much smaller feature space via φ : R n → R K with2 (cid:28) n . Following the deﬁnition in [34], the mapping φ is a composite of two approximateuniform hash functions h : N → { , . . . , K } and ξ : N → {− , +1 } . The j -th element of φ ( x ) is deﬁned as: [ φ ( x )] j = (cid:88) i : h ( i )= j ξ ( i ) x i . (2)As shown in [34], a key property is its inner product preservation, which we quote and restatebelow. Lemma [Inner Product Preservation of Original Feature Hashing]

With the hashdeﬁned by Eq. (2), the hash kernel is unbiased, i.e., E φ [ φ ( x ) (cid:62) φ ( y )] = x (cid:62) y . Moreover, thevariance is var x , y = K (cid:80) i (cid:54) = j (cid:0) x i y j + x i y i x j y j (cid:1) , and thus var x , y = O ( K ) if || x || = || y || =const . HashedNets in [7].

As illustrated in Figure 1(a), HashedNets randomly maps networkweights into a smaller number of groups prior to learning, and the weights in a samegroup share a same value thereafter. A naive implementation could be trivially achievedby maintaining a secondary matrix that records the group assignment, at the expense ofadditional memory cost however. HashedNets instead adopts a hash function that requiresno storage cost with the model. Assume there is a ﬁnite memory budge K (cid:96) per layer torepresent V (cid:96) , with K (cid:96) (cid:28) ( d (cid:96) + 1) d (cid:96) +1 . We only need to store a weight vector w (cid:96) ∈ R K (cid:96) ,and assign V (cid:96)ij an element in w (cid:96) indexed by a hash function h (cid:96) ( i, j ) , namely V (cid:96)ij = ξ (cid:96) ( i, j ) · w (cid:96)h (cid:96) ( i,j ) , (3)where hash function h (cid:96) ( i, j ) outputs an integer within [1 , K (cid:96) ] . Another independent hashfunction ξ (cid:96) ( i, j ) : ( d (cid:96) +1 × d (cid:96) ) → ± outputs a sign factor, aiming to reduce the bias due tohash collisions [34]. The resulting matrix V (cid:96) is virtual , since d (cid:96) could be increased withoutincreasing the actual number of parameters in w (cid:96) once the compression space size K (cid:96) isdetermined and ﬁxed.Substituting Eq. (3) into Eq. (1), we have z (cid:96) +1 i = b (cid:96) +1 i + (cid:80) d (cid:96) j =1 ξ (cid:96) ( i, j ) w (cid:96)h (cid:96) ( i,j ) a (cid:96)j . Duringtraining, w (cid:96) is updated by back propagating the gradient via z (cid:96) +1 (and the virtual V (cid:96) ).Besides, the activation function f ( · ) in Eq. (1) was kept as ReLU in [7] to further relieve thehash collision eﬀect through a sparse feature space. In both [7] and this paper, the opensource xxHash is adopted as an approximately uniform hash implementation with low cost.Figure 1: Illustrations of hashing approaches for neural networks compression. (a)HashedNets [7]. (b) our FunHashNN. (c) our FunHashNN with dual space hashing.(Best viewed in color) http://cyan4973.github.io/xxHash/ Functional Hashing for Neural Network Compression

For clarity, we will focus on a single layer throughout and drop the super-script (cid:96) . Still,vector w ∈ R K denotes parameters in the compression space. The key diﬀerence betweenFunHashNN and HashedNets [7] lies in (i) how to employ hash functions, and (ii) how tomap from w to V : • Instead of adopting one pair of hash function ( h, ξ ) in Eq. (3), we use a set ofmultiple pairs of independent random hash functions. Let’s say there are U pairsof mappings { h u , ξ u } Uu =1 , each h u ( i, j ) outputs an integer within [1 , K ] , and each ξ u ( i, j ) selects a sign factor. • Eq. (3) of HashedNets employs an identity mapping between one element in V andone hashed value, i.e., V ij = ξ ( i, j ) w h ( i,j ) . In contrast, we use a multivariate function g ( · ) to describe the mapping from multiple hashed values { ξ u ( i, j ) w h u ( i,j ) } Uu =1 to V ij . Speciﬁcally, V ij = g (cid:0)(cid:2) ξ ( i, j ) w h ( i,j ) , . . . , ξ U ( i, j ) w h U ( i,j ) (cid:3) ; α (cid:1) . (4)Therein, α is referred to as the parameters in g ( · ) . Note that the input ξ u ( i, j ) w h u ( i,j ) is order sensitive from u = 1 to U . We choose g ( · ) to be a multi-layer feed forwardneural network, and other multivariate functions may be considered as alternatives.As a whole, Figure 1(b) illustrates our FunHashNN structure, which can be easily plugged inany matrices of DNNs. Note that α in the reconstruction network g ( · ) is of a much smallersize compared to w . For instance, a setting with U = 4 and a 1-layer g ( · ; α ) of α ∈ R performs already well enough in experiments. In other words, Eq. (4) just uses an ignorableamount of additional memory to describe a functional w -to- V mapping, whose propertieswill be further explained in the sequel. The parameters in need of updating include w in the compression and α in g ( · ) . TrainingFunHashNN is equivalent to training a standard neural network, except that we need toforward/backward-propagate values related to w through g ( · ) and the virtual matrix V . Forward Propagation.

Substituting Eq. (4) into Eq. (1), we still omit the super-script (cid:96) and get z i = b i + d (cid:88) j =1 a j V ij = b i + d (cid:88) j =1 a j · g (cid:0)(cid:2) ξ ( i, j ) w h ( i,j ) , . . . , ξ U ( i, j ) w h U ( i,j ) (cid:3) ; α (cid:1) . (5) Backward Propagation.

Denote L as the ﬁnal loss function, e.g., cross entropy or squaredloss, and suppose δ i = ∂ L ∂z i is already available by back-propagation from layers above. Thederivatives of L with respect to w and α are computed by ∂ L ∂ w = (cid:88) i (cid:88) j a j δ i ∂V ij ∂ w , ∂ L ∂ α = (cid:88) i (cid:88) j a j δ i ∂V ij ∂ α , (6)where, since we choose g ( · ) as a multilayer neural network, derivatives ∂V ij ∂ w and ∂V ij ∂ α can becalculated through the small network g ( · ) in a standard back-propagation manner. Complexity.

Concerning time and memory cost, FunHashNN roughly has the samecomplexity as HashedNets, since the small network g ( · ) is quite light-weighted. One keyvariable factor is the way to implement multiple hash functions. On one hand, if they arecalculated online, then FunHashNN requires little additional time if tackling them in parallel.On the other, if they are pre-computed and stored in dicts to avoid hashing time cost, themultiple hash functions of FunHashNN demand more storage space. In application, wesuggest to pre-compute hashes during oﬄine training for speedup, and to compute hashes inparallel during online prediction for saving memory under limited budget.4 .3 Property Analysis In this part, we try to depict the properties of our FunHashNN from several aspects to helpunderstanding it, especially in comparison with HashedNets [7].

Value Collision.

It should be noted, both HashedNets and FunHashNN conduct hashingprior to training, i.e., in a two-phase manner. Consequently, it would be unsatisfactoryif hashing collisions happen among important values. For instance in natural languageprocessing tasks, one may observe wired results if there are many hashing collisions amongembeddings (which form a matrix) of frequent words, especially when they are not relatedat all. In the literature, multiple hash functions are known to perform better than one singlefunction [1, 4, 5]. In intuition, when we have multiple hash functions, the items colliding inone function are hashed diﬀerently by other hash functions.

Value Reconstruction.

In both HashedNets and FunHashNN, the hashing trick can beviewed as a reconstruction of the original parameter V from w ∈ R K . In this sense, theapproach with a lower reconstruction error is preferred . Then we have at least the followingtwo observations: • The maximum number of possible distinct values output by hashing intu-itively explains the modelling capability [32]. For HashedNets, considering the signhashing function ξ ( · ) , we have at most K possible distinct values of Eq. (3) torepresent elements in V . In contrast, since there are multiple ordered hashed inputs,FunHashNN has at most (2 K ) U possible distinct values of Eq. (4). Note that thememory size K is the same for both. • The reconstruction error may be diﬃcult to analyzed directly, since the hash-ing mechanism is trained jointly within the whole network. However, we ob-serve g (cid:0)(cid:2) ξ ( i, j ) w h ( i,j ) , . . . , ξ U ( i, j ) w h U ( i,j ) (cid:3) ; α (cid:1) degenerates to g ( ξ ( i, j ) w h ( i,j ) ) if we assign zeros to all entries in α unrelated to the 1st input dimension. Since g ( ξ ( i, j ) w h ( i,j ) ) depends only on one single pair of hash functions, it is conceptuallyequivalent to HashedNets. Consequently, including HashedNets as a special case,FunHashNN with freely adjustable α is able to reach a lower reconstruction error toﬁt the ﬁnal accuracy better. Feature Hashing.

In line with previous work [32, 34], we compare HashedNets andFunHashNN in terms of feature hashing. For speciﬁcation clarity, we drop the sign hashingfunctions ξ ( · ) below for both methods, the analysis with which is straightforward by replacing K hereafter with K . • For HashedNets, one ﬁrst deﬁnes a hash mapping function φ (1) i ( a ) , whose k -thelement is (cid:104) φ (1) i ( a ) (cid:105) k (cid:44) (cid:88) j : h ( i,j )= k a j , for k = 1 , . . . , K. (7)Thus z i by HashedNets can be computed as the inner product (details c.f. Section 4.3in [7]) z i = w (cid:62) φ (1) i ( a ) . (8) • For FunHashNN, we ﬁrst deﬁne a hash mapping function φ (2) i ( a ) . Diﬀerent from a K -dim output in Eq. (7), it is of a much larger size K U , with (cid:16)(cid:80) Uu =1 k u K ( u − (cid:17) -thelement as (cid:104) φ (2) i ( a ) (cid:105) (cid:80) Uu =1 k u K ( u − (cid:44) (cid:88) j : h ( i,j )= k h ( i,j )= k ...h U ( i,j )= k U a j , for ∀ u, k u = 1 , . . . , K. (9) One might argue that there exists redundancy in V , whereas here we could imagine V is alreadystructured and ﬁlled by values with least redundancy. g α ( w ) still of length K U , whose (cid:16)(cid:80) Uu =1 k u K ( u − (cid:17) -thentry is [ g α ( w )] (cid:80) Uu =1 k u K ( u − (cid:44) g ( w k , w k , . . . , w k U ; α ) , for ∀ u, k u = 1 , . . . , K. (10)Thus z i by FunHashNN can be computed as the following inner product z i = g α ( w ) (cid:62) φ (2) i ( a ) . (11)The diﬀerence between Eq. (8) and Eq. (11) further explains the above discussionabout “the maximum number of possible distinct values”. If considering a linear model f ( x ; θ ) = θ (cid:62) x , one can not onlydeliver analysis like Bayesian or hashing on input feature space of x , but also do similarlyon the dual space of θ [3]. We now revisit the “reconstruction” network g ( x ij ; α ) in Eq. (4),where vector x ij concatenates the hashed values ξ u ( i, j ) w h u ( i,j ) for u = 1 , . . . , U . What wedid in Eq. (4) is in fact hashing ( i, j ) through w to get the input feature of g ( · ) . In analogy,we can also hash ( i, j ) to fetch parameters of g ( · ) , namely we have a new “reconstruction”network in the following form: V ij = g ( x ij ; α ij ) , with [ x ij ] u = ξ u ( i, j ) w h u ( i,j ) and [ α ij ] r = ξ (cid:48) r ( i, j ) w (cid:48) h (cid:48) r ( i,j ) , (12)where { ξ (cid:48) r ( · ) , h (cid:48) r ( · ) } are additional multiple pairs of hash functions applied on α , and w (cid:48) is anadditional vector in the compression space of α . The size of α ij remains the same as previous.Using this trick, the maximum number of possible distinct values of V further increasesexponentially, so that FunHashNN has more potential ability to ﬁt the prediction well. Wedenote FunHashNN with dual space hashing shortly as FunHashNN-D, and illustrate itsstructure in Figure 1(c). Multi-hops.

We conjecture that FunHashNN could be used in a multi-hops structure,by imagining w in the compression space plays a virtual role similar to V . Speciﬁcally,we can build another level of hash functions (cid:110) ξ (1) u ( · ) , h (1) u ( · ) (cid:111) and compression space w (1) .Thereafter, each entry in w is hashed into multiple values in w (1) via (cid:110) ξ (1) u ( · ) , h (1) u ( · ) (cid:111) . Thenanother reconstruction network g (1) ( · ) is used to learn the mapping from the hashed valuesin w (1) to the corresponding entry in w .This procedure can be implemented recursively. If there are in total M -hops, what we needto save in fact just includes a (possibly much more smaller) vector w ( M ) at the ﬁnal hop, aseries of M small reconstruction networks { g ( m ) ( · ) } Mm =1 , and a series of hashing functions. Incontrast, the multi-hops version of HashedNets is equivalent to just adjusting the compressionratio, or say the size K . Recent studies have conﬁrmed the redundancy existence in the parameters of deep neuralnetworks. Denil et al. [11] decomposed a matrix in a fully-connected layers as the productof two low-rank matrices, so that the number of parameters decreases linearly as the latentdimensionality decreases. More structured decompositions Fastfood [25] and Deep Fried [35]were proposed not only to reduce the number of parameters, but also to speed up matrixmultiplications. More recently, Han et al. [15, 16] proposed to iterate pruning-retrainingduring training DNNs, and used quantization and ﬁne-tuning as a post-processing step.Huﬀman coding and hardware implementation were also considered. In order to mostly keepaccuracy, the authors suggested multiple rounds of pruning-retraining. That is, for littleaccuracy loss, we have to prune slowly enough and thus suﬀer from increased training time.Again, the most related work to ours is HashedNets [7], which was then extended in [6] torandom hashing in frequency domain for compressing convolutional neural networks. EitherHashedNets or FunHashNN could be combined in conjunction with other techniques forbetter compression. 6xtensive studies have been made on constructing and analyzing multiple hash functions,which have shown better performances over one single hash function [4]. One multi-hashingalgorithm, d -random scheme [1], uses only one hash table but d hash functions, pretty similarto our settings. One choice alternative to d -random is the d -left algorithm proposed in [5],used for improving IP lookups. Hashing algorithms for natural language processing arealso studied in [14]. Papers [32, 34] investigated feature hashing (a.k.a. the hashing trick),providing useful bounds and feasible results. We conduct extensive experiments to evaluate FunHashNN on DNN compression. Codes forfully reproducibility will be open source soon after necessary polishment.

Three benchmark datasets [24] are considered here, including (1) the original

MNIST hand-written digit dataset, (2) dataset

BG-IMG as a variant to

MNIST , and (3) binaryimage classiﬁcation dataset

CONVEX . For all datasets, we use prespeciﬁed training and testingsplits. In particular, the original

MNIST dataset has

Methods and Settings.

In [7], the authors compared HashedNets against several DNNcompression approaches, and showed HashedNets performs consistently the best, includingthe low-rank decomposition [11]. Under the same settings, we compare

FunHashNN with

HashedNets and a standard neural network without compression. All activation functionsare chosen as ReLU.The settings of FunHashNN are tested in two scenarios. First, we will ﬁx to use FunHashNN inFigure 1(b) without extensions, and then compare the eﬀects of compression by FunHashNNand HashedNets. Second, we compare diﬀerent conﬁgurations of FunHashNN itself, includingthe number U of seeds, the layer of reconstruction network g ( · ) , and extension with the dualspace hashing. Hidden layers within g ( · ) keep using tanh as activation functions. Resultsby the multi-hops extension of FunHashNN will be included in another ongoing paper forsystematic comparisons. To test robustness, we vary the compression ratio with (1) a ﬁxed virtual network size (i.e.,the size of V (cid:96) in each layer), and then with (2) a ﬁxed memory size (i.e., the size of w (cid:96) in eachlayer). Three-layer (1 hidden layer) and ﬁve-layer (3 hidden layers) networks are investigated.In experiments, we vary the compression ratio geometrically within { , , , . . . , } . ForFunHashNN, this comparison sticked to use 4 hash functions, 3-layer g ( · ) , and without dualspace hashing. With Virtual Network Size Fixed.

The hidden layer for 3-layer nets initializes at 1000units, and for 5-layer nets starts at 100 units per layer. As the compression ratio rangesfrom 1 to 1/64 with a ﬁxed virtual network size, the memory decreases and it becomesincreasingly diﬃcult to preserve the classiﬁcation accuracy. The testing errors are shownin Figure 2, where standard neural networks with equivalent parameter sizes are includedin comparison. FunHashNN shows robustly eﬀective compression against the compressionratios, and persistently produces better prediction accuracy than HashedNets. It should benoted, even when the compression ratio equals to one, FunHashNN with the reconstructionnetwork structure is still not equivalent to HashedNets and performs better.

With Memory Storage Fixed.

We change to vary the compression ratio from 1 to 1/64with a ﬁxed memory storage size, i.e., the size of the virtual network increases while the ∼ wenlinchen/project/HashedNets/index.html On 3-layer nets with compression ratio / , we vary the conﬁguration dimensions of Fun-HashNN, including the number of hash functions ( U ), the structure of layers of the recon-struction network g ( · ) , and whether dual space hashing is turned on. Since it is impossibleto enumerate all probable choices, U is restricted to vary in { , , , } . The structure of g ( · ) is chosen from ∼ layers, with U × , U × . U × , U × U × . U × layerwisewidths, respectively. We denote U x -G y as x hash functions and y layers of g ( · ) , and a suﬃx-D indicates the dual space hashing.Table 1 shows the performances of FunHashNN with diﬀerent conﬁgurations on MNIST . Theobservations are summarized below. First, the series from index (0) to (1.x) ﬁxes a 3-layer g ( · ) and varies the number of hash functions. As listed, more hash functions do not ensure8 better accuracy, and instead U4-G3 performs the best, perhaps because too many hashfunctions potentially brings too many partial collisions. Second, the series from (0) to (2.x)ﬁxes the number of hash functions and varies the layer number in g ( · ) , where three layersperforms the best mainly due to its strongest representability. Third, indices (3.x) showfurther improved accuracies using dual space hashing.Table 1: Performances on MNIST by various conﬁgurations of Fun-HashNN.Index Conﬁg Test Error(%)(0) U4-G3 1.32(1.1) U2-G3 1.42(1.2) U8-G3 1.39(1.3) U16-G3 1.40(2.1) U4-G2 1.34(2.2) U4-G3 1.28(3.1) U2-G3-D 1.36(3.2) U4-G3-D 1.24(3.3) U8-G3-D 1.27 Figure 4: Performances for pairwise semanticranking. Testing correct-to-wrong pairwise rank-ing ratios (the larger the better) are plottedversus the number of training epochs.

Finally, we evaluate the performance of FunHashNN on semantic learning-to-rank DNNs.The data is collected from logs of a commercial search engine, with per clicked query-urlbeing a positive sample and per non-clicked being a negative sample. There are totallyaround 45B samples. We adopt a deep convolutional structured semantic model similar to[19, 31], which is of a siamese structure to describe the semantic similarity between a queryand a url title. The network is trained to optimize the cross entropy for each pair of positiveand negative samples per query.The performance is evaluated by correct-to-wrong pairwise ranking ratio on testing set. InFigure 4, we plot the performance by a baseline network as training proceeds, compared toFunHashNN and HashNet both with 1/4 compression ratio. With U = 4 hash functions,FunHashNN performs better than HashedNets throughout the training epochs, and evencomparable to the full network baseline which requires 4 times of memory storage. Thedeterioration of HashedNets probably comes from many inappropriate collisions on wordembeddings, especially for words of high frequencies. This paper presents a novel approach FunHashNN for neural network compression. Brieﬂy,after adopting multiple low-cost hash functions to fetch values in compression space, Fun-HashNN employs a small reconstruction network to recover each entry in an matrix of theoriginal network. The reconstruction network is plugged into the whole network and learnedjointly. The recently proposed HashedNets [7] is shown as a degenerated special case ofFunHashNN. Extensions of FunHashNN with dual space hashing and multi-hops are alsodiscussed. On several datasets, FunHashNN demonstrates promisingly high compressionratios with little loss on prediction accuracy.As future work, we plan to further systematically analyze the properties and bounds ofFunHashNN and its extensions. More industrial applications are also expected, especially onmobile devices. This paper focuses on the fully-connected layer in DNNs, and the compressionperformance on other structures (such as convolutional layers) is also planned to be studied.As a simple and eﬀective approach, FunHashNN is expected to be a standard tool for DNNcompression. 9 eferences [1] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. In

STOC , 1994.[2] C. M. Bishop.

Neural Networks for Pattern Recognition . Oxford University Press, Inc., 1995.[3] C. M. Bishop.

Pattern Recognition and Machine Learning . Springer, 2006.[4] A. Broder and A. Karlin. Multilevel adaptive hashing. In

SODA , 1990.[5] A. Broder and M. Mitzenmacher. Using multiple hash functions to improve IP lookups. In

INFOCOM , 2001.[6] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutionalneural networks. In

NIPS , 2015.[7] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networkswith the hashing trick. In

ICML , 2015.[8] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Naturallanguage processing (almost) from scratch.

JMLR , 12:2493–2537, 2011.[9] G. Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Minsketch and its application.

J. Algorithms , 55:29–38, 2005.[10] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: Training deep neural networkswith binary weights during propagations. In

NIPS , 2015.[11] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas. Predicting parameters in deeplearning. In

NIPS , 2013.[12] T. Dettmers. 8-bit approximations for parallelism in deep learning. In

ICLR , 2016.[13] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large scale sentiment classiﬁcation:a deep learning approach. In

ICML , 2011.[14] A. Goyal, H. I. Daume, and G. Cormode. Sketch algorithms for estimating point queries inNLP. In

EMNLP/CoNLL , 2012.[15] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks withpruning, trained quantization and Huﬀman coding. In

ICLR , 2016.[16] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for eﬃcientneural networks. In

NIPS , 2015.[17] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling inspeech recognition: The shared views of four research groups.

IEEE Signal Processing Magazine ,29(6):82–97, 2012.[18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In

NIPSDeep Learning Workshop , 2014.[19] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semanticmodels for web search using clickthrough data. In

CIKM , 2013.[20] Y. Kang, S. Kim, and S. Choi. Deep learning to hash with multiple representations. In

IEEEICDM , 2012.[21] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutionalneural networks for fast and low power mobile applications. In

ICLR , 2016.[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

NIPS , 2012.[23] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In

NIPS ,2009.[24] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluations ofdeep architectures on problems with many factors of variation. In

ICML , 2007.[25] Q. V. Le, T. Sarlos, and A. J. Smola. Fastfood – approximating kernel expansions in loglineartime. In

ICML , 2013.[26] M. Lin, Q. Chen, and S. Yan. Network in network. In arXiv:1312.4400 , 2013.[27] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications.In

ICLR , 2016.[28] Z. Mariet and S. Sra. Diversity networks. In

ICLR , 2016.[29] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted Boltzmann machines. In

ICML , 2010.[30] M. Ranzato, Y.-L. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks.In

NIPS , 2007.[31] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In

CIKM , 2014.[32] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan. Hash kernels forstructured data.

JMLR , 10:2615–2637, 2009.[33] J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data – asurvey.

Proceedings of IEEE , 104(1):34–57, 2016.[34] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing forlarge scale multitask learning. In

ICML , 2009.[35] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep friedconvnets. In

ICCV , 2015., 2015.