Linear Computation Coding
11 Linear Computation Coding
Ralf R. Müller ∗ , Bernhard Gäde ∗ , Ali Bereyhi ∗∗ Institute for Digital Communications, Friedrich-Alexander Universität Erlangen-Nürnberg, [email protected], [email protected], [email protected]
Abstract —We introduce the new concept of computationcoding. Similar to how rate-distortion theory is concerned withthe lossy compression of data, computation coding deals withthe lossy computation of functions.Particularizing to linear functions, we present an algorithmto reduce the computational cost of multiplying an arbitrarygiven matrix with an unknown column vector. The algorithmdecomposes the given matrix into the product of codebook and wiring matrices whose entries are either zero or signed integerpowers of two.For a typical implementation of deep neural networks, theproposed algorithm reduces the number of required additionunits several times. To achieve the accuracy of 16-bit signedinteger arithmetic for 4k-vectors, no multipliers and only 1.5adders per matrix entry are needed.
Index Terms —approximate computing, computational com-plexity, estimation error, fixed-point arithmetic, linear systems,rate-distortion theory, quantization.
I. M
OTIVATION
Neural networks are becoming an integral part of modernday’s reality. This technology consists of two stages: Atraining phase and an inference phase. The training phaseis computationally expensive and typically outsourced tocluster or cloud computing. It takes place only now andthen, eventually only once forever. The inference phase isimplemented on the device running the application. It isrepeated whenever the neural network is used. This worksolely targets the inference phase after the neural networkhas been successfully trained.The inference phase consists of scalar nonlinearities andmatrix-vector multiplications. The computational burden isoverwhelmingly at the latter. The target of this work isto reduce the computational cost of the following task:Multiply an arbitrary unknown vector with a known, butarbitrary matrix. At the first layer of the neural network,the unknown vector is the input to the neural network. At asubsequent layer, it is the output of the respective previouslayer. The known matrices are the outcome of the trainingphase in the data fusion center and fixed for all inferencecycles of the neural network.
This paper was/will be presented in part at the Information Theory& Applications Workshop, San Diego, CA, Feb. 2020 and the IEEEConference on Acoustics, Speech, and Signal Processing, Toronto, Canada,Jun. 2021.This work was partially supported by Deutsche Forschungsgemeinschaft(DFG) under grant MU-3735/6-1.
The computing unit running the inference phase need notbe a general-purpose processor. With neural networks beingmore and more frequently deployed in low-energy devices,it is attractive to employ dedicated hardware. For some ofthem, e.g, field programmable gate arrays or application-specific integrated circuits with memory, the data center hasthe option to update the known matrices, whenever it wantsto reconfigure the neural network. Still, the matrices stayfixed for most of the time. In this work, we will not addressthose updates, but focus on the most computationally costlyeffort: The frequent matrix-vector multiplications within thededicated hardware.Besides the matrix-vector multiplications, memory accessis currently also considered a major bottleneck in theinference phase of neural networks. However, technologicalsolutions to the memory access problem, e.g., stackeddynamical random access memory utilizing through-siliconvias [1] or emerging non-volatile memories [2], are beingdeveloped and expected to be available soon. Thus, we willnot address memory-access issues in this work.Various works have addressed the problem of simplifyingmatrix-matrix multiplications utilizing certain recursionsthat result in sub-cubic time-complexity of matrix-matrixmultiplication (and matrix inversion) [3], [4]. However,these algorithms and their more recent improvements, tothe best of our knowledge, do not help for matrix-vectorproducts. This work is not related to that group of ideas.Various other works have addressed the problem ofsimplifying matrix-vector multiplications in neural networksutilizing structures of the matrices, e.g., sparsity [5], [6].However, this approach comes with severe drawbacks: 1) Itdoes not allow to design training phase and inference phaseindependently from each other. This restricts interoperabil-ity, hinders efficient training, and compromises performance[7]. 2) Sparsity alone does not necessarily reduce compu-tational cost, as it may require higher accuracy, i.e. largerword-length for the nonzero matrix elements. In this work,we will neither utilize structures of the trained matricesnor structures of the input data. The vector and matrixto be multiplied may be totally arbitrary. They may, butneed not, contain independent identically distributed (IID)random variables, for instance.It is not obvious that, without any specific structure inthe matrix, significant computational savings are possibleover state-of-the-art methods implementing matrix-vector a r X i v : . [ c s . I T ] J a n multiplications. In the sequel, we will explain why suchsavings are possible and how they can be achieved. Wealso show that these savings are very significant for typicalmatrix-sizes in present day’s deep networks: By meansof linear computation coding, the computational cost, ifmeasured in number of additions and bit shifts, is reducedseveral times.The paper is organized as follows: In Section II, thegeneral concept of computation coding is introduced. InSection III, we review the state-of-the-art and define abenchmark for comparison. In Section IV, we propose ournew algorithm. Sections V and VI study its performanceby analytic and simulative means, respectively. Section VIIsummarizes conclusions and gives an outlook to openproblems and applications of linear computation codingbeyond the area of neural networks.Matrices are denoted by boldface upper letters, vectorsare not explicitly distinguished from scalar variables. Thesets Z and R denote the integers and reals, respectively.The identity matrix, the all zero matrix, the all one matrix,the expectation operator, the sign function, Landau’s bigO-operator, and the Frobenius norm are denoted by I , , , E [ · ] , sign ( · ) , O ( · ) , and || · || F , respectively. Indices toconstant matrices express their dimensions. The notation || · || counts the number of non-zero entries of the vector-or matrix-valued argument.II. C OMPUTATION C ODING FOR G ENERAL F UNCTIONS
The approximation by a deep network is the currentstate-of-the-art to compute a multi-dimensional functionefficiently. There may be other ones, yet undiscovered,as well. Thus, we define computation coding for generalmulti-dimensional functions. Subsequently, we discuss thepractically important case of linear functions, i.e., matrix-vector products, in greater detail.The best starting point to understand computation codingis rate-distortion theory in data compression. In fact, com-putation coding can be interpreted as a lossy encoding offunctions with a side constraint on the computational cost ofthe decoding algorithm. It shares a common principle withlossy source coding: Random codebooks, if suitably con-structed, usually perform well. In contrast to rate-distortiontheory, however, random codebooks turn out useful even forpractical applications of computation coding.
Computation coding consists of computation encoding and computation decoding . Roughly speaking, computationencoding is to find some approximate representation m ( x ) for a given and known function f ( x ) such that m ( x ) canbe calculated for any arbitrary x ∈ X with low computa-tional cost and high accuracy. Computation decoding is thecalculation of m ( x ) . Formal definitions are as follows: Definition 1:
Given a probability space ( X , P X ) and ametric d : F × F (cid:55)→ R , a computation encoding with distortion D for given function f : X (cid:55)→ F is a mapping m : X (cid:55)→ M ⊆ F such that E x ∈X [ d ( f ( x ) , m ( x ))] ≤ D . Definition 2:
A computation decoding with computationalcost C for given operator C is an implementation of themapping m : X (cid:55)→ M such that C [ m ( x )] ≤ C for all x ∈ X .The computational cost operator C [ m ( · )] measures thecost to implement the function m ( · ) . It reflects the proper-ties of the hardware that executes the computation.Computation coding can be regarded as a generalizationof lossy source coding. If we consider the identity function f ( x ) = x and the limit C → ∞ , computation codingspecializes to lossy source coding with m ( x ) being the lossycodeword for x . Rate-distortion theory analyzes the trade-off between distortion D and the code rate. In computationcoding, we are interested in the trade-off between distortion D and computational cost C . The code rate is of no or atmost subordinate concern.The expectation operator in the distortion constraint ofDefinition 1 is natural to readers familiar with rate-distortiontheory. From a computer science perspective, it followsthe philosophy of approximate computing [8]. Nevertheless,hard constraints on the accuracy of computation can beaddressed via distortion metrics based on the infinity norm,which enforces a maximum tolerable distortion.The computational cost operator may also include anexpectation. Whether this is appropriate or not, dependson the goal of the hardware design. If the purpose isminimum chip area, one usually must be able to deal withthe worst case and an expectation can be inappropriate.Power consumption, on the other hand, overwhelminglycorrelates with average computational cost.The above definitions shall not be confused with related,but different definitions in the literature of approximationtheory [9]. There, the purpose is rather to allow for provingtheoretical achievability bounds than evaluating algorithms.The approach to distortion is similar. Complexity, however,is measured as the growth rate of the number of bits requiredto achieve a given upper bound on distortion. This is quitedifferent from the computational cost in Definition 2.III. S TATE OF THE A RT A. Scalar Linear Functions
The principles and benefits of computation coding aremost easily explained for linear functions. Let us startwith scalar linear functions, before we get to the multi-dimensional case.
Example 1:
Let x ∈ R have unit variance and f ( x ) = tx, t ∈ R . Let the computation encoding take the form m ( x ) = sign( t ) (cid:88) b ∈B b x (1) where the set B ⊂ Z simply contains the positions ofall the 1-bits in the appropriately rounded binary repre-sentation of t . Define the computational cost operator as C [ m ( x )] = max B− min B +1 , independent of x . This is thestandard way a linear function is implemented on a moderncomputer by means of additions and bit shifts. Considering t as a random variable with uniform distribution on [ − , +1] ,the trade-off between average mean-squared distortion andcomputational cost is well-known to be [10] E x ∈ R [ f ( x ) − m ( x )] = 4 − C / , (2)i.e., every additional bit of resolution reduces the quantiza-tion error by 6 dB.On average, half of the additions are actually additionsof zero. For 16-bit signed integer arithmetic, we only needto add (16 − / . terms, on average, thus, compute6.5 additions. The multiplication of a × matrix of16-bit signed fixed-point numbers by an unknown columnvector, thus, requires · (4096 · . ≈ . millionadditions, i.e., 7.5 additions per matrix entry, on average. Itachieves an average distortion of − / which is equivalentto − dB. This is used as benchmark in the sequel.For very high precision, i.e. very small distortion, moreefficient algorithms are known from literature that are allbased on some version of the Chinese Remainder Theo-rem [11]–[16]. For precisions relevant in neural networks,however, these algorithms are not useful. Example 2:
Consider the setting of Example 1, except forthe computation encoding to take the canonical signed digitform [17] m ( x ) = (cid:88) ( s,b ) ∈P s b x (3)with P ⊂ {± }× Z . Like in Example 1, the number t is ap-proximated by a weighted sum of powers of 2. However, theweighting coefficients are chosen from the set {− , , +1 } instead of { , } . The computational cost operator is definedas C [ m ( x )] = |P| . In contrast to Example 1, it does notimpose any cost for additions of zeros, but only countsthe number of additions and subtractions. For the optimumchoice of the set P , the trade-off between average mean-squared distortion and computational cost is shown in theappendix to be E x ∈ R [ f ( x ) − m ( x )] = 28 − C / . (4)Thus, every unit increment of the computational cost re-duces the quantization error by 14.5 dB. In this case, theaverage distortion of − dB is achieved for an average of C = log ≈ . additions per matrix entry. That is a17% reduction over the benchmark in Example 1.There are various further improvements to computationcoding of scalar linear functions in the literature, e.g., redundant logarithmic arithmetic [18], [19]. One can alsosearch for repeating bit patterns in the representation of t and reuse the initial computation for later occurrences [20],[21]. One can represent t by multiple bases at the sametime [22]. However, the conversion into the multiple basenumber system does not come for free. Area and powersavings were studied in [23] for 32-bit and 64-bit integerarithmetic. Only for 64-bit arithmetic improvements werereported. B. Multidimensional Linear Functions
Linear computation coding for products of structuredmatrices with unknown real or complex vectors is widelyspread in the literature. The most known example is prob-ably the fast Fourier transform. By contrast, computationcoding for linear functions with unstructured matrices hasrarely been studied at all with the notable exception of[24], [25]. Reference [24] finds up to 40% improvementsfor structured matrices, but the performance for randommatrices is reported as only “slightly better”. In [25], theprocessing on the matrix entries is optimized utilizing var-ious algorithms including linear programming and patternsearch. Computational cost for random matrices is reportedto be reduced by at most 28% over the scalar canonicallysigned digit form.Accelerations to products of unstructured matrices withunknown vectors is well-studied for Boolean semi-rings[26], [27]. For Boolean semi-rings, [27] shows that losslesslinear computation coding in K dimensions requires atmost O ( K (log K ) − ) operations. The mailman algorithm[28] is inspired by these ideas. It allows to compute theproduct of a matrix composed of entries from a finite fieldwith an arbitrary (even real or complex) vector by at most O ( K (log K ) − ) operations. For this purpose, the targetmatrix T = BW (5)is decomposed into a codebook matrix B and a wiringmatrix W in order to simplify the computation of the linearfunction f ( x ) = T x by f ( x ) = B ( W x ) with appropriatechoices of the codebook and the wiring matrix.Let T ∈ T N × K for the finite set T = { , . . . , T − } withcardinality T . Let K = T N and fix the codebook matrix B ∈ T N × K in such a way that entry B nk is the n th digit ofthe T -ary representation of k − . Thus, the k th column of thecodebook matrix is the T -ary representation of k − . Since B contains all the T N possible columns, any column of T is also a column of B . The purpose of the wiring matrixis to pick the columns of the codebook matrix in the rightorder similar to a mailman who orders the letters suitablyprior to delivery. Since the wiring matrix only reorders thecolumns of the codebook matrix, it contains a single oneper column while all other elements are zero. Thus, theproduct h = W x does not require any arithmetic. On a circuit board, it just defines a wiring. For the product B h ,[28] gives a recursive algorithm that only requires O ( K ) operations. Decomposing a K × K matrix into K/ log T K submatrices of size log T K × K , the overall complexityscales as O ( K / log K ) .The drawback of the mailman algorithm is that it can beused only for very low accuracy of computation. To reachthe accuracy of q -bit unsigned integer arithmetic, we needmatrices whose size is at least q × (2 q ) . Arbitrary accuracycan only be achieved for infinite matrix size. Furthermore,the matrix size must grow exponentially with the desiredsignal-to-quantization-noise-ratio.IV. P ROPOSED A LGORITHM
We start with the decomposition of the target matrix intocodebook matrix and wiring matrix as in (5). However, wedesign at least the wiring matrix in a manner different from[28]. We also drop the purity of wiring and allow somecomputational effort to calculate the product W x . A. Single Wiring Matrix
For given target matrix T = [ t , . . . , t K ] ∈ R N × K andcodebook matrix B ∈ { , ± Z } N × K , we find the wiringmatrix W = [ w , . . . , w K ] such that it is a good solutionto the sparse recovery problem W = argmin Ω ∈{ , ± Z } K × K : || Ω || = K + S || T − BΩ || F (6)for some parameter S controlling the computational cost.This wiring matrix requires S additions and K + S shiftoperations.Since the optimum solution to the optimization problem(6) is hard to find in polynomial time (NP-hard), we proposea greedy algorithm that is demonstrated in Sections V andVI to perform well. First, we break the matrix optimizationproblem into K column-wise optimization problems: w k = argmin ω ∈{ , ± Z } K : || ω || =1+ s || t k − B ω || ∀ k (7)with s = S/K . Since the column-wise optimization (7) isstill NP-hard, we take the following greedy approach foreach column:1) Start with s = 0 and ω = N × .2) Update ω such that it changes in at most a singlecomponent.3) Increment s .4) If s ≤ S/K , go to step 2.The codebook matrix need not be a binary mailmanmatrix. Any matrix that can be multiplied with little effortto the product h = W x is welcome. Details are discussedin Section IV-C. B. Multiple Wiring Matrices
The wiring matrix can be further decomposed into L sub-wiring matrices W = W W · · · W L . (8)This means that B serves as codebook for W and BW · · · W (cid:96) serves as codebook for W (cid:96) +1 . Repeating thegreedy algorithm of Section IV-A for (cid:96) increasing from to L , the wiring matrices W (cid:96) are recursively found.Multiple wiring matrices are useful, if the codebook ma-trix B is computationally cheap, but poor from a distortionpoint of view. The product of a computationally cheapcodebook matrix B with a computationally cheap wiringmatrix W can serve as a codebook BW for subsequentwiring matrices that performs well with respect to bothdistortion and computational cost.Multiple wiring matrices can also be useful if the hard-ware in the inference phase favors some serial over fullyparallel processing. In this case, circuitry for multiplyingwith W (cid:96) can be reused for subsequent multiplication with W (cid:96) − . Note that in the design phase, wiring matrices arepreferably calculated in increasing order of the index (cid:96) ,while in the inference phase, they are used in decreasingorder of (cid:96) . C. Codebook Matrices
For codebook matrices, the computational cost dependson the way they are designed. Besides being easy tomultiply onto a given vector, a codebook matrix shouldbe designed such that pairs of columns are not collinear.A column that is collinear to another one is obsolete: Itdoes not help to reduce the distortion while it requires tocompute additions. In an early conference version of thiswork [29], we proposed to find the codebook matrix bysparse quantization of the target matrix. While this resultsin significant savings of computational cost over the state ofthe art, there are even better designs for codebook matrices.Three of them are detailed in the sequel.
1) Binary Mailman Codebook:
In the binary mailmancodebook, only the all zero column is obsolete. It is shownin the appendix that the multiplication of the binary mail-man matrix with an arbitrary vector requires less than K additions. While performing well, it lacks flexibility, as itrequires the matrix dimensions to fulfill K = 2 N .
2) Two-Sparse Codebook:
We choose the alphabet
S ⊂{ , ± , ± , ± , . . . } as a subset of the signed positivepowers of two augmented by zero and find K vectors oflengths N such that no pair of vectors is collinear andeach vector has zero norm equal to either 1 or 2. Forsufficiently large size of the subset, those vectors alwaysexist. These vectors are the columns of the codebook matrix.The ordering is irrelevant. It turns out useful to restrict themagnitude of the elements of S to the minimum that isrequired to avoid collinear pairs. target vector s c a l e db e s t c o d e w o r d d i s t a n cee rr o r a n g l ee rr o r α β Fig. 1: Decomposition of the approximation error.
3) Self-Designing Codebook:
We set B = B B B with B = [ I N N × ( K − N ) ] and find the K × K matrices B and B recursively via (6) for S = K interpreting them aswiring matrices for some given auxiliary target matrix ˜ T .The auxiliary target matrix may, but need not be identical to T . The codebook designs itself taking the auxiliary targetmatrix as model.V. P ERFORMANCE A NALYSIS
In order to analyze the expected distortion, we resortto IID Gaussian random codebooks and target vectors, aswell as mean-square distortion averaged over the codebookensemble. The IID Gaussian codebook is solely chosen,as it simplifies the performance analysis. In practice, theIID Gaussian random matrix B must be replaced by acodebook matrix with low computational cost, but similarperformance. Simulation results in Section VI will show thatpractical codebooks perform very similar to IID Gaussianones. A. Logarithmic Aspect Ratio
A key point to good performance of the multiplicativematrix decomposition (5) is the logrithmic aspect ratio.The number of rows of the codebook matrix N scaleslogarithmically with the number of columns K . For a linearcomputation code, we define the code rate as R = 1 N log K. (9)The code rate is a design parameter that, as we will seelater on, has some minor impact on the trade-off betweendistortion and computational cost.The logarithmic scaling of the aspect ratio is fundamen-tal. This is a consequence of extreme-value statistics oflarge-dimensional random vectors: Consider the correlationcoefficients (inner products normalized by their norms) of N -dimensional real random vectors with IID entries in thelimit N → ∞ . For any set of those vectors whose size ispolynomial in N , the squared maximum of all correlationcoefficients converges to zero, as N → ∞ [30]. Thus, theangle α in Fig. 1 becomes a right angle and the norm ofthe angle error is lower bounded by the norm of the targetvector. However, for an exponentially large set of size RN with rate R > , the limit for N → ∞ is strictly positive and given by rate-distortion theory as − − R [31]. Theasymptotic squared relative error of approximating a targetvector by an optimal scaling of the best codeword, thus, is − R . The error itself can be approximated by another vectorof the exponentially large set to get the total squared errordown to − R . Repeating that procedure for s times, the(squared) error decays exponentially in s . In practice, thescale factor cannot be a real number, but must be quantized.This additional error is illustrated in Fig. 1 and labelleddistance error as opposed to the previously discussed angleerror. As a result, the total squared error does not decay as − ( s +1) R , but with a slightly reduced rate.The logarithmic aspect ratio has the following impacton the trade-off between distortion and computational cost:For K + S choices from the codebook, the wiring matrixhas K + S nonzero entries. This implies S additions inthe product h = W x . In return, we get approximated an N × K target matrix T with N = R log K = O (log K ) rows. For any desired distortion D , the computational costof the product W x is by a factor O (log K ) smaller thanthe number of entries in the target matrix. This is thesame scaling as in the mailman algorithm. Given such ascaling, the mailman algorithm allows for a fixed distortion D which depends on the size of the target matrix. Theproposed algorithm, however, can achieve arbitrarily lowdistortion by setting S appropriately large irrespective ofthe matrix size. The connection between the computationalcost C = 2 K + S and distortion D is addressed in thesequel. B. Angle Error
Consider a unit norm target vector t ∈ R N that shall beapproximated by a scaled version of one out of K code-words b k ∈ R N which are random and jointly independent.Denoting the angle between the target vector t and thecodeword b k as α k , we get (norm of) the angle error as a k = | sin α k | . (10)The correlation coefficient between target vector t andcodeword b k is given as ρ k = < t ; b k > || t || || b k || = cos α k (11)with < · ; ·· > denoting the inner product. The minimumangle error a and the correlation coefficients are related by a = 1 − max k ρ k . (12)Next, we will study the behavior of the correlation coeffi-cients in order to learn about the minimum angle error.Let P ρ | t ( r, t ) denote the cumulative distribution function(CDF) of the squared correlation coefficient given targetvector t . As the columns of the codebook matrix are jointlyindependent, we conclude that for (cid:37) = max k ρ k , (13) Fig. 2: CDF of the squared angle error for rate R = and variousnumbers of columns K . we have P (cid:37) | t ( r, t ) = (cid:2) P ρ | t ( r, t ) (cid:3) K . (14)The target vector t follows a unitarily invariant distribu-tion. Thus, the conditional CDF does not depend on it. Inthe sequel, we choose t to be the first unit vector of thecoordinate system, without loss of generality.The squared correlation coefficient ρ k is known to bedistributed according to the beta distribution with shapeparameters and N − [32, Sec. III.A] and given byP ρ ( r ) = B (cid:18) , N − , r (cid:19) (15)with B ( a, b, x ) = x (cid:82) ξ a − (1 − ξ ) b − d ξ (cid:82) ξ a − (1 − ξ ) b − d ξ (16)denoting the regularized incomplete Beta function [33]. Thedistribution of the squared angle error is, thus, given byP a ( r ) = 1 − (cid:20) B (cid:18) , N − , − r (cid:19)(cid:21) K . (17)It is shown in the appendix to converge to lim K →∞ lim N → R log K P a ( r ) = (cid:26) r < − R r > − R . (18)for logarithmic aspect ratios confirming the considerationsin Section V-A. The CDF is depicted in Fig. 2. For finitedimensions, we find the mean squared angle error as a = (cid:90) a dP a ( a ) = 1 − (cid:90) P a ( a ) d a (19) = (cid:90) B (cid:18) , N − , r (cid:19) K d r. (20)using integration by parts. Fig. 3: R th root of the average total squared error vs. the numberof columns K in logarithmic scale for various rates R . C. Distance Error
For a target vector of unit norm, the distance error d is bounded from above by | cos α | . To see this, notethat the norm of the optimally scaled codeword is | cos α | .Furthermore, the maximum error occurs, if the magnitudeof the optimum scale factor v is exactly in the middle oftwo adjacent powers of two, say p and p . In this case, wehave d = | v | − p = 2 p − | v | ⇒ | v | − p = | v | . Assuming the error to be uniformly distributed, it is easilyconcluded that the average squared distance error is givenby d = 127 (cid:37) = 1 − a . (21)Note that the factor / slightly differs from the factor / in Example 2. Like in Example 2, the number tobe quantized is uniformly distributed within some interval.Here however, the interval boundaries are not signed powersof two. This leads to a minor increase in the power of thequantization noise. D. Total Error
Since distance error and angle error are orthogonal toeach other, the average total squared error is simply givenas (cid:15) = a + d = 1 + 26 a . (22)The average total squared error for a single approximationstep is depicted in Fig. 3 for various rates R . Note that thecomputational cost per matrix entry linearly scales with R for fixed K . In order to have a fair comparison, the averagetotal squared error is exponentiated with /R . Despite thefinite size of the matrices and the additional distance errordue to quantization of the scale factor, it does not deviatevery much from the asymptotic value of found in (18). Every reduction of error compounds on top of the pre-vious one. If the errors in s + 1 subsequent steps ofapproximation were statistically independent, the remainingaverage total squared error would be D LB = (cid:16) (cid:15) (cid:17) s +1 . (23)for s + 1 steps. As discussed in the next paragraph, they arenot independent and (23) is just a lower bound.The angle between the target vector and the best code-word (denoted by α ) and the angle between target vectorand angle error (denoted by β ) add to ◦ , see Fig. 1. Thelarger the angle error, the smaller the angle β . That means alarge angle error implies that the next target vector is closeto the previous one. In other words, a target vector that isdifficult to approximate implies that the next target vectoris likely to be difficult to approximate, as well. On the otherhand, a target vector that is easy to approximate results in anew target vector that is close to orthogonal to the previoustarget vector. In this case, it is pretty much a fair coin flip todecide whether the next approximation is easy or difficult.Thus, it is more likely to get a difficult target vector than aneasy one, on average. This effect is all the more pronouncedthe smaller the size of the codebook K is. This is becausesmaller values of K increase the probability of having alarge angle error, see Fig. 2. E. Computational Cost
For sake of simplicity, we use the number of additionsto measure computational cost. Sign changes and shifts arecheaper than additions and their numbers are also very closeto the number of additions.For wiring matrix W (cid:96) with s (cid:96) nonzero entries percolumn, we have to sum K times s (cid:96) terms. So, weneed s (cid:96) K additions in total for wiring matrix W (cid:96) . Allthree codebook matrices discussed in Section IV-C requireat most K additions.Adding the computational costs of wiring and codebookmatrices, we get C = 2 K + L (cid:88) (cid:96) =1 s (cid:96) K = ( s + 2) K (24)with s = (cid:80) L(cid:96) =1 s (cid:96) . Normalizing to the number of elementsof the N × K target matrix T , we have ˜ C = CKN = s + 2 N . (25)The computational cost per matrix entry vanishes withincreasing matrix size. This behavior fundamentally differsfrom state-of-the-art methods discussed in Examples 1 and2, where the matrix size has no impact on the computationalcost per matrix entry.
Fig. 4: Average relative total squared error vs. total number ofadditions per column required for the wiring matrices. The solidlines refer to the lower bound (23). The markers to simulationresults. All results are for IID Gaussian codebook matrices withunit rate. Results are averaged over random matrix entries,except for the circular markers which refer to random matrixentries. VI. S
IMULATION R ESULTS
A. Accuracy of Lower Bound
Consider L wiring matrices each with two nonzero entriesper column, i.e., s (cid:96) = 1 for all (cid:96) . The total average distortionfor IID Gaussian codebook matrices is shown in Fig. 4 vs. s , i.e., the total number of additions per column required forthe wiring matrices. Results are averaged over randommatrix entries. For unit rate, the lower bound (23) is verytight for N > and otherwise poor over a wide range of s . Remarkably, the jittery behavior for N = 6 seems notto result from insufficient averaging, as going for ten timesmore matrix samples does not reduce it. B. Codebook Comparison
For an IID Gaussian target matrix with K = 256 columns, we compare various codebook matrices in Fig. 5.The wiring matrices are designed as described in Sec-tion VI-A. For all numbers of rows N that are shown in thefigure, self-designing codebooks perform best while two-sparse codebooks perform worst. The mailman codebook,which only exists for N = 8 , performs slightly worse thanthe IID Gaussian codebook. The choice of the codebookdoes not affect the slope of the curve, but is responsible foran offset shift in the average distortion. C. Number of Additions per Matrix Entry
Due to their superior performance, we restrict the sub-sequent considerations to self-designing codebooks. First,we address IID Gaussian target matrices. Here, the target
Fig. 5: Average relative total squared error vs. total number ofadditions per column required for the wiring matrices and GaussianIID target matrices of size N × . Results are averaged over random matrix entries. matrix can be used to self-design the codebook matrix,as IID Gaussian codebooks work well. As in the previoussubsections, we use multiple wiring matrices with s (cid:96) = 1 for all of them. Table I shows the average number ofadditions per matrix entry to achieve at least the accuracyof standard signed integer arithmetic evaluated by (2). For8-bit accuracy, the code rate has little, for K = 1024 even no impact. For larger and lower accuracies, higherand lower code rates are favored, respectively. This trend isslightly distorted by quantization effects, but clear enoughto be spotted. In comparison to the benchmark defined inSection III, the computational cost is reduced by 80 %.This behavior is explained as follows: There is thefixed computational cost of K additions for the codebookmatrix. This cost is shared by the fewer rows, the largeris the code rate. This increases the computational burdenper row (and also per matrix entry) for larger code rates.For high accuracy, the wiring matrices clearly dominate thecomputational cost, so the computation of the codebook issecondary. Thus, higher rates are favored, as they generallyresult in lower distortions, see Fig. 3. For low accuracy,only few wiring matrices are needed and the relative costof the codebook is more important. This shifts the optimumrate towards lower values.Consider now a random matrix whose entries are uniformIID within [0 , . If the same algorithm is used as before,the performance degrades significantly, as such a matrixis not a good codebook. Since all entries are positive, allcodewords are constrained to a single orthant of the fullspace. In order to circumvent this problem, we choose anauxiliary target matrix ˜ T with Gaussian IID entries to self-design the codebook. In order to keep that codebook for all TABLE I: Number of additions per matrix entry that are requiredto reach the accuracy of 2-, 4-, 8-, 16-, and 24-bit signed integerarithmetic or better. Best values for given matrix width shown inboldface. ˜ C × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × TABLE II: Number of additions per matrix entry that are requiredto reach the accuracy of 2-, 4-, 8-, 16-, and 24-bit signed integerarithmetic or better. ˜ C × × × subsequent wiring, we use only a single wiring matrix andadapt the parameter s in such a way as to reach the desiredaccuracy. The results are shown in Table II. A comparisonto Table I shows that this procedure only causes a minordegradation, if any at all.VII. C ONCLUSIONS & O
UTLOOK
Linear computation coding by means of codebook andwiring matrices is a powerful tool to reduce the computa-tional cost of matrix-vector multiplications in deep neural networks. The number of additions to achieve the accuracyof q -bit signed integer arithmetic is roughly given by q log K . (26)This means no multiplication and only a single addition for8-bit integer arithmetic and vectors with at least K = 1000 dimensions.The idea of computation coding is not restricted tolinear functions. Its direct application to multidimensionalnonlinear functions promises even greater reductions inthe computational cost of neural networks. We conjecturethat neural networks need neither have precise weights norbe densely connected, if they are sufficiently deep. As incomputation coding of linear functions, any quantizationerrors at intermediate layers and any level of sparsity maybe compensated for with more layers and larger dimensions.Fast matrix-vector products are also important for ap-plications in other areas of signal processing, e.g., beam-forming for wireless multiple-input multiple-output systems[34], compressive sensing, numerical solvers for partialdifferential equations, etc. This opens up for many futureresearch directions based on linear computation coding.A CKNOWLEDGEMENT
The authors like to thank Veniamin Morgenshtern, MarcReichenbach, and Hermann Schulz-Baldes for helpful dis-cussions. A
PPENDIX
Average Distortion of Canonically Signed Digit Form
For the canonically signed digit form, the set of recon-struction values is R = {± k : k ∈ Z } . Let the targetvariable t be uniformly distributed within [0 , . Thus, onlyreconstruction values for k ≤ are used.Consider the interval [2 k − , k ] with k ≤ . The prob-ability p k that the variable t falls within that interval isequal to its width w k = 2 k − . The average mean-squaredquantization error within that interval is well known to be w k / . Averaging over all intervals, the average mean-squared distortion for quantization with a single digit isgiven by (cid:88) k = −∞ p k w k = 112 (cid:88) k = −∞ w k = 112 (cid:88) k = −∞ k − (27) = 196 ∞ (cid:88) k =0 − k = 196 ·
87 = 13 · . (28)By symmetry, the same considerations hold true, if t is uniformly distributed within [ − , . Due to the signeddigit representation, these considerations even extend to t uniformly distributed within [ − , +1] . For representation by zero digits, t ∈ [ − , +1] is quan-tized to 0 with average mean-squared distortion equal to .This is 28 times larger than for quantization with one digit.The canonically signed digit representation is invariantto any scaling by a power of two. Thus, any further digitof representation also reduces the average mean-squareddistortion by a factor of 28. Additions Required for Binary Mailman Codebook
Let B N × K denote the binary mailman matrix with N rows and K columns. It can be decomposed recursively as B N × K = (cid:34) B ( N − × K B ( N − × K × K × K (cid:35) . (29)Following [28], we decompose h into its first half h andits second half h such that we get B N × K h = (cid:34) B ( N − × K ( h + h ) × K h (cid:35) . (30)Let c ( N ) denote the number of additions to compute theproduct B N × K h . The recursion (30), implies c ( N ) = K c ( N −
1) + K − (31) < c ( N −
1) + 2 N . (32)These are K additions for h + h , c ( N − additions forthe matrix-vector product, and K − additions to sum thecomponents of h . Note that B × h = h . Thus, c (1) = 0 and c ( N ) < N +1 = 2 K . Asymptotic Cumulative Distribution Function
In order to show the convergence of the CDF of thesquared angle error to the unit step function, recall thefollowing limit holding for any positive x and u lim K →∞ (cid:16) − xK u (cid:17) K = u < x ) u = 11 u > . (33)The limiting behavior of P a ( r ) is, thus, decided by thescaling of − P ρ (1 − r ) with respect to K . The criticalscaling is K . Such a scaling implies a slope of − in doublylogarithmic scale. Thus, lim N →∞ ∂∂ ( N R ) log (cid:2) − P ρ (1 − r ) (cid:3) = − . (34)Explicit calculation of the derivative yields lim N →∞ (cid:82) − r (1 − ξ ) N − ξ − log (1 − ξ ) d ξ. R (cid:82) − r (1 − ξ ) N − ξ − d ξ. = − (35)and saddle point integration gives [35, Chapter 4] R log [1 − (1 − r )] = − . (36)This immediately leads to r = 4 − R . R EFERENCES[1] W.-W. Shen, Y.-M. Lin, S.-C. Chen, H.-H. Chang, C.-C. Lin, Y.-F. Chou, D.-M. Kwai, and K.-N. Chen, “3-D stacked technologyof DRAM-logic controller using through-silicon via (TSV),”
IEEEJournal of the Electron Devices Society , vol. 6, pp. 396–402, Mar.2018.[2] S. Hong, O. Auciello, and D. W. (Eds.),
Emerging Non-VolatileMemories . Springer-Verlag, 2014.[3] V. Strassen, “Gaussian elimination is not optimal,”
NumerischeMathematik , vol. 13, pp. 354–356, 1969.[4] D. Copperfield and S. Winograd, “Matrix multiplication via arith-metic progressions,”
Journal of Symbolic Computation , vol. 9, no. 3,pp. 251–280, Mar. 1990.[5] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weightsand connections for efficient neural network,” in
Proceedings ofthe Advances in Neural Information Processing Systems , Montreal,Canada, Dec. 2015.[6] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression fordeep learning,” in
Proceedings of the Advances in Neural InformationProcessing Systems , Long Beach, CA, Dec. 2017.[7] U. Evci, F. Pedregosa, A. Gomez, and E. Elsen, “The difficulty oftraining sparse neural networks,” 2019, arXiv:1906.10732v2.[8] K. Palem and A. Lingamneni, “Ten years of building broken chips:The physics and engineering of inexact computing,”
ACM Transac-tions on Embedded Computing Systems , vol. 12, no. 2s, pp. 87:1–87:23, May 2013.[9] R. A. DeVore, “Nonlinear approximation,”
Acta Numerica , vol. 7,pp. 51–150, Jan. 1998.[10] R. M. Gray and D. L. Neuhoff, “Quantization,”
IEEE Transactionson Information Theory , vol. 44, no. 6, pp. 2325–2383, Oct. 1998.[11] A. Karatsuba and Y. Ofman, “Multiplication of multidigit numberson automata (in Russian),”
Doklady Akademii Nauk SSSR , vol. 145,no. 2, pp. 293–294, 1962.[12] A. L. Toom, “The complexity of a scheme of functional elementssimulating the multiplication of integers (in Russian),”
DokladyAkademii Nauk SSSR , vol. 150, no. 3, pp. 496–498, 1963.[13] S. A. Cook and S. O. Aanderaa, “On the minimum computation timeof functions,”
Transactions of the American Mathematical Society ,vol. 142, pp. 291–314, Aug. 1969.[14] A. Schönhage and V. Strassen, “Schnelle Multiplikation großerZahlen,”
Computing , vol. 7, pp. 281–292, Sep. 1971.[15] M. Fürer, “Faster integer multiplication,”
SIAM Journal on Comput-ing , vol. 39, no. 3, pp. 979–1005, 2009.[16] D. Harvey, J. van der Hoeven, and G. Lecerf, “Even faster integermultiplication,”
Journal of Complexity , vol. 36, pp. 1–30, Oct. 2016.[17] A. D. Booth, “A signed binary multiplication technique,”
The Quar-terly Journal of Mechanics and Applied Mathematics , vol. 4, no. 2,pp. 236–240, Jan. 1951.[18] M. G. Arnold, T. A. Bailey, J. R. Cowles, and J. J. Cupal, “Redundantlogarithmic arithmetic,”
IEEE Transactions on Computers , vol. 39,no. 8, pp. 1077–1086, Aug. 1990.[19] H. Huang, M. Itoh, and T. Yatagai, “Modified signed-digit arithmeticbased on redundant bit representation,”
Applied Optics , vol. 33,no. 26, pp. 6146–6156, Sep. 1994.[20] R. I. Hartley, “Subexpression sharing in filters using canonic signeddigit multipliers,”
IEEE Transactions on Circuits and Systems-II:Analog and Digital Signal Processing , vol. 43, no. 10, pp. 677–688,Oct. 1996.[21] V. Lefèvre,
Moyens arithmétiques pour un calcul fiable . ÉcoleNormale Supérieure de Lyon: Ph. D. thesis, Jan. 2000, see alsoINRIA Research Report no 4192, May 2001.[22] V. S. Dimitrov, G. A. Jullien, and W. C. Miller, “Theory andapplications of the double-base number system,”
IEEE Transactionson Computers , vol. 48, no. 10, pp. 1098–1106, Oct. 1999.[23] V. S. Dimitrov, K. U. Järvinen, and J. Adikari, “Area-efficient mul-tipliers based on multiple-radix representations,”
IEEE Transactionson Computers , vol. 60, no. 2, pp. 189–201, Feb. 2011.[24] N. Boullis and A. Tisserand, “Some optimizations of hardware mul-tiplication by constant matrices,”
IEEE Transactions on Computers ,vol. 54, no. 10, pp. 1271–1282, Oct. 2005. [25] L. Aksoy, P. Flores, and J. Monteiro, “A novel method for the ap-proximation of multiplierless constant matrix vector multiplication,”
EURASIP Journal on Embedded Systems , May 2016.[26] M. A. Kronrod, V. L. Arlazarov, E. A. Dinic, and I. A. Faradzev,“On economic construction of the transitive closure of a direct graph(in Russian),”
Doklady Akademii Nauk SSSR , vol. 194, no. 3, pp.487–488, 1970.[27] R. Williams, “Matrix-vector multiplication in sub-quadratic time(some preprocessing required),” in
Proceedings of the Eighteenth An-nual ACM-SIAM Symposium on Discrete Algorithms . Philadelphia,PA, USA: Society for Industrial and Applied Mathematics, 2007, pp.995–1001.[28] E. Liberty and S. W. Zucker, “The mailman algorithm: A note onmatrix-vector multiplication,”
Information Processing Letters , vol.109, pp. 179–182, Jan. 2009.[29] R. Müller, B. Gäde, and A. Bereyhi, “Efficient matrix multipli-cation: The sparse power-of-2 factorization,” in
Proc. of Informa-tion Theory & Applications Workshop , San Diego, CA, Feb. 2020,https://arxiv.org/abs/2002.04002v2.[30] T. Jiang, “The asymptotic distribution of the largest entries of samplecorrelation matrices,”
The Annals of Applied Probability , vol. 14,no. 2, pp. 865–880, 2004.[31] T. Berger,
Rate Distortion Theory . Eaglewood Cliffs, NJ: Prentice-Hall, 1971.[32] R. Müller, “On approximation, bounding & exact calculation of blockerror probability for random code ensembles,”
IEEE Transactions onCommunications , 2021, early access.[33] J. Spanier and K. B. Oldham,
An Atlas of Functions . Berlin,Germany: Springer-Verlag, 1987.[34] O. Castañeda, S. Jacobsson, G. Durisi, T. Goldstein, and C. Studer,“Finite-alphabet MMSE equalization for all-digital massive MU-MIMO mmWave communication,”
IEEE Journal on Selected Areasin Communications , vol. 38, no. 9, pp. 2128–2141, Sep. 2020.[35] N. Merhav et al. , “Statistical physics and information theory,”