[PDF] The Two-Pass Softmax Algorithm

Abstract

The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth. We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the "mantissa" and another representing the "exponent". Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors. To foster reproducibility, we released an open-source implementation of the new Two-Pass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at this http URL.

Full PDF

TThe Two-Pass Softmax Algorithm

Marat Dukhan ∗ and Artsiom Ablavatski Google Research Georgia Institute of Technology

Abstract

The softmax (also called softargmax) function is widely usedin machine learning models to normalize real-valued scoresinto a probability distribution. To avoid ﬂoating-point over-ﬂow, the softmax function is conventionally implemented inthree passes: the ﬁrst pass to compute the normalizationconstant, and two other passes to compute outputs fromnormalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimizedimplementation on HPC-class processors performance of allthree passes is limited by memory bandwidth.We then present a novel algorithm for softmax computa-tion in just two passes. The proposed Two-Pass algorithmavoids both numerical overﬂow and the extra normalizationpass by employing an exotic representation for intermediatevalues, where each value is represented as a pair of ﬂoating-point numbers: one representing the “mantissa” and an-other representing the “exponent”.Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass al-gorithm by up to 28% in AVX512 implementation, and byup to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Passalgorithm on Intel Broadwell and AMD Zen 2 processors.To foster reproducibility, we released an open-source im-plementation of the new Two-Pass Softmax algorithm andother experiments in this paper as a part of XNNPACKlibrary at GitHub.com/google/XNNPACK.

The softmax (also called softargmax ) function is a smoothapproximation to the argmax function. The softmax func-tion σ ( x ) operates on a vector of real-valued scores x i andnormalizes them into a probability distribution p i = σ ( x ) i = e x i (cid:80) k e x k where p i ≥ (cid:80) i p i = 1. ∗ Corresponding Author: [email protected]

The softmax function is commonly used in machine learn-ing to give a probabilistic interpretation to outputs of classi-ﬁcation models [1]. Such machine learning models output aprobability for every possible object class, and the number ofclasses in modern datasets can reach millions. For example,natural language processing models may predict probabilitydistribution over each possible word in a vocabulary, andrecommendation systems may model the probability distri-bution over users, products, web pages, or their interactions.Table 1 summarized number of classes on several public clas-siﬁcation datasets.Dataset Class description Class CountImageNet [2] Image category 21841One Billion Word [3] Unique Words 793471Wikilinks [4] Wikipedia pages 2933659DepCC [5] Web documents 364.80 millionTable 1: Characteristics of several public machine learningdatasets.Hierarchical Softmax [6] (HSM) and its modiﬁcations [7]are the common techniques to scale classiﬁcation modelsto large number of classes. HSM models jointly considerthe softmax function and the matrix-matrix multiplicationthat produced its input, and replace them by a low-rankapproximation. Thus, HSM methods improve performanceby reducing the matrix-matrix multiplication cost, and doso by approximating the original machine learning model.Unlike previous research, we focus on the softmax functionin the context of inference using a pre-trained model. Thissituation commonly arises in machine learning frameworks,such as TensorFlow [8] or PyTorch [9], when the trainingdataset or metaparameters needed to devise an accurate ap-proximation to the model are not available. In this case, themodel must be computed exactly according to the speciﬁca-tion and unsafe approximations are not possible.Our contributions in this paper are as follows: • We demonstrate that a well-optimized implementationof the softmax function can be memory-bound even insingle-threaded execution. This result emphasizes theimportance of eliminating memory operations for fur-ther improvements in the performance of the softmaxfunction.1 a r X i v : . [ c s . PF ] J a n We introduce a novel algorithm for computing the soft-max function. The new algorithm employs an exoticrepresentation for intermediate values, where each valueis represented as a pair of ﬂoating-point numbers: onerepresenting the “mantissa” and another representingthe “exponent”. Thanks to the special representationof intermediate results, the new algorithm needs onlytwo passes over an input vector versus three passes fortraditional algorithms. • We present and evaluate high-performance implemen-tations of the new Two-Pass softmax algorithms forthe x86-64 processors with AVX2 and AVX512F SIMDextensions. The experimental study conﬁrms thespeedups of up to 28% on an Intel Skylake-X system.The proposed improvements to the softmax implemen-tation are orthogonal to matrix-matrix multiplication opti-mizations, and can be combined with sparsiﬁcation [10, 11],low-rank decomposition [12], low-precision arithmetic [13–15], or hardware acceleration [16, 17] for the matrix-matrixmultiplication that produce softmax input.

The softmax function is widely used in the modern machinelearning models and algorithms. In particular, the softmaxfunction gained wide popularity in Natural Language Pro-cessing as a building block for language models, which pre-dict a word or n-gram out of a large vocabulary [18]. As thesoftmax function plays an important role in the machinelearning models the several approximations have been pro-posed [7, 19–24] in order to make the calculations eﬃcientand fast. We refer the reader to Jozefowicz et al. [7] fora good overview of the recent approaches to speed up thesoftmax calculations.Although a lot work has been done to address the compu-tation complexity of the softmax function on a large inputs,the proposed solutions either 1) require special hardware(e.g FPGA) or 2) require to use an approximation to thesoftmax function during model training. In the latter casethe softmax approximation can not be directly applied toa pre-trained machine learning model, which dramaticallylimits its use in existing machine learning frameworks e.g.TensorFlow [8] and PyTorch [9]. Therefore, we developeda novel algorithm that improves performance of the origi-nal softmax function on widely available hardware, can beused with pre-trained machine learning models, and can beimplemented in any machine learning framework.

Direct calculation of the softmax function according to theformula σ ( x ) i = e xi (cid:80) k e xk is conjugate with numerical issues. Single-precision e x function overﬂows for x >

89 and un-derﬂows for x < − outputs inthe na¨ıve implementations of the softmax function. In prac-tice, the parts of the machine learning models that produceinput to the softmax function are rarely bounded, and thusimplementation can’t assume that the input would fall intosuch narrow range.In order to overcome numerical instability issues machinelearning frameworks adapt a workaround by utilizing theequivalence [25]: σ ( x ) i = e x i (cid:80) k e x k = e ( x i − c ) (cid:80) k e ( x k − c ) which holds for any c value. In particular, if c = max x ii ,then: • No inputs to e x function exceed zero • There is at least one zero input to the e x function, andthus the denominator of the fraction is non-zero.Together, these properties result in good numerical behaviorof the computation and lead to Algorithm 1. Algorithm 1

The Three-Pass algorithm with re-computation of exponential function function

SoftmaxThreePassRecompute ( X , Y ) N ← Length ( X ) µ ← max ≤ i ≤ N X i (cid:46) Pass 1: read X σ ← (cid:80) ≤ i ≤ N Exp ( X i − µ ) (cid:46) Pass 2: read X λ ← / σ for all ≤ i ≤ N do Y i ← λ · Exp ( X i − µ ) (cid:46) Pass 3: read X, write Y end forend function

Both of the Pass 2 and the Pass 3 in Algorithm 1 compute e ( x i − µ ) with the same x i and µ values. This observationhints a potential optimization: if computing e x function isexpensive, we could save the computed e ( x i − µ ) values toavoid recomputing them in the Pass 3. Such modiﬁcation ispresented in Algorithm 2. Produce ﬂoating-point inﬁnity because result is too large to berepresented as a ﬁnite single-precision ﬂoating-point number Produce 0 because result is too small to be distinguishable fromzero in the single-precision ﬂoating-point format Not a Number: a special ﬂoating-point value deﬁned by IEEE 754standard indicating invalid result lgorithm 2 The Three-Pass algorithm with re-loading ofexponential computations function

SoftmaxThreePassReload ( X , Y ) N ← Length ( X ) µ ← max ≤ i ≤ N X i (cid:46) Pass 1: read X σ ← for all ≤ i ≤ N do Y i ← Exp ( X i − µ ) (cid:46) Pass 2: read X, write Y σ ← σ + Y i end for λ ← / σ for all ≤ i ≤ N do Y i ← λ · Y i (cid:46) Pass 3: read Y, write Y end forend function

Algorithm 2 computes e ( x i − µ ) values only once, but thisreduction in the number of computations comes at a cost:the second pass of Algorithm 2 does both a read and a writefor each element, unlike Algorithm 1 where the second passdoes only reads. The Three-Pass Algorithms 1 and 2 avoid numerical issuesby normalizing inputs relative to their maximum value, butthen require an additional memory pass to ﬁnd the maxi-mum value. In this section we suggest that it is possibleto get the numerical stability without the extra memorypass, and present a practical Two-Pass algorithm for soft-max computation.The immediate reason for the numerical instability of ana¨ıve softmax implementation is the saturation of e x func-tion for inputs outside of the narrow range of [ − , e x function for a solu-tion.The e x function can be implemented in inﬁnite num-ber of ways, but practical implementations [26–29] in IEEEﬂoating-point arithmetic follow the traditional structure ofelementary function implementation [30], and include threesteps:1. Range reduction , where the problem of approximat-ing e x on the inﬁnite input domain x ∈ ( −∞ , + ∞ ) isreduced to a problem of approximating e x on a smallﬁnite range. For e x , a natural range reduction derivesfrom the equivalence e x = e t ∈ [ − log 22 , log 22 ] (cid:122) (cid:125)(cid:124) (cid:123) x − log 2 · (cid:98) x · log e (cid:101) · n ∈ Z (cid:122) (cid:125)(cid:124) (cid:123) (cid:98) x · log e (cid:101) which decompose approximating e x on x ∈ ( −∞ , + ∞ )into approximating e x on t ∈ (cid:104) − log 22 , log 22 (cid:105) and multi-plying the result by 2 n where n is an integer. 2. Approximation of the function on the reduced range,i.e. on (cid:104) − log 22 , log 22 (cid:105) for e x . This step is achievedthrough a polynomial or rational approximation, oftenin combination with table look-ups.3. Reconstruction of the ﬁnal e x value from the approx-imation on the reduced range. For e x , the reconstruc-tion step consists of multiplication of e t by 2 n , andcan be achieved at low cost by manipulating the ex-ponent ﬁeld of a binary ﬂoating-point number m := e t . It is this step where the underﬂow and over-ﬂow situations arise : e t ∈ (cid:104) √ , √ (cid:105) and thus alwaysﬁts into a single-precision ﬂoating-point number, but n = (cid:98) x · log e (cid:101) can exceed the range of the exponentﬁeld, causing underﬂow or overﬂow.The key idea that enables the Two-Pass Softmax algo-rithm is to remove the reconstruction step, and instead keepthe result of e x as a pair of ﬂoating-point values ( m, n ),where m = e t . Mathematically, e x = m · n , but in gen-eral this expression can not be evaluated in ﬂoating-pointarithmetic without overﬂowing or underﬂowing the result.The representation of n as a ﬂoating-point number is im-portant: although n is by design always an integer, it canhave a very large magnitude, and fall outside of the rangeof standard integer formats. Therefore, the result e x mustbe represented as two, rather than one, ﬂoating-point num-bers. Using multiple ﬂoating-point numbers to representthe real-valued result has a long history in double-double,triple-double, quad-double [31] representations and Error-Free Transformations [32]. However, these representationsuse multiple ﬂoating-point numbers to improve precision ofﬂoating-point arithmetic, whereas we suggest to use twoﬂoating-point numbers to extend its dynamic range. Algorithm 3

The Two-Pass softmax algorithm. ExtExpdenotes an exponential function that produce a pair ( m, n )of ﬂoating-point values. function

SoftmaxTwoPass ( X , Y ) N ← Length ( X ) m sum ← n sum ← −∞ for all ≤ i ≤ N do m i , n i ← ExtExp ( X i ) (cid:46) Pass 1: read X n max ← Max ( n i , n sum ) m sum ← m i · n i − n max + m sum · n sum − n max n sum ← n max end for λ sum ← / m sum for all ≤ i ≤ N do m i , n i ← ExtExp ( X i ) (cid:46) Pass 2: read X, write Y Y i ← m i · λ sum · n i − n sum end forend function Algorithm 3 presents the softmax computation in just two3asses by implementing addition for ( m, n ) representation.The reduction pass keeps track of the running maximum n value among all elements, and accumulates the scaled m values to the running sum. It avoids the ﬂoating-point over-ﬂow by scaling m values by the diﬀerence between the cor-responding n values and maximum of n values. As thisdiﬀerence is never positive, m values are never scaled up,which ensures the absence of the ﬂoating-point overﬂow. While the number of memory passes in the presented soft-max algorithms is evident from the names, the number ofactual memory operations is more nuanced. Every pass ofthe Three-Pass algorithm with Recomputing reads the inputarray, while the last pass also writes the output array. TheThree-Pass algorithm with Reloading just reads the inputarray in the ﬁrst pass, reads the input array and writes theoutput array in the second pass, and reads-modiﬁes-writesthe output array in the last pass. The Two-Pass algorithmreads the input array in both passes, and also writes theoutput array in the second pass. Thus, the memory band-width requirements of the Two-Pass algorithm are similarto just the last two passes of the Three-Pass algorithm withRecomputing.Table 2 summarize the number of memory reads, memorywrites, and the memory bandwidth cost for the three algo-rithms on arrays of N elements. Per Table 2, the Two-Passalgorithm has a memory bandwidth advantage of 33% overthe Three-Pass algorithm with Recomputing and 67% overthe Three-Pass algorithm with Reloading. In practice, weshould treat these numbers as upper bounds, because highercomputational complexity of the Two-Pass algorithm cutsinto gains from bandwidth reduction. We evaluate the performance of the three softmax al-gorithms on the Intel Xeon W-2135 processor based onSkylake-X microarchitecture and with the characteristicslisted in Table 3. To improve performance stability we dis-abled dynamic frequency scaling in the processor for theduration of our experiments. Characteristic ValueMicroarchitecture Skylake-XNumber of cores 6Number of hyperthreads 12Base frequency 3.7 GHzL1 cache size (per core) 32 KBL2 cache size (per core) 1 MBL3 cache size (shared by all cores) 8.25 MBAVX2 / AVX512 FMA throughput 2 / cycleAVX2 / AVX512 FMA latency 4 cyclesTable 3: Characteristics of the Intel Xeon W-2135 processorused for experimental evaluation of the softmax algorithms.Additionally, in Sec. 6.8 we replicate a subset of experi-ments on a Broadwell-based Intel Xeon E5-2696 v4 proces-sor and on an AMD Ryzen 9 3900X processor with Zen 2microarchitecture.

We use Google Benchmark framework to estimate sustainedperformance of the softmax implementations. We set theminimum run-time for each measurement to 5 seconds, re-peat each measurement 25 times and record the medianof the 25 runs. In each benchmark run we simulate thecache state during neural network inference: output vectoris evicted from the cache before each iteration, but inputtensor stays in cache as long as it ﬁts.

We developed highly optimized implementations of theThree-Pass Algorithms 1 and 2, and the Two-Pass Algo-rithm 3 in C. For all three algorithms, we did two templatedimplementations targeting AVX2 and AVX512 instructionsets. We expressed high-level optimization parameters, suchas unroll factor for the loops and the number of accumula-tor variables in reduction functions, as meta-parameters ofthe templated implementations, and employed auto-tuningto discover their optimal values.An eﬃcient implementation of vectorized e x function isa key component of all softmax variants. For our imple-mentation, we adapted the throughput-optimized methodsof Dukhan and Vuduc [29] to single-precision ﬂoating-pointevaluation. Algorithm 4 presents the resulting table-free,branch-free, and division-free algorithm.4lgorithm Memory reads Memory writes Bandwidth costThree-Pass (Recompute) 1 3N

2N 1N 3N

Table 2: Theoretical analysis of memory complexity and bandwidth costs of the three softmax algorithms.

Algorithm 4

Calculation of e x in the Three-pass softmaxalgorithms function Exp ( x ) n ← (cid:98) x · log e (cid:101) t ← x − n · log 2 (cid:46) Cody-Waite range reduction p ← t ( c + t ( c + t ( c + t ( c + t · c ))))) (cid:46) Polynomial approximation y ← p · n (cid:46) Reconstruction return y end function Algorithm 4 follows the traditional structure of elemen-tary function implementation [30], described in Sec. 4. Itstarts with a range reduction to reduce approximation on theinﬁnite input domain to approximation on a small and ﬁniterange. The calculation of the reduced argument t in Algo-rithm 4 uses Cody-Waite range reduction [33]: log 2 is repre-sented as a sum of two single-precision constants, log hi log lo

2, to improve the accuracy of this step. Range reductionresults in a reduced argument t in the [ − log 22 , log 22 ] range anda reduced integer argument n . Next, e t is approximatedon [ − log 22 , log 22 ] with a degree-5 polynomial. The polyno-mial coeﬃcients are produced by the algorithm of Brisebarreand Chevillard [34] as implemented in Sollya software [35].Following [29], we evaluate the approximation polynomialwith Horner scheme using Fused Multiply-Add instructionsto minimize the number of ﬂoating-point instructions andmaximize the throughput. In the last stage the Algorithm 4reconstructs the ﬁnal output value of the function by multi-plying polynomial approximation result p by 2 n . In AVX2implementation, we do this multiplication by directly ma-nipulating ﬂoating-point exponent to construct a scale num-ber s : s := (cid:40) n n > = − n < ≤ x and compute the ﬁnal step as y ← p · s . This reconstructiontrick has two buit-in assumptions: the argument x to the e x is negative , and subnormal ﬂoating-point numbers canbe ﬂushed to zero without signiﬁcant accuracy impact. Thereconstruction step in the AVX512 implementation leveragethe new VSCALEFPS instruction [36] which computes p · n as a single hardware operation.The resulting e x implementation has maximum error un-der 2 ULP, validated through exhaustive evaluation on all Always the case for the e x evaluation in the Three-Pass softmaxalgorithms. valid inputs. This accuracy is comparable to other vector-ized elementary function implementations, e.g. SIMD func-tions in GNU LibM library guarantee maximum error under4 ULP.Implementation of the ExtExp in the Two-Pass softmaxalgorithm is similar to Algorithm 4 with the reconstructionstep removed. Thus, implementations of both the Three-Pass and the Two-Pass algorithms use exactly the samerange reduction and approximating polynomials to computethe exponential function. l l l l l l l l l l l l Number of elements P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading)

Figure 1: Performance comparison of the Softmax algo-rithms 1 and 2 in the AVX512 implementations on theSkylake-X system. Gray dotted lines denote boundaries oflevel-1, level-2, and level-3 caches.Fig. 1 presents the performance of the Three-Pass soft-max Algorithm 1 with recomputing of exponentials and theThree-Pass softmax Algorithm 2 with reloading of computedexponentials in the AVX512 implementations. Reloading ofexponential computations delivers 30 −

55% speedup whenthe data is small enough to ﬁt into private L1 and L2 caches,but turns into 15% slowdown when operating on L3 cache,and eventually levels oﬀ at 4 −

6% advantage when workingset exceeds last-level cache.5 l l l l l l l l l l l

Number of elements P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading)

Figure 2: Performance comparison of the Softmax algo-rithms 1 and 2 in the AVX2 implementations on the Skylake-X system. Gray dotted lines denote boundaries of level-1,level-2, and level-3 caches.The AVX2 implementation of the same Three-Pass soft-max Algorithm 1 and Algorithm 2 is illustrated in Fig. 2 anddemonstrates similar trends. As the working set increases,the 13 −

16% speedup from reloading of exponential com-putations goes down, and eventually levels oﬀ at 3 −

6% forlarge arrays.The small diﬀerence between recomputing and reloadingof exponential computations on Fig. 1 and Fig. 2 suggeststhat despite the expensive exponential function, softmaxcomputation might be memory-bound for large arrays. Todirectly test this hypothesis, we decompose the Algorithms 1and 2 into individual memory passes, and compare measuredbandwidth to STREAM benchmarks [37]. T h r ee − P a ss S o ft m a x ( pa ss ) T h r ee − P a ss S o ft m a x w . R e c o m pu t i ng ( pa ss ) T h r ee − P a ss S o ft m a x w . R e c o m pu t i ng ( pa ss ) T h r ee − P a ss S o ft m a x w . R e l oad i ng ( pa ss ) T h r ee − P a ss S o ft m a x w . R e l oad i ng ( pa ss ) S T R E A M C op y S T R E A M S c a l e B and w i d t h , G B / s Figure 3: Measured memory bandwidth on the Skylake-Xsystem in the three passes of the Softmax algorithms 1 and 2,and in the STREAM benchmark. Both the softmax im-plementations and the STREAM benchmark use AVX512instructions. T h r ee − P a ss S o ft m a x ( pa ss ) T h r ee − P a ss S o ft m a x w . R e c o m pu t i ng ( pa ss ) T h r ee − P a ss S o ft m a x w . R e c o m pu t i ng ( pa ss ) T h r ee − P a ss S o ft m a x w . R e l oad i ng ( pa ss ) T h r ee − P a ss S o ft m a x w . R e l oad i ng ( pa ss ) S T R E A M C op y S T R E A M S c a l e B and w i d t h , G B / s Figure 4: Measured memory bandwidth on the Skylake-Xsystem in the three passes of the Softmax algorithms 1 and 2,and in the STREAM benchmark. Both the softmax imple-mentations and the STREAM benchmark use AVX2 instruc-tions.Fig. 3 and 4 illustrate the memory bandwidth in eachpass of the Three-Pass softmax Algorithms 1 and 2, aswell as Copy and Scale STREAM benchmarks [37]. Asrecommended in STREAM documentation, we set the ar-ray size to four times the size of last-level cache (8650752single-precision elements for Softmax, and 4325376 double-precision elements for STREAM). The ﬁrst softmax pass(max-reduction) is the same in both versions of the Three-Pass algorithm, and thus presented only once. This passreads one input array, and doesn’t have a direct equivalentin STREAM, but achieves similar bandwidth to STREAMCopy and Scale benchmarks, which both read one array andwrite one array. The second pass in Algorithm 1 reads onearray, computes exponentials on the inputs, and accumu-lates them. It achieves 91% of STREAM Copy bandwidthin AVX512 version and 71% in AVX2 version. The secondpass Algorithm 2 is similar, but additionally stores com-puted exponents into the output array. Although it achieveshigher bandwidth than the second pass in Algorithm 1, ittakes substantially longer to complete; the higher bandwidthis due to doubling the number of transferred bytes with aless than proportional increase in run time. The third passof Algorithm 1 reads one array, computes exponentials onthe inputs, scales them, and writes results to another array.This pass does the same number of memory operations asSTREAM Scale benchmark, but substantially more compu-tational operations. Yet, our auto-tuned implementationsexceed the performance of STREAM Scale benchmark inboth the AVX512 and the AVX2 versions. The third passof the Algorithm 2 is an in-place variant of STREAM Scalebenchmark. The processor clearly favors in-place operation:it is 86% faster than STREAM Scale with AVX512, and 84%6aster with AVX2.To summarize, passes 1 and 3 of the Algorithm with Re-computing 1 demonstrate similar memory performance toSTREAM benchmark, and pass 2 in AVX512 implementa-tion is not far behind. Passes 1 and 2 of the Algorithm 2 withReloading similarly perform close to STREAM bandwidth,and pass 3 is signiﬁcantly faster than STREAM Scale bench-mark. These results conﬁrm that performance of Three-Passsoftmax algorithms is limited by achievable memory band-width, and suggest that softmax computation can befurther accelerated only through reducing the num-ber of memory operations . l l l l l l l l l l l l Number of elements P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading) Two−Pass

Figure 5: Performance comparison of the Algorithms 1, 2,and 3 in the AVX512 implementations. Gray dotted linesdenote boundaries of level-1, level-2, and level-3 caches.On Fig. 5 we present the performance of the Two-Pass soft-max algorithm in comparison with the two versions of theThree-Pass algorithm in AVX512 implementations. On out-of-cache working sets the proposed Two-Pass softmax algo-rithm outperforms Three-Pass algorithms by 18% − l l l l l l l l l l l l Number of elements P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading) Two−Pass

Figure 6: Performance comparison of the Algorithms 1, 2,and 3 in the AVX2 implementations. Gray dotted lines de-note boundaries of level-1, level-2, and level-3 caches. Fig. 6 similarly compares performance of the Two-Passand the Three-Pass algorithms in the AVX2 implementa-tions. Here, the Two-Pass algorithm outperforms Three-Pass algorithm with Reloading of exponential computationsby 16% −

18% on out-of-cache workloads. The smallerspeedups, compared to AVX512 implementation, are ex-plained by relatively higher cost of recomputing exponen-tials in AVX2 compared to AVX512. If we compare to theThree-Pass Algorithm 1, which similarly recomputed expo-nentials, the Two-Pass algorithm wins by 19 − m, n ) representation compared to just accumulating scalarﬂoating-point values. The benchmarks in Sec. 6.4 and Sec. 6.5 presented perfor-mance in single-threaded softmax computation, and demon-strated that on HPC-class systems softmax saturates mem-ory bandwidth even when running on a single core. Utilizingmultiple cores increases available computational resourcesfaster than achievable memory bandwidth, and thereforeincreases the advantage of the bandwidth-saving Two-Passsoftmax algorithm. To quantify this advantage, we ﬁx thesize of the array at 4 times the last-level cache size [37], andvary the number of threads from 1 to 6 (number of cores inthe system) to 12 (number of logical processors, includinghyperthreads, in the system). l l l l l l l

Number of threads P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading) Two−Pass

Figure 8: Weak scaling of the softmax algorithms in theAVX512 implementations on the Skylake-X system.Fig. 8 illustrates weak multi-core scaling of the AVX5127

VX2 Three−Pass w/ReloadingAVX2 Three−Pass w/RecomputingAVX2 Two−PassAVX512 Three−Pass w/ReloadingAVX512 Three−Pass w/RecomputingAVX512 Two−Pass 0.0 2.5 5.0 7.5 10.0

Time, ms A l go r i t h m Pass

Figure 7: Absolute runtime of the passes in the Algorithms 1, 2, and 3 in both the AVX2 and the AVX512 implementations.The algorithms were evaluated on arrays of 8,650,752 single-precision elements on the Skylake-X system.implementations. As the number of threads grows, the ad-vantage of the Two-Pass over Three-Pass algorithms remainsunchanged at 25 − l l l l l l l Number of threads P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading) Two−Pass

Figure 9: Weak scaling of the softmax algorithms in theAVX2 implementations on the Skylake-X system.Fig. 9 similarly illustrates weak multi-core scaling of theAVX2 implementations. The advantage of the Two-Passover Three-Pass algorithms grows from 9% on a single coreto 19% when utilizing all 6 cores to 22% when also usinghyperthreads.

The results in Sec. 6.5-6.6 demonstrate that on out-of-cacheinputs the Two-Pass softmax algorithm outperforms theThree-Pass softmax algorithms in our implementation, butleaves out the question of whether our implementations arecompetitive with state-of-the-art. To demonstrate the abso- lute eﬀectiveness of the Two-Pass algorithm, we comparedour implementations of the three softmax algorithm to thesoftmax primitive in Intel Deep Neural Network Library(DNNL) version 1.1.1.Intel DNNL implements the Three-Pass softmax Algo-rithm 2 with reloading of computed exponentials. It includesimplementations for SSE4.1, AVX, and AVX512 instructionsets, and automatically dispatch to AVX512 implementationon the Skylake-X processor. Unlike our implementations, In-tel DNNL generates implementation in runtime using Just-in-Time (JIT) technology. JIT code generation potentiallyallows adaptation of implementation to parameters of a par-ticular softmax operator (e.g. number of channels).Fig. 10 presents the comparison between two implemen-tations (Ours and DNNL) of the Three-Pass algorithm withreloading of exponentials, and the Two-Pass softmax al-gorithm in our implementation. For the Three-Pass algo-rithm with reloading, our implementation ourperforms IntelDNNL for the majority of problem sizes. Its advantage isparticularly high – over 2X – when data ﬁts into L1, diminishto 72 −

94% within L2, and levels oﬀ at 7 −

8% for out-of-cache problem sizes. As the implementation in Intel DNNLis less eﬃcient than ours, our Two-Pass softmax algorithmoutperforms DNNL softmax primitive on all problem sizes:it is 28 −

41% faster on out-of-cache problem sizes, and upto 87% when input ﬁts into L1 cache.

The results in Sec. 6.4-6.7 were all collected on the Xeon W-2135 processor with the Intel Skylake-X microarchitecture,which prompts a question as to whether the advantage of theTwo-Pass softmax algorithm is restricted to a speciﬁc typeof processor. To demonstrate that the Two-Pass algorithmgeneralize to other types of processors, we replicated resultsof Sec. 6.5 on Xeon E5-2696 v4 processor with Intel Broad-well microarchitecture and Ryzen 9 3900X with AMD Zen2 microarchitecture. Both of these processors support the8 l l l l l l l l l l l

Number of elements P e r f o r m an c e , G E l e m en t s / s Implementation l Three−Pass w/Reloading (Intel DNNL) Three−Pass w/Reloading (Ours) Two−Pass (Ours)

Figure 10: Performance comparison of our implementation of Algorithms 1, 2, and 3, with the softmax implementationin Intel DNNL library. Gray dotted lines denote boundaries of level-1, level-2, and level-3 caches.AVX2, but not the AVX512 instruction set, and have dif-ferent cache hierarchy parameters than the Intel Skylake-Xsystem. l l l l l l l l l l l l

Number of elements P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading) Two−Pass

Figure 11: Performance comparison of Algorithms 1, 2,and 3 on an Intel Broadwell-based system. Gray dotted linesdenote boundaries of level-1, level-2, and level-3 caches.Fig. 11 presents performance of the three softmax algo-rithms on the Intel Broadwell system. Although the Two-Pass softmax algorithm demonstrates inferior performanceon problems which ﬁt into L2 cache, it gets competitivewith the Three-Pass softmax algorithms on L3-sizes prob-lems, and outperforms them by 21 −

23% on out-of-cacheproblems. l l l l l l l l l l l l

012 1K 10K 100K 1M 10M 100M

Number of elements P e r f o r m an c e , G E l e m en t s / s Algorithm l Three−Pass (Recomputing) Three−Pass (Reloading) Two−Pass

Figure 12: Performance comparison of Algorithms 1, 2,and 3 on a Ryzen 9 3900X system. Gray dotted lines denoteboundaries of level-1, level-2, and level-3 caches.Fig. 12 shows a similar picture on AMD Zen 2 microar-chitecture. Here, the Three-Pass algorithms deliver superiorperformance as long as data ﬁts into L3 cache, but loose14 −

16% to the Two-Pass algorithm when the data exceedscache size.

We presented a novel Two-Pass algorithm for softmax com-putation and demonstrated that the new Two-Pass algo-rithm is up to 28% faster than the traditional Three-Pass9lgorithm on large input vectors. The algorithm, however,oﬀers no advantage over reloading variant of the Three-Passalgorithm when the data ﬁts into the processor’s cache.This study focused on performance on a CPU, but thealgorithm has great potential for GPU and hardware AIaccelerators. These platforms further shift the balance be-tween compute and memory performance towards expensivememory and cheap ﬂoating-point operations, and would fa-vor reduced memory intensity of the presented Two-Passsoftmax algorithm.To foster reproducibility, we released an open-source im-plementation of the new Two-Pass Softmax algorithm andother experiments in this paper as a part of XNNPACKlibrary at GitHub.com/google/XNNPACK.

Acknowledgements

We thank Matthias Grundmann, Sergey Ioﬀe, Juhyun Lee,Karthik Raveendran, and Richard Vuduc for helpful feed-back and discussion, and Vijay Thakkar for sharing accessto the Ryzen 9 3900X system.

References [1] J. S. Bridle, “Probabilistic interpretation of feedfor-ward classiﬁcation network outputs, with relationshipsto statistical pattern recognition,” in

Neurocomputing .Springer, 1990, pp. 227–236.[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei, “Imagenet: A large-scale hierarchical im-age database,” in . Ieee, 2009, pp. 248–255.[3] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants,P. Koehn, and T. Robinson, “One billion word bench-mark for measuring progress in statistical languagemodeling,” arXiv preprint arXiv:1312.3005 , 2013.[4] S. Singh, A. Subramanya, F. Pereira, and A. McCal-lum, “Wikilinks: A large-scale cross-document corefer-ence corpus labeled via links to wikipedia,”

Universityof Massachusetts, Amherst, Tech. Rep. UM-CS-2012 ,vol. 15, 2012.[5] A. Panchenko, E. Ruppert, S. Faralli, S. P. Ponzetto,and C. Biemann, “Building a web-scale dependency-parsed corpus from commoncrawl,” arXiv preprintarXiv:1710.01779 , 2017.[6] J. Goodman, “Classes for fast maximum entropy train-ing,” arXiv preprint cs/0108006 , 2001.[7] E. Grave, A. Joulin, M. Ciss´e, H. J´egou et al. , “Ef-ﬁcient softmax approximation for GPUs,” in

Proceed-ings of the 34th International Conference on MachineLearning-Volume 70 . JMLR. org, 2017, pp. 1302–1310. [8] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow: A system for large-scale machinelearning,” in { USENIX } Symposium on Operat-ing Systems Design and Implementation ( { OSDI } ,2016, pp. 265–283.[9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito,M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, and S. Chintala, “Pytorch: An imper-ative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 ,H. Wallach, H. Larochelle, A. Beygelzimer, F. d ' Alch´e-Buc, E. Fox, and R. Garnett, Eds. Curran Associates,Inc., 2019, pp. 8024–8035.[10] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pen-sky, “Sparse convolutional neural networks,” in

Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2015, pp. 806–814.[11] J. Park, S. R. Li, W. Wen, H. Li, Y. Chen, andP. Dubey, “Holistic SparseCNN: Forging the tri-dent of accuracy, speed, and size,” arXiv preprintarXiv:1608.01409 , vol. 1, no. 2, 2016.[12] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, andR. Fergus, “Exploiting linear structure within convolu-tional networks for eﬃcient evaluation,” in

Advances inneural information processing systems , 2014, pp. 1269–1277.[13] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,A. Howard, H. Adam, and D. Kalenichenko, “Quantiza-tion and training of neural networks for eﬃcient integer-arithmetic-only inference,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recogni-tion , 2018, pp. 2704–2713.[14] J. Park, M. Naumov, P. Basu, S. Deng, A. Kala-iah, D. Khudia, J. Law, P. Malani, A. Malevich,S. Nadathur et al. , “Deep learning inference in face-book data centers: Characterization, performance op-timizations and hardware implications,” arXiv preprintarXiv:1811.09886 , 2018.[15] A. Tulloch and Y. Jia, “High performance ultra-low-precision convolutions on mobile devices,” arXivpreprint arXiv:1712.02427 , 2017.[16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A.Horowitz, and W. J. Dally, “EIE: eﬃcient inferenceengine on compressed deep neural network,” in . IEEE, 2016, pp. 243–254.1017] N. P. Jouppi, C. Young, N. Patil, D. Patterson,G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,A. Borchers et al. , “In-datacenter performance analysisof a tensor processing unit,” in . IEEE, 2017, pp. 1–12.[18] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, andY. Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410 , 2016.[19] I. Kouretas and V. Paliouras, “Simpliﬁed hardware im-plementation of the softmax activation function,” in . IEEE, 2019,pp. 1–4.[20] M. Wang, S. Lu, D. Zhu, J. Lin, and Z. Wang, “Ahigh-speed and low-complexity architecture for softmaxfunction in deep learning,” in .IEEE, 2018, pp. 223–226.[21] Q. Sun, Z. Di, Z. Lv, F. Song, Q. Xiang, Q. Feng,Y. Fan, X. Yu, and W. Wang, “A high speed soft-max vlsi architecture based on basic-split,” in . IEEE, 2018,pp. 1–3.[22] K. Shim, M. Lee, I. Choi, Y. Boo, and W. Sung, “Svd-softmax: fast softmax approximation on large vocabu-lary neural networks,” in

Advances in Neural Informa-tion Processing Systems , 2017, pp. 5463–5473.[23] S. Kanai, Y. Fujiwara, Y. Yamanaka, and S. Adachi,“Sigsoftmax: Reanalysis of the softmax bottleneck,”in

Advances in Neural Information Processing Systems ,2018, pp. 286–296.[24] B. Yuan, “Eﬃcient hardware architecture of softmaxlayer in deep neural network,” in . IEEE,2016, pp. 323–326.[25] I. Goodfellow, Y. Bengio, and A. Courville,

Deep learn-ing . MIT press, 2016.[26] S. Gal, “An accurate elementary mathematical libraryfor the ieee ﬂoating point standard,”

ACM Transac-tions on Mathematical Software (TOMS) , vol. 17, no. 1,pp. 26–45, 1991.[27] C. Iordache and P. T. P. Tang, “An overview of ﬂoating-point support and math library on the intel xscaletrade/architecture,” in

Proceedings 2003 16th IEEESymposium on Computer Arithmetic . IEEE, 2003, pp.122–128.[28] P. Markstein,

IA-64 and elementary functions: speedand precision . Prentice Hall, 2000, vol. 3. [29] M. Dukhan and R. Vuduc, “Methods for high-throughput computation of elementary functions,” in

International Conference on Parallel Processing andApplied Mathematics . Springer, 2013, pp. 86–95.[30] J.-M. Muller, N. Brisebarre, F. De Dinechin, C.-P. Jeannerod, V. Lefevre, G. Melquiond, N. Revol,D. Stehl´e, S. Torres et al. , “Handbook of ﬂoating-pointarithmetic,” 2010.[31] Y. Hida, X. S. Li, and D. H. Bailey, “Algorithmsfor quad-double precision ﬂoating point arithmetic,” in

Proceedings 15th IEEE Symposium on Computer Arith-metic. ARITH-15 2001 . IEEE, 2001, pp. 155–162.[32] S. Graillat, V. M´enissier-Morain et al. , “Error-freetransformations in real and complex ﬂoating pointarithmetic,” in

Proceedings of the International Sympo-sium on Nonlinear Theory and its Applications . Cite-seer, 2007, pp. 341–344.[33] W. J. Cody,

Software Manual for the Elementary Func-tions (Prentice-Hall series in computational mathemat-ics) . Prentice-Hall, Inc., 1980.[34] N. Brisebarre and S. Chevillard, “Eﬃcient polynomiall-approximations,” in . IEEE, 2007, pp. 169–176.[35] S. Chevillard, M. Jolde¸s, and C. Lauter, “Sollya: Anenvironment for the development of numerical codes,”in

International Congress on Mathematical Software .Springer, 2010, pp. 28–31.[36] M. Cornea, “Intel AVX-512 instructions and their usein the implementation of math functions,”

Intel Corpo-ration , 2015.[37] J. D. McCalpin, “Memory bandwidth and machine bal-ance in current high performance computers,”