On Controllable Sparse Alternatives to Softmax
Anirban Laha, Saneem A. Chemmengath, Priyanka Agrawal, Mitesh M. Khapra, Karthik Sankaranarayanan, Harish G. Ramaswamy
OOn Controllable Sparse Alternatives to Softmax
Anirban Laha †∗ Saneem A. Chemmengath ∗ Priyanka Agrawal Mitesh M. Khapra Karthik Sankaranarayanan Harish G. Ramaswamy IBM Research Robert Bosch Center for DS and AI, and Dept of CSE, IIT Madras
Abstract
Converting an n-dimensional vector to a probability distribution over n objectsis a commonly used component in many machine learning tasks like multiclassclassification, multilabel classification, attention mechanisms etc. For this, severalprobability mapping functions have been proposed and employed in literature suchas softmax, sum-normalization, spherical softmax, and sparsemax, but there is verylittle understanding in terms how they relate with each other. Further, none of theabove formulations offer an explicit control over the degree of sparsity. To addressthis, we develop a unified framework that encompasses all these formulations asspecial cases. This framework ensures simple closed-form solutions and existenceof sub-gradients suitable for learning via backpropagation. Within this framework,we propose two novel sparse formulations, sparsegen-lin and sparsehourglass , thatseek to provide a control over the degree of desired sparsity. We further developnovel convex loss functions that help induce the behavior of aforementionedformulations in the multilabel classification setting, showing improved performance.We also demonstrate empirically that the proposed formulations, when used tocompute attention weights, achieve better or comparable performance on standardseq2seq tasks like neural machine translation and abstractive summarization.
Various widely used probability mapping functions such as sum-normalization, softmax, and sphericalsoftmax enable mapping of vectors from the euclidean space to probability distributions. Theneed for such functions arises in multiple problem settings like multiclass classification [1, 2],reinforcement learning [3, 4] and more recently in attention mechanism [5, 6, 7, 8, 9] in deep neuralnetworks, amongst others. Even though softmax is the most prevalent approach amongst them, ithas a shortcoming in that its outputs are composed of only non-zeroes and is therefore ill-suitedfor producing sparse probability distributions as output. The need for sparsity is motivated byparsimonious representations [10] investigated in the context of variable or feature selection. Sparsityin the input space offers benefits of model interpretability as well as computational benefits whereason the output side, it helps in filtering large output spaces, for example in large scale multilabelclassification settings [11]. While there have been several such mapping functions proposed inliterature such as softmax [4], spherical softmax [12, 13] and sparsemax [14, 15], very little isunderstood in terms of how they relate to each other and their theoretical underpinnings. Further, forsparse formulations, often there is a need to trade-off interpretability for accuracy, yet none of theseformulations offer an explicit control over the desired degree of sparsity.Motivated by these shortcomings, in this paper, we introduce a general formulation encompassing allsuch probability mapping functions which serves as a unifying framework to understand individualformulations such as hardmax, softmax, sum-normalization, spherical softmax and sparsemax as spe-cial cases, while at the same time helps in providing explicit control over degree of sparsity. With the ∗ Equal contribution by the first two authors. Corresponding authors: {anirlaha,saneem.cg}@in.ibm.com. † This author was also briefly associated with IIT Madras during the course of this work.32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada. a r X i v : . [ c s . L G ] O c t im of controlling sparsity, we propose two new formulations: sparsegen-lin and sparsehourglass .Our framework also ensures simple closed-form solutions and existence of sub-gradients similar tosoftmax. This enables them to be employed as activation functions in neural networks which requiregradients for backpropagation and are suitable for tasks that require sparse attention mechanism[14]. We also show that the sparsehourglass formulation can extend from translation invariance toscale invariance with an explicit control, thus helping to achieve an adaptive trade-off between theseinvariance properties as may be required in a problem domain.We further propose new convex loss functions which can help induce the behaviour of the aboveproposed formulations in a multilabel classification setting. These loss functions are derived from aviolation of constraints required to be satisfied by the corresponding mapping functions. This way ofdefining losses leads to an alternative loss definition for even the sparsemax function [14]. Throughexperiments we are able to achieve improved results in terms of sparsity and prediction accuracies formultilabel classification.Lastly, the existence of sub-gradients for our proposed formulations enable us to employ them tocompute attention weights [5, 7] in natural language generation tasks. The explicit controls providedby sparsegen-lin and sparsehourglass help to achieve higher interpretability while providing betteror comparable accuracy scores. A recent work [16] had also proposed a framework for attention;however, they had not explored the effect of explicit sparsity controls. To summarize, our contributionsare the following: • A general framework of formulations producing probability distributions with connectionsto hardmax, softmax, sparsemax, spherical softmax and sum-normalization (
Sec. • New formulations like sparsegen-lin and sparsehourglass as special cases of the generalframework which enable explicit control over the desired degree of sparsity (
Sec. • A formulation sparsehourglass which enables us to adaptively trade-off between the trans-lation and scale invariance properties through explicit control (
Sec. • Convex multilabel loss functions correponding to all the above formulations proposed by us.This enable us to achieve improvements in the multilabel classification problem (
Sec. • Experiments for sparse attention on natural language generation tasks showing comparableor better accuracy scores while achieving higher interpretability (
Sec.
Notations:
For K ∈ Z + , we denote [ K ] := { , . . . , K } . Let z ∈ R K be a real vector denoted as z = { z , . . . , z K } . and denote vector of ones and zeros resp. Let ∆ K − := { p ∈ R K | T p =1 , p ≥ } be the ( K − -dimensional simplex and p ∈ ∆ K − be denoted as p = { p , . . . , p K } . Weuse [ t ] + := max { , t } . Let A ( z ) := { k ∈ [ K ] | z k = max j z j } be the set of maximal elements of z . Definition : A probability mapping function is a map ρ : R K → ∆ K − which transforms a scorevector z to a categorical distribution (denoted as ρ ( z ) = { ρ ( z ) , . . . , ρ K ( z ) } ). The support of ρ ( z ) is S ( z ) := { j ∈ [ K ] | ρ j ( z ) > } . Such mapping functions can be used as activation function formachine learning models. Some known probability mapping functions are listed below: • Softmax function is defined as: ρ i ( z ) = exp ( z i ) (cid:80) j ∈ [ K ] exp ( z j ) , ∀ i ∈ [ K ] . Softmax is easy toevaluate and differentiate and its logarithm is the negative log-likelihood loss [14]. • Spherical softmax - Another function which is simple to compute and derivative-friendly: ρ i ( z ) = z i (cid:80) j ∈ [ K ] z j , ∀ i ∈ [ K ] . Spherical softmax is not defined for (cid:80) j ∈ [ K ] z j = 0 . • Sum-normalization : ρ i ( z ) = z i (cid:80) j ∈ [ K ] z j , ∀ i ∈ [ K ] . It is not used in practice much as themapping is not defined if z i < for any i ∈ [ K ] and for (cid:80) j ∈ [ K ] z j = 0 .The above mapping functions are limited to producing distributions with full support. Consider thereis a single value of z i significantly higher than the rest, its desired probability should be exactly 1,while the rest should be grounded to zero ( hardmax mapping). Unfortunately, that does not happenunless the rest of the values tend to −∞ (in case of softmax) or are equal to (in case of sphericalsoftmax and sum-normalization). 2 z z . .
050 0 .
100 0 . . .
400 0 . .
600 0 . . .
900 0 .
950 0 . (a) Softmax z z .
100 0 .
200 0 .
300 0 .
400 0 .
500 0 . .
700 0 .
800 0 . . . (b) Sparsemax (c) Sum-normalization Figure 1:
Visualization of probability mapping functions in two-dimension. The contour plotsshow values of ρ ( z ) . The green line segment connecting (1,0) and (0,1) is the 1-dimensionalprobability simplex. Each contour (here it is line) contains points in R plane which have the same ρ ( z ) , the exact value marked on the contour line. • Sparsemax recently introduced by [14] circumvents this issue by projecting the score vector z onto a simplex [15]: ρ ( z ) = argmin p ∈ ∆ K − (cid:107) p − z (cid:107) . This offers an intermediatesolution between softmax (no zeroes) and hardmax (zeroes except for the highest value).The contour plots for softmax, sparsemax and sum-normalization in two-dimensions ( z ∈ R )are shown in Fig.
1. The contours of sparsemax are concentrated over a narrow region, while theremaining region corresponds to sparse solutions. For softmax, the contour plots are spread overthe whole real plane, confirming the absence of sparse solutions. Sum-normalization is not definedoutside the first quadrant, and yet, the contours cover the whole quadrant, denying sparse solutions.
Definition : We propose a generic probability mapping function inspired from the sparsemax formu-lation (in
Sec.
2) which we call sparsegen : ρ ( z ) = sparsegen ( z ; g, λ ) = argmin p ∈ ∆ K − (cid:107) p − g ( z ) (cid:107) − λ (cid:107) p (cid:107) (1)where g : R K → R K is a component-wise transformation function applied on z . Here g i ( z ) denotesthe i -th component of g ( z ) . The coefficient λ < controls the regularization strength. For λ > , thesecond term becomes negative L -2 norm of p . In addition to minimizing the error on projection of g ( z ) , Eq. O ( K ) time using the modified randomized medianfinding algorithm as followed in [15] while solving the projection onto simplex problem.The choices of both λ and g can help control the cardinality of the support set S ( z ) , thus influencingthe sparsity of ρ ( z ) . λ can help produce distributions with support ranging from full ( uniformdistribution when λ → − ) to minimum ( hardmax when λ → −∞ ). Let S ( z , λ ) denote the supportof sparsegen for a particular coefficient λ . It is easy to show: if | S ( z , λ ) | > | A ( z ) | , then thereexists λ x > λ for an x < | S ( z , λ ) | such that | S ( z , λ x ) | = x . In other words, if a sparser solutionexists, it can be obtained by changing λ . The following result has an alternate interpretation for λ : Result : The sparsegen formulation (Eq.1) is equivalent to the following, when γ = − λ (where γ > ): ρ ( z ) = argmin p ∈ ∆ K − (cid:107) p − γg ( z ) (cid:107) .The above result says that scaling g ( z ) by γ = − λ is equivalent to applying the negative L -2 normwith λ coefficient when considering projection of g ( z ) onto the simplex. Thus, we can write:sparsegen ( z ; g, λ ) = sparsemax (cid:16) g ( z )1 − λ (cid:17) . (2)This equivalence helps us borrow results from sparsemax to establish various properties for sparsegen. Jacobian of sparsegen : To train a model with sparsegen as an activation function, it is essentialto compute its
Jacobian matrix denoted by J ρ ( z ) := [ ∂ρ i ( z ) /∂z j ] i,j for using gradient-based3ptimization techniques. We use Eq.
Sec. J sparsegen ( z ) = J sparsemax (cid:16) g ( z )1 − λ (cid:17) × J g ( z )1 − λ (3)where J g ( z ) is Jacobian of g ( z ) and J sparsemax ( z ) = (cid:104) Diag ( s ) − ss T | S ( z ) | (cid:105) . Here s is an indicatorvector whose i th entry is 1 if i ∈ S ( z ) . Diag ( s ) is a matrix created using s as its diagonal entries. Apart from λ , one can control the sparsity of sparsegen through g ( z ) as well. Moreover, certainchoices of λ and g ( z ) help us establish connections with existing activation functions (see Sec.
App.
A.2):
Example 1 : g ( z ) = exp ( z ) (sparsegen-exp) : exp ( z ) denotes element-wise exponentiation of z ,that is g i ( z ) = exp ( z i ) . Sparsegen-exp reduces to softmax when λ = 1 − (cid:80) j ∈ [ K ] e z j , as it results in S ( z ) = [ K ] as per Eq.
13 in
App.
A.2.
Example 2 : g ( z ) = z (sparsegen-sq) : z denotes element-wise square of z . As observed forsparsegen-exp, when λ = 1 − (cid:80) j ∈ [ K ] z j , sparsegen-sq reduces to spherical softmax. Example 3 : g ( z ) = z , λ = 0 : This case is equivalent to the projection onto the simplex objectiveadopted by sparsemax. Setting λ (cid:54) = 0 leads the regularized extension of sparsemax as seen next. The negative L -2 norm regularizer in Eq.
Fig. λ → − , the whole real planemaps to sparse region whereas for λ → −∞ , the whole real plane renders non-sparse solutions. ρ ( z ) = sparsegen-lin ( z ) = argmin p ∈ ∆ K − (cid:107) p − z (cid:107) − λ (cid:107) p (cid:107) (4) B(1,0)A(0,1) z − z z − z B(1,0)A(0,1) z z z z − z z − λz z − λz (1 , . z , . p . , . z (1 , . z (1 , . z (2 , . Sparse RegionSparse Region
Figure 2:
Sparsegen-lin : Region plot for z = { z , z } ∈ R when λ = 0 . . ρ ( z ) is sparse in thered region, whereas non-sparse in the blue region.The dashed red lines depict the boundaries betweensparse and non-sparse regions. For λ = 0 . , pointslike ( z or z (cid:48) ) are mapped onto the sparse points A or B . Whereas for sparsemax ( λ = 0 ), they fall inthe blue region (the boundaries of sparsemax areshown by lighter dashed red lines passing through A and B ). The point z lies in the blue region,producing non-sparse solution. Interestingly morepoints like z (cid:48)(cid:48) and z (cid:48)(cid:48)(cid:48) , which currently lie in thered region can fall in the blue region for some λ < . For λ > . , the blue region becomessmaller, as mores points map to sparse solutions. Let us enumerate below some properties a probability mapping function ρ should possess:1. Monotonicity : If z i ≥ z j , then ρ i ( z ) ≥ ρ j ( z ) . This does not always hold true for sum-normalization and spherical softmax when one or both of z i , z j is less than zero. For sparsegen, both g i ( z ) and g i ( z ) / (1 − λ ) should be monotonic increasing, which implies λ needs to be less than 1.2. Full domain : The domain of ρ should include negatives as well as positives, i.e. Dom ( ρ ) ∈ R K .Sum-normalization does not satisfy this as it is not defined if some dimensions of z are negative.4. Existence of Jacobian : This enables usage in any training algorithm where gradient-basedoptimization is used. For sparsegen, the Jacobian of g ( z ) should be easily computable ( Eq.
Lipschitz continuity : The derivative of the function should be upper bounded. This is importantfor the stability of optimization technique used in training. Softmax and sparsemax are 1-Lipschitzwhereas spherical softmax and sum-normalization are not Lipschitz continuous.
Eq. / (1 − λ ) times the Lipschitz constant for g ( z ) .5. Translation invariance : Adding a constant c to every element in z should not change the outputdistribution : ρ ( z + c ) = ρ ( z ) . Sparsemax and softmax are translation invariant whereas sum-normalization and spherical softmax are not. Sparsegen is translation invariant iff for all c ∈ R thereexist a ˜ c ∈ R such that g ( z + c ) = g ( z ) + ˜ c . This follows from Eq.
Scale invariance : Multiplying every element in z by a constant c should not change the outputdistribution : ρ ( c z ) = ρ ( z ) . Sum-normalization and spherical softmax satisfy this property whereassparsemax and softmax are not scale invariant. Sparsegen is scale invariant iff for all c ∈ R thereexist a ˆ c ∈ R such that g ( c z ) = g ( z ) + ˆ c . This also follows from Eq.
Permutation invariance : If there is a permutation matrix P , then ρ ( P z ) = P ρ ( z ) . For sparsegen,the precondition is that g ( z ) should be a permutation invariant function.8. Idempotence : ρ ( z ) = z , ∀ z ∈ ∆ K − . This is true for sparsemax and sum-normalization. Forsparsegen, it is true if and only if g ( z ) = z , ∀ z ∈ ∆ K − and λ = 0 .In the next section, we discuss in detail about the scale invariance and translation invariance propertiesand propose a new formulation achieving a trade-off between these properties. As mentioned in
Sec.
1, scale invariance is a desirable property to have for probability mappingfunctions. Consider applying sparsemax on two vectors z = { , } , ¯ z = { , } ∈ R . Bothwould result in { , } as the output. However, ideally ¯ z should have mapped to a distributionnearer to { . , . } instead. Scale invariant functions will not have such a problem. Among theexisting functions, only sum-normalization and spherical softmax satisfy scale invariance. Whilesum-normalization is only defined for positive values of z , spherical softmax is not monotonic orLipschitz continuous. In addition, both of these methods are also not defined for z = , thus makingthem unusable for practical purposes. It can be shown that any probability mapping function with thescale invariance property will not be Lipschitz continuous and will be undefined for z = .A recent work[13] had pointed out the lack of clarity over whether scale invariance is more desiredthan the translation invariance property of softmax and sparsemax. We take this into account toachieve trade-off between the two invariances. In the usual scale invariance property, scaling vector z essentially results in another vector along the line connecting z and the origin. That resultant vectoralso has the same output probability distribution as the original vector (See Sec. z along the line connecting it with a point (we call it anchor point henceforth) otherthan the origin, yet achieving the same output. Interestingly, the choice of this anchor point can act asa control to help achieve a trade-off between scale invariance and translation invariance.Let a vector z be projected onto the simplex along the line connecting it with an anchor point q = ( − q, . . . , − q ) ∈ R K , for q > (See Fig.
3a for K = 2 ). We choose g ( z ) as the point where thisline intersects with the affine hyperplane T ˆ z = 1 containing the probability simplex. Thus, g ( z ) is set equal to α z + (1 − α ) q , where α = Kq (cid:80) i z i + Kq (we denote it as α ( z ) as α is a function of z ).From the translation invariance property of sparsemax, the resultant mapping function can be shownequivalent to considering g ( z ) = α ( z ) z in Eq.
1. We refer to this variant of sparsegen assuming g ( z ) = α ( z ) z and λ = 0 as sparsecone .Interestingly, when the parameter q = 0 , sparsecone reduces to sum-normalization (scale invariant)and when q → ∞ , it is equivalent to sparsemax (translation invariant). Thus the parameter q acts as acontrol taking sparsecone from scale invariance to translation invariance. At intermediate values (thatis, for < q < ∞ ), sparsecone is approximate scale invariant with respect to the anchor point q .However, it is undefined for z where (cid:80) i z i < − Kq (beyond the black dashed line shown in Fig. α ( z ) (that is, (cid:80) i z i + Kq ) becomes negative destroying themonotonicity of α ( z ) z . Also note that sparsecone is not Lipschitz continuous.5 (1,0)A(0,1) q ( − q, − q ) z − z z − z B(1,0)A(0,1) z z − q z ( z , z p ( p , p Sparse RegionSparse Region (a) Sparsecone
B(1,0)A(0,1) q ( q, q ) z z z z B(1,0)A(0,1) z ( z , z p ( p , p z = ( z , z
2) ˜ z = (˜ z , ˜ z p ( p , p Sparse RegionSparse Region (b) Sparsehourglass
Figure 3: (a)
Sparsecone:
The vector z maps to a point p on the simplex along the line connectingto the point q = ( − q, − q ) . Here we consider q = 1 . The red region corresponds to sparse regionwhereas blue covers the non-sparse region. (b) Sparsehourglass:
For the vector z (cid:48) in the positivehalf-space, the mapping to the solution p (cid:48) can be obtained similarly as sparsecone. For the vector z in the negative half-space, a mirror point ˜ z needs to be found, which leads to the solution p . To alleviate the issue of monotonicity when (cid:80) i z i < − Kq , we choose to restrict applying sparseconeonly for the positive half-space H K + := { z ∈ R K | (cid:80) i z i ≥ } . For the remaining negative half-space H K − := { z ∈ R K | (cid:80) i z i < } , we define a mirror point function to transform to a point in H K + , on which sparsecone can be applied. Thus the solution for a point in the negative half-space isgiven by the solution of its corresponding mirror point in the positive half-space. This mirror pointfunction has some necessary properties (see App. A.4 for details), which can be satisfied by defining m : m i ( z ) = z i − (cid:80) j z j K , ∀ i ∈ [ K ] . Interestingly, this can alternatively be achieved by choosing g ( z ) = ˆ α ( z ) z , where ˆ α is a slight modification of α given by ˆ α ( z ) = Kq | (cid:80) i z i | + Kq . This leads to thedefinition of a new probability mapping function (which we call sparsehourglass ): ρ ( z ) = sparsehourglass ( z ) = argmin p ∈ ∆ K − (cid:13)(cid:13)(cid:13) p − Kq | (cid:80) i ∈ [ K ] z i | + Kq z (cid:13)(cid:13)(cid:13) (5)Like sparsecone, sparsehourglass also reduces to sparsemax when q → ∞ . Similarly, q = 0 forsparsehourglass leads to a corrected version of sum-normalization (we call it sum normalization++ ),which works for the negative domain as well unlike the original version defined in Sec.
2. Anotheradvantage of sparsehourglass is that it is Lipschitz continuous with Lipschitz constant equal to (1+ Kq ) (proof details in App.
A.5).
Table.
Sec. sparsehourglass isthe only probability mapping function which satisfies all the properties . Even though it does notsatisfy both scale invariance and translation invariance simultaneously, it is possible to achieve theseseparately through different values of q parameter, which can be decided independent of z . An important usage of such sparse probability mapping functions is in the output mapping ofmultilabel classification models. Typical multilabel problems have hundreds of possible labels ortags, but any single instance has only a few tags [17]. Thus, a function which takes in a vector in R K and outputs a sparse version of the vector is of great value.Given training instances ( x i , y i ) ∈ X × { , } K , we need to find a model function f : X → R K that produces score vector over the label space, which on application of ρ : R K → ∆ K − (thesparse probability mapping function in question) leads to correct prediction of label vector y i . Define6able 1: Summary of the properties satisfied by probability mapping functions. Here (cid:52) denotes‘satisfied in general’, (cid:55) signifies ‘not satisfied’ and (cid:88) says ‘satisfied for some constant parameterindependent of z ’. Note that P ERMUTATION I NV and existence of J ACOBIAN are satisfied by all. F UNCTION I DEMPOTENCE M ONOTONIC T RANSLATION I NV S CALE I NV F ULL D OMAIN L IPSCHITZ S UM N ORMALIZATION (cid:52) (cid:55) (cid:55) (cid:52) (cid:55) ∞ S PHERICAL S OFTMAX (cid:55) (cid:55) (cid:55) (cid:52) (cid:55) ∞ S OFTMAX (cid:55) (cid:52) (cid:52) (cid:55) (cid:52) PARSEMAX (cid:52) (cid:52) (cid:52) (cid:55) (cid:52) PARSEGEN - LIN (cid:88) (cid:52) (cid:52) (cid:55) (cid:52) / (1 − λ ) S PARSEGEN - EXP (cid:55) (cid:52) (cid:55) (cid:55) (cid:52) ∞ S PARSEGEN - SQ (cid:55) (cid:55) (cid:55) (cid:55) (cid:52) ∞ S PARSECONE (cid:52) (cid:55) (cid:88) (cid:88) (cid:55) ∞ S PARSEHOURGLASS (cid:52) (cid:52) (cid:88) (cid:88) (cid:52) (1 + 1 /Kq ) S UM N ORMALIZATION ++ (cid:52) (cid:52) (cid:55) (cid:52) (cid:52) ∞ η i := y i / (cid:107) y i (cid:107) , which is a probability distribution over the labels. Considering a loss function L : ∆ K − × ∆ K − → [0 , ∞ ) and representing z i := f ( x i ) , a natural way for training using ρ is tofind a function f : X → R K that minimises the error R ( f ) below over a hypothesis class F : R ( f ) = M (cid:88) i =1 L ( ρ ( z i ) , η i ) In the prediction phase, for a test instance x , one can simply predict the non-zero elements in thevector ρ ( f ∗ ( x )) where f ∗ is the minimizer of the above training objective R ( f ) .For all cases where ρ produces sparse probability distributions, one can show that the trainingobjective R above is highly non-convex in f , even for the case of a linear hypothesis class F .However, when we remove the requirement of ρ ( z ) being there in training part, and use a lossfunction which works with z and y , we can obtain a system which is convex in f . We, thus,design a loss function L : R K × ∆ K − → [0 , ∞ ) such that L ( z , η ) = 0 only if ρ ( z ) = η . Toderive such a loss function, we proceed by enumerating a list of constraints which will be satisfiedby the zero-loss region in the K -dimensional space of the vector z . For sparsehourglass, theclosed-form solution is given by ρ i ( z ) = [ˆ α ( z ) z i − τ ( z )] + (see App.
A.1). This enables us to listdown the following constraints for zero loss: (1) ˆ α ( z )( z i − z j ) = 0 , ∀ i, j | η i = η j (cid:54) = 0 , and (2) ˆ α ( z )( z i − z j ) ≥ η i , ∀ i, j | η i (cid:54) = 0 , η j = 0 . The value of the loss when any such constraints is violatedis simply determined by piece-wise linear functions, which lead to the following loss function forsparsehourglass: L sparsehg,hinge ( z , η ) = (cid:88) i,jη i (cid:54) =0 ,η j (cid:54) =0 (cid:12)(cid:12) z i − z j (cid:12)(cid:12) + (cid:88) i,jη i (cid:54) =0 ,η j =0 max (cid:110) η i ˆ α ( z ) − ( z i − z j ) , (cid:111) . (6)It can be easily proved that the above loss function is convex in z using the properties that both sumof convex functions and maximum of convex functions result in convex functions. The above strategycan also be applied to derive a multilabel loss function for sparsegen-lin: L sparsegen-lin,hinge ( z , η ) = 11 − λ (cid:88) i,jη i (cid:54) =0 ,η j (cid:54) =0 | z i − z j | + (cid:88) i,jη i (cid:54) =0 ,η j =0 max (cid:110) η i − z i − z j − λ , (cid:111) . (7)The above loss function for sparsegen-lin can be used to derive a multilabel loss for sparsemax bysetting λ = 0 (which we use in our experiments for “sparsemax+hinge”) . The piecewise-linearlosses proposed in this section based on violation of constraints are similar to the well-knownhinge loss, whereas the sparsemax loss proposed by [14] (which we use in our experiments for“sparsemax+huber”) has connections with Huber loss. We have shown through our experiments innext section, that hinge loss variants for multilabel classification work better than Huber loss variants. Here we present two sets of evaluations for the proposed probability mapping functions and lossfunctions. First, we apply them on the multilabel classification task studying the effect of varyinglabel density in synthetic dataset, followed by evaluation on real multilabel datasets. Next, we reportresults of sparse attention on NLP tasks of machine translation and abstractive summarization.7 .1 Multilabel Classification
We compare the proposed activations and loss functions for multilabel classification with bothsynthetic and real datasets. We use a linear prediction model followed by a loss function duringtraining. During test time, the corresponding activation is directly applied to the output of the linearmodel. We consider the following activation-loss pairs: (1) softmax+log : KL-divergence loss appliedon top of softmax outputs, (2) sparsemax+huber : multilabel classification method from [14], (3) sparsemax+hinge : hinge loss as in
Eq. λ = 0 is used during training compared to Huber lossin (2), and (4) sparsehg+hinge : for sparsehourglass (in short sparsehg), loss in Eq.
Eq. p , above which a label ispredicted to be “on”. For others, a label is predicted “on” if its predicted probability is non-zero. Wetune hyperparams q for sparsehg+hinge and p for softmax+log using validation set. We use scikit-learn for generating synthetic datasets (details in
App.
A.6). We conducted experimentsin three settings : (1) varying mean number of labels per instance, (2) varying range of number oflabels and, (3) varying document length. In the first setting, we study the ability to model varyinglabel sparsity. We draw number of labels N uniformly at random from set { µ − , µ, µ + 1 } where µ ∈ { . . . } is mean number of labels. For the second setting we study how these modelsperform when label density varies across instances. We draw N uniformly at random from set { − r, . . . , r } . Parameter r controls variation of label density among instances. In the thirdsetting we experiment with different document lengths, we draw N from Poisson with mean andvary document length L from 200 to 2000. In first two settings document length was fixed to .We report F-score and Jensen-Shannon divergence (JSD) on test set in our results. s p a r s i t y softmax+logsparsemax+hubersparsemax+hingesparsehg+hinge Figure 4: Sparsity comparison
Fig.
Fig.
4- lower the curve the sparser it is - this is analysis is done cor-responding to the setting in
Fig.
App.
A.7.1).
We further experiment with three real datasets for multilabel classification: Birds, Scene andEmotions. The experimental setup and baselines are same as that for synthetic dataset described in Sec.
App.
A.7.2. All methods give comparable results on these benchmark datasets.
Here we demonstrate the effectiveness of our formulations experimentally on two natural languagegeneration tasks: neural machine translation and abstractive sentence summarization. The purposeof these experiments are two fold: firstly, effectiveness of our proposed formulations sparsegen-lin and sparsehourglass in attention framework on these tasks, and secondly, control over sparsity leadsto enhanced interpretability. We borrow the encoder-decoder architecture with attention (see
Fig.
App.
A.8). We replace the softmax function in attention by our proposed functions as well as Micro-averaged F score. Available at http://mulan.sourceforge.net/datasets-mlc.html F - s c o r e (a) Varying mean F - s c o r e (b) Varying range
500 1000 1500 2000Document length0.750.800.850.900.95 F - s c o r e softmax+logsparsemax+hubersparsemax+hingesparsehg+hinge (c) Varying document length Figure 5: F-score on multilabel classification synthetic dataset.Table 2: Sparse Attention Results. Here R-1, R-2 and R-L denote the ROUGE scores.
Attention T
RANSLATION S UMMARIZATION
FR-EN EN-FR Gigaword DUC 2003 DUC 2004BLEU BLEU R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-Lsoftmax 36.38 36.00 34.80 16.64 32.15 27.95 sparsehg 36.63 35.69 35.14 16.91 32.66 27.39 9.11 24.53 30.64 12.05 28.18 sparsemax as baseline. In addition we use another baseline where we tune for the temperature insoftmax function. More details are provided in
App.
A.8.
Experimental Setup : In our experiments we adopt the same experimental setup followed by [16]on top of the OpenNMT framework [18]. We varied only the control parameters required by ourformulations. The models for the different control parameters were trained for 13 epochs and theepoch with the best validation accuracy is chosen as the best model for that setting. The best controlparameter for a formulation is again selected based on validation accuracy. For all our formulations,we report the test scores corresponding to the best control parameter in Table 2.
Neural Machine Translation : We consider the FR-EN language pair from the NMT-Benchmarkproject and perform experiments both ways. We see (refer
Table.
2) that sparsegen-lin surpassesBLEU scores of softmax and sparsemax for FR-EN translation, whereas sparsehg formulations yieldcomparable performance. Quantitatively, these metrics show that adding explicit controls do not comeat the cost of accuracy. In addition, it is encouraging to see (refer
Fig.
App.
A.8) that increasing λ for sparsegen-lin leads to crisper and hence more interpretable attention heatmaps (the lesser numberof activated columns per row the better it is). We have also analyzed the average sparsity of heatmapsover the whole test dataset and have indeed observed that larger λ leads to sparser attention. Abstractive Summarization : We next perform our experiments on abstractive summarizationdatasets like Gigaword, DUC2003 & DUC2004 and report ROUGE metrics. The results in
Ta-ble. λ control leadsto more interpretable attention heatmaps as shown in Fig.
App.
A.8 and we have also observed thesame with average sparsity of heatmaps over the test set.
In this paper, we investigated a family of sparse probability mapping functions, unifying them undera general framework. This framework helped us to understand connections to existing formulationsin the literature like softmax, spherical softmax and sparsemax. Our proposed probability mappingfunctions enabled us to provide explicit control over sparsity to achieve higher interpretability.These functions have closed-form solutions and sub-gradients can be computed easily. We havealso proposed convex loss functions, which helped us to achieve better accuracies in the multilabelclassification setting. Application of these formulations to compute sparse attention weights for NLPtasks also yielded improvements in addition to providing control to produce enhanced interpretability.As future work, we intend to apply these sparse attention formulations for efficient read and writeoperations of memory networks [19]. In addition, we would like to investigate application of theseproposed sparse formulations in knowledge distillation and reinforcement learning settings.9 cknowledgements
We thank our colleagues in IBM, Abhijit Mishra, Disha Shrivastava, and Parag Jain for the numerousdiscussions and suggestions which helped in shaping this paper.
References [1] M. Aly. Survey on multiclass classification methods.
Neural networks , pages 1–9, 2005.[2] John S. Bridle. Probabilistic interpretation of feedforward classification network outputs, withrelationships to statistical pattern recognition. In Françoise Fogelman Soulié and Jeanny Hérault,editors,
Neurocomputing , pages 227–236, Berlin, Heidelberg, 1990. Springer Berlin Heidelberg.[3] Richard S. Sutton and Andrew G. Barto.
Introduction to Reinforcement Learning . MIT Press,Cambridge, MA, USA, 1st edition, 1998.[4] B. Gao and L. Pavel. On the properties of the softmax function with application in game theoryand reinforcement learning.
ArXiv e-prints , 2017.[5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In
Proceedings of the International Conference on LearningRepresentations (ICLR) , San Diego, CA, 2015.[6] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov,Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generationwith visual attention. In Francis R. Bach and David M. Blei, editors,
ICML , volume 37 of
JMLRWorkshop and Conference Proceedings , pages 2048–2057. JMLR.org, 2015.[7] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-basedencoder-decoder networks.
IEEE Transactions on Multimedia , 17(11):1875–1886, Nov 2015.[8] Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractivesentence summarization. In
Proceedings of the 2015 Conference on Empirical Methods inNatural Language Processing , pages 379–389, Lisbon, Portugal, September 2015. Associationfor Computational Linguistics.[9] Preksha Nema, Mitesh Khapra, Anirban Laha, and Balaraman Ravindran. Diversity drivenattention model for query-based abstractive summarization. In
Proceedings of the 55th AnnualMeeting of the Association for Computational Linguistics , Vancouver, Canada, August 2017.Association for Computational Linguistics.[10] Francis Bach, Rodolphe Jenatton, and Julien Mairal.
Optimization with Sparsity-InducingPenalties (Foundations and Trends(R) in Machine Learning) . Now Publishers Inc., Hanover,MA, USA, 2011.[11] Mohammad S. Sorower. A literature survey on algorithms for multi-label learning. 2010.[12] Pascal Vincent, Alexandre de Brébisson, and Xavier Bouthillier. Efficient exact gradientupdate for training deep networks with very large sparse targets. In
Proceedings of the 28thInternational Conference on Neural Information Processing Systems - Volume 1 , NIPS’15,pages 1108–1116, Cambridge, MA, USA, 2015. MIT Press.[13] Alexandre de Brébisson and Pascal Vincent. An exploration of softmax alternatives belongingto the spherical loss family. In
Proceedings of the International Conference on LearningRepresentations (ICLR) , 2016.[14] André F. T. Martins and Ramón F. Astudillo. From softmax to sparsemax: A sparse model ofattention and multi-label classification. In
Proceedings of the 33rd International Conferenceon International Conference on Machine Learning - Volume 48 , ICML’16, pages 1614–1623.JMLR.org, 2016.[15] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections ontothe l1-ball for learning in high dimensions. In
Proceedings of the 25th International Conferenceon Machine Learning , ICML ’08, pages 272–279, New York, NY, USA, 2008. ACM.[16] Vlad Niculae and Mathieu Blondel. A regularized framework for sparse and structured neuralattention. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors,
Advances in Neural Information Processing Systems 30 , pages 3340–3350.Curran Associates, Inc., 2017. 1017] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classification using bayesiancompressed sensing. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25 , pages 2645–2653. Curran Associates,Inc., 2012.[18] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. Opennmt:Open-source toolkit for neural machine translation.
CoRR , abs/1701.02810, 2017.[19] Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and Rob Fergus. End-to-end memorynetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural Information Processing Systems 28 , pages 2440–2448. Curran Associates,Inc., 2015. 11
Supplementary Material
A.1 Closed-Form Solution of Sparsegen
Formulation is given by: p ∗ = sparsegen ( z ) = argmin p ∈ ∆ K − {(cid:107) p − g ( z ) (cid:107) − λ (cid:107) p (cid:107) } (8)The Lagrangian of the above formulation is: L ( z , µ , τ ) = 12 (cid:107) p − g ( z ) (cid:107) − λ (cid:107) p (cid:107) − µ T p + τ ( T p − (9)Defining the Karush-Kuhn-Tucker conditions with respect to optimal solutions ( p ∗ , µ ∗ , τ ∗ ): p ∗ i − g i ( z ) − λp ∗ i − µ ∗ i + τ ∗ = 0 , ∀ i ∈ [ K ] , (10) T p ∗ = 1 , p ∗ ≥ , µ ∗ ≥ , (11) µ ∗ i p ∗ i = 0 , ∀ i ∈ [ K ] . ( complementary slackness ) (12)From Eq.
12, if we want p ∗ i > for certain i ∈ [ K ] , then we must have µ ∗ i = 0 . This implies p ∗ i − g i ( z ) − λp ∗ i + τ ∗ = 0 (according to Eq. p ∗ i = ( g i ( z ) − τ ∗ ) / (1 − λ ) .For i / ∈ S ( z ) , where p i = 0 , we have g i ( z ) − τ ∗ ≤ (as µ ∗ i ≥ ). From Eq.
11 we obtain (cid:80) j ∈ S ( z ) ( g j ( z ) − τ ∗ ) = 1 − λ , which leads to τ ∗ = ( (cid:80) j ∈ S ( z ) g j ( z ) − λ ) / | S ( z ) | . Proposition 0.1
The closed-form solution of sparsegen is as follows ( ∀ i ∈ [ K ] ): ρ i ( z ) = sparsegen i ( z ; g, λ ) = (cid:104) g i ( z ) − τ ( z )1 − λ (cid:105) + , (13) where τ ( z ) is the threshold which makes (cid:80) j ∈ [ K ] ρ j ( z ) = 1 . Let g (1) ( z ) ≥ g (2) ( z ) · · · ≥ g ( K ) ( z ) be the sorted coordinates of g ( z ) . The cardinality of the support set S ( z ) is given by k ( z ) :=max { k ∈ [ K ] | − λ + kg ( k ) ( z ) > (cid:80) j ≤ k g ( j ) ( z ) } . Then τ ( z ) can be obtained as, τ ( z ) = (cid:80) j ≤ k ( z ) g ( j ) ( z ) − λk ( z ) = (cid:80) j ∈ S ( z ) g j ( z ) − λ | S ( z ) | , A.2 Special cases of SparsegenExample 1 : g ( z ) = z (sparsegen-lin) :When g ( z ) = z , sparsegen reduces to a regularized extension of sparsemax. This is translationinvariant, has monotonicity and can work for full domain. It is also Lipschitz continuous with theLipschitz constant being / (1 − λ ) . The visualization of this variant is shown in Fig. Example 2 : g ( z ) = exp ( z ) (sparsegen-exp) :exp ( z ) here means element-wise exponentiation of z , that is g i ( z ) = exp ( z i ) . For z where all z i > , sparsegen-exp leads to sparser solutions than sparsemax.Checking with properties from Sec. ( z i ) is a mono-tonically increasing function. However, as exp ( z ) is not Lipschitz continuous, sparsegen-exp is notLipschitz continuous.It is interesting to note that sparsegen-exp reduces to softmax when λ is dependent on z and equals − (cid:80) j ∈ [ K ] e z j , as it results in τ ( z ) = 0 and S ( z ) = [ K ] according to Eq.
Example 3: g ( z ) = z (sparsegen-sq) :Unlike sparsegen-lin and sparsegen-exp, sparsegen-sq does not satisfy the monotonicity propertyas g i ( z ) = z i is not monotonic. Also sparsegen-sq has neither translation invariance nor scaleinvariance. Moreover, as z is not Lipschitz continuous, sparsegen-sq is not Lipschitz continuous.Additionally, when λ depends on z and λ = 1 − (cid:80) j ∈ [ K ] z j , sparsegen-sq reduces to sphericalsoftmax. Example 4: g ( z ) = log ( z ) (sparsegen-log) :log ( z ) here means element-wise natural logarithm of z , that is g i ( z ) = log ( z i ) . This is scale invariantbut not translation invariant. However, it is not defined for negative values or zero values of z i .12 .3 Derivation of Sparsecone Let us project a point z = ( z , . . . , z K ) ∈ R K onto the simplex along the line connecting z with q = ( − q, . . . , − q ) ∈ R K . The equation of this line is α z + (1 − α ) q . If the point of intersection ofthe line with the probability simplex (denoted by T z = 1 ) is represented as ˆ z = α ∗ z + (1 − α ∗ ) q ,then it should have the property: (cid:80) i ˆ z i = α ∗ (cid:80) i z i − (1 − α ∗ ) Kq = 1 , which leads to α ∗ =(1 + Kq ) / ( (cid:80) i z i + Kq ) . Also, ˆ z i ≥ condition must be satisfied, as it should lie on the probabilitysimplex. Thus, the required point can be obtained by solving the following: p ∗ = argmin p ∈ ∆ K − (cid:107) p − ˆ z (cid:107) = argmin p ∈ ∆ K − (cid:107) p − α ∗ z − (1 − α ∗ ) q (cid:107) , (14) = argmin p ∈ ∆ K − (cid:107) p − α ∗ z (cid:107) = sparsemax ( α ∗ z ) , (15) α ∗ = 1 + Kq (cid:80) i ∈ [ K ] z i + Kq (16)As α ∗ is a function of z , it is denoted by α ( z ) . The formulation in Eq.
14 is equivalent to
Eq. (1 − α ∗ ) q is a constant term for aparticular z . Thus, we can say sparsecone can be obtained by sparsemax ( α ( z ) z ) , where α ( z ) isgiven by Eq.
A.4 Derivation of Sparsehourglass
The above formulation fails for z such that (cid:80) i z i < − Kq . For example, it fails for z = ( − , − ,where q = ( − , − (the solution will give greater probability mass to -2 compared to -1), that is q = 1 . To make it work for such a z , we can find a mirror point ˜ z satisfying the following properties :(1) (cid:80) i ˜ z i = − (cid:80) i z i , and (2) ˜ z i − ˜ z j = z i − z j ∀ ( i, j ) , and (3) sparsehourglass ( z ) = sparsecone ( ˜ z ) .Let sparsecone ( ˜ z ) be given by sparsemax ( α ( ˜ z ) ˜ z ) (as seen earlier) and sparsehourglass ( z ) be definedby sparsemax ( α ∗ z ) . Can we find α ∗ which satisfies the mirror point properties above? Property(3) is true iff α ( ˜ z )( ˜ z i − ˜ z j ) = α ∗ ( z i − z j ) ∀ ( i, j ) . Using property (2) we get, α ( ˜ z ) = α ∗ . Fromdefinition of sparsecone, α ( ˜ z ) = (1 + Kq ) / ( (cid:80) i ˜ z i + Kq ) . Using property (1), we get α ∗ = α ( ˜ z ) =(1 + Kq ) / ( (cid:80) i ˜ z i + Kq ) = (1 + Kq ) / ( − (cid:80) i z i + Kq ) . As α ∗ is a function of z , let it be representedas ˆ α ( z ) . Thus, sparsehourglass is defined as follows: p ∗ = sparsehourglass ( z ) = argmin p ∈ ∆ K − (cid:107) p − ˆ α ( z ) z (cid:107) (17) = sparsemax (ˆ α ( z ) z ) (18) ˆ α ( z ) = 1 + Kq | (cid:80) i ∈ [ K ] z i | + Kq (19) A.5 Proof for Lipschitz constant of Sparsehourglass
Lipschitz constant of composition of functions is bounded above by the product of Lipschitz constantsof functions in the composition. Applying this principle on
Eq.
18 we find an upper bound for theLipschitz constant of sparsehg (denoted as L sparsehg ) as the product of Lipschitz constant of sparsemax(denoted as L sparsemax ) and Lipschitz constant of g ( z ) = ˆ α ( z ) z (which we denote as L g ). L sparsehg ≤ L sparsemax × L g (20)We use the property that Lipschitz constant of a function is the largest matrix norm of Jacobian ofthat function. For sparsemax that would be, L sparsemax = argmax x , z ∈ R K (cid:107) J sparsemax ( z ) x (cid:107) (cid:107) x (cid:107) where, J sparsemax ( z ) is the Jacobian of sparsemax at given value z . From the term given in Sec. J sparsemax ( z ) = (cid:104) Diag ( s ) − ss T | S ( z ) | (cid:105) whose matrix norm is . We use the property that, for anysymmetric matrix, matrix norm is the largest absolute eigenvalue, and the eigenvalues of J sparsemax ( z ) are and . Hence, L sparsemax = 1 . 13or g ( z ) = ˆ α ( z ) z , let us look at the Jacobian of g ( z ) , which is given as J g ( z ) = ˆ α ( z ) I − ˆ α ( z ) sgn ( (cid:80) i z i ) | (cid:80) i z i | + Kq z z . . . z z z . . . z ... . . . z K z K , (21)where I denotes identity matrix. Eigen values of J g ( z ) are ˆ α ( z ) and ˆ α ( z )(1 − | (cid:80) z i || (cid:80) z i | + Kq ) , and thelargest among them is clearly ˆ α ( z ) . As ˆ α ( z ) > , all the eigen values of J g ( z ) are greater than0, making J g ( z ) a positive definite matrix. We now use the property that for any positive definitematrix, matrix norm is the largest eigenvalue. From Eq.
19, we see that the largest eigen value of J g ( z ) is ˆ α ( z ) which assumes the largest value when (cid:80) i z i = 0 . The highest value of ˆ α ( z ) is Kq ;therefore, L g = 1 + Kq . Thus, L sparsehg ≤ L sparsemax × L g = 1 × (1 + 1 Kq ) = 1 + 1 Kq It turns out this is also a tight bound; hence, L sparsehg = 1 + Kq . A.6 Synthetic dataset creation
We use scikit-learn for generating synthetic datasets with 5000 instances split across train (0.5), val(0.2) and test (0.3). Each instance is a multilabel document with a sequence of words. Vocabularysize and number of labels K are fixed to . Each instance is generated as follows: We draw numberof true labels N ∈ { , . . . , K } from either a discrete uniform distribution. Then we sample N labelsfrom { , . . . , K } without replacement and sample L number of words (document length) from themixture of sampled label-specific distribution over words. A.7 More results on multilabel classification experimentsA.7.1 Synthetic Dataset
Fig. J S D (a) Varying mean J S D (b) Varying range
500 1000 1500 2000Document length0.0040.0060.0080.0100.012 J S D softmax+logsparsemax+hubersparsemax+hingesparsehg+hinge (c) Varying document length Figure 6: JSD on multilabel classification synthetic dataset
A.7.2 Real Datasets
Table 3: F-score on three benchmark multilabel datasets
Activation + Loss SCENE EMOTIONS BIRDS softmax+log sparsemax+huber 0.70 0.63 0.42sparsemax+hinge 0.68 .8 Controlled Sparse Attention Here we discuss attention mechanism with respect to encoder-decoder architecture as illustrated in
Fig.
7. Lets say we have a sequence of inputs (possibly sequence of words) and their correspondingword vectors sequence be X = ( x , . . . , x K ) . The task will be to produce a sequence of words Y = ( y , . . . , y L ) as output. Here an encoder RNN encodes the input sequence into a sequence ofhidden state vectors H = ( h , . . . , h K ) . A decoder RNN generates its sequence of hidden states S = ( s , . . . , s L ) by considering different h i while generating different s j . This is known as hardattention . This approach is known to be non-differentiable as it is based on a sampling based approach[6]. As an alternative solution, a softer version (called soft attention ) is more popularly used. Itinvolves generating a score vector z t = ( z t , . . . , z tK ) based on relevance of encoder hidden states H with respect to current decoder state vector s t using Eq.
22. The score vector is then transformedto a probability distribution using the
Eq.
23. Considering the attention model parameters [ W s , W h , v ], we define the following: z ti = v T tanh ( W s s t + W h h i ) . (22) p t = sof tmax ( z t ) (23)Following the approach by [14], we replace Eq.
23 with our formulations, namely, sparsegen-lin andsparsehourglass. This enables us to apply explicit control over the attention weights (with respectto sparsity and tradeoff between scale and translation invariance) produced for the sequence tosequence architecture. This falls in the sparse attention paradigm as proposed in the earlier work.Unlike hard attention, these formulations lead to easy computation of sub-gradients, thus enablingbackpropagation in this encoder-decoder architecture composed RNNs. This approach can also beextended to other forms of non-sequential input (like images) as long as we can compute the scorevector z t from them, possibly using a different encoder like CNN instead of RNN.Figure 7: Encoder Decoder with Sparse Attention A.8.1 Neural Machine Translation
The first task we consider for our experiments is neural machine translation. Our aim is to see howour techniques compare with softmax and sparsemax with respect BLEU metric and interpretability.We consider the french-english language pair from the NMT-Benchmark project and performexperiments both ways using controlled sparse attention. The dataset has around 1M parallel traininginstances along with equal validation and test sets of 1K parallel instances.We find from our experiments (refer Table.
2) that sparsegen-lin surpasses the BLEU scores of softmaxand sparsemax for FR-EN translation, whereas the other formulations yield comparable performance.On the other hand, for EN-FR translation, softmax is still better than others. Quantitatively, thesemetrics show that adding explicit controls do not come at the cost of accuracy. In addition, it isencouraging to see (refer
Fig.
8) that increasing λ for sparsegen-lin leads to crisper and hence moreinterpretable attention heatmaps (the lesser number of activated columns per row the better it is). http://scorer.nmt-benchmark.net/ λ = 0 . ) in translation(FR-EN) task.Figure 9: The attention heatmaps for softmax, sparsemax and sparsegen-lin ( λ = 0 . ) in summa-rization (Gigaword) task. A.8.2 Abstractive Summarization
We next perform our experiments on abstractive summarization datasets like Gigaword, DUC2003& DUC2004 . This dataset consists of pairs of sentences where the task is to generate the secondsentence (news headline) as a summary of the larger first sentence. We trained our models onnearly 4M training pairs of Gigaword and validated on 190K pairs. Then we reported the test scoresaccording to ROUGE metrics on 2K test instances of Gigaword, 624 instances of DUC2003 and 500instances of DUC2004.The results in Table. λ control leads to more interpretable attention heatmaps asshown in Fig. https://github.com/harvardnlp/sent-summaryhttps://github.com/harvardnlp/sent-summary