Soft-DTW: a Differentiable Loss Function for Time-Series
SSoft-DTW: a Differentiable Loss Function for Time-Series
Marco Cuturi Mathieu Blondel Abstract
We propose in this paper a differentiable learningloss between time series, building upon the cel-ebrated dynamic time warping (DTW) discrep-ancy. Unlike the Euclidean distance, DTW cancompare time series of variable size and is ro-bust to shifts or dilatations across the time di-mension. To compute DTW, one typically solvesa minimal-cost alignment problem between twotime series using dynamic programming. Ourwork takes advantage of a smoothed formula-tion of DTW, called soft-DTW, that computes thesoft-minimum of all alignment costs. We showin this paper that soft-DTW is a differentiable loss function, and that both its value and gradi-ent can be computed with quadratic time/spacecomplexity (DTW has quadratic time but linearspace complexity). We show that this regular-ization is particularly well suited to average andcluster time series under the DTW geometry, atask for which our proposal significantly outper-forms existing baselines (Petitjean et al., 2011).Next, we propose to tune the parameters of a ma-chine that outputs time series by minimizing itsfit with ground-truth labels in a soft-DTW sense.
1. Introduction
The goal of supervised learning is to learn a mapping thatlinks an input to an output objects, using examples of suchpairs. This task is noticeably more difficult when the out-put objects have a structure, i.e. when they are not vec-tors (Bakir et al., 2007). We study here the case where eachoutput object is a time series , namely a family of observa-tions indexed by time. While it is tempting to treat timeas yet another feature, and handle time series of vectorsas the concatenation of all these vectors, several practical CREST, ENSAE, Universit´e Paris-Saclay, France NTTCommunication Science Laboratories, Seika-cho, Kyoto, Japan.Correspondence to: Marco Cuturi < [email protected] > ,Mathieu Blondel < [email protected] > . Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).
Input Output
Figure 1.
Given the first part of a time series, we trained twomulti-layer perceptron (MLP) to predict the entire second part.Using the ShapesAll dataset, we used a Euclidean loss for the firstMLP and the soft-DTW loss proposed in this paper for the secondone. We display above the prediction obtained for a given testinstance with either of these two MLPs in addition to the groundtruth. Oftentimes, we observe that the soft-DTW loss enables usto better predict sharp changes. More time series predictions aregiven in Appendix F. issues arise when taking this simplistic approach: Time-indexed phenomena can often be stretched in some areasalong the time axis (a word uttered in a slightly slower pacethan usual) with no impact on their characteristics; varyingsampling conditions may mean they have different lengths;time series may not synchronized.
The DTW paradigm.
Generative models for time seriesare usually built having the invariances above in mind:Such properties are typically handled through latent vari-ables and/or Markovian assumptions (L¨utkepohl, 2005,Part I, § n and m by computing first the n × m pairwise distance ma-trix between these points to solve then a dynamic program(DP) using Bellman’s recursion with a quadratic ( nm ) cost. The DTW geometry.
Because it encodes efficiently a use-ful class of invariances, DTW has often been used in a dis-criminative framework (with a k -NN or SVM classifier) topredict a real or a class label output, and engineered to run a r X i v : . [ s t a t . M L ] F e b oft-DTW: a Differentiable Loss Function for Time-Series faster in that context (Yi et al., 1998). Recent works byPetitjean et al. (2011); Petitjean & Ganc¸arski (2012) have,however, shown that DTW can be used for more innova-tive tasks, such as time series averaging using the DTWdiscrepancy (see Schultz & Jain 2017 for a gentle introduc-tion to these ideas). More generally, the idea of synthetis-ing time series centroids can be regarded as a first attemptto output entire time series using DTW as a fitting loss.From a computational perspective, these approaches are,however, hampered by the fact that DTW is not differen-tiable and unstable when used in an optimization pipeline. Soft-DTW.
In parallel to these developments, several au-thors have considered smoothed modifications of Bell-man’s recursion to define smoothed DP distances (Bahl &Jelinek, 1975; Ristad & Yianilos, 1998) or kernels (Saigoet al., 2004; Cuturi et al., 2007). When applied to theDTW discrepancy, that regularization results in a soft-DTW score, which considers the soft-minimum of the distributionof all costs spanned by all possible alignments betweentwo time series. Despite considering all alignments andnot just the optimal one, soft-DTW can be computed witha minor modification of Bellman’s recursion, in which all (min , +) operations are replaced with (+ , × ) . As a result,both DTW and soft-DTW have quadratic in time & linearin space complexity with respect to the sequences’ lengths.Because soft-DTW can be used with kernel machines, onetypically observes an increase in performance when usingsoft-DTW over DTW (Cuturi, 2011) for classification. Our contributions.
We explore in this paper anotherimportant benefit of smoothing DTW: unlike the originalDTW discrepancy, soft-DTW is differentiable in all of itsarguments. We show that the gradients of soft-DTW w.r.tto all of its variables can be computed as a by-product ofthe computation of the discrepancy itself, with an addedquadratic storage cost. We use this fact to propose an al-ternative approach to the DBA (DTW Barycenter Averag-ing) clustering algorithm of (Petitjean et al., 2011), andobserve that our smoothed approach significantly outper-forms known baselines for that task. More generally, wepropose to use soft-DTW as a fitting term to compare theoutput of a machine synthesizing a time series segmentwith a ground truth observation, in the same way that, forinstance, a regularized Wasserstein distance was used tocompute barycenters (Cuturi & Doucet, 2014), and laterto fit discriminators that output histograms (Zhang et al.,2015; Rolet et al., 2016). When paired with a flexiblelearning architecture such as a neural network, soft-DTWallows for a differentiable end-to-end approach to designpredictive and generative models for time series, as illus-trated in Figure 1. Source code is available at https://github.com/mblondel/soft-dtw . Structure.
After providing background material, we show in § § § Notations.
We consider in what follows multivariate dis-crete time series of varying length taking values in Ω ⊂ R p .A time series can be thus represented as a matrix of p linesand varying number of columns. We consider a differen-tiable substitution-cost function δ : R p × R p → R + whichwill be, in most cases, the quadratic Euclidean distance be-tween two vectors. For an integer n we write (cid:74) n (cid:75) for the set { , . . . , n } of integers. Given two series’ lengths n and m ,we write A n,m ⊂ { , } n × m for the set of (binary) align-ment matrices, that is paths on a n × m matrix that connectthe upper-left (1 , matrix entry to the lower-right ( n, m ) one using only ↓ , → , (cid:38) moves. The cardinal of A n,m isknown as the delannoy ( n − , m − number; that numbergrows exponentially with m and n .
2. The DTW and soft-DTW loss functions
We propose in this section a unified formulation for theoriginal DTW discrepancy (Sakoe & Chiba, 1978) andthe Global Alignment kernel (GAK) (Cuturi et al., 2007),which can be both used to compare two time series x =( x , . . . , x n ) ∈ R p × n and y = ( y , . . . , y m ) ∈ R p × m . Given the cost matrix ∆( x , y ) := (cid:2) δ ( x i , y j ) (cid:3) ij ∈ R n × m ,the inner product (cid:104) A, ∆( x , y ) (cid:105) of that matrix with an align-ment matrix A in A n,m gives the score of A , as illustratedin Figure 2. Both DTW and GAK consider the costs of allpossible alignment matrices, yet do so differently:DTW ( x , y ) := min A ∈A n,m (cid:104) A, ∆( x , y ) (cid:105) ,k γ GA ( x , y ) := (cid:88) A ∈A n,m e −(cid:104) A, ∆( x , y ) (cid:105) /γ . (1) DP Recursion.
Sakoe & Chiba (1978) showed that theBellman equation (1952) can be used to compute DTW.That recursion, which appears in line 5 of Algorithm 1 (dis-regarding for now the exponent γ ), only involves (min , +) operations. When considering kernel k γ GA and, instead, itsintegration over all alignments (see e.g. Lasserre 2009),Cuturi et al. (2007, Theorem 2) and the highly related for-mulation of Saigo et al. (2004, p.1685) use an old algo-rithmic appraoch (Bahl & Jelinek, 1975) which consistsin (i) replacing all costs by their neg-exponential; (ii) re-place (min , +) operations with (+ , × ) operations. Thesetwo recursions can be in fact unified with the use of a soft- oft-DTW: a Differentiable Loss Function for Time-Series y y y y y y x x x x , , , , , , , , , , , , , , , , , , , , , , , , Figure 2.
Three alignment matrices (orange, green, purple, in ad-dition to the top-left and bottom-right entries) between two timeseries of length 4 and 6. The cost of an alignment is equal to thesum of entries visited along the path. DTW only considers theoptimal alignment (here depicted in purple pentagons), whereassoft-DTW considers all delannoy ( n − , m − possible align-ment matrices. minimum operator, which we present below. Unified algorithm
Both formulas in Eq. (1) can be com-puted with a single algorithm. That formulation is new toour knowledge. Consider the following generalized min operator, with a smoothing parameter γ ≥ : min γ { a , . . . , a n } := (cid:40) min i ≤ n a i , γ = 0 , − γ log (cid:80) ni =1 e − a i /γ , γ > . With that operator, we can define γ -soft-DTW: dtw γ ( x , y ) := min γ {(cid:104) A, ∆( x , y ) (cid:105) , A ∈ A n,m } . The original DTW score is recovered by setting γ to .When γ > , we recover dtw γ = − γ log k γ GA . Mostimportantly, and in either case, dtw γ can be computedusing Algorithm 1, which requires ( nm ) operations and ( nm ) storage cost as well . That cost can be reduced to n with a more careful implementation if one only seeksto compute dtw γ ( x , y ) , but the backward pass we con-sider next requires the entire matrix R of intermediaryalignment costs. Note that, to ensure numerical stabil-ity, the operator min γ must be computed using the usuallog-sum-exp stabilization trick, namely that log (cid:80) i e z i =(max j z j ) + log (cid:80) i e z i − max j z j . A small variation in the input x causes a small changein dtw ( x , y ) or dtw γ ( x , y ) . When considering dtw ,that change can be efficiently monitored only when theoptimal alignment matrix A (cid:63) that arises when computing dtw ( x , y ) in Eq. (1) is unique. As the minimum over afinite set of linear functions of ∆ , dtw is therefore locallydifferentiable w.r.t. the cost matrix ∆ , with gradient A (cid:63) ,a fact that has been exploited in all algorithms designed to Algorithm 1
Forward recursion to compute dtw γ ( x , y ) and intermediate alignment costs Inputs : x , y , smoothing γ ≥ , distance function δ r , = 0; r i, = r ,j = ∞ ; i ∈ (cid:74) n (cid:75) , j ∈ (cid:74) m (cid:75) for j = 1 , . . . , m do for i = 1 , . . . , n do r i,j = δ ( x i , y j ) + min γ { r i − ,j − , r i − ,j , r i,j − } end for end for Output: ( r n,m , R ) average time series under the DTW metric (Petitjean et al.,2011; Schultz & Jain, 2017). To recover the gradient of dtw ( x , y ) w.r.t. x , we only need to apply the chain rule,thanks to the differentiability of the cost function: ∇ x dtw ( x , y ) = (cid:18) ∂ ∆( x , y ) ∂ x (cid:19) T A (cid:63) , (2)where ∂ ∆( x , y ) /∂ x is the Jacobian of ∆ w.r.t. x , a linearmap from R p × n to R n × m . When δ is the squared Euclideandistance, the transpose of that Jacobian applied to a matrix B ∈ R n × m is ( ◦ being the elementwise product): ( ∂ ∆( x , y ) /∂ x ) T B = 2 (cid:0) ( p Tm B T ) ◦ x − y B T (cid:1) . With continuous data, A (cid:63) is almost always likely to beunique, and therefore the gradient in Eq. (2) will be de-fined almost everywhere. However, that gradient, when itexists, will be discontinuous around those values x wherea small change in x causes a change in A (cid:63) , which is likelyto hamper the performance of gradient descent methods. The case γ > . An immediate advantage of soft-DTWis that it can be explicitly differentiated, a fact that was alsonoticed by Saigo et al. (2006) in the related case of editdistances. When γ > , the gradient of Eq. (1) is obtainedvia the chain rule, ∇ x dtw γ ( x , y ) = (cid:18) ∂ ∆( x , y ) ∂ x (cid:19) T E γ [ A ] , (3)where E γ [ A ] := 1 k γ GA ( x , y ) (cid:88) A ∈A n,m e −(cid:104) A, ∆( x , y ) /γ (cid:105) A, is the average alignment matrix A under the Gibbs distri-bution p γ ∝ e −(cid:104) A, ∆( x , y ) (cid:105) /γ defined on all alignments in A n,m . The kernel k γ GA ( x , y ) can thus be interpreted asthe normalization constant of p γ . Of course, since A n,m has exponential size in n and m , a naive summation is nottractable. Although a Bellman recursion to compute thataverage alignment matrix E γ [ A ] exists (see Appendix A)that computation has quartic ( n m ) complexity. Note that oft-DTW: a Differentiable Loss Function for Time-Series this stands in stark contrast to the quadratic complexity ob-tained by Saigo et al. (2006) for edit-distances, which is dueto the fact the sequences they consider can only take valuesin a finite alphabet. To compute the gradient of soft-DTW,we propose instead an algorithm that manages to remain quadratic ( nm ) in terms of complexity. The key to achievethis reduction is to apply the chain rule in reverse order ofBellman’s recursion given in Algorithm 1, namely back-propagate. A similar idea was recently used to compute thegradient of ANOVA kernels in (Blondel et al., 2016). Differentiating algorithmically dtw γ ( x , y ) requires doingfirst a forward pass of Bellman’s equation to store all in-termediary computations and recover R = [ r i,j ] whenrunning Algorithm 1. The value of dtw γ ( x , y ) —storedin r n,m at the end of the forward recursion—is then im-pacted by a change in r i,j exclusively through the termsin which r i,j plays a role, namely the triplet of terms r i +1 ,j , r i,j +1 , r i +1 ,j +1 . A straightforward application ofthe chain rule then gives ∂r n,m ∂r i,j (cid:124) (cid:123)(cid:122) (cid:125) e i,j = ∂r n,m ∂r i +1 ,j (cid:124) (cid:123)(cid:122) (cid:125) e i +1 ,j ∂r i +1 ,j ∂r i,j + ∂r n,m ∂r i,j +1 (cid:124) (cid:123)(cid:122) (cid:125) e i,j +1 ∂r i,j +1 ∂r i,j + ∂r n,m ∂r i +1 ,j +1 (cid:124) (cid:123)(cid:122) (cid:125) e i +1 ,j +1 ∂r i +1 ,j +1 ∂r i,j , in which we have defined the notation of the main objectof interest of the backward recursion: e i,j := ∂r n,m ∂r i,j . TheBellman recursion evaluated at ( i + 1 , j ) as shown in line 5of Algorithm 1 (here δ i +1 ,j is δ ( x i +1 , y j ) ) yields : r i +1 ,j = δ i +1 ,j + min γ { r i,j − , r i,j , r i +1 ,j − } , which, when differentiated w.r.t r i,j yields the ratio: ∂r i +1 ,j ∂r i,j = e − r i,j /γ / (cid:16) e − r i,j − /γ + e − r i,j /γ + e − r i +1 ,j − /γ (cid:17) . The logarithm of that derivative can be conveniently castusing evaluations of min γ computed in the forward loop: γ log ∂r i +1 ,j ∂r i,j = min γ { r i,j − , r i,j , r i +1 ,j − } − r i,j = r i +1 ,j − δ i +1 ,j − r i,j . Similarly, the following relationships can also be obtained: γ log ∂r i,j +1 ∂r i,j = r i,j +1 − r i,j − δ i,j +1 ,γ log ∂r i +1 ,j +1 ∂r i,j = r i +1 ,j +1 − r i,j − δ i +1 ,j +1 . We have therefore obtained a backward recursion to com-pute the entire matrix E = [ e i,j ] , starting from e n,m = ∂r n,m ∂r n,m = 1 down to e , . To obtain ∇ x dtw γ ( x , y ) , noticethat the derivatives w.r.t. the entries of the cost matrix ∆ can be computed by ∂r n,m ∂δ i,j = ∂r n,m ∂r i,j ∂r i,j ∂δ i,j = e i,j · e i,j , and therefore we have that ∇ x dtw γ ( x , y ) = (cid:18) ∂ ∆( x , y ) ∂ x (cid:19) T E, where E is exactly the average alignment E γ [ A ] inEq. (3). These computations are summarized in Algo-rithm 2, which, once ∆ has been computed, has complexity nm in time and space. Because min γ has a /γ -Lipschitzcontinuous gradient, the gradient of dtw γ is /γ -Lipschitzcontinuous when δ is the squared Euclidean distance. Algorithm 2
Backward recursion to compute ∇ x dtw γ ( x , y ) Inputs : x , y , smoothing γ ≥ , distance function δ ( · , R ) = dtw γ ( x , y ) , ∆ = [ δ ( x i , y j )] i,j δ i,m +1 = δ n +1 ,j = 0 , i ∈ (cid:74) n (cid:75) , j ∈ (cid:74) m (cid:75) e i,m +1 = e n +1 ,j = 0 , i ∈ (cid:74) n (cid:75) , j ∈ (cid:74) m (cid:75) r i,m +1 = r n +1 ,j = −∞ , i ∈ (cid:74) n (cid:75) , j ∈ (cid:74) m (cid:75) δ n +1 ,m +1 = 0 , e n +1 ,m +1 = 1 , r n +1 ,m +1 = r n,m for j = m, . . . , do for i = n, . . . , do a = exp γ ( r i +1 ,j − r i,j − δ i +1 ,j ) b = exp γ ( r i,j +1 − r i,j − δ i,j +1 ) c = exp γ ( r i +1 ,j +1 − r i,j − δ i +1 ,j +1 ) e i,j = e i +1 ,j · a + e i,j +1 · b + e i +1 ,j +1 · c end for end for Output: ∇ x dtw γ ( x , y ) = (cid:16) ∂ ∆( x , y ) ∂ x (cid:17) T E
3. Learning with the soft-DTW loss
We study in this section a direct application of Algorithm 2to the problem of computing Fr´echet means (1948) of timeseries with respect to the dtw γ discrepancy. Given afamily of N times series y , . . . , y N , namely N matricesof p lines and varying number of columns, m , . . . , m N ,we are interested in defining a single barycenter time se-ries x for that family under a set of normalized weights λ , . . . , λ N ∈ R + such that (cid:80) Ni =1 λ i = 1 . Our goal is thusto solve approximately the following problem, in which wehave assumed that x has fixed length n : min x ∈ R p × n N (cid:88) i =1 λ i m i dtw γ ( x , y i ) . (4)Note that each dtw γ ( x , y i ) term is divided by m i , thelength of y i . Indeed, since dtw is an increasing (roughlylinearly) function of each of the input lengths n and m i , wefollow the convention of normalizing in practice each dis-crepancy by n × m i . Since the length n of x is here fixedacross all evaluations, we do not need to divide the objec-tive of Eq. (4) by n . Averaging under the soft-DTW geom-etry results in substantially different results than those thatcan be obtained with the Euclidean geometry (which canonly be used in the case where all lengths n = m = · · · = oft-DTW: a Differentiable Loss Function for Time-Series i,j i +1 ,j i,j +1 i +1 ,j +1 r i ,j r i ,j r i ,j +1 r i,j r i,j r i,j +1 r i +1 ,j r i +1 ,j r i +1 ,j +1 e ( r i +1 ,j r i,j i +1 ,j ) e ( r i +1 ,j +1 r i,j i +1 ,j +1 ) e ( r i,j +1 r i,j i,j +1 ) e i,j e i +1 ,j e i +1 ,j +1 e i,j +1 Figure 3.
Sketch of the computational graph for soft-DTW, in the forward pass used to compute dtw γ (left) and backward pass used tocompute its gradient ∇ x dtw γ (right). In both diagrams, purple shaded cells stand for data values available before the recursion starts,namely cost values (left) and multipliers computed using forward pass results (right). In the left diagram, the forward computation of r i,j as a function of its predecessors and δ i,j is summarized with arrows. Dotted lines indicate a min γ operation, solid lines an addition.From the perspective of the final term r n,m , which stores dtw γ ( x , y ) at the lower right corner (not shown) of the computational graph,a change in r i,j only impacts r n,m through changes that r i,j causes to r i +1 ,j , r i,j +1 and r i +1 ,j +1 . These changes can be tracked usingEq. (2.3,2.3) and appear in lines 9-11 in Algorithm 2 as variables a, b, c , as well as in the purple shaded boxes in the backward pass(right) which represents the recursion of line 12 in Algorithm 2. m N are equal), as can be seen in the intuitive interpolationswe obtain between two time series shown in Figure 4. Non-convexity of dtw γ . A natural question that arisesfrom Eq. (4) is whether that objective is convex or not. Theanswer is negative, in a way that echoes the non-convexityof the k -means objective as a function of cluster centroidslocations. Indeed, for any alignment matrix A of suitablesize, each map x (cid:55)→ (cid:104) A, ∆( x , y ) (cid:105) shares the same convex-ity/concavity property that δ may have. However, both min and min γ can only preserve the concavity of elementaryfunctions (Boyd & Vandenberghe, 2004, pp.72-74). There-fore dtw γ will only be concave if δ is concave, or becomeinstead a (non-convex) (soft) minimum of convex functionsif δ is convex. When δ is a squared-Euclidean distance, dtw is a piecewise quadratic function of x , as is also thecase with the k -means energy (see for instance Figure 2in Schultz & Jain 2017). Since this is the setting we con-sider here, all of the computations involving barycentersshould be taken with a grain of salt, since we have no wayof ensuring optimality when approximating Eq. (4). Smoothing helps optimizing dtw γ . Smoothing can beregarded, however, as a way to “convexify” dtw γ . In-deed, notice that dtw γ converges to the sum of all costsas γ → ∞ . Therefore, if δ is convex, dtw γ will graduallybecome convex as γ grows. For smaller values of γ , onecan intuitively foresee that using min γ instead of a mini-mum will smooth out local minima and therefore provide abetter (although slightly different from dtw ) optimizationlandscape. We believe this is why our approach recoversbetter results, even when measured in the original dtw discrepancy , than subgradient or alternating minimizationapproaches such as DBA (Petitjean et al., 2011), which can,on the contrary, get more easily stuck in local minima. Ev-idence for this statement is presented in the experimentalsection. (a) Euclidean loss (b) Soft-DTW loss ( γ = 1 ) Figure 4.
Interpolation between two time series (red and blue) onthe Gun Point dataset. We computed the barycenter by solving Eq.(4) with ( λ , λ ) set to (0.25, 0.75), (0.5, 0.5) and (0.75, 0.25).The soft-DTW geometry leads to visibly different interpolations. The (approximate) computation of dtw γ barycenters canbe seen as a first step towards the task of clustering timeseries under the dtw γ discrepancy. Indeed, one can nat-urally formulate that problem as that of finding centroids x , . . . , x k that minimize the following energy: min x ,..., x k ∈ R p × n N (cid:88) i =1 m i min j ∈ [[ k ]] dtw γ ( x j , y i ) . (5)To solve that problem one can resort to a direct generaliza-tion of Lloyd’s algorithm (1982) in which each centeringstep and each clustering allocation step is done accordingto the dtw γ discrepancy. One of the de-facto baselines for learning to classify timeseries is the k nearest neighbors ( k -NN) algorithm, com-bined with DTW as discrepancy measure between time se-ries. However, k -NN has two main drawbacks. First, thetime series used for training must be stored, leading topotentially high storage cost. Second, in order to com- oft-DTW: a Differentiable Loss Function for Time-Series pute predictions on new time series, the DTW discrep-ancy must be computed with all training time series, lead-ing to high computational cost. Both of these drawbackscan be addressed by the nearest centroid classifier (Hastieet al., 2001, p.670), (Tibshirani et al., 2002). This methodchooses the class whose barycenter (centroid) is closestto the time series to classify. Although very simple, thismethod was shown to be competitive with k -NN, while re-quiring much lower computational cost at prediction time(Petitjean et al., 2014). Soft-DTW can naturally be usedin a nearest centroid classifier, in order to compute thebarycenter of each class at train time, and to compute thediscrepancy between barycenters and time series, at predic-tion time. Soft-DTW is ideally suited as a loss function for any taskthat requires time series outputs. As an example of such atask, we consider the problem of, given the first , . . . , t observations of a time series, predicting the remaining ( t + 1) , . . . , n observations. Let x t,t (cid:48) ∈ R p × ( t (cid:48) − t +1) bethe submatrix of x ∈ R p × n of all columns with indices be-tween t and t (cid:48) , where ≤ t < t (cid:48) < n . Learning to predictthe segment of a time series can be cast as the problem min θ ∈ Θ N (cid:88) i =1 dtw γ (cid:16) f θ ( x ,ti ) , x t +1 ,ni (cid:17) , where { f θ } is a set of parameterized function that takeas input a time series and outputs a time series. Naturalchoices would be multi-layer perceptrons or recurrent neu-ral networks (RNN), which have been historically trainedwith a Euclidean loss (Parlos et al., 2000, Eq.5).
4. Experimental results
Throughout this section, we use the UCR (Universityof California, Riverside) time series classification archive(Chen et al., 2015). We use a subset containing 79 datasetsencompassing a wide variety of fields (astronomy, geology,medical imaging) and lengths. Datasets include class infor-mation (up to 60 classes) for each time series and are splitinto train and test sets. Due to the large number of datasetsin the UCR archive, we choose to report only a summaryof our results in the main manuscript. Detailed results areincluded in the appendices for interested readers.
In this section, we compare the soft-DTW barycenter ap-proach presented in § Experimental setup.
For each dataset, we choose a classat random, pick 10 time series in that class and compute
Table 1.
Percentage of the datasets on which the proposed soft-DTW barycenter is achieving lower DTW loss (Equation (4) with γ = 0 ) than competing methods.Randominitialization Euclidean meaninitialization Comparison with DBA γ = 1 γ = 0 . γ = 0 . γ = 0 . Comparison with subgradient method γ = 1 γ = 0 . γ = 0 . γ = 0 . their barycenter. For quantitative results below, we repeatthis procedure 10 times and report the averaged results. Foreach method, we set the maximum number of iterationsto 100. To minimize the proposed soft-DTW barycenterobjective, Eq. (4), we use L-BFGS. Qualitative results.
We first visualize the barycenters ob-tained by soft-DTW when γ = 1 and γ = 0 . , by DBAand by the subgradient method. Figure 5 shows barycen-ters obtained using random initialization on the ECG200dataset. More results with both random and Euclideanmean initialization are given in Appendix B and C.We observe that both DBA or soft-DTW with low smooth-ing parameter γ yield barycenters that are spurious. Onthe other hand, a descent on the soft-DTW loss with suf-ficiently high γ converges to a reasonable solution. Forexample, as indicated in Figure 5 with DTW or soft-DTW( γ = 0 . ), the small kink around x = 15 is not repre-sentative of any of the time series in the dataset. However,with soft-DTW ( γ = 1 ), the barycenter closely matches thetime series. This suggests that DTW or soft-DTW with toolow γ can get stuck in bad local minima.When using Euclidean mean initialization (only possible iftime series have the same length), DTW or soft-DTW withlow γ often yield barycenters that better match the shape ofthe time series. However, they tend to overfit: they absorbthe idiosyncrasies of the data. In contrast, soft-DTW is ableto learn barycenters that are much smoother. Quantitative results.
Table 1 summarizes the percentageof datasets on which the proposed soft-DTW barycenterachieves lower DTW loss when varying the smoothing pa-rameter γ . The actual loss values achieved by differentmethods are indicated in Appendix G and Appendix H.As γ decreases, soft-DTW achieves a lower DTW loss thanother methods on almost all datasets. This confirms our oft-DTW: a Differentiable Loss Function for Time-Series Figure 5.
Comparison between our proposed soft barycenter andthe barycenter obtained by DBA and the subgradient method,on the ECG200 dataset. When DTW is insufficiently smoothed,barycenters often get stuck in a bad local minimum that does notcorrectly match the time series. claim that the smoothness of soft-DTW leads to an objec-tive that is better behaved and more amenable to optimiza-tion by gradient-descent methods. k -means clustering experiments We consider in this section the same computational toolsused in § Experimental setup.
For all datasets, the number of clus-ters k is equal to the number of classes available in thedataset. Lloyd’s algorithm alternates between a centeringstep (barycenter computation) and an assignment step. Weset the maximum number of outer iterations to and themaximum number of inner (barycenter) iterations to 100,as before. Again, for soft-DTW, we use L-BFGS. Qualitative results.
Figure 6 shows the clusters obtainedwhen runing Lloyd’s algorithm on the CBF dataset withsoft-DTW ( γ = 1 ) and DBA, in the case of random initial-ization. More results are included in Appendix E. Clearly,DTW absorbs the tiny details in the data, while soft-DTWis able to learn much smoother barycenters. Quantitative results.
Table 2 summarizes the percentageof datasets on which soft-DTW barycenter achieves lower k -means loss under DTW, i.e. Eq. (5) with γ = 0 . Theactual loss values achieved by all methods are indicated inAppendix I and Appendix J. The results confirm the sametrend as for the barycenter experiments. Namely, as γ de-creases, soft-DTW is able to achieve lower loss than othermethods on a large proportion of the datasets. Note thatwe have not run experiments with smaller values of γ than0.001, since dtw . is very close to dtw in practice. (a) Soft-DTW ( γ = 1 ) (b) DBA Figure 6.
Clusters obtained on the CBF dataset when plugging ourproposed soft barycenter and that of DBA in Lloyd’s algorithm.DBA absorbs the idiosyncrasies of the data, while soft-DTW canlearn much smoother barycenters.
In this section, we investigate whether the smoothing insoft-DTW can act as a useful regularization and improveclassification accuracy in the nearest centroid classifier.
Experimental setup.
We use 50% of the data for training,25% for validation and 25% for testing. We choose γ from15 log-spaced values between − and . Quantitative results.
Each point in Figure 7 above the di-agonal line represents a dataset for which using soft-DTWfor barycenter computation rather than DBA improves theaccuracy of the nearest centroid classifier. To summarize,we found that soft-DTW is working better or at least as wellas DBA in 75% of the datasets.
In this section, we present preliminary experiments for thetask of multistep-ahead prediction, described in § Experimental setup.
We use the training and test sets pre-defined in the UCR archive. In both the training and testsets, we use the first 60% of the time series as input and theremaining 40% as output, ignoring class information. Wethen use the training set to learn a model that predicts theoutputs from inputs and the test set to evaluate results withboth Euclidean and DTW losses. In this experiment, wefocus on a simple multi-layer perceptron (MLP) with one oft-DTW: a Differentiable Loss Function for Time-Series
Table 2.
Percentage of the datasets on which the proposed soft-DTW based k -means is achieving lower DTW loss (Equation (5)with γ = 0 ) than competing methods.Randominitialization Euclidean meaninitialization Comparison with DBA γ = 1 γ = 0 . γ = 0 . γ = 0 . Comparison with subgradient method γ = 1 γ = 0 . γ = 0 . γ = 0 . Figure 7.
Each point above the diagonal represents a datasetwhere using our soft-DTW barycenter rather than that of DBAimproves the accuracy of the nearest nearest centroid classifier.This is the case for of the datasets in the UCR archive. hidden layer and sigmoid activation. We also experimentedwith linear models and recurrent neural networks (RNNs)but they did not improve over a simple MLP.
Implementation details.
Deep learning frameworks suchas Theano, TensorFlow and Chainer allow the user to spec-ify a custom backward pass for their function. Implement-ing such a backward pass, rather than resorting to automaticdifferentiation (autodiff), is particularly important in thecase of soft-DTW: First, the autodiff in these frameworksis designed for vectorized operations, whereas the dynamicprogram used by the forward pass of Algorithm 1 is inher-ently element-wise; Second, as we explained in § Qualitative results.
Visualizations of the predictions ob-tained under Euclidean and soft-DTW losses are given inFigure 1, as well as in Appendix F. We find that for sim-
Table 3.
Averaged rank obtained by a multi-layer perceptron(MLP) under Euclidean and soft-DTW losses. Euclidean initial-ization means that we initialize the MLP trained with soft-DTWloss by the solution of the MLP trained with Euclidean loss.Training loss Randominitialization Euclideaninitialization
When evaluating with DTW loss
Euclidean 3.46 4.21soft-DTW ( γ = 1 ) 3.55 3.96soft-DTW ( γ = 0 . ) 3.33 3.42soft-DTW ( γ = 0 . ) 2.79 2.12soft-DTW ( γ = 0 . ) Euclidean soft-DTW ( γ = 1 ) 2.41 2.99soft-DTW ( γ = 0 . ) 3.42 3.38soft-DTW ( γ = 0 . ) 4.13 3.64soft-DTW ( γ = 0 . ) 3.99 3.29 ple one-dimensional time series, an MLP works very well,showing its ability to capture patterns in the training set.Although the predictions under Euclidean and soft-DTWlosses often agree with each other, they can sometimes bevisibly different. Predictions under soft-DTW loss can con-fidently predict abrupt and sharp changes since those havea low DTW cost as long as such a sharp change is present,under a small time shift, in the ground truth. Quantitative results.
A comparison summary of ourMLP under Euclidean and soft-DTW losses over the UCRarchive is given in Table 3. Detailed results are given inthe appendix. Unsurprisingly, we achieve lower DTW losswhen training with the soft-DTW loss, and lower Euclideanloss when training with the Euclidean loss. Because DTWis robust to several useful invariances, a small error in thesoft-DTW sense could be a more judicious choice than anerror in an Euclidean sense for many applications.
5. Conclusion
We propose in this paper to turn the popular DTW discrep-ancy between time series into a full-fledged loss functionbetween ground truth time series and outputs from a learn-ing machine. We have shown experimentally that, on theexisting problem of computing barycenters and clusters fortime series data, our computational approach is superior toexisting baselines. We have shown promising results on theproblem of multistep-ahead time series prediction, whichcould prove extremely useful in settings where a user’s ac-tual loss function for time series is closer to the robust per-spective given by DTW, than to the local parsing of theEuclidean distance.
Acknowledgements.
MC gratefully acknowledges thesupport of a chaire de l’IDEX Paris Saclay . oft-DTW: a Differentiable Loss Function for Time-Series References
Bahl, L and Jelinek, Frederick. Decoding for channels withinsertions, deletions, and substitutions with applicationsto speech recognition.
IEEE Transactions on Informa-tion Theory , 21(4):404–411, 1975.Bakir, GH, Hofmann, T, Sch¨olkopf, B, Smola, AJ, Taskar,B, and Vishwanathan, SVN.
Predicting StructuredData . Advances in neural information processing sys-tems. MIT Press, Cambridge, MA, USA, 2007.Bellman, Richard. On the theory of dynamic programming.
Proceedings of the National Academy of Sciences , 38(8):716–719, 1952.Blondel, Mathieu, Fujino, Akinori, Ueda, Naonori, andIshihata, Masakazu. Higher-order factorization ma-chines. In
Advances in Neural Information ProcessingSystems 29 , pp. 3351–3359. 2016.Boyd, Stephen and Vandenberghe, Lieven.
Convex Opti-mization . Cambridge University Press, 2004.Chen, Yanping, Keogh, Eamonn, Hu, Bing, Begum, Nurja-han, Bagnall, Anthony, Mueen, Abdullah, and Batista,Gustavo. The ucr time series classification archive,July 2015. .Cuturi, Marco. Fast global alignment kernels. In
Proceed-ings of the 28th international conference on machinelearning (ICML-11) , pp. 929–936, 2011.Cuturi, Marco and Doucet, Arnaud. Fast computation ofWasserstein barycenters. In
Proceedings of the 31st In-ternational Conference on Machine Learning (ICML-14) , pp. 685–693, 2014.Cuturi, Marco, Vert, Jean-Philippe, Birkenes, Oystein, andMatsui, Tomoko. A kernel for time series based onglobal alignments. In , volume 2, pp. II–413, 2007.Fr´echet, Maurice. Les ´el´ements al´eatoires de nature quel-conque dans un espace distanci´e. In
Annales de l’institutHenri Poincar´e , volume 10, pp. 215–310. Presses uni-versitaires de France, 1948.Garreau, Damien, Lajugie, R´emi, Arlot, Sylvain, and Bach,Francis. Metric learning for temporal sequence align-ment. In
Advances in Neural Information ProcessingSystems , pp. 1817–1825, 2014.Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome.
The Elements of Statistical Learning . Springer New YorkInc., 2001. Kingma, Diederik and Ba, Jimmy. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Lasserre, Jean B.
Linear and integer programming vslinear integration and counting: a duality viewpoint .Springer Science & Business Media, 2009.Lloyd, Stuart. Least squares quantization in pcm.
IEEETrans. on Information Theory , 28(2):129–137, 1982.L¨utkepohl, Helmut.
New introduction to multiple time se-ries analysis . Springer Science & Business Media, 2005.Parlos, Alexander G, Rais, Omar T, and Atiya, Amir F.Multi-step-ahead prediction using dynamic recurrentneural networks.
Neural networks , 13(7):765–786, 2000.Petitjean, Franc¸ois and Ganc¸arski, Pierre. Summarizing aset of time series by averaging: From steiner sequenceto compact multiple alignment.
Theoretical ComputerScience , 414(1):76–91, 2012.Petitjean, Franc¸ois, Ketterlin, Alain, and Ganc¸arski, Pierre.A global averaging method for dynamic time warping,with applications to clustering.
Pattern Recognition , 44(3):678–693, 2011.Petitjean, Franc¸ois, Forestier, Germain, Webb, Geoffrey I,Nicholson, Ann E, Chen, Yanping, and Keogh, Eamonn.Dynamic time warping averaging of time series allowsfaster and more accurate classification. In
ICDM , pp.470–479. IEEE, 2014.Ristad, Eric Sven and Yianilos, Peter N. Learning string-edit distance.
IEEE Transactions on Pattern Analysisand Machine Intelligence , 20(5):522–532, 1998.Rolet, A., Cuturi, M., and Peyr´e, G. Fast dictionary learn-ing with a smoothed Wasserstein loss.
Proceedings ofAISTATS’16 , 2016.Saigo, Hiroto, Vert, Jean-Philippe, Ueda, Nobuhisa, andAkutsu, Tatsuya. Protein homology detection usingstring alignment kernels.
Bioinformatics , 20(11):1682–1689, 2004.Saigo, Hiroto, Vert, Jean-Philippe, and Akutsu, Tatsuya.Optimizing amino acid substitution matrices with a localalignment kernel.
BMC bioinformatics , 7(1):246, 2006.Sakoe, Hiroaki and Chiba, Seibi. A dynamic programmingapproach to continuous speech recognition. In
Proceed-ings of the Seventh International Congress on Acoustics,Budapest , volume 3, pp. 65–69, 1971.Sakoe, Hiroaki and Chiba, Seibi. Dynamic program-ming algorithm optimization for spoken word recogni-tion.
IEEE Trans. on Acoustics, Speech, and Sig. Proc. ,26:43–49, 1978. oft-DTW: a Differentiable Loss Function for Time-Series
Schultz, David and Jain, Brijnesh. Nonsmooth analysisand subgradient methods for averaging in dynamic timewarping spaces. arXiv preprint arXiv:1701.06393 , 2017.Tibshirani, Robert, Hastie, Trevor, Narasimhan, Balasubra-manian, and Chu, Gilbert. Diagnosis of multiple cancertypes by shrunken centroids of gene expression.
Pro-ceedings of the National Academy of Sciences , 99(10):6567–6572, 2002.Yi, Byoung-Kee, Jagadish, HV, and Faloutsos, Christos.Efficient retrieval of similar time sequences under timewarping. In
Data Engineering, 1998. Proceedings., 14thInternational Conference on , pp. 201–208. IEEE, 1998.Zhang, C., Frogner, C., Mobahi, H., Araya-Polo, M., andPoggio, T. Learning with a Wasserstein loss.
Advancesin Neural Information Processing Systems 29 , 2015. oft-DTW: a Differentiable Loss Function for Time-Series
Appendix material
A. Recursive forward computation of the average path matrix
The average alignment under Gibbs distribution p γ can be computed with the following forward recurrence, which mimicsclosely Bellman’s original recursion. For each i ∈ (cid:74) n (cid:75) , j ∈ (cid:74) m (cid:75) , define E i +1 ,j +1 = (cid:20) e − δ i +1 ,j +1 /γ E i,j i Tj e − r i +1 ,j +1 /γ (cid:21) + (cid:20) e − δ i +1 ,j +1 /γ E i,j +1 Tj e − r i +1 ,j +1 /γ (cid:21) + (cid:20) e − δ i +1 ,j +1 /γ E i +1 ,j i e − r ij /γ (cid:21) Here terms r ij are computed following the recursion in Algorithm 2. Border matrices are initialized to , except for E , which is initialized to [1] . Upon completion, the average alignment matrix is stored in E n,m .The operation above consists in summing three matrices of size ( i + 1 , j + 1) . There are exactly ( nm ) such updates. Acareful implementation of this algorithm, that would only store two arrays of matrices, as Algorithm 1 only store two arraysof values, can be carried out in nm min( n, m ) space but it would still require ( nm ) operations. oft-DTW: a Differentiable Loss Function for Time-Series B. Barycenters obtained with random initialization
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (a) CBF
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (b) Herring
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (c) Medical Images
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (d) Synthetic Control
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (e) Wave Gesture Library Y oft-DTW: a Differentiable Loss Function for Time-Series
C. Barycenters obtained with Euclidean mean initialization
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (a) CBF
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (b) Herring
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (c) Medical Images
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (d) Synthetic Control
Soft-DTW ( =1)
Soft-DTW ( =0.01)
DBA (e) Wave Gesture Library Y oft-DTW: a Differentiable Loss Function for Time-Series
D. More interpolation results
Left: results obtained under Euclidean loss. Right: results obtained under soft-DTW ( γ = 1 ) loss. (a) ArrowHead (b) ECG200 (c) ItalyPowerDemand (d) TwoLeadECG oft-DTW: a Differentiable Loss Function for Time-Series E. Clusters obtained by k -means under DTW or soft-DTW geometryCBF dataset Cluster 1 (8 points)
Cluster 2 (9 points)
Cluster 3 (13 points) (a) Soft-DTW ( γ = 1 , random initialization) Cluster 1 (8 points)
Cluster 2 (8 points)
Cluster 3 (14 points) (b) Soft-DTW ( γ = 1 , Euclidean mean initialization) Cluster 1 (8 points)
Cluster 2 (10 points)
Cluster 3 (12 points) (c) DBA (random initialization)
Cluster 1 (4 points)
Cluster 2 (14 points)
Cluster 3 (12 points) (d) DBA (Euclidean mean initialization) oft-DTW: a Differentiable Loss Function for Time-Series
ECG200 dataset
Cluster 1 (59 points)
Cluster 2 (41 points) (a) Soft-DTW ( γ = 1 , random initialization) Cluster 1 (81 points)
Cluster 2 (19 points) (b) Soft-DTW ( γ = 1 , Euclidean mean initialization) Cluster 1 (83 points)
Cluster 2 (17 points) (c) DBA (random initialization)
Cluster 1 (76 points)
Cluster 2 (24 points) (d) DBA (Euclidean mean initialization) oft-DTW: a Differentiable Loss Function for Time-Series
F. More visualizations of time-series prediction
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth (a) CBF
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth (b) ECG200
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth (c) ECG5000
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth (d) ShapesAll
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth
EuclideanSoft-DTWGround truth (e) uWaveGestureLibrary Y oft-DTW: a Differentiable Loss Function for Time-Series
G. Barycenters: DTW loss (Eq. 4 with γ = 0 ) achieved with random init Dataset Soft-DTW γ = 1 γ = 0 . γ = 0 . γ = 0 . Subgradient method DBA Euclidean mean50words 5.000 2.785
ArrowHead 2.390 1.598
Computers 231.421 182.380 ∞ DistalPhalanxOutlineAgeGroup 1.380 1.074 1.407 1.509 11.539 1.761
DistalPhalanxOutlineCorrect 2.501
ECG200 7.374 ∞ ∞ ∞ N/A N/A N/A N/AItalyPowerDemand 2.442
MiddlePhalanxOutlineCorrect 0.832 0.714 0.985 1.030 11.643 1.678
MiddlePhalanxTW 0.755 0.581 0.963 1.206 10.684 1.274
MoteStrain 24.177 21.639 21.616
PhalangesOutlinesCorrect 1.383
ProximalPhalanxOutlineCorrect 0.749 0.654 0.833 0.882 10.767 1.111
ProximalPhalanxTW 0.653 0.536 0.672 0.778 10.377 1.133
RefrigerationDevices 159.745 146.601 ∞ ∞ ∞ ∞ ∞ ∞ ∞ WordsSynonyms 9.305 4.917 ∞ ∞ ∞ ∞ oft-DTW: a Differentiable Loss Function for Time-Series H. Barycenters: DTW loss (Eq. (4) with γ = 0 ) achieved with Euclidean init Dataset Soft-DTW γ = 1 γ = 0 . γ = 0 . γ = 0 . Subgradient method DBA Euclidean mean50words 5.400 2.895 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ oft-DTW: a Differentiable Loss Function for Time-Series I. k -means clustering: DTW loss achieved (Eq. (5) with γ = 0 , log-scaled) when using randominitialization Dataset Soft-DTW γ = 1 γ = 0 . γ = 0 . γ = 0 . Subgradient method DBA Euclidean mean50words 16.294 16.193
ArrowHead 9.020 8.757 8.699
DistalPhalanxOutlineCorrect 12.467 12.373
ECG200 11.395 11.317 11.323
FISH 10.841 10.740 10.594
MiddlePhalanxTW
MoteStrain 9.560 9.484 9.460 9.451
N/A N/A N/A N/A N/A N/AProximalPhalanxTW 11.055 10.993 10.978
N/A N/A N/A N/ATrace 14.570 14.570 14.556
Two Patterns 17.416 17.379 17.317 17.325 17.524
WordsSynonyms 15.209 15.093 uWaveGestureLibrary X 18.789
N/A N/A N/A N/A N/A oft-DTW: a Differentiable Loss Function for Time-Series J. k -means clustering: DTW loss achieved (Eq. (5) with γ = 0 , log-scaled) when usingEuclidean mean initialization Dataset Soft-DTW γ = 1 γ = 0 . γ = 0 . γ = 0 . Subgradient method DBA Euclidean mean50words 16.233 16.145 16.046
DistalPhalanxOutlineCorrect 12.544 12.533 12.494 12.475
N/A N/A N/A N/A N/A N/AMedicalImages 15.082 14.963
N/A N/A N/A N/A N/A N/AProximalPhalanxTW 10.978
N/A N/A N/A N/ATrace 14.553 14.553 14.553
N/A N/A N/A N/A N/A oft-DTW: a Differentiable Loss Function for Time-Series
K. Time-series prediction: DTW loss achieved when using random init
Dataset Soft-DTW loss γ = 1 γ = 0 . γ = 0 . γ = 0 . Euclidean loss50words 6.473
CinC ECG torso 45.675 26.337
Computers 92.584 84.723 78.953
DistalPhalanxOutlineCorrect 0.494 0.476 0.564 0.591
DistalPhalanxTW 0.441 0.330 0.305 1.214
ECG200 1.874
MedicalImages 1.023 0.853
MiddlePhalanxOutlineCorrect 0.278 0.204 0.227 0.202
MiddlePhalanxTW 0.251 0.153 0.445 0.314
MoteStrain 10.188
PhalangesOutlinesCorrect 0.352 0.216 0.352 0.338
Phoneme 160.536 150.017 148.175
ProximalPhalanxOutlineCorrect 0.129 0.047 0.089 0.128
ProximalPhalanxTW 0.154 0.077 0.102 0.150
RefrigerationDevices 108.421 93.519
SwedishLeaf 1.486 1.277 1.316
WordsSynonyms 12.466 10.437 oft-DTW: a Differentiable Loss Function for Time-Series
L. Time-series prediction: DTW loss achieved when using Euclidean init
Dataset Soft-DTW loss γ = 1 γ = 0 . γ = 0 . γ = 0 . Euclidean loss50words 6.330 5.628 4.885
PhalangesOutlinesCorrect 0.328 0.231 0.161
WordsSynonyms 12.654 10.089 9.8871.807