Improved Time Warp Edit Distance -- A Parallel Dynamic Program in Linear Memory
IIMPROVED TIME WARP EDIT DISTANCEA PARALLEL DYNAMIC PROGRAM IN LINEAR MEMORY
GARRETT WRIGHT
This paper is dedicated to the PhD Octopus.
Abstract.
Edit Distance is a classic family of dynamic programming prob-lems, among which Time Warp Edit Distance refines the problem with thenotion of a metric and temporal elasticity. A novel Improved Time Warp EditDistance algorithm that is both massively parallelizable and requiring onlylinear storage is presented. This method uses the procession of a three diago-nal band to cover the original dynamic program space. Every element of thediagonal update can be computed in parallel. The core method is a feature ofthe TWED Longest Common Subsequence data dependence and is applicableto dynamic programs that share similar band subproblem structure. The al-gorithm has been implemented as a CUDA C library with Python bindings.Speedups for challenging problems are phenomenal.
Background
Time Warping is a collection of techniques to programmatically solve the generalproblem of aligning time series towards superposition through a sequence of edits.Time Warp Edit Distance (TWED) is such a method distinguished by computing aproper metric in the process described by Marteau [4]. Yielding a metric is advan-tageous in the contexts of machine learning and data science, particularly wherework is often exploratory or experimental inquiries across large datasets. Havinga metric provides for more meaningful comparisons following from the triangle in-equality. Unfortunately, most time warping methods, including Time Warp EditDistance, are implemented as dynamic programs which require O ( n ) time andspace for input time series on the order of length n . In practice, this can be alimiting factor in the usefulness of such methods.Some attempts have been made at solving Time Warp problems using approxi-mations to reduce the computational and memory complexity. The most popularof which appears to still require O ( n ) storage on disk, but is linear in RAM duringrun time. This helps to compute problems that would not fit outside of hi-memorysystems, or at all, but is otherwise unsatisfying. Recently Gold and Sharir loweredthe previously quadratic bound for the related problem of deterministic DynamicTime Warping to O ( n logloglogn/loglogn ) [3]. While this is a great theoretical re-sult, the following chart suggests a speedup of 10x for the quadratic method wouldgreatly surpass the more sophisticated algorithm in practice, while a speedup of100x would absolutely demolish it. We achieve those speedups in implementationfor relevant problem sizes. Date : May 10, 2020. a r X i v : . [ c s . C G ] J u l GARRETT WRIGHT (a)
Up to 10K Elements (b)
Up to 1M Elements
Figure 1.
Complexity Scaling (ignoring constants)Additionally, the memory access patterns of such problems storing the completedynamic program matrix could be considered degenerate with respect to perfor-mance during matrix updates. For instance, TWED alternates between columnand row strides inside a doubly nested loop. The more sophisticated methods canhave even more involved access patterns. A key result of using the Improved TimeWarp Edit Distance implementation is that the core dynamic program solver usesonly three vectors of O ( n ) length accessed in unit stride. TWED Implementations.
I would be remiss to not mention there are severalquadratic (serial) implementations of the TWED available on the web. Particularly,TWED has its own wiki page which mentions implementations are available in C,R, Matlab, and Python. Marteau was kind enough to freely provide his C codethat is quite fast compared to the others. This is the reference code basis.
A Floyd Warshall Digression.
When attempting to parallelize the TWED prob-lem, I began by considering how other dynamic programs were parallelized (andcuriously when it is possible to do correctly). An example is the Parallel FloydWarshall algorithm, which uses block based subproblem decomposition. Basically,the dynamic program matrix is subdivided into blocks and it can be proven thatthese blocks can be correctly computed in parallel until all other related blocksare ready. Then an information exchange occurs vertically and horizontally acrossblocks in the matrix. However, the entries in the Floyd Warshall dynamic programmatrix have a graph adjacency structure, rows and columns are defined by graphvertices. This implicitly defines the subproblem data dependence relation as totallyhorizontal and vertical.Assuming a large problem, PFW and similarly structured algorithms can be par-allelized across many cores, but with increased communication and synchronizationexpense. This also requires nontrivial programming to orchestrate the communica-tion patterns, typically MPI. In TWED we have a totally different data dependence,which can naturally resolve to unit size subproblems without communication or syn-chronization penalties. It could be exploited with naive methods more typical ofembarrassingly parallel problems, though I will use CUDA. Exploiting this datadependence is the core result of Improved Time Warp Edit Distance. There are boundary conditions, but those are simply 0 or ∞ . Their implementation andcomplexity-contribution can be disregarded. Related in the sense they are the same column or row of blocks
MPROVED TIME WARP EDIT DISTANCE 3
Figure 2.
PFW Graphic shamelessly borrowed from Wikipedia
Algorithms
Here we review the TWED algorithm. We will then dive into the dynamicprogram update step with enough detail to highlight the data dependence. TheImproved algorithm will then be detailed.
Time Warp Edit Distance.
We begin TWED with input arrays A and B whichare a time series values in R N . We are also given respective real timestamps T A and
T B corresponding to the time series values. We will assume that we have nA , nB samples in A and B respectively, where nA and nB are on the order of n for the sake of time complexity arguments. I will ignore the parameters nu and lambda here, as they do not change the mechanics of the dynamic program, andare implemented the same as in Marteau’s work. A degree for the internal norm calls is also required and typically set to 2; we will simply refer to the operation as norm and ignore the degree for this discussion. Given the option to make edits along the path of the time series, we desire theminimal aggregate edit distance cost that can be found to align the pair of timeseries. Every potential time warp edit has a computed cost that takes into accountlocal timestamps and algorithm tuning parameters. This is described in detailby [4]. Some implementations also return the actual resultant path, but this canreadily be found via backtracking or simply stored after basic observation in theforward TWED problem; it is ignored here.Sequentially:(1) Compute O ( n ) norm distances in A where each distance is defined as: DA [ i ] = norm ( A [ i ] − A [ i − ∀ i ∈ [0 , n ](2) Compute O ( n ) norm distances in B : DB [ i ] = norm ( B [ i ] − B [ i − ∀ i ∈ [0 , n ] You may think of the sketch as nu = 1 and lambda = 0 Variable names, slightly untraditional, were chosen to match the code. Note, when i = 0 we take A [ −
1] = B [ −
1] = 0.
GARRETT WRIGHT (3) Initialize the dynamic program cost matrix boundary. Assign the firstrow ( DP [0][:]) and first column ( DP [:][0]) to ∞ . Assign the first element, DP [0][0] = 0 . (4) Initialize the interior of the dynamic program matrix. DP [ i ][ j ] = norm ( DA [ i − − DB [ j − ∀ i, j ∈ [1 , n ] DP [ i ][ j ] += norm ( DA [ i − − DB [ j − ∀ i, j ∈ [2 , n ](5) Execute the dynamic program updates. This is a doubly nested loop, be-ginning at DP [ row = 1][ col = 1],(a) We compute the cost of the following update cases: delete a = DA [ row ] + DP [ row − col ] + | T A [ row ] − T A [ row − | delete b = DB [ rcol ] + DP [ row ][ col −
1] + | T B [ col ] − T B [ col − | match = DP [ row − col −
1] + | T A [ row ] − T B [ row ] | + | T A [ row − − T B [ row − | (b) Assign DP [ row ][ col ] = min ( delete a , delete b , match )(c) This continues for the remaining columns in the row, and then againfor each row in order until the end of the matrix.(6) When you reach the end of the matrix, the result of the dynamic programis stored in DP [ nA ][ nB ]. Playing In The Band.
Some folks store in columns, and others store in rows. Idon’t store in either, but I know it comes out right. In the brief description of theTWED dynamic program, the update step (5) has the doubly nested for loop. If weinspect what is happening for an element somewhere in the middle of the program,we observe the following: • To compute delete a we must look back in A (one row). • To compute delete b we must look back in B (one col). • To compute a match we must look back in A and B , one row and onecolumn. • Typical to dynamic programs, once a subproblem is optimally computed,we move on, never to update that entry again.For the following diagrams the trivial boundary will be very lightly shaded, withdark entries representing updates and two lighter shades of blocks representingdependencies. We’ll start with the loosest bound, tighten it up, and build out analternative storage approach.The boundary conditions are assigned before we begin. The problem worksthrough the matrix in C order. The first interior row is trivially satisfied aboveby the boundary. While we are computing updates moving along columns in thefirst interior row, data dependencies are always satisfied, because the prior columnvalue is always the most up to date subproblem along that row. When we reachthe end of the first interior row, we begin the next knowing we have the completelysolved prior row, and so the entries above are the most up to date subproblemsalong that column. This is a key difference from Floyd Warshall, because movingalong TWED rows and columns have an implied order, time. Compare this to FW’sgraph vertices, which have no natural ordering; they require knowledge of entirecolumns and rows to complete a subproblem. Again, indexes outside the boundary are taken to be 0 vectors.
MPROVED TIME WARP EDIT DISTANCE 5
Figure 3.
Crude notion of data dependence.TWED can’t naively compute a whole row in parallel, because the columnssequentially depend on each other. We cannot make effective sub blocks becausethey would also be sequentially dependent with only the first sub block having validinitial conditions. Similarly, we cannot compute a parallel column because the rowdependence is again sequential. Compare this to the Parallel Floyd Warshall, wherewe can actually compute the shortest paths that are locally true inside the domainof sub block, then exchange with other blocks globally.With localization in mind, let’s refine the observation further, and tighten up tothe minimum of what we really need: • To compute delete a we must look back exactly one row in the same column . • To compute delete b we must look back exactly one col in the same row . • To compute a match we must look back exactly one row and one column . • Typical to dynamic programs, once a subproblem update is computed, wemove on, never to return.So for an individual update, we actually need very few entries in the DP matrix.The matrix is initialized at (0 ,
0) (1 ,
0) and (0 , , we have , and what we can compute . It appears we haveenough information to compute two elements, the next in this row, DP [1][2] as theoriginal TWED would naturally compute and also the update DP [2][1]. See 10aand 10b.While these share a data dependence, they are otherwise independent at thisstep, and they can be safely computed and assigned in parallel.Let’s assume we have computed both DP [1][2] and DP [2][1], forming a left-handed diagonal in DP . Note that these diagonal entries are orthogonal to theentries mathematicians canonically refer to as diagonals. I would like to call themortho-diagonals, or left-handed diagonals, but for the purpose of this paper I willsimply use diagonal for brevity. Now looking forward, we consider that we cancompute the items one to the right and one below any we’ve already computed. GARRETT WRIGHT (a)
Update DP [1][2] (b) Update DP [2][1] Figure 4.
Possible to compute both two updates.Thus we can compute an entire diagonal, containing DP [1][3] , DP [2][2], DP [3][1]. Again after examining their data dependencies, these entries can safely be computedand assigned in parallel.
Figure 5.
Diagonal FourIn fact, it is at this stage we can begin to see that given a band of the current andtwo most recent sequential diagonals, we can always compute the next diagonal,and that this computation can be performed in parallel. To formalize this, let’saccount for the diagonals with a zero based index, starting from entry [0 , DP [1][1] will ever bereferenced. This information is now packed into the current updates. We can forgetit if we want (and we should). The fifth diagonal will go on to require the entirefourth diagonal, along with most of the third; no further prior diagonal is requirednow. All that information is packed into updates already, never to be referenceddirectly again.Given the accounting scheme in the table 1, let’s construct two utility mapsbetween the traditional matrix entries (row, col), and my diagonal storage. Wewant to compute which diagonal we are on, and also an index into that diagonal, we can ignore the boundary since it is precomputed. Again, boundary conditions are handled in the implementation, ignored here to keep thedefinitions simpler. I encourage you to play with the indices a little. Rectangular matrices arecute...
MPROVED TIME WARP EDIT DISTANCE 7
Table 1.
Mapping traditional indices to an Ortho-Diagonal.
Note Row+Col sum to the corresponding diagonal.
Figure 6.
Diagonal Fivewhich I choose to start at the lower left corner. I use the following map, thoughother schemes could be defined. I was torn between this chosen scheme, and anotherone which required a smaller diagonal array but was buried in logic. I chose thesimpler one.(0.1)
OrthoDiagonal : ( row, col ) (cid:55)→ ( orthodiag = row + col, idx = col )(0.2) RowCol : ( orthodiag, idx ) (cid:55)→ ( row = orthodiag − idx, col = idx )Computing the length of the widest diagonal is the same result from classicalrectangular matrices diagonals min ( row, col ). However, to use the basic map abovewe will instead let our diagonal storage hold row + col elements. Since storage spaceis no longer a practical concern for this algorithm, I much prefer this simpler indexscheme. The reader is welcome to consider the more challenging mapping functionsto use the min ( row, col ) sized diagonal if they wanted to implement their own code.As a practical matter, I think the extra logic would hurt performance. Figure 7.
Covering We can also observe that to cover a matrix uses rows + columns − GARRETT WRIGHT the main diagonal. This is rather simple, and I bring it up mainly because it isactually the update procession in the ITWED implementation.Finally, we should observe that if we stack our diagonals with the correct initialoffset, all data dependencies are stored with unit stride. This is helpful on allarchitectures, but particularly critical when it comes to a CUDA implementation.The following graphic is a representation of the three sequences of data lookupsthat will occur. The dark blue is the current active update, and the lighter blue arethe data being referenced for the calculation. This is another way to think of whatwill happen when computing the fifth otho-diagonal updates in the figure above.With the exception of out of bounds checks, we achieve perfect unit stride.
Figure 8.
Coalesced alignment of data reads
Improved Time Warp Edit Distance.
To recap results from our adventure
Playing In The Band , we will consider a full DP matrix that has rows = nA + 1and columns = nB + 1: • For the d th diagonal update: – We require only the current ( z d ), one lagged ( z d − ), and two lagged( z d − ) diagonals. – All entries in the z d diagonal can safely be computed in parallel • The largest diagonal is min ( nA + 1 , nB + 1), but we choose to store ( nA +1) + ( nB + 1) per diagonal because it simplifies indexing. • For this DP matrix there are exactly nA + nB + 1 diagonals. We can now describe the Improved Time Warp Edit Distance algorithm. Let (cid:107) ∀ to mean parallelizable for-all.(1) Concurrently: • Compute O ( n ) norm distances in A where each distance is definedas: DA [ i ] = norm ( A [ i ] − A [ i − ∀ i ∈ [0 , n ] • Compute O ( n ) norm distances in B : DB [ i ] = norm ( B [ i ] − B [ i − ∀ i ∈ [0 , n ](2) for d ∈ [1 , nA + nB ]:(a) Allocate z d with ( nA + 1) + ( nB + 1) elements.(b) (cid:107) ∀ idx ∈ [0 , d ] ( nA + 1) + ( nB + 1) − Note, when i = 0 we take A [ −
1] = B [ −
1] = 0.
MPROVED TIME WARP EDIT DISTANCE 9 (i) map to the respective row col index in a full DP matrix using0.2(ii) Compute initial cost on the fly: • If row == 0 and col == 0, this point is the initial (mini-mal) point, and d = 0 • Else if the row == 0 or col == 0, this point is a boundary,and d = ∞• Else, we are in the interior of the dynamic program matrix: d = norm ( DA [ i − − DB [ j − i, j ∈ [1 , n ] d += norm ( DA [ i − − DB [ j − i, j ∈ [2 , n ](iii) Before we compute updates we require some index arithmetic tomap between the locations in a full DP matrix and our diagonalstorage. Using 0.1 yields: idrm OrthDiag ( row − , col ) .idxidcm OrthDiag ( row, col − .idxidrm cm OrthDiag ( row − , col − .idx (iv) Execute the dynamic program updates by computing the fol-lowing update cases. Note we use three diagonal arrays, thecurrent, once lagged, and twice lagged, called z z − and z − respectively. delete a = DA [ row ] + z d − [ idrm
1] + | T A [ row ] − T A [ row − | delete b = DB [ col ] + z d − [ idcm
1] + | T B [ col ] − T B [ col − | match = z d − [ idrm cm
1] + | T A [ row ] − T B [ row ] | + | T A [ row − − T B [ row − | (v) Assign result z d [ idx ] = min ( delete a , delete b , match ).(c) If z >
2: Free z d − (3) When you complete all diagonals, the result of the dynamic program isstored in the final z d at OrthDiag ( nA, nB ) .idx .At any given point we have at most three diagonals in memory, each using( nA + 1) + ( nB + 1) memory. We additionally must store both input time series,their norm distances, and the timestamps. All these are trivially O ( n ) space.Computationally we are still quadratic number of steps. To cover the full DPmatrix, for each nA + nB + 1 diagonal we compute at most min ( nA + 1 , nB + 1)updates. On the other hand, because we can now parallelize over all the diagonalentries, and our memory is stored with unit stride, we can effectively leveragethousands of cores which would yield O ( n /p ) for p cores. The CUDA programmingmodel naturally exposes several thousand GPU cores, and this brings us to discusscuTWED. Yet again, indexes outside the boundary are taken to be 0 vectors. cuTWED Implementation Remarks
Largely the CUDA code cuTWED is taken from the ITWED algorithm directly,with the addition of managing corner cases and boundary conditions. Because wehave taken care to understand the data dependence and have stored our diagonaldata in a way that is accessed with unit stride, we can simply map the parallel for inITWED (2b) to a 1D grid of 1D blocks following the standard CUDA programmingmodel.To avoid extraneous malloc (3a) and free (3c) calls, cuTWED manage’s its ownmemory by manipulating pointers in a cycle: tmp_ptr = DP_diag_lag_2;DP_diag_lag_2 = DP_diag_lag;DP_diag_lag = DP_diag;DP_diag = tmp_ptr;
Using CUDA streams cuTWED is able to squeeze out some additional concur-rency from the device, by computing algorithm steps in (1) concurrently.Further, two distance kernels are provided, and automatically selected at runtime based on the dimension of the time series inputs. Recall the lp -norm formulafor x = ( x , . . . , x n ): (cid:107) x (cid:107) p := (cid:18) n (cid:88) i =1 | x i | p (cid:19) /p Given a time series of high dimensional vectors, it can be advantageous to useextra parallelism available on the device to compute all the pow calls, storing intofast shared memory simultaneously. These can then be accumulated quickly fromshared memory, then the nth root taken is to complete the norm. Power calls arenotoriously expensive, essentially being several instructions, and while aggregateregister pressure should be considered, generally it is effective to perform the pow operations in parallel on the GPU [1]. For example, given an input time series in R , we can compute raising all vector elements simultaneously with little overhead.I am not sure the speedup merits the code complexity, and this optimization maybe removed in the future.The current release of cuTWED implements a batch mode for large system oftime series. Given two lists of time series (potentially the same) as large arrays,we can use two dimensions in CUDA to process multiple time series and multiplediagonal entries as a 2D grid of 2D blocks. The output of batch mode is a distancematrix corresponding to all pairs of entries in the two lists of time series. A slightlyoptimized unreleased version makes more use of streaming concurrency overlaps.In a future release of cuTWED it is planned to implement some light auto-tuning logic, and capability to harness multiple GPUs. Time depending it may beoptimized further MPROVED TIME WARP EDIT DISTANCE 11
Performance
In cases where the problem was computable by twed.c , cuTWED is demonstra-bly two orders of magnitude faster on current GPU hardware . Even on older hard-ware , more typical of personal computers than a research facility, great speedups,above a single order of magnitude are attained. Basic Methodology and Results.
The reference twed.c code is compiled ( − O cuT W ED.twed call from a host. We repeat the batched experimentsfor cuT W ED.twed dev which is intended for problems where the data resides onGPU already. From this we can also extrapolate a practical measure of host todevice transfer cost overhead, which is trivial . This should make sense becausethe input data is linear, we don’t preprocess it, and our algorithm is quadratic.Walltime is plotted in figure 9a.All of the experiments were performed in double precision. While single precisionwill obviously reduce storage by half, it had marginal performance benefit otherwise.Any gain would probably be lost to required casting overhead in practice, shouldyou not already be in singles. The author recommends using the precision yourdata provided in.Times are listed in a traditional table at the end of the paper for reference, seeTable 3.Perhaps a more telling perspective is relative speedup on the two test systemsplotted in figure 9b. The two test systems consist of one late model desktop witha GPU circa 2014, comparable to desktops and laptops many people might have Implementations using the other higher level languages, while convenient, are orders slowerstill and are not considered here. Intel Xeon E5-2680v4 with Nvidia P100 Intel i7-2600K CPU with TitanZ Kepler GPU Overhead is generally less than machine noise for this problem. (a)
Walltime (b)
Relative Speedup
Figure 9.
Timings (a)
Walltime (b)
Overhead
Figure 10.
Large Scale Timingspersonally. The second system is a common configuration for production or re-search facilities with current hardware.
We achieve triple digit speedups on currenthardware.
Over 150x speedup and still climbing when the host code is no longerable to run.Because we can, I’ve run this out to time series of one million elements. Noproblem. With series this large we finally have measurable, but still trivial, trans-fer costs. Extremal sized cases can be found in figure 10a. Because no originalimplementation can fit such problems, even with high memory nodes, there is nocomparison data.
Higher Dimensions.
The only original TWED implementation with reasonablespeed (twed.c) lacks support for R N inputs. To compare such problems requiresmodifying the reference code to admit R N . This is straightforward, and was doneusing essentially the same code as in cuT W ED . Similar experiments were repeatedwith increasing dimension N . The use of R N didn’t really have any practical timedifference in either implementation. Basically, since the R N component is onlyused once for A and B in a linear computation, until N is on the order of n this isovershadowed by the dynamic program matrix computation. And we get to skip achart on that. Validation using Real Datasets.
To validate and performance test using realworld data, three datasets were considered. From the UCI Machine Learning Repos-itory [2] two time series datasets were chosen. The Synthetic Control Chart TimeSeries Data Set consists of 600 1D time series including Cyclic, Trending, and Shift-ing series. The Pseudo Periodic Synthetic Time Series Data Set consists of 10 timeseries each with 100k points. The pseudo periodic data does not repeat and istargeted for testing database indexing schemes. Most importantly the results arechecked to match between cuTWED and TWED. In most cases they match to thebit, but occasionally they are about 1E-14 different in RMSE, which is consistentwith the aggregated use of optimized pow/sqrt calls on different chips. We com-pute all pairs of TWED using both tools and document timings. For quick referencefigures 11 12 13 of the testing time series are found at the end of the paper. http://archive.ics.uci.edu/ml MPROVED TIME WARP EDIT DISTANCE 13
For another larger problem we’ll consider the MNIST database of handwrittendigits. We chose to use MNIST to validate computations in R N by unraveling onedimension of the two-dimensional image as a 28 sample time axis, and treating theother as a vector in R . As a timing example a random digit from the test set isselected and distances computed against the training set of 60k images. Again wecheck for parity, and this query is also documented in the timings. Table 2.
Timings From Validation Exercises (seconds)
TWED cuTWED Size SpeedupSynthetic Control 22 1 600x600 x 60 R
147 10x10 x 100k R R < R - Note the distance matrix is symmetric, and currently the cuTWED batch callcomputes the complete distance matrix. The nested loop I use for the CPU referencecode is only computing the upper triangle (roughly half). When this optimization isadded to the CUDA code I would expect the above batch time speedups to roughlydouble.
Generalized Dynamic Programming Extensions
Direct Extension: LCS.
Problems with similar banded optimal subproblem maybenefit from this approach. For example the length of the classic Longest CommonSubsequence can be solved with my diagonal band trick in linear memory. Simplynote the fact that your data dependence is totally local, just as in cuTWED. Foran LCS subproblem you need priors up and to the left, and the initial sequences.We’ve covered how to map from any diagonal band back to rows and columns, iethe original sequences. Given this, you can compute the update value. This canbe performed, in parallel no less, and repeated. When you reach the end of thedynamic programming matrix, you have computed the LCS length. If you wouldlike the actual subsequence(s) this will require two lagged diagonals, same as we’vediscussed. The band of diagonals will keep enough information to determine ifthere is a new optimum and what the right-handed diagonal parent entry is. Bybookkeeping the path of parent entries we are forward constructing the same pathas in the backtracking LCS method. There is the case of multiple LCS, which issimply additional bookkeeping of a path for each sequence.I believe there is also a way to use the band technique with the Matrix Chain Mul-tiplication ordering optimization. However, that particular problem is commonlyoptimized with a hash table in place of the complete dynamic program matrix, andI suspect because of this the practical application would not be improved with mytechnique. http://yann.lecun.com/exdb/mnist/ Required use of my 192GB high memory server, one allocation of the dynamic programmatrix is 80GB in doubles. The same experiment can easily be computed with cuTWED on acommon late model GPU capable desktop or laptop in a matter of minutes.
Indirect Extension.
Other problems may not immediately present as banded,but could be reinterpreted with some consideration. One might desire to do thisto either reduce memory footprint or better exploit the GPU. Because the bandedstorage is stride ideal in linear memory, such diagonal banded algorithms may provemore effective in time to solution over others with the same or better theoreticalcomputational time. Also if the banded storage is correct for an application, someparallelized problems might benefit from a total reduction in communication over-head naturally provided by this technique.
Summary
A novel technique requiring only linear memory to solve a family of dynamic pro-grams has been described. Additionally, cuTWED, an open-source high-performanceCUDA and Python library using the technique has been published under GPL.Timings presenting one to two orders of magnitude speedups, while already out-standing, are known to have at least another factor of two left in optimizations.The technique and software was similarly validated against the original TWEDalgorithm using three classic machine learning datasets. (cid:3)
The author would like to thank Dr Igor Rivin for his encouragement to writethis down for others, instead of tossing it in the piles of my other unfinished workonce I figured it out. I hope others can make some use of it.
References
1. C Cuda,
Best practice guide, 2013 , 2013.2. Dheeru Dua and Casey Graff,
UCI machine learning repository , 2017.3. Omer Gold and Micha Sharir,
Dynamic time warping and geometric edit distance: Breakingthe quadratic barrier , ACM Trans. Algorithms (2018), no. 4.4. P. Marteau, Time warp edit distance with stiffness adjustment for time series matching , IEEETransactions on Pattern Analysis and Machine Intelligence (2009), no. 2, 306–318. Gestalt Group LLC, Yardley, PA, 19067
E-mail address : [email protected] MPROVED TIME WARP EDIT DISTANCE 15
Table 3.
Raw Time to Solution
N cuTWED cuTWED dev twedc Speedup1048576 214.22 211.51524288 54.317 54.084262144 13.863 13.827131072 3.6166 3.634465536 0.8545 0.8571 129.91 152x32768 0.2852 0.2851 33.400 117x16384 0.1152 0.1140 8.4386 73x8192 0.0504 0.0499 2.1325 42x4096 0.0238 0.0243 0.5439 23x2048 0.0122 0.0118 0.1300 11x1024 0.0067 0.0059 0.0301 4.5x
Figure 11.
UCI Synthetic Control Time Series
Figure 12.
UCI Synthetic Control Time Series 6-10