Scalable Data Series Subsequence Matching with ULISSE
NNoname manuscript No. (will be inserted by the editor)
Scalable Data Series Subsequence Matching with ULISSE
Michele Linardi · Themis Palpanas
Received: date / Accepted: date
Abstract
Data series similarity search is an important op-eration and at the core of several analysis tasks and applica-tions related to data series collections. Despite the fact thatdata series indexes enable fast similarity search, all existingindexes can only answer queries of a single length (fixedat index construction time), which is a severe limitation. Inthis work, we propose
ULISSE , the first data series indexstructure designed for answering similarity search queriesof variable length (within some range). Our contribution istwo-fold. First, we introduce a novel representation tech-nique, which effectively and succinctly summarizes multi-ple sequences of different length. Based on the proposed in-dex, we describe efficient algorithms for approximate andexact similarity search, combining disk based index vis-its and in-memory sequential scans. Our approach supportsnon Z-normalized and Z-normalized sequences, and can beused with no changes with both Euclidean Distance andDynamic Time Warping, for answering both k-NN and (cid:15) -range queries. We experimentally evaluate our approach us-ing several synthetic and real datasets. The results show that
ULISSE is several times, and up to orders of magnitude moreefficient in terms of both space and time cost, when com-pared to competing approaches. (Paper published in VLDBJ2020)
Data sequences are one of the most commondata types, and they are present in almost every scientific and
Michele LinardiLIPADE, Universit´e de ParisE-mail: [email protected] PalpanasLIPADE, Universit´e de ParisE-mail: [email protected] social domain (example application domains include mete-orology, astronomy, chemistry, medicine, neuroscience, fi-nance, agriculture, entomology, sociology, smart cities, mar-keting, operation health monitoring, human action recogni-tion and others) [26,58,61,21,48]. This makes data series adata type of particular importance.Informally, a data series (a.k.a data sequence, or time se-ries) is defined as an ordered sequence of points, each oneassociated with a position and a corresponding value . Re-cent advances in sensing, networking, data processing andstorage technologies have significantly facilitated the pro-cesses of generating and collecting tremendous amounts ofdata sequences from a wide variety of domains at extremelyhigh rates and volumes.The SENTINEL-2 mission [15] conducted by the Euro-pean Space Agency (ESA) represents such an example ofmassive data series collection. The two satellites of this mis-sion continuously capture multi-spectral images, designed togive a full picture of earth’s surface every five days at a res-olution of , resulting in over five trillion different dataseries. Such recordings will help monitor at fine granularitythe evolution of the properties of the surface of the earth,and benefit applications such as land management, agricul-ture and forestry, disaster control, humanitarian relief oper-ations, risk mapping and security concerns.
Data series analytics.
Once the data series have been col-lected, the domain experts face the arduous tasks of pro-cessing and analyzing them [75,52,6] in order to identifypatterns, gain insights, detect abnormalities, and extract use-ful knowledge. Critical part of this process is the data seriessimilarity search operation, which lies at the core of sev- If the dimension that imposes the ordering of the sequence is timethen we talk about time series. Though, a series can also be definedover other measures (e.g., angle in radial profiles in astronomy, massin mass spectroscopy in physics, etc.). We use the terms data series , time series , and sequence interchangeably. a r X i v : . [ c s . D B ] S e p Michele Linardi, Themis Palpanas eral analysis and machine learning algorithms (e.g., cluster-ing [46], classification [42], outliers [60,7,8], and others).However, similarity search in very large data series col-lections is notoriously challenging [70,49,50,50,18,17, 13,14,2], due to the high dimensionality (length) of the data se-ries. In order to address this problem, a significant amount ofeffort has been dedicated by the data management researchcommunity to data series indexing techniques [51,13,14],which lead to fast and scalable similarity search [16,56,29,4,62,24,66,11,12,71,72,68,69,53,55,54,9,31,32,33].
Predefined constraints.
Despite the effectiveness and bene-fits of the proposed indexing techniques, which have enabledand powered many applications over the years, they are re-stricted in different ways: either they only support similaritysearch with queries of a fixed size, or they do not offer ascalable solution. The solutions working for a fixed length,require that this length is chosen at index construction time(it should be the same as the length of the series in the in-dex).Evidently, this is a constraint that penalizes the flexibil-ity needed by analysts, who often times need to analyze pat-terns of slightly different lengths (within a given data se-ries collection) [24,25,57,39,40,41,44,52,6]. This is truefor several applications. For example, in the
SENTINEL-2 mission data, oceanographers are interested in searchingfor similar coral bleaching patterns of different lengths; atAirbus engineers need to perform similarity search queriesfor patterns of variable length when studying aircraft take-offs and landings [47]; and in neuroscience, analysts needto search in Electroencephalogram (EEG) recordings forCyclic Alternating Patterns (CAP) of different lengths (du-ration), in order to get insights about brain activity duringsleep [3]. In these applications, we have datasets with a verylarge number of fixed length data series, on which analystsneed to perform a large number of ad hoc similarity queriesof (slightly) different lengths (as shown in Figure 1).A straightforward solution for answering such querieswould be to use one of the available indexing techniques.However, in order to support (exact) results for variable-length similarity search, we would need to (i) create severaldistinct indexes, one for each possible query length; and (ii)for each one of these indexes, index all overlapping subse-quences (using a sliding window). We illustrate this in Fig-ure 1, where we depict two similarity search queries of dif-ferent lengths ( (cid:96) and (cid:96) (cid:48) ). Given a data series from the collec-tion, D i (shown in black), we draw in red the subsequencesthat we need to compare to each query in order to computethe exact answer. Using an indexing technique implies in-serting all the subsequences in the index: since we want to Best match1 2 3 4 5
QUERY 1: D QUERY 2: D Subsequences of length l < l INDEX INDEX
Best match1 2 3 4 5 6 7 lenght l Subsequencesof length l lenght l Fig. 1
Indexing for supporting queries of 2 different lengths. N u m b e r o f s ub s e qu e n c e s Dataset size
S96,256 S128,256S160,256 S192,256S224,256 S S S S S Fig. 2
Search space evolution of variable length similarity search.Each dataset contains series of length answer queries of two different lengths, we are obliged touse two distinct indexes.Nevertheless, this solution is prohibitively expensive, inboth space and time. Space complexity is increased, sincewe need to index a large number of subsequences for eachone of the supported query lengths: given a data seriescollection C = D , ..., D | C | and a query length range [ (cid:96) min , (cid:96) max ] , the number of subsequences we would nor-mally have to examine (and index) is: S (cid:96) min ,(cid:96) max = (cid:80) ( (cid:96) max − (cid:96) min )+1 (cid:96) =1 (cid:80) | C | i =1 ( | D i | − ( (cid:96) − .Figure 2 shows how quickly this number explodes as thedataset size and the query length range increase: consider-ing the largest query length range ( S − ) in the GBdataset, we end up with a collection of subsequences (thatneed to be indexed) 5 orders of magnitude larger thanthe original dataset! Computational time is significantly in-creased as well, since we have to construct different indexesfor each query length we wish to support.In the current literature, a technique based on multi-resolution indexes [25,24] has been proposed in order tomitigate this explosion in size, by creating a smaller numberof distinct indexes and performing more post-processing.Nonetheless, this solution works exclusively for non
Z-normalized series (which means that it cannot return resultswith similar trends, but different absolute values), and thus,renders the solution useless for a wide spectrum of appli-cations. Besides, it only mitigates the problem, since it stillleads to a space explosion (albeit, at a lower rate), and there-fore, it is not scalable, either.We note that the technique discussed above (despite itslimitations) is indeed the current state of the art, and no Z-normalization transforms a series so that it has a mean value ofzero, and a standard deviation of one. This allows similarity searchto be effective, irrespective of shifting (i.e., offset translation) andscaling[28].calable Data Series Subsequence Matching with ULISSE 3 other technique has been proposed since, even though duringthe same period of time we have witnessed lots of activityand a steady stream of papers on the single-length similar-ity search problem (e.g., [29,4,62,10,66,11,71,72,68,69,53,55,54,31,32,33]). This attests to the challenging natureof the problem we are tackling in this paper.
Contributions.
In this work, we propose
ULISSE (ULtracompact Index for variable-length Similarity SEarch in dataseries), which is the first single-index solution that supportsfast approximate and exact answering of variable-length(within a given range) similarity search queries for both nonZ-normalized and Z-normalized data series collections. Ourapproach can be used with no changes with both EuclideanDistance and Dynamic Time Warping, for answering both k-NN and (cid:15) -range queries.
ULISSE produces exact (i.e., correct) results, and isbased on the following key idea: a data structure that indexesdata series of length (cid:96) , already contains all the informationnecessary for reasoning about any subsequence of length (cid:96) (cid:48) < (cid:96) of these series. Therefore, the problem of enablinga data series index to answer queries of variable-length, be-comes a problem of how to reorganize this information thatalready exists in the index. To this effect,
ULISSE proposesa new summarization technique that is able to represent con-tiguous and overlapping subsequences, leading to succinct,yet powerful summaries: it combines the representation ofseveral subsequences within a single summary, and enablesfast (approximate and exact) similarity search for variable-length queries.Our contributions can be summarized as follows : – We introduce the problem of Variable-Length Subse-quences Indexing, which calls for a single index that caninherently answer queries of different lengths. – We provide a new data series summarization technique,able to represent several contiguous series of differentlengths. This technique produces succinct, discretizedenvelopes for the summarized series, and can be appliedto both non Z-normalized and Z-normalized data series. – Based on this summarization technique, we develop anindexing algorithm, which organizes the series and theirdiscretized summaries in a hierarchical tree structure,namely, the
ULISSE index. – We propose efficient exact and approximate k-NN algo-rithms, suitable for the
ULISSE index, which can com-pute similarity using either Euclidean Distance or Dy-namic Time Warping measure. – Finally, we perform an experimental evaluation withseveral synthetic and real datasets. The results demon-strate the effectiveness and scalability of
ULISSE todataset sizes that competing approaches cannot handle. A preliminary version of this work has appeared elsewhere [38,37].
Paper Organization.
The rest of this paper is organized asfollows. Section 2 discusses related work, and Section 3 for-mulates the problem. In Section 4, we describe the
ULISSE summarization techniques, and in Sections 5 and 6 we ex-plain our indexing and query answering algorithms. Sec-tion 7 describes the experimental evaluation, and we con-clude in Section 8.
The literature includes several tech-niques for data series indexing [13], which are all based onthe same principle: they first reduce the dimensionality ofthe data series by applying some summarization technique(e.g., Piecewise Aggregate Approximation (PAA) [27], orSymbolic Aggregate approXimation (SAX) [62]. However,all the approaches mentioned above share a common limita-tion: they can only answer queries of a fixed, predeterminedlength, which has to be decided before the index creation.Faloutsos et al. [16] proposed the first indexing tech-nique suitable for variable length similarity search query.This technique extracts subsequences that are grouped inMBRs (Minimum Bounding Rectangles) and indexed usingan R-tree. We note that this approach works only for nonZ-normalized sequences. An improvement of this approachwas proposed by Kahveci and Singh [25]. They describedMRI (Multi Resolution Index), which is a technique basedon the construction of multiple indexes for variable lengthsimilarity search query.Storing subsequences at different resolutions (buildingindexes for different series lengths) provided a significantimprovement over the earlier approach, since a greater partof a single query is considered during the search. Subse-quently, Kadiyala and Shiri [24] redesigned the MRI con-struction, in order to decrease the indexing size and con-struction time. This new indexing technique, called Com-pact Multi Resolution Index (CMRI), has a space require-ment, which is 99% smaller than the one of MRI. The au-thors also redefined the search algorithm, guaranteeing animprovement of the range search proposed upon the MRIindex.Loh et al. [43] proposed Index Interpolation for variablelength subsequence similarity search for (cid:15) -range queries.This approach uses a single index that supports (cid:15) -rangesearch for subsequences of a fixed length that is smallerthan the query length. The search starts by considering aquery prefix subsequence of the same length as the one sup-ported by the index. During this process, the algorithm com-putes the distances between the original query and candi-dates of the same length, if the prefixes of these candidateshave a distance to the query prefix smaller than the proposedbound. The authors proved the correctness of their solution,showing that their bounding strategy provides correct results
Michele Linardi, Themis Palpanas for both non Z-normalized and Z-normalized subsequences.We note that, based on the (cid:15) -range search, this approachcan also answer k-NN queries, thanks to the framework pro-posed by Han et al. [19].More recently, Wu et al. [67] have proposed the KV-Match index, which supports (cid:15) -range similarity searchqueries of variable length, using both Z-normalized Eu-clidean and DTW distances. The idea of this technique issimilar to the CMRI one, since many indexes are built fordifferent subsequence window lengths, which are consid-ered at query time using multiple query segments. We notethat for Z-normalized sequences, this method provides exactanswers only for constrained (cid:15) -range search. To this effect,two new parameters that constrain the mean and the standarddeviation of a valid result are considered at query answeringtime.In contrast to CMRI and KV-Match, our approach uses asingle index that is able to answer similarity search queriesof variable length over larger datasets, and works for bothnon Z-normalized and Z-normalized series (a feature that isnot supported by any of the previously introduced indexingtechniques).
Sequential scan techniques.
Even though recent workshave shown that sequential scans can be performed effi-ciently [57,45], such techniques are mostly applicable whenthe dataset consists of a single, very long data series, andqueries are looking for potential matches in small subse-quences of this long data series. Such approaches, in gen-eral, do not provide any benefit when the dataset is com-posed of a large number of small data series, like in our case.Therefore, indexing is required in order to efficiently sup-port data exploration tasks, which involve ad-hoc queries,i.e., the query workload is not known in advance.
Let a data series D = d ,..., d | D | be a sequence of numbers d i ∈ R , where i ∈ N represents the position in D . We denotethe length, or size of the data series D with | D | . The subse-quence D o,(cid:96) = d o ,..., d o + (cid:96) − of length (cid:96) , is a contiguous sub-set of (cid:96) points of D starting at offset o , where ≤ o ≤ | D | and ≤ (cid:96) ≤ | D | − o + 1 . A subsequence is itself a dataseries. A data series collection, C , is a set of data series.We say that a data series D is Z-normalized, denoted D n , when its mean µ is and its standard deviation σ is .The normalized version of D = d , ..., d | D | is computed asfollows: D n = { d − µσ , ..., d | D | − µσ } . Z-normalization is anessential operation in several applications, because it allowssimilarity search irrespective of shifting and scaling [28,57]. Euclidean Distance.
Given two data series D = d , ..., d | D | and D (cid:48) = d (cid:48) , ..., d (cid:48)| D (cid:48) | of the same length (i.e., | D | = | D (cid:48) | ), we can calculate their Euclidean Distance as follows: (1,2)(1,3)(1,4) (2,2)(2,3)(2,4) (3,2)(3,3)(3,4) (4,2)(4,3)(4,4)(1,1) (2,1) (3,1) (4,1)(i,j+1) (i+1,j+1)(i,j) (i+1,j)(i,j+1) (i+1,j+1)(i,j) (i+1,j) (i,j+1) (i+1,j+1)(i,j) (i+1,j) Euclidean alignement Warping alignement (a)(b)
Fig. 3 (a)
Euclidean and Warping alignment in a squared matrix. (b)
Valid index steps in a warping alignment. ED ( D, D (cid:48) ) = (cid:113)(cid:80) | D | i d ( d i , d (cid:48) i ) , where the distance func-tion d is applied to two real values, namely A and B , asfollows: d ( A, B ) = ( A − B ) . Dynamic Time Warping.
The Euclidean distance is a lock-step measure, which is computed by summing up the dis-tances between pairs of points that have the same posi-tions in their respective series. Dynamic Time Warping(DTW) [34] represents a more elastic measure, allowing forsmall mis-alignments of the matched points on the x-axis.Given two data series D and D (cid:48) , the DTW distance iscomputed by considering the differences between pairs ofpoints ( d ( d i , d (cid:48) j )) , where the indexes i, j might be differ-ent. In this manner, a particular alignment of D and D (cid:48) isperformed before to compute the distance. We define a se-quence alignment as a vector of index pairs A ∈ R (cid:96) × ,where ( i, j ) ∈ A ⇐⇒ ≤ i, j ≤ (cid:96) , and (cid:96) is the lengthof the two series. The alignment of the Euclidean Distanceis a special case, where the indexes are equal to their posi-tion in A . In the case of two series of length (cid:96) , the space ofthe possible alignments spans the paths that join two cellsin a squared matrix composed by (cid:96) cells. In Figure 3(a),we depict a Euclidean distance alignment of two series oflength , which exactly crosses the diagonal of the matrix,joining the cells ( ) and ( ). On the other hand, in thesame figure we report another possible alignment that wecall warping alignment, which deviates from the diagonal.We use the terms warping path and warping alignment in-terchangeably.In order to restrict the allowed paths, we can apply thefollowing local constraints on the index pairs: – We require that the first and last pairs of A correspondto the first and last pairs of points in D and D (cid:48) , re-spectively. If | D | = | D (cid:48) | = (cid:96) , we have A [1] = (1 , calable Data Series Subsequence Matching with ULISSE 5 and A [ (cid:96) ] = ( (cid:96), (cid:96) ) . Furthermore, for any ( a, b ) , ( c, d ) ∈ A ⇐⇒ ( a (cid:54) = c ) ∨ ( b (cid:54) = d ) . This latter, avoids to con-sider the same index pair twice in a single path. – Given k ∈ N ( < k ≤ (cid:96) ), we require that A [ k ][0] − A [ k − ≤ and A [ k ][1] − A [ k − ≤ alwaysholds. This restricts each index to move by at most 1 unitto its next alignment position. – Moreover, we always require that A [ k ][0] − A [ k − ≥ and A [ k ][1] − A [ k − ≥ . This guarantees a mono-tonic movement of the path, towards the last index pair.In Figure 3(b), we depict the three possible steps thateach index pair can perform in a valid alignment.These constraints permit to bound the length of an align-ment between two series of length (cid:96) , between (cid:96) and × (cid:96) − .Typically, warping paths are also subject to global con-straints. We can thus set their maximum deviation from thematrix diagonal. In that regard, Sakoe and Chiba [59] andItakura [23] proposed different warping path constraints,which restrict the matrix positions that a valid path can visit.The Sakoe-Chiba band [59] constraint allows each index ofa warping path to be at most r points far from the diago-nal (Euclidean Distance alignment). On the other hand, the Itakura-parallelogram [23] constraint allows to choose dif-ferent r values depending on the index position i . In general, r is called the warping window .Given a valid warping path, A ∗ , that satisfies the pre-viously introduced constraints, we can formally define theDTW distance between two series D and D (cid:48) of the samelength (cid:96) , as: DT W ( D, D (cid:48) ) = argmin A ∗ ( (cid:118)(cid:117)(cid:117)(cid:116) | A ∗ | (cid:88) i d ( d A ∗ [ i ][0] , d (cid:48) A ∗ [ i ][1] )) . We note that computing the DTW distance corresponds tofinding the valid alignment that minimizes the sum of thedistances.In Figure 4, we consider two series ( D and D (cid:48) ), whichare extracted from two offsets that are points away, in thesame long sequence. In this manner, the prefix of D is equalto the suffix of D (cid:48) , which starts at position . In the plots,the values of D span the right vertical axis, whereas those of D (cid:48) the left one. If we compute the Euclidean distance, as de-picted in Figure 4(a), the fixed alignment of points does notcapture the similarity of the two series. On the other hand,when computing the DTW distance, the warping path alignsthe two similar parts, as reported in Figure 4(b). At the bot-tom of the figure, we also report the warping path, which isconstrained by a Sakoe-Chiba band . Problem Definition.
The problem we wish to solve in thispaper is the following:
Problem 1 (Variable-Length Subsequences Indexing)
Given a data series collection C , and a series length range EUCLIDEAN DISTANCE POINTS ALIGNMENTDTW ALIGNMENTWARPING PATH ED = 4.92DTW = 0.33 (a)(b) Sakoe-Chiba band
D’DD’DD D’
Fig. 4 (a)
Euclidean distance alignment between the data series D and D (cid:48) . (b) DTW Alignment between D and D (cid:48) . In the bottom part, thepath is depicted in the | D | × | D (cid:48) | matrix, contoured by the Sakoe-Chiba band. [ (cid:96) min , (cid:96) max ] , we want to build an index that supports exactsimilarity search, under the Euclidean and Dynamic TimeWarping (DTW) measures, for queries of any length withinthe range [ (cid:96) min , (cid:96) max ] .In our case similarity search is formally defined as fol-lows: Definition 1 (Similarity search)
Given a data series collec-tion C = { D , ..., D C } , a series length range [ (cid:96) min , (cid:96) max ] ,a query data series Q , where (cid:96) min ≤ | Q | ≤ (cid:96) max , and k ∈ N , we want to find the set R = { D io,(cid:96) | D i ∈ C ∧ (cid:96) = | Q | ∧ ( (cid:96) + o − ≤ | D i |} , where | R | = k . We require that ∀ D io,(cid:96) ∈ R (cid:64) D i (cid:48) o (cid:48) ,(cid:96) (cid:48) s.t. dist ( D i (cid:48) o (cid:48) ,(cid:96) (cid:48) , Q ) < dist ( D io,(cid:96) , Q ) ,where (cid:96) (cid:48) = | Q | , ( (cid:96) (cid:48) + o (cid:48) − ≤ | D i (cid:48) | and D i (cid:48) ∈ C . Weinformally call R , the k nearest neighbors set of Q . Giventwo generic series of the same length, namely D and D (cid:48) thefunction dist ( d, d (cid:48) ) = ED () .In this work, we perform Similarity Search using ei-ther Euclidean Distance (ED) or Dynamic Time Warping(DTW), as the dist function.3.1 The iSAX IndexThe Piecewise Aggregate Approximation (PAA) of a dataseries D , P AA ( D ) = { p , ..., p w } , represents D in a w -dimensional space by means of w real-valued segments oflength s , where the value of each segment is the mean of the Michele Linardi, Themis Palpanas ........ ........
ROOT NODE - 0 - 0 node split on 2 nd segment Indexing D iSAX(D): {1,11,0,0}SAX(D,4,2): {1,1,0,0} data series D - 0 - 0 increased cardinalityby adding 0refine representationof 2 nd segment of D increased cardinalityby adding 1Insert in the correct leaf node - 0 - 0 SAX(D,4,4): {11,10,01,00}
Fig. 5
Indexing of series D (and an inner node split). corresponding values of D [27]. We denote the first k di-mensions of P AA ( D ) , ( k ≤ w ), as P AA ( D ) ,..,k . Then,the iSAX representation of a data series D , denoted by SAX ( D, w, | alphabet | ) , is the representation of P AA ( D ) by w discrete coefficients, drawn from an alphabet of cardi-nality | alphabet | [62].The main idea of the iSAX representation (see Fig-ure 5, top), is that the real-values space may be segmentedby | alphabet | − breakpoints in | alphabet | regions thatare labeled by distinct symbols: binary values (e.g., with | alphabet | = 4 the available labels are { , , , } ). iSAX assigns symbols to the P AA coefficients, dependingin which region they are located.The iSAX data series index is a tree data structure [62,11], consisting of three types of nodes (refer to Figure 5).(i) The root node points to n children nodes (in the worstcase n = 2 w , when the series in the collection cover all pos-sible iSAX representations). (ii) Each inner node containsthe iSAX representation of all the series below it. (iii) Eachleaf node contains both the iSAX representation and the rawdata of all the series inside it (in order to be able to prunefalse positives and produce exact, correct answers). Whenthe number of series in a leaf node becomes greater than themaximum leaf capacity, the leaf splits: it becomes an innernode and creates two new leaves, by increasing the cardinal-ity of one of the segments of its iSAX representation. Thetwo refined iSAX representations (new bit set to and )are assigned to the two new leaves. … D i, l min (a) D Master series Aligned Master Series l max l min (b)(c) D Containment area Envelope extremes D Fig. 6 a) master series of D in the length interval (cid:96) min , (cid:96) max . b) Zero-aligned master series. c) Envelope built over the master series.
The key idea of the
ULISSE approach is the succinct summa-rization of sets of series, namely, overlapping subsequences.In this section, we present this summarization method.4.1 Representing Multiple SubsequencesWhen we consider, contiguous and overlapping subse-quences of different lengths within the range [ (cid:96) min , (cid:96) max ] (Figure 6(a), we expect the outcome as a bunch of similarseries, whose differences are affected by the misalignmentand the different number of points. We conduct a simple ex-periment in Figure 6(b), where we zero-align all the seriesshown in Figure 6(a); we call those master series . Definition 2 (Master Series)
Given a data series D , and asubsequence length range [ (cid:96) min , (cid:96) max ] , the master series aresubsequences of the form D i,min ( | D |− i +1 ,(cid:96) max ) , for each i such that ≤ i ≤ | D | − ( (cid:96) min − , where ≤ (cid:96) min ≤ (cid:96) max ≤ | D | .We observe that the following property holds for themaster series. Lemma 1
For any master series of the form D i,(cid:96) (cid:48) , we havethat P AA ( D i,(cid:96) (cid:48) ) ,..,k = P AA ( D i,(cid:96) (cid:48)(cid:48) ) ,..,k holds for each (cid:96) (cid:48)(cid:48) such that (cid:96) (cid:48)(cid:48) ≥ (cid:96) min , (cid:96) (cid:48)(cid:48) ≤ (cid:96) (cid:48) ≤ (cid:96) max and (cid:96) (cid:48) , (cid:96) (cid:48)(cid:48) % k = 0 .Proof It trivially follows from the fact that, each non masterseries is always entirely overlapped by a master series. Sincethe subsequences are not subject to any scale normalization,their prefix coincides to the prefix of the equi-offset masterseries.Intuitively, the above lemma says that by computingonly the
P AA of the master series in D , we are able to rep-resent the P AA prefix of any subsequence of D . calable Data Series Subsequence Matching with ULISSE 7 When we zero-align the
P AA summaries of the mas-ter series, we compute the minimum and maximum
P AA values (over all the subsequences) for each segment: thisforms what we call an
Envelope (refer to Figure 6.c). (Whenthe length of a master series is not a multiple of the
P AA segment length, we compute the
P AA coefficients of thelongest prefix, which is multiple of a segment.) We call con-tainment area the space in between the segments that definethe Envelope.4.2 PAA Envelope for Non-Z-Normalized SubsequencesIn this subsection, we formalize the concept of the
Envelope ,introducing a new series representation.We denote by L and U the P AA coefficients, which de-limit the lower and upper parts, respectively, of a contain-ment area (see Figure 6.c). Furthermore, we introduce a pa-rameter γ , which corresponds to the number of master serieswe represent by the Envelope. This allows to tune the num-ber of subsequences of length in the range [ (cid:96) min , (cid:96) max ], thata single Envelope represents, influencing both the tightnessof a containment area and the size of the Index (number ofcomputed Envelopes). We will show the effect of the rela-tive tradeoff i.e., Tightness/Index size in the Experimentalevaluation. Given a , the point from where we start to con-sider the subsequences in D , and s , the chosen length of thePAA segment, we refer to an Envelope using the followingsignature: paaEN V [ D,(cid:96) min ,(cid:96) max ,a,γ,s ] = [ L, U ] (1)4.3 PAA Envelope for Z-Normalized SubsequencesSo far we have considered that each subsequence in the inputseries D is not subject of any scale normalization, i.e., is notZ-normalized. We introduce here a negative result, concern-ing the unsuitability of a generic paaEN V [ D,(cid:96) min ,(cid:96) max ,a,γ,s ] to describe subsequences that are Z-normalized.Intuitively, we argue that the P AA coefficients of a sin-gle master series D i,a , generate a containment area, whichmay not embed the coefficients of the Z-normalized subse-quence in the form D (cid:48) i,a (cid:48) , for a (cid:48) < a . This happens, be-cause Z-normalization causes the subsequences of differentlengths to change their shape, and even shift on the y-axis.Figure 7 depicts such an example.We can now formalize this negative result. Lemma 2 A paaEN V [ D,(cid:96) min ,(cid:96) max ,a,γ,s ] is not guaranteedto contain all the P AA coefficients of the Z-normalized sub-sequences of lengths [ (cid:96) min , (cid:96) max ] , of D .Proof To prove the correctness of the lemma, it sufficesto pick such a case where a subsequence of D , namely D (Z-normalized master series) D D Fig. 7
Master series D , with marked PAA coefficients. D a,(cid:96) (cid:48) , with (cid:96) min ≤ (cid:96) (cid:48) ≤ (cid:96) max , is not encoded by paaEN V [ D,(cid:96) min ,(cid:96) max ,a,γ,s ] . Formally, we should considerthe case where ∃ k such that P AA ( D i,(cid:96) (cid:48) ) k > U k or P AA ( D i,(cid:96) (cid:48) ) k < L k . We may pick a Z-normalized series D choosing (cid:96) max = | D | = (cid:96) min + 1 and γ = 0 . The resulting paaEN V [ D,(cid:96) min = (cid:96) max − ,(cid:96) max = | D | ,i =1 ,γ =0 ,s ] obtains equalbounds, namely L = U . Let consider the z-normalized sub-sequence D ,(cid:96) min . Its P AA coefficients must be in the en-velope. This implies that,
P AA ( D ,(cid:96) min ) = L = U must hold. If s is the P AA segment length, in the caseof Z-normalization,
P AA ( D ,(cid:96) min ) = ((( (cid:80) si =1 d i ) − ( µ D ,(cid:96)min × s )) /σ D ,(cid:96)min ) /s and U = ((( (cid:80) si =1 d i ) − ( µ D × s )) /σ D ) /s . Therefore, the following equation: ( µ D ,(cid:96)min × s ) /σ D ,(cid:96)min = ( µ D × s ) /σ D holds, which is equivalent to µ D ,(cid:96)min /σ D ,(cid:96)min = µ D /σ D . At this point we may havethat µ D = µ D ,(cid:96)min , when D (cid:96)max, = µ D ,(cid:96)min . Thisclearly leads to have a smaller dispersion on D than D ,(cid:96)min and thus σ D < σ D ,(cid:96)min = ⇒ P AA ( D ,(cid:96) min ) (cid:54) = L (cid:54) = U .If we want to build an Envelope, containing all theZ-normalized sequences, we need to take into accountthe shifted coefficients of the Z-normalized subsequences,which are not master series. Hence, each P AA segment co-efficient (in a master series) will be represented by the set ofvalues resulting from the Z-normalizations of all the subse-quences of length in [ (cid:96) min , (cid:96) max ] that are not master seriesand contain that segment.Given a generic master series D i,(cid:96) = { d i , ..d i + (cid:96) − } ,and s the length of the segment, its k th P AA co-efficient set is computed by:
P AA ∗ ( D i,(cid:96) ) k = { ( ( (cid:80) s ( k − sp = s ( k − d p ) − ( µ Di,(cid:96) (cid:48) × s ) σ Di,(cid:96) (cid:48) ) /s | (cid:96) min ≤ (cid:96) (cid:48) ≤ (cid:96) max , (cid:96) (cid:48) ≥ ( s ( k −
1) + s − ( i − } (2).In Figure 8, we depict an example of P AA ∗ computa-tion for the first segment of the master series D .We can then follow the same procedure as before (in thecase of non Z-normalized sequences), computing the min-imum and maximum P AA coefficients for each segmentgiven by the above formula, in order to get the Envelopefor the Z-normalized sequences (which we also denote with paaEN V ). Michele Linardi, Themis Palpanas
D (Master Series) ={d ,..,d |D| }D D PAA(D) l min = |D | l max = |D|PAA*(D) = { (∑ (cid:3031) (cid:3284) )(cid:2879) (µ D × (cid:3046)) (cid:3294)(cid:3284)(cid:3128)(cid:3117) (cid:2978) D /s, (∑ (cid:3031) (cid:3284) )(cid:2879)(µ (cid:3253)(cid:3117), (cid:3253) (cid:3127)(cid:3117) × (cid:3046)) (cid:3294)(cid:3284)(cid:3128)(cid:3117) (cid:2978) (cid:3253)(cid:3117), (cid:3253) (cid:3127)(cid:3117) /𝑠, (∑ (cid:3031) (cid:3284) )(cid:2879)(µ (cid:3253)(cid:3117), (cid:3253) (cid:3127)(cid:3118) × (cid:3046)) (cid:3294)(cid:3284)(cid:3128)(cid:3117) (cid:2978) (cid:3253)(cid:3117), (cid:3253) (cid:3127)(cid:3118) /𝑠 }s := segment length Fig. 8
P AA ∗ ( D ) computation. Since the first PAA segment (oflength s ) of the master series D , is also the first one of the two nonmaster series D , | D − | , D , | D − | , three PAA coefficients are com-puted with the different normalizations. :PAA(D ) D ) not enough points for the 3 rd segments paaENV [D, l min=40, l max=60, a=1, γ =20, s=20 pts] = U = [ Max(1,…, 1 γ ) ] , Max(2,…, 2 γ ) ], ..., Max(3) ] L = [ Min(1,…, 1 γ ) ] , Min(2,…, 2 γ ) ], ..., Min(3) ] ) ) :PAA(D ) γ :PAA(D ) γ :PAA(D ) ... … …0 60 ) ) ) Z-norm.Non Z-norm. :PAA*(D ) :PAA*(D ) γ :PAA*(D ) γ :PAA*(D ) Z-norm.Non Z-norm.Z-norm.Non Z-norm.
Fig. 9 uENV building, with input: data series D of length , P AA segment size = 20 , γ = 20 , (cid:96) min = 40 and (cid:96) max = 60 . iSAX indexing mecha-nism (depicted in Figure 5).Given a paaEN V , we can translate its P AA extremes into the relative iSAX representation: uEN V paaENV [ D,(cid:96)min,(cid:96)max,a,γ,s ] = [ iSAX ( L ) , iSAX ( U )] ,where iSAX ( L ) ( iSAX ( U ) ) is the vector of the mini-mum (maximum) P AA coefficients of all the segmentscorresponding to the subsequences of D .The ULISSE
Envelope, uEN V , represents the principalbuilding block of the
ULISSE index. Note that, we mightremove for brevity the subscript containing the parametersfrom the uEN V notation, when they are explicit.In Figure 9, we show a small example of envelope build-ing, given an input series D . The picture shows the P AA coefficients computation of the master series. They are cal-culated by using a sliding window starting at point a = 1 ,which stops after γ steps. Note that the Envelope generatesa containment area, which embeds all the subsequences of D of all lengths in the range [ (cid:96) min , (cid:96) max ] . Algorithm 1: uEN V computation
Input: float [] D , int s , int (cid:96) min , int (cid:96) max , int γ , int a Output: uENV [ iSAX min , iSAX max ] int w ← (cid:98) (cid:96) max /s (cid:99) ; int segUpdateList[S] ← { } ; float U [ w ] ← {−∞ , ..., −∞} , L [ w ] ← {∞ , ..., ∞} ; if | D | − ( i − ≥ (cid:96) min then float paaRSum ← // iterate the master series. for i ← a to min( | D | , a + (cid:96) max + γ ) do // running sum of paa segment paaRSum ← paaRSum + D[i]; if (j-a) > s then paaRSum ← paaRSum - D[i-s]; for z ← to min( (cid:98) [i-(a-1)] / s (cid:99) ,w) do if segUpdatedList[z] ≤ γ then segUpdateList[z] ++; float paa ← (paaRSum / s); L [ z ] ← min ( paa, L [ z ]) ; U [ z ] ← max ( paa, U [ z ]) ; uENV ← [iSAX( L ),iSAX( U )]; else uENV ← ∅ ; uEN V . Algorithm 1 describes the procedure for non-Z-normalized subsequences. As we noticed, maintaining therunning sum of the last s points, i.e., the length of a P AA segment (refer to Line 7), allows us to compute all the
P AA values of the expected envelope in O ( w ( (cid:96) max + γ )) timein the worst case, where (cid:96) max + γ is the points window weneed to take into account for processing each master series,and w is the number of P AA segments in the maximumsubsequence length (cid:96) max . Since w , is usually a very smallnumber (ranging between 8-16), it essentially plays the roleof a constant factor. In order to consider not more than γ steps for each segment position, we store how many timeswe use it, to update the final envelope in the vector, in Line 2.5.2 Indexing Z-Normalized SubsequencesIn Algorithm 2, we show the procedure that computes anindexable Envelope for Z-normalized sequences, which wedenote as uEN V norm . This routine iterates over the pointsof the overlapping subsequences of variable length ( Firstloop in Line 7), and performs the computation in two parts.The first operation consists of computing the sum of each
P AA segment we keep in the vector
P AAs defined inLine 2. When we encounter a new point, we update thesum of all the segments that contain that point (Lines 8-11).The second part, starting in Line 16 (
Second loop ), performsthe segment normalizations, which depend on the statistics(mean and std.deviation) of all the subsequences of differentlength (master and non-master series), in which they appear.During this step, we keep the sum and the squared sum of calable Data Series Subsequence Matching with ULISSE 9
Algorithm 2: uEN V norm computation
Input: float [] D , int s , int (cid:96) min , int (cid:96) max , int γ , int a Output: uENV norm [ iSAX min , iSAX max ] int w ← (cid:98) (cid:96) max /s (cid:99) ; // sum of PAA segments values float PAAs [ (cid:96) max + γ − ( s − ] ← { } ; float U [ w ] ← {−∞ , ..., −∞} , L [ w ] ← {∞ , ..., ∞} ; if | D | − ( a − ≥ (cid:96) min then int nSeg ← float accSum, accSqSum ← // First loop: Iterate the points. for i ← a to min( | D | ,(a+ (cid:96) max + γ )) do // update sum of PAA segments values if i − a > s then nSeg++; PAAs[nSeg] ← PAAs[nSeg-1] - D[i-s]; PAAs[nSeg] += D[i]; // keep sum and squared sum. accSum += D[i], accSqSum += (D[i]) ; // the window contains enough points. if i-(a-1) ≥ (cid:96) min then acSAc ← accSum, acSqSAc ← accSqSum; int nMse ← min( γ +1,(i-(a-1)- (cid:96) min ) + 1 ); // Normalizations of PAA coefficients. for j ← to nMse do int wSubSeq ← i-(a-1)-(j-1) ; if wSubSeq ≤ (cid:96) max then float µ ← acSAc/wSubSeq; float σ ← (cid:113) ( acSqSAcwSubSeq − µ ) ; int nSeg ← (cid:98) wSubSeq ÷ s (cid:99) ; for z ← to nSeg do float a ← PAAs[j+[(z-1) × s]]; float b ← s × µ ; float paaNorm ← (( a − b ) /σ ) s ; L [ z ] ← min ( paaNorm, L [ z ]) ; U [ z ] ← max ( paaNorm, U [ z ]) ; acSAc -= D[j], acSqSAc -= (D[j]) ; uENV norm ← [iSAX( L ),iSAX( U )]; else uENV norm ← ∅ ; the window, which permits us to compute the mean and thestandard deviation in constant time (Lines 19,20). We thencompute the Z-normalizations of all the P AA coefficientsin Line 25, by using Equation 2.In Figure 10, we show an example that illustrates theoperation of the algorithm. In , the First loop has iteratedover points (marked with the dashed square). Since theyform a subsequence of length (cid:96) min , the Second loop startsto compute the Z-normalized PAA coefficients of the twosegments, computing the mean and the standard deviationusing the sum ( acSAc ) and squared sum ( acSqAc ) of thepoints considered by the
First loop (gray circles). The sec-ond step takes place after that the
First loop has consideredthe th point (black circle) of the series. Here, the Secondloop updates the sum and the squared sum, with the newpoint, calculating then the corresponding new Z-normalizedPAA coefficients. At step , the algorithm considers the sec-ond subsequence of length (cid:96) min , which is contained in thenine points window. The Second loop considers in order allthe overlapping subsequences, with different prefixes andlength. This permits to update the statistics (and all possiblenormalizations) in constant time. The algorithm terminates,when all the points are considered by the
First loop , and the acSAc = ∑( ), acSqSAc = ∑( )wSubSeq = i−(a−1)−(j−1)=8 σ = acSqSAcwSubSeq − µ (cid:2870) µ = acSAcwSubSeq PAAs[1] i-(a-1)=8 j=1
PAAs[5] for loop (line 17)normalization window paaNorm = ( PAAs[x] − s ∗ µ) /(cid:2978) sLoops iterations Z-normalization statistics updates := PAA segmentlength uENV norm paaENV[D, l min=8, l max=12, a=1, γ =4, s=4 pts]: acSAc= acSAc + , acSqSAc = acSqSAc+ wSubSeq= 9 2 D PAAs[1] PAAs[5] acSAc = acSAc - , acSqSAc = acSqSAc- wSubSeq= 8 3 D PAAs[2] PAAs[6] PAAs[1] PAAs[5] … …
PAAs[9] … …acSAc = acSAc - , acSqSAc = acSqSAc- wSubSeq= 8 15 PAAs[5] PAAs[9] acSAc = ∑( ) , acSqSAc = ∑( ) wSubSeq= 12 Fig. 10
Running example of Algorithm 2.
Left column ) Points itera-tion, the dashed squared contours the subsequence used to normalizethe PAA coefficients in the Second loop.
Right column ) Statistics up-date at each step, which serve the computation of µ and σ of eachpossible coefficients normalization. Algorithm 3:
ULISSE index computation
Input: Collection C , int s , int (cid:96) min , int (cid:96) max , int γ , bool bNorm Output:
ULISSE index I foreach D in C do int a (cid:48) ← ∅ ; uENV E ← ∅ ; while true do if bNorm then E ← uENV norm ( D, s, (cid:96) min , (cid:96) max , γ, a (cid:48) ) ; else E ← uENV ( D, s, (cid:96) min , (cid:96) max , γ, a (cid:48) ) ; a (cid:48) ← a (cid:48) + γ + 1 ; if E == ∅ then break ; bulkLoadingIndexing ( I, E ) ; I.inMemoryList.add ( maxCardinality ( E )) ; Second loop either encounters a subsequence of length (cid:96)min (as depicted in the step ), or performs at most γ iterations,since all the subsequences starting at position a + γ + 1 orlater (if any) will be represented by other Envelopes. Given w , the number of PAA segments in the window oflength (cid:96) max , and M = (cid:96) max − (cid:96) min + γ , the number ofmaster series we need to consider, building a normalizedEnvelope, uEN V norm , takes O ( M γw ) time.
ULISSE in-dex upon a data series collection. We maintain the structureof the iSAX index [11], introduced in the preliminaries.Each
ULISSE internal node stores the Envelope uEN V that represents all the sequences in the subtree rooted at thatnode. Leaf nodes contain several Envelopes, which by con-struction have the same iSAX ( L ) . On the contrary, their iSAX ( U ) varies, since it get updated with every new inser-tion in the node. (Note that, inserting by keeping the same iSAX ( U ) and updating iSAX ( L ) represents a symmetricand equivalent choice.)In Figure 11, we show the structure of the ULISSE in-dex during the insertion of an Envelope (rectangular/yellowbox). Note that insertions are performed based on iSAX ( L ) (underlined in the figure). Once we find a node with thesame iSAX ( L ) = (1 − − − (Figure 11, st step )if this is an inner node, we descend its subtree (always fol-lowing the iSAX ( L ) representations) until we encounter aleaf. During this path traversal, we also update the iSAX representation of the Envelope we are inserting, by increas-ing the number of bits of the segments, as necessary. In ourexample, when the Envelope arrives at the leaf, it has in-creased the cardinality of the second segment to two bits: iSAX ( L ) = (1 − − − , and similarly for iSAX ( U ) (Figure 11, nd step ). Along with the Envelope, we store inthe leaf a pointer to the location on disk for the correspond-ing raw data series. We note that, during this operation, wedo not move any raw data into the index.To conclude the insertion operation, we also update the iSAX ( U ) of the nodes visited along the path to the leaf,where the insertion took place. In our example, we up-date the upper part of the leaf Envelope to iSAX ( U ) =(1 − − − , as well as the upper part of the Envelope ofthe leaf’s parent to iSAX ( U ) = (1 − − − (Figure 11, rd step ). This brings the ULISSE index to a consistent stateafter the insertion of the Envelope.Algorithm 3 describes the procedure, which iterates overthe series of the input collection C , and inserts them in theindex. Note that function bulkLoadingIndexing in Line 12may use different bulk loading techniques. In our experi-ments, we used the iSAX 2.0 bulk loading algorithm [10].Alongside the index, we also keep in memory (using the rawdata order) all the Envelopes, represented by the symbols ofthe highest iSAX cardinality available (Line 13). This in-formation is used during query answering. The index space complexity is equivalent for the case ofZ-normalized and non Z-normalized sequences. The choiceof γ determines the number of Envelopes generated and nd step: Insert the uENVin the leaf with thesame iSax(L),computing the new representation for the split symbols .....
ROOT uENV = iSax(U) = 1 - 1 - 1 - 0iSax(L) = 1 - 0 - 0 - 0iSax(U): 1 - 0 - 0 - 0iSax(L): 1 - 0 - 0 - 0 ..... INDEX st step: Find leaf node with the representative iSax(L)
Internal node: split on a segment of representative word rd step: Update iSax(U) in the updated leaf path nodes, updating the highest symbol values.Each envelope in a leaf points to the subsequence starting point in the disk iSax(U) : 1 - 01 - 0 - 0 iSax(U) 1 - 11 - 1 - 0iSax(L) 1 - 10 - 0 - 0 … …iSax(L) : 1 - 10 - 0 - 0 iSax(U) : 1 - 11 - 0 - 0 … … …iSax(L) : 1 - 00 - 0 - 0… … …… … … ROOTiSax(U): 1 - - 0 - 0iSax(L): 1 - 0 - 0 - 0 .......... iSax(U) : 1 - - 0 - 0 iSax(U) 1 - 11 - 1 - 0iSax(L) 1 - 10 - 0 - 0 … …iSax(L) : 1 - 10 - 0 - 0 Fig. 11
Envelope insertion in an
ULISSE index. iSAX ( L ) is chosento accommodate the Envelopes inside the nodes. thus the index size. Hence, given a data series collection C = { D , ..., D | C | } the number of extracted Envelopes isgiven by N = ( (cid:80) | C | i (cid:98) | D i | (cid:96) min + γ (cid:99) ). If w PAA segments areused to discretize the series, each iSAX symbol is repre-sented by a single byte (binary label) and the disk pointerin each Envelope occupies b bytes (in general bytes areused). The final space complexity is O ((2 w ) bN ) . In this section, we present the building blocks of the simi-larity search algorithms we developed for the
ULISSE index,for both the Euclidean and the DTW distances, and both k-NN and (cid:15) -range queries.We note that the same index structure supports both dis-tance measures. When the query arrives, and depending onthe distance measure we have chosen, we use the corre-sponding lower bounding and real distance formulas. Weelaborate on these procedures in the following sections.6.1 Lower Bounding Euclidean DistanceThe iSAX representation allows the definition of a distancefunction, which lower bounds the true Euclidean [62]. Thisfunction compares the
P AA coefficients of the first data se-ries, against the iSAX breakpoints (values) that delimit thesymbol regions of the second data series. calable Data Series Subsequence Matching with ULISSE 11
Let β u ( S ) and β l ( S ) be the breakpoints of the iSAXsymbol S . We can compute the distance between a P AA coefficient and an iSAX region using: distLB ( P AA ( D ) i , iSAX ( D (cid:48) ) i ) = ( β u ( iSAX ( D (cid:48) ) i ) − P AA ( D ) i ) if β u ( iSAX ( D (cid:48) ) i )
P AA ( D ) i otherwise. (3)In turn, the lower bounding distance between two equi-length series D , D (cid:48) , represented by w PAA segments and wiSAX symbols, respectively, is defined as: mindist
P AA iSAX ( P AA ( D ) , iSAX ( D (cid:48) )) = (cid:114) | D | w (cid:118)(cid:117)(cid:117)(cid:116) w (cid:88) i =1 distLB ( P AA ( D ) i , iSAX ( D (cid:48) ) i ) . (4)We rely on the following proposition [36]: Proposition 1
Given two data series
D, D (cid:48) , where | D | = | D (cid:48) | , mindist P AA iSAX ( P AA ( D ) , iSAX ( D (cid:48) )) ≤ ED ( D, D (cid:48) ) . Since our index contains Envelope representations, weneed to adapt Equation 4, in order to lower bound the dis-tances between a data series Q , which we call query, and aset of subsequences, whose iSAX symbols are described bythe Envelope uEN V paaENV [ D,(cid:96)min,(cid:96)max,a,γ,s ] = [ iSAX ( L ) , iSAX ( U )] .Therefore, given w , the number of PAA coefficients of Q , that are computed using the Envelope PAA segmentlength s on the longest multiple prefix, we define the fol-lowing function: mindist ULiSSE ( P AA ( Q ) , uEN V paaENV ... ) = √ s (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) w (cid:88) i =1 ( P AA ( Q ) i − β u ( iSAX ( U ) i )) , if ( ∗ )( P AA ( Q ) i − β u ( iSAX ( L ) i )) , if ( ∗∗ )0 otherwise. ( ∗ ) β u ( iSAX ( U ) i )
P AA ( Q ) i (5)In Figure 12, we report an example of mindist ULiSSE computation between a query Q , represented by its PAA co-efficients, and an Envelope in the iSAX space. Proposition 2
Given two data series Q , D , mindist ULiSSE ( P AA ( Q ) , uENV paaENV [ D,(cid:96)min,(cid:96)max,a,γ,s ] ) ≤ ED ( Q, D i, | Q | ) , for each i such that a ≤ i ≤ a + γ + 1 and | D | − ( i − ≥ (cid:96) min . PAA(Q) PAA(Q) Q:(a) (b) 𝑠 x ( 0 + (PAA(Q)2−β ) (cid:2870) ) mindist ULISSE (PAA(Q) , uENV paaENV[D, l min, l max,a,γ,s] ) = β β β Fig. 12
Given the PAA representation of a query Q ( a )and uENV paaENV [ D,(cid:96)min,(cid:96)max,a,γ,s ] ( b ) we compute their mindist ULiSSE . The iSAX space is delimited with dashedlines and the relative breakpoints β i . Proof (sketch) We may have two cases, when mindist
ULiSSE is equal to zero, the propositionclearly holds, since Euclidean distance is non neg-ative. On the other hand, the function yields valuesgreater than zero, if one of the first two branches istrue. Let consider the first (the second is symmetric).If we denote with D (cid:48)(cid:48) the subsequence in D , such that β l ( iSAX ( U ) i ) ≤ P AA ( D (cid:48)(cid:48) ) i ≤ β u ( iSAX ( U ) i ) , weknow that the upper breakpoint of the i th iSAX symbol,of each subsequence in D , which is represented by theEnvelope, must be less or equal than β u ( iSAX ( U ) i ) . Itfollows that, for this case, Equation 5 is equivalent to distLB ( P AA ( Q ) i , iSAX ( D (cid:48)(cid:48) ) i ) , which yields the short-est lower bounding distance between the i th segment ofpoints in D and Q .6.2 Lower Bounding Dynamic Time WarpingWe present here a lower bound for the true DTW distancebetween two data series. Keogh et al. [30] introduced the LB Keogh function, which provides a measure that is al-ways smaller or equal than the true DTW, between twoequi-length series. To compute this measure, we need to ac-count for the valid warping alignments of two data series.Recall that the indexes of a valid path are confined by theSakoa-Chiba band, where they are at most r points far fromthe diagonal (Euclidean Distance alignment). Given a dataseries D , we can build an envelope, dtwEN V r ( D ) , com-posed by two data series: L DT W and U UDT W , which de-limit the space generated by the points of D that have in-dexes in the valid warping paths, constrained by the window r . Therefore, the i th point of the two envelope sequencesare computed as follows: L DT Wi = min ( D ( i − r, r +1) ) and U DT Wi = max ( D ( i − r, r +1) ) . Intuitively, each i th value of L DT W and U DT W represent the minimum and the max-imum values, respectively, of the points in D that can bealigned with the i th position of a matching series. In Fig-ure 13(a), we report a data series D (plotted using a dashedline), contoured by its dtwEN V r ( D ) envelope ( r = 7 ). L DTW U DTW
D’(a) U DTW L DTW
D LB_Keogh(dtwENV r (D),D’) dtwENV r (D)= (L DTW ,U DTW ) PAA(U
DTW ) PAA(L
DTW ) β LB PaL ( PAA(dtwENV r (Q)) , uENV paaENV[D’, l min, l max,a,γ,s] (D’) ) = β β β uENV paaENV[D’, l min, l max,a,γ,s] (c) = (𝑃𝐴𝐴 𝐿 (cid:3005)(cid:3021)(cid:3024) (cid:2869) − β (cid:2872) ) (cid:2870) +0 + (β (cid:2869) − 𝑃𝐴𝐴 𝑈 (cid:3005)(cid:3021)(cid:3024) (cid:2871) ) (cid:2870) β β (b) r =7 Q = D
Fig. 13 (a)
DTW Envelope ( L DT W , U DT W ) of a series D . (b) LB Keogh distance between DTW Envelope and D (cid:48) . (c) LB P aL be-tween the DTW Envelope of Q (prefix of D ) and the ULISSE
Envelopeof D (cid:48) . The horizontal dashed lines delimit the iSAX breakpoints space. Lower bounding DTW.
We can thus define the LB Keogh distance [30], which is computed between a DTW envelopeof a series D and a data series D (cid:48) , where | D | = | D (cid:48) | and thewarping window is r : LB Keogh ( dtwENV r ( D ) , D (cid:48) ) = (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) | D | (cid:88) i =1 ( D (cid:48) i − U DT Wi ) , ifD (cid:48) i > U DT Wi ( D (cid:48) i − L DT Wi ) , ifD (cid:48) i < L DT Wi s otherwise. (6) The LB Keogh distance between dtwEN V r ( D ) and D (cid:48) is guaranteed to be always smaller than, or equalto DT W ( D, D (cid:48) ) , computed with warping window r . InFigure 13(b), we depict the LB Keogh distance between dtwEN V r ( D ) (from Figure 13(a)), and a new series D (cid:48) .The vertical (blue) lines represent the positive differencesbetween D (cid:48) and the DTW envelope of D , in Equation 6.Note that the computation of LB Keogh takes O ( (cid:96) ) time (lin-ear), whereas the true DTW computation runs in O ( (cid:96)r ) timeusing dynamic programming [30,57]. Lower bounding DTW in ULISSE.
We now propose anew lower bounding measure for the true DTW distance between a data series and all the sequences (of the samelength) represented by an
ULISSE
Envelope. To that ex-tent, we first introduce a measure based on LB Keogh dis-tance, which is computed between the PAA representation of dtwEN V r ( D ) and the iSAX representation of D (cid:48) . Given w , the number of PAA coefficients of each dtw envelopeseries ( U DT W , L DT W ) that is equivalent to the number ofiSAX coefficients of D (cid:48) , we have: LB Keogh
PAA iSAX ( P AA ( dtwEN V r ( D )) , iSAX ( D (cid:48) )) = (cid:114) | D | w (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) w (cid:88) i =1 ( β (cid:96) ( iSAX ( D (cid:48) ) i ) − P AA ( U DTW ) i ) , if ( ∗ )( P AA ( L DTW ) i − β u ( iSAX ( D (cid:48) ) i )) , if ( ∗∗ )0 otherwise. ( ∗ ) β (cid:96) ( iSAX ( D (cid:48) ) i ) >P AA ( U DTW ) i )( ∗∗ ) P AA ( L DTW ) i >β u ( iSAX ( D (cid:48) ) i ) (7)We know that LB Keogh
PAA iSAX ( P AA ( dtwEN V r ( D )) ,iSAX ( D (cid:48) )) ≤ LB Keogh ( dtwEN V r ( D ) , D (cid:48) ) as provenby Keogh et al. [30]. Given the P AA representation of dtwEN V r ( D ) (of w coefficients), and an ULISSE
Envelopebuilt on D (cid:48) : uEN V paaENV [ D (cid:48) ,(cid:96)min,(cid:96)max,a,γ,s ] ) = [ L, U ] , wedefine: LB P aL ( P AA ( dtwEN V r ( D )) , uEN V paaENV [ D (cid:48) ,... ] ) = √ s (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) w (cid:88) i =1 ( β (cid:96) ( iSAX ( L ) i ) − P AA ( U DTW ) i ) , if ( ∗ )( P AA ( L DTW ) i − β u ( iSAX ( L ) i )) , if ( ∗∗ )0 otherwise. ( ∗ ) β (cid:96) ( iSAX ( L ) i ) >P AA ( U DTW ) i )( ∗∗ ) P AA ( L DTW ) i >β u ( iSAX ( U ) i ) (8) Lemma 3
Given two data series D and D (cid:48) , where (cid:96) min ≤| D | ≤ (cid:96) max , the distance LB P aL ( P AA ( dtwEN V r ( D )) ,uEN V paaENV [ D (cid:48) ,(cid:96) min ,(cid:96) max ,a,γ,s ] ) is always smaller orequal to DT W ( D, D (cid:48) i, | D | ) , for each i such that a ≤ i ≤ a + γ + 1 and | D (cid:48) | − ( i − ≥ (cid:96) min . Intuitively, the lemma states that the LB P aL functionalways provides a measure that is smaller than the true DTWdistance between D and each subsequence in D (cid:48) of the samelength, represented by uEN V paaENV [ D (cid:48) ,(cid:96) min ,(cid:96) max ,a,γ,s ] ) . Proof (sketch): We want to prove that LB P aL ( P AA ( dtwENV r ( D )) , uENV paaENV [ D (cid:48) ,(cid:96) min ,(cid:96) max ,a,γ,s ] ) is equal to argmin i { LB Keogh
PAA iSAX ( P AA ( dtwENV r ( D )) , iSAX ( D (cid:48) i, | D | )) } , calable Data Series Subsequence Matching with ULISSE 13 where D (cid:48) i, | D | is a subsequence of D (cid:48) representedby uEN V paaENV [ D (cid:48) ,(cid:96) min ,(cid:96) max ,a,γ,s ] ) . The lemma clearlyholds if LB P aL yields zero, since the DTW distance be-tween two series is always positive, or equal to zero. We thustest the case, where Equation 8 provides a strictly positivevalue. In the first case, the i th lower iSAX breakpoint of L inthe U LISSE
Envelope ( β (cid:96) ( iSAX ( L ) i ) ) is greater than the i th PAA coefficient of the U DT W , namely
P AA ( U DT W ) i .This implies that any other i th iSAX coefficient, which iscontained in the ULISSE
Envelope is necessarily greaterthan β (cid:96) ( iSAX ( L ) i ) and P AA ( U DT W ) i . Hence, the Equa-tion 8 is equivalent to the smallest value we can obtain fromthe first branch of LB Keogh
PAA iSAX computed betweeneach i th iSAX coefficient of the subsequences in D (cid:48) (rep-resented in the U LISSE
Envelope) to the i th PAA coef-ficient of
P AA ( U DT W ) . LB Keogh
PAA iSAX always yieldsa value that is smaller or equal to the true
DT W distance,with warping window r .The second case is symmetric. Here, the β u ( iSAX ( L ) i ) coefficient is the closest to P AA ( L DT W ) i , and greater thanany other i th iSAX coefficient of the U LISSE
Envelope.Therefore, Equation 8 is equivalent to the smallest valuewe can obtain on the second branch of LB Keogh
PAA iSAX computed between each i th iSAX coefficient of the subse-quences in D (cid:48) (represented in the U LISSE
Envelope) tothe i th coefficient of P AA ( L DT W ) . (cid:4) In Figure 13(c), we depict an example that shows thecomputation of LB P aL between the DTW Envelope that isbuilt around the prefix of D ( points) and the ULISSE
Envelope of the series D (cid:48) . For this latter, the settings are: a = 1 , (cid:96) min = 153 , (cid:96) max = 255 , γ = 0 and s = 51 .In the figure, we represent the iSAX coefficients of the ULISSE
Envelope, with (gray) rectangles delimited by theirbreakpoints (dashed horizontal lines). The coefficients of
P AA ( U DT W ) and P AA ( L DT W ) are represented by redand green solid segments.6.3 Approximate searchSimilarity search performed on ULISSE index relies onEquation 5 (Euclidean distance) and Equation 8 (DTW dis-tance) to prune the search space. This allows to navigate thetree, visiting the most promising nodes first. We thus pro-vide a fast approximate search procedure we report in Al-gorithm 4. In Line 7 (or Line 9 if DTW distance is used),we start to push the internal nodes of the index in a priorityqueue, where the nodes are sorted according to their lowerbounding distance to the query. Note that in the comparison,we use the largest prefix of the query, which is a multiple ofthe
P AA segment length, used at the index building stage(Line 1). Recall that when the search is performed using the DTW measure, the
P AA representation of the query is com-puted on the DTW envelope ( dtwEN V r ) of the segment-length multiple that completely contains the query (Line 2).This envelope is composed by two series, which encode thepossible warping alignment according the warping window r . Therefore, the PAA representation is composed by twosets of coefficients, e.g., P AA ( L DT W ) and P AA ( U DT W ) ,as we depict in Figure 13.(c). Then, the algorithm pops theordered nodes from the queue, visiting their children in theloop of Line 10. In this part, we still maintain the internalnodes ordered (Lines 34-35).As soon as a leaf node is discovered (Line 12), we checkif its lower bound distance to the query is shorter than the bsf . If this is verified, the dataset does not contain any dataseries that are closer than those already compared with thequery. In this case, the approximate search result coincideswith that of the exact search. Otherwise, we can load the rawdata series pointed by the Envelopes in the leaf, which are inturn sorted according to their position, to avoid random diskreads. We visit a leaf only if it contains Envelopes that rep-resent sequences of the same length as the query. Each timewe compute either the true Euclidean distance (Line 19) orthe true DTW distance (Line 21), the best-so-far distance( bsf ) is updated, along with the R a vector. Since priorityis given to the most promising nodes, we can terminate ourvisit, when at the end of a leaf visit the k bsf ’s have notimproved (Line 22). Hence, the vector R a contains the k approximate query answers.6.4 Exact searchNote that the approximate search described above may notvisit leaves that contain answers better than the approximateanswers already identified, and therefore, it will fail to pro-duce exact, correct results. We now describe an exact nearestneighbor search algorithm, which finds the k sequences withthe absolute smallest distances to the query.In the context of exact search, accessing disk-residentdata following the lower bounding distances order may re-sult in several leaf visits: this process can only stop afterfinding a node, whose lower bounding distance is greaterthan the bsf , guaranteeing the correctness of the results. Thiswould penalize computational time, since performing manyrandom disk I/O might unpredictably degenerate.We may avoid such a bottleneck by sorting the En-velopes, and in turn the disk accesses. Moreover, we canexploit the bsf provided by approximate search, in order toperform a sequential search with pruning over the sorted En-velopes list (this list is stored across the ULISSE index). In-tuitively, we rely on two aspects. First, the bsf , which cantranslate into a tight-enough bound for pruning the candidateanswers. Second, since the list has no hierarchy structure,any Envelope is stored with the highest cardinality available,
Algorithm 4:
ULISSE k-NN-
Approx
Input: int k , float [] Q , ULISSE index I, int r // warping window Output: float [ k ][ | Q | ] R a , float [] bsf float [] Q ∗ ← P AA ( Q ,.., (cid:98)| Q | /I.s (cid:99) ) ; float [][] Q ∗ dtw ← P AA ( dtwENV r ( Q ,.., (cid:98)| Q | /I.s (cid:99) )) ; float[k] bsf ← {∞ , ..., ∞} ; PriorityQueue nodes; foreach node in I.root.children() do if Euclidean distance search then nodes.push ( node, mindist ULiSSE ( Q ∗ , node )) ; else if DTW search then nodes.push ( node, LB PaL ( Q ∗ dtw , node )) ; while n = nodes.pop() do if n.isLeaf() and n.containsSize( | Q | ) then if n. lowerBound < bsf[k] then // sort according disk pos. uENV [] Envelopes = sort ( n.Envelopes ) ; // iterate the Env. and compute true ED oldBSF ← bsf [k]; foreach E in Envelopes do float [] D ← readSeriesFromDisk(E); for i ← E.a to min (E.a+E. γ +1, | D | − ( | Q | − ) do if Euclidean distance search then ED updateBSF ( Q, E.D i, | Q | , k, bsf , R a ); else if DTW search then DT W updateBSF ( Q, E.D i, | Q | , k, bsf , R a , r ); // if bsf has not improved end visit. if oldBSF == bsf[k] then break ; else break; // Approximate search is exact. else LBleft ← , LBright ← ; if Euclidean distance search then LBleft ← mindist ULISSE ( Q ∗ , n.left ) ; LBright ← mindist ULISSE ( Q ∗ , n.right ) ; else if DTW search then LBleft ← LB PaL ( Q ∗ dtw , n.left ) ; LBright ← LB PaL ( Q ∗ dtw , n.right ) ; nodes.push ( n.left, LBleft ) ; nodes.push ( n.right, LBright ) ; Algorithm 5:
ULISSE k-NN-
Exact
Input: int k , float [] Q , ULISSE index I, int r // warping window Output: float [ k ][ | Q | ] R float [] Q ∗ ← P AA ( Q ,.., (cid:98)| Q | /I.s (cid:99) ) ; float [][] Q ∗ dtw ← P AA ( dtwENV r ( Q ,.., (cid:98)| Q | /I.s (cid:99) )) ; float [] bsf , float [ k ][ | Q | ] R ← k - NN - Approx ( k, Q, I ) ; if bsf is not exact then foreach E in I.inMemoryList do LBDist ← ; if Euclidean distance search then LBDist ← mindist ULiSSE ( Q ∗ , E ) ; else if DTW search then LBDist ← LB PaL ( Q ∗ dtw , E ) ; if LBDist< bsf[k] then float [] D ← readSeriesFromDisk(E); for i ← E.a to min (E.a+E. γ +1, | D | − ( | Q | − ) do if Euclidean distance search then ED updateBSF ( Q, D i, | Q | , k, bsf , R ) ; else if DTW search then l ← LB Keogh ( dtwENV r ( Q ) , D i, | Q | ) ; if l < bsf[k] then DT W updateBSF ( Q, D i, | Q | , k, bsf , R ) ; which guarantees a fine representation of the series, and cancontribute to the pruning process.Algorithm 5 describes the exact search procedure. In thecase of Euclidean distance search, in Line 8 we compute thelower bound distance between the Envelope and the query. On the other hand, when DTW distance is used, we com-pute the lower bound distance in Line 10. If it is not smallerthan the k th bsf , we do not access the disk, pruning Eu-clidean Distance computations as well. Note that while weare computing the true distances, we can speed-up compu-tations using the Early Abandoning technique [57], whichworks both for Euclidean and DTW distances. In the caseof DTW distance, prior to computing the raw distance, wehave a further possibility to prune computations using the LB Keogh (Equation 6) in Line 17. This permits to obtaina lower bounding measure in linear time, avoiding the fullDTW calculation.6.5 Complexity of query answeringWe provide now the time complexity analysis of queryanswering with
ULISSE . Both the approximate and exactquery answering time strictly depend on data distribution asshown in [74]. We focus on exact query answering, sinceapproximate is part of it.
Best Case.
In the best case, an exact query will visit oneleaf at the stage of the approximate search (Algorithm 4),and during the second leaf visit will fulfill the stopping cri-terion (i.e., the bsf distance is smaller than the lower bound-ing distance between the second leaf and the query). Giventhe number of the first layer nodes (root children) N , thelength of the first leaf path L , and the number of Envelopesin the leaf S , the best case complexity is given by the costto iterate the first layer node and descend to the leaf keepingthe nodes sorted in the heap: O ( w ( N + LlogL )) , where w is the number of symbols checked at each lower boundingdistance computation. We recall that computing the lowerbound of Euclidean or DTW distance has equal time com-plexity. Moreover we need to take into account the addi-tional cost of sorting the disk accesses and computing thetrue distances in the leaf, which is O ( S ( logS + W )) in thecase of Euclidean distance, and O ( S ( logS + rW )) for DTWdistance, where W = (cid:96) min ( (cid:96) max − (cid:96) min + γ + 1) representsthe maximum number of points to check in each Envelope,and r is the warping window length. Note that we alwaysperform disk accesses sequentially, avoiding random diskI/O. Each disk access in ULISSE reads at most Θ ( (cid:96) max + γ ) points. Worst Case.
The worst case for exact search takes placewhen at the approximate search stage, the complete set ofleaves that we denote with T , need to be visited. This has acost of O ( w ( N + T LlogL )) plus the cost of computing thetrue distances, which takes O ( T ( S ( logS + W ))) for Eu-clidean distance, or O ( T ( SlogS + SrW )) for DTW dis-tance, where (as above) W = (cid:96) min ( (cid:96) max − (cid:96) min + γ + 1) .Note though that this worst case is pathological: for exam-ple, when all the series in the dataset are the same straightlines (only slightly perturbed). Evidently, the very notion of calable Data Series Subsequence Matching with ULISSE 15 indexing does not make sense in this case, where all the dataseries look the same. As we show in our experiments on sev-eral datasets, in practice, the approximate algorithm alwaysvisits a very small number of leaves. ULISSE k-NN Exact complexity.
So far we have consid-ered the exact k-NN search with regards to Algorithm 4 (ap-proximate search). When this algorithm produces approxi-mate answers, providing just an upper bound bsf , in orderto compute exact answers we must run Algorithm 5 (ex-act search). The complexity of this procedure is given bythe cost of iterating over the Envelopes and computing the mindist , which takes O ( M w ) time, where M is the totalnumber of Envelopes in the index. Let’s denote with V thenumber of Envelopes, for which the raw data are retrievedfrom disk and checked. Then, the algorithm takes an ad-ditional O ( V W ) time to compute the true Euclidean dis-tances, or O ( V rW ) to compute the true DTW distances,with W = (cid:96) min ( (cid:96) max − (cid:96) min + γ + 1) . (cid:15) -range search adaption. We note that Algorithm 5 can beeasily adapted to support (cid:15) -range search, without affectingits time complexity. In order to retrieve all answers with dis-tance less than a given threshold (cid:15) ∈ R , we just need toreplace the bound bsf [ k ] with (cid:15) , in the test of line 11. Sub-sequently, if the test is true, the algorithm will compute thereal distances between the query and all candidates in D (line 13), simply filtering the subsequences with distanceslower than (cid:15) . All the experiments presented in this section arecompletely reproducible: the code and datasets we used areavailable online [1]. We implemented all algorithms (index-ing and query answering) in C (compiled with gcc 4.8.2).We ran experiments on an Intel Xeon E5-2403 (4 cores @1.9GHz), using the x86 64 GNU/Linux OS environment.
Algorithms.
We compare
ULISSE to Compact Multi-Resolution Index (CMRI) [24] , which is the current state-of-the-art index for similarity search with varying-lengthqueries (recall that
CMRI constructs a limited number ofdistinct indexes for series of different lengths). We notethough, that in contrast to our approach,
CMRI can onlysupport non Z-normalized sequences. Furthermore, we com-pare
ULISSE to KV-Match [67], which is the state-of-the-art indexing technique for (cid:15) -range queries that support theEuclidean and DTW measures over non Z-normalized se-quences (remember that, as we discussed in Section 2, forZ-normalized data
KV-Match only supports exact search forthe constrained (cid:15) -range queries). Finally, we consider IndexInterpolation (IND-INT) [43]. This method adapts an indexbased (cid:15) -range algorithm, which supports Z-normalization toanswer k-NN queries of variable length. In addition, we compare to the current state-of-the-art algorithms for subsequence similarity search, the
UCRsuite [57], and
MASS [45]. Note that only
UCR suite workswith the Euclidean and DTW measures, whereas
MASS sup-ports only similarity search using Euclidean distance. Thesealgorithms do not use an index, but are based on optimizedserial scans, and are natural competitors, since they can pro-cess overlapping subsequences very fast.
Datasets.
For the experiments, we used both synthetic andreal data. We produced the synthetic datasets with a gen-erator, where a random number is drawn from a Gaussiandistribution N (0 , , then at each time point a new numberis drawn from this distribution and added to the value ofthe last number. This kind of data generation has been ex-tensively used in the past [74,73], and has been shown toeffectively model real-world financial data [16].The real datasets we used are: – (GAP), which contains the recording of the global activeelectric power in France for the period 2006-2008. Thisdataset is provided by EDF (main electricity supplier inFrance) [35]; – (CAP), the Cyclic Alternating Pattern dataset, whichcontains the EEG activity occurring during NREM sleepphase [64]; – (ECG) and (EMG) signals from Stress Recognition inAutomobile Drivers [20]; – (ASTRO), which contains data series representing celes-tial objects [63]; – (SEISMIC), which contains seismic data series, col-lected from the IRIS Seismic Data Access reposi-tory [22].In our experiments, we test queries of lengths points, since these cover at least of the ranges exploredin works about data series indexing in the last two decades[28,5,65].7.1 Envelope BuildingIn the first set of experiments, we analyze the performanceof the ULISSE indexing algorithm. Note that the indexingalgorithm is oblivious to the distance measure used at querytime.In Figure 14(a) we report the indexing time (EnvelopeBuilding and Bulk loading operations) when varying γ . Weuse a dataset containing series of length , fixing (cid:96) min = 160 and (cid:96) max = 256 . Observe that, when γ = 0 , thealgorithm needs to extract as many Envelopes as the numberof master series of length (cid:96) min . This generates a significantoverhead for the index building process (due to the maximalEnvelopes generation), but also does not take into accountthe contiguous series of same length, in order to compute thestatistics needed for Z-normalization. A larger γ speeds-up t i m e ( s e c o nd s ) γ = (% of (l max - l min )) BulkLoading AlgorithmEnvelopes construction 110100100010000 t i m e ( s e c o nd s ) Query length range (l max - l min )Envelopes constructionBulkLoading Algorithm (a) (b)
Fig. 14 (a) Construction and bulk Loading time (log scale) of En-velopes in 5GB datasets varying γ (5M of series of length 256), (cid:96) min = 160 , (cid:96) max = 256 . (b) Construction and Bulk Loading time(log scale) of Envelopes in 5GB dataset (2.5M of series of length 512)varying (cid:96) max − (cid:96) min (lengths range), γ = 256 , fixed (cid:96) max = 512 . the Envelope building operation by several orders of magni-tude, and this is true for a very wide range of γ values (Fig-ure 14(a)). These results mean that the uEN V norm buildingalgorithm can achieve good performance in practice, despiteits complexity that is quadratic on γ .In Figure 14(b) we report an experiment, where γ isfixed, and the query length range ( (cid:96) max − (cid:96) min ) varies. Weuse a dataset, with the same size of the previous one, whichcontains series of length . The results show that in-creasing the range has a linear impact on the final runningtime.7.2 Exact Search Similarity Queries with EuclideanDistanceWe now test ULISSE on exact 1-Nearest Neigh-bor queries using Euclidean distance. We have re-peated this experiment varying the
ULISSE param-eters along predefined ranges, which are (default inbold) γ : [0% , , , , , %] , wherethe percentage is referring to its maximum value, (cid:96) min : [96 , , , , , , (cid:96) max : [256] , datasetseries length ( (cid:96) S ): [ , , , , , anddataset size of GB . Here, we use synthetic datasetscontaining random walk data in binary format, where asingle point occupies 4 bytes. Hence, in each dataset C ,where | C | Bytes denotes the corresponding size in bytes,we have a number of subsequences of length (cid:96) given by N seq = ( (cid:96) S − (cid:96) + 1) × (( | C | Bytes / /(cid:96) S ) . For instance,in a dataset, containing series of length , we have ∼
500 Million subsequences of length .We record the average
CPU time , query disk I/O time (time to fetch data from disk: Total time - CPU time), and pruning power (percentage of the total number of Envelopesin the index that do not need to be read), of queries,extracted from the datasets with the addition of Gaussiannoise. For each index used, the building time and the relative size are reported. Note that we clear the main memory cachebefore answering each set of queries. We have conducted our
160 192 224 256 A v g E x a c t Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length
0% 20%40% 60%80% 100%
160 192 224 256 A v g E x a c t Q u e r y T i m e d i s k I / O ( S e c s ) Query length
0% 20% 40%60% 80% 100%
160 192 224 256 A v e r a g e P r un i n g P o w e r % Query length0% 20% 40%60% 80% 100%
Method γ Indexing time (h) Indexing size (GB) No. of Records/Envelopes No. of indexesCmri -
Ulisse 0%
Ulisse 20%
Ulisse 40%
Ulisse 60%
Ulisse 80%
Ulisse 100% (a)(c) (b)(d)(e) γγ γ C u m u l a t i v e Q u e r y T i m e ( h o u r s ) γ = (% of (lmax - lmin))query answering disk I/Oquery answering cpuIndexing (disk I/O + CPU) Fig. 15
Query answering time performance, varying γ on non Z-normalized data series. a) ULISSE average query time (CPU + diskI/O). b) ULISSE average query disk I/O time. c) ULISSE average querypruning power. d) Comparison of
ULISSE to other techniques (cumula-tive indexing + query answering time). e) Table resuming the indexes’properties. experiments using datasets that are both smaller and largerthan the main memory.In all experiments, we report the cumulative runningtime of random queries for each query length.
Varying γ . We first present results for similarity searchqueries on
ULISSE when we vary γ , ranging from to itsmaximum value, i.e., (cid:96) max − (cid:96) min . In Figure 15, we reportthe results concerning non Z-normalized series (for whichwe can compare to CMRI ). We observe that grouping con-tiguous and overlapping subsequences under the same sum-marization (Envelope) by increasing γ , affects positively theperformance of index construction, as well as query answer-ing (Figures 15(a) and (d)). The latter may seem counterin-tuitive, since γ influences in a negative way pruning power,as depicted in Figure 15(c). Indeed, inserting more masterseries into a single Envelope is likely to generate large con-tainment areas, which are not tight representations of thedata series. On the other hand, it leads to an overall numberof
Envelopes that is several orders of magnitude smallerthan the one for γ = 0% . In this last case, when γ = 0 , thealgorithm inserts in the index as many records as the numberof master series present in the dataset ( M), as reported in(Figure 15(e)).We note that the disk I/O time on compact indexes is notnegatively affected at the same ratio of pruning power. Onthe contrary, in certain cases it becomes faster. For example,the results in Figure 15(b) show that for query length 160, calable Data Series Subsequence Matching with ULISSE 17 the γ = 100% index is more than 2x faster in disk I/O thanthe γ = 0% index, despite the fact that the latter index hasan average pruning power that is higher (Figure 15(c)).This behavior is favored by disk caching, which translates toa higher hit ratio for queries with slightly larger disk load.We note that we repeated this experiment several times, withdifferent sets of queries that hit different disk locations, inorder to verify this specific behavior. The results showed thatthis disk I/O trend always holds.While disk I/O represents on average the − % of thetotal query cost, computational time significantly affects thequery performance. Hence, a compact index, containing asmaller number of Envelopes , permits a fast in memorysequential scan, performed by Algorithm 5.In Figure 15(d) we show the cumulative time perfor-mance (i.e., , queries in total), comparing ULISSE , CMRI , and
UCR Suite . Note that in this experiment,
ULISSE indexing time is negligible w.r.t. the query answering time.
ULISSE , outperforms both
UCR Suite and
CMRI , achievinga speed-up of up to x .Further analyzing the performance of CMRI , we observethat it constructs four indexes (for four different lengths),generating more than B index records. Consequently, it isclear that the size of these indexes will negatively affect theperformance of CMRI , even if it achieves reasonable prun-ing ratios.These results suggest that the idea of generating multi-ple copies of an index for different lengths, is not a scalablesolution.In Figure 16, we show the results of the previous ex-periment, when using Z-normalization. We note that in thiscase the query answering time has an overhead generated bythe Z-normalization that is performed on-the-fly, during thesimilarity search stage. Overall, we observe exactly the sametrend as in non Z-normalized query answering.
ULISSE isstill x faster than the state-of-the-art, namely UCR Suite . Varying Length of Data Series.
In this part, we presentthe results concerning the query answering performance of
ULISSE and
UCR Suite , as we vary the length of the se-quences in the indexed datasets, as well as the query length(refer to Figure 17). In this case, varying the data serieslength in the collection, leads to a search space growth,in terms of overlapping subsequences, as reported in Fig-ure 17(e). This certainly penalizes index creation, due to theinflated number of Envelopes that need to be generated. Onthe other hand,
UCR Suite takes advantage of the high over-lapping of the subsequences during the in-memory scan.Note that we do not report the results for
CM RI in thisexperiment, since its index building time would take up to . In the same amount of time,
ULISSE answers morethan , queries.Observe that in Figures 17(a) and (c), ULISSE showsbetter query performance than the UCR suite, growing lin- A v e r a g e P r un i n g P o w e r % Query length0% 20% 40% 60% 80% 100%
160 192 224 256 A v g E x a c t Q u e r y T i m e d i s k I / O ( S e c s ) Query length
0% 20% 40%60% 80% 100% A v g E x a c t Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length
0% 20% 40%60% 80% 100% γ Indexingtime (hours) Indexingsize (GB) Number of Envelopes0% (a)(c) (b)(d)(e) γγγ C u m u l a t i v e Q u e r y T i m e ( h o u r s ) γ = (% of (l max - l min )) query answering disk i/oquery answering cpuIndexing (CPU + disk i/o time) Fig. 16
ULISSE
Indexing and exact queries performance on Z-normalized series, varying sigma. a) Average query time (CPU + diskI/O). b) Average query disk I/O time. c) Average query pruning power. d) Comparison to other techniques (cumulative indexing + query an-swering time). e) Table resuming the indexes’ properties. early as the search space gets exponentially larger. Thisdemonstrates that
ULISSE offers a competitive advantagein terms of pruning the search space that eclipses the prun-ing techniques
UCR Suite . The aggregated time for an-swering , queries ( , for each query length) is 2xfor ULISSE when compared to
UCR Suite (Figures 17(b)and (d)).
Comparison to Serial Scan Algorithms using EuclideanDistance.
We now perform further comparisons to serialscan algorithms, namely,
MASS and
UCR Suite , with vary-ing query lengths.
MASS [45] is a recent data series similarity search algo-rithm that computes the distances between a Z-normalizedquery of length l and all the Z-normalized overlapping sub-sequences of a single sequence of length n ≥ l . MASS works by calculating the dot products between the query and n overlapping subsequences in frequency domain, in logn time, which then permits to compute each Euclidean dis-tance in constant time. Hence, the time complexity of MASS is O ( nlogn ) , and is independent of the data characteristicsand the length of the query ( l ). In contrast, the UCR Suite effectiveness of pruning computations may be significantlyaffected by the data characteristics.We compared
ULISSE (using the default parameters),MASS and
UCR Suite on a dataset containing data se-ries of length . In Figure 17(f), we report the averagequery time (CPU + disk/io) of the three algorithms.We note that MASS, which in some cases is outper-formed by UCR Suite and
ULISSE , is strongly penalized, s e c o nd s p e r qu e r y Length of sequences 050100150200 2560 2048 1536 1024 512Length of sequences
Non Z-Normalized Data I nd . + qu e r y a n s . t i m e ( h o u r s ) Subsequences lengthUCR ULISSE
Non Z-norm. time (secs) Z-norm. time (secs)
Z-Normalized Data(a) (c) (e) (f)(d)(b) query answering disk I/Oquery answering cpu Indexing (disk I/O + cpu) 0100200300400500600700 256 512 1024 2048 4096 A v g E x a c t Q u e r y t i m e C P U + d i s k I / O ( S e c s ) Query lengthMASS UCR Suite ULISSEUCR q. length 256 ULISSE q. length 160 ULISSE q. length 256UCR q. length 160
Fig. 17
Query answering time performance of
ULISSE and
UCR Suite , varying the data series size. Average query (CPU time + disk I/O) ( a )for non Z-normalized, ( c ) for Z-normalized series). Cumulative indexing + query answering time ( b ) for non Z-normalized, ( d ) for Z-normalizedseries). e) Table resuming the indexes’ properties. f) Comparison between MASS algorithm,
UCR Suite and
ULISSE . C u m u l a t i v e qu e r y T i m e ( h o u r s ) [ l min - l max ]Indexing (disk I/O + cpu) ULISSEquery answering disk I/Oquery answering cpu ULISSEquery answering cpu UCR suite
96 128 160 192 224 256 A v e r a g e P r un i n g P o w e r % Query length96-256 128-256160-256 192-256224-256
96 128 160 192 224 256 A v g E x a c t Q u e r y T i m e d i s k I / O ( S e c s ) Query length96-256 128-256160-256 192-256224-256 (a)(c) (b)(d)
96 128 160 192 224 256 A v g E x a c t Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length96-256 128-256160-256 192-256224-256 [96-256][128-256][160-256][192-256][224-256]
Fig. 18
Query answering time, varying the range of query length onZ-normalized data series. ( a) ULISSE average query time (CPU + diskI/O). ( b) ULISSE average query disk I/O time. ( c) ULISSE averagequery pruning power. ( d) ULISSE comparison to other techniques (cu-mulative indexing + query answering time). when ran over a high number of non overlapping series. Thereason is that, although MASS has a low time complexity of O ( nlogn ) , the Fourier transformations (computed on eachsubsequence) have a non negligible constant time factor thatrender the algorithm suitable for computations on very longseries. Varying Range of Query Lengths.
In the last experiment ofthis subsection, we investigate how varying the length range[ (cid:96) min ; (cid:96) max ] affects query answering performance.In Figure 18, we depict the results for Z-normalizedsequences. We observe that enlarging the range of querylength, influences the number of Envelopes we need to ac-commodate in our index. Moreover, a larger query lengthrange corresponds to a higher number of Series (differentnormalizations), which the algorithms needs to consider for building a single Envelope (loop of line 16 of Algorithm 2).This leads to large containment areas and in turn, coarsedata summarizations. In contrast, Figure 18(c) indicates thatpruning power slightly improves as query length range in-creases. This is justified by the higher number of Envelopesgenerated, when the query length range gets larger. Hence,there is an increased probability to save disk accesses. InFigure 18(a) we show the average query time (CPU + diskI/O) on each index, observing that this latter is not signif-icantly affected by the variations in the length range. Thesame is true when considering only the average query diskI/O time (Figure 18(b)), which accounts for − % of thetotal query cost. We note that the cost remains stable as thequery range increases, when the query length varies between - . For queries of length and , when the range isthe smallest possible the disk I/O time increases. This is dueto the high pruning power, which translates into a higher rateof cache misses. In Figure 18(d), the aggregated time com-parison shows ULISSE achieving an up to x speed-up over UCR Suite .In Figure 19 we present the results for non Z-normalizedsequences, where the same observations hold. Moreover,as we previously mentioned, when Z-normalization is notapplied the pruning power slightly increases. This leadsULISSE to a performance up to x faster than UCR Suite .7.3 Approximate Search Queries
Approximate Search with Euclidean Distance.
In thispart, we evaluate
ULISSE approximate search. Since wecompare our approach to CMRI, Z-normalization is not ap-plied. Figure 20(a) depicts the cumulative query answeringtime for , queries. As previously, we note that the in- calable Data Series Subsequence Matching with ULISSE 19 C u m u l a t i v e Q u e r y T i m e ( h o u r s ) [ l min - l max ]Indexing (disk i/o + cpu) ULISSEquery answering disk i/oquery answering cpu ULISSEquery answering cpu UCR suite020406080100
96 128 160 192 224 256 P r un i n g P o w e r % Query length96-256 128-256 160-256192-256 224-256
96 128 160 192 224 256 A v g E x a c t Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length96-256 128-256 160-256192-256 224-256 a)c) b)d) A v g E x a c t Q u e r y T i m e d i s k I / O ( S e c s ) Query length96-256 128-256 160-256192-256 224-256
Fig. 19
Query answering time, varying the range of query length onnon Z-normalized data series. ( a) ULISSE average query time (CPU +disk I/O). ( b) ULISSE average query disk I/O time. ( c) ULISSE aver-age query pruning power. ( d) ULISSE comparison to other techniques(cumulative indexing + query answering time). I nd . + c u m u l . qu e r y t i m e ( h o u r s ) Indexing timeDisk I/OCPU time (a) (b)
Query length 160 192 224 256 R a n k i n g p o s i t i o n Ulisse1-25
92% 90% 92% 100%
8% 10% 8% 0%
Cmri1-25
Fig. 20
Approximate query answering on non Z-normalized dataseries. ( a) Cumulative Indexing + approximate search query time(CPU + disk I/O) of , queries ( , per each query length in[160,192,224,256]). ( b) Approximate quality: percentage of answersin the relative exact search range. dexing time for
ULISSE is relatively very small. On the otherhand, the time that CMRI needs for indexing is 2x more thanthe time during which
ULISSE s has finished indexing andanswering , queries.In Figure 20(b), we measure the quality of the Approxi-mate search. In order to do this, we consider the exact queryresults ranking, showing how the approximate answers aredistributed along this rank, which represents the groundtruth. We note that CMRI answers have better positions thanthe ULISSE ones. This happens thanks to the tighter rep-resentation generated by the complete sliding window ex-traction of each subsequence, employed by CMRI. Never-theless, this small penalty in precision is balanced out bythe considerable time performance gains:
ULISSE is up to15x faster than CMRI. When we use a smaller γ , (e.g., ), ULISSE shows its best time performance. This is due totighter
Envelopes containment area, which permits to finda better best-so-far with a shorter tree index visit.
Approximate Search with DTW.
Here we evaluate, thetime performance of query answering, along with the qual-ity of approximate search. We test the search using both theEuclidean and DTW measures, on a synthetic series com- Q u e r y a n s w e r i n g t i m e ( s e c o nd s ) Query lengthULISSE Z-norm ApproxULISSE Z-norm ExactUCR Suite Z-norm 0200400600 1024 2048 3072 4096 Q u e r y a n s w e r i n g t i m e ( s e c o nd s ) Query lengthULISSE non Z-norm ApproxULISSE non Z-norm ExactUCR Suite non Z-norm
ULISSE Approximate Search(Euclidean Distance - Non Z-Normalized series)Query length 1024 2048 3072 4096 R a n k i n g p o s i t i o n Q u e r y a n s w e r i n g t i m e ( s e c o nd s ) Query lengthULISSE Z-norm ApproxULISSE Z-norm ExactUCR Suite Z-norm
ULISSE Approximate Search (Euclidean Distance - Z-Normalized series)Query length 1024 2048 3072 4096 R a n k i n g p o s i t i o n R a n k i n g p o s i t i o n R a n k i n g p o s i t i o n (a) (b)(c) (d) Q u e r y a n s w e r i n g t i m e ( s e c o nd s ) Query lengthULISSE non Z-norm ApproxULISSE non Z-norm ExactUCR Suite non Z-norm
Fig. 21
Average query answering and approximate quality varyingquery length. ( a) Z-normalized search with Euclidean distance. ( b) NonZ-normalized search with Euclidean distance. ( c) Z-normalized searchwith DTW measure. ( d) Non Z-normalized search with DTW measure. posed of M points. We test a query length range between (cid:96) min = 1024 and (cid:96) max = 4096 . The other parameters areset to their default value.In Figures 21(a) and (b), we report the average queryanswering time for the Z-normalized and non Z-normalizedcases, respectively. The results show that ULISSE answersqueries up to one order of magnitude faster than
UCR Suite .Furthermore, we note that
ULISSE scales better as the querylength increases. This shows that our pruning strategy oversummarized data, as well as having a good bsf approximateanswer early on, represent a concrete advantage when prun-ing the search space.In Figures 21(c) and (d), we report the time performanceof query answering with the DTW measure, consideringboth Z-normalized and non Z-normalized search. In Fig-ure 21(c), we observe that
ULISSE answers queries slightlyslower than
UCR Suite , for three of the query lengths. Thisbehavior is explained by the fact that the (overlapping) sub-sequences represented by the Envelopes have a total size ∼ x bigger than the original data points. In this case, thepruning power does not mitigate this disadvantage.Overall, the results show that ULISSE is a scalable so-lution. Moreover, the approximate search, which in this ex-periment does not visit more than leaves in the tree, repre-sents a very fast solution, approximating well the exact an-swer (refer to the tables below each plot of Figure 21). Weobserved the same trend of visited leaves in all the experi-ments presented in this work. This means that in practice thetime complexity of the Approximate search is very close toits best case having a constant query answering complexity. I nd e x i n g T i m e w i t h Z - N o r m a li z a t i o n γ = (% of (l max - l min ))ASTROEMGEEGECGGAP050100150200250300 I nd e x i n g T i m e w i t h o u t Z - N o r m a li z a t i o n γ = (% of (l max - l min ))ASTROEMGEEGECGGAP (a) (b) γ(%) Number of Envelopes Generated5 9,600,00010 4,800,00020 2,400,00040 1,200,00060 800,00080 600,000100 500,000 (c) Fig. 22
Indexing time of five real datasets (ASTRO, EMG, EEG, ECG,GAP) varying the number of master series in the Envelope ( γ ). Thedatasets contain K data series of length , whereas (cid:96) min =160 and (cid:96) max = 256 . (a) Indexing of non Z-Normalized series. (b)
Indexing time of Z-Normalized series.
ULISSE
Envelopeon query answering time. Moreover, we want to analyze theimpact of the DTW measure on query time performance.
Indexing.
For this experiment, we used five real dataset,where each one contains data series of length 256 (AS-TRO, EMG, EEG, ECG, GAP). We show in Figure 22.(a,b)the indexing time performance, varying γ for both non Z-Normalized and Normalized sequences. Recall that γ is ex-pressed as the percentage of the maximum number of mas-ter series that is (cid:96) max − (cid:96) min . The results confirm the trenddepicted in Figure 14, where the time of building ULISSE
Envelopes that contain all the master series of each series isone order of magnitude smaller than the time of building themost compact Envelopes, obtained with γ = 5% . We alsonote that the overhead generated by the Z-normalization op-erations, which have an additional γ factor in the time com-plexity of the indexing algorithm, is amortized by the gen-eration of ∼ x less Envelopes in the index, as depicted inFigure 22(c). Query Answering with Euclidean Distance.
We reportin Figure 23 the results obtained for search over Z-normalized sequences, with Euclidean distance. All param-eters are set to their default values. Therefore, in these ex-periments we used queries of length between (cid:96) min = 160 and (cid:96) max = 256 ; the series in the datasets have length .In Figure 23(a), we report the query pruning power as thenumber of master series ( γ ) in each Envelope varies. As ex-pected, we can prune less candidates when the Envelopescontain more sequences. Recall that when a candidate (sub-sequence) is pruned, the search does not consider its raw val- ues, thus avoiding both Z-normalization and Euclidean dis-tance computations. If a candidate is not pruned, the searchcan abandon the computations earlier, when the running Eu-clidean Distance is greater than the k th bsf distance.In Figure 23(b), we report the average abandoningpower , which measures the percentage of the total num-ber of real Euclidean distance computations that are not performed. When the search processes an increased num-ber of overlapping subsequences, we expect a decrease inthe number of computations performed. We note that thesearch avoids computations when the Envelopes contain alarge number of subsequences, namely, as γ increases.In Figure 23(c), we report the average query time vary-ing γ . We obtain the highest speed-up, with the most com-pact index (largest γ value), which is more that x fasterthan the state-of-the-art ( UCR Suite algorithm). This con-firms the trend we observed in the previous results con-ducted over synthetic data. We report the average query timefor each dataset in Figure 23(d), and for each query lengthin Figure 23(e). In Figure 23(f), we show the average num-ber of Euclidean distance and lower bound computationsperformed by
ULISSE ( γ = 100% ) and UCR Suite , as thequery length varies (this corresponds to the average numberof points on which the distance to the query is computed), aswell as the number of points that are loaded from disk andZ-normalized (this corresponds to the overhead generatedby the Z-normalization operations). The goal of this exper-iment is to quantify the overall benefit of
ULISSE pruningand abandoning power. (Recall that
UCR Suite does not per-form any lower bound distance computations when using theEuclidean distance.)First, we observe that
ULISSE performs half of the Eu-clidean distance computations of
UCR Suite , and considersup to seven time less points for the Z-normalization phase.Furthermore, we note that the computation of lower bounddistances has a negligible impact on the query workload, es-pecially when the query length is smaller than the length ofthe series in the dataset ( ), in which case the number ofcandidate subsequences can be orders of magnitude more.In Figure 24, we depict the results of query answering,without the use of Z-normalization. In this case, the resultsexhibit a small difference in terms of absolute pruning powervalues, which is higher when the search is performed onabsolute series values. The average query answering timemaintains the same trend we observe in Z-normalized queryanswering. On average,
ULISSE has a x speed-up factorwhen compared to UCR Suite . Query Answering with DTW Distance.
We now report theresults of query answering using the DTW measure (Fig-ure 25). For this experiment, we used the default parametersettings, and the same real datasets considered in the previ-ous two experiments. We study the efficiency of query an- calable Data Series Subsequence Matching with ULISSE 21
5% 10% 20% 40% 60% 80% 100% A V G a b a nd o n i n g p o w e r % ULISSE γ (% of max) 0E+005E+081E+09 160 192224256 160192224256 N u m b e r o f p o i n t s Query lengthZ-NormalizationLower bound distanceEuclidean distance05101520 A V G qu e r y t i m e ( S e c o nd s ) A V G qu e r y t i m e ( s e c o nd s )
160 192 224 256
Z-Normalized data series A V G p r un i n g p o w e r % γ (% of max) (a) (b)(c) (d)(e) Query length: (f)
ULISSE UCR Suite0102030 ASTRO ECG EMG EEG GAP A V G qu e r y t i m e ( S e c o nd s ) UCR Suite ULISSE (40%)ULISSE (60%) ULISSE (80%)ULISSE (100%)
Fig. 23
Exact (Z-normalized) query answering with Euclidean dis-tance on real datasets. (a)
Average Pruning power (b)
Average Aban-doning power (c)
Average query answering time (d)
Average queryanswering time for each dataset (e)
Average query answering time foreach query length (f)
Average query workload (number of points, with γ = 100% ) swering ( query), which uses the DTW lower boundingmeasures to prune the search space.In Figure 25(a) we report the average pruning power,when varying the DTW warping windows from to of the subsequence length. (These values for the warpingwindow have commonly been used in the literature [30].)We vary γ between and of its maximum value,which give the best running time in this experiment. Toavoid an unnecessary overload in the plot, we omit the re-sults for γ smaller than .Once again, we note that the pruning power is nega-tively affected by the size of the Envelope ( γ ), and underDTW search the abandoning power slightly decreases asthe gamma and the warping window get larger (see Fig-ure 25(b)). This suggests that the DTW lower bound mea-sure we propose is more sensitive than the one used for Eu-clidean Distance. Nevertheless, in the worst case ULISSE isstill able to prune of the candidates, and to abandonmore than of the
DT W computations on raw values.In Figure 25(c) we report the average query answeringtime varying γ , and in Figures 25(d) and (e) the average timefor each dataset and for different query lengths, respectively,for γ = 100% . For these last two experiments, we observeno significant difference for the other values of γ we tested.We first note that, despite the loss of pruning power ofULISSE when increasing γ , the query answering time is notsignificantly affected (refer to Figures 25(c) and (e)). Asin the case of Euclidean distance search, the compactness A V G qu e r y t i m e ( s e c o nd s )
160 192 224 256 0E+002E+084E+086E+088E+081E+09 160192224256160192224256 N u m b e r o f p o i n t s Query lengthLower bound distanceEuclidean distance02468101214 A V G qu e r y t i m e ( S e c o nd s )
5% 10% 20% 40% 60% 80% 100% A V G p r un i n g p o w e r % γ (% of max) 01020 ASTRO ECG EMG EEG GAP A V G qu e r y t i m e ( S e c o nd s ) UCR Suite ULISSE (40%)ULISSE (60%) ULISSE (80%)ULISSE (100%)
Non Z-Normalized data series(a) (b)(c) (d)(e)
Query length: (f)
ULISSE UCR SuiteULISSE γ (% of max)050100
5% 10% 20% 40% 60% 80% 100% A V G a b a nd o n i n g p o w e r % Fig. 24
Exact (non Z-normalized) query answering with Euclideandistance on real datasets. (a)
Average Pruning power (b)
AverageAbandoning power (c)
Average query answering time (d)
Averagequery answering time for each dataset (e)
Average query answeringtime for each query length (f)
Average query workload (number ofpoints, with γ = 100% ) of the ULISSE index plays a fundamental role in determin-ing the query time performance, along with the pruning andabandoning power.In Figure 25(d), we note that only in the ECG and GAPdatasets, enlarging the warping window has a substantialnegative effect on query time ( x slower), whereas in theother datasets, and in the worst case the time loss is equiva-lent to .In Figure 25(e), we report the average query workloadof ULISSE and
UCR Suite . In contrast to Euclidean distancequeries, we notice that the largest amount of work corre-sponds to lower bounding distance computations. Recall that
ULISSE prunes the search space in two stages: first com-paring the query and the data in their summarized versionsusing LB P aL (Equation 8), and then computing in lineartime the LB Keogh between the query and the non prunedcandidates. In the worst case, the DTW distance point-wisecomputation are 10% of those performed for calculatingthe Lower Bound (query length ). In general, the totalnumber of points considered for the whole workload is upto x smaller than for UCR Suite . We note that the prun-ing strategy of
UCR Suite is still very competitive, since itavoids a high number of true distance computations usingthe LB Keogh lower bound. Nonetheless, it has to computethe lower bound distance on the entire set of candidates. Thepruning strategy implemented in
ULISSE permits to achieveup to x speedup over UCR Suite . A v e r a g e qu e r y t i m e ( s e c o nd s ) wrp:5% wrp:10% wrp:15%0204060 WarpingWindow: 5% WarpingWindow: 10% WarpingWindow: 15% P r un i n g p o w e r % ULISSE γ=60% ULISSE γ=80% ULISSE γ=100%0100200300400 ULISSEwar:5% ULISSEwar:10% ULISSEwar:15% UCRSuitewar:5% UCRSuitewar:10% UCRSuitewar:15% A V G qu e r y t i m e ( s e c o nd s )
160 192 224 256 0E+001E+102E+103E+10 160192224256160 192224256 N u m b e r o f p o i n t s Query lengthDTW distanceLower bound distanceZ-Normalization050100150200 WarpingWindow:5% WarpingWindow:10% WarpingWindow:15% A v e r a g e qu e r y t i m e ( s e c o nd s ) ULISSE γ=60% ULISSE γ=80%ULISSE γ=100% UCR Suite 7880828486 WarpingWindow: 5% WarpingWindow: 10% WarpingWindow: 15% A b a nd o n i n g p o w e r % ULISSE γ=60% ULISSE γ=80% ULISSE γ=100% (a) (b)(d)(c)
Query length: ULISSE UCR Suite (e) (f)Z-Normalized data series
Fig. 25
Exact (Z-normalized) query answering with DTW measure onreal datasets. (a)
Average Pruning power (varying the warping window) (b)
Average Abandoning power (varying the warping window) (c)
Av-erage query answering time (d)
Average query answering time for eachdataset (e)
Average query answering time for each query length (f)
Av-erage query workload (number of points, with γ = 100% ) In Figure 26, we report the results of DTW search, with-out the application of Z-Normalization. Also in this case,we note that the average pruning power of
ULISSE is higherthan the one we previously observed in the Z-normalizedsearch (Figure 26(a)). On the other hand, the average aban-doning power is less effective, as shown in Figure 26(b). Asa consequence, we can see that the
ULISSE search performsmore DTW distance computations (refer to Figure 26(c)).Nevertheless, Figure 26(e) shows that on average
ULISSE isup to x faster than UCR Suite , for all query lengths wetested.
Query over Large datasets with Euclidean Distance.
Here, we test
ULISSE on three large synthetic datasets ofsizes
GB,
GB, and
GB, as well as on two real se-ries collections, i.e., ASTRO and SEISMIC (described ear-lier). The other parameters are the default ones. For eachgenerated index and for the
UCR Suite , we ran a set of 100queries, for which we report the average exact search time.In Figure 27(a) we report the average query answeringtime ( ) on synthetic datasets, varying the query length.These results demonstrate that
ULISSE scales better than
UCR Suite across all query lengths, being up to 5x faster.In Figure 27(b), we report the k-NN exact search timeperformance, varying k and picking the smallest querylength, namely . Note that, this is the largest search spacewe consider in these datasets, since each query has bil-lion of possible candidates (subsequences of length ).The experimental results on real datasets confirm the superi- A V G qu e r y t i m e ( s e c o nd s ) wrp:5% wrp:10% wrp:15%020406080 WarpingWindow: 5% WarpingWindow: 10% WarpingWindow: 15% P r un i n g p o w e r % ULISSE γ=60% ULISSE γ=80% ULISSE γ=100%050100150200 WarpingWindow:5% WarpingWindow:10% WarpingWindow:15% A V G qu e r y t i m e ( s e c o nd s ) ULISSE γ=60% ULISSE γ=80%ULISSE γ=100% UCR Suite 050100 WarpingWindow: 5% WarpingWindow: 10% WarpingWindow: 15% A b a nd o n i n g p o w e r % ULISSE γ=60% ULISSE γ=80% ULISSE γ=100%0E+001E+102E+103E+10 160 192 224 256 160 192 224 256 N u m b e r o f p o i n t s Query LengthLower bound distanceDTW distance0100200300 ULISSEwar:5% ULISSEwar:10% ULISSEwar:15% UCRSuitewar:5% UCRSuitewar:10% UCRSuitewar:15% A V G qu e r y t i m e ( s e c o nd s )
160 192 224 256 (a) (b)(d)(c)
Query length: (e) (f)
ULISSE UCR Suite
Non Z-Normalized data series
Fig. 26
Exact (Z-normalized) query answering with DTW measure onreal datasets. (a)
Average Pruning power (varying the warping window) (b)
Average Abandoning power (varying the warping window) (c)
Av-erage query answering time (d)
Average query answering time for eachdataset (e)
Average query answering time for each query length (f)
Av-erage query workload (number of points, with γ = 100% ) ority of ULISSE , which scales with stable performance, alsowhen increasing the number k of nearest neighbors. Onceagain it is up to 5x faster than UCR Suite , whose perfor-mance deteriorates as k gets larger.In Figure 27(c) we report the number of disk accesses ofthe queries considered in Figure 27(b). Here, we are count-ing the number of times that we follow a pointer from anenvelope to the raw data on disk, during the sequential scanin Algorithm 5. Note that the number of disk accesses isbounded by the total number of Envelopes, which are re-ported in Figure 27(d) (along with the number of leaves andthe building time for each index).We observe that in the worst case, which takes placefor the ASTRO dataset for k = 100 , we retrieve from disk ∼ % of the total number of subsequences. This still guar-antees a remarkable speed-up over UCR Suite , which needsto consider all the raw series.Moreover, since
ULISSE can use Early Abandoning dur-ing exact query answering, we observe during our empiri-cal evaluation that disposing of the approximate answer dis-tance prior the start of the exact search, permits to abandonon average of points more than
UCR Suite for the samequery.
Query over Large datasets with DTW.
We conclude thispart of the evaluation reporting the results of query answer-ing on large datasets using the DTW distance.In Figure 28, we report the time performance of ( search) on the ASTRO, SEISMIC and synthetic datasets,each one containing M data series of length calable Data Series Subsequence Matching with ULISSE 23 A v e r a g e E x a c t qu e r y d i s k a cc e ss e s ( M illi o n ) Query length
ULISSE ASTRO data (100 GB)ULISSE SEISMIC data (100 GB)ULISSE SYNTHETIC data (100 GB) A v g E x a c t - NN Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length
UCR Suite 100GB (Synthetic) ULISSE 100GB (Synthetic)UCR Suite 500GB (Synthetic) ULISSE 500GB (Synthetic)UCR Suite 750GB (Synthetic) ULISSE 750GB (Synthetic) A v g E x a c t Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length
UCR Suite ASTRO (100GB) ULISSE ASTRO (100GB)UCR Suite SEISMIC (100GB) ULISSE SEISMIC (100GB)UCR Suite Synthetic (100GB) ULISSE Synthetic (100GB)
Synthetic data (100 GB) Syntheticdata (500 GB) Synthetic data (750 GB) Seismic data (100 GB) Astro data (100 GB)Indexing time (hours)
Leaf nodes in the ULISSE index
Envelopes in the ULISSE index (a) (b) (c)(d)
Fig. 27
Exact and Approximate similarity search on Z-normalized synthetic and real datasets. a) Average exact query time (CPU + disk I/O) onsynthetic datasets. b) Average exact k-NN query time (CPU + disk I/O) on real datasets (100 GB) varying k . c) Average disk accesses of k-NN query. d) Indexing measures for all datasets. A v g E x a c t - NN Q u e r y T i m e C P U + d i s k I / O ( s e c o nd s ) Query lengthDatasetULISSE UCR SuiteASTRO (100GB) SEISMIC (100GB) Synthetic (100GB)
Fig. 28
Average exact query time with DTW distance (CPU + diskI/O) on real and synthetic datasets. ( GB). Also in this case,
ULISSE guarantees a consistentspeed-up over
UCR Suite , which is at least ∼ . x fasterin the worst case (ASTRO dataset, query length ), andup to one order of magnitude faster (synthetic dataset, querylength ).7.5 ULISSE vs Index InterpolationIn this section, we compare ULISSE to the Index Interpo-lation (
IND-INT ) method.
IND-INT works by means of (cid:15) -range query answering of fixed length that serves the an-swering of variable length k -NN queries. We consider thereal datasets previously introduced and each query is an-swered using k = 1 (Nearest Neighbor). We set the otherparameters to their default values. The data series collec-tions contain raw data series of fixed length . Hence,the number of candidates changes according to the (variable)length of the query subsequence. We considered queries oflengths between - . In the left part of Figure 29(a), we report the total number of candidate answers for each ofthese lengths.In order to test the scalability of the approaches as afunction of the query length range, we test three differentranges: ( - ), ( - ), and ( - ). IND-INT must build an index using the smallest query length ( ),extracting one record for each candidate. We thus have thesame index for all three ranges. In contrast, for each one ofthe above three query length ranges, we can build a different
ULISSE index, whose sizes are reported in the right part ofFigure 29(a). Observe that all three
ULISSE indexes have thesame number of records (Envelopes), and that their sizes aretwo orders of magnitude smaller than the
IND-INT index.In Figure 29(b), we report the total query answeringtime (CPU + disk I/O; y-axis in log scale), as we vary thequery length in the three chosen ranges.
ULISSE answers -NN queries more than x faster than IND-INT in all thesecases. We observe the same in Figure 29(c), where we reportonly the disk I/O time. Recall that
ULISSE starts by perform-ing an approximate search that visits first the most promisingnodes of the index. In contrast,
IND-INT issues an (cid:15) -rangequery, and thus must probe each summarized record in theindex, and then access the disk, when the lower boundingdistance between the query and the record is smaller than (cid:15) .This operation translates to significant time cost.Note that in this set of experiments, we use for
IND-INT an (cid:15) equal to the distance between the query and itsapproximate answer, which we obtain by first running thequery using ULISSE . This method for choosing (cid:15) (proposedin the
RangeTopK algorithm [19]) favors
IND-INT , since itprovides a good initial value.Our experiments show that
ULISSE scales better as thequery length increases. As we observe in Figure 29(d),
ULISSE always prunes more records as the query length in-creases, whereas
IND-INT exhibits an unstable pruning pat-tern that depends on the query length. P r un i n g p o w e r % Query length 020406080100 256 704 1152 1600 2048 P r un i n g p o w e r % Query length 020406080100 256 1216 2176 3136 4096 P r un i n g p o w e r % Query length1101001000 256 320 384 448 512 D i s k I / O s e c s ( l o g s c a l e ) Query length 1101001000 256 704 1152 1600 2048 D i s k I / O s e c s ( l o g s c a l e ) Query length 1101001000 256 1216 2176 3136 4096 D i s k I / O s e c s ( l o g s c a l e ) Query length110100100010000 256 320 384 448 512 Q u e r y a n s w e r i n g t i m e C P U + d i s k I / O s e c s ( l o g s c a l e ) Query length 110100100010000 256 704 1152 1600 2048 Q u e r y a n s w e r i n g t i m e C P U + d i s k I / O s e c s ( l o g s c a l e ) Query length 110100100010000 256 1216 2176 3136 4096 Q u e r y a n s w e r i n g t i m e C P U + d i s k I / O s e c s ( l o g s c a l e ) Query length
INDEX
119 20
Ulisse l min =256 - l max =512 Ulisse l min =256 - l max =2048 Ulisse l min =256 - l max =4096 Original series length: 4096 – Number of series: 3134Subsequence length
Subsequence length
Subsequence length
Ulisse l min =256 l max =512 Ulisse l min =256 l max =2048 Ulisse l min =256 l max =4096 a)b) ULISSE IND-INT c)d)
Fig. 29
Result of k -NN search comparison between ULISSE and
IND-INT . ( a ) Data and Indices properties. ( b ) Query answering time. ( c ) DiskI/O. ( d ) Pruning power %. Overall,
ULISSE uses a more succinct index that per-mits to scale better (since there are less records to iterateover). This translates to reduced CPU time, as well as diskaccesses.7.6 (cid:15) -Range QueriesIn this last part, we test the
ULISSE search algorithm forthe (cid:15) -Range query task. To that extent we adapted Algo-rithm 5, so that given as input (cid:15) ∈ R , it computes the setof subsequences that have a distance to the query smallerthan or equal to (cid:15) . Similarly, we also adapted the UCR Suite to support (cid:15) -Range search. As additional competitors, weconsider
IND-INT (only for Euclidean distance), and
KV-Match , which is the state-of-the art index-based solution forexact (cid:15) -Range queries on non Z-normalized data series.In this experiment, we used five different real datasets,composed by a single data series of different lengths, as re-ported in Figure 30(a). For each of these datasets, we can seethat
ULISSE builds its index times faster than KV-Match .This is because
KV-Match is based on the construction ofmultiple indexes. Specifically, it builds different indexes per-forming a sliding windows extraction at different lengths. Atquery answering time,
KV-Match performs a recombinationof query answers coming from the different indexes. For our (cid:15) -Range queries, we set the (cid:15) parameter to twicethe NN distance of each query. In this manner, we simulatean exploratory analysis task. We report the average value ofquery selectivity in Figure 30(b). We note that in the ECGdataset the selectivity is very high. This is due to the peri-odic/cyclical nature of this kind of data, which contain re-peating heartbeats subsequences that are very similar. In theother datasets, we have different values of selectivity rang-ing from . to , when using Euclidean distance. Onthe other hand, when the DTW measure is considered, weobserve a significant increase of the answer-set cardinality.In Figures 30(c) and (d), we show the average queryanswering time for Euclidean distance, when varying thequery length and the dataset, respectively. The results showthat IND-INT does not represent a competitive alternative.In fact, only in one case, when the number of candidates isthe smallest (i.e., ), it has time performance better thanthe other approaches. When the number of possible answersincreases (smaller query lengths), we observe that the per-formance of
IND-INT becomes more than an order of mag-nitude worse than the rest.We note that in this case
U LISSE and
KV-match haveno substantial difference in their time performance, with
U LISSE being slightly better.However, when we consider the DTW distance,
U LISSE becomes up to one order of magnitude faster than calable Data Series Subsequence Matching with ULISSE 25
KV-Match , as shown in Figures 30(e) and (f). This differenceis pronounced for the two largest datasets:
U LISSE is 3xfaster for ECG, and 10x faster for GAP. It is important tonote that since
KV-Match needs to recombine the answersfrom the different index structures, its time performance isaffected by this refinement phase of query answering, and israther sensitive to dataset size and query selectivity.
Similarity search is one of the fundamental operations forseveral data series analysis tasks. Even though much efforthas been dedicated to the development of indexing tech-niques that can speed up similarity search, all existing so-lutions are limited by the fact that they can only supportqueries of a fixed length.In this work, we proposed
ULISSE , the first index ableto answer similarity search queries of variable-length, overboth Z-normalized and non Z-normalized sequences, sup-porting the Euclidean and DTW distances, for answering ex-actly, or approximately both k-NN and (cid:15) -range queries. Weexperimentally evaluated, our indexing and similarity searchalgorithms, on synthetic and real datasets, demonstrating theeffectiveness and efficiency (in space and time cost) of theproposed solution.
References .2. Machine Learning in Time Series Databases ( and Everything Is aTime Series !) Outline of Tutorial II. Update , pages 1–31.3. Automatic detection of cyclic alternating pattern (cap) sequencesin sleep: preliminary results.
Clinical Neurophysiology , 1999.4. I. Assent, R. Krieger, F. Afschari, and T. Seidl. The ts-tree: Effi-cient time series search and retrieval. In
EDBT , 2008.5. A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. J. Keogh. Thegreat time series classification bake off: a review and experimentalevaluation of recent algorithmic advances.
DAMI , 2017.6. A. J. Bagnall, R. L. Cole, T. Palpanas, and K. Zoumpatianos. Dataseries management.
Dagstuhl Reports , 9(7), 2019.7. P. Boniol, M. Linardi, F. Roncallo, and T. Palpanas. AutomatedAnomaly Detection in Large Sequences. In
ICDE , 2020.8. P. Boniol and T. Palpanas. Series2Graph: Graph-based Subse-quence Anomaly Detection for Time Series.
PVLDB , 2020.9. Botao Peng (supervised by Panagiota Fatourou and Themis Pal-panas). Data Series Indexing Gone Parallel. In
ICDE PhD Work-shop , 2020.10. A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. isax 2.0: In-dexing and mining one billion time series. In
ICDM 2010 .11. A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. J.Keogh. Beyond one billion time series: indexing and mining verylarge time series collections with isax2+.
KAIS , 2014.12. M. Dallachiesa, T. Palpanas, and I. F. Ilyas. Top-k nearest neighborsearch in uncertain data series.
PVLDB(8)1:13-24 , 2014.13. K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.The lernaean hydra of data series similarity search: An experi-mental evaluation of the state of the art.
PVLDB , 12(2), 2018. 14. K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.Return of the lernaean hydra: Experimental evaluation of data se-ries approximate similarity search.
PVLDB , 13(3), 2019.15. ESA. SENTINEL-2 mission. https://sentinel.esa.int/web/sentinel/missions/sentinel-2 .16. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subse-quence matching in time-series databases. In
SIGMOD , 1994.17. A. Gogolou, T. Tsandilas, K. Echihabi, A. Bezerianos, and T. Pal-panas. Data Series Progressive Similarity Search with Probabilis-tic Quality Guarantees. In
SIGMOD , 2020.18. A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. Pro-gressive similarity search on time series data. In
BigVis, in con-junction with EDBT/ICDT , 2019.19. W. Han, J. Lee, Y. Moon, and H. Jiang. Ranked subsequencematching in time-series databases. In
VLDB , 2007.20. P. R. Healey JA. Detecting stress during real-world driving tasksusing physiological sensors.
ITS 6(2):156-166 , June 2016.21. P. Huijse, P. A. Est´evez, P. Protopapas, J. C. Principe, andP. Zegers. Computational intelligence challenges and applicationson large-scale astronomical time series databases. 2014.22. IRIS. Seismic Data Access. http://ds.iris.edu/data/access , 2016.23. F. Itakura. Minimum prediction residual principle applied tospeech recognition.
IEEE Trans Acoust Speech Signal Process ,1975.24. S. Kadiyala and N. Shiri. A compact multi-resolution index forvariable length queries in time series databases.
KAIS , 2008.25. T. Kahveci and A. Singh. Variable length queries for time seriesdata. In
ICDE , 2001.26. K. Kashino, G. Smith, and H. Murase. Time-series active searchfor quick retrieval of audio and video. In
ICASSP , 1999.27. E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimen-sionality reduction for fast similarity search in large time seriesdatabases.
KAIS , 3, 2000.28. E. J. Keogh and S. Kasetty. On the need for time series data miningbenchmarks: A survey and empirical demonstration.
DAMI , 2003.29. E. J. Keogh, T. Palpanas, V. B. Zordan, D. Gunopulos, and M. Car-dle. Indexing large human-motion databases. In
VLDB , 2004.30. E. J. Keogh and C. A. Ratanamahatana. Exact indexing of dy-namic time warping.
Knowl. Inf. Syst. , 2005.31. H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas. Co-conut: A scalable bottom-up approach for building data series in-dexes.
PVLDB (11)6:677-690 , 2018.32. H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas. Co-conut palm: Static and streaming data series exploration now inyour palm. In
SIGMOD , 2019.33. H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas. Co-conut: sortable summarizations for scalable indexes over static andstreaming data series.
VLDBJ , 28(6), 2019.34. J. Kruskal and M. Liberman. The symmetric time-warping prob-lem: From continuous to discrete.
Time Warps, String Edits, andMacromolecules: The Theory and Practice of Sequence Compari-son , 01 1983.35. M. Lichman. UCI machine learning repository, 2013.36. J. Lin, E. Keogh, L. Wei, and S. Lonardi. Experiencing sax: anovel symbolic representation of time series.
DAMI , 2007.37. M. Linardi and T. Palpanas. ULISSE: ULtra compact Index forVariable-Length Similarity SEarch in Data Series. In
ICDE 2018 .38. M. Linardi and T. Palpanas. Scalable, variable-length simi-larity search in data series: The ULISSE approach.
PVLDB ,11(13):2236–2248, 2018.39. M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. Matrix profileX: VALMOD - scalable discovery of variable-length motifs in dataseries. In
SIGMOD Conference 2018 .40. M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. VALMOD: Asuite for easy and exact detection of variable length motifs in dataseries. In
SIGMOD Conference 2018 .6 Michele Linardi, Themis Palpanas I nd e x i n g T i m e ( s e c o nd s l o g s c a l e ) ULISSEKV-MatchIND-INT 0102030405060
160 192 224 256 160 192 224 256 160 192 224 256 160 192 224 256 Q u e r y S e l e c t i v i t y % Query LengthDatasetED DTW 02.557.51012.5
160 192 224 256 A V G D T W D i s t a n c e qu e r y t i m e ( s e c o nd s ) Query LengthULISSE UCR Suite KV-MatchASTRO ECG EEGEMG GAP 051015 ASTRO EEG GAP A V G D T W qu e r y t i m e ( s e c o nd s ) DatasetULISSE UCR Suite KV-Match0100200300 ECG EMG (a) (b) (c)(d) (e) (f)
Number of pointsASTRO
EMG
EEG
GAP
ECG A V G E u c li d e a n D i s t a n c e qu e r y t i m e ( s e c o nd s l o g s c a l e ) DatasetUCR Suite KV-MatchIND-INT ULISSE 0.1110100 160 192 224 256 A V G E u c li d e a n D i s t a n c e qu e r y t i m e ( s e c o nd s l o g s c a l e ) Query LengthUCR Suite KV-MatchIND-INT ULISSE
Fig. 30
Results of (cid:15) -range search on non Z-normalized real datasets. (a)
Indexing time and datasets length. (b)
Average selectivity of the queriesin each dataset. (c)
Average query answering time for each query length, using Euclidean distance. (d)
Average query answering time for eachdataset, using Euclidean distance. (d)
Average query answering time for each query length, using DTW. (e)
Average query answering time for eachdataset, using DTW.41. M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. Matrix Pro-file Goes MAD: Variable-Length Motif And Discord Discovery inData Series. In
DAMI , 2020.42. J. Lines and A. Bagnall. Time series classification with ensemblesof elastic distance measures.
DAMI , 2015.43. W. Loh, S. Kim, and K. Whang. A subsequence matchingalgorithm that supports normalization transform in time-seriesdatabases.
Data Min. Knowl. Discov. , 2004.44. Michele Linardi (supervised by Themis Palpanas). Effective andEfficient Variable-Length Data Series Analytics. In
VLDB PhDWorkshop , 2019.45. A. Mueen, H. Hamooni, and T. Estrada. Time series join on sub-sequence correlation. In
ICDM , 2014.46. V. Niennattrakul and C. A. Ratanamahatana. On clustering multi-media time series data using k-means and dynamic time warping.MUE ’07, 2007.47. A. G. H. of Operational Intelligence Department Airbus. Personalcommunication., 2017.48. T. Palpanas. Data series management: The road to big sequenceanalytics.
SIGMOD Rec. , 2015.49. T. Palpanas. Big sequence management: A glimpse of the past,the present, and the future. In
SOFSEM , 2016.50. T. Palpanas. The parallel and distributed future of data series min-ing. In
HPCS , 2017.51. T. Palpanas. Evolution of a Data Series Index - The iSAX Familyof Data Series Indexes.
CCIS , 2020.52. T. Palpanas and V. Beckmann. Report on the First and SecondInterdisciplinary Time Series Analysis Workshop (ITISA).
SIG-MOD Rec. , 48(3), 2019.53. B. Peng, P. Fatourou, and T. Palpanas. Paris: The next destinationfor fast data series indexing and query answering. In
IEEE BigData , 2018.54. B. Peng, T. Palpanas, and P. Fatourou. Messi: In-memory dataseries indexing. In
ICDE , 2020.55. B. Peng, T. Palpanas, and P. Fatourou. Paris+: Data series indexingon multi-core architectures.
TKDE , 2020.56. D. Rafiei and A. Mendelzon. Efficient retrieval of similar timesequences using dft. In
ICDE , 1998.57. T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A.Batista, M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh.Searching and mining trillions of time series subsequences underdynamic time warping. In
SIGKDD , 2012.58. U. Raza, A. Camerra, A. L. Murphy, T. Palpanas, and G. P. Picco.Practical data prediction for real-world wireless sensor networks.
IEEE Trans. Knowl. Data Eng. , 2015. 59. H. Sakoe and S. Chiba. Dynamic programming algorithm opti-mization for spoken word recognition.
IEEE Trans Acoust SpeechSignal Process , 1978.60. P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedihardjo,C. Chen, and S. Frankenstein. Time series anomaly discovery withgrammar-based compression. In
EDBT , 2015.61. D. Shasha. Tuning time series queries in finance: Case studies andrecommendations.
IEEE Data Eng. Bull. , 1999.62. J. Shieh and E. J. Keogh. i sax: indexing and mining terabyte sizedtime series. In KDD , pages 623–631, 2008.63. S. Soldi, V. Beckmann, W.H.Baumgartner, G.Ponti, C.R.Shrader,P. Lubinski, H.A.Krimm, F. Mattana, and J. Tueller. Long-termvariability of agn at hard x-rays.
Astronomy & Astrophysics , 2014.64. M. G. Terzano, L. Parrino, A. Sherieri, R. Chervin,S. Chokroverty, C. Guilleminault, M. Hirshkowitz, M. Ma-howald, H. Moldofsky, A. Rosa, R. Thomas, and A. Walters.Atlas, rules, and recording techniques for the scoring of cyclicalternating pattern (cap) in human sleep.
Sleep Medicine , 2(6),2001.65. X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, andE. J. Keogh. Experimental comparison of representation methodsand distance measures for time series data.
DAMI , 2013.66. Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-adaptiveand dynamic segmentation index for whole matching on time se-ries.
PVLDB 6(10):793-804 , 2013.67. J. Wu, P. Wang, N. Pan, C. Wang, W. Wang, and J. Wang. Kv-match: A subsequence matching approach supporting normaliza-tion and time warping.
ICDE , 2019.68. D. E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas.Dpisax: Massively distributed partitioned isax. In
ICDM , 2017.69. D.-E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas. Mas-sively distributed time series indexing and querying.
TKDE , 32(1),2020.70. K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for inter-active exploration of big data series. In
SIGMOD , 2014.71. K. Zoumpatianos, S. Idreos, and T. Palpanas. RINSE: interactivedata series exploration with ADS+.
PVLDB (8)12 , 2015.72. K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptivedata series index.
VLDB J. 25(6): 843-866 , 2016.73. K. Zoumpatianos, Y. Lou, I. Ileana, T. Palpanas, and J. Gehrke.Generating data series query workloads.
VLDB J. , 27(6), 2018.74. K. Zoumpatianos, Y. Lou, T. Palpanas, and J. Gehrke. Queryworkloads for data series indexes. In
SIGKDD , 2015.75. K. Zoumpatianos and T. Palpanas. Data series management: Ful-filling the need for big sequence analytics. In