[PDF] Effective and Efficient Variable-Length Data Series Analytics

Abstract

In the last twenty years, data series similarity search has emerged as a fundamental operation at the core of several analysis tasks and applications related to data series collections. Many solutions to different mining problems work by means of similarity search. In this regard, all the proposed solutions require the prior knowledge of the series length on which similarity search is performed. In several cases, the choice of the length is critical and sensibly influences the quality of the expected outcome. Unfortunately, the obvious brute-force solution, which provides an outcome for all lengths within a given range is computationally untenable. In this Ph.D. work, we present the first solutions that inherently support scalable and variable-length similarity search in data series, applied to sequence/subsequences matching, motif and discord discovery problems.The experimental results show that our approaches are up to orders of magnitude faster than the alternatives. They also demonstrate that we can remove the unrealistic constraint of performing analytics using a predefined length, leading to more intuitive and actionable results, which would have otherwise been missed.

Full PDF

EEffective and EfﬁcientVariable-Length Data Series Analytics

Michele LinardiSupervised by: Themis Palpanas

LIPADE, Universit ´e de Paris [email protected]

ABSTRACT

In the last twenty years, data series similarity search hasemerged as a fundamental operation at the core of severalanalysis tasks and applications related to data series collec-tions. Many solutions to diﬀerent mining problems work bymeans of similarity search. In this regard, all the proposedsolutions require the prior knowledge of the series length onwhich similarity search is performed. In several cases, thechoice of the length is critical and sensibly inﬂuences thequality of the expected outcome. Unfortunately, the obvi-ous brute-force solution, which provides an outcome for alllengths within a given range is computationally untenable.In this Ph.D. work, we present the ﬁrst solutions that inher-ently support scalable and variable-length similarity searchin data series, applied to sequence/subsequences matching,motif and discord discovery problems. The experimental re-sults show that our approaches are up to orders of magnitudefaster than the alternatives. They also demonstrate that wecan remove the unrealistic constraint of performing analyt-ics using a predeﬁned length, leading to more intuitive andactionable results, which would have otherwise been missed.

1. INTRODUCTION

Data series (i.e., ordered sequences of points) are oneof the most common data types , present in almost ev-ery scientiﬁc and social domain (such as meteorology, as-tronomy, chemistry, medicine, neuroscience, ﬁnance, agri-culture, entomology, sociology, smart cities, marketing, op-eration health monitoring, human action recognition andothers) [20].Once the data series have been collected, the domain ex-perts face the arduous tasks of processing and analyzingthem [30, 6] in order to gain insights, e.g., by identifying sim-ilar patterns, and performing classiﬁcation, or clustering. Acore operation that is part of all these analysis tasks is sim-ilarity search, which has attracted lots of attention becauseof its importance [2]. Nevertheless, all existing scalable andindex-based similarity search techniques are restricted inthat they only support queries of a ﬁxed length, and they If the dimension that imposes the ordering of the sequenceis time then we talk about time series. Though, a seriescan also be deﬁned over other measures (e.g., angle in radialproﬁles in astronomy, mass in mass spectroscopy in physics,position in genome sequences in biology, etc.). We use theterms data series , time series , and sequence interchangeably. Paper published in the Ph.D. Workshop - VLDB Conference 2019 require that this length is chosen at index construction [10,24, 1, 25, 28, 29, 26, 21, 11]. The same observation holdsfor techniques proposed to discover motifs [12] and discords(i.e., anomalous subsequences) [27]: they all assume a ﬁxedsequence length, which has to be predeﬁned.Evidently, this is a constraint that penalizes the ﬂexibilityneeded by analysts, who often times need to analyze pat-terns of slightly diﬀerent lengths (within a given data seriescollection) [7, 8, 5, 17, 16]. For example, in the

SENTINEL-2 mission data, oceanographers are interested in searchingfor similar coral bleaching patterns of diﬀerent lengths; atAirbus engineers need to perform similarity search queriesfor patterns of variable length when studying aircraft take-oﬀs and landings [19]; and in neuroscience, analysts need tosearch in Electroencephalogram (EEG) recordings for CyclicAlternating Patterns (CAP) of diﬀerent lengths (duration),in order to get insights about brain activity during sleep [22].In our work, we focus on three core problems that arebased on similarity search: subsequence matching, and mo-tif and discord discovery, organized under the ULISSE andMAD methods:1. ULISSE (ULtra compact Index for variable-length Sim-ilarity SEarch in data series) is the ﬁrst indexing techniquethat supports variable-length subsequence matching for nonZ-normalized and Z-normalized data series [15, 13, 14].2. MAD (Motif and Discord discovery framework) im-plements two novel algorithms for variable-length motif anddiscord discovery in large data series [17, 4, 16].

2. VARIABLE-LENGTH ANALYTICS

In this section, we describe our proposed approaches tothe aforementioned problems. In the next part we describethe notions and the elements used in our solutions.

Preliminaries.

Let a data series D = d ,..., d | D | be a se-quence of numbers d i ∈ R , where i ∈ N represents the posi-tion in D . We denote the length, or size of the data series D with | D | . The subsequence D s,(cid:96) = d s ,..., d s + (cid:96) − of length (cid:96) , is a contiguous subset of (cid:96) points of D starting at oﬀset s ,where 1 ≤ s ≤ | D | and 1 ≤ (cid:96) ≤ | D | . A subsequence is itselfa data series. A data series collection, C , is a set of dataseries. We say that a data series D is Z-normalized, denoted D n , when its mean µ is 0 and its standard deviation σ is1. Z-normalization is an essential operation in several ap-plications, because it allows similarity search irrespective ofshifting and scaling [5]. The Piecewise Aggregate Approx-imation (PAA) of a data series D , P AA ( D ) = { p , ..., p w } , a r X i v : . [ c s . D B ] S e p epresents D in a w -dimensional space by means of w real-valued segments of length s , where the value of each segmentis the mean of the corresponding values of D [9]. We denotethe ﬁrst k dimensions of P AA ( D ), ( k ≤ w ), as P AA ( D ) ,..,k .The iSAX representation of a data series D , denoted by iSAX ( D, w, | alphabet | ), is the representation of P AA ( D ) by w discrete coeﬃcients, drawn from an alphabet of cardinal-ity | alphabet | [24]. The main idea of the iSAX represen-tation, is that the real-value space may be segmented by | alphabet | − | alphabet | regions, which arelabeled by discrete symbols (e.g., with | alphabet | = 4 theavailable labels may be { , , , } ). The subsequence matching problem is deﬁned as follows:Given a data series collection C = { D , ..., D C } , a se-ries length range [ (cid:96) min , (cid:96) max ], a query data series Q , where (cid:96) min ≤ | Q | ≤ (cid:96) max , and k ∈ N , we want to ﬁnd theset R = { D io,(cid:96) | D i ∈ C ∧ (cid:96) = | Q | ∧ ( (cid:96) + o − ≤ | D i |} ,where | R | = k . We require that ∀ D io,(cid:96) ∈ R (cid:64) D i (cid:48) o (cid:48) ,(cid:96) (cid:48) s.t.dist ( D i (cid:48) o (cid:48) ,(cid:96) (cid:48) , Q ) < dist ( D io,(cid:96) , Q ), where (cid:96) (cid:48) = | Q | , ( (cid:96) (cid:48) + o (cid:48) − ≤| D i (cid:48) | and D i (cid:48) ∈ C . We informally call R , the k nearest neigh-bors set of Q . Given two generic series of the same length,namely D and D (cid:48) the function dist ( D, D (cid:48) ) can be EuclideanDistance or Dynamic Time Warping.

Variable Length Subsequences.

In a data series, whenwe consider contiguous and overlapping subsequences of dif-ferent lengths within the range [ (cid:96) min , (cid:96) max ],we expect theoutcome as a bunch of similar series, whose diﬀerences areaﬀected by the misalignment and the diﬀerent number ofpoints. Given a data series D , and a subsequence lengthrange [ (cid:96) min , (cid:96) max ], we deﬁne the master series as the sub-sequences of the form D i,min ( | D |− i +1 ,(cid:96) max ) , for each i suchthat 1 ≤ i ≤ | D | − ( (cid:96) min − ≤ (cid:96) min ≤ (cid:96) max ≤ | D | .We observe that for any master series of the form D i,(cid:96) (cid:48) , wehave that P AA ( D i,(cid:96) (cid:48) ) ,..,k = P AA ( D i,(cid:96) (cid:48)(cid:48) ) ,..,k holds for each (cid:96) (cid:48)(cid:48) such that (cid:96) (cid:48)(cid:48) ≥ (cid:96) min , (cid:96) (cid:48)(cid:48) ≤ (cid:96) (cid:48) ≤ (cid:96) max and (cid:96) (cid:48) , (cid:96) (cid:48)(cid:48) % k = 0.Therefore, by computing only the P AA of the master se-ries in D , we are able to represent the P AA preﬁx of any sub-sequence of D . When we zero-align the P AA summaries ofthe master series, we compute the minimum and maximum

P AA values (over all the subsequences) for each segment:this forms what we call an

Envelope . (When the length of amaster series is not a multiple of the

P AA segment length,we compute the

P AA coeﬃcients of the longest preﬁx that ismultiple of a segment.) We call containment area the spacein between the segments that deﬁne the

Envelope . PAA Envelope.

We formalize the concept of the

Envelope ,introducing a new series representation. We denote by L and U the P AA coeﬃcients, which delimit the lower and upperparts, respectively, of a containment area. Furthermore, weintroduce a parameter γ , which permits to select the numberof master series we represent by the Envelope . We refer to itusing the following signature: paaENV [ D,(cid:96) min ,(cid:96) max ,a,γ,s ] =[ L, U ]. It delimits the containment area generated by the

P AA coeﬃcients of the master series.

Indexing the Envelopes.

Given a paaENV , wecan translate its

P AA extremes into the correspond-ing iSAX representation: uENV paaENV [ D,(cid:96)min,(cid:96)max,a,γ,s ] =[ iSAX ( L ) , iSAX ( U )], where iSAX ( L ) ( iSAX ( U )) is thevector of the minimum (maximum) P AA coeﬃcients of allthe segments corresponding to the subsequences of D . The C u m u l a t i v e Q u e r y T i m e ( h o u r s ) γ = (% of ( l max - l min ))query answering disk i\oquery answering cpuIndexing (disk i\o + cpu time)050100150200 160 192 224 256 A v g E x a c t Q u e r y T i m e C P U + d i s k I / O ( S e c s ) Query length

0% 20%40% 60%80% 100% (a) (b) γ Figure 1: Query answering time performance, vary-ing γ on non Z-normalized data series. (a) ULISSE average query time (CPU + disk I/O). (b) ULISSE average query disk I/O time. (b) Comparison of

ULISSE to other techniques (cumulative indexing+ query answering time).

Envelope uENV represents the principal building block ofthe

ULISSE

Index. In details,

ULISSE is a tree structure,where each internal node stores the

Envelope uENV repre-senting all the sequences in the subtree rooted at that node.Leaf nodes contain several

Envelopes , which by constructionhave the same iSAX ( L ). On the contrary, their iSAX ( U )varies, since it get updated with every new insertion in thenode. Each Envelope in leaf nodes point the the representedsequences in the original data series collection.

Approximate Subsequence Matching.

Subsequencematching performed on

ULISSE index relies on the mindist

ULiSSE () lower bounding function to prune thesearch space. This allows to navigate the tree in order, vis-iting ﬁrst the most promising nodes. As soon as a leaf nodeis discovered, we can load the raw data series pointed bythe

Envelopes in the leaf. Each time we compute the trueEuclidean or DTW distance between the series in a leaf, thebest-so-far distance (bsf) is updated, along with a vectorcontaining the k best matches, where k refers to the k near-est neighbors. Since priority is given to the most promisingnodes, we can terminate our visit, when at the end of a leafvisit the k bsf’s have not improved. Exact Subsequence Matching.

Note that the approxi-mate search described above may not visit leaves that con-tain answers better than the approximate answers alreadyidentiﬁed, and therefore, it will fail to produce exact, correctresults.The exact nearest neighbor search algorithm we proposeﬁnds the k sequences with the absolute smallest distancesto the query. In this case, the search algorithm may visitseveral leaves: the process stops after it has either visited,or pruned (when the lower bounding distance to the node isgreater than the bsf) all the nodes of the index, guaranteeingthe correctness of the results. Experiments.

To evaluate

ULISSE , we used synthetic andreal data (but in the interest of space we only report resultswith the synthetic data). We record the average

CPU time , disk I/O (time to fetch data from disk (Total time - CPUtime)), for queries, extracted from the datasets with theaddition of Gaussian noise. We compare ULISSE with

UCRsuite [5] the non index-based state-of-the-art technique foranswering similarity search queries. Concerning the com-petitor indexing techniques, the state-of-the-art is the Com-pact Multi Resolution Index [7]

CMRI .In Figure 1, we present results for subsequence matchingqueries on

ULISSE when we vary γ , ranging from to itsmaximum value in this dataset, i.e., (cid:96) max − (cid:96) min . In Fig-ure 1, we report the results concerning non Z-normalized se-ries. We observe that grouping contiguous and overlappingubsequences under the same summarization ( Envelope ) byincreasing γ , aﬀects positively the performance of index con-struction, as well as query answering.The latter may seem counterintuitive, since inserting moremaster series into a single Envelope is likely to generate largecontainment areas, which are not tight representations of thedata series. On the other hand, it leads to an overall numberof

Envelope that is several orders of magnitude smaller thanthe one for γ = 0%, where only a single master series isrepresented by each Envelope . Motif and Discord are data mining primitives that rep-resent frequent and rare (anomalous) patterns, respectively.Given a data series D, they are deﬁned as follows: • Data series motif: D a,(cid:96) and D b,(cid:96) is a motif pair iﬀ dist ( D a,(cid:96) , D b,(cid:96) ) ≤ dist ( D i,(cid:96) , D j,(cid:96) ) ∀ i, j ∈ [1 , , ..., | D | − (cid:96) + 1], where a (cid:54) = b and i (cid:54) = j , and dist is a functionthat computes the z-normalized Euclidean distance be-tween the input subsequences. • Data series discord: We call the k subsequences of D , with the k largest distances to their m th NearestNeighbor (according the Euclidean distance), the

Top-k m th discords. Variable length motif and discord discovery.

We pro-vide solutions to the following problems: • Variable-Length Motif Discovery: Given a data series D and a subsequence length-range [ (cid:96) min , ..., (cid:96) max ], wewant to ﬁnd the data series motif pairs of all lengthsin [ (cid:96) min , ..., (cid:96) max ], occurring in D . • Variable-Length

Top-k m th Discord Discovery: Givena data series D , a subsequence length-range[ (cid:96) min , ..., (cid:96) max ] and the parameters a, b ∈ N + we wantto enumerate the Top-k m th discords for each k ∈{ , .., a } and each m ∈ { , .., b } , and for all lengthsin [ (cid:96) min , ..., (cid:96) max ], occurring in D . Fixed length motif and discord discovery.

The state-of-the art algorithm for ﬁxed length motif and discord dis-covery [3] requires the user to deﬁne the length of the de-sired motif or discord. This mining operation is supportedby computation of the

Matrix proﬁle , which is a meta dataseries storing the z-normalized Euclidean distance betweeneach subsequence and its nearest neighbor. The Matrix pro-ﬁle does not only derive the motif, but also ranks and ﬁltersout the other pairs, giving also a convenient and graphicalrepresentation of their occurrences and proximity. Unfortu-nately, this technique comes with an important shortcoming:it does not provide an eﬀective solution for trying several dif-ferent motif lengths. Therefore, the analyst is forced to runthe algorithm using all possible lengths in a range of interest,and rank the various motifs discovered, picking eventuallythe patterns that contain the desired insight. Clearly, thispossibility is not optimal for at least two reasons: the scal-ability, since ﬁnding motif of one ﬁxed length takes O ( | D | )time, and also because it does not provide an eﬀective wayto compare motifs of diﬀerent lengths. MAD Framework.

Our framework for Variable LengthMotif and Discord Discovery (MAD) works by applying an incremental computing strategy, which aims to prune unnec-essary distance computations for larger motif and discord lengths. Hence, given a data series D , we compute the Ma-trix proﬁle using the smallest subsequence length, namely (cid:96) min , within a speciﬁed input range [ (cid:96) min , (cid:96) max ]. The keyidea of our approach is to minimize the work that needsto be done for succeeding subsequence lengths ( (cid:96) min + 1, (cid:96) min + 2, . . . , (cid:96) max ). Matrix Proﬁle Computation.

We start the computationof the Matrix proﬁle, considering all the contiguous subse-quences of length (cid:96) min , computing for each one the

Distanceproﬁle in O ( | D | ) time. This latter is a vector that containsthe z-normalized Euclidean distances between a ﬁxed sub-sequence and all the other in D (excluding trivial matches). Lower Bound Subsequences of Diﬀerent Length.

Wemoreover introduce a new lower bounding distance [17],which lower bounds (is always smaller than) the true Eu-clidean distances between subsequences longer than (cid:96) min .We initially compute this lower bound using the true Eu-clidean distances computation of subsequences with length (cid:96) min . For the larger lengths, we update the lower bound,considering only the variation generated by the trailingpoints in the longer subsequences. This measure enjoys animportant property: if we rank the subsequences accordingto this measure, the same rank will be preserved along all thelower bound updates for the subsequences of greater length.We exploit this property, in order to prune computations.

Pruning the Search Space.

Once we compute motif anddiscords, with length greater than (cid:96) min , instead of comput-ing from scratch each distance proﬁle, we update the truedistances (in constant time) of the subsequences that havethe p smallest lower bounding distances (computed in theprevious step). These distances form what we call partialdistance proﬁle . In each partial distance proﬁle, we also up-date the lower bound. After this operation, we may have twocases: if in a new computed distance proﬁle the minimumtrue distance ( minDist ) is shorter than the maximum lowerbound ( maxLB ), we know that no distances among thosenot computed can be smaller than minDist. In this case, apartial distance proﬁle becomes a valid distance proﬁle . Onthe other hand, when maxLB is smaller than minDist , thislatter is not guaranteed to be the nearest neighbor distance.For discord discovery, we need to test this condition for the m smallest true distances in the partial distance proﬁle. Inthis case a valid (partial) distance proﬁle must contain thetrue m th best match distances, which are smaller than, orequal to maxLB . Exact Motif and Discord Discovery.

Once the partialdistance proﬁles are computed, we pick the absolute smallestlower bounding value from all the non-valid distance proﬁles,namely minLBAbs (if any). Therefore, the global minimum(true) distance of all the valid (partial) distance proﬁles,which is smaller than minLBAbs is guaranteed to be thedistance between the motif pair subsequences. Symmetri-cally, we consider the valid (partial) distance proﬁles to ﬁndthe true m th best match distances, which are the greatestnearest neighbor distances that are larger than maxLBAbs .This latter is the largest lower bounding distance of the non-valid distance proﬁles.In the motif discovery task, if no nearest neighbor dis-tance is smaller than minLBAbs , we recompute only thedistance proﬁles that have the maxLB distance smaller thanthe smallest true distance computed.On the other hand, for discord discovery, if no true near-est neighbor distances are found we need to iterate the non

100 150 200 400 600 T i m e ( H o u r s ) Subsequence length range

ECG

100 150 200 400 600

Subsequence length range

ASTRO

Time out after 24h

Figure 2: Time over motif length ranges (default (cid:96) min = , data series length= . EMG ASTRO T i m e ( H o u r s ) (a) (b) m (Top-1 m th discords) Dataset GrammarVizMAD (Discord discovery)DAD Time out after 48h k (Top-k 1 st discords) Dataset Figure 3: (a)

T op − m th discords discovery, and (b) T op − k st discords discovery time performance. valid (partial) distance proﬁles, which contain the maxLB distance greater than the largest m th best match distance.We keep extracting in this manner the motif and the dis-cord subsequences of each length, until (cid:96) max . Motif Discovery Experimental Evaluation.

To bench-mark the MAD framework, we used several diﬀerentreal datasets. Concerning the motif discovery problem,the competitors we considered are: QUICKMOTIF [12],STOMP [3], and MOEN [18]. We report in Figure 2 a sam-ple of the experiments we conducted (detailed experimen-tal results on several datasets are reported elsewhere [17]).Here, we show the results of MAD, which ﬁnds motifs indiﬀerent real datasets. In the plots, we report the totalexecution time varying motif length ranges. From this ex-periment, we observe that VALMOD maintains a good andstable performance across datasets and parameter settings,quickly producing results, even in cases where the competi-tors do not terminate within a reasonable amount of time.

Discord Discovery Experimental Evaluation.

Weidentify two state-of-the-art competitors to compare to ourapproach, the Motif And Discord (MAD) framework. Theﬁrst one, DAD (Disk aware discord discovery) [27], imple-ments an algorithm suitable to enumerate the ﬁxed-length

T op − m th discords. The second approach, Grammar-Viz [23], is the most recent technique, which discovers Top-k1 st discords. In Figure 3.(a) we report the results of T op − m th discord discovery, varying m . We note that MAD grace-fully scales over the number of discords to enumerate andis up to one order of magnitude faster than DAD. In Fig-ure 3.(b), we show the result of T op − k st discords dis-covery. Once again, MAD scales better over the numberof discovered discords, as its execution time remains almostconstant. A diﬀerent trend is observed for GrammarViz,whose performance signiﬁcantly deteriorates as k increases.

3. CONCLUSIONS

Even though much eﬀort has been dedicated for develop-ping techniques for data series analytics, existing solutionsfor subsequence matching, motif and discord discovery arelimited to ﬁxed length queries/results. In this Ph.D. work,we propose the ﬁrst scalable solutions to the variable-lengthversion of these problems:

ULISSE is the ﬁrst index thatsupports variable-length subsequence matching over both Z-normalized and non Z-normalized sequences [15, 13, 14],while MAD is the ﬁrst framework that implements variable-length motif and discord discovery [17, 4, 16].

References [1] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. isax 2.0:Indexing and mining one billion time series. In

ICDM 2010 ,2010.[2] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.The lernaean hydra of data series similarity search: An experi-mental evaluation of the state of the art.

PVLDB , 12(2), 2018.[3] C. M. Y. et al. Matrix proﬁle I: all pairs similarity joins fortime series: A unifying view that includes motifs, discords andshapelets. In

ICDM , 2016.[4] M. L. et al. Matrix proﬁle goes mad: Variable-length motif anddiscord discovery in data series. In

Under Submission 2019 .[5] T. R. et al. Searching and mining trillions of time series subse-quences under dynamic time warping. In

SIGKDD , 2012.[6] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. Pro-gressive similarity search on time series data. In

BigVis, inconjunction with EDBT/ICDT , 2019.[7] S. Kadiyala and N. Shiri. A compact multi-resolution index forvariable length queries in time series databases.

KAIS , 2008.[8] T. Kahveci and A. Singh. Variable length queries for time seriesdata. In

ICDEF , 2001.[9] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimen-sionality reduction for fast similarity search in large time seriesdatabases.

KAIS , 3, 2000.[10] E. J. Keogh, T. Palpanas, V. B. Zordan, D. Gunopulos, andM. Cardle. Indexing large human-motion databases. In

VLDB ,2004.[11] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas.Coconut: A scalable bottom-up approach for building data seriesindexes. In

PVLDB , 2018.[12] Y. Li, L. H. U, M. L. Yiu, and Z. Gong. Quick-motif: An eﬃcientand scalable framework for exact motif discovery ICDE, 2015.[13] M. Linardi and T. Palpanas. Scalable data series subsequencematching with ulisse. In

Under Submission 2019 .[14] M. Linardi and T. Palpanas. ULISSE: ULtra compact Indexfor Variable-Length Similarity SEarch in Data Series. In

ICDE2018 .[15] M. Linardi and T. Palpanas. Scalable, variable-length simi-larity search in data series: The ULISSE approach.

PVLDB ,11(13):2236–2248, 2018.[16] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. VALMOD:A suite for easy and exact detection of variable length motifs indata series. In

SIGMOD Conference 2018 .[17] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. Matrix proﬁleX: Valmod - scalable discovery of variable-length motifs in dataseries. In

SIGMOD , 2018.[18] A. Mueen and N. Chavoshi. Enumeration of time series motifsof all lengths.

Knowl. Inf. Syst. , 2015.[19] A. G. H. of Operational Intelligence Department Airbus. Per-sonal communication., 2017.[20] T. Palpanas. Data series management: The road to big sequenceanalytics.

SIGMOD Rec. , 2015.[21] B. Peng, P. Fatourou, and T. Palpanas. Paris: The next destina-tion for fast data series indexing and query answering. In

IEEEBig Data , 2018.[22] A. Rosa, L. Parrino, and M. Terzano. Automatic detection ofcyclic alternating pattern (cap) sequences in sleep: preliminaryresults.

Clinical Neurophysiology , 1999.[23] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedi-hardjo, C. Chen, and S. Frankenstein. Time series anomaly dis-covery with grammar-based compression. In

EDBT , 2015.[24] J. Shieh and E. J. Keogh. i sax: indexing and mining terabytesized time series. In KDD , 2008.[25] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-adaptive and dynamic segmentation index for whole matchingon time series.

PVLDB , 2013.[26] D. E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas.Dpisax: Massively distributed partitioned isax. In

ICDM , 2017.[27] D. Yankov, E. J. Keogh, and U. Rebbapragada. Disk awarediscord discovery: ﬁnding unusual time series in terabyte sizeddatasets.

Knowl. Inf. Syst. , 2008.[28] K. Zoumpatianos, S. Idreos, and T. Palpanas. RINSE: interac-tive data series exploration with ADS+.

PVLDB , 2015.[29] K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptivedata series index.

VLDB J. , 2016.[30] K. Zoumpatianos and T. Palpanas. Data series management:Fulﬁlling the need for big sequence analytics. In