[PDF] MESSI: In-Memory Data Series Indexing

Abstract

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-core and multi-socket architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in _50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.

Full PDF

MMESSI: In-Memory Data Series Indexing

Botao Peng

LIPADE, Universit´e de [email protected]

Panagiota Fatourou

FORTH ICS & Dept. of Comp. Science, Univ. of [email protected]

Themis Palpanas

LIPADE, Universit´e de [email protected]

Abstract —Data series similarity search is a core operation forseveral data series analysis applications across many different do-mains. However, the state-of-the-art techniques fail to deliver thetime performance required for interactive exploration, or analysisof large data series collections. In this work, we propose MESSI,the ﬁrst data series index designed for in-memory operation onmodern hardware. Our index takes advantage of the modernhardware parallelization opportunities (i.e., SIMD instructions,multi-core and multi-socket architectures), in order to accelerateboth index construction and similarity search processing times.Moreover, it beneﬁts from a careful design in the setup andcoordination of the parallel workers and data structures, so thatit maximizes its performance for in-memory operations. Ourexperiments with synthetic and real datasets demonstrate thatoverall MESSI is up to 4x faster at index construction, and upto 11x faster at query answering than the state-of-the-art parallelapproach. MESSI is the ﬁrst to answer exact similarity searchqueries on 100GB datasets in ∼ Index Terms —Data series, Indexing, Modern hardware

I. I

NTRODUCTION [Motivation]

Several applications across many diverse do-mains, such as in ﬁnance, astrophysics, neuroscience, engi-neering, multimedia, and others [1]–[3], continuously producebig collections of data series which need to be processedand analyzed. The most common type of query that differentanalysis applications need to answer on these collections ofdata series is similarity search [1], [4], [5].The continued increase in the rate and volume of data seriesproduction renders existing data series indexing technologiesinadequate. For example, ADS+ [6], the state-of-the-art se-quential (i.e., non-parallel) indexing technique, requires morethan 2min to answer exactly a single 1-NN (Nearest Neighbor)query on a (moderately sized) 100GB sequence dataset. Forthis reason, a disk-based data series parallel indexing scheme,called ParIS, was recently designed [7] to take advantageof modern hardware parallelization. ParIS effectively exploitsthe parallelism capabilities provided by multi-core and multi-socket architectures, and the Single Instruction Multiple Data(SIMD) capabilities of modern CPUs. In terms of queryanswering, experiments showed that ParIS is more than orderof magnitude faster than ADS+, and more than orders ofmagnitude faster than the optimized serial scan method. A data series, or data sequence, is an ordered sequence of data points. If theordering dimension is time then we talk about time series, though, series canbe ordered over other measures. (e.g., angle in astronomical radial proﬁles,frequency in infrared spectroscopy, mass in mass spectroscopy, position ingenome sequences, etc.).

Still, ParIS is designed for disk-resident data and thereforeits performance is dominated by the I/O costs it encounters.For instance, ParIS answers a 1-NN (Nearest Neighbor) exactquery on a 100GB dataset in 15sec, which is above the limitfor keeping the user’s attention (i.e., 10sec), let alone for sup-porting interactivity in the analysis process (i.e., 100msec) [8]. [Application Scenario]

In this work, we focus on designingan efﬁcient parallel indexing and query answering schemefor in-memory data series processing. Our work is motivatedand inspired by the following real scenario. Airbus , currentlystores petabytes of data series, describing the behavior overtime of various aircraft components (e.g., the vibrations of thebearings in the engines), as well as that of pilots (e.g., the waythey maneuver the plane through the ﬂy-by-wire system) [9].The experts need to access these data in order to run differ-ent analytics algorithms. However, these algorithms usuallyoperate on a subset of the data (e.g., only the data relevantto landings from Air France pilots), which ﬁt in memory.Therefore, in order to perform complex analytics operations(such as searching for similar patterns, or classiﬁcation) fast,in-memory data series indices must be built for efﬁcient dataseries query processing. Consequently, the time performanceof both index creation and query answering become importantfactors in this process. [MESSI Approach] We present MESSI, the ﬁrst in-MEmorydata SerieS Index, which incorporates the state-of-the-art tech-niques in sequence indexing. MESSI effectively uses multi-core and multi-socket architectures in order to concurrentlyexecute the computations needed for both index constructionand query answering and it exploits SIMD. More importantlythough, MESSI features redesigned algorithms that lead to afurther ∼

4x speedup in index construction time, in compari-son to an in-memory version of ParIS. Furthermore, MESSIanswers exact 1-NN queries on 100GB datasets 6-11x fasterthan ParIS across the datasets we tested, achieving for the ﬁrsttime interactive exact query answering times, at ∼ a r X i v : . [ c s . D B ] S e p he development of new algorithms for answering similaritysearch queries on this index.For query answering in particular, we showed that adapta-tions of alternative solutions, which have proven to performthe best in other settings (i.e., disk-resident data [7]), are notoptimal in our case, and we designed a novel solution thatachieves a good balance between the amount of communica-tion among the parallel worker threads, and the effectivenessof each individual worker. For instance, the new scheme usesconcurrent priority queues for storing the data series that can-not be pruned, and for processing these series in order, startingfrom those whose iSAX representations have the smallestdistance to the iSAX representation of the query data series. Inthis way, the parallel query answering threads achieve betterpruning on the data series they process. Moreover, the newscheme uses the index tree to decide which data series to insertinto the priority queues for further processing. In this way,the number of distance calculations performed between theiSAX summaries of the query and data series is signiﬁcantlyreduced (ParIS performs this calculation for all data series inthe collection). We also experimented with several designs forreducing the synchronization cost among different workers thataccess the priority queues and for achieving load balancing.We ended up with a scheme where workers use radomizationto choose the priority queues they will work on. Consequently,MESSI answers exact 1-NN queries on 100GB datasets within30-70msec across diverse synthetic and real datasets.The index construction phase of MESSI differentiates fromParIS in several ways. For instance, ParIS was using a numberof buffers to temporarily store pointers to the iSAX summariesof the raw data series before constructing the tree index [7].MESSI allocates smaller such buffers per thread and storesin them the iSAX summaries themselves. In this way, itcompletely eliminates the synchronization cost in accessingthe iSAX buffers. To achieve load balancing, MESSI splitsthe array storing the raw data series into small blocks, andassigns blocks to threads in a round-robin fashion. We appliedthe same technique when assigning to threads the bufferscontaining the iSAX summary of the data series. Overall, thenew design and algorithms of MESSI led to ∼

4x improvementin index construction time when compared to ParIS. [Contributions]

Our contributions are summarized as follows. • We propose MESSI, the ﬁrst in-memory data seriesindex designed for modern hardware, which can answersimilarity search queries in a highly efﬁcient manner. • We implement a novel, tree-based exact query answeringalgorithm, which minimizes the number of required dis-tance calculations (both lower bound distance calculationsfor pruning true negatives, and real distance calculationsfor pruning false positives). • We also design an index construction algorithm that ef-fectively balances the workload among the index creationworkers by using a parallel-friendly index frameworkwith low synchronization cost. • We conduct an experimental evaluation with several syn-thetic and real datasets, which demonstrates the efﬁciency of the proposed solution. The results show that MESSIis up to 4.2x faster at index construction and up to11.2x faster at query answering than the state-of-the-art parallel index-based competitor, up to 109x faster atquery answering than the state-of-the-art parallel serialscan algorithm, and thus can signiﬁcantly reduce theexecution time of complex analytics algorithms (e.g., k-NN classiﬁcation).II. P

RELIMINARIES

We now provide some necessary deﬁnitions, and introducethe related work on state-of-the-art data series indexing.

A. Data Series and Similarity Search [Data Series]

A data series, S = { p , ..., p n } , is deﬁned as asequence of points, where each point p i = ( v i , t i ) , ≤ i ≤ n ,is associated to a real value v i and a position t i . The positioncorresponds to the order of this value in the sequence. We call n the size , or length of the data series. We note that all thediscussions in this paper are applicable to high-dimensionalvectors, in general. [Similarity Search] Analysts perform a wide range of datamining tasks on data series including clustering [10], classiﬁ-cation and deviation detection [11], [12], and frequent patternmining [13]. Existing algorithms for executing these tasks relyon performing fast similarity search across the different series.Thus, efﬁciently processing nearest neighbor (NN) queriesis crucial for speeding up the above tasks. NN queries areformally deﬁned as follows: given a query series S q of length n , and a data series collection S of sequences of the samelength, n , we want to identify the series S c ∈ S that has thesmallest distance to S q among all the series in the collection S .(In the case of streaming series, we ﬁrst create subsequencesof length n using a sliding window, and then index those.)Common distance measures for comparing data series areEuclidean Distance (ED) [14] and dynamic time warping(DTW) [15]. While DTW is better for most data mining tasks,the error rate using ED converges to that of DTW as thedataset size grows [16]. Therefore, data series indexes formassive datasets use ED as a distance metric [6], [15]–[18],though simple modiﬁcations can be applied to make themcompatible with DTW [16]. Euclidean distance is computedas the sum of distances between the pairs of correspondingpoints in the two sequences. Note that minimizing ED onz-normalized data (i.e., a series whose values have mean 0and standard deviation 1) is equivalent to maximizing theirPearson’s correlation coefﬁcient [19]. [Distance calculation in SIMD] Single-Instruction Multiple-Data (SIMD) refers to a parallel architecture that allows theexecution of the same operation on multiple data simultane-ously [20]. Using SIMD, we can reduce the latency of anoperation, because the corresponding instructions are fetchedonce, and then applied in parallel to multiple data. All modernCPUs support 256-bit wide SIMD vectors, which means thatcertain ﬂoating point (or other 32-bit data) computations canbe up to 8 times faster when executed using SIMD. a) raw data series(b) PAA representation

10 00 11 N ( , ) (c) iSAX representation IdxBulkLoading worker

ROOT

10 0 1 11 0 1

IdxConstruction worker C oo r d i n a t o r i S A X s u mm a r i e s create thread RAW Data disk

OutBuf RecBuf … main memory

10 00 1 11 01 1 (d) ParIS indexFig. 1. The iSAX representation, and the ParIS index

In the data series context, SIMD has been employed for thecomputation of the Euclidean distance functions [21], as wellas in the ParIS index, for the conditional branch calculationsduring the computation of the lower bound distances [7].

B. iSAX Representation and the ParIS Index [iSAX Representation]

The iSAX representation (or sum-mary) is based on the Piecewise Aggregate Approximation(PAA) representation [22], which divides the data series insegments of equal length, and uses the mean value of thepoints in each segment in order to summarize a data series.Figure 1(b) depicts an example of PAA representation withthree segments (depicted with the black horizontal lines),for the data series depicted in Figure 1(a). Based on PAA,the indexable Symbolic Aggregate approXimation (iSAX)representation was proposed [16] (and later used in severaldifferent data series indices [6], [7], [11], [23], [24]). Thismethod ﬁrst divides the (y-axis) space in different regions,and assigns a bit-wise symbol to each region. In practice,the number of symbols is small: iSAX achieves very goodapproximations with as few as 256 symbols, the maximumalphabet cardinality, | alphabet | , which can be represented byeight bits [18]. It then represents each segment w of the serieswith the symbol of the region the PAA falls into, forming theword shown in Figure 1(c) (subscripts denote thenumber of bits used to represent the symbol of each segment). [ParIS Index] Based on the iSAX representation, the state-of-the-art ParIS index was developed [7], which proposedtechniques and algorithms speciﬁcally designed for modernhardware and disk-based data. ParIS makes use of variablecardinalities for the iSAX summaries (i.e., variable degreesof precision for the symbol of each segment) in order to build a hierarchical tree index (see Figure 1(d)), consistingof three types of nodes: (i) the root node points to severalchildren nodes, w in the worst case (when the series in thecollection cover all possible iSAX summaries); (ii) each innernode contains the iSAX summary of all the series below it,and has two children; and (iii) each leaf node contains theiSAX summaries of all the series inside it, and pointers tothe raw data (in order to be able to prune false positives andproduce exact, correct answers), which reside on disk. Whenthe number of series in a leaf node becomes greater than themaximum leaf capacity, the leaf splits: it becomes an innernode and creates two new leaves, by increasing the cardinalityof the iSAX summary of one of the segments (the one thatwill result in the most balanced split of the contents of thenode to its two new children [6], [18]). The two reﬁned iSAXsummaries (new bit set to and ) are assigned to the twonew leaves. In our example, the series of Figure 1(c) will beplaced in the outlined node of the index (Figure 1(d)). Notethat we deﬁne the distance of a query series to a node as thedistance between the query (raw values, or iSAX summary)and the iSAX summary of the node.In the index construction phase (see Figure 1(d)), ParISuses a coordinator worker that reads raw data series fromdisk and transfers them into a raw data buffer in memory.A number of index bulk loading workers compute the iSAXsummaries of these series, and insert < iSAX summary, ﬁleposition > pairs in an array. They also insert a pointer to theappropriate element of this array in the receiving buffer of thecorresponding subtree of the index root. When main memoryis exhausted, the coordinator worker creates a number of indexconstruction worker threads, each one assigned to one subtreeof the root and responsible for further building that subtree (byprocessing the iSAX summaries stored in the corespondingreceiving buffer). This process results in each iSAX summarybeing moved to the output buffer of the leaf it belongs to.When all iSAX summaries in the receiving buffer of an indexconstruction worker have been processed, the output buffersof all leaves in that subtree are ﬂushed to disk.For query answering, ParIS offers a parallel implementationof the SIMS exact search algorithm [6]. It ﬁrst computes anapproximate answer by calculating the real distance betweenthe query and the best candidate series, which is in the leafwith the smallest lower bound distance to the query. ParIS usesthe index tree only for computing this approximate answer.Then, a number of lower bound calculation workers computethe lower bound distances between the query and the iSAXsummary of each data series in the dataset, which are storedin the SAX array , and prune the series whose lower bounddistance is larger than the approximate real distance computedearlier. The data series that are not pruned, are stored in acandidate list for further processing. Subsequently, a numberof real distance calculation workers operate on different partsof this array to compute the real distances between the queryand the series stored in it (for which the raw values need tobe read from disk). For details see [7].In the in-memory version of ParIS, the raw data series are earch

Tree ConstructionCalculate iSAX summaries read each iSAX buffer; place elements in appropriate tree index subtree fill up index iSAX Buffers query S t a g e : Q u e r y A n s w e r i n g raw data compute iSAX summaries S t a g e : I nd e x C o n s t r u c t i o n traverse indexbuild priority queue(s) iSAX buffers Search remove iSAX summaries (in order) from priority queue(s)calculate real distance tree index use result for better pruning by updating BSF

Priority

Queues

Fig. 2. MESSI index construction and query answering stored in an in-memory array. Thus, there is no need for acoordinator worker. The bulk loading workers now operatedirectly on this array (split to as many chunks as the workers).In the rest of the paper, we use ParIS to refer to this in-memoryversion of the algorithm.III. T HE MESSI S

OLUTION

Figure 2 depicts the MESSI index construction and queryanswering pipeline. The raw data are stored in memory intoan array, called

RawData . This array is split into a prede-termined number of chunks. A number, N w , of index worker threads process the chunks to calculate the iSAX summariesof the raw data series they store. The number of chunks isnot necessarily the same as N w . Chunks are assigned to indexworkers the one after the other (using Fetch&Inc). Based onthe iSAX representation, we can ﬁgure out in which subtreeof the index tree an iSAX summary will be stored. A numberof iSAX buffers , one for each root subtree of the index tree,contain the iSAX summaries to be stored in that subtree.Each index worker stores the iSAX summaries it computesin the appropriate iSAX buffers. To reduce synchronizationcost, each iSAX buffer is split into parts and each workerworks on its own part . The number of iSAX buffers is usuallya few tens of thousands and at most w , where w is the numberof segments in the iSAX summaries of each data series ( w isﬁxed to in this paper, as in previous studies [6], [7]).When the iSAX summaries for all raw data series have beencomputed, the index workers proceed in the constuction ofthe tree index. Each worker is assigned an iSAX buffer towork on (this is done again using Fetch&Inc). Each workerreads the data stored in (all parts of) its assigned buffer andbuilds the corresponding index subtree. Therefore, all indexworkers process distinct subtrees of the index, and can workin parallel and independently from one another, with no need We have also tried an alternative technique where each buffer wasprotected by a lock and many threads were accessing each buffer. However,this resulted in worse performance due to the encountered contention inaccessing the iSAX buffers. for synchronization . When an index worker ﬁnishes with thecurrent iSAX buffer it works on, it continues with the nextiSAX buffer that has not yet been processed.When the series in all iSAX buffers have been processed, thetree index has been built and can be used to answer similaritysearch queries, as depicted in the query answering phase ofFig. 2. To answer a query, we ﬁrst perform a search for thequery iSAX summary in the tree index. This returns a leafwhose iSAX summary has the closest distance to the iSAXsummary of the query. We calculate the real distance of the(raw) data series pointed to by the elements of this leaf to thequery series, and store the minimum of these distances intoa shared variable, called BSF (Best-So-Far). Then, the indexworkers start traversing the index subtrees (the one after theother) using BSF to decide which subtrees will be pruned. Theleaves of the subtrees that cannot be pruned are placed into(a ﬁxed number of) minimum priority queues, using the lowerbound distance between the raw values of the query series andthe iSAX summary of the leaf node, in order to be furtherexamined. Each thread inserts elements in the priority queuesin a round-robin fashion so that load balancing is achieved(i.e., all queues contain about the same number of elements).As soon as the necessary elements have been placed in thepriority queues, each index worker chooses a priority queue towork on, and repeatedly calls DeleteMin() on it to get a leafnode, on which it performs the following operations. It ﬁrstchecks whether the lower bound distance stored in the priorityqueue is larger than the current BSF: if it is then we are certainthat the leaf node does not contain any series that can be partof the answer, and we can prune it; otherwise, the workerneeds to examine the series contained in the leaf node, by ﬁrstcomputing lower bound distances using the iSAX summaries,and if necessary also the real distances using the raw values.During this process, we may discover a series with a smallerdistance to the query, in which case we also update the BSF.When a worker reaches a node whose distance is bigger thanthe BSF, it gives up this priority queue and starts workingon another, because it is certain that all the other elementsin the abandoned queue have an even higher distance to thequery series. This process is repeated until all priority queueshave been processed. During this process, the value of BSF isupdated to always reﬂect the minimum distance seen so far.At the end of the calculation, the value of BSF is returned asthe query answer.Note that, similarly to ParIS, MESSI uses SIMD (Single-Instruction Multiple-Data) for calculating the distances ofboth, the index iSAX summaries from the query iSAX sum-mary ( lower bound distance calculations ), and the raw data se-ries from the query data series ( real distance calculations ) [7]. A. Index Construction

Algorithm 1 presents the pseudocode for the initiator thread.The initiator creates N w index worker threads to execute the Parallelizing the processing inside each one of the index root subtreeswould require a lot of synchronization due to node splitting. lgorithm 1:

CreateIndex

Input: Index index , Integer N w , Integer chunk size for i ← to N w − do create a thread to execute an instance of IndexWorker( index , chunk size , i , N w ); wait for all these threads to ﬁnish their execution; Algorithm 2:

IndexW orker

Input: Index index , Integer chunk size , Integer pid , Integer N w CalculateiSAXSummaries( index , chunk size , pid ); barrier to synchronize the IndexWorkers with one another; TreeConstruction( index , N w ); exit(); index construction phase (line 2). As soon as these workersﬁnish their execution, the initiator returns (line 3). We ﬁx N w to be threads (Figure 9 in Section IV justiﬁes thischoice). We assume that the index variable is a structure(struct) containing the RawData array, all iSAX buffers, anda pointer to the root of the tree index. Recall that MESSI splits

RawData into chunks of size chunk size . We assume thatthe size of

RawData is a multiple of chunk size (if not,standard padding techniques can be applied).The pseudocode for the index workers is in Algorithm 2.The workers ﬁrst call the

CalculateiSAXSummaries func-tion (line 1) to calculate the iSAX summaries of the raw dataseries and store them in the appropriate iSAX buffers. Assoon as the iSAX summaries of all the raw data series havebeen computed (line 2), the workers call

T reeConstruction to construct the index tree.The pseudocode of

CalculateiSAXSummaries is shownin Algorithm 3 and is schematically illustrated in Figure 3(a).Each index worker repeatedly does the following. It ﬁrst per-forms a Fetch&Inc to get assigned a chunk of raw data series towork on (line 3). Then, it calculates the offset in the

RawData array that this chunk resides (line 4) and starts processing therelevant data series (line 6). For each of them, it computesits iSAX summary by calling the ConvertToiSAX function(line 7), and stores the result in the appropriate iSAX bufferof index (lines 8-9). Recall that each iSAX buffer is split into N w parts, one for each thread; thus, index.iSAXbuf f er isa two dimensional array.Each part of an iSAX buffer is allocated dynamically whenthe ﬁrst element to be stored in it is produced. The size ofeach part has an initial small value (5 series in this work, aswe discuss in the experimental evaluation) and it is adjusteddynamically based on how many elements are inserted in it(by doubling its size each time).We note that we also tried a design of MESSI with no iSAXbuffers, but this led to slower performance (due to the worsecache locality). Thus, we do not discuss this alternative further.As soon as the computation of the iSAX summaries is over,each index worker starts executing the T reeConstruction function. Algorithm 4 shows the pseudocode for this functionand Figure 3(b) schematically describes how it works. In

Algorithm 3:

CalculateiSAXSummaries

Input: Index index , Integer chunk size , Integer pid Shared integer F c = 0 ; while (TRUE) do b ← Atomically fetch and increment F c ; b = b ∗ chunk size ; if ( b ≥ size of the index.RawData array) then break ; for j ← b to b + chunk size do isax = ConvertT oiSAX ( index.RawData [ j ] ); (cid:96) = ﬁnd appropriate root subtree where isax must be stored; index.iSAXbuf [ (cid:96) ][ pid ] = (cid:104) isax, j (cid:105) ; Algorithm 4:

T reeConstruction

Input: Index index , Integer N w Shared integer F b = 0 ; while (TRUE) do b ← Atomically fetch and increment F b ; if ( b ≥ w ) then break ; // the root has at most w children for j ← to N w do for every (cid:104) isax, pos (cid:105) pair ∈ index.iSAXbuf [ b ][ j ] do targetLeaf ← Leaf of index tree to insert (cid:104) isax, pos (cid:105) ; while targetLeaf is full do SplitNode( targetLeaf ); targetLeaf ← New leaf to insert (cid:104) isax, pos (cid:105) ; Insert (cid:104) isax, pos (cid:105) in targetLeaf ; T reeConstruction , a worker repeatedly executes the follow-ing actions. It accesses F b (using Fetch&Inc) to get assignedan iSAX buffer to work on (line 3). Then, it traverses allparts of the assigned buffer (lines 5-6) and inserts every pair (cid:104) iSAX summary , pointer to relevant data series (cid:105) stored therein the index tree (line 7-11). Recall that the iSAX summariescontained in the same iSAX buffer will be stored in thesame subtree of the index tree. So, no synchronization isneeded among the index workers during this process. If a treeworker ﬁnishes its work on a subtree, a new iSAX buffer is(repeatedly) assigned to it, until all iSAX buffers have beenprocessed. B. Query Answering

The pseudocode for executing an exact search query isshown in Algorithm 5. We ﬁrst calculate the iSAX summary ofthe query (line 2), and execute an approximate search (line 3)to ﬁnd the initial value of BSF, i.e., a ﬁrst upper bound on the

Algorithm 5:

ExactSearch Shared ﬂoat

BSF ; Input: QuerySeries

QDS , Index index , Integer N q QDS iSAX = calculate iSAX summary for QDS; BSF = approxSearch(

QDS iSAX , index ); for i ← to N q − do queue [ i ] = Initialize the i th priority queue; for i ← to N s − do create a thread to execute an instance of SearchWorker( QDS , index , queue [] , i , N q ); Wait for all threads to ﬁnish; return ( BSF ); reate thread Initiate thread

ROOT0 0 0 1 1 1 . . . iSAXBuf iSAXBuf

Raw Data … .. Worker fill up index iSAX Bufferscompute iSAX summaries Nc (a) CalculateiSAXSummaries Tree construction workers … .. grow subtree iSAXBufs ROOT . . . iSAXBufs …… (b) TreeConstructionFig. 3. Workﬂow and algorithms for MESSI index creation actual distance between the query and the series indexed bythe tree. This process is illustrated in Figure 4(a).During a search query, the index tree is traversed and thedistance of the iSAX summary of each of the visited nodes tothe iSAX summary of the query is calculated. If the distance ofthe iSAX summary of a node, nd , to the query iSAX summaryis higher than BSF, then we are certain that the distances of alldata series indexed by the subtree rooted at nd are higher thanBSF. So, the entire subtree can be pruned. Otherwise, we godown the subtree, and the leaves with a distance to the querysmaller than the BSF, are inserted in the priority queue.The technique of using priority queues maximizes thepruning degree, thus resulting in a relatively small number ofraw data series whose real distance to the query series must becalculated. As a side effect, BSF converges fast to the correctvalue. Thus, the number of iSAX summaries that are testedagainst the iSAX summary of the query series is also reduced.Algorithm 5 creates N s = 48 threads, called the searchworkers (lines 6-7), which perform the computation describedabove by calling SearchW orker . It also creates N q ≥ priority queues (lines 4-5), where the search workers placethose data series that are potential candidates for real distancecalculation. After all search workers have ﬁnished (line 8), ExactSearch returns the current value of

BSF (line 9).We have experimented with two different settings regardingthe number of priority queues, N q , that the search workersuse. The ﬁrst, called Single Queue ( SQ ), refers to N q = 1 ,whereas the second focuses in the Multiple-Queue ( M Q )case where N q > . Using a single shared queue imposesa high synchronization overhead, whereas using a local queueper thread results in severe load imbalance, since, dependingon the workload, the size of the different queues may varysigniﬁcantly. Thus, we choose to use N q shared queues, where N q > is a ﬁxed number (in our analysis N q is set to , asexperiments our show that this is the best choice). Algorithm 6:

SearchW orker

Input: QuerySeries

QDS , Index index , Queue queue [] , Integer pid , Integer N q Shared integer N b = 0 ; q = pid mod N q ; while (TRUE) do i ← Atomically fetch and increment N b ; if ( i ≥ w ) then break; T raverseRootSubtree ( QDS , index.rootnode [ i ] , queue [] , & q , N q ); Barrier to synchronize the search workers with one another; q = pid mod N q ; while (true) do P rocessQueue ( QDS, index, queue [ q ]) ; if all queue[].ﬁnished=true then break; q ← index such that queue [ q ] has not been processed yet; The pseudocode of search workers is shown in Algorithm 6,and the work they perform is illustrated in Figures 4(b)and 4(c). At each point in time, each thread works on a singlequeue. Initially, each queue is shared by two threads. Eachsearch worker ﬁrst identiﬁes the queue where it will performits ﬁrst insertion (line 2). Then, it repeatedly chooses (usingFetch&Inc) a root subtree of the index tree to work on bycalling

T raverseRootSubtree (line 6). After all root subtreeshave been processed (line 7), it repeatedly chooses a priorityqueue (lines 9, 13) and works on it by calling

P rocessQueue (line 10). Each element of the queue array has a ﬁeld, called f inished , which indicates whether the processing of thecorresponding priority queue has been ﬁnished. As soon asa search worker determines that all priority queues have beenprocessed (line 12), it terminates.We continue to describe the pseudocode for

T raverseRootSubtree which is presented in Algorithm 7and illustrated in Figure 4(b).

T raverseRootSubtree is lgorithm 7: T raverseRootSubtree

Input: QuerySeries

QDS , Node node , queue queue [] , Integer ∗ pq , Integer N q nodedist = FindDist( QDS , node ); if nodedist > BSF then break; else if node is a leaf then acquire queue [ ∗ pq ] lock; Put node in queue [ ∗ pq ] with priority nodedist ; release queue [ ∗ pq ] lock; // next time, insert in the subsequent queue ∗ pq ← ( ∗ pq + 1) mod N q ; else TraverseRootSubtree( node.leftChild, queue [] , pq, N q ); TraverseRootSubtree( node.rightChild, queue [] , pq, N q ) recursive. On each internal node, nd , it checks whether the(lower bound) distance of the iSAX summary of nd to theraw values of the query (line 1) is smaller than the current BSF , and if it is, it examines the two subtrees of the nodeusing recursion (lines 11-12). If the traversed node is a leafnode and its distance to the iSAX summary of the queryseries is smaller than the current BSF (lines 4-9), it placesit in the appropriate priority queue (line 6). Recall that thepriority queues are accessed in a round-robin fashion (line 9).This strategy maintains the size of the queues balanced, andreduces the synchronization cost of node insertions to thequeues. We implement this strategy by (1) passing a pointerto the local variable q of SearchW orker as an argumentto

T raverseRootSubtree , (2) using the current value of q for choosing the next queue to perform an insertion (line 6),and (3) updating the value of q (line 9). Each queue may beaccessed by more than one threads, so a lock per queue isused to protect its concurrent access by multiple threads.We next describe how P rocessQueue works (see Al-gorithm 8 and Figure 4(c)). The search worker repeatedlyremoves the (leaf) node, nd , with the highest priority from thepriority queue, and checks whether the corresponding distancestored in the queue is still less than the BSF. We do so,because the BSF may have changed since the time that theleaf node was inserted in the priority queue. If the distanceis less than the BSF, then CalculateRealDistance (line 3)is called, in order to identify if any series in the leaf node(pointed to by nd ) has a real distance to the query that issmaller than the current BSF. If we discover such a series(line 4), BSF is updated to the new value (line 6). We use alock to protect BSF from concurrent update efforts (lines 5, 7).Previous experiments showed that the initial value of BSF isvery close to its ﬁnal value [25]. Indeed, in our experiments,the BSF is updated only 10-12 times (on average) per query.So, the synchronization cost for updating the BSF is negligible.In Algorithm 9, we depict the pseudocode for

CalculateRealDistance . Note that we perform the realdistance calculation using SIMD. However, the use of SIMDdoes not have the same signiﬁcant impact in performance asin ParIS [7]. This is because pruning is much more effectivein MESSI, since for each candidate series in the examined

Algorithm 8:

P rocessQueue

Input: QuerySeries

QDS , Index index , Queue Q while node = DeleteMin( Q ) do if node.dist < BSF then realDist = CalculateRealDistance( QDS , index , node ); if realDist < BSF then acquire BSF Lock ; BSF = realDist ; release BSF Lock ; else q.finished = true; break; Algorithm 9:

CalculateRealDistance

Input: QuerySeries

QDS , Index index , node node , ﬂoat BSF for every ( isax , pos ) pair ∈ node do if LowerBound SIMD ( QDS , isax ) < BSF then dist = RealDist SIMD ( index.RawData [ pos ] , QDS ) ; if dist < BSF then BSF = dist ; return ( BSF ) leaf node, CalculateRealDistance ﬁrst performs a lowerbound distance calculation, and proceeds to the real distancecalculation only if necessary (line 3). Therefore, the numberof (raw) data series to be examined is limited in comparisonto those examined in ParIS (we quantify the effect of thisnew design in our experimental evaluation).IV. E

XPERIMENTAL E VALUATION

In this section, we present our experimental evaluation.We use synthetic and real datasets in order to compare theperformance of MESSI with that of competitors that have beenproposed in the literature and baselines that we developed. Wedemonstrate that, under the same settings, MESSI is able toconstruct the index up to 4.2x faster, and answer similaritysearch queries up to 11.2x faster than the competitors. Overall,MESSI exhibits a robust performance across different datasetsand settings, and enables for the ﬁrst time the exploration ofvery large data series collections at interactive speeds.

A. Setup

We used a server with 2x Intel Xeon E5-2650 v4 2.2GhzCPUs (12 cores/24 hyper-threads each) and 256GB RAM. Allalgorithms were implemented in C, and compiled using GCCv6.2.0 on Ubuntu Linux v16.04. [Algorithms]

We compared MESSI to the following algo-rithms: (i) ParIS [7], the state-of-the-art modern hardware dataseries index. (ii) ParIS-TS, our extension of ParIS, where weimplemented in a parallel fashion the traditional tree-basedexact search algorithm [16]. In brief, this algorithm traversesthe tree, and concurrently (1) inserts in the priority queue thenodes (inner nodes or leaves) that cannot be pruned basedon the lower bound distance, and (2) pops from the queuesnodes for which it calculates the real distances to the candidateseries [16]. In contrast, MESSI (a) ﬁrst makes a completepass over the index using lower bound distance computationsand then proceeds with the real distance computations; (b) uery data series

Raw Data ...

Tree leaves tree index

1. Compute BSF (a) Approximate search for calculatingthe ﬁrst BSF

Leaf node Leaf node

Leaf node Internal node

Search worker

4. if node dist < BSF insert node to PQ[++i%N q ] … …

2. Traverse tree index3. Calculate node distance

PQ[0]

Priority Queues

PQ[0]

Root

Internal node (b) Tree traversal and node insertion in priorityqueues

PQ[0]

Search worker

LB_distLB_dist

LB_distLB_dist R_distR_dist

6. Calculate real node distance

Leaf node

BSF

5. remove leaf node from PQ 7. Update BSF

Priority Queues

Raw Data … PQ[0]PQ[0]

PQ[0]

8. Output the answer (c) Node distance calculation from priority queuesFig. 4. Workﬂow and algorithms for MESSI query answering it only considers the leaves of the index for insertion in thepriority queue(s); and (c) performs a second ﬁltering step usingthe lower bound distances when popping elements from thepriority queue (and before computing the real distances). Theperformance results we present later justify the choices wehave made in MESSI, and demonstrate that a straight-forwardimplementation of tree-based exact search leads to sub-optimalperformance. (iii) UCR Suite-P, our parallel implementationof the state-of-the-art optimized serial scan technique, UCRSuite [15]. In UCR Suite-P, every thread is assigned a part ofthe in-memory data series array, and all threads concurrentlyand independently process their own parts, performing thereal distance calculations in SIMD, and only synchronize atthe end to produce the ﬁnal result. (We do not consider thenon-parallel UCR Suite version in our experiments, since itis almost 300x slower.) All algorithms operated exclusively inmain memory (the datasets were already loaded in memory,as well). The code for all algorithms used in this paper isavailable online [26]. [Datasets]

In order to evaluate the performance of the pro-posed approach, we use several synthetic datasets for a ﬁnegrained analysis, and two real datasets from diverse domains.Unless otherwise noted, the series have a size of 256 points,which is a standard length used in the literature, and allowsus to compare our results to previous work. We used syntheticdatasets of sizes 50GB-200GB (with a default size of 100GB),and a random walk data series generator that works as follows:a random number is ﬁrst drawn from a Gaussian distributionN(0,1), and then at each time point a new number is drawnfrom this distribution and added to the value of the lastnumber. This kind of data generation has been extensivelyused in the past (and has been shown to model real-worldﬁnancial data) [6], [16]–[18], [27]. We used the same processto generate 100 query series.For our ﬁrst real dataset,

Seismic , we used the IRIS SeismicData Access repository [28] to gather 100M series representingseismic waves from various locations, for a total size of 100GB. The second real dataset,

SALD , includes neuroscienceMRI data series [29], for a total of 200M series of size 128, ofsize 100 GB. In both cases, we used as queries 100 series outof the datasets (chosen using our synthetic series generator).In all cases, we repeated the experiments 10 times and wereport the average values. We omit reporting the error bars,since all runs gave results that were very similar (less than 3%difference). Queries were always run in a sequential fashion,one after the other, in order to simulate an exploratory analysisscenario, where users formulate new queries after having seenthe results of the previous one.

B. Parameter Tuning Evaluation

In all our experiments, we use 24 index workers and 48search workers. We have chosen the chunk size to be 20MB(corresponding to 20K series of length 256 points). Each partof any iSAX buffer, initially holds a small constant numberof data series, but its size changes dynamically depending onhow many data series it needs to store. The capacity of eachleaf of the index tree is 2000 data series (2MB). For queryanswering, MESSI-mq utilizes 24 priority queues (whereasMESSI-sq utilizes just one priority queue). In either case,each priority queue is implemented using an array whose sizechanges dynamically based on how many elements must bestored in it. Below we present the experiments that justify thechoices for these parameters.Figure 5 illustrates the time it takes MESSI to build thetree index for different chunk sizes on a random dataset of100GB. The required time to build the index decreases whenthe chunk size is small and does not have any big inﬂuencein performance after the value of 1K (data series). Smallerchunk sizes than 1K result in high contention when accessingthe fetch&increment object used to assign chunks to indexworkers. In our experiments, we have chosen a size of 20K,as this gives slightly better performance than setting it to 1K.Figures 6 and 7 show the impact that varying the leaf sizeof the tree index has in the time needed for the index creation

10 100 500 1 k k k k (cid:9) k (cid:9) m (cid:9) m (cid:9) m T i m e ( S e c ond s ) Chunk size (number of series)

MESSIParIS−no−synch

Fig. 5. Index creation, vs. chunk size

50 100 200 500 1 k k k k k k k T i m e ( S e c ond s ) Leaf size (number of series)

Fig. 6. Index creation, vs. leaf size

50 100 200 500 1 k k k k k k k T i m e ( M illi s e c ond s ) Leaf Size (number of series)

MESSI−sqMESSI−mq

Fig. 7. Query answering, vs. leaf size k T i m e ( S e c ond s ) Buffer size (number of series)

Fig. 8. Index creation, vs. initialiSAX buffer size

ParIS MESSI T i m e ( S e c o nd s ) Number of cores

Calculate iSAX RepresentationsTree Index Construction

Fig. 9. Index creation, varying number of cores and for query answering, respectively. As we see in Figure 6,the larger the leaf size is, the faster index creation becomes.However, once the leaf size becomes 5K or more, this timeimprovement is insigniﬁcant. On the other hand, Figure 7shows that the query answering time takes its minimum valuewhen the leaf size is set to 2K (data series). So, we havechosen this value for our experiments.Figure 7 indicates that the inﬂuence of varying the leaf sizeis signiﬁcant for query answering. Note that when the leafsize is small, there are more leaf nodes in the index tree andtherefore, it is highly probable that more nodes will be insertedin the queues, and vice versa. On the other hand, as the leafsize increases, the number of real distance calculations that areperformed to process each one of the leaves in the queue islarger. This causes load imbalance among the different searchworkers that process the priority queues. For these reasons, wesee that at the beginning the time goes down as the leaf sizeincreases, it reaches its minimum value for leaf size 2K series,and then it goes up again as the leaf size further increases.Figure 8 shows the inﬂuence of the initial iSAX buffer sizeduring index creation. This initialization cost is not negligiblegiven that we allocate w iSAX buffers, each consisting of parts (recall that 24 is the number of index workers in thesystem). As expected, the ﬁgure illustrates that smaller initialsizes for the buffers result in better performance. We havechosen the initial size of each part of the iSAX buffers to bea small constant number of data series. (We also consideredan alternative design that collects statistics and allocates theiSAX buffers right from the beginning, but was slower.)We ﬁnally justify the choice of using more than one priority queues for query answering. As Figure 11 shows, MESSI-mqand MESSI-sq have similar performance when the numberof threads is smaller than 24. However, as we go from24 to 48 cores, the synchronization cost for accessing thesingle priority queue in MESSI-sq has negative impact inperformance. Figure 13 presents the breakdown of the queryanswering time for these two algorithms. The ﬁgure showsthat in MESSI-mq, the time needed to insert and removenodes from the list is signiﬁcantly reduced. As expected, thetime needed for the real distance calculations and for the treetraversal are about the same in both algorithms. This hasthe effect that the time needed for the distance calculationsbecomes the dominant factor. The ﬁgure also illustrates thepercentage of time that goes on each of these tasks. Finally,Figure 14 illustrates the impact that the number of priorityqueues has in query answering performance. As the numberof priority queues increases, the time goes down, and it takesits minimum value when this number becomes 24. So, we havechosen this value for our experiments. C. Comparison to Competitors [Index Creation]

Figure 9 compares the index creation time ofMESSI with that of ParIS as the number of cores increases fora dataset of 100GB. The time MESSI needs for index creationis signiﬁcantly smaller than that of ParIS. Speciﬁcally, MESSIis 3.5x faster than ParIS. The main reasons for this are onthe one hand that MESSI exhibits lower contention cost whenaccessing the iSAX buffers in comparison to the correspondingcost paid by ParIS, and on the other hand, that MESSI achievesbetter load balancing when performing the computation of theiSAX summaries from the raw data series. Note that due tosynchronization cost, the performance improvement that bothalgorithms exhibit decreases as the number of cores increases;this trend is more prominent in ParIS, while MESSI managesto exploit to a larger degree the available hardware.In Figure 10, we depict the index creation time as the datasetsize grows from 50GB to 200GB. We observe that MESSIperforms up to 4.2x faster than ParIS (for the 200GB dataset),with the improvement becoming larger with the dataset size. [Query Answering]

Figure 11 compares the performance ofthe MESSI query answering algorithm to its competitors, asthe number of cores increases, for a random dataset of 100GB(y-axis in log scale). The results show that both MESSI-sq and T i m e ( S e c ond s ) Data Size/GB

ParISMESSI

Fig. 10. Index creation, vs. data size T i m e ( M illi s e c o nd s ) Number of coresUCR Suite-P ParIS ParIS-TS MESSI-sq MESSI-mq

Fig. 11. Query answering, vs. number of cores T i m e ( M illi s e c ond s ) Data Size/GB

UCR Suite−pParISParIS−TS MESSI−sqMESSI−mq

Fig. 12. Query answering, vs. data size

80 MESSI-sq MESSI-mq T i m e ( M illi s e c o nd s ) Algorithms

PQ remove nodeDistance calculationPQ insert node

MESSI tree passInitialization P e r c e n t a g e o f t o t a l t i m e Algorithms

Fig. 13. Query answering with different queue type T i m e ( M illi s e c o nd s ) Number of queues SALD Random Seismic

Fig. 14. Query answering, vs. number of queues

MESSI-mq perform much better than all the other algorithms.Note that the performance of MESSI-mq is better than that ofMESSI-sq, so when we mention MESSI in our comparisonbelow we refer to MESSI-mq. MESSI is 55x faster thanUCR Suite-P and 6.35x faster than ParIS when we use 48threads (with hyperthreading). In contrast to ParIS, MESSIapplies pruning when performing the lower bound distance

SALD Seismic T i m e ( S e c ond s ) Dataset

ParISMESSI

Fig. 15. Index creationfor real datasets

SALD Seismic T i m e ( M illi s e c ond s ) Dataset

UCR Suite−pParISParIS−TSMESSI−sqMESSI−mq

Fig. 16. Query answering for realdatasets

Random Seismic SALDDataset

ParIS

MESSI o f l o w e r b o und d i s t . c a l c u l . ( x ) (a) Lower bound distance calcula-tions Random Seismic SALD o f r e a l d i s t . c a l c u l . ( x ) Dataset

ParISMESSI (b) Real distance calculationsFig. 17. Number of distance calculations calculations and therefore it executes this phase much faster.Moreover, the use of the priority queues result in even higherpruning power. As a side effect, MESSI also performs lessreal distance calculations than ParIS. Note that UCR Suite-Pdoes not perform any pruning, thus resulting in a much lowerperformance than the other algorithms.Figure 12 shows that this superior performance of MESSIis exhibited for different data set sizes as well. Speciﬁcally,MESSI is up to 61x faster than UCR Suite-p (for 200GB), upto 6.35x faster than ParIS (for 100GB), and up to 7.4x fasterthan ParIS-TS (for 50GB). [Performance Beneﬁt Breakdown]

Given the above results,we now evaluate several of the design choices of MESSI inisolation. Note that some of our design decisions stem fromthe fact that in our index the root node has a large numberof children. Thus, the same design ideas are applicable to theiSAX family of indices [4] (e.g., iSAX2+, ADS+, ULISSE).Other indices however [4], use a binary tree (e.g., DSTree),or a tree with a very small fanout (e.g., SFA trie, M-tree), sonew design techniques are required for efﬁcient parallelization.However, some of our techniques, e.g., the use of (more thanone) priority queue, the use of SIMD, and some of the datastructures designed to reduce the syncrhonization cost can beapplied to all other indices. Figure 18 shows the results for thequery answering performance. The leftmost bar (ParIS-SISD)shows the performance of ParIS when SIMD is not used.By employing SIMD, ParIS becomes 60% faster than ParIS-ISD. We then measure the performance for ParIS-TS, whichis about 10% faster than ParIS. This performance improvementcomes form the fact that using the index tree (instead of theSAX array that ParIS uses) to prune the search space anddetermine the data series for which a real distance calculationmust be performed, signiﬁcantly reduces the number of lowerbound distance calculations. ParIS calculates lower bounddistances for all the data series in the collection, and pruningis performed only when calculating real distances, whereasin ParIS-TS pruning occurs when calculating lower bounddistances as well.MESSI-mq further improves performance by only insertingin the priority queue leaf nodes (thus, reducing the size ofthe queue), and by using multiple queues (thus, reducing thesynchronization cost). This makes MESSI-mq 83% faster thanParIS-TS. [Real Datasets]

Figures 15 and 16 reafﬁrm that MESSIexhibits the best performance for both index creation andquery answering, even when executing on the real datasets,SALD and Seismic (for a 100GB dataset). The reasons forthis are those explained in the previous paragraphs. Regardingindex creation, MESSI is 3.6x faster than ParIS on SALDand 3.7x faster than ParIS on Seismic, for a 100GB dataset.Moreover, for SALD, MESSI query answering is 60x fasterthan UCR Suite-P and 8.4x faster than ParIS, whereas forSeismic, it is 80x faster than UCR Suite-P, and almost 11xfaster than ParIS. Note that MESSI exhibits better performancethan UCR Suite-P in the case of real datasets. This is sobecause working on random data results in better pruning thanthat on real data.Figures 17(a) and 17(b) illustrate the number of lower boundand real distance calculations, respectively, performed by thedifferent query algorithms on the three datasets. ParIS calcu-lates the distance between the iSAX summaries of every singledata series and the query series (because, as we discussedin Section II, it implements the SIMS strategy for queryanswering). In contrast, MESSI performs pruning even duringthe lower bound distance calculations, resulting in much lesstime for executing this computation. Moreover, this results in asigniﬁcantly reduced number of data series whose real distanceto the query series must be calculated.The use of the priority queues lead to even less real distancecalculations, because they help the BSF to converge faster toits ﬁnal value. MESSI performs no more than 15% of thelower bound distance calculations performed by ParIS. [MESSI with DTW]

In our ﬁnal experiments, we demonstratethat MESSI not only accelerates similarity search based onEuclidean distance, but can also be used to signiﬁcantlyaccelerate similarity search using the Dynamic Time Warping(DTW) distance measure [30]. We note that no changes arerequired in the index structure; we just have to build theenvelope of the LB Keogh method [31] around the queryseries, and then search the index using this envelope. Figure 19shows the query answering time for different dataset sizes (weuse a warping window size of 10% of the query series length,which is commonly used in practice [31]). The results show T i m e ( M illi s e c o nd s ) Algorithms

Fig. 18. Query answering per-formance beneﬁt breakdown T i m e ( S e c ond s ) Data Size/GB

UCR Suite DTWUCR Suite−p DTWMESSI DTW

Fig. 19. MESSI query answering timefor DTW distance (synthetic data, 10%warping window) that MESSI-DTW is up to 34x faster than UCR Suite-p DTW(and more than 3 orders of magnitude faster than the non-paralell version of UCR Suite DTW).V. R

ELATED W ORK

Various dimensionality reduction techniques exist for dataseries, which can then be scanned and ﬁltered [32], [33] orindexed and pruned [6], [7], [11], [16], [17], [23], [24], [34],[35] during query answering. We follow the same approachof indexing the series based on their summaries, though ourwork is the ﬁrst to exploit the parallelization opportunitiesoffered by modern hardware, in order to accelerate in-memoryindex construction and similarity search for data series. Thework closest to ours is ParIS [7], which also exploits modernhardware, but was designed for disk-resident datasets. Wediscussed this work in more detail in Section II.FastQuery is an approach used to accelerate search oper-ations in scientiﬁc data [36], based on the construction ofbitmap indices. In essence, the iSAX summarization used inour approach is an equivalent solution, though, speciﬁcallydesigned for sequences (which have high dimensionalities).The interest in using SIMD instructions for improving theperformance of data management solutions is not new [37].However, it is only more recently that relatively complexalgorithms were extended in order to take advantage of thishardware characteristic. Polychroniou et al. [38] introduceddesign principles for efﬁcient vectorization of in-memorydatabase operators (such as selection scans, hash tables, andpartitioning). For data series in particular, previous work hasused SIMD for Euclidean distance computations [21]. Follow-ing [7], in our work we use SIMD both for the computation ofEuclidean distances, as well as for the computation of lowerbounds, which involve branching operations.Multi-core CPUs offer thread parallelism through multiplecores and simultaneous multi-threading (SMT). Thread-LevelParallelism (TLP) methods, like multiple independent coresand hyper-threads are used to increase efﬁciency [39].A recent study proposed a high performance temporal indexsimilar to time-split B-tree (TSB-tree), called TSBw-tree,which focuses on transaction time databases [40]. Binna etal. [41], present the Height Optimized Trie (HOT), a general-purpose index structure for main-memory database systems,hile Leis et al. [42] describe an in-memory adaptive Radixindexing technique that is designed for modern hardware.Xie et al. [43], study and analyze ﬁve recently proposedindices, i.e., FAST, Masstree, BwTree, ART and PSL andidentify the effectiveness of common optimization techniques,including hardware dependent features such as SIMD, NUMAand HTM. They argue that there is no single optimizationstrategy that ﬁts all situations, due to the differences in thedataset and workload characteristics. Moreover, they pointout the signiﬁcant performance gains that the exploitationof modern hardware features, such as SIMD processing andmultiple cores bring to in-memory indices.We note that the indices described above are not suitablefor data series (that can be thought of as high-dimensionaldata), which is the focus of our work, and which pose veryspeciﬁc data management challenges with their hundreds, orthousands of dimensions (i.e., the length of the sequence).Techniques speciﬁcally designed for modern hardware andin-memory operation have also been studied in the context ofadaptive indexing [44], and data mining [45].VI. C

ONCLUSIONS

We proposed MESSI, a data series index designed for in-memory operation by exploiting the parallelism opportunitiesof modern hardware. MESSI is up to 4x faster in indexconstruction and up to 11x faster in query answering than thestate-of-the-art solution, and is the ﬁrst technique to answerexact similarity search queries on 100GB datasets in ∼ Acknowledgments

Work supported by Chinese ScholarshipCouncil, FMJH Program PGMO, EDF, Thales and HIPEAC4. Part of work performed while P. Fatourou was visitingLIPADE, and while B. Peng was visiting CARV, FORTH ICS.R

EFERENCES[1] T. Palpanas, “Data series management: The road to big sequenceanalytics,”

SIGMOD Record , 2015.[2] K. Zoumpatianos and T. Palpanas, “Data series management: Fulﬁllingthe need for big sequence analytics,” in

ICDE , 2018.[3] T. Palpanas and V. Beckmann, “Report on the ﬁrst and second interdisci-plinary time series analysis workshop (itisa),”

SIGMOD Rec. , ”Acceptedfor publication, 2019.[4] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim, “Thelernaean hydra of data series similarity search: An experimental evalu-ation of the state of the art,”

PVLDB , 2018.[5] ——, “Return of the lernaean hydra: Experimental evaluation of dataseries approximate similarity search,”

PVLDB , 2019.[6] K. Zoumpatianos, S. Idreos, and T. Palpanas, “Ads: the adaptive dataseries index,”

VLDB J. , 2016.[7] B. Peng, T. Palpanas, and P. Fatourou, “Paris: The next destination forfast data series indexing and query answering,”

IEEE BigData , 2018.[8] J.-D. Fekete and R. Primet, “Progressive analytics: A computationparadigm for exploratory data analysis,”

CoRR , 2016.[9] A. Guillaume, “Head of Operational Intelligence Department Airbus.Personal communication.” 2017.[10] T. Rakthanmanon, E. J. Keogh, S. Lonardi, and S. Evans, “Time seriesepenthesis: Clustering time series streams requires ignoring some data,”in

ICDM , 2011, pp. 547–556.[11] J. Shieh and E. Keogh, “iSAX: disk-aware mining and indexing ofmassive time series datasets,”

DMKD , no. 1, 2009. [12] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,”

CSUR , 2009.[13] A. Mueen, E. J. Keogh, Q. Zhu, S. Cash, M. B. Westover, and N. B.Shamlo, “A disk-aware algorithm for time series motif discovery,”

DAMI , 2011.[14] R. Agrawal, C. Faloutsos, and A. N. Swami, “Efﬁcient similarity searchin sequence databases,” in

FODO , 1993.[15] T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A. Batista,M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh, “Searchingand mining trillions of time series subsequences under dynamic timewarping,” in

SIGKDD , 2012.[16] J. Shieh and E. Keogh, “i sax: indexing and mining terabyte sized timeseries,” in

SIGKDD , 2008.[17] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang, “A data-adaptiveand dynamic segmentation index for whole matching on time series,”

VLDB , 2013.[18] A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. Keogh,“Beyond One Billion Time Series: Indexing and Mining Very LargeTime Series Collections with iSAX2+,”

KAIS , vol. 39, no. 1, 2014.[19] A. Mueen, S. Nath, and J. Liu, “Fast approximate correlation for massivetime-series data,” in

SIGMOD , 2010.[20] C. Lomont, “Introduction to intel advanced vector extensions,”

IntelWhite Paper , 2011.[21] B. Tang, M. L. Yiu, Y. Li et al. , “Exploit every cycle: Vectorized timeseries algorithms on modern commodity cpus,” in

IMDM , 2016.[22] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Dimensionalityreduction for fast similarity search in large time series databases,”

KAIS ,2001.[23] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas, “Co-conut: A scalable bottom-up approach for building data series indexes,”

PVLDB , 2018.[24] M. Linardi and T. Palpanas, “Scalable, variable-length similarity searchin data series: The ulisse approach,”

PVLDB , 2019.[25] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos, “Progressivesimilarity search on time series data,” in

EDBT , 2019.[26] http://helios.mi.parisdescartes.fr/ themisp/messi/, 2019.[27] B.-K. Yi and C. Faloutsos, “Fast time sequence indexing for arbitrarylp norms,” in

VLDB . Citeseer, 2000.[28] “Incorporated Research Institutions for Seismology – Seismic DataAccess,” http://ds.iris.edu/data/access/, 2016.[29] “Southwest university adult lifespan dataset (sald),” http://fcon 1000.projects.nitrc.org/indi/retro/sald.html, 2018.[30] D. J. Berndt and J. Clifford, “Using dynamic time warping to ﬁndpatterns in time series.” in

AAAIWS , 1994.[31] E. Keogh and C. A. Ratanamahatana, “Exact indexing of dynamic timewarping,”

Knowledge and information systems , 2005.[32] S. Kashyap and P. Karras, “Scalable knn search on vertically stored timeseries,” in

SIGKDD , 2011, pp. 1334–1342.[33] C. Li, P. S. Yu, and V. Castelli, “Hierarchyscan: A hierarchical similaritysearch algorithm for databases of long sequences,” in

ICDE , 1996.[34] A. Guttman, “R-trees: A dynamic index structure for spatial searching,”in

SIGMOD , 1984, pp. 47–57.[35] I. Assent, R. Krieger, F. Afschari, and T. Seidl, “The ts-tree: efﬁcienttime series search and retrieval,” in

EDBT , 2008.[36] J. Chou, K. Wu et al. , “Fastquery: A parallel indexing system forscientiﬁc data,” in

CLUSTER . IEEE, 2011, pp. 455–464.[37] J. Zhou and K. A. Ross, “Implementing database operations using simdinstructions,” in

SIGMOD . ACM, 2002.[38] O. Polychroniou, A. Raghavan, and K. A. Ross, “Rethinking simdvectorization for in-memory databases,” in

SIGMOD . ACM, 2015.[39] P. Gepner and M. F. Kowalik, “Multi-core processors: New way toachieve high system performance,” in

PAR ELEC , 2006.[40] D. B. Lomet and F. Nawab, “High performance temporal indexing onmodern hardware,” in

ICDE , 2015.[41] R. Binna, E. Zangerle, M. Pichl, G. Specht, and V. Leis, “Hot: A heightoptimized trie index for main-memory database systems,” in

SIGMOD .ACM, 2018.[42] V. Leis, A. Kemper, and T. Neumann, “The adaptive radix tree: Artfulindexing for main-memory databases.” in

ICDE , 2013.[43] Z. Xie, Q. Cai, G. Chen, R. Mao, and M. Zhang, “A comprehensiveperformance evaluation of modern in-memory indices,” in

ICDE , 2018.[44] V. Alvarez, F. M. Schuhknecht, J. Dittrich, and S. Richter, “Mainmemory adaptive indexing for multi-core systems,” in

DaMoN , 2014.45] S. Tatikonda and S. Parthasarathy, “An adaptive memory consciousapproach for mining frequent trees: implications for multi-core archi-tectures,” in