Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm
Haridimos Kondylakis, Niv Dayan, Kostas Zoumpatianos, Themis Palpanas
CCoconut Palm: Static and Streaming
Data Series Exploration Now in your Palm
Haridimos Kondylakis
FORTH-ICS
Niv Dayan
Harvard University
Kostas Zoumpatianos
Harvard University
Themis Palpanas
Paris Descartes University
ABSTRACT
Many modern applications produce massive streams ofdata series and maintain them in indexes to be able toexplore them through nearest neighbor search. Existingdata series indexes, however, are expensive to operateas they issue many random I/Os to storage. To addressthis problem, we recently proposed Coconut, a new in-frastructure that organizes data series based on a new sortable format. In this way, Coconut is able to leveragestate-of-the-art indexing techniques that rely on sortingfor the first time to build, maintain and query data seriesindexes using fast sequential I/Os.In this demonstration, we present Coconut Palm, anew exploration tool that allows to interactively combinedifferent indexing techniques from within the Coconutinfrastructure and to thereby seamlessly explore dataseries from across various scientific domains. We highlightthe rich indexing design choices that Coconut opens up,and we present a new recommender tool that allowsusers to intelligently navigate them for both static andstreaming data exploration scenarios.
Data Series Exploration.
Many applications today rang-ing from finance and multimedia to astronomy producerapid streams of data series. To enable domain experts tomonitor and explore these series as they are created andstreamed (e.g., to detect anomalies or discover patterns),it is crucial to be able to efficiently perform similaritysearch against user-specified query targets.
Indexable Data Series Summarizations.
Performing sim-ilarity search by comparing a target query against everyindividual data series becomes intractable as data sizesgrow. To address this problem, modern techniques com-press data series into smaller summarizations that allowapproximating the distance to the target, and they in-dex these summarizations such that grossly dissimilardata series can be pruned out of the search [5]. Variousindexes have been proposed for this purpose includingR-Tree, đť‘– SAX, ADS, DS-Tree, and SFA [1]. To enableinteractive performance for applications, it is crucial to be able to construct, update and query such data seriesindexes as efficiently as possible.
Problem.
Existing data series indexes do not scale well.The problem is that the summarizations on top of whichthese indexes are built cannot be sorted while keepingsimilar data series close to each other in the sorted order.The reason is that existing summarizations partition andtokenize data series into multiple independent segmentsthat are laid out in their original order within the dataseries. Sorting based on these summarizations wouldtherefore place together data series that are similar interms of their beginning (i.e., the first segment), yetarbitrarily far in terms of the rest of the segments. Thus,state-of-the-art indexing techniques that use externalsorting to create and maintain a compact and contiguousindex using sequential I/Os to storage cannot be used.Instead, existing data series indexes are constructedusing top-down insertions that lead to many randomI/Os, creating a sparsely populated and non-contiguousindex that requires many random I/Os to query [2].
Solution: Sortable Summarizations.
In this work, weshow how to make data series summarizations sortable.The core idea is to interleave the bits in each summa-rization such that the more significant bits across allsegments precede all the less significant bits. As a result,sorting based on these summarizations keeps data seriesthat are similar in terms of all of their segments close toeach other in the sorted other.
Coconut Infrastructure.
By making data series summa-rizations sortable, we built the Co mpact and Con tiguousSeq u ence T able infrastructure [2], which leverages ex-ternal sorting and log-structured updates to efficientlybuild and maintain a compact and contiguous data seriesindex that is fast to query. Coconut is extensible and canallow any state-of-the-art database indexing techniquethat relies on sorting to support efficient data seriessimilarity search. Coconut Palm.
In this demonstration, we put the Co-conut infrastructure at your palm. While there havebeen previous demonstrations on interactive data seriesexploration [3, 6], this is the first to demonstrate the a r X i v : . [ c s . D B ] J un ignificance of compact and contiguous data layouts,thereby further pushing the scalability envelope, notleast with respect to applications with streaming dataand/or with limited memory. We provide a teaser in theURL below . Coconut is a novel data series indexing infrastructurethat organizes data series based on sortable summariza-tions [2]. As a result, it offers new and appealing trade-offs along various performance and space dimensions. Inthis demo, we present a new recommender tool to allowusers to navigate these trade-offs and to thereby tailoran index to the specific requirements of an application:
Better Read vs. Write Trade-Offs.
While existing dataseries indexes offer expensive and rigid cost balancesbetween reads and writes on account of random I/Os,Coconut is able to harness state-of-the-art indexingtechniques to achieve configurable and overall superiorread/write costs by leveraging sequential I/Os to a muchgreater extent. CoconutTree (CTree), our read-optimizedB-tree implementation, is a compact and contiguous dataseries index that is extremely efficient to sequentiallyquery, and it can further be tuned to accommodate up-dates by controlling its leaf nodes’ fill-factor. On theother hand, CoconutLSM (CLSM), our write-optimizedLSM-tree implementation [4], leverages log-structuredsequential writes to efficiently ingest incoming data dur-ing runtime while still providing good read performance,and it allows fine-tuning the read/write cost balance bycontrolling the LSM-tree’s growth factor.
Better Memory vs. Construction Trade-Offs.
Existingdata series indexes heavily rely on in-memory buffering toalleviate the costs of index construction and maintenanceby waiting for similar data series (i.e., that map tothe same node) to gather, and then performing themusing one I/O as a batch update. Coconut alleviatesthis pressure on main memory by relying on two-passexternal sorting and log-structured updates to constructand maintain an index.
Better Space vs. Time Trade-Offs.
As nodes in existingdata series indexes are often sparsely populated, theycan emerge as a storage cost bottleneck, especially onthe cloud. Coconut not only alleviates such cost bot-tlenecks by building compact indexes, but it also pro-vides a further option of constructing indexes as non-materialized (i.e., containing only the summarizations)or fully-materialized (i.e., also containing the originaldata series). The key trade-off is that non-materializedindexes take up less storage and are faster to build, but https://tinyurl.com/y8j35rv4 subsequent queries may be slower as the raw data filehas to be accessed to fetch the original data series. In data exploration scenarios, queries often have tem-poral constraints; they must find the nearest neighborfrom within a temporal window of interest. In contrastto traditional streaming data applications, where val-ues inside the window of interest are treated as setsof distinct points, we treat the values in each windowas sequences of time-ordered points. This allows us toconstruct sequential patterns and query historical dataas such. We present three approaches for how to supportsuch variable-sized window queries.
Post-Processing (PP) relies on examining the timestampof every entry as it is encountered and discarding it ifthe timestamp falls outside the specified query window.
Temporal Partitioning (TP) creates a new index par-tition based on the in-memory buffer’s contents everytime that the buffer fills up. The system gathers moreand more temporal partitions over time, and it organizesthem based on their creation time. This allows queriesto access only partitions whose creation timestamp fallswithin or intersects with a specified query window.
Bounded Temporal Partitioning (BTP).
The ability tosort data series summarizations enables a new approach,Bounded Temporal Partitioning (BTP), that combinesthe best aspects of PP and TP. BTP creates a newtemporal partition every time the buffer flushes, and itlater sort-merges temporal partitions of similar sizes sothat newer data resides in smaller partitions while olderdata gradually moves to larger contiguous partitions.As with TP, BTP allows queries over small windows tosave storage bandwidth by skipping larger partitions. Onthe other hand as with PP, it allows queries over largerwindows to spatially prune data at larger runs moreeffectively, and it allows approximate queries over largewindows to issue fewer I/Os by bounding the overallnumber of partitions that need to be accessed.
The high-level architecture of Coconut Palm is shown inFigure 1. It consists of a GUI client and an algorithmsserver, which we describe in detail below.
Configurable Environment and Structures.
The GUI al-lows users to directly interact with the Coconut infras-tructure across a range of configurable application scenar-ios. It allows choosing between synthetic or real scientificdatasets, configuring the available main memory bud-get, anticipating a temporal window size, and combining tatic Data Series Non-Materialized Materialized
ADS+CTreeCLSM ADS FullCTreeFullCLSMFull
Streaming Data Series
Materialized
ADS+PPCTreePPADS+TPCTreeTPCLSMBTP
Non-Materialized
ADS FullPPCTreeFullPPADS FullTPCTreeFullTPCLSMFullBTP
GUI Client
RecommenderWeb Service
Algorithms Server
Results (JSON)Request
Storage Layer
Figure 1: Coconut Palm high-level architecture. different algorithmic elements from the Coconut infras-tructure (e.g., materialized CTree with PP) or choosingan alternative for comparison (e.g., ADSFull or ADS+).The GUI allows constructing any index of choice whilevisually comparing construction speed and storage con-sumption across index variants. A detailed screen-shotof the GUI is shown in Figure 2.
Recommender.
Users can consult our new recommendertool for the best structural configuration for the chosenapplication scenario. The recommender is designed asa decision tree to be able to provide users with therationale for its advice.
Query and Performance Visualization.
Users can furtherdraw data series, issue them as approximate or exactqueries with any window size, and visually comparequery performance across different index variants. Toallow users to appreciate how the structural propertiesof an index affect query performance, we provide a heatmap that visualizes a query’s access pattern.
Implementation.
The GUI client is developed using PHP,JavaScript and HTML. It communicates with a back-end server, on which the indexes are built and evaluated.Client-server communication takes place through RESTweb service calls. All algorithms are implemented inC/C++.
There are two goals for the demonstration. The first is toshow through experiments that the Coconut infrastruc-ture significantly speeds up the process of data seriesexploration through faster index construction, mainte-nance, and querying. The second goal is to instruct users on how to choose from among the different design com-binations within the Coconut infrastructure to achievethe best possible performance and space properties for atarget application. We will start by guiding participantsthrough the different design choices in Coconut. We willthen walk them through two data exploration scenarios.
Scenario 1: Big Static Data Series.
This exploration sce-nario commences with a large collection of raw astronomydata series. The goal is to find data series within thisdataset that match several known patterns of interest(e.g., corresponding to a supernova, a binary star, etc.).We will first undertake the exploration workflow withthe state-of-the-art approach, ADS+, and demonstratethat it exhibits performance lags for both constructionand querying. We will then consult our recommenderfor advice on the best Coconut index for this scenario,and we will repeat the workflow with the recommender’schoice (in this case a non-materialized CTree with PP).Through first-hand experience and by visualizing perfor-mance metrics, we will demonstrate that CTree signifi-cantly speeds up the workflow. By using the heat mapto analyze CTree’s access patterns and comparing themto those of ADS+, we will attribute the performanceimprovement to more friendly I/O patterns, which areenabled as a result of constructing CTree compactly andcontiguously through external sorting.We will show that as we increase the projected numberof queries in the workload, our recommender changes itschoice to using a materialized CTree, the reason beingthat the additional space and construction overheadsfor a materialized index become justified as the numberof subsequent queries increases. We will allow users toissue the same set of queries to a pre-built materializedCTree on our server to appreciate the improved queryperformance of a materialized Coconut index.
Scenario 2: Dynamic Streaming Data Series.
The sec-ond exploration scenario commences from an emptydataset and having IRIS Seismic data series continuallyarrive in batches. The goal is to find data series thatmatch known patterns corresponding to earthquakesfrom within variable-sized temporal windows of interest.We will use ADS+ with both PP and TP as a baselinerepresenting the state of the art and compare it to ourrecommender’s choice, in this case a non-materializedCLSM with BTP. By using the heat map, we will showthat ongoing updates hamper the query performanceof the ADS+ variants, whereas CLSM performs queriesseamlessly while still being able to ingest the updates.We will further demonstrate through access pattern visu-alization that even in moments where updates are absent, http://ds.iris.edu/data/access/ igure 2: A screenshot of the Coconut Palm GUI. CLSM still outperforms the ADS+ variants by virtueof using the BTP scheme to narrow the search to thepartitions of interest and being able to effectively prunethem to more quickly find a nearest neighbor.
We demonstrate Coconut, a new infrastructure that ac-celerate the process of data series exploration. The coreinnovation is a sortable data series summarization, whichallows using state-of-the-art indexing techniques for thefirst time to efficiently construct, maintain and querya data series index. We demonstrate the versatile newperformance and space trade-offs that Coconut provides,and we allow users to experience and navigate the in-frastructure with the aid of a new recommender tool.
This work was partially supported by the projects BOUNCE(H2020
REFERENCES [1] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.The lernaean hydra of data series similarity search: An exper-imental evaluation of the state of the art.
PVLDB , 12(2),2018.[2] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas.Coconut: A scalable bottom-up approach for building data series indexes.
PVLDB , 11(6):677–690, 2018.[3] M. Mannino and A. Abouzied. Qetch: Time series queryingwith expressive sketches. In
SIGMOD , 2018.[4] P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. Thelog-structured merge-tree (lsm-tree).
Acta Inf. , 33(4), 1996.[5] T. Palpanas. Big sequence management: A glimpse of the past,the present, and the future. In
SOFSEM , 2016.[6] K. Zoumpatianos, S. Idreos, and T. Palpanas. RINSE: inter-active data series exploration with ADS+.
PVLDB , 8(12),2015., 8(12),2015.