A Foundation for Spatio-Textual-Temporal Cube Analytics (Extended Version)
AA Foundation for Spatio-Textual-Temporal Cube Analytics(Extended Version)
Mohsin Iqbal
Aalborg [email protected]
Matteo Lissandrini
Aalborg [email protected]
Torben Bach Pedersen
Aalborg [email protected]
ABSTRACT
Large amounts of spatial, textual, and temporal data are beingproduced daily. This is data containing an unstructured compo-nent (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a power-ful and general way of analyzing spatial, textual, and temporaldata together . In this paper, we define and formalize the
Spatio-Textual-Temporal Cube structure to enable combined effectiveand efficient analytical queries over spatial, textual, and temporaldata . Our novel data model over spatio-textual-temporal objects enables novel joint and integrated spatial, textual, and temporalinsights that are hard to obtain using existing methods. More-over, we introduce the new concept of spatio-textual-temporalmeasures with associated novel spatio-textual-temporal-OLAPoperators. To allow for efficient large-scale analytics, we presenta pre-aggregation framework for exact and approximate compu-tation of spatio-textual-temporal measures . Our comprehensiveexperimental evaluation on a real-world Twitter dataset confirmsthat our proposed methods reduce query response time by 1-5orders of magnitude compared to the
No Materialization base-line and decrease storage cost between 97% and 99.9% comparedto the
Full Materialization baseline while adding only a negligi-ble overhead in the Spatio-Textual-Temporal Cube constructiontime. Moreover, approximate computation achieves an accuracybetween 90% and 100% while reducing query response time by3-5 orders of magnitude compared to
No Materialization . Due to the increased usage of mobile devices and advancementsin accurate geo-tagging, more and more geo-tagged data is beingproduced [8]. In particular, social media platforms like Twitterand Facebook are some of the main sources of geo-tagged data,usually in the form of posts, comments, and reviews (e.g., Fig-ure 1). This type of data contains spatial, textual, and temporal(STT) information. As a result,
STT data analysis is becomingincreasingly important [9] since it allows to extract new insightsregarding customer satisfaction, user-generated content sharedonline, and brand reputation [27].
STT data contains information regarding topics discussed w.r.t.time and location, hence presenting an invaluable link betweenuser opinions and the real world. For example, STT data canhelp us analyze an advertisement campaign to identify the bestlocations for ad placements. Traditionally, this information isaccessed through spatial keyword-queries [4], e.g., to retrievetopics within a certain location, or identify in which locationssome topic is discussed. However, keyword or topic search are point-wise search tasks. Instead, there a significant need to pro-vide more extensive analytics analogous to traditional OLAP-style Β© Copyright 2021 for this paper held by its author(s). Published in the proceedingsof DOLAP 2021 (March 23, 2021, Nicosia, Cyprus, co-located with EDBT/ICDT 2021)on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0International (CC BY 4.0).
Alex (@Alex1)
A lovely evening here in Paris. So far fromeveryday stresses. I should travel here moreoften!
TextLocationTimestamp
Figure 1: Geo-tagged tweet: An example of a STT object analytics . An example STT query is βfind the top-k trending hash-tags aggregated by topic within a user-defined region (i.e., polygon)around Paris this month" .The traditional data cube model is one of the most widelyused tools to analyze structured data. Since their introduction,data cubes have been extended to analyze different types of data,like sales [14], locations [15], time-series [6], and text [22], but separately . In particular, some works propose OLAP operatorsto analyze either textual data [3, 31, 36] or spatial data [14, 15].However, no previous work proposed a unified model and setof operators enabling integrated and joint analysis of
STT data .Moreover, as we propose to jointly analyze STT dimensions to-gether with other dimensions, we are also able to define novelfamilies of measures that have not been studied before, namely spatio-textual and spatio-textual-temporal measures . These mea-sures, as we show later, allow to produce more advanced analyticsinstead of, e.g., simple keyword frequency.
Contributions.
In this paper, we introduce the Spatio-Textual-Temporal Cube (STTCube) to analyze
STT data . Adding spatial,textual, and temporal support to a traditional data cube is notstraight-forward due to the presence of π β π relationships in tex-tual hierarchies and because existing families of measures cannotsupport joint and integrated analysis involving spatial, textual,and temporal dimensions, e.g., finding the trending keywordsgrouped by regions, defined by geometry shapes, over a timeinterval (Section 3.3). Hence, we introduce new families of mea-sures and OLAP operators that extract combined insights fromSTT dimensions and measures. STTCube provides specialized spatio-textual and spatio-textual-temporal measures such as Top-k Dense Keywords within an area and
Top-k Volatile Keywordswithin an area that deliver the integrated aggregates over
STTdata . Moreover, a set of analytical operators, namely STT slice,dice, roll-up, and drill-down are proposed. This results in a datamodel able to support spatio-textual-temporal OLAP (STTOLAP)operators. Furthermore, we propose
Partial Exact Materialization(PEM) and
Partial Approximate Materialization (PAM) methods forefficient exact and approximate computations of
STT measures ,respectively. Among other things, we also provide a systematicset of solutions to handle π β π relationships in textual hierarchies.In this work, we present the following contributions: Table 1: Spatio-Textual-Temporal Sample Dataset
Time Location Terms a r X i v : . [ c s . D B ] D ec able 2: Presence ( β ) or absence ( β ) of support for spatial and textual data, dimensions, hierarchies, and measures in existing methods Textual Support Spatial Support ST STTMethod
Data Dimension Hierarchies Measures Data Dimension Hierarchies Measures Measures Measures
EXODuS [7] β (JSON) β β β β β β β β β TextCube [22] β β β β β β β β β β
Text OLAP [35] β β β β β β β β β β
TextCubeTopKCells [10] β β β β β β β β β β
Geo Miner [15] β β β β β β β β β β
SpatialCube [14] β β β β β β β β β β
StreamCube [12] β β β β β β β β β β
TwitterSand [30] β β β β β β β β β β
TextStreams [33] β β β β β β β β β β
TopicExploration [38] β β β β β β β β β β
SocialCube [24] β β β β β β β β β β
TopicCube [37] β β β β β β β β β β
ContextualizedWarehouse [29] β β β β β β β β β β
STTCube β β β β β β β β β β β’ We extend the standard cube model to add support for spatial,textual, and temporal dimensions and hierarchies and spatio-textual and spatio-textual-temporal measures (Sections 3.1 to 3.3). β’ We propose a set of analytical operators (STTOLAP) over spatio-textual-temporal data (Section 3.4). β’ We introduce keyword density and keywords volatility as pro-totypical spatio-textual and spatio-textual-temporal measures (Section 3.3). β’ We propose a pre-aggregation framework (STTCube material-ization) for efficient, exact ( (PEM) ) and approximate ( (PAM) ),computation of the proposed spatio-textual-temporal measures (Section 4). β’ We propose techniques for processing spatio-textual-temporalobjects and the construction of the
STTCube (Section 5). β’ We evaluate the pre-aggregation framework's (
PEM and
PAM )query response time, storage cost, and accuracy by comparingit with the
No STT Cube , Full Materialization , and
No Materi-alization baselines. Our pre-aggregation framework provides1-5 orders of magnitude improvement in query response timeand a 97% to 99.9% reduction in storage cost with an accuracybetween 90% and 100% (Section 6).
OLAP and the
Data Cube [18] are used heavily in business intel-ligence to obtain insights over the historical, current, and futurestate of business. With the emergence of web and social media, animmense amount of unstructured data is being produced, whichmust be included in the analytical process. Table 2 summarizesthe state of the art on spatial, textual, and temporal analytics bylisting the properties and gaps in the current methods.The
Text-Cube [22] allows OLAP-like queries on text data byproviding dimensions and hierarchies for terms. Moreover, it sup-ports the computation of two information retrieval (IR) measures: inverted index and term frequency . EXODuS [7] processes semi-structured document stores (i.e., JSON) using a schema-on-readapproach to allow exploratory OLAP on text. Text OLAP [35]extends traditional OLAP to support textual dimensions andkeyword-based top- k search [10]. Yet, all these approaches lacksupport for spatial and temporal data and the advanced measuresand operators required for spatio-textual-temporal analytics.
For spatial data, GeoMiner [15] proposes a cube structure formining characteristics, comparisons, and association rules fromgeo-spatial data and Spatial cube [14] allows to perform spatialOLAP on the semantic web.
Yet, these solutions focus on spatialdata only and lack support for textual and temporal data.
There are solutions that combine more than one componentof data, e.g., spatio-temporal [34], into the same model but donot provide combined
STT analytics. Among those, the contextu-alized warehouse [29] combines traditional OLAP with a textualwarehouse. This allows the user to provide some keywords, select a market (country or region), retrieve documents matching thekeywords as context, and then analyze the facts related to thosekeywords and documents. Similarly,
Topic Cube [37] extends thefunctionality of a traditional cube and combines probabilistictopic modeling with OLAP by introducing the topic hierarchy .TwitterSand [30] and StreamCube [12] exploit textual and spa-tial information to gain insights by clustering twitter hashtagsand tweets in a region, respectively.
STT data is also analyzed toextract events and topics information in TextStreams [33] andTopicExploration [38]. Finally, SocialCube [24] tries to capturehuman, social, and cultural behavior by performing linguisticanalysis (sentiment analysis) over tweets. All these approachesfocus on the unstructured nature of text along with spatial andtemporal data but they do not provide Integrated STT analytics ,for example, they do not provide the ability to compute aggregatespatial, textual, temporal, and spatio-textual-temporal measuresover spatial, textual, and temporal dimensions and hierarchies .Spatial top- k keyword-queries [5, 9, 25] answer only point-wise queries and do not support aggregation functions or hierarchies.Thus, they do not support more complex OLAP-style analyticaltasks, which we do. There are methods that solve a very specifictask for a specific type of data [2, 21, 28]. These methods arefundamentally different from STTCube because STTCube providesa generic framework for a wide range of STT analytics over differentkinds of STT data sources , including, but not limited to, geo-taggedtweets. Also, STTCube can take advantage of the improvementssuggested over other cubes, e.g., Nanocubes [23] and DICE [17],making it a powerful tool for OLAP-style
STT analytics .Our summary of related work in Table 2 shows that no ex-isting method provides integrated support for
STT data , unlikeSTTCube. To the best of our knowledge, a proper formalizationof a data cube model for
STT data able to support complex ana-lytics for
STT objects at scale is missing. In particular, no previousmethod studies dimensions, hierarchies, and measures that al-low processing STT data jointly . Furthermore, the main novelchallenge for
STT-OLAP is handling π β π relationships inside the STT dimensions effectively since π β π relationships do not allowtraditional pre-aggregation techniques to be used. Moreover, ar-bitrary temporal ranges with multiple levels of granularity addscomplexity to STT measures computations. As a remedy, we pro-pose
STTCube which enables the joint and integrated analysis of
STT objects by introducing new sets of measures, spatio-textual and spatio-textual-temporal measures , to gain in-depth insightsusing STTOLAP operators.
Here, we define the
STTCube , an extension of the traditional datacube to allow storage and analysis of
STT objects . Data cubes areused to model and analyze multi-dimensional data. - - . , . - - - . , . - - - . , . - - - . , . - - -
00 2 1 - -
31 2 0 - -
30 1 3 - - - - - - - - - - - - - -- - - - - - - - - A pp l e - F r u i t - L o v e ... - O n i o n - : : - - - : : - - - : : - - ... - : : - - Figure 2: STTCube Example
Definition 1 (Data Cube).
An n-dimensional data cube CS ππ is a tuple CS ππ = ( π·, π, πΉ ) , with a set of dimensions π· = { π , π , Β· Β· Β· ,π π } , a set of measures π = { π , π , Β· Β· Β· , π π } , and a set of facts πΉ . Adimension π π β π· has a set of hierarchies π» π π . Each hierarchy β β π» π π is organized into a set of levels πΏ β . Each level π β πΏ β contains a setof members and has a set of attributes π΄ π . Each attribute π β π΄ π isdefined over a domain. Each measure π β π is a function definedover a domain which can return either a single value or a complexobject. The domain of a dimension π π is denoted by πΏ ( π π ) Spatio-Textual-Temporal (STT) Objects An STT object recordsplace (geo-coordinates or location where it was created), text (areview, or a user comment), and time (when it was created). So-cial networks with geo-tagged micro-blog posts are typical
STTdata sources (e.g., the geo-tagged tweet in Figure 1).
Definition 2 (STT object).
A spatio-textual-temporal object isa tuple ππ π π π‘ = β¨ π, π, π β© where π , π , and π represent the location, text,and time components, respectively. The
Location is represented as the latitude and longitude pair π β( R Γ R ) . The Text is an ordered list π = β¨ π€ , π€ , . . . , π€ π β© where π€ π βW is a string and is called a Term . Among all Terms, key-words are a user-defined subset of important Terms π π βW . Forinstance, the user can decide that hashtags (terms starting with Time specifies a precise instant (a timestamp) to some resolution (e.g.,seconds). Table 1 contains examples of
STT objects with their lo-cation, a set of keywords extracted from the text, and timestamp.
For analytical processing of
STT objects we propose to model themas an
STTCube . An STTCube CS π π‘π‘π = ( π·, π, πΉ ) is a data cube(Definition 1) with three special dimensions, namely Location , Text , and
Time that is π· = { π πΏππππ‘πππ , π
πππ₯π‘ , π
ππππ , π , . . . , π π } . Dimensions. An STTCube stores
STT objects as facts modelingtheir spatial, textual, and temporal features in the correspondingdimensions. Figure 2 shows a dimensional STTCube built on thesample dataset in Table 1 where each row represents one fact (i.e.,the members of πΉ ) with dimensions π· = { π Location , π
Text , π
Time } .Domains for the respective dimensions are πΏ ( π Location ) = { (57.016, 09.991) , (56.187, 10.171) , . . . } πΏ ( π Text ) = { apple , Fruit , , . . . } πΏ ( π Time ) = { , , . . . } Hence w.r.t. Definition 2, the dimensions capturing π , π , and π are the spatial, textual, and temporal dimensions, respectively. Dimension Hierarchy.
A hierarchy is spatial, textual, or tempo-ral if it contains spatial, textual, or temporal levels, respectively.In Figure 2, the
Location dimension is a spatial dimension with aspatial hierarchy going from π to City, Region , and
Country andthe
Text dimension is a textual dimension aggregating π into the Term, Theme, Topic , and
Concept levels. Similarly,
Time is a tem-poral dimension. Hierarchy steps π»π β = { βπ , βπ , βπ , . . . , βπ π } de-fine the mechanism of moving from a lower (child) level to an up-per (parent) level and vice versa. A hierarchy step βπ π = ( π β , π β , πππ β πππππππ‘π¦ )β π»π β entails that members of a child level π β can be ag-gregated together if they correspond to the same member at theparent level π β and that this correspondence between children toparent members has the given ππππππππππ‘π¦ β{ β , β π, π β , π β π } .For instance, the step from Date to Month has an π β Term to Topic has an π β π cardinality (e.g., the Carrot Term correspond both to the Gardening and Food
Topics , while theFood
Topic has as child members not only Carrot but also Apple).
Level Attributes.
As mentioned earlier, a level π is associatedwith a set of attributes π΄ π = { π , π , . . . , π π } and has a set of mem-bers π = { π , π , . . . , π } . Attribute values describe the different char-acteristics of each member from that level. Spatial, textual, andtemporal levels are then usually characterized by spatial, textual,and temporal attributes. For instance, at the City level, memberAalborg has the
Boundary attribute whose value is the polygondefining the boundary of Aalborg. An example of a textual at-tribute is
Sentiment which captures the polarity of the associatedtextual member. Similarly, an integer value representing the num-ber of days in a specific month is a temporal attribute.
We now describe the STTCube's dimensions and hierarchies.
Spatial Dimensions.
Spatial information can be analyzed atdifferent levels and granularities. It is important to note that factsin an STTCube are composed only by geographical points (i.e.,each tweet or user post is associated with a coordinate, not withshapes or polygons). Points can be aggregated either within apredefined spatial grid or based on semantic information.
Grid-Based Hierarchy.
Here, the geographic area being ana-lyzed is divided into small equal size cells with a predefinedresolution, e.g., 1 Γ ππ . At the lowest level, each latitude andlongitude point is assigned to the cell they fall in. To analyze dataat a coarser granularity, neighboring cells are combined into alarger cell at the parent level (e.g., 3 Γ ππ ). This hierarchy canbe built automatically, without the need for any meta-data. Semantic-Based Hierarchy.
Here, data is analyzed in a prede-fined taxonomy, e.g., an administrative division. Therefore, wemove within the taxonomy, e.g., from the
Location to the
City level, from the
City level to the
Region level, and so on up to the
All level. This hierarchy requires each object coordinate to beassociated with a member in the lowest level in the hierarchy(usually in a pre-processing step) and requires the taxonomyinformation to build the entire hierarchy.
Textual Dimensions.
Hierarchies in the textual dimension movefrom specific concepts to general ones. This follows a generic tax-onomic structure connecting more specific terms to more generalones (i.e., hypernyms) [20]. In particular,
Terms are the base levelwhich are grouped into
Themes , Themes into larger categoriescalled
Topics , and
Topics in turn grouped into
Concepts . Differentlyfrom most hierarchies, the members in the levels of a textual hier-archy are typically in an π β π relationship. Hence, when movingbetween textual levels we need to decide how measure values getaggregated. Below we propose a set of aggregation techniquesto address this issue.
Replication-Based Hierarchy.
This is a common approach whereeach member of a child level is aggregated into all the parentmembers. Hence, its value is effectively replicated. This approacheads to a counting problem when parent levels are further ag-gregated. For example, the first data instance in Table 1 will bepart of two
Themes : 1) Fruits because it contains
Term { apple andfruit } and 2) Emotion because of Term { } . Majority-Based Hierarchy.
If a fact can be mapped to morethan one parent member, then that fact will be part of the parentmember which has the most representation (e.g., in terms offrequency). This scheme avoids double counting of facts in parentmembers. In case of ties, some tie-breaking heuristic or a user-defined criterion can be employed instead, e.g., the first fact inTable 1 will be part of only the Fruits
Theme because it has thetwo representative
Term { apple, fruit } , as compared to Emotionhaving only one Term { } . Custom Hierarchy.
In general, other user-specified criteria andrules can be defined to establish how child-parent level stepswill be aggregated in case of ambiguities. For instance, a domain-specific importance score can be assigned to the hierarchy mem-bers during the
STTCube construction. In this way, facts will bepart of only the parent member with the highest importance.
Temporal Dimensions.
Similarly, temporal dimensions allowto analyze
STT objects at different levels of granularity w.r.t. timeand has the following two temporal hierarchies: π β π·ππ¦ β ππππ‘β β ππ’πππ‘ππ β ππππ β πππ and π β ππππππ β ππππ’π‘π β π»ππ’π β πππ . Here, the first contains a hierarchy of Date aggre-gated by the temporal levels
Day, Month, Quarter , and
Year (total5 levels including
All ), whereas the second is a hierarchy for
TimeOfDay having 4 levels in total.
As defined earlier, an n-dimensional STTCube has a set of mea-sures π = { π , π , π , . . . , π π } , which permit to analyze STT ob-jects by computing values at different levels of granularity. Forinstance, the
STTCube in Figure 2 models
Location , Text , and
Time with
Fact Count as a measure (i.e.,
Fact Count β π ). In practice, itmaintains the count of STT objects at given spatial, textual, andtemporal aggregation levels. Measure values at different levelsin the hierarchies are obtained by applying an aggregation func-tion over the
STT objects . Examples of aggregation functions are
SUM , COUNT , MIN , MAX , and
AVG . The
STTCube in Figure 2uses
COUNT as an aggregation function. For example, it reportsthat on
September, 20th at AAU Bus Terminal the
Term apple wasmentioned in 2 facts.A measure is spatial if it is defined over a spatial domain. Aspatial measure is then computed over a collection of spatial val-ues (e.g., geographical points, or geometry shapes like polygons).A spatial measure can be a simple value, e.g., the (numeric) areaof the convex hull of multiple shapes, or a complex spatial object,e.g., the polygon representing the convex hull itself. A measureis textual if it is defined over a textual domain, and can be eithera simple numeric value or a complex textual object. Analogously,a measure is temporal if it is defined over a temporal domain, Ameasure is spatio-textual if it is defined over a spatial and textualdomain and is a combination of spatial and textual measures.Finally, a measure is spatio-textual-temporal if it is defined overa spatial, textual, and temporal domain and is a combination ofspatial, textual, and temporal measures. Below, we propose a listof spatio-textual and spatio-textual-temporal measures to be usedas part of
STTCube to analyze
STT objects effectively.
Top-k Keywords within an Area is a spatio-textual measurewhich returns a list of tuples β¨ π, ββ ππ€ β© consisting of a geometryshape π representing a geographical area and the list of top- k most frequent keywords ββ ππ€ = β¨ π€ , π€ , . . . , π€ π β© in that area. Analogousto previous measures, It can also be computed at different levelsof aggregation, so that it can return the top-k keywords for each City or each
Region . Keyword Density is a spatio-textual measure which returnsa list of tuples β¨ π π , π€ π , π π π β© consisting of a geometry shape π π representing a geographical area, a keyword π€ π , and its density π π π in the area π π . The density π π π of a keyword π€ π over an area π π is computed as π π π = freq ( π π ,π€ π ) SurfaceArea ( π π ) , in which freq ( π π , π€ π ) is thefrequency of the keyword π€ π in the area π π (i.e., the number ofobjects located within π π in which π€ π appears) and SurfaceArea is the surface area of π π . For example, if we have two Regions π , π with SurfaceArea ( π ) = π , SurfaceArea ( π ) = π , and theterm Apple with frequency 5 and 30 in π and π , respectively (seeFigure 3), then, keyword densities are π = . π = . π and π , respectively. Top-k Dense Keywords within an Area is a spatio-textualmeasure which returns a list of tuples β¨ π π , ββ ππ€ β© computing thekeyword density as described in the measure above, but in thiscase, it returns the top-k keywords ββ ππ€ = β¨ π€ , π€ , . . . , π€ π β© withthe highest density. Keyword Volatility is a spatio-textual-temporal measure (be-comes textual-temporal If no region is specified) which returns alist of tuples β¨ π π , π€ π ,π π , Ξ π π ππ β© consisting of a geometry shape π π representing a geographical area, a keyword π€ π , a time interval π π , and its change in density Ξ π π ππ in the area π π over the timeinterval π π (divided into π equal intervals). The change in density Ξ π π ππ of a keyword π€ π in an area π π over a time interval π π iscomputed as Ξ π π ππ = (cid:205) ππ§ = | π πππ§ β π πππ§ β | π , where π π π π§ represents thedensity of the keyword π€ π in the area π π at a specific time instance π π π§ . Furthermore, the change in density computation formulacan be updated depending on the analysis requirements, e.g., itcan be changed to weighted density (assign different weights toeach interval in π π ) or to rate of change computation using linearregression [19]. Top-k Volatile Keywords within an Area is a spatio-textual-temporal measure which returns a list of tuples β¨ π π , ββ ππ€ β© comput-ing the keyword volatility as described above, but in this case,it returns the top-k volatile keywords ββ ππ€ = β¨ π€ , π€ , . . . , π€ π β© withthe highest change in density. Distributive, Algebraic, and Holistic Measures.
There arethree types (also known as additivity) of measures: distributive,algebraic, and holistic, depending on whether it is possible tocompute the value of a measure at a parent level directly fromthe values at the child level [13]. For distributive and algebraicmeasures, this is possible. For instance, the
Fact Count at the
State level can be computed by summing up the
Fact Counts at the
City . Keyword Density is instead an algebraic measure. We cancompute the higher-level aggregate values of this measure if westore for each child level both the frequency of each keyword andthe
SurfaceArea . The
Top-k Keywords , the
Top-k Dense Keywords ,and
Top-k Volatile keywords within an area measures, instead, areholistic, since the value at a parent level cannot be computeddirectly from the values at the child level, but it is necessary torecompute them directly from the base facts every time.Consider the computation of
Top-3 Dense Keywords within anArea in Figure 3 given the two
Regions π and π with SurfaceArea π and 100 π , respectively, and the computation at the parentlevel π = π βͺ π (grayed-out rows are not part of the computed egion π = π βͺ π Keyword Ξ£ πππ Ξ£ πππ β Area ππππ ππππ β Carrot 42 40 π π Keyword Count Area DensityApple 5 π π Keyword Count Area DensityCarrot 40 π Figure 3: Example: Merging of Holistic Measure measure value). The values in the top-3 for the members π and π at the child level are not sufficient to compute the correctdensities for region π . Both, some of the computed density (incolumn π πππ β , while the correct values are reported in π πππ ) andconsequently the final ranking, would be wrong. For instance, thekeyword Strawberry would not have been returned (if computedalgebraically) because it is neither in the top-3 for π nor π . Tocompute the correct response, either we have to store all theaggregate values for each possible cell or we have to reprocess allthe facts covered by the query. When dealing with large datasetsthese approaches are not feasible. Hence, in Section 4 we providea framework for the computation of an exact and approximatesolution with accuracy guarantees. A data cube allows different O n L ine A nalytical P rocessing (OLAP)operators to group, filter, and analyze cells and subsets of cellsat different levels of granularity and under different perspec-tives. Those operators are known as Slice , Dice , Roll-Up , and
Drill-Down [18]. We extend the basic OLAP operators to
STT-OLAP operators , i.e., for spatial, textual, and temporal dimensions,hierarchies, and measures (Handing of π β π relationships is ex-plained in Section 4 and 5). In general, an OLAP (and STTOLAP)operator OP accepts as input a cube πΆ β² , some parameters params and outputs a new cube πΆ β²β² , i.e., OP ( πΆ β² , params ) = πΆ β²β² . In this way,a new OLAP operator can be applied to πΆ β²β² . Among all cubes, wedistinguish the initial or base cube πΆ as the cube containing allthe original information at the base level. Cube materialization is the process of pre-aggregating measurevalues at different levels of granularity in the cube to computequery responses from pre-aggregated results instead of the rawdata, and hence improve query response time for
STTOLAP oper-ators [16]. In a data cube, a cuboid is a collection of level members and associated measure values for a unique combination of dimen-sion hierarchy levels. Each unique combination is representedby a separate cuboid. For instance, if we request the
Fact Count for the
State of Denmark and have stored
Fact Count at the
Re-gion level, we can avoid accessing the raw data and computethe aggregation from much fewer rows. This is an example of partial materialization , i.e., the actual cuboid at the
State level,containing the answer to the query was not materialized, but thesystem was still able to exploit the cuboid for
Region .What to materialize and how much to materialize depends onthe trade-off between query response time and storage cost.
FullMaterialization (FM) is obtained by pre-computing measure val-ues for all combinations of levels in all hierarchies. This approach requires huge storage but achieves the best query response timesince every operation can just look up already pre-computedresults. At the other extreme,
No Materialization (NM) only ma-terializes the base cuboid and does not require any extra space,but will require aggregated measure values to be recomputedfrom the base cuboid every time, hence incurring much slowerresponse times. A middle-ground solution is to partially mate-rialize the cube, i.e., to materialize only some of the possiblecuboids. In this strategy, some queries will be able to exploitpre-aggregated values at the current level, while other queriescan exploit pre-aggregated values at lower levels for distributiveor algebraic measures.
The core of the proposed partial materialization approach de-pends on the trade-off between the storage cost of materializingany particular cuboid and the actual benefit that the materializa-tion of the cuboid provides. To evaluate this benefit, we have toestimate the (run time) cost of a query. To devise a cost model forthis estimation, we performed a micro-benchmark which con-firmed that the running time is directly proportional to the datasize (the number of rows). Hence we can use the following linearcost model for benefit calculation Benefit ( π ) = βοΈ π β² β descendants ( π )βͺ{ π } cost ( π β² ) β size ( π ) We propose an exact partial materialization technique for pre-computing the spatio-textual and spatio-textual-temporal measure values. To answer an
STT query for these measures we materializetwo other distributive measures, namely
Keyword Frequency π and SurfaceArea π . Then, since Keyword Density π and KeywordVolatility Ξ π are algebraic measures, they can be computed fromthe values of Keyword Frequency π and SurfaceArea π . Finally, Top-k Dense Keywords and
Top-k Volatile keywords are holistic butfor an exact solution we materialize
Top-ALL and hence, computeit from the materialized measure values (Figure 3).We adopt the chosen linear cost model (Section 4.1) and extendthe greedy algorithm approach [16] to our task (Algorithm 1).
Additionally, and different from [16], Algorithm 1 accepts an inputparameter πΎ and materializes only the top-K measures values ineach cuboid . For instance, for πΎ =
10, it will materialize the top-10keywords in each cuboid. Then, any top-k query, with π β€ πΎ , fora materialized cuboid will return the pre-computed answer.Algorithm 1, given a size budget π΅ (measured in rows, cuboids,or GB), proceeds until the size of the current cube is as largeas possible within the budget (Line 6). At each step, it selectsamong all the non-materialized cuboids (Line 3) the one with thehighest benefit (Line 4) and materializes it (Line 5). The differencebetween the exact (PEM) and approximate (PAM) materialization Algorithm 1:
Greedy Materialization GreedyMaterialization ( π΅ , (cid:28) , πΎ ) Input:
Budget π΅ , STTCube (cid:28) , desired top-k πΎ Output:
Partially Materialized STTCube (cid:28) do Candidates β { π β (cid:28) |Β¬ π . isMaterialized } ; π β max π β Candidates
Benefit ( π ) ; (cid:28) . materialize ( π, πΎ ) ; while size ( (cid:28) ) β€ π΅ ; return (cid:28) ; Due to space restrictions, the details and figures of the cost model experimentsare available in Appendices B and C. lgorithm 2:
Top-K Volatile Keywords in an Area TopKVolatile ( Ξ¦ = {β¨ π , ββββ ππ€ ,π β© , . . ., β¨ π π , ββββ ππ€ π ,π π β©} , π π₯ , π ) Input:
Set of Top-k+1 Volatile Keywords lists Ξ¦ , Set of π₯ Timestamps π π₯ ,Integer k Output: β¨ π, βββ ππ€,π π β© top-k keywords βββ ππ€ in the merged area π over timeinterval π π₯ , πΏ number of guaranteed top positions π β (cid:208) π β[ ,π ] π π , π΄ β SurfaceArea ( π ) ; // Merge areas ππ€ β {} , Ξ π β {} , ππππ£ π β {} ; // Empty dictionaries foreach π‘ β π π₯ do foreach β¨ π π , βββ ππ€ π ,π π β© β Ξ¦ do foreach π β [ , . . .,π + ] do if π‘ β π π then π€ ββββ ππ€ π .πππ‘ ( π ) ; // keyword at π π ββββ ππ€ π . freq ( π ) ; // frequency at π ππ€ [ π€ ] β ππ€ [ π€ ] + π ; Ξ π [ π€ ] β Ξ π [ π€ ] + | ππππ£ π [ π€ ] β π | ; ππππ£ π [ π€ ] β π ; π + = βββ ππ€ π . freq ( π + ) ; βββ ππ€ β topK ( ππ€,π΄, Ξ π ) ; // top-k volatile keywords πΏ β max π β[ ,...,π ] βββ ππ€. freq ( π ) β₯ π ; return β¨ π, βββ ππ€,π π β© , πΏ using Algorithm 1 is the value of πΎ . When πΎ = β the full sortedlist of measure values will be stored so that all top- k queries canbe answered for that cuboid. We set πΎ = β and πΎ = π (to materializeonly top π measure values) for PEM and
PAM , respectively.
Query rewriting.
Finally, as in [16], after
STTCube materializa-tion, queries are still formulated in terms of the base cuboid butrewritten by the system to be evaluated over the smallest cuboid.
As a result of the materialization performed by Algorithm 1, whenquerying a non-materialized cuboid, we can directly exploit val-ues in the cuboidβs materialized ancestors when computing alldistributive and algebraic measures. On the other hand, for holis-tic measures, we have to perform some additional computation.For instance, as mentioned earlier, to compute the value for the
Top-k Dense Keywords in an area we can exploit the pre-computed
Keyword Density values, but then we need to perform the top- k selection. That is, if the top- k for the current view is not materi-alized, we cannot exploit the materialized top- k of the ancestorviews without incurring the risk of returning the wrong result.Yet, it is possible to exploit the top- k computation in some ma-terialized cuboid to retrieve an approximate top- k and estimatethe result's accuracy [32]. In practice, for the Top-k Dense Key-words within an area , given a target k for the top- k computation,when materializing a cuboid, we materialize the top- k +1 mostdense keywords for that cuboid (i.e., set πΎ = π +1 in Algorithm 1).Then, to compute the top- k dense keywords for a descendantcuboid by exploiting a materialized ancestor cuboid, we deter-mine which members of the list are guaranteed to be correct.Algorithm 2 implements this computation for Top-k VolatileKeywords within an area . It receives as input the set Ξ¦ = {β¨ π , βββ ππ€ ,π β© , β¨ π , βββ ππ€ ,π β© , . . . , β¨ π π , ββββ ππ€ π ,π π β©} of lists of top-K (i.e., k +1) densekeywords in a specific area with respective time stamps, timeinterval π π₯ divided into x equal-sized interval (e.g., day or month),and the value for π . The output is the ranked list of top- k volatilekeywords in the area π that is composed by the merging of the ar-eas π , π , . . . , π π . It computes the SurfaceArea of the merged area π (lines 2). Then it merges all the aggregated keyword frequencies(line 10) and change in keyword frequencies (line 11) for eachtime instance in π π₯ (line 7) in respective dictionaries ππ€ and Ξ π (lines 4-13) by getting each keyword in each list (line 8) and the Algorithm 3:
STTCubeConstruction ConstructSTTCube (X , T , G , π΅, πΎ ) Input:
Collection of Spatio-Textual-Temporal Objects X , Knowledge Source T ,Geographical Information G , Materialization Budget π΅ , desired top-k πΎ Output:
Spatio-Textual-Temporal Cube (cid:28) (cid:28) β load empty or existing cube; (cid:28) .π Time β initialize or load temporal dimension; (cid:28) .π Location β initialize or load spatial dimension; (cid:28) .π Text β initialize or load textual dimension; (cid:28) .πΉ β initialize empty or load existing Fact Table ; foreach π₯ β X do UpdateTemporalHierarchies ( π₯.π, (cid:28) .π Time ); π β² β ProcessLocation ( π₯.π ) ; UpdateSpatialHierarchies ( π β² , G , (cid:28) .π Location ); π β² β ProcessText ( π₯.π ) ; UpdateTextualHierarchies ( π β² , T , (cid:28) .π Text ); InsertFact ( π₯.π, π β² ,π β² , (cid:28) .πΉ ); GreedyMaterialization ( π΅ , (cid:28) , πΎ ) ; return (cid:28) ; corresponding frequencies (line 9). If a keyword is not found inthe ππ€ , Ξ π , or ππππ£ π dictionary then its value is considered to be zero . Moreover, it keeps track of the upper-bound π frequency forkeywords outside the current materialized ranking for possibleerror reporting (line 13). Once all frequencies and changes infrequencies are merged, we compute the top- k volatile keywordsusing the aggregated values (line 14). Finally, by comparing thevalue of π with the frequencies of keywords in the aggregatedtop- k , we report how many positions in the current ranking areguaranteed to be exact (line 15). In the best case, the frequency ofthe keyword at position π will be at least π and thus the computedtop- k is guaranteed to be correct. Here, we describe the proposed approach for constructing an
STTCube . Algorithm 3 takes a collection X of STT objects to beanalyzed, a textual taxonomy T with semantic information aboutthe terms, themes, topics, and concepts, and a geographical taxon-omy G for cities, regions, and countries. Standard date functions are used for the temporal dimension processing. Moreover, italso receives as input the parameters π΅ and πΎ as the budget andnumber of top- K keywords for the partial materialization.Algorithm 3 constructs the STTCube in an incremental way,it initializes an empty cube (line 2), and then the correspondingspatial, textual, and temporal dimensions (lines 3-5) as well as the Fact Table (line 6). If the cube is already constructed, i.e., the cubeis being updated instead of constructed for the first time, thenAlgorithm 3 loads the existing
STTCube (lines 2-6) and updatesit with new information. In particular, the spatial dimension hasthe grid-based hierarchy and the hierarchy with the base levelat each object's
Location (i.e., the geographical point), and thenthe levels
City , Region , Country , and
All (5 levels in total). Thetextual dimension, instead, has the hierarchy build from the baselevel
Term , and then
Theme , Topic , Concept , and
All (5 levels intotal). Finally, the temporal dimension contains the
Date and
TimeOfDay hierarchies mentioned in Section 3.Once the basic structure is prepared, Algorithm 3 loops througheach STT object in X (lines 8-13). In this loop, it extracts andinitializes from each STT object the base-level members for eachdimension. Then, once the base level data has been extracted, itproceeds with building the various dimension hierarchies startingfrom the existing base-level members and exploiting the providedspatial and textual taxonomies (lines 8-12). Once the dimensionhierarchies are built, the
STT object itself is then inserted in thefact table of the
STTCube (line 13) so that each fact is linked tothe lowest (base) level members in the respective dimensions.n this step (line 13), the fact measure values are also computed(e.g., the keyword count). As the last step (line 14) Algorithm 3executes the (partial) materialization procedure.
Spatial Hierarchies Construction.
In our proposed
STTCube the base level for the spatial hierarchies is the
Location present inthe raw data, i.e., the longitude and latitude points. Hence, we useMilitary Grid Reference System (MGRS) for grid-based hierar-chy and when building the semantic-based hierarchy, individualpoints are linked to the respective cities using the information inthe available geographical taxonomy G , or to a special memberfor points that link to unknown locations. Therefore, this cor-responds to the step function from Location to City . The spatialtaxonomy G is also used to generate the spatial hierarchy stepfunctions for the higher levels. Textual Hierarchies Construction.
The unstructured natureof the text makes it a challenging task to convert it into a di-mension of a cube. In Algorithm 3, the
ProcessText function(line 11) implements the following steps: (1) splits the text intoindividual words, (2) removes stop words , and (3) converts theremaining words to their base form (e.g., βworksβ and βworkingβhave the same base form βworkβ). The final processed text is usedto populate the
Term base-level in the textual dimension. Thisimplements the base step function in the textual hierarchies, andlinks every fact to one or more
Terms , hence it has an π β π cardi-nality. Moreover, while constructing the higher levels, using thesemantic taxonomy T (e.g., WordNet), each STT object is linkedto one or more
Themes , and similarly for
Topics , and
Concepts . Now, we report on the performance of
STTCube analysis. In par-ticular, we compare the different materialization strategies for
STTCube and
No STTCube (NC) implementations, in terms ofquery response time (QRT) and storage cost. NC answers thequeries by computing the query response from base data withoutconstructing the STTCube . NC stores pre-processed text , lemma-tized text after removal of stop and invalid words, along with the geographical location , longitude and latitude point, and timestamp .Specifically, NC uses user-defined functions for text (for retriev-ing individual terms) and location processing (e.g., identificationof the city a particular longitude, latitude point belongs to) andbuilt-in functions for timestamp. Further, NC filters on locationand timestamp for the queried area and time and performs a se-ries of joins, e.g., 4 joins for Concept level, to retrieve informationfor the requested textual level. Finally, it groups results on thetextual and temporal columns, computes the
STT measure values ,and performs the top- k selection. Also, we compare QRT andhierarchy construction time for different combinations of hierar-chy schemes. Moreover, we also report on the accuracy of PAM and demonstrate the advantage in performance when comparedto
PEM . Lastly, we compare QRTs for different spatial and textualhierarchy schemes, showing that combinations of
Grid-based spa-tial and Majority-based textual (GM) hierarchy scheme achievesthe fastest QRTs among all hierarchy combinations.
Experimental Setup.
We evaluate the
STTCube on a real-worldTwitter dataset containing 125 million tweets collected over six weeks. Each tweet contains the tweet location, text, and time.We implemented the
STTCube in a leading commercial RDBMS,called
RDBMS-X as we cannot disclose the name. The proposeddesign is realized using a snowflake schema to avoid redundancyin the dimension data. We implemented the
Pre-Processing (PP) component, wherethe whole raw dataset is parsed and the relational tables arepopulated, in Java (v11). All tests are run on a Windows Servermachine with 2 Intel Xeon 2.50GHz CPUs and 16GB RAM.We extracted the taxonomy for the spatial dimension fromGeoNames [1]. For the
City level, we considered all the citieshaving ππππ’πππ‘πππ > Region level, we use ad-ministrative divisions information available in the GeoNamesdataset. We use the reverse geocoding process to find the cityname for the
Location coordinates.For the textual dimension, as a taxonomy for
Terms , Themes , Topics , and
Concepts , we use the widely used WordNet [11]. Weuse the direct
HYPERNYM link of WordNet to decide the parentmember for a
Term , Theme , and
Topic . If a term is present inWordNet and has a super-class (
HYPERNYM ) then the super-class becomes the parent of the term. Otherwise, it becomes itsown parent (this avoid unbalanced hierarchies and UNKNOWNvalues in the hierarchy). For text pre-processing βtokenization,lemmatization, and stop word removalβ we use the Stanford CoreNLP library [26]. We implemented the temporal dimension usingthe standard
Date and
Time functions supported in
RDBMS-X .We implemented the semantic-based and grid-based hierarchyschemes for the spatial dimension, replication-based and majority-based hierarchy schemes for the textual dimension (Section 3.2),and
Date hierarchy for the temporal dimension.
Spatial, Textual, and Temporal Levels Members.
The baselevels contain 40 . Location Points and 9 . Terms . The GeoNames taxonomy contains 132 πΎ cities,divided into 4 πΎ administrative divisions (regions) for 247 coun-tries. Among those, we have tweets for 104 πΎ cities, 3 . πΎ regions,and 246 distinct countries. In the textual hierarchy, terms aregrouped into 23 . πΎ Themes , 19 . πΎ Topics , and 17 . πΎ Concepts .Furthermore, the temporal dimension spans over 37 days. Finally,for
PAM we materialize πΎ =
31 densest keywords.We implemented
Keyword Density , Keyword Volatility , Top-k Dense Keywords within an area , and
Top-k Volatile Keywordswithin an area as prototypical
STT measures . We compare (
PEM )and (
PAM ) strategies with the following three baselines.
NoSTTCube (NC): is the traditional RDBMS setup with all textual,spatial, and temporal functions implemented as built-in or user-defined functions. NC is the traditional solution one would gofor without the STTCube . No Materialization (NM): constructsthe
STTCube and minimizes the storage cost by only materializ-ing the base cuboid and computing all query responses from it.
Full Materialization (FM): minimizes the QRT by materializ-ing every cuboid in the
STTCube . With this approach queries areanswered through a lookup in the pre-computed cuboid.
Thesethree baselines are at the extreme ends of the space-time trade-offand are usually infeasible for large datasets.
Queries.
We perform experiments on five different sizes of datasetsusing nine different
STT queries . Each
STT query , described inTable 3, targets a different level of spatial, textual, and temporal
Table 3: Spatio-Textual-Temporal Queries
Query Description
Q1 Top- k Dense/Volatile
Terms in a
City [time span]
Q2 Top- k Dense/Volatile
Topics in a
City [ time span] Q3 Top- k Dense/Volatile
Concepts in a
Country [time span]
Q4 Top- k Dense/Volatile
Terms in a
Region [time span]
Q5 Top- k Dense/Volatile
Concepts in a
Region [time span]
Q6 Top- k Dense/Volatile
Themes in a
Region [time span]
Q7 Top- k Dense/Volatile
Terms in a
Country [time span]
Q8 Top- k Dense/Volatile
Terms in a
Country
Group by
Region [time span]
Q9 Top-
ALL
Dense/Volatile
Topics in a
Country
Group by
Region [time span] (a) GR - Keyword Density Q R T ( m s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(b) GR - Keyword Volatility Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(c) GR - Top- k Dense Keyowrds Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(d) GR - Top- k Volatile KeyowrdsQ2 Q3 Q5 Q6 Q9 (e) GM - Keyword Density Q R T ( m s ) Q2 Q3 Q5 Q6 Q9(f) GM - Keyword Volatility Q2 Q3 Q5 Q6 Q9(g) GM - Top- k Dense Keyowrds Q2 Q3 Q5 Q6 Q9(h) SM - Top- k Volatile KeyowrdsNC NM PEM PAM FM
Figure 4: QRTs for spatio-textual-temporal measures for different combinations of hierarchy schemes over 125 Million of Data
25 50 75 1001250100200300400 (a) GR - Keyword Density(PEM) Q R T ( m s )
25 50 75 100125 (b) GR - Top- k Dense Keywordswithin an Area (PEM)
25 50 75 100125 (c) GR - Top- k Dense Key-words within an Area (PAM)
25 50 75 100125 (d) GR - Top- k Volatile Key-words within an Area (PEM)
25 50 75 100125 (e) GR - Top- k Volatile Key-words within an Area (PAM)
25 50 75 100125050100
Data Size in Millions(f) GM - Keyword Density(PEM) Q R T ( m s )
25 50 75 100125
Data Size in Millions(g) GM - Top- k Dense Key-words within an Area (PEM)
25 50 75 100125
Data Size in Millions(h) GM - Top- k Dense Key-words within an Area (PAM)
25 50 75 100125
Data Size in Millions(i) GM - Top- k Volatile Key-words within an Area (PEM)
25 50 75 100125
Data Size in Millions(j) GM - Top- k Volatile Key-words within an Area (PAM)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Figure 5: QRT Vs Data Size granularity. Each query requests either dense or volatile key-words with a range of time which is used for volatile but not usedfor dense keywords queries. We execute each query ten timeswith randomly generated parameters for each method and reportmean and standard deviation.
Query Response Time.
For
Top-k Dense and Top-k Volatile Key-words within an area measures, we compare the QRT of
PEM and
PAM methods with the NC , NM , and FM baselines. For KeywordDensity and
Keyword Volatility , no approximate solution is pos-sible so we only compare
PEM with NC , NM , and FM . As the Majority-based textual hierarchy scheme does not process
Terms (Section 3.2), we only evaluate five out of nine queries requesting
Theme , Topic , and
Concept for it (Figures 4eβ4h). Furthermore,we cannot evaluate
PAM for Q9 as no approximate solution ispossible for it.We plot results in Figures 4aβ4h for 100% (125M) of data, asthe results are similar for smaller data sizes. Each row in Figure 4shows the QRTs for one particular combination of spatio-textualhierarchy schemes. Specifically, Figures 4aβ4d show the QRTsfor the Grid-based spatial and Replication-based textual ( GR )hierarchy combination for all measures . Similarly, Figures 4eβ4hshow QRTs for Grid-based spatial and Majority-based textual( GM ) combinations . Figure 4 has queries on the x-axis and QRTsin msec on the y-axis (note: log scale). Figure 4 confirms that NC is 1 β NM . Specifically, regard-less of the spatial hierarchy scheme, it is 1 β β Due to space constraint we have omitted the figures for Semantic-based spatialhierarchy (can be found in Appendices Figures 9 and 10) as we observed similarresults magnitude slower than NM for Replication-based and
Majority-based textual hierarchy, respectively. The
Majority-based textualhierarchy scheme achieves faster QRTs because it does not pro-cess individual
Terms but directly links
Theme to the fact , hence,drastically reducing the number of rows to process (from millionsto thousands). Furthermore, NM is 1 β β PEM and
PAM , respectively, for all measuresand combinations of hierarchy schemes.
PEM is on average sixtimes slower than FM which achieves its fast QRTs at the expenseof a highly increased storage cost (Figure 6a). PAM achieves near-optimal
QRTs because it materializes only the K densest keywords in the cuboid, hence it has much fewer rows to process. QRTsfor Q9 for Top- k Volatile Keyword within an area and
Top- k DenseKeywords within an area measures for all combinations of hier-archy schemes are the worst for
PAM (same as NM ) because itrequests ALL keywords' densities instead of top-k which cannotbe computed from the approximate pre-aggregated information.To generate a response for Q9 , we have to process all detail datadirectly from the base facts. In comparison, PEM and
PAM materi-alize a subset of views (also a subset of rows for
PAM ) and use thepre-aggregated measure values in those views to efficiently gen-erate a response for a query instead of processing base facts, thusimproving the overall QRT. NC is the slowest of all (1 β STTCube NM ) because ithas to process the complete dataset for computing each query re-sponse, and cannot take advantage of the
STTCube optimizationsfor
STT measures . Among all the hierarchy scheme combinations, GM has the fastest QRTs mainly because of Majority-based whichdrastically reduces the row count by linking the
Theme directly Β· - N M - N M - N M - N M - N M Β· - P A M - P A M - P A M - P A M - P A M Β· - PE M - PE M - PE M - PE M - PE M Β· Data Size in Millions S t o r ag e C o s t( N o . o f R o w s ) Base CubeMaterialization - F M - F M - F M - F M - F M (6a) Storage Cost Β· Number of Views(6b) Views Selection Benefit B e n e fi t ( M i ll i o n s o f R o w s ) π π π
25 50 75 100 12510 Data Size in Millions(6c) Cube Construction Time T i m e ( m i n u t e s ) ππ ππΈπππ΄π ππ _ πΌππΆπΉπ
25 50 75 100 125051015 Β· Data Size in Millions(6d) Hierarchy Construction Time T i m e ( m i n u t e s ) πΊπππππππππ‘πππ πππππππ‘ππππππππππ‘π¦ to each
Fact instead of individual
Terms , whereas, GR has theslowest QRTs due to Replication-based having far more rows toprocess than
Majority-based textual hierarchy. Furthermore,
Grid and
Semantic-based spatial hierarchies have similar QRTs.Figure 5 shows the scalability of
PEM and
PAM over growingdata sizes for different combinations of hierarchy schemes andconfirms that the QRTs are almost constant as the data grows.This is because the sizes of materialized views do not increase alot as the data grows. Only new dimension members, e.g., newcities or topics, increase the size of materialized views, but onlyby a small fraction. Figures 5fβ5j confirm that the GM hierarchycombination results in the fastest QRTs, i.e., all QRTs <
100 msec.On the contrary, Figures 5aβ5e show that GR yields the slowestQRTs, with QRT as high as 400 msec. Figure 5 confirms that PAM consistently achieves the fastest QRTs (mostly <
100 msec withfew a bit over 100 msec) regardless of hierarchy schemes. Figure 5shows that
PEM and
PAM scale linearly w.r.t. data size.
Storage Cost.
We now compare the storage cost for FM , PEM , PAM , and NM . We do not compare NC 's storage cost because itdoes not construct STTCube, and hence does not materialize any-thing. We only show the storage cost for up to 20 million because FM takes an unfeasible amount of time (shown in Figure 6c) whilefor the other methods and over the larger datasets we observe thesame trend. We use the number of rows in a view as its storagecost.The base cube's storage cost is always needed. Besides that,every additional materialized view adds to the storage cost, asdisplayed in Figure 6a, that shows the storage cost of NM , PAM , PEM , and FM over growing data sizes. The materialization of theSTTCube using PEM and
PAM only adds 13% and 0.1% to thestorage cost of the base cube, respectively. Whereas, using FM increases the storage cost by more than an order of magnitude. PEM reduces the storage cost by only materializing a subset ofviews (four views) and still achieves 2-5 orders of magnitude im-provement in QRT (Figures 4).
PAM further reduces the storagecost by only materializing a subset of rows in a view (top- k ) andgains an additional order of magnitude improvement in QRT.On the other hand, FM materializes all views in a cube, i.e., 500(5 Γ Γ Γ
4) views in our case, which makes the view material-ization storage cost much higher (one order of magnitude) thanthe base cube itself, as shown in Figure 6a. Figure 6a confirmsthat our proposed methods
PEM and
PAM reduce the storagecost between 97% and 99.9% compared to FM . Views Selection for Materialization.
Our proposed methods
PEM and
PAM are partial materialization methods that material-ize only a subset of the cuboids. Hence, an important trade-off tobe understood is between the number of cuboids to materialize,the corresponding storage cost, and the gain in query responsetime achieved. We empirically evaluate the benefit gained (im-provements in QRT for all dependent cells which can be answered using this view) against the cost of materializing the view (Al-gorithm 1). We consider the base cube as a necessary view tobe materialized and consider its benefit as zero. Figure 6b showsthat materializing three cuboids ( (Day, City, Term) , (Day, Location,Theme) , and (Day, Region, Term) ) on top of the base cube gain themost benefit after which we do not get a significant advantage ofmaterializing further cuboids. The reason is that the materializedcuboids are already small enough, so the benefit of materializingany descendant cuboid is small. Hence, materializing 4 cuboidsrepresents the best trade-off between QRT and storage cost. Pre-Processing and Cube Construction.
Here, we report thetime for the construction of
STTCube . Construction of an STTCubeis divided into two steps: 1)
Pre-Processing (PP) of base facts( spatio-textual-temporal objects ) and population of the relationaltables and 2) materialization of views. Further, the materializationof views can be done either using FM , PEM , or
PAM . In Figure 6c,we have data sizes on the x-axis and time in minutes on the y-axis(note: log scale). FM is the most time consuming among all andadds significant overhead on top of PP time and does not scale.On the contrary, PEM and
PAM time is negligible compared to the FM time. Hence, with PEM and
PAM STTCube construction timescales linearly. To evaluate
STTCubeβs ability to handle updates,we performed several updates of 4M tweets each (
PP_INC line inFigure 6c). Experiments confirm that STTCube handles updatesefficiently by only processing the new
STT objects and updatingthe respective dimensions and measures.Furthermore, we compare the different hierarchy schemesw.r.t. their construction time. Figure 6d shows the hierarchiesβconstruction time for different hierarchy schemes. It is evidentfrom Figure 6d that, among all, the
Replication-based textual hi-erarchy scheme takes the longest to construct because for eachsingle spatio-textual-temporal object it has to process each in-dividual
Term and construct hierarchy for it. Whereas, for allother schemes, for each spatio-textual-temporal object only onehierarchy instance is processed. Figure 6d confirms that all of thehierarchy schemes are constructed in linear time w.r.t. data size,allowing STTCube to support multiple hierarchy schemes.
Accuracy.
Given that
PAM efficiently computes the approximatemeasure values, it becomes necessary to evaluate its accuracy.To evaluate the accuracy of
PAM , we use NM 's results as groundtruth. Our evaluation result in Table 7a confirms that it achieveshigh accuracy. Specifically, it is 100% for 6 out of 8 queries, and90-97% for 2. Queries with 90-97% accuracy request as manykeywords as are materialized and the risk of having wrong resultsnear the border (bottom of the top- k list) is higher. QRT of STTOLAP Operators.
Our proposed materializationstrategies (
PEM and PAM ) improves the QRTs for
STTOLAP op-erators . To demonstrate this, we perform a series of STTOLAPoperations and measure their QRT for different materialization uery
Data Size in Millions25 50 75 100 125Q1 100.0 100.0 100.0 100.0 100.0Q2 100.0 100.0 100.0 100.0 100.0Q3 100.0 100.0 90.0 95.0 90.0Q4 92.3 100.0 100.0 100.0 100.0Q5 100.0 100.0 100.0 100.0 100.0Q6 100.0 100.0 100.0 100.0 100.0Q7 93.3 96.7 90.0 93.3 93.3Q8 100.0 100.0 100.0 100.0 100.0 (7a) PAM's Accuracy
Start RU RU D S DD DD
STTOLAP Operations Q R T ( m s ) NMPEMPAMFM (7b) STTOLAP Operations'QRTs
10 20 50 100 200 500 1000
Materialized K Q R T s ( m s e c ) (7c) Materialized-K Vs QRT strategies. Figure 7b shows the QRTs for multiple STTOLAP oper-ations for different materialization strategies. We have STTOLAPoperators on the x-axis (RU, D, S, and DD represents STT Roll Up,Dice, Slice, and Drill Down operators, respectively) on QRT inmsec on the y-axis. It is evident that NM is on average 3-5 ordersof magnitude slower than PEM which is one order of magnitudeslower than
PAM . Furthermore,
PAM achieves near-optimal QRTs,just a fraction higher than FM . These experiments confirm that STTCube's materialization methods (
PEM and PAM ) improves
ST-TOLAP operators'
QRTs by materializing only a subset of cuboids.
Top-K Value Estimation.
Here, we study the relationship be-tween QRT and the value of materialized πΎ . We create seven different STTCube materialization versions using 10, 20, 50, 100,200, 500, and 1000 as the value of πΎ . Next, we use the Gammadistribution to generate 100 random numbers, to be used as top- k values, in the range of 1 and 1000. We chose the Gamma distri-bution because it resembles a common long-tail distribution fortop- k values. We execute each query for all the 100 generatedtop- k values over all seven materialization versions. Figure 7cshows the QRT for all queries over different materialization ver-sions. For πΎ =
10 and 20 the median value is the same as the boxtop, hence not visible in the plot. It is evident from Figure 7cthat a larger value of materialized πΎ achieves faster QRTs (lowermedian value) because almost all the queries are answered usingthe pre-computed measure values. But, in the case of smaller πΎ ,all the queries requesting π > πΎ need to be answered using thenon-pre-computed measure values from the base cuboid. Hence,resulting in slower QRTs (higher median value). A larger valueof πΎ such as 1000 is not recommended because 1) there will bevery few queries requesting a larger top- k and 2) it will requiremore storage cost (Figure 6a). Specifically, between πΎ =
50 and100 and πΎ =
100 and 200 QRT decrease by 35% and 0% but storageincrease 250% and 200%, respectively. Hence, these experimentsconfirm that choosing a value between 20β50 for πΎ in our currentexperiments settings is a near-optimal choice. In this paper, we defined and formalized the
Spatio-Textual-Temp-oral Cube (STTCube) structure to effectively perform
STTCubeanalytics . We introduced STT hierarchies, spatio-textual and spatio-textual-temporal measures, and STTOLAP operators to analyze STTdata together.
For efficient, exact and approximate, computationof
STT measures , we proposed a pre-aggregation framework ableto provide faster response times by requiring a controlled amountof extra storage to store pre-computed measure values. We ob-served how the partial materialization provides 1 to 5 orders ofmagnitude reduction in query response time, with between 97%and 99.9% reduced storage cost compared to full materializationtechniques. Moreover, the approximate materialization providesaccuracy between 90% and 100%, while requiring considerablyless space compared to no materialization techniques. In futurework, we plan to enhance STTCube with additional
STT measures and distributed implementation.
REFERENCES [1] 2020. GeoNames. http://download.geonames.org/. Accessed: 2020-09-09.[2] A. Almaslukh, A. Magdy, A. M. Aly, M. F. Mokbel, S. Elnikety, Y. He, S. Nath,and W. G. Aref. 2019. Local trend discovery on real-time microblogs withuncertain locations in tight memory environments.
GeoInformatica (2019).[3] M. Azabou, K. Khrouf, J. Feki, C. Soulé-Dupuy, and N. Vallès. 2016. Analyzingtextual documents with new OLAP operators.
AICCSA (2016).[4] X. Cao, L. Chen, G. Cong, C. S. Jensen, Q. Qu, A. Skovsgaard, D. Wu, and M. L.Yiu. 2012. Spatial Keyword Querying. ER (2012).[5] L. Chen, G. Cong, C. S. Jensen, and D. Wu. 2013. Spatial Keyword QueryProcessing: An Experimental Evaluation. PVLDB (2013).[6] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. 2002. Multi-dimensionalRegression Analysis of Time-series Data Streams.
VLDB (2002).[7] M. L. Chouder, S. Rizzi, and R. Chalal. 2019. Exploratory OLAP over DocStores. IS (2019).[8] G. Cong, K. Feng, and K. Zhao. 2016. Querying and mining geo-textual datafor exploration: Challenges and opportunities. ICDEW (2016).[9] G. Cong and C. S. Jensen. 2016. Spatial Keyword Queries and Beyond.
SIGMOD (2016).[10] B. Ding, B. Zhao, C. X. Lin, J. Han, C. Zhai, A. Srivastava, and N. C. Oza. 2011.Efficient Keyword-Based Search for Top-K Cells in Text Cube.
TKDE (2011).[11] C. Fellbaum. 1998. WordNet: An Electronic Lexical Database.
MIT Press (1998).[12] W. Feng, C. Zhang, W. Zhang, J. Han, J. Wang, C. Aggarwal, and J. Huang.2015. STREAMCUBE: Hierarchical spatio-temporal hashtag clustering forevent exploration over the Twitter stream.
ICDE (2015).[13] J. Gray, A. Bosworth, A. Lyaman, and H. Pirahesh. 1996. Data cube: a relationalaggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS.
ICDE (1996).[14] N. GΓΌr, T. B. Pedersen, E. Zimanyi, and K. Hose. 2017. A foundation for spatialdata warehouses on the Semantic Web.
Semantic Web (2017).[15] J. Han, K. Koperski, and N. Stefanovic. 1997. GeoMiner: A System Prototypefor Spatial Data Mining.
SIGMOD (1997).[16] V. Harinarayan, A. Rajaraman, and J. D. Ullman. 1996. Implementing datacubes efficiently.
SIGMOD (1996).[17] P. Jayachandran, K. Tunga, N. Kamat, and A. Nandi. 2014. Combining UserInteraction, Speculative Query Execution and Sampling in the DICE System.
ICDE (2014).[18] C. S. Jensen, T. B. Pedersen, and C. Thomsen. 2010.
Multidimensional Databasesand Data Warehousing . Morgan & Claypool Publishers.[19] J. F. Kenney and E. S. Keeping. 1962. Mathematics of Statistics, Part 1, chapter15. van Nostrand (1962).[20] J. D. Knijff, F. Frasincar, and F. Hogenboom. 2013. Domain taxonomy learningfrom text: The subsumption method versus hierarchical clustering.
DKE (2013).[21] M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. 2007. STEW-ARD: Architecture of a Spatio-textual Search Engine.
GIS (2007).[22] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. 2008. Text Cube: ComputingIR Measures for Multidimensional Text Database Analysis.
ICDM (2008).[23] L. Lins, J. T. Klosowski, and C. E. Scheidegger. 2013. Nanocubes for real-timeexploration of spatiotemporal datasets.
TVCG (2013).[24] X. Liu, K. Tang, J. Hancock, J. Han, M. Song, R. Xu, and B. Pokorny. 2013. AText Cube Approach to Human Social and Cultural Behavior in the TwitterStream.
SBP (2013).[25] A. Magdy, L. Abdelhafeez, Y. Kang, E. Ong, and M.F. Mokbel. 2020. Microblogsdata management: a survey.
The VLDB Journal (2020).[26] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky.2014. The Stanford CoreNLP natural language processing toolkit.
ACL (2014).[27] R. Othman, R. Belkaroui, and R. Faiz. 2016. Customer Opinion SummarizationBased on Twitter Conversations.
WIMS (2016).[28] B. Pat and Y. Kanza. 2017. Whereβs Waldo?: Geosocial Search over MyriadGeotagged Posts.
SIGSPATIAL (2017).[29] J. M. PΓ©rez-MartΓnez, R. Berlanga-Llavori, M. J. Aramburu-Cabo, and T. B.Pedersen. 2008. Contextualizing Data Warehouses with Documents.
DSS (2008).[30] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling.2009. TwitterStand: News in Tweets.
SIGSPATIAL (2009).[31] A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. 2008. Multidimensionalcontent eXploration.
PVLDB (2008).[32] A. Skovsgaard, D. Sidlauskas, and C. S. Jensen. 2014. Scalable top-k spatio-temporal term querying.
ICDE (2014).[33] M. Walther and M. Kaisser. 2013. Geo-spatial Event Detection in TwitterStream.
ECIR (2013).34] S. Wang, J. Cao, and P. Yu. 2020. Deep learning for spatio-temporal datamining: A survey.
TKDE (2020).[35] D. Yu, D. Xu, D. Wang, and Z. Ni. 2019. Hierarchical Topic Modeling of TwitterData for Online Analytical Processing.
IEEE Access (2019).[36] C. Zhang and J. Han. 2019. Multidimensional Mining of Massive Text Data.
DMKD (2019).[37] D. Zhang, C. X. Zhai, J. Han, A. Srivastava, and N. Oza. 2009. Topic Modelingfor OLAP on Multidimensional Text Databases.
Stat. Anal. Data Min. (2009).[38] K. Zhao, L. Chen, and G. Cong. 2016. Topic Exploration in Spatio-TemporalDocument Collections.
SIGMOD (2016).
STTOLAP OPERATORS
A data cube allows different O n L ine A nalytical P rocessing (OLAP)operators to group, filter, and analyze cells and subsets of cellsat different levels of granularity and under different perspec-tives. Those operators are known as Slice , Dice , Roll-Up , and
Drill-Down [18], and they take as input a cube and produce asoutput another cube. In the following, we extend the basic OLAPoperators to
STTOLAP operators , i.e., for spatial, textual, andtemporal dimensions, hierarchies, and measures. In general, anOLAP (and STTOLAP) operator OP accepts as input a cube πΆ β² ,some parameters params and outputs a new cube πΆ β²β² , i.e., OP ( πΆ β² , params ) = πΆ β²β² . In this way, a new OLAP operator can be applied to πΆ β²β² . Among all cubes, we distinguish the initial or base cube πΆ asthe cube containing all the original information at the base level.In the following, we generally assume every OLAP operator OP to have access to πΆ since some operators need access to the basecube πΆ to produce the desired result. A.1 STT-Slice
Slice operates over the current data cube πΆ β² and takes as a param-eter a dimension member π£ π in a specific level π π and dimension π π .It keeps only cells in πΆ β² corresponding to π£ π and finally removesdimension π π . An example of the slice operator is βslice locationdimension on user-defined polygon representing Aalborgβ. STT-Slice. operator is defined as
STTSlice ( πΆ β² π π‘π‘ , π£ π ) = πΆ β²β² π π‘π‘ . It takesan n -dimensional STTCube πΆ β² π π‘π‘ and a member π£ π of the level π π of the spatial, textual, or temporal dimension π πΏππππ‘πππ , π πππ₯π‘ , or π ππππ , respectively, and produces a resulting cube πΆ β²β² π π‘π‘ with π β π π only for the member π£ π and removes the respective dimension.The member π£ π could be an object in the taxonomy of a semantic-based hierarchy (e.g., Aalborg as a dimension member in π πΆππ‘π¦ )or could be a grid cell at some granularity level. Similarly, π£ π could be a specific Theme or Topic for the textual dimension anda specific day or month for the temporal dimension. A.2 STT-Dice
While the slice operator selects and removes a single dimension,the dice operator produces a new cube whose cell contents havebeen filtered based on a set of conditions (complex predicates,queries covering several cells) but without removing any dimen-sion. That is, it produces a resulting cube with the same numberof dimensions but based on facts that satisfy the provided set ofconditions. Such conditions can use a combination of spatial, tex-tual, temporal, and general-purpose functions. These functionscan perform different computations, e.g., they combine more thanone object and return a new single aggregated object, or comparetwo objects and return a Boolean value, or they can produce anumeric value based on some computation.
STT-Dice.
The
STT-Dice operator selects only cell(s) that sat-isfy the provided spatial, textual, or temporal logical conditions.Given a STTCube πΆ β² π π‘π‘ and a set of logical conditions COND π , the STTDice is represented as
πππ π·πππ ( πΆ β² π π‘π‘ , COND π ) = πΆ β²β² π π‘π‘ where COND π is a set of atomic or compound spatial, textual, or tempo-ral logical conditions. For instance, it can return only the cell(s)that intersect with the polygon describing a custom region ofinterest and containing at least π observations, the cell containingonly objects whose relevance score with the topic Food is above0 . DLT (100M)(T)DL (15M)(F) DT (4M)(T) LT (96M)(F)D (37)(F) L (14M)(F) T (2M)(F) β (1)(F) Figure 8: Spatio-Textual-Temporal Lattice
A.3 STT-Roll-Up
The Roll-Up operator aggregates measure values along a hierar-chy by moving from a child level to a parent level. This allows tomove the analysis to a coarser granularity. An example of thisoperator is βRoll-Up from City to Countryβ.
STT-Roll-Up.
The
STT-Roll-Up operator groups facts by aggre-gating the measure values of all child members that belong to thesame parent member in a spatial, textual, or temporal hierarchy.Given a STTCube πΆ β² π π‘π‘ , a child level π π β and a target parent level π π β (identifying a hierarchy step function βπ π in the spatial, tex-tual, or temporal dimensions), the STTRollUp operator definedas
STTRollUp ( πΆ β² π π‘π‘ , π π β , π π β ) = πΆ β²β² π π‘π‘ produces a new STTCube thathas the same number of dimensions and for each measure π β π ,the aggregation function associated with π is used to create anaggregated measure value for it at the higher parent level. Forinstance, when we Roll-Up from
City level to the
Region level,
Fact Count values get summed to compute the new total. Sim-ilarly, βRoll-Up to Topic level from Theme levelβ groups factsby aggregating the measure values of all
Themes that belong tothe same
Topic and βRoll-Up to Quarter level from Month levelβgroups facts by aggregating the measure values of all
Months that belong to the same
Quarter . A.4 STT-Drill-Down
The inverse of the Roll-Up is the Drill-Down operator whichshows data at finer granularity by dis-aggregating measure valuesalong a hierarchy. An example of the Drill-Down operator isβDrill-Down to City level from Region levelβ. For this operation,the base cube is required as we cannot uniquely dis-aggregatemeasure values knowing only the values at the parent level.
STT-Drill-Down.
Given an STTCube πΆ β² π π‘π‘ at a non-base level π π β for the spatial ( π πΏππππ‘πππ ), textual ( π πππ₯π‘ ), or temporal ( π ππππ ) di-mension and a target child-level π π β (identifying a hierarchy stepfunction βπ π ), the STTDrillDown operator
SDrillDown ( πΆ π π‘π‘ , πΆ β² π π‘π‘ ,π π β , π π β ) = πΆ β²β² π π‘π‘ produces a new STTCube at finer granularity com-puting dis-aggregated measure values along the selected spatial,textual, or temporal hierarchy, respectively. For instance, STT-Drill-Down can move from the
Region level to the
City level dis-aggregating the
Fact Count of each
Region in the counts for each
City in that
Region . βDrill-Down to Term from Themeβ followingthe inverse textual hierarchy step
Term β Theme and βDrill-Downto Day from Yearβ following the inverse temporal hierarchy step
Day β Year are examples of STT-Drill-down operators for textualand temporal dimension, respectively.
B LATTICE EXAMPLE
Consider a simple lattice of cuboid for a cube with 3 dimensions(Figure 8), each with a single 2-level hierarchy, namely with baselevels Location (L) with 14M rows for the spatial dimension, Term(T) with 2M rows for the textual dimension, and Date (D) with37 rows for the temporal dimension, each of which can be thenrolled up to the all level ( β ) with only one row. Each node in the (a) SR - Keyword Density(PEM) Q R T ( m s )
25 50 75 100125 (b) SR - Top- k Dense Keywordswithin an Area (PEM)
25 50 75 100125 (c) SR - Top- k Dense Key-words within an Area (PAM)
25 50 75 100125 (d) SR - Top- k Volatile Key-words within an Area (PEM)
25 50 75 100125 (e) SR - Top- k Volatile Key-words within an Area (PAM)
25 50 75 100125050100150200 (f) SM - Keyword Density(PEM) Q R T ( m s )
25 50 75 100125 (g) SM - Top- k Dense Keywordswithin an Area (PEM)
25 50 75 100125 (h) SM - Top- k Dense Key-words within an Area (PAM)
25 50 75 100125 (i) SM - Top- k Volatile Key-words within an Area (PEM)
25 50 75 100125 (j) SM - Top- k Volatile Key-words within an Area (PAM)
25 50 75 1001250100200300400 (a) GR - Keyword Density(PEM) Q R T ( m s )
25 50 75 100125 (b) GR - Top- k Dense Keywordswithin an Area (PEM)
25 50 75 100125 (c) GR - Top- k Dense Key-words within an Area (PAM)
25 50 75 100125 (d) GR - Top- k Volatile Key-words within an Area (PEM)
25 50 75 100125 (e) GR - Top- k Volatile Key-words within an Area (PAM)
25 50 75 100125050100
Data Size in Millions(f) GM - Keyword Density(PEM) Q R T ( m s )
25 50 75 100125
Data Size in Millions(g) GM - Top- k Dense Key-words within an Area (PEM)
25 50 75 100125
Data Size in Millions(h) GM - Top- k Dense Key-words within an Area (PAM)
25 50 75 100125
Data Size in Millions(i) GM - Top- k Volatile Key-words within an Area (PEM)
25 50 75 100125
Data Size in Millions(j) GM - Top- k Volatile Key-words within an Area (PAM)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Figure 9: QRT Vs Data Size
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 (a) SR - Keyword Density Q R T ( m s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(b) SR - Keyword Volatility Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(c) SR - Top- k Dense Keyowrds Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(d) SR - Top- k Volatile KeyowrdsQ1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 (a) GR - Keyword Density Q R T ( m s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(b) GR - Keyword Volatility Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(c) GR - Top- k Dense Keyowrds Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(d) GR - Top- k Volatile KeyowrdsQ2 Q3 Q5 Q6 Q9 (i) SM - Keyword Density Q R T ( m s ) Q2 Q3 Q5 Q6 Q9(j) SM - Keyword Volatility Q2 Q3 Q5 Q6 Q9(k) SM - Top- k Dense Keyowrds Q2 Q3 Q5 Q6 Q9(l) SM - Top- k Volatile KeyowrdsQ2 Q3 Q5 Q6 Q9 (e) GM - Keyword Density Q R T ( m s ) Q2 Q3 Q5 Q6 Q9(f) GM - Keyword Volatility Q2 Q3 Q5 Q6 Q9(g) GM - Top- k Dense Keyowrds Q2 Q3 Q5 Q6 Q9(h) SM - Top- k Volatile KeyowrdsNC NM PEM PAM FM
Figure 10: QRTs for spatio-textual-temporal measures for different combinations of hierarchy schemes over 125 Million of Data lattice is associated with two values, first, the number of rows inthe cuboid, and second a flag (T/F) to mark if the current cuboidis materialized. At the top of the lattice, we have the base cuboid (which is always materialized) with Date, Location, and Term(DLT) containing in this example all rows (100M). If we Roll-Upthe spatial dimension from Location to All we obtain a new cuboid Β· Data Size in Millions E x e c u t i o n T i m e ( s e c o n d s ) π π π π π π π π π Figure 11: QRT Vs Data Size (DT) with 4M rows. The cuboid DLT is referred to as the ancestor of the cuboid DT. If the cube is partially materialized, i.e., not allcuboids are materialized, and the cuboid DT is materialized, thento obtain the
Fact Count for every Date and Term, the cuboid DTwith 4M rows would contain the answer already pre-computedwithout the need to compute such an answer from the basecube DLT with 100M rows. Moreover, when the cuboid T is notmaterialized, we can still compute the
Fact Count for every Termfrom the cuboid DT by accessing only 4M values instead of the100M in DLT.
C COST MODEL
The core of the proposed partial materialization approach de-pends on the trade-off between the storage cost of materializingany particular cuboid and the actual benefit that the materializa-tion of the cuboid provides. To evaluate this benefit, we have toestimate the (run time) cost of a query. To devise a cost model forthis estimation, we performed a micro-benchmark which con-firmed that the running time is directly proportional to the datasize (the number of rows). Hence we can use the following linearcost model for benefit calculation we selected a set of representa-tive queries (Q1βQ9, details in Table 3) for the aforementioned spatio-textual-temporal measures and measured the runtime ofthese queries on increasing data sizes. The micro-benchmark(Figure 11) confirmed that the running time is directly propor-tional to the data size (the number of rows), i.e., it confirms thatwe can use the Linear Cost Model [16] and the associated benefitcalculation. Then, to model the dependency relationships amongall the possible cuboids we use the lattice framework [16] (Fig-ure 8). Hence, to compute the benefit of materializing a particularcuboid π , we need to compare the cost of answering queries atall levels of granularity (i.e., for the current cuboid π and all itsdescendants in the lattice) with the current set of materializedcuboids against the cost when π is also materialized. Benefit ( π ) = βοΈ π β² β descendants ( π )βͺ{ π } cost ( π β² ) β size ( π ) For instance, assume the lattice in Figure 8 and that only thebase cuboid (DLT) with 100M rows is materialized. If we considerDT with 4M rows, we have that, if materialized, queries againstthe cuboids D, T, β , and DT itself can be answered through it(with a cost of 4M), while without materializing DT, we will needto compute the answer against DLT (with a cost of 100M). Hence,materializing DT will achieve a benefit of ( β ) β = ( β ) β ==