[PDF] A Foundation for Spatio-Textual-Temporal Cube Analytics (Extended Version)

Abstract

Large amounts of spatial, textual, and temporal data are being produced daily. This is data containing an unstructured component (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a powerful and general way of analyzing spatial, textual, and temporal data together. In this paper, we define and formalize the Spatio-Textual-Temporal Cube structure to enable combined effective and efficient analytical queries over spatial, textual, and temporal data. Our novel data model over spatio-textual-temporal objects enables novel joint and integrated spatial, textual, and temporal insights that are hard to obtain using existing methods. Moreover, we introduce the new concept of spatio-textual-temporal measures with associated novel spatio-textual-temporal-OLAP operators. To allow for efficient large-scale analytics, we present a pre-aggregation framework for the exact and approximate computation of spatio-textual-temporal measures. Our comprehensive experimental evaluation on a real-world Twitter dataset confirms that our proposed methods reduce query response time by 1-5 orders of magnitude compared to the No Materialization baseline and decrease storage cost between 97% and 99.9% compared to the Full Materialization baseline while adding only a negligible overhead in the Spatio-Textual-Temporal Cube construction time. Moreover, approximate computation achieves an accuracy between 90% and 100% while reducing query response time by 3-5 orders of magnitude compared to No Materialization.

Full PDF

AA Foundation for Spatio-Textual-Temporal Cube Analytics(Extended Version)

Mohsin Iqbal

Aalborg [email protected]

Matteo Lissandrini

Aalborg [email protected]

Torben Bach Pedersen

Aalborg [email protected]

ABSTRACT

Large amounts of spatial, textual, and temporal data are beingproduced daily. This is data containing an unstructured compo-nent (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a power-ful and general way of analyzing spatial, textual, and temporaldata together . In this paper, we define and formalize the

Spatio-Textual-Temporal Cube structure to enable combined effectiveand efficient analytical queries over spatial, textual, and temporaldata . Our novel data model over spatio-textual-temporal objects enables novel joint and integrated spatial, textual, and temporalinsights that are hard to obtain using existing methods. More-over, we introduce the new concept of spatio-textual-temporalmeasures with associated novel spatio-textual-temporal-OLAPoperators. To allow for efficient large-scale analytics, we presenta pre-aggregation framework for exact and approximate compu-tation of spatio-textual-temporal measures . Our comprehensiveexperimental evaluation on a real-world Twitter dataset confirmsthat our proposed methods reduce query response time by 1-5orders of magnitude compared to the

No Materialization base-line and decrease storage cost between 97% and 99.9% comparedto the

Full Materialization baseline while adding only a negligi-ble overhead in the Spatio-Textual-Temporal Cube constructiontime. Moreover, approximate computation achieves an accuracybetween 90% and 100% while reducing query response time by3-5 orders of magnitude compared to

No Materialization . Due to the increased usage of mobile devices and advancementsin accurate geo-tagging, more and more geo-tagged data is beingproduced [8]. In particular, social media platforms like Twitterand Facebook are some of the main sources of geo-tagged data,usually in the form of posts, comments, and reviews (e.g., Fig-ure 1). This type of data contains spatial, textual, and temporal(STT) information. As a result,

STT data analysis is becomingincreasingly important [9] since it allows to extract new insightsregarding customer satisfaction, user-generated content sharedonline, and brand reputation [27].

STT data contains information regarding topics discussed w.r.t.time and location, hence presenting an invaluable link betweenuser opinions and the real world. For example, STT data canhelp us analyze an advertisement campaign to identify the bestlocations for ad placements. Traditionally, this information isaccessed through spatial keyword-queries [4], e.g., to retrievetopics within a certain location, or identify in which locationssome topic is discussed. However, keyword or topic search are point-wise search tasks. Instead, there a significant need to pro-vide more extensive analytics analogous to traditional OLAP-style © Copyright 2021 for this paper held by its author(s). Published in the proceedingsof DOLAP 2021 (March 23, 2021, Nicosia, Cyprus, co-located with EDBT/ICDT 2021)on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0International (CC BY 4.0).

Alex (@Alex1)

A lovely evening here in Paris. So far fromeveryday stresses. I should travel here moreoften!

TextLocationTimestamp

Figure 1: Geo-tagged tweet: An example of a STT object analytics . An example STT query is “find the top-k trending hash-tags aggregated by topic within a user-defined region (i.e., polygon)around Paris this month" .The traditional data cube model is one of the most widelyused tools to analyze structured data. Since their introduction,data cubes have been extended to analyze different types of data,like sales [14], locations [15], time-series [6], and text [22], but separately . In particular, some works propose OLAP operatorsto analyze either textual data [3, 31, 36] or spatial data [14, 15].However, no previous work proposed a unified model and setof operators enabling integrated and joint analysis of

STT data .Moreover, as we propose to jointly analyze STT dimensions to-gether with other dimensions, we are also able to define novelfamilies of measures that have not been studied before, namely spatio-textual and spatio-textual-temporal measures . These mea-sures, as we show later, allow to produce more advanced analyticsinstead of, e.g., simple keyword frequency.

Contributions.

In this paper, we introduce the Spatio-Textual-Temporal Cube (STTCube) to analyze

STT data . Adding spatial,textual, and temporal support to a traditional data cube is notstraight-forward due to the presence of 𝑛 − 𝑛 relationships in tex-tual hierarchies and because existing families of measures cannotsupport joint and integrated analysis involving spatial, textual,and temporal dimensions, e.g., finding the trending keywordsgrouped by regions, defined by geometry shapes, over a timeinterval (Section 3.3). Hence, we introduce new families of mea-sures and OLAP operators that extract combined insights fromSTT dimensions and measures. STTCube provides specialized spatio-textual and spatio-textual-temporal measures such as Top-k Dense Keywords within an area and

Top-k Volatile Keywordswithin an area that deliver the integrated aggregates over

STTdata . Moreover, a set of analytical operators, namely STT slice,dice, roll-up, and drill-down are proposed. This results in a datamodel able to support spatio-textual-temporal OLAP (STTOLAP)operators. Furthermore, we propose

Partial Exact Materialization(PEM) and

Partial Approximate Materialization (PAM) methods forefficient exact and approximate computations of

STT measures ,respectively. Among other things, we also provide a systematicset of solutions to handle 𝑛 − 𝑛 relationships in textual hierarchies.In this work, we present the following contributions: Table 1: Spatio-Textual-Temporal Sample Dataset

Time Location Terms a r X i v : . [ c s . D B ] D ec able 2: Presence ( ✔ ) or absence ( ✘ ) of support for spatial and textual data, dimensions, hierarchies, and measures in existing methods Textual Support Spatial Support ST STTMethod

Data Dimension Hierarchies Measures Data Dimension Hierarchies Measures Measures Measures

EXODuS [7] ✔ (JSON) ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ TextCube [22] ✔ ✔ ✔ ✔ ✘ ✘ ✘ ✘ ✘ ✘

Text OLAP [35] ✔ ✔ ✔ ✘ ✘ ✘ ✘ ✘ ✘ ✘

TextCubeTopKCells [10] ✔ ✔ ✔ ✔ ✘ ✘ ✘ ✘ ✘ ✘

Geo Miner [15] ✘ ✘ ✘ ✘ ✔ ✔ ✔ ✔ ✘ ✘

SpatialCube [14] ✘ ✘ ✘ ✘ ✔ ✔ ✔ ✔ ✘ ✘

StreamCube [12] ✔ ✘ ✘ ✔ ✔ ✔ ✔ ✘ ✘ ✘

TwitterSand [30] ✔ ✔ ✔ ✘ ✔ ✘ ✘ ✔ ✘ ✘

TextStreams [33] ✔ ✘ ✘ ✔ ✔ ✘ ✘ ✘ ✘ ✘

TopicExploration [38] ✔ ✘ ✘ ✔ ✔ ✔ ✔ ✘ ✘ ✘

SocialCube [24] ✔ ✘ ✘ ✔ ✔ ✔ ✔ ✘ ✘ ✘

TopicCube [37] ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✘ ✘ ✘

ContextualizedWarehouse [29] ✔ ✘ ✘ ✔ ✔ ✔ ✔ ✘ ✘ ✘

STTCube ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ • We extend the standard cube model to add support for spatial,textual, and temporal dimensions and hierarchies and spatio-textual and spatio-textual-temporal measures (Sections 3.1 to 3.3). • We propose a set of analytical operators (STTOLAP) over spatio-textual-temporal data (Section 3.4). • We introduce keyword density and keywords volatility as pro-totypical spatio-textual and spatio-textual-temporal measures (Section 3.3). • We propose a pre-aggregation framework (STTCube material-ization) for efficient, exact ( (PEM) ) and approximate ( (PAM) ),computation of the proposed spatio-textual-temporal measures (Section 4). • We propose techniques for processing spatio-textual-temporalobjects and the construction of the

STTCube (Section 5). • We evaluate the pre-aggregation framework's (

PEM and

PAM )query response time, storage cost, and accuracy by comparingit with the

No STT Cube , Full Materialization , and

No Materi-alization baselines. Our pre-aggregation framework provides1-5 orders of magnitude improvement in query response timeand a 97% to 99.9% reduction in storage cost with an accuracybetween 90% and 100% (Section 6).

OLAP and the

Data Cube [18] are used heavily in business intel-ligence to obtain insights over the historical, current, and futurestate of business. With the emergence of web and social media, animmense amount of unstructured data is being produced, whichmust be included in the analytical process. Table 2 summarizesthe state of the art on spatial, textual, and temporal analytics bylisting the properties and gaps in the current methods.The

Text-Cube [22] allows OLAP-like queries on text data byproviding dimensions and hierarchies for terms. Moreover, it sup-ports the computation of two information retrieval (IR) measures: inverted index and term frequency . EXODuS [7] processes semi-structured document stores (i.e., JSON) using a schema-on-readapproach to allow exploratory OLAP on text. Text OLAP [35]extends traditional OLAP to support textual dimensions andkeyword-based top- k search [10]. Yet, all these approaches lacksupport for spatial and temporal data and the advanced measuresand operators required for spatio-textual-temporal analytics.

For spatial data, GeoMiner [15] proposes a cube structure formining characteristics, comparisons, and association rules fromgeo-spatial data and Spatial cube [14] allows to perform spatialOLAP on the semantic web.

Yet, these solutions focus on spatialdata only and lack support for textual and temporal data.

There are solutions that combine more than one componentof data, e.g., spatio-temporal [34], into the same model but donot provide combined

STT analytics. Among those, the contextu-alized warehouse [29] combines traditional OLAP with a textualwarehouse. This allows the user to provide some keywords, select a market (country or region), retrieve documents matching thekeywords as context, and then analyze the facts related to thosekeywords and documents. Similarly,

Topic Cube [37] extends thefunctionality of a traditional cube and combines probabilistictopic modeling with OLAP by introducing the topic hierarchy .TwitterSand [30] and StreamCube [12] exploit textual and spa-tial information to gain insights by clustering twitter hashtagsand tweets in a region, respectively.

STT data is also analyzed toextract events and topics information in TextStreams [33] andTopicExploration [38]. Finally, SocialCube [24] tries to capturehuman, social, and cultural behavior by performing linguisticanalysis (sentiment analysis) over tweets. All these approachesfocus on the unstructured nature of text along with spatial andtemporal data but they do not provide Integrated STT analytics ,for example, they do not provide the ability to compute aggregatespatial, textual, temporal, and spatio-textual-temporal measuresover spatial, textual, and temporal dimensions and hierarchies .Spatial top- k keyword-queries [5, 9, 25] answer only point-wise queries and do not support aggregation functions or hierarchies.Thus, they do not support more complex OLAP-style analyticaltasks, which we do. There are methods that solve a very specifictask for a specific type of data [2, 21, 28]. These methods arefundamentally different from STTCube because STTCube providesa generic framework for a wide range of STT analytics over differentkinds of STT data sources , including, but not limited to, geo-taggedtweets. Also, STTCube can take advantage of the improvementssuggested over other cubes, e.g., Nanocubes [23] and DICE [17],making it a powerful tool for OLAP-style

STT analytics .Our summary of related work in Table 2 shows that no ex-isting method provides integrated support for

STT data , unlikeSTTCube. To the best of our knowledge, a proper formalizationof a data cube model for

STT data able to support complex ana-lytics for

STT objects at scale is missing. In particular, no previousmethod studies dimensions, hierarchies, and measures that al-low processing STT data jointly . Furthermore, the main novelchallenge for

STT-OLAP is handling 𝑛 − 𝑛 relationships inside the STT dimensions effectively since 𝑛 − 𝑛 relationships do not allowtraditional pre-aggregation techniques to be used. Moreover, ar-bitrary temporal ranges with multiple levels of granularity addscomplexity to STT measures computations. As a remedy, we pro-pose

STTCube which enables the joint and integrated analysis of

STT objects by introducing new sets of measures, spatio-textual and spatio-textual-temporal measures , to gain in-depth insightsusing STTOLAP operators.

Here, we define the

STTCube , an extension of the traditional datacube to allow storage and analysis of

STT objects . Data cubes areused to model and analyze multi-dimensional data. - - . , . - - - . , . - - - . , . - - - . , . - - -

00 2 1 - -

31 2 0 - -

30 1 3 - - - - - - - - - - - - - -- - - - - - - - - A pp l e - F r u i t - L o v e ... - O n i o n - : : - - - : : - - - : : - - ... - : : - - Figure 2: STTCube Example

Definition 1 (Data Cube).

An n-dimensional data cube CS 𝑑𝑐 is a tuple CS 𝑑𝑐 = ( 𝐷, 𝑀, 𝐹 ) , with a set of dimensions 𝐷 = { 𝑑 , 𝑑 , · · · ,𝑑 𝑛 } , a set of measures 𝑀 = { 𝑚 , 𝑚 , · · · , 𝑚 𝑘 } , and a set of facts 𝐹 . Adimension 𝑑 𝑖 ∈ 𝐷 has a set of hierarchies 𝐻 𝑑 𝑖 . Each hierarchy ℎ ∈ 𝐻 𝑑 𝑖 is organized into a set of levels 𝐿 ℎ . Each level 𝑙 ∈ 𝐿 ℎ contains a setof members and has a set of attributes 𝐴 𝑙 . Each attribute 𝑎 ∈ 𝐴 𝑙 isdefined over a domain. Each measure 𝑚 ∈ 𝑀 is a function definedover a domain which can return either a single value or a complexobject. The domain of a dimension 𝑑 𝑖 is denoted by 𝛿 ( 𝑑 𝑖 ) Spatio-Textual-Temporal (STT) Objects An STT object recordsplace (geo-coordinates or location where it was created), text (areview, or a user comment), and time (when it was created). So-cial networks with geo-tagged micro-blog posts are typical

STTdata sources (e.g., the geo-tagged tweet in Figure 1).

Definition 2 (STT object).

A spatio-textual-temporal object isa tuple 𝑜𝑏 𝑗 𝑠𝑡 = ⟨ 𝜆, 𝜑, 𝜏 ⟩ where 𝜆 , 𝜑 , and 𝜏 represent the location, text,and time components, respectively. The

Location is represented as the latitude and longitude pair 𝜆 ∈( R × R ) . The Text is an ordered list 𝜑 = ⟨ 𝑤 , 𝑤 , . . . , 𝑤 𝑛 ⟩ where 𝑤 𝑖 ∈W is a string and is called a Term . Among all Terms, key-words are a user-defined subset of important Terms 𝑊 𝑘 ⊆W . Forinstance, the user can decide that hashtags (terms starting with Time specifies a precise instant (a timestamp) to some resolution (e.g.,seconds). Table 1 contains examples of

STT objects with their lo-cation, a set of keywords extracted from the text, and timestamp.

For analytical processing of

STT objects we propose to model themas an

STTCube . An STTCube CS 𝑠𝑡𝑡𝑐 = ( 𝐷, 𝑀, 𝐹 ) is a data cube(Definition 1) with three special dimensions, namely Location , Text , and

Time that is 𝐷 = { 𝑑 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 , 𝑑

𝑇𝑒𝑥𝑡 , 𝑑

𝑇𝑖𝑚𝑒 , 𝑑 , . . . , 𝑑 𝑛 } . Dimensions. An STTCube stores

STT objects as facts modelingtheir spatial, textual, and temporal features in the correspondingdimensions. Figure 2 shows a dimensional STTCube built on thesample dataset in Table 1 where each row represents one fact (i.e.,the members of 𝐹 ) with dimensions 𝐷 = { 𝑑 Location , 𝑑

Text , 𝑑

Time } .Domains for the respective dimensions are 𝛿 ( 𝑑 Location ) = { (57.016, 09.991) , (56.187, 10.171) , . . . } 𝛿 ( 𝑑 Text ) = { apple , Fruit , , . . . } 𝛿 ( 𝑑 Time ) = { , , . . . } Hence w.r.t. Definition 2, the dimensions capturing 𝜆 , 𝜑 , and 𝜏 are the spatial, textual, and temporal dimensions, respectively. Dimension Hierarchy.

A hierarchy is spatial, textual, or tempo-ral if it contains spatial, textual, or temporal levels, respectively.In Figure 2, the

Location dimension is a spatial dimension with aspatial hierarchy going from 𝜆 to City, Region , and

Country andthe

Text dimension is a textual dimension aggregating 𝜑 into the Term, Theme, Topic , and

Concept levels. Similarly,

Time is a tem-poral dimension. Hierarchy steps 𝐻𝑆 ℎ = { ℎ𝑠 , ℎ𝑠 , ℎ𝑠 , . . . , ℎ𝑠 𝑛 } de-fine the mechanism of moving from a lower (child) level to an up-per (parent) level and vice versa. A hierarchy step ℎ𝑠 𝑖 = ( 𝑙 ↓ , 𝑙 ↑ , 𝑐𝑎𝑟 − 𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 )∈ 𝐻𝑆 ℎ entails that members of a child level 𝑙 ↓ can be ag-gregated together if they correspond to the same member at theparent level 𝑙 ↑ and that this correspondence between children toparent members has the given 𝑐𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 ∈{ − , − 𝑛, 𝑛 − , 𝑛 − 𝑛 } .For instance, the step from Date to Month has an 𝑛 − Term to Topic has an 𝑛 − 𝑛 cardinality (e.g., the Carrot Term correspond both to the Gardening and Food

Topics , while theFood

Topic has as child members not only Carrot but also Apple).

Level Attributes.

As mentioned earlier, a level 𝑙 is associatedwith a set of attributes 𝐴 𝑙 = { 𝑎 , 𝑎 , . . . , 𝑎 𝑛 } and has a set of mem-bers 𝑙 = { 𝑙 , 𝑙 , . . . , 𝑙 } . Attribute values describe the different char-acteristics of each member from that level. Spatial, textual, andtemporal levels are then usually characterized by spatial, textual,and temporal attributes. For instance, at the City level, memberAalborg has the

Boundary attribute whose value is the polygondefining the boundary of Aalborg. An example of a textual at-tribute is

Sentiment which captures the polarity of the associatedtextual member. Similarly, an integer value representing the num-ber of days in a specific month is a temporal attribute.

We now describe the STTCube's dimensions and hierarchies.

Spatial Dimensions.

Spatial information can be analyzed atdifferent levels and granularities. It is important to note that factsin an STTCube are composed only by geographical points (i.e.,each tweet or user post is associated with a coordinate, not withshapes or polygons). Points can be aggregated either within apredefined spatial grid or based on semantic information.

Grid-Based Hierarchy.

Here, the geographic area being ana-lyzed is divided into small equal size cells with a predefinedresolution, e.g., 1 × 𝑘𝑚 . At the lowest level, each latitude andlongitude point is assigned to the cell they fall in. To analyze dataat a coarser granularity, neighboring cells are combined into alarger cell at the parent level (e.g., 3 × 𝑘𝑚 ). This hierarchy canbe built automatically, without the need for any meta-data. Semantic-Based Hierarchy.

Here, data is analyzed in a prede-fined taxonomy, e.g., an administrative division. Therefore, wemove within the taxonomy, e.g., from the

Location to the

City level, from the

City level to the

Region level, and so on up to the

All level. This hierarchy requires each object coordinate to beassociated with a member in the lowest level in the hierarchy(usually in a pre-processing step) and requires the taxonomyinformation to build the entire hierarchy.

Textual Dimensions.

Hierarchies in the textual dimension movefrom specific concepts to general ones. This follows a generic tax-onomic structure connecting more specific terms to more generalones (i.e., hypernyms) [20]. In particular,

Terms are the base levelwhich are grouped into

Themes , Themes into larger categoriescalled

Topics , and

Topics in turn grouped into

Concepts . Differentlyfrom most hierarchies, the members in the levels of a textual hier-archy are typically in an 𝑛 − 𝑛 relationship. Hence, when movingbetween textual levels we need to decide how measure values getaggregated. Below we propose a set of aggregation techniquesto address this issue.

Replication-Based Hierarchy.

This is a common approach whereeach member of a child level is aggregated into all the parentmembers. Hence, its value is effectively replicated. This approacheads to a counting problem when parent levels are further ag-gregated. For example, the first data instance in Table 1 will bepart of two

Themes : 1) Fruits because it contains

Term { apple andfruit } and 2) Emotion because of Term { } . Majority-Based Hierarchy.

If a fact can be mapped to morethan one parent member, then that fact will be part of the parentmember which has the most representation (e.g., in terms offrequency). This scheme avoids double counting of facts in parentmembers. In case of ties, some tie-breaking heuristic or a user-defined criterion can be employed instead, e.g., the first fact inTable 1 will be part of only the Fruits

Theme because it has thetwo representative

Term { apple, fruit } , as compared to Emotionhaving only one Term { } . Custom Hierarchy.

In general, other user-specified criteria andrules can be defined to establish how child-parent level stepswill be aggregated in case of ambiguities. For instance, a domain-specific importance score can be assigned to the hierarchy mem-bers during the

STTCube construction. In this way, facts will bepart of only the parent member with the highest importance.

Temporal Dimensions.

Similarly, temporal dimensions allowto analyze

STT objects at different levels of granularity w.r.t. timeand has the following two temporal hierarchies: 𝜏 → 𝐷𝑎𝑦 → 𝑀𝑜𝑛𝑡ℎ → 𝑄𝑢𝑎𝑟𝑡𝑒𝑟 → 𝑌𝑒𝑎𝑟 → 𝑎𝑙𝑙 and 𝜏 → 𝑆𝑒𝑐𝑜𝑛𝑑 → 𝑀𝑖𝑛𝑢𝑡𝑒 → 𝐻𝑜𝑢𝑟 → 𝑎𝑙𝑙 . Here, the first contains a hierarchy of Date aggre-gated by the temporal levels

Day, Month, Quarter , and

Year (total5 levels including

All ), whereas the second is a hierarchy for

TimeOfDay having 4 levels in total.

As defined earlier, an n-dimensional STTCube has a set of mea-sures 𝑀 = { 𝑚 , 𝑚 , 𝑚 , . . . , 𝑚 𝑘 } , which permit to analyze STT ob-jects by computing values at different levels of granularity. Forinstance, the

STTCube in Figure 2 models

Location , Text , and

Time with

Fact Count as a measure (i.e.,

Fact Count ∈ 𝑀 ). In practice, itmaintains the count of STT objects at given spatial, textual, andtemporal aggregation levels. Measure values at different levelsin the hierarchies are obtained by applying an aggregation func-tion over the

STT objects . Examples of aggregation functions are

SUM , COUNT , MIN , MAX , and

AVG . The

STTCube in Figure 2uses

COUNT as an aggregation function. For example, it reportsthat on

September, 20th at AAU Bus Terminal the

Term apple wasmentioned in 2 facts.A measure is spatial if it is defined over a spatial domain. Aspatial measure is then computed over a collection of spatial val-ues (e.g., geographical points, or geometry shapes like polygons).A spatial measure can be a simple value, e.g., the (numeric) areaof the convex hull of multiple shapes, or a complex spatial object,e.g., the polygon representing the convex hull itself. A measureis textual if it is defined over a textual domain, and can be eithera simple numeric value or a complex textual object. Analogously,a measure is temporal if it is defined over a temporal domain, Ameasure is spatio-textual if it is defined over a spatial and textualdomain and is a combination of spatial and textual measures.Finally, a measure is spatio-textual-temporal if it is defined overa spatial, textual, and temporal domain and is a combination ofspatial, textual, and temporal measures. Below, we propose a listof spatio-textual and spatio-textual-temporal measures to be usedas part of

STTCube to analyze

STT objects effectively.

Top-k Keywords within an Area is a spatio-textual measurewhich returns a list of tuples ⟨ 𝜉, −→ 𝑘𝑤 ⟩ consisting of a geometryshape 𝜉 representing a geographical area and the list of top- k most frequent keywords −→ 𝑘𝑤 = ⟨ 𝑤 , 𝑤 , . . . , 𝑤 𝑘 ⟩ in that area. Analogousto previous measures, It can also be computed at different levelsof aggregation, so that it can return the top-k keywords for each City or each

Region . Keyword Density is a spatio-textual measure which returnsa list of tuples ⟨ 𝜉 𝑖 , 𝑤 𝑗 , 𝜌 𝑖 𝑗 ⟩ consisting of a geometry shape 𝜉 𝑖 representing a geographical area, a keyword 𝑤 𝑗 , and its density 𝜌 𝑖 𝑗 in the area 𝜉 𝑖 . The density 𝜌 𝑖 𝑗 of a keyword 𝑤 𝑗 over an area 𝜉 𝑖 is computed as 𝜌 𝑖 𝑗 = freq ( 𝜉 𝑖 ,𝑤 𝑗 ) SurfaceArea ( 𝜉 𝑖 ) , in which freq ( 𝜉 𝑖 , 𝑤 𝑗 ) is thefrequency of the keyword 𝑤 𝑗 in the area 𝜉 𝑖 (i.e., the number ofobjects located within 𝜉 𝑖 in which 𝑤 𝑗 appears) and SurfaceArea is the surface area of 𝜉 𝑖 . For example, if we have two Regions 𝑟 , 𝑟 with SurfaceArea ( 𝑟 ) = 𝑚 , SurfaceArea ( 𝑟 ) = 𝑚 , and theterm Apple with frequency 5 and 30 in 𝑟 and 𝑟 , respectively (seeFigure 3), then, keyword densities are 𝜌 = . 𝜌 = . 𝑟 and 𝑟 , respectively. Top-k Dense Keywords within an Area is a spatio-textualmeasure which returns a list of tuples ⟨ 𝜉 𝑖 , −→ 𝑘𝑤 ⟩ computing thekeyword density as described in the measure above, but in thiscase, it returns the top-k keywords −→ 𝑘𝑤 = ⟨ 𝑤 , 𝑤 , . . . , 𝑤 𝑘 ⟩ withthe highest density. Keyword Volatility is a spatio-textual-temporal measure (be-comes textual-temporal If no region is specified) which returns alist of tuples ⟨ 𝜉 𝑖 , 𝑤 𝑗 ,𝑇 𝑘 , Δ 𝜌 𝑖 𝑗𝑘 ⟩ consisting of a geometry shape 𝜉 𝑖 representing a geographical area, a keyword 𝑤 𝑗 , a time interval 𝑇 𝑘 , and its change in density Δ 𝜌 𝑖 𝑗𝑘 in the area 𝜉 𝑖 over the timeinterval 𝑇 𝑘 (divided into 𝑘 equal intervals). The change in density Δ 𝜌 𝑖 𝑗𝑘 of a keyword 𝑤 𝑗 in an area 𝜉 𝑖 over a time interval 𝑇 𝑘 iscomputed as Δ 𝜌 𝑖 𝑗𝑘 = (cid:205) 𝑘𝑧 = | 𝜌 𝑖𝑗𝑧 − 𝜌 𝑖𝑗𝑧 − | 𝑘 , where 𝜌 𝑖 𝑗 𝑧 represents thedensity of the keyword 𝑤 𝑗 in the area 𝜉 𝑖 at a specific time instance 𝑇 𝑘 𝑧 . Furthermore, the change in density computation formulacan be updated depending on the analysis requirements, e.g., itcan be changed to weighted density (assign different weights toeach interval in 𝑇 𝑘 ) or to rate of change computation using linearregression [19]. Top-k Volatile Keywords within an Area is a spatio-textual-temporal measure which returns a list of tuples ⟨ 𝜉 𝑖 , −→ 𝑘𝑤 ⟩ comput-ing the keyword volatility as described above, but in this case,it returns the top-k volatile keywords −→ 𝑘𝑤 = ⟨ 𝑤 , 𝑤 , . . . , 𝑤 𝑘 ⟩ withthe highest change in density. Distributive, Algebraic, and Holistic Measures.

There arethree types (also known as additivity) of measures: distributive,algebraic, and holistic, depending on whether it is possible tocompute the value of a measure at a parent level directly fromthe values at the child level [13]. For distributive and algebraicmeasures, this is possible. For instance, the

Fact Count at the

State level can be computed by summing up the

Fact Counts at the

City . Keyword Density is instead an algebraic measure. We cancompute the higher-level aggregate values of this measure if westore for each child level both the frequency of each keyword andthe

SurfaceArea . The

Top-k Keywords , the

Top-k Dense Keywords ,and

Top-k Volatile keywords within an area measures, instead, areholistic, since the value at a parent level cannot be computeddirectly from the values at the child level, but it is necessary torecompute them directly from the base facts every time.Consider the computation of

Top-3 Dense Keywords within anArea in Figure 3 given the two

Regions 𝑟 and 𝑟 with SurfaceArea 𝑚 and 100 𝑚 , respectively, and the computation at the parentlevel 𝑟 = 𝑟 ∪ 𝑟 (grayed-out rows are not part of the computed egion 𝑟 = 𝑟 ∪ 𝑟 Keyword Σ 𝑎𝑙𝑙 Σ 𝑇𝑜𝑝 − Area 𝜌𝑎𝑙𝑙 𝜌𝑇𝑜𝑝 − Carrot 42 40 𝑚 𝑟 Keyword Count Area DensityApple 5 𝑚 𝑟 Keyword Count Area DensityCarrot 40 𝑚 Figure 3: Example: Merging of Holistic Measure measure value). The values in the top-3 for the members 𝑟 and 𝑟 at the child level are not sufficient to compute the correctdensities for region 𝑟 . Both, some of the computed density (incolumn 𝜌 𝑇𝑜𝑝 − , while the correct values are reported in 𝜌 𝑎𝑙𝑙 ) andconsequently the final ranking, would be wrong. For instance, thekeyword Strawberry would not have been returned (if computedalgebraically) because it is neither in the top-3 for 𝑟 nor 𝑟 . Tocompute the correct response, either we have to store all theaggregate values for each possible cell or we have to reprocess allthe facts covered by the query. When dealing with large datasetsthese approaches are not feasible. Hence, in Section 4 we providea framework for the computation of an exact and approximatesolution with accuracy guarantees. A data cube allows different O n L ine A nalytical P rocessing (OLAP)operators to group, filter, and analyze cells and subsets of cellsat different levels of granularity and under different perspec-tives. Those operators are known as Slice , Dice , Roll-Up , and

Drill-Down [18]. We extend the basic OLAP operators to

STT-OLAP operators , i.e., for spatial, textual, and temporal dimensions,hierarchies, and measures (Handing of 𝑛 − 𝑛 relationships is ex-plained in Section 4 and 5). In general, an OLAP (and STTOLAP)operator OP accepts as input a cube 𝐶 ′ , some parameters params and outputs a new cube 𝐶 ′′ , i.e., OP ( 𝐶 ′ , params ) = 𝐶 ′′ . In this way,a new OLAP operator can be applied to 𝐶 ′′ . Among all cubes, wedistinguish the initial or base cube 𝐶 as the cube containing allthe original information at the base level. Cube materialization is the process of pre-aggregating measurevalues at different levels of granularity in the cube to computequery responses from pre-aggregated results instead of the rawdata, and hence improve query response time for

STTOLAP oper-ators [16]. In a data cube, a cuboid is a collection of level members and associated measure values for a unique combination of dimen-sion hierarchy levels. Each unique combination is representedby a separate cuboid. For instance, if we request the

Fact Count for the

State of Denmark and have stored

Fact Count at the

Re-gion level, we can avoid accessing the raw data and computethe aggregation from much fewer rows. This is an example of partial materialization , i.e., the actual cuboid at the

State level,containing the answer to the query was not materialized, but thesystem was still able to exploit the cuboid for

Region .What to materialize and how much to materialize depends onthe trade-off between query response time and storage cost.

FullMaterialization (FM) is obtained by pre-computing measure val-ues for all combinations of levels in all hierarchies. This approach requires huge storage but achieves the best query response timesince every operation can just look up already pre-computedresults. At the other extreme,

No Materialization (NM) only ma-terializes the base cuboid and does not require any extra space,but will require aggregated measure values to be recomputedfrom the base cuboid every time, hence incurring much slowerresponse times. A middle-ground solution is to partially mate-rialize the cube, i.e., to materialize only some of the possiblecuboids. In this strategy, some queries will be able to exploitpre-aggregated values at the current level, while other queriescan exploit pre-aggregated values at lower levels for distributiveor algebraic measures.

STT query for these measures we materializetwo other distributive measures, namely

Keyword Frequency 𝑓 and SurfaceArea 𝑎 . Then, since Keyword Density 𝜌 and KeywordVolatility Δ 𝜌 are algebraic measures, they can be computed fromthe values of Keyword Frequency 𝑓 and SurfaceArea 𝑎 . Finally, Top-k Dense Keywords and

Top-k Volatile keywords are holistic butfor an exact solution we materialize

Top-ALL and hence, computeit from the materialized measure values (Figure 3).We adopt the chosen linear cost model (Section 4.1) and extendthe greedy algorithm approach [16] to our task (Algorithm 1).

Additionally, and different from [16], Algorithm 1 accepts an inputparameter 𝐾 and materializes only the top-K measures values ineach cuboid . For instance, for 𝐾 =

10, it will materialize the top-10keywords in each cuboid. Then, any top-k query, with 𝑘 ≤ 𝐾 , fora materialized cuboid will return the pre-computed answer.Algorithm 1, given a size budget 𝐵 (measured in rows, cuboids,or GB), proceeds until the size of the current cube is as largeas possible within the budget (Line 6). At each step, it selectsamong all the non-materialized cuboids (Line 3) the one with thehighest benefit (Line 4) and materializes it (Line 5). The differencebetween the exact (PEM) and approximate (PAM) materialization Algorithm 1:

Greedy Materialization GreedyMaterialization ( 𝐵 , (cid:28) , 𝐾 ) Input:

Budget 𝐵 , STTCube (cid:28) , desired top-k 𝐾 Output:

Partially Materialized STTCube (cid:28) do Candidates ← { 𝑉 ∈ (cid:28) |¬ 𝑉 . isMaterialized } ; 𝑉 ← max 𝑉 ∈ Candidates

Benefit ( 𝑉 ) ; (cid:28) . materialize ( 𝑉, 𝐾 ) ; while size ( (cid:28) ) ≤ 𝐵 ; return (cid:28) ; Due to space restrictions, the details and figures of the cost model experimentsare available in Appendices B and C. lgorithm 2:

Top-K Volatile Keywords in an Area TopKVolatile ( Φ = {⟨ 𝜉 , −−−→ 𝑘𝑤 ,𝑇 ⟩ , . . ., ⟨ 𝜉 𝑛 , −−−→ 𝑘𝑤 𝑛 ,𝑇 𝑛 ⟩} , 𝑇 𝑥 , 𝑘 ) Input:

Set of Top-k+1 Volatile Keywords lists Φ , Set of 𝑥 Timestamps 𝑇 𝑥 ,Integer k Output: ⟨ 𝜉, −−→ 𝑘𝑤,𝑇 𝑛 ⟩ top-k keywords −−→ 𝑘𝑤 in the merged area 𝜉 over timeinterval 𝑇 𝑥 , 𝛿 number of guaranteed top positions 𝜉 ← (cid:208) 𝑖 ∈[ ,𝑛 ] 𝜉 𝑖 , 𝐴 ← SurfaceArea ( 𝜉 ) ; // Merge areas 𝑘𝑤 ← {} , Δ 𝑓 ← {} , 𝑝𝑟𝑒𝑣 𝑓 ← {} ; // Empty dictionaries foreach 𝑡 ∈ 𝑇 𝑥 do foreach ⟨ 𝜉 𝑖 , −−→ 𝑘𝑤 𝑖 ,𝑇 𝑖 ⟩ ∈ Φ do foreach 𝑗 ∈ [ , . . .,𝑘 + ] do if 𝑡 ∈ 𝑇 𝑖 then 𝑤 ←−−→ 𝑘𝑤 𝑖 .𝑔𝑒𝑡 ( 𝑗 ) ; // keyword at 𝑗 𝑓 ←−−→ 𝑘𝑤 𝑖 . freq ( 𝑗 ) ; // frequency at 𝑗 𝑘𝑤 [ 𝑤 ] ← 𝑘𝑤 [ 𝑤 ] + 𝑓 ; Δ 𝑓 [ 𝑤 ] ← Δ 𝑓 [ 𝑤 ] + | 𝑝𝑟𝑒𝑣 𝑓 [ 𝑤 ] − 𝑓 | ; 𝑝𝑟𝑒𝑣 𝑓 [ 𝑤 ] ← 𝑓 ; 𝜖 + = −−→ 𝑘𝑤 𝑖 . freq ( 𝑘 + ) ; −−→ 𝑘𝑤 ← topK ( 𝑘𝑤,𝐴, Δ 𝑓 ) ; // top-k volatile keywords 𝛿 ← max 𝑗 ∈[ ,...,𝑘 ] −−→ 𝑘𝑤. freq ( 𝑗 ) ≥ 𝜖 ; return ⟨ 𝜉, −−→ 𝑘𝑤,𝑇 𝑛 ⟩ , 𝛿 using Algorithm 1 is the value of 𝐾 . When 𝐾 = ∞ the full sortedlist of measure values will be stored so that all top- k queries canbe answered for that cuboid. We set 𝐾 = ∞ and 𝐾 = 𝑛 (to materializeonly top 𝑛 measure values) for PEM and

PAM , respectively.

Query rewriting.

Finally, as in [16], after

STTCube materializa-tion, queries are still formulated in terms of the base cuboid butrewritten by the system to be evaluated over the smallest cuboid.

As a result of the materialization performed by Algorithm 1, whenquerying a non-materialized cuboid, we can directly exploit val-ues in the cuboid’s materialized ancestors when computing alldistributive and algebraic measures. On the other hand, for holis-tic measures, we have to perform some additional computation.For instance, as mentioned earlier, to compute the value for the

Top-k Dense Keywords in an area we can exploit the pre-computed

Keyword Density values, but then we need to perform the top- k selection. That is, if the top- k for the current view is not materi-alized, we cannot exploit the materialized top- k of the ancestorviews without incurring the risk of returning the wrong result.Yet, it is possible to exploit the top- k computation in some ma-terialized cuboid to retrieve an approximate top- k and estimatethe result's accuracy [32]. In practice, for the Top-k Dense Key-words within an area , given a target k for the top- k computation,when materializing a cuboid, we materialize the top- k +1 mostdense keywords for that cuboid (i.e., set 𝐾 = 𝑘 +1 in Algorithm 1).Then, to compute the top- k dense keywords for a descendantcuboid by exploiting a materialized ancestor cuboid, we deter-mine which members of the list are guaranteed to be correct.Algorithm 2 implements this computation for Top-k VolatileKeywords within an area . It receives as input the set Φ = {⟨ 𝜉 , −−→ 𝑘𝑤 ,𝑇 ⟩ , ⟨ 𝜉 , −−→ 𝑘𝑤 ,𝑇 ⟩ , . . . , ⟨ 𝜉 𝑛 , −−−→ 𝑘𝑤 𝑛 ,𝑇 𝑛 ⟩} of lists of top-K (i.e., k +1) densekeywords in a specific area with respective time stamps, timeinterval 𝑇 𝑥 divided into x equal-sized interval (e.g., day or month),and the value for 𝑘 . The output is the ranked list of top- k volatilekeywords in the area 𝜉 that is composed by the merging of the ar-eas 𝜉 , 𝜉 , . . . , 𝜉 𝑛 . It computes the SurfaceArea of the merged area 𝜉 (lines 2). Then it merges all the aggregated keyword frequencies(line 10) and change in keyword frequencies (line 11) for eachtime instance in 𝑇 𝑥 (line 7) in respective dictionaries 𝑘𝑤 and Δ 𝑓 (lines 4-13) by getting each keyword in each list (line 8) and the Algorithm 3:

STTCubeConstruction ConstructSTTCube (X , T , G , 𝐵, 𝐾 ) Input:

Collection of Spatio-Textual-Temporal Objects X , Knowledge Source T ,Geographical Information G , Materialization Budget 𝐵 , desired top-k 𝐾 Output:

Spatio-Textual-Temporal Cube (cid:28) (cid:28) ← load empty or existing cube; (cid:28) .𝑑 Time ← initialize or load temporal dimension; (cid:28) .𝑑 Location ← initialize or load spatial dimension; (cid:28) .𝑑 Text ← initialize or load textual dimension; (cid:28) .𝐹 ← initialize empty or load existing Fact Table ; foreach 𝑥 ∈ X do UpdateTemporalHierarchies ( 𝑥.𝜏, (cid:28) .𝑑 Time ); 𝜆 ′ ← ProcessLocation ( 𝑥.𝜆 ) ; UpdateSpatialHierarchies ( 𝜆 ′ , G , (cid:28) .𝑑 Location ); 𝜑 ′ ← ProcessText ( 𝑥.𝜑 ) ; UpdateTextualHierarchies ( 𝜑 ′ , T , (cid:28) .𝑑 Text ); InsertFact ( 𝑥.𝜏, 𝜆 ′ ,𝜑 ′ , (cid:28) .𝐹 ); GreedyMaterialization ( 𝐵 , (cid:28) , 𝐾 ) ; return (cid:28) ; corresponding frequencies (line 9). If a keyword is not found inthe 𝑘𝑤 , Δ 𝑓 , or 𝑝𝑟𝑒𝑣 𝑓 dictionary then its value is considered to be zero . Moreover, it keeps track of the upper-bound 𝜖 frequency forkeywords outside the current materialized ranking for possibleerror reporting (line 13). Once all frequencies and changes infrequencies are merged, we compute the top- k volatile keywordsusing the aggregated values (line 14). Finally, by comparing thevalue of 𝜖 with the frequencies of keywords in the aggregatedtop- k , we report how many positions in the current ranking areguaranteed to be exact (line 15). In the best case, the frequency ofthe keyword at position 𝑘 will be at least 𝜖 and thus the computedtop- k is guaranteed to be correct. Here, we describe the proposed approach for constructing an

STTCube . Algorithm 3 takes a collection X of STT objects to beanalyzed, a textual taxonomy T with semantic information aboutthe terms, themes, topics, and concepts, and a geographical taxon-omy G for cities, regions, and countries. Standard date functions are used for the temporal dimension processing. Moreover, italso receives as input the parameters 𝐵 and 𝐾 as the budget andnumber of top- K keywords for the partial materialization.Algorithm 3 constructs the STTCube in an incremental way,it initializes an empty cube (line 2), and then the correspondingspatial, textual, and temporal dimensions (lines 3-5) as well as the Fact Table (line 6). If the cube is already constructed, i.e., the cubeis being updated instead of constructed for the first time, thenAlgorithm 3 loads the existing

STTCube (lines 2-6) and updatesit with new information. In particular, the spatial dimension hasthe grid-based hierarchy and the hierarchy with the base levelat each object's

Location (i.e., the geographical point), and thenthe levels

City , Region , Country , and

All (5 levels in total). Thetextual dimension, instead, has the hierarchy build from the baselevel

Term , and then

Theme , Topic , Concept , and

All (5 levels intotal). Finally, the temporal dimension contains the

Date and

TimeOfDay hierarchies mentioned in Section 3.Once the basic structure is prepared, Algorithm 3 loops througheach STT object in X (lines 8-13). In this loop, it extracts andinitializes from each STT object the base-level members for eachdimension. Then, once the base level data has been extracted, itproceeds with building the various dimension hierarchies startingfrom the existing base-level members and exploiting the providedspatial and textual taxonomies (lines 8-12). Once the dimensionhierarchies are built, the

STT object itself is then inserted in thefact table of the

STTCube (line 13) so that each fact is linked tothe lowest (base) level members in the respective dimensions.n this step (line 13), the fact measure values are also computed(e.g., the keyword count). As the last step (line 14) Algorithm 3executes the (partial) materialization procedure.

Spatial Hierarchies Construction.

In our proposed

STTCube the base level for the spatial hierarchies is the

Location present inthe raw data, i.e., the longitude and latitude points. Hence, we useMilitary Grid Reference System (MGRS) for grid-based hierar-chy and when building the semantic-based hierarchy, individualpoints are linked to the respective cities using the information inthe available geographical taxonomy G , or to a special memberfor points that link to unknown locations. Therefore, this cor-responds to the step function from Location to City . The spatialtaxonomy G is also used to generate the spatial hierarchy stepfunctions for the higher levels. Textual Hierarchies Construction.

The unstructured natureof the text makes it a challenging task to convert it into a di-mension of a cube. In Algorithm 3, the

ProcessText function(line 11) implements the following steps: (1) splits the text intoindividual words, (2) removes stop words , and (3) converts theremaining words to their base form (e.g., “works” and “working”have the same base form “work”). The final processed text is usedto populate the

Term base-level in the textual dimension. Thisimplements the base step function in the textual hierarchies, andlinks every fact to one or more

Terms , hence it has an 𝑛 − 𝑛 cardi-nality. Moreover, while constructing the higher levels, using thesemantic taxonomy T (e.g., WordNet), each STT object is linkedto one or more

Themes , and similarly for

Topics , and

Concepts . Now, we report on the performance of

STTCube analysis. In par-ticular, we compare the different materialization strategies for

STTCube and

No STTCube (NC) implementations, in terms ofquery response time (QRT) and storage cost. NC answers thequeries by computing the query response from base data withoutconstructing the STTCube . NC stores pre-processed text , lemma-tized text after removal of stop and invalid words, along with the geographical location , longitude and latitude point, and timestamp .Specifically, NC uses user-defined functions for text (for retriev-ing individual terms) and location processing (e.g., identificationof the city a particular longitude, latitude point belongs to) andbuilt-in functions for timestamp. Further, NC filters on locationand timestamp for the queried area and time and performs a se-ries of joins, e.g., 4 joins for Concept level, to retrieve informationfor the requested textual level. Finally, it groups results on thetextual and temporal columns, computes the

STT measure values ,and performs the top- k selection. Also, we compare QRT andhierarchy construction time for different combinations of hierar-chy schemes. Moreover, we also report on the accuracy of PAM and demonstrate the advantage in performance when comparedto

PEM . Lastly, we compare QRTs for different spatial and textualhierarchy schemes, showing that combinations of

Grid-based spa-tial and Majority-based textual (GM) hierarchy scheme achievesthe fastest QRTs among all hierarchy combinations.

Experimental Setup.

We evaluate the

STTCube on a real-worldTwitter dataset containing 125 million tweets collected over six weeks. Each tweet contains the tweet location, text, and time.We implemented the

STTCube in a leading commercial RDBMS,called

RDBMS-X as we cannot disclose the name. The proposeddesign is realized using a snowflake schema to avoid redundancyin the dimension data. We implemented the

Pre-Processing (PP) component, wherethe whole raw dataset is parsed and the relational tables arepopulated, in Java (v11). All tests are run on a Windows Servermachine with 2 Intel Xeon 2.50GHz CPUs and 16GB RAM.We extracted the taxonomy for the spatial dimension fromGeoNames [1]. For the

City level, we considered all the citieshaving 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 > Region level, we use ad-ministrative divisions information available in the GeoNamesdataset. We use the reverse geocoding process to find the cityname for the

Location coordinates.For the textual dimension, as a taxonomy for

Terms , Themes , Topics , and

Concepts , we use the widely used WordNet [11]. Weuse the direct

HYPERNYM link of WordNet to decide the parentmember for a

Term , Theme , and

Topic . If a term is present inWordNet and has a super-class (

HYPERNYM ) then the super-class becomes the parent of the term. Otherwise, it becomes itsown parent (this avoid unbalanced hierarchies and UNKNOWNvalues in the hierarchy). For text pre-processing –tokenization,lemmatization, and stop word removal– we use the Stanford CoreNLP library [26]. We implemented the temporal dimension usingthe standard

Date and

Time functions supported in

RDBMS-X .We implemented the semantic-based and grid-based hierarchyschemes for the spatial dimension, replication-based and majority-based hierarchy schemes for the textual dimension (Section 3.2),and

Date hierarchy for the temporal dimension.

Spatial, Textual, and Temporal Levels Members.

The baselevels contain 40 . Location Points and 9 . Terms . The GeoNames taxonomy contains 132 𝐾 cities,divided into 4 𝐾 administrative divisions (regions) for 247 coun-tries. Among those, we have tweets for 104 𝐾 cities, 3 . 𝐾 regions,and 246 distinct countries. In the textual hierarchy, terms aregrouped into 23 . 𝐾 Themes , 19 . 𝐾 Topics , and 17 . 𝐾 Concepts .Furthermore, the temporal dimension spans over 37 days. Finally,for

PAM we materialize 𝐾 =

31 densest keywords.We implemented

Keyword Density , Keyword Volatility , Top-k Dense Keywords within an area , and

Top-k Volatile Keywordswithin an area as prototypical

STT measures . We compare (

PEM )and (

PAM ) strategies with the following three baselines.

NoSTTCube (NC): is the traditional RDBMS setup with all textual,spatial, and temporal functions implemented as built-in or user-defined functions. NC is the traditional solution one would gofor without the STTCube . No Materialization (NM): constructsthe

STTCube and minimizes the storage cost by only materializ-ing the base cuboid and computing all query responses from it.

Full Materialization (FM): minimizes the QRT by materializ-ing every cuboid in the

STTCube . With this approach queries areanswered through a lookup in the pre-computed cuboid.

Thesethree baselines are at the extreme ends of the space-time trade-offand are usually infeasible for large datasets.

Queries.

We perform experiments on five different sizes of datasetsusing nine different

STT queries . Each

STT query , described inTable 3, targets a different level of spatial, textual, and temporal

Table 3: Spatio-Textual-Temporal Queries

Query Description

Q1 Top- k Dense/Volatile

Terms in a

City [time span]

Q2 Top- k Dense/Volatile

Topics in a

City [ time span] Q3 Top- k Dense/Volatile

Concepts in a

Country [time span]

Q4 Top- k Dense/Volatile

Terms in a

Region [time span]

Q5 Top- k Dense/Volatile

Concepts in a

Region [time span]

Q6 Top- k Dense/Volatile

Themes in a

Region [time span]

Q7 Top- k Dense/Volatile

Terms in a

Country [time span]

Q8 Top- k Dense/Volatile

Terms in a

Country

Group by

Region [time span]

Q9 Top-

ALL

Dense/Volatile

Topics in a

Country

Group by

Region [time span] (a) GR - Keyword Density Q R T ( m s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(b) GR - Keyword Volatility Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(c) GR - Top- k Dense Keyowrds Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(d) GR - Top- k Volatile KeyowrdsQ2 Q3 Q5 Q6 Q9 (e) GM - Keyword Density Q R T ( m s ) Q2 Q3 Q5 Q6 Q9(f) GM - Keyword Volatility Q2 Q3 Q5 Q6 Q9(g) GM - Top- k Dense Keyowrds Q2 Q3 Q5 Q6 Q9(h) SM - Top- k Volatile KeyowrdsNC NM PEM PAM FM

Figure 4: QRTs for spatio-textual-temporal measures for different combinations of hierarchy schemes over 125 Million of Data

25 50 75 1001250100200300400 (a) GR - Keyword Density(PEM) Q R T ( m s )

25 50 75 100125 (b) GR - Top- k Dense Keywordswithin an Area (PEM)

25 50 75 100125 (c) GR - Top- k Dense Key-words within an Area (PAM)

25 50 75 100125 (d) GR - Top- k Volatile Key-words within an Area (PEM)

25 50 75 100125 (e) GR - Top- k Volatile Key-words within an Area (PAM)

25 50 75 100125050100

Data Size in Millions(f) GM - Keyword Density(PEM) Q R T ( m s )

25 50 75 100125

Data Size in Millions(g) GM - Top- k Dense Key-words within an Area (PEM)

25 50 75 100125

Data Size in Millions(h) GM - Top- k Dense Key-words within an Area (PAM)

25 50 75 100125

Data Size in Millions(i) GM - Top- k Volatile Key-words within an Area (PEM)

25 50 75 100125

Data Size in Millions(j) GM - Top- k Volatile Key-words within an Area (PAM)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Figure 5: QRT Vs Data Size granularity. Each query requests either dense or volatile key-words with a range of time which is used for volatile but not usedfor dense keywords queries. We execute each query ten timeswith randomly generated parameters for each method and reportmean and standard deviation.

Query Response Time.

For

Top-k Dense and Top-k Volatile Key-words within an area measures, we compare the QRT of

PEM and

PAM methods with the NC , NM , and FM baselines. For KeywordDensity and

Keyword Volatility , no approximate solution is pos-sible so we only compare

PEM with NC , NM , and FM . As the Majority-based textual hierarchy scheme does not process

Terms (Section 3.2), we only evaluate five out of nine queries requesting

Theme , Topic , and

Concept for it (Figures 4e—4h). Furthermore,we cannot evaluate

PAM for Q9 as no approximate solution ispossible for it.We plot results in Figures 4a—4h for 100% (125M) of data, asthe results are similar for smaller data sizes. Each row in Figure 4shows the QRTs for one particular combination of spatio-textualhierarchy schemes. Specifically, Figures 4a—4d show the QRTsfor the Grid-based spatial and Replication-based textual ( GR )hierarchy combination for all measures . Similarly, Figures 4e—4hshow QRTs for Grid-based spatial and Majority-based textual( GM ) combinations . Figure 4 has queries on the x-axis and QRTsin msec on the y-axis (note: log scale). Figure 4 confirms that NC is 1 − NM . Specifically, regard-less of the spatial hierarchy scheme, it is 1 − − Due to space constraint we have omitted the figures for Semantic-based spatialhierarchy (can be found in Appendices Figures 9 and 10) as we observed similarresults magnitude slower than NM for Replication-based and

Majority-based textual hierarchy, respectively. The

Majority-based textualhierarchy scheme achieves faster QRTs because it does not pro-cess individual

Terms but directly links

Theme to the fact , hence,drastically reducing the number of rows to process (from millionsto thousands). Furthermore, NM is 1 − − PEM and

PAM , respectively, for all measuresand combinations of hierarchy schemes.

PEM is on average sixtimes slower than FM which achieves its fast QRTs at the expenseof a highly increased storage cost (Figure 6a). PAM achieves near-optimal

QRTs because it materializes only the K densest keywords in the cuboid, hence it has much fewer rows to process. QRTsfor Q9 for Top- k Volatile Keyword within an area and

Top- k DenseKeywords within an area measures for all combinations of hier-archy schemes are the worst for

PAM (same as NM ) because itrequests ALL keywords' densities instead of top-k which cannotbe computed from the approximate pre-aggregated information.To generate a response for Q9 , we have to process all detail datadirectly from the base facts. In comparison, PEM and

PAM materi-alize a subset of views (also a subset of rows for

PAM ) and use thepre-aggregated measure values in those views to efficiently gen-erate a response for a query instead of processing base facts, thusimproving the overall QRT. NC is the slowest of all (1 − STTCube NM ) because ithas to process the complete dataset for computing each query re-sponse, and cannot take advantage of the

STTCube optimizationsfor

STT measures . Among all the hierarchy scheme combinations, GM has the fastest QRTs mainly because of Majority-based whichdrastically reduces the row count by linking the

Theme directly · - N M - N M - N M - N M - N M · - P A M - P A M - P A M - P A M - P A M · - PE M - PE M - PE M - PE M - PE M · Data Size in Millions S t o r ag e C o s t( N o . o f R o w s ) Base CubeMaterialization - F M - F M - F M - F M - F M (6a) Storage Cost · Number of Views(6b) Views Selection Benefit B e n e fi t ( M i ll i o n s o f R o w s ) 𝑀 𝑀 𝑀

25 50 75 100 12510 Data Size in Millions(6c) Cube Construction Time T i m e ( m i n u t e s ) 𝑃𝑃 𝑃𝐸𝑀𝑃𝐴𝑀 𝑃𝑃 _ 𝐼𝑁𝐶𝐹𝑀

25 50 75 100 125051015 · Data Size in Millions(6d) Hierarchy Construction Time T i m e ( m i n u t e s ) 𝐺𝑟𝑖𝑑𝑆𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑀𝑎𝑗𝑜𝑟𝑖𝑡𝑦 to each

Fact instead of individual

Terms , whereas, GR has theslowest QRTs due to Replication-based having far more rows toprocess than

Majority-based textual hierarchy. Furthermore,

Grid and

Semantic-based spatial hierarchies have similar QRTs.Figure 5 shows the scalability of

PEM and

PAM over growingdata sizes for different combinations of hierarchy schemes andconfirms that the QRTs are almost constant as the data grows.This is because the sizes of materialized views do not increase alot as the data grows. Only new dimension members, e.g., newcities or topics, increase the size of materialized views, but onlyby a small fraction. Figures 5f—5j confirm that the GM hierarchycombination results in the fastest QRTs, i.e., all QRTs <

100 msec.On the contrary, Figures 5a—5e show that GR yields the slowestQRTs, with QRT as high as 400 msec. Figure 5 confirms that PAM consistently achieves the fastest QRTs (mostly <

100 msec withfew a bit over 100 msec) regardless of hierarchy schemes. Figure 5shows that

PEM and

PAM scale linearly w.r.t. data size.

Storage Cost.

We now compare the storage cost for FM , PEM , PAM , and NM . We do not compare NC 's storage cost because itdoes not construct STTCube, and hence does not materialize any-thing. We only show the storage cost for up to 20 million because FM takes an unfeasible amount of time (shown in Figure 6c) whilefor the other methods and over the larger datasets we observe thesame trend. We use the number of rows in a view as its storagecost.The base cube's storage cost is always needed. Besides that,every additional materialized view adds to the storage cost, asdisplayed in Figure 6a, that shows the storage cost of NM , PAM , PEM , and FM over growing data sizes. The materialization of theSTTCube using PEM and

PAM only adds 13% and 0.1% to thestorage cost of the base cube, respectively. Whereas, using FM increases the storage cost by more than an order of magnitude. PEM reduces the storage cost by only materializing a subset ofviews (four views) and still achieves 2-5 orders of magnitude im-provement in QRT (Figures 4).

PAM further reduces the storagecost by only materializing a subset of rows in a view (top- k ) andgains an additional order of magnitude improvement in QRT.On the other hand, FM materializes all views in a cube, i.e., 500(5 × × ×

4) views in our case, which makes the view material-ization storage cost much higher (one order of magnitude) thanthe base cube itself, as shown in Figure 6a. Figure 6a confirmsthat our proposed methods

PEM and

PAM reduce the storagecost between 97% and 99.9% compared to FM . Views Selection for Materialization.

Our proposed methods

PEM and

PAM are partial materialization methods that material-ize only a subset of the cuboids. Hence, an important trade-off tobe understood is between the number of cuboids to materialize,the corresponding storage cost, and the gain in query responsetime achieved. We empirically evaluate the benefit gained (im-provements in QRT for all dependent cells which can be answered using this view) against the cost of materializing the view (Al-gorithm 1). We consider the base cube as a necessary view tobe materialized and consider its benefit as zero. Figure 6b showsthat materializing three cuboids ( (Day, City, Term) , (Day, Location,Theme) , and (Day, Region, Term) ) on top of the base cube gain themost benefit after which we do not get a significant advantage ofmaterializing further cuboids. The reason is that the materializedcuboids are already small enough, so the benefit of materializingany descendant cuboid is small. Hence, materializing 4 cuboidsrepresents the best trade-off between QRT and storage cost. Pre-Processing and Cube Construction.

Here, we report thetime for the construction of

STTCube . Construction of an STTCubeis divided into two steps: 1)

Pre-Processing (PP) of base facts( spatio-textual-temporal objects ) and population of the relationaltables and 2) materialization of views. Further, the materializationof views can be done either using FM , PEM , or

PAM . In Figure 6c,we have data sizes on the x-axis and time in minutes on the y-axis(note: log scale). FM is the most time consuming among all andadds significant overhead on top of PP time and does not scale.On the contrary, PEM and

PAM time is negligible compared to the FM time. Hence, with PEM and

PAM STTCube construction timescales linearly. To evaluate

STTCube’s ability to handle updates,we performed several updates of 4M tweets each (

PP_INC line inFigure 6c). Experiments confirm that STTCube handles updatesefficiently by only processing the new

STT objects and updatingthe respective dimensions and measures.Furthermore, we compare the different hierarchy schemesw.r.t. their construction time. Figure 6d shows the hierarchies’construction time for different hierarchy schemes. It is evidentfrom Figure 6d that, among all, the

Replication-based textual hi-erarchy scheme takes the longest to construct because for eachsingle spatio-textual-temporal object it has to process each in-dividual

Term and construct hierarchy for it. Whereas, for allother schemes, for each spatio-textual-temporal object only onehierarchy instance is processed. Figure 6d confirms that all of thehierarchy schemes are constructed in linear time w.r.t. data size,allowing STTCube to support multiple hierarchy schemes.

Accuracy.

Given that

PAM efficiently computes the approximatemeasure values, it becomes necessary to evaluate its accuracy.To evaluate the accuracy of

PAM , we use NM 's results as groundtruth. Our evaluation result in Table 7a confirms that it achieveshigh accuracy. Specifically, it is 100% for 6 out of 8 queries, and90-97% for 2. Queries with 90-97% accuracy request as manykeywords as are materialized and the risk of having wrong resultsnear the border (bottom of the top- k list) is higher. QRT of STTOLAP Operators.

Our proposed materializationstrategies (

PEM and PAM ) improves the QRTs for

STTOLAP op-erators . To demonstrate this, we perform a series of STTOLAPoperations and measure their QRT for different materialization uery

Data Size in Millions25 50 75 100 125Q1 100.0 100.0 100.0 100.0 100.0Q2 100.0 100.0 100.0 100.0 100.0Q3 100.0 100.0 90.0 95.0 90.0Q4 92.3 100.0 100.0 100.0 100.0Q5 100.0 100.0 100.0 100.0 100.0Q6 100.0 100.0 100.0 100.0 100.0Q7 93.3 96.7 90.0 93.3 93.3Q8 100.0 100.0 100.0 100.0 100.0 (7a) PAM's Accuracy

Start RU RU D S DD DD

STTOLAP Operations Q R T ( m s ) NMPEMPAMFM (7b) STTOLAP Operations'QRTs

10 20 50 100 200 500 1000

Materialized K Q R T s ( m s e c ) (7c) Materialized-K Vs QRT strategies. Figure 7b shows the QRTs for multiple STTOLAP oper-ations for different materialization strategies. We have STTOLAPoperators on the x-axis (RU, D, S, and DD represents STT Roll Up,Dice, Slice, and Drill Down operators, respectively) on QRT inmsec on the y-axis. It is evident that NM is on average 3-5 ordersof magnitude slower than PEM which is one order of magnitudeslower than

PAM . Furthermore,

PAM achieves near-optimal QRTs,just a fraction higher than FM . These experiments confirm that STTCube's materialization methods (

PEM and PAM ) improves

ST-TOLAP operators'

QRTs by materializing only a subset of cuboids.

Top-K Value Estimation.

Here, we study the relationship be-tween QRT and the value of materialized 𝐾 . We create seven different STTCube materialization versions using 10, 20, 50, 100,200, 500, and 1000 as the value of 𝐾 . Next, we use the Gammadistribution to generate 100 random numbers, to be used as top- k values, in the range of 1 and 1000. We chose the Gamma distri-bution because it resembles a common long-tail distribution fortop- k values. We execute each query for all the 100 generatedtop- k values over all seven materialization versions. Figure 7cshows the QRT for all queries over different materialization ver-sions. For 𝐾 =

10 and 20 the median value is the same as the boxtop, hence not visible in the plot. It is evident from Figure 7cthat a larger value of materialized 𝐾 achieves faster QRTs (lowermedian value) because almost all the queries are answered usingthe pre-computed measure values. But, in the case of smaller 𝐾 ,all the queries requesting 𝑘 > 𝐾 need to be answered using thenon-pre-computed measure values from the base cuboid. Hence,resulting in slower QRTs (higher median value). A larger valueof 𝐾 such as 1000 is not recommended because 1) there will bevery few queries requesting a larger top- k and 2) it will requiremore storage cost (Figure 6a). Specifically, between 𝐾 =

50 and100 and 𝐾 =

100 and 200 QRT decrease by 35% and 0% but storageincrease 250% and 200%, respectively. Hence, these experimentsconfirm that choosing a value between 20—50 for 𝐾 in our currentexperiments settings is a near-optimal choice. In this paper, we defined and formalized the

Spatio-Textual-Temp-oral Cube (STTCube) structure to effectively perform

STTCubeanalytics . We introduced STT hierarchies, spatio-textual and spatio-textual-temporal measures, and STTOLAP operators to analyze STTdata together.

For efficient, exact and approximate, computationof

STT measures , we proposed a pre-aggregation framework ableto provide faster response times by requiring a controlled amountof extra storage to store pre-computed measure values. We ob-served how the partial materialization provides 1 to 5 orders ofmagnitude reduction in query response time, with between 97%and 99.9% reduced storage cost compared to full materializationtechniques. Moreover, the approximate materialization providesaccuracy between 90% and 100%, while requiring considerablyless space compared to no materialization techniques. In futurework, we plan to enhance STTCube with additional

STT measures and distributed implementation.

REFERENCES [1] 2020. GeoNames. http://download.geonames.org/. Accessed: 2020-09-09.[2] A. Almaslukh, A. Magdy, A. M. Aly, M. F. Mokbel, S. Elnikety, Y. He, S. Nath,and W. G. Aref. 2019. Local trend discovery on real-time microblogs withuncertain locations in tight memory environments.

GeoInformatica (2019).[3] M. Azabou, K. Khrouf, J. Feki, C. Soulé-Dupuy, and N. Vallès. 2016. Analyzingtextual documents with new OLAP operators.

AICCSA (2016).[4] X. Cao, L. Chen, G. Cong, C. S. Jensen, Q. Qu, A. Skovsgaard, D. Wu, and M. L.Yiu. 2012. Spatial Keyword Querying. ER (2012).[5] L. Chen, G. Cong, C. S. Jensen, and D. Wu. 2013. Spatial Keyword QueryProcessing: An Experimental Evaluation. PVLDB (2013).[6] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. 2002. Multi-dimensionalRegression Analysis of Time-series Data Streams.

VLDB (2002).[7] M. L. Chouder, S. Rizzi, and R. Chalal. 2019. Exploratory OLAP over DocStores. IS (2019).[8] G. Cong, K. Feng, and K. Zhao. 2016. Querying and mining geo-textual datafor exploration: Challenges and opportunities. ICDEW (2016).[9] G. Cong and C. S. Jensen. 2016. Spatial Keyword Queries and Beyond.

SIGMOD (2016).[10] B. Ding, B. Zhao, C. X. Lin, J. Han, C. Zhai, A. Srivastava, and N. C. Oza. 2011.Efficient Keyword-Based Search for Top-K Cells in Text Cube.

TKDE (2011).[11] C. Fellbaum. 1998. WordNet: An Electronic Lexical Database.

MIT Press (1998).[12] W. Feng, C. Zhang, W. Zhang, J. Han, J. Wang, C. Aggarwal, and J. Huang.2015. STREAMCUBE: Hierarchical spatio-temporal hashtag clustering forevent exploration over the Twitter stream.

ICDE (2015).[13] J. Gray, A. Bosworth, A. Lyaman, and H. Pirahesh. 1996. Data cube: a relationalaggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS.

ICDE (1996).[14] N. Gür, T. B. Pedersen, E. Zimanyi, and K. Hose. 2017. A foundation for spatialdata warehouses on the Semantic Web.

Semantic Web (2017).[15] J. Han, K. Koperski, and N. Stefanovic. 1997. GeoMiner: A System Prototypefor Spatial Data Mining.

SIGMOD (1997).[16] V. Harinarayan, A. Rajaraman, and J. D. Ullman. 1996. Implementing datacubes efficiently.

SIGMOD (1996).[17] P. Jayachandran, K. Tunga, N. Kamat, and A. Nandi. 2014. Combining UserInteraction, Speculative Query Execution and Sampling in the DICE System.

ICDE (2014).[18] C. S. Jensen, T. B. Pedersen, and C. Thomsen. 2010.

Multidimensional Databasesand Data Warehousing . Morgan & Claypool Publishers.[19] J. F. Kenney and E. S. Keeping. 1962. Mathematics of Statistics, Part 1, chapter15. van Nostrand (1962).[20] J. D. Knijff, F. Frasincar, and F. Hogenboom. 2013. Domain taxonomy learningfrom text: The subsumption method versus hierarchical clustering.

DKE (2013).[21] M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. 2007. STEW-ARD: Architecture of a Spatio-textual Search Engine.

GIS (2007).[22] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. 2008. Text Cube: ComputingIR Measures for Multidimensional Text Database Analysis.

ICDM (2008).[23] L. Lins, J. T. Klosowski, and C. E. Scheidegger. 2013. Nanocubes for real-timeexploration of spatiotemporal datasets.

TVCG (2013).[24] X. Liu, K. Tang, J. Hancock, J. Han, M. Song, R. Xu, and B. Pokorny. 2013. AText Cube Approach to Human Social and Cultural Behavior in the TwitterStream.

SBP (2013).[25] A. Magdy, L. Abdelhafeez, Y. Kang, E. Ong, and M.F. Mokbel. 2020. Microblogsdata management: a survey.

The VLDB Journal (2020).[26] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky.2014. The Stanford CoreNLP natural language processing toolkit.

ACL (2014).[27] R. Othman, R. Belkaroui, and R. Faiz. 2016. Customer Opinion SummarizationBased on Twitter Conversations.

WIMS (2016).[28] B. Pat and Y. Kanza. 2017. Where’s Waldo?: Geosocial Search over MyriadGeotagged Posts.

SIGSPATIAL (2017).[29] J. M. Pérez-Martínez, R. Berlanga-Llavori, M. J. Aramburu-Cabo, and T. B.Pedersen. 2008. Contextualizing Data Warehouses with Documents.

DSS (2008).[30] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling.2009. TwitterStand: News in Tweets.

SIGSPATIAL (2009).[31] A. Simitsis, A. Baid, Y. Sismanis, and B. Reinwald. 2008. Multidimensionalcontent eXploration.

PVLDB (2008).[32] A. Skovsgaard, D. Sidlauskas, and C. S. Jensen. 2014. Scalable top-k spatio-temporal term querying.

ICDE (2014).[33] M. Walther and M. Kaisser. 2013. Geo-spatial Event Detection in TwitterStream.

ECIR (2013).34] S. Wang, J. Cao, and P. Yu. 2020. Deep learning for spatio-temporal datamining: A survey.

TKDE (2020).[35] D. Yu, D. Xu, D. Wang, and Z. Ni. 2019. Hierarchical Topic Modeling of TwitterData for Online Analytical Processing.

IEEE Access (2019).[36] C. Zhang and J. Han. 2019. Multidimensional Mining of Massive Text Data.

DMKD (2019).[37] D. Zhang, C. X. Zhai, J. Han, A. Srivastava, and N. Oza. 2009. Topic Modelingfor OLAP on Multidimensional Text Databases.

Stat. Anal. Data Min. (2009).[38] K. Zhao, L. Chen, and G. Cong. 2016. Topic Exploration in Spatio-TemporalDocument Collections.

SIGMOD (2016).

STTOLAP OPERATORS

A data cube allows different O n L ine A nalytical P rocessing (OLAP)operators to group, filter, and analyze cells and subsets of cellsat different levels of granularity and under different perspec-tives. Those operators are known as Slice , Dice , Roll-Up , and

Drill-Down [18], and they take as input a cube and produce asoutput another cube. In the following, we extend the basic OLAPoperators to

STTOLAP operators , i.e., for spatial, textual, andtemporal dimensions, hierarchies, and measures. In general, anOLAP (and STTOLAP) operator OP accepts as input a cube 𝐶 ′ ,some parameters params and outputs a new cube 𝐶 ′′ , i.e., OP ( 𝐶 ′ , params ) = 𝐶 ′′ . In this way, a new OLAP operator can be applied to 𝐶 ′′ . Among all cubes, we distinguish the initial or base cube 𝐶 asthe cube containing all the original information at the base level.In the following, we generally assume every OLAP operator OP to have access to 𝐶 since some operators need access to the basecube 𝐶 to produce the desired result. A.1 STT-Slice

Slice operates over the current data cube 𝐶 ′ and takes as a param-eter a dimension member 𝑣 𝑖 in a specific level 𝑙 𝑖 and dimension 𝑑 𝑖 .It keeps only cells in 𝐶 ′ corresponding to 𝑣 𝑖 and finally removesdimension 𝑑 𝑖 . An example of the slice operator is “slice locationdimension on user-defined polygon representing Aalborg”. STT-Slice. operator is defined as

STTSlice ( 𝐶 ′ 𝑠𝑡𝑡 , 𝑣 𝑖 ) = 𝐶 ′′ 𝑠𝑡𝑡 . It takesan n -dimensional STTCube 𝐶 ′ 𝑠𝑡𝑡 and a member 𝑣 𝑖 of the level 𝑙 𝑖 of the spatial, textual, or temporal dimension 𝑑 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 , 𝑑 𝑇𝑒𝑥𝑡 , or 𝑑 𝑇𝑖𝑚𝑒 , respectively, and produces a resulting cube 𝐶 ′′ 𝑠𝑡𝑡 with 𝑛 − 𝑙 𝑖 only for the member 𝑣 𝑖 and removes the respective dimension.The member 𝑣 𝑖 could be an object in the taxonomy of a semantic-based hierarchy (e.g., Aalborg as a dimension member in 𝑙 𝐶𝑖𝑡𝑦 )or could be a grid cell at some granularity level. Similarly, 𝑣 𝑖 could be a specific Theme or Topic for the textual dimension anda specific day or month for the temporal dimension. A.2 STT-Dice

While the slice operator selects and removes a single dimension,the dice operator produces a new cube whose cell contents havebeen filtered based on a set of conditions (complex predicates,queries covering several cells) but without removing any dimen-sion. That is, it produces a resulting cube with the same numberof dimensions but based on facts that satisfy the provided set ofconditions. Such conditions can use a combination of spatial, tex-tual, temporal, and general-purpose functions. These functionscan perform different computations, e.g., they combine more thanone object and return a new single aggregated object, or comparetwo objects and return a Boolean value, or they can produce anumeric value based on some computation.

STT-Dice.

The

STT-Dice operator selects only cell(s) that sat-isfy the provided spatial, textual, or temporal logical conditions.Given a STTCube 𝐶 ′ 𝑠𝑡𝑡 and a set of logical conditions COND 𝑖 , the STTDice is represented as

𝑆𝑇𝑇 𝐷𝑖𝑐𝑒 ( 𝐶 ′ 𝑠𝑡𝑡 , COND 𝑖 ) = 𝐶 ′′ 𝑠𝑡𝑡 where COND 𝑖 is a set of atomic or compound spatial, textual, or tempo-ral logical conditions. For instance, it can return only the cell(s)that intersect with the polygon describing a custom region ofinterest and containing at least 𝑛 observations, the cell containingonly objects whose relevance score with the topic Food is above0 . DLT (100M)(T)DL (15M)(F) DT (4M)(T) LT (96M)(F)D (37)(F) L (14M)(F) T (2M)(F) ★ (1)(F) Figure 8: Spatio-Textual-Temporal Lattice

A.3 STT-Roll-Up

The Roll-Up operator aggregates measure values along a hierar-chy by moving from a child level to a parent level. This allows tomove the analysis to a coarser granularity. An example of thisoperator is “Roll-Up from City to Country”.

STT-Roll-Up.

The

STT-Roll-Up operator groups facts by aggre-gating the measure values of all child members that belong to thesame parent member in a spatial, textual, or temporal hierarchy.Given a STTCube 𝐶 ′ 𝑠𝑡𝑡 , a child level 𝑙 𝑖 ↓ and a target parent level 𝑙 𝑖 ↑ (identifying a hierarchy step function ℎ𝑠 𝑖 in the spatial, tex-tual, or temporal dimensions), the STTRollUp operator definedas

STTRollUp ( 𝐶 ′ 𝑠𝑡𝑡 , 𝑙 𝑖 ↓ , 𝑙 𝑖 ↑ ) = 𝐶 ′′ 𝑠𝑡𝑡 produces a new STTCube thathas the same number of dimensions and for each measure 𝑚 ∈ 𝑀 ,the aggregation function associated with 𝑚 is used to create anaggregated measure value for it at the higher parent level. Forinstance, when we Roll-Up from

City level to the

Region level,

Fact Count values get summed to compute the new total. Sim-ilarly, “Roll-Up to Topic level from Theme level” groups factsby aggregating the measure values of all

Themes that belong tothe same

Topic and “Roll-Up to Quarter level from Month level”groups facts by aggregating the measure values of all

Months that belong to the same

Quarter . A.4 STT-Drill-Down

The inverse of the Roll-Up is the Drill-Down operator whichshows data at finer granularity by dis-aggregating measure valuesalong a hierarchy. An example of the Drill-Down operator is“Drill-Down to City level from Region level”. For this operation,the base cube is required as we cannot uniquely dis-aggregatemeasure values knowing only the values at the parent level.

STT-Drill-Down.

Given an STTCube 𝐶 ′ 𝑠𝑡𝑡 at a non-base level 𝑙 𝑖 ↑ for the spatial ( 𝑑 𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 ), textual ( 𝑑 𝑇𝑒𝑥𝑡 ), or temporal ( 𝑑 𝑇𝑖𝑚𝑒 ) di-mension and a target child-level 𝑙 𝑖 ↓ (identifying a hierarchy stepfunction ℎ𝑠 𝑖 ), the STTDrillDown operator

SDrillDown ( 𝐶 𝑠𝑡𝑡 , 𝐶 ′ 𝑠𝑡𝑡 ,𝑙 𝑖 ↑ , 𝑙 𝑖 ↓ ) = 𝐶 ′′ 𝑠𝑡𝑡 produces a new STTCube at finer granularity com-puting dis-aggregated measure values along the selected spatial,textual, or temporal hierarchy, respectively. For instance, STT-Drill-Down can move from the

Region level to the

City level dis-aggregating the

Fact Count of each

Region in the counts for each

City in that

Region . “Drill-Down to Term from Theme” followingthe inverse textual hierarchy step

Term ← Theme and “Drill-Downto Day from Year” following the inverse temporal hierarchy step

Day ← Year are examples of STT-Drill-down operators for textualand temporal dimension, respectively.

B LATTICE EXAMPLE

Consider a simple lattice of cuboid for a cube with 3 dimensions(Figure 8), each with a single 2-level hierarchy, namely with baselevels Location (L) with 14M rows for the spatial dimension, Term(T) with 2M rows for the textual dimension, and Date (D) with37 rows for the temporal dimension, each of which can be thenrolled up to the all level ( ★ ) with only one row. Each node in the (a) SR - Keyword Density(PEM) Q R T ( m s )

25 50 75 100125 (b) SR - Top- k Dense Keywordswithin an Area (PEM)

25 50 75 100125 (c) SR - Top- k Dense Key-words within an Area (PAM)

25 50 75 100125 (d) SR - Top- k Volatile Key-words within an Area (PEM)

25 50 75 100125 (e) SR - Top- k Volatile Key-words within an Area (PAM)

25 50 75 100125050100150200 (f) SM - Keyword Density(PEM) Q R T ( m s )

25 50 75 100125 (g) SM - Top- k Dense Keywordswithin an Area (PEM)

25 50 75 100125 (h) SM - Top- k Dense Key-words within an Area (PAM)

25 50 75 100125 (i) SM - Top- k Volatile Key-words within an Area (PEM)

25 50 75 100125 (j) SM - Top- k Volatile Key-words within an Area (PAM)

25 50 75 1001250100200300400 (a) GR - Keyword Density(PEM) Q R T ( m s )

25 50 75 100125 (b) GR - Top- k Dense Keywordswithin an Area (PEM)

25 50 75 100125 (c) GR - Top- k Dense Key-words within an Area (PAM)

25 50 75 100125 (d) GR - Top- k Volatile Key-words within an Area (PEM)

25 50 75 100125 (e) GR - Top- k Volatile Key-words within an Area (PAM)

25 50 75 100125050100

Data Size in Millions(f) GM - Keyword Density(PEM) Q R T ( m s )

25 50 75 100125

Data Size in Millions(g) GM - Top- k Dense Key-words within an Area (PEM)

25 50 75 100125

Data Size in Millions(h) GM - Top- k Dense Key-words within an Area (PAM)

25 50 75 100125

Data Size in Millions(i) GM - Top- k Volatile Key-words within an Area (PEM)

25 50 75 100125

Data Size in Millions(j) GM - Top- k Volatile Key-words within an Area (PAM)

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Figure 9: QRT Vs Data Size

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 (a) SR - Keyword Density Q R T ( m s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(b) SR - Keyword Volatility Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(c) SR - Top- k Dense Keyowrds Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(d) SR - Top- k Volatile KeyowrdsQ1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 (a) GR - Keyword Density Q R T ( m s ) Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(b) GR - Keyword Volatility Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(c) GR - Top- k Dense Keyowrds Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9(d) GR - Top- k Volatile KeyowrdsQ2 Q3 Q5 Q6 Q9 (i) SM - Keyword Density Q R T ( m s ) Q2 Q3 Q5 Q6 Q9(j) SM - Keyword Volatility Q2 Q3 Q5 Q6 Q9(k) SM - Top- k Dense Keyowrds Q2 Q3 Q5 Q6 Q9(l) SM - Top- k Volatile KeyowrdsQ2 Q3 Q5 Q6 Q9 (e) GM - Keyword Density Q R T ( m s ) Q2 Q3 Q5 Q6 Q9(f) GM - Keyword Volatility Q2 Q3 Q5 Q6 Q9(g) GM - Top- k Dense Keyowrds Q2 Q3 Q5 Q6 Q9(h) SM - Top- k Volatile KeyowrdsNC NM PEM PAM FM

Figure 10: QRTs for spatio-textual-temporal measures for different combinations of hierarchy schemes over 125 Million of Data lattice is associated with two values, first, the number of rows inthe cuboid, and second a flag (T/F) to mark if the current cuboidis materialized. At the top of the lattice, we have the base cuboid (which is always materialized) with Date, Location, and Term(DLT) containing in this example all rows (100M). If we Roll-Upthe spatial dimension from Location to All we obtain a new cuboid · Data Size in Millions E x e c u t i o n T i m e ( s e c o n d s ) 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 Figure 11: QRT Vs Data Size (DT) with 4M rows. The cuboid DLT is referred to as the ancestor of the cuboid DT. If the cube is partially materialized, i.e., not allcuboids are materialized, and the cuboid DT is materialized, thento obtain the

Fact Count for every Date and Term, the cuboid DTwith 4M rows would contain the answer already pre-computedwithout the need to compute such an answer from the basecube DLT with 100M rows. Moreover, when the cuboid T is notmaterialized, we can still compute the

Fact Count for every Termfrom the cuboid DT by accessing only 4M values instead of the100M in DLT.

C COST MODEL

The core of the proposed partial materialization approach de-pends on the trade-off between the storage cost of materializingany particular cuboid and the actual benefit that the materializa-tion of the cuboid provides. To evaluate this benefit, we have toestimate the (run time) cost of a query. To devise a cost model forthis estimation, we performed a micro-benchmark which con-firmed that the running time is directly proportional to the datasize (the number of rows). Hence we can use the following linearcost model for benefit calculation we selected a set of representa-tive queries (Q1—Q9, details in Table 3) for the aforementioned spatio-textual-temporal measures and measured the runtime ofthese queries on increasing data sizes. The micro-benchmark(Figure 11) confirmed that the running time is directly propor-tional to the data size (the number of rows), i.e., it confirms thatwe can use the Linear Cost Model [16] and the associated benefitcalculation. Then, to model the dependency relationships amongall the possible cuboids we use the lattice framework [16] (Fig-ure 8). Hence, to compute the benefit of materializing a particularcuboid 𝑐 , we need to compare the cost of answering queries atall levels of granularity (i.e., for the current cuboid 𝑐 and all itsdescendants in the lattice) with the current set of materializedcuboids against the cost when 𝑐 is also materialized. Benefit ( 𝑐 ) = ∑︁ 𝑐 ′ ∈ descendants ( 𝑐 )∪{ 𝑐 } cost ( 𝑐 ′ ) − size ( 𝑐 ) For instance, assume the lattice in Figure 8 and that only thebase cuboid (DLT) with 100M rows is materialized. If we considerDT with 4M rows, we have that, if materialized, queries againstthe cuboids D, T, ★ , and DT itself can be answered through it(with a cost of 4M), while without materializing DT, we will needto compute the answer against DLT (with a cost of 100M). Hence,materializing DT will achieve a benefit of ( − ) ∗ = ( − ) ∗ ==