[PDF] Short and Long-term Pattern Discovery Over Large-Scale Geo-Spatiotemporal Data

Abstract

Pattern discovery in geo-spatiotemporal data (such as traffic and weather data) is about finding patterns of collocation, co-occurrence, cascading, or cause and effect between geospatial entities. Using simplistic definitions of spatiotemporal neighborhood (a common characteristic of the existing general-purpose frameworks) is not semantically representative of geo-spatiotemporal data. We therefore introduce a new geo-spatiotemporal pattern discovery framework which defines a semantically correct definition of neighborhood; and then provides two capabilities, one to explore propagation patterns and the other to explore influential patterns. Propagation patterns reveal common cascading forms of geospatial entities in a region. Influential patterns demonstrate the impact of temporally long-term geospatial entities on their neighborhood. We apply this framework on a large dataset of traffic and weather data at countrywide scale, collected for the contiguous United States over two years. Our important findings include the identification of 90 common propagation patterns of traffic and weather entities (e.g., rain --> accident --> congestion), which results in identification of four categories of states within the US; and interesting influential patterns with respect to the "location", "duration", and "type" of long-term entities (e.g., a major construction --> more traffic incidents). These patterns and the categorization of the states provide useful insights on the driving habits and infrastructure characteristics of different regions in the US, and could be of significant value for applications such as urban planning and personalized insurance.

Full PDF

SShort and Long-term Pattern Discovery Over Large-ScaleGeo-Spatiotemporal Data

Sobhan Moosavi, Mohammad Hossein Samavatian, Arnab Nandi, Srinivasan Parthasarathy, andRajiv Ramnath

Department of Computer Science and EngineeringThe Ohio State UniversityColumbus, Ohio 43210-1277{moosavi.3,samavatian.1,nandi.9,parthasarathy.2,ramnath.6}@osu.edu

ABSTRACT

Pattern discovery in geo-spatiotemporal data (such as traffic andweather data) is about finding patterns of collocation, co-occurrence,cascading, or cause and effect between geospatial entities. Usingsimplistic definitions of spatiotemporal neighborhood (a commoncharacteristic of the existing general-purpose frameworks) is notsemantically representative of geo-spatiotemporal data. We there-fore introduce a new geo-spatiotemporal pattern discovery frame-work which defines a semantically correct definition of neighbor-hood; and then provides two capabilities, one to explore propaga-tion patterns and the other to explore influential patterns. Propaga-tion patterns reveal common cascading forms of geospatial entitiesin a region. Influential patterns demonstrate the impact of tempo-rally long-term geospatial entities on their neighborhood. We applythis framework on a large dataset of traffic and weather data atcountrywide scale, collected for the contiguous United States overtwo years. Our important findings include the identification of 90common propagation patterns of traffic and weather entities (e.g., rain → accident → conдestion ), which results in identification offour categories of states within the US; and interesting influentialpatterns with respect to the “location”, “duration”, and “type” of long-term entities (e.g., a major construction → more traffic incidents ).These patterns and the categorization of the states provide usefulinsights on the driving habits and infrastructure characteristics ofdifferent regions in the US, and could be of significant value forapplications such as urban planning and personalized insurance. CCS CONCEPTS • Applied computing → Transportation ; •

Information sys-tems → Traffic analysis ; Information integration ; Data cleaning . KEYWORDS

Propagation Patterns, Influential Patterns, Geo-Spatiotemporal Data

ACM Reference Format:

Sobhan Moosavi, Mohammad Hossein Samavatian, Arnab Nandi, SrinivasanParthasarathy, and Rajiv Ramnath. 2019. Short and Long-term Pattern Dis-covery Over Large-Scale Geo-Spatiotemporal Data. In

The 25th ACM SIGKDDConference on Knowledge Discovery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA.

ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3292500.3330755

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Spatiotemporal pattern discovery has seen considerable interest overthe past decade, with various frameworks were proposed to processthe data to find interesting patterns [3, 4, 14, 19–21, 24, 30, 33, 34].The application domains of relevance include public safety, trans-portation, earth science, epidemiology, climatology, and environ-mental management [25]. These frameworks can be used to discoverpatterns of collocation and co-occurrence, interactions and correla-tions, cascading, sequential, or cause and effect relationship patterns.However, they all rely on a simplistic definition of spatiotemporalneighborhood, essentially spatial closeness based on an Euclidean orCartesian system and temporal overlap [4, 14, 21, 33], which oftenmakes their use impractical for applications such as traffic, trans-portation, or weather analyses. For example, a traffic accident on onelane of a freeway has no impact on traffic flow on an opposite lane,yet general-purpose frameworks will locate both lanes in a singleneighborhood. Another example arises when studying the impactof a snow event (on traffic flow) which continues well past whenthe snow event has ended. The time overlap constraint required byexisting frameworks would hinder such a study. Note that there maynot be any trivial changes to be made to make the existing frame-works semantically applicable for this type of data. Because, theirbasis is on a specific way of defining spatiotemporal neighborhood,which changing that would make them unusable (e.g., regardingtheir pruning step) or expensive to be employed.To address these challenges, we propose a new framework forfinding patterns in geo-spatiotemporal data. This framework consistsof two parts, one to explore propagation patterns, and the other toreveal influential patterns. Identifying propagation patterns requiresthe exploration of partially ordered sets of geospatial entities, thatare spatially co-located and temporally co-occurring, with potential“cause and effect” relationships between the entities. An exampleof this type is a rain event, which causes an accident , with theaccident then causing congestion . Identifying influential patterns,on the other hand, requires studying the impact of temporally long-term geospatial entities (e.g. a major construction) on their spatialneighborhoods. An example of this type of pattern is the increasein number of congestion events in a region because of a long-term snowing event.To explore propagation patterns – also referred as “cascadingpatterns”[21] or “spatiotemporal couplings” [25], we propose a tree-pattern-mining-based process, we term short-term pattern discovery ,which employs a strict definition of spatial neighborhood to ensurespatial collocation, and a definition of temporal co-occurrence spe-cific to geo-spatiotemporal data and application domain constraints.To explore influential patterns – also referred as “tele-couplings” [25]– we propose a new process, we term long-term pattern discovery , toexamine the effect of long-term entities on their neighborhood toreveal any significant impact. As in, and drawing from [11, 16], thisprocess may be used to study impacts with respect to different types ,different locations , and duration of long-term geospatial entities.To evaluate our framework, we used a large-scale, real-worldgeo-spatiotemporal dataset of traffic and weather data. This dataset a r X i v : . [ c s . D B ] M a y DD ’19, August 4–8, 2019, Anchorage, AK, USA S. Moosavi et al. covers the contiguous United States , includes data collected fromAugust 2016 to August 2018, and contains about 13 . . four categories of states based onthese patterns. In addition, we carefully studied the impact of rela-tively long-term traffic or weather entities on traffic, and identifieda variety of insights with respect to “location”, “type”, and “duration”of the entities. The main contributions of this paper are as follows: • Short-term pattern discovery: We propose a new process for dis-covering propagation patterns in geo-spatiotemporal data, whichmodels spatiotemporal collocation and co-occurrence in terms oftree structures, and adopts an existing tree pattern mining approachto reveal prevalent patterns. In comparison to the general purposeframeworks, this method better suits application domain require-ments of a stricter definition of spatiotemporal neighborhood. • Long-term pattern discovery: We propose a new process for dis-covering influential patterns in geo-spatiotemporal data, whichexamines the impact of long-term geospatial entities on their neigh-borhood in order to reveal significant influential patterns. Explor-ing such patterns with existing frameworks is not feasible, due tolack of effective spatiotemporal neighborhood metrics to explorelonger-term (or lagging) impacts. • Data collection and processing: We present a set of processes forcollecting real-time traffic and historic weather data, using whichwe built a publicly available “research dataset” of 13 . . • Findings and insights: By applying our new framework on theabove dataset, we present a range of insights for different regionsin the United States. These insights may be further utilized forapplications such as urban planning, exploring flaws in transporta-tion infrastructure design, traffic control and prediction, impactprediction, personalized insurance, potentially with relevance tothe creation of smart cities.The rest of this paper is organized as follows: We review therelated work in Section 2, and provide preliminaries in Section 3.Section 4 describes the dataset preparation, followed by descriptionof framework in Section 5. Experiments and results are presented insection 6, and Section 7 concludes the paper.

Spatiotemporal pattern discovery has been thoroughly discussed inliterature [3, 4, 14, 19–21, 24]. Earlier work focused more on spatialprevalence and paid less attention to temporal aspects [14], whilelater work considered both aspects simultaneously [25]. The com-mon process of spatiotemporal pattern discovery is to first definespatiotemporal co-occurrence and collocation criteria; then intro-duce an interest measure (e.g., participation index); and finally out-line a miner algorithm to find interesting patterns [14]. Techniquesin these papers being general purpose solutions, rely on simplisticdefinitions of collocation (spatial) and co-occurrence (temporal), andunable to reveal complex spatiotemporal correlations (such as influ-ential patterns). Further, they have been developed and only testedon small-scale (real-world or synthetic) data. To address these chal-lenges with respect to geo-spatiotemporal data, we propose a newframework which provides an appropriate and precise definitionof collocation and co-occurrence criteria. Moreover, we outline theprocess of finding complex spatiotemporal patterns and prove itsapplicability through extensive experiments. Lastly, we apply our The contiguous United States excludes Alaska and Hawaii, and considers District ofColumbia (DC) as a separate state. framework on a large-scale, countrywide geo-spatiotemporal datasetof traffic and weather data to explore interesting patterns.Regarding the application domain, there are numerous studiesfor finding patterns in traffic and weather data, with the followinggoals: to study the impact of precipitation on likelihood or severityof accidents [7, 16, 28]; to explore the impact of weather on trafficintensity [5, 31]; to reveal the effect of climate change and weathercondition on road safety [1, 11, 29]; to characterize road accidentslocations [18]; or, to discover frequent spatiotemporal patterns intraffic data [15, 17, 19]. The scale of data in most of these studies islimited to one or at most a few cities. Moreover, interactions andcorrelations between the different types of traffic entities (accident,congestion, etc.) has not been studied before. Although similar ideasto explore long-term patterns have been previously suggested [7, 11,16], we extend them by: 1) examining a wider range of weather andtraffic entity types besides precipitation; 2) exploring properties ofdifferent “locations”; and 3) analyzing the impact of “duration length”on traffic flow.

In this section, we first provide preliminaries and definitions, andthen present the problem statement. Note that some of the definitionsare customized for our illustration application domain (i.e., trafficand weather data). However, this will not limit their generalizabilityto the other related domains. • Geospatial Entity: a geospatial entity e is represented by a tu-ple ⟨ type , start , end , loc ⟩ , which shows an entity of type type ,happened in time interval (cid:2) start , end (cid:3) , and its location is speci-fied by loc . Definition of loc is related to the application domain.For traffic data, we have loc = ⟨ latitude , lonдitude , Street _ Name , Street _ Side , Zipcode , City , State ⟩ , where Street _ Side shows therelative side of a street (i.e., R or L ). For weather data, we have loc = ⟨ airport _ code ⟩ , which represents the “airport” that e is re-ported from its weather station. A geospatial entity is called long , ifit takes place over a relatively long time interval (see Section 5.2). • Weak-Dependency Relationship: two co-occurring and co-located geospatial entities are called weakly dependent. Co-occurrence fortwo entities e and e means 0 ≤ (cid:12)(cid:12) e . start − e . start (cid:12)(cid:12) ≤ T-thresh ,where

T-thresh is a time-threshold. Collocation for two trafficentities requires location matching as well as spatial closeness .The former means that all location fields except the GPS coordi-nates should be the same. By latter, we require that dist ( e , e ) ≤ D-thresh , where dist is the Haversine distance function [13] basedon GPS coordinates, and

D-Thresh is a distance threshold. Withrespect to matching a pair of weather and traffic entities, collo-cation means a match between the “airport station” at which theweather entity is reported and the “airport station” closest to thetraffic entity’s location. • Child-Parent Relationship: for two weakly dependent geospatialentities e and e , e is a parent for e if e begins before e . Wetreat parent-child relationship as indicative of a cause and effect relation. A weather entity may only be the parent (or cause) of atraffic entity, and we do not define such a relationship betweentwo weather entities. • Tree Structure: given a set of vertices V = (cid:0) v , v , . . . , v n (cid:1) , wedefine tree T = ( V , E ) , where V ⊂ V and E = { e , e , . . . , e m } isa set of edges, and each edge e ∈ E connects a pair of vertices v i , v j ∈ V using an un-directed edge. A tree is an acyclic graph, andvertices with the same parent are siblings . Trees in this work havea root node, sibling nodes are un-ordered , and nodes are labeled .Figure 1-(a) shows several examples of such tree structure. In thiswork, each node of a tree is a geospatial entity, and each edgeshows a child-parent relationship between two entities. pplied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA ab ccb d e acdbad bc f cf dbaT T T T gee Frequent Subtreesacd ab eab c ad eForest of Unordered, Rooted, Labeled Trees MinSup = 75% (a) (b) Figure 1: (a) A forest of four trees, (b) Four of embedded fre-quently occurred subtrees with a minimum support . • Embedded Subtree: given a tree T = ( V , E ) , we define a subtree as S = ( V ′ , E ′ ) , where V ′ ⊂ V and E ′ ⊂ E . A subtree S is said to bean embedded subtree of T if for each edge e = ( v a , v b ) ∈ E ′ , v a isan ancestor (and not necessarily the parent) of v b in T . We now formalize the two related problems studied in this paper.

Here we seek to find commonshort-term propagation patterns that indicate how geospatial entitiescause other entities to happen. We represent a set of weakly depen-dent geospatial entities as un-ordered, rooted, labeled trees, wherethe entities are nodes, weak dependency relations are the edges, andentity types (e.g., rain, accident, and congestion) are the labels of thenodes. Thus, given a forest F = { T , T , . . . , T k } of such trees, theshort-term pattern discovery problem is about finding all embeddedsubtrees in F which are occurred relatively frequently. Formally, fora subtree S and tree T we define support ( S , T ) by Equation 1: support ( S , T ) = (cid:26) S is a subtree of T support ( S , F ) by Equation 2: support ( S , F ) = (cid:205) T ∈ F support ( S , T )| F | (2)For a subtree S , if support ( S , F ) ≥ min _ sup , where min _ sup is aminimum support threshold, then we say S is a frequent embeddedsubtree in F . An example of a forest with some of frequently oc-curring subtree patterns is shown in Figure 1. In this example, wehave a forest which includes four trees. Using a minimum supportthreshold of 75%, we identified several frequently occurring embed-ded sub-tree patterns, four of which are shown in Figure 1-(b). Weuse “short-term pattern discovery” to indicate that we search for pat-terns of immediate or short-term impacts, as opposed to long-termimpacts which is discussed next. Long-term pattern discovery isabout exploring the magnitude of impact of long-term geospatialentities on their neighborhood. As an example, consider a majorconstruction event in region A , because of which, we might observemore congestion events in the same region (when compared to atime when there was not such a construction event). Given a longentity L , let S R = [ e , e , . . . ] be the set of geospatial entities inthe vicinity of that, where R is the maximum distance threshold .Let L . start < e . start and L . end > e . end , ∀ e ∈ S R . To study theimpact of a long entity, we also define two other sets, S –before R and S –after R . The former contains all geospatial entities which For each e i ∈ S R , its location is within distance R from L . happened within distance R from L , during a time interval of the samelength as L , but before L started. The latter contains all entities in thesame neighborhood as L , during a time interval of the same length as L , but which happened after L ended. Given sets S R , S –before R , and S –after R , we define the problem of “long-term pattern discovery” asexploring any significant difference between size of set S R and theother two sets. In other words, a statistically significant differencebetween the number of entities when a long entity like L is present,and the number of entities before or after L , shows the magnitudeof the impact. We call such an occurrence a long-term or influentialpattern. Short-term pattern discoveryis about finding immediate impacts, and long-term pattern discoveryis about exploring the “long-lastingness” of impacts (i.e., lagging im-pacts). Hence, these two are complementary problems, with each onefocused on a separate aspect of dependency and pattern discovery,while using the same set of input data.

In this section, we describe the dataset preparation process. Theresulting dataset includes 13 . . To begin with, traffic entities werecollected in real-time using a rest API provided by

MapQuest [32]for a period of two years, from August 2016 to August 2018. To ourknowledge, this API broadcast traffic entities captured by a varietyof mechanisms - the US and state departments of transportation,law enforcement agencies, traffic cameras, and traffic sensors withinthe road-networks. Traffic data was collected for the contiguousUnited States (49 States). As the raw traffic entities came with GPScoordinates, we employed

Nominatim tool [22] to perform reversegeocoding and translated GPS coordinates to addresses.

Following cleaning steps are employed:i. Resolving duplicates: Duplicates were identified either explicitlyby id (i.e., two entities have the same id), or implicitly by content(i.e., two entities of the same type occurring at the same time andlocation). We kept one entity and removed the other.ii. Denoising the data: In this context, noise is related to the “type”of entity, where the Traffic Message Channel (TMC) [8] code (aspart of the information for each traffic entity) was different fromthe default type reported by the MapQuest API. In order to dealwith this mismatch, we first extracted 250 different TMC codesfrom our data, and manually created a new taxonomy by defininga unified type for each TMC code using [8] as reference. Finally,we replaced the new taxonomy with the default one in traffic data.

We defined the following taxonomyfor traffic entities: • Accident: a common type, which may involve one or more vehi-cles, and could result in fatality. • Broken-Vehicle: refers to the situation when there is one (or more)disabled vehicle(s) in a road. • Congestion: refers to the situation when the speed of traffic isslower than the expected speed. Using the TMC codes, we definedseverity of a congestion as light , moderate , or heavy . • Construction: an on-going construction or maintenance projecton a road. • Event: situations such as sports event , concerts , or demonstrations ,that could potentially impact traffic flow. • Lane-blocked: refers to the cases when we have blocked lane(s)due to traffic or weather condition.

DD ’19, August 4–8, 2019, Anchorage, AK, USA S. Moosavi et al.

Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec F r e q u e n c y R a t i o ( % ) Frequency Distribution of Traffic Entities (Monthly) (a) Monthly Traffic Distribution

Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec

Frequency Distribution of Weather Entities (Monthly) (b) Monthly Weather Distribution

Mon Tue Wed Thu Fri Sat Sun

Frequency Distribution of Traffic Entities (Weekly) (c) Weekly Traffic Distribution

Figure 2:

Relative frequency distribution of traffic and weather data, collected from Aug 2016 to Aug 2018, for the contiguous United States.

Table 1:

Details on Traffic Dataset, collected for the contiguousUnited States from Aug 2016 to Aug 2018.Entity Type Raw Count Relative Frequency

Accident 1,169,507 8.9%Broken-Vehicle 308,112 2.34%Congestion 10,542,020 80.18%Construction 209,933 1.60%Event 32,817 0.25%Lane-Blocked 246,832 1.88%Flow-Incident 637,489 4.85%Total 13,146,710 100% • Flow-incident: refers to all other types of traffic entities. Examplesare broken traffic light and animal in the road .Table 1 provides more details on the traffic dataset. The most fre-quent entity type is “congestion” which includes about 80% of thedata, and “accident” is the second most frequent entity type. Fig-ure 2a also depicts the monthly frequency distribution, where themost entities are observed in March and September and the leastin November. Additionally, weekly frequency distribution of trafficentities is shown by Figure 2c, where ‘Friday” and “Sunday“ arefound to be the days with the most and the least number of recordedentities, respectively.

Raw weather data was collected from1,973 weather stations located in airports all around the country.The raw data comes in the form of observation records, where eachrecord consists of several attributes such as temperature , humidity , wind speed , pressure , precipitation (in millimeters), and condition .For each weather station, we receive several observation records perday, which are recorded upon any significant change in any of themeasured attributes. To define the taxonomy of weatherentities, we require to extract some threshold values. To do so, weused the United State observations of temperature, wind speed, andprecipitation amount for rain and snow for a period of seven years,from January 2010 to January 2016, and applied K-Means clusteringalgorithm [12] on each of these attributes. The obtained cluster cen-ters are used as threshold for these attributes. For temperature, weidentified five cluster center values (degrees are in Celsius): − . ◦ , − . ◦ , 6 . ◦ , 21 . ◦ , and 35 . ◦ ; which we refer them as severe-cold , cold , cool , warm , and hot , respectively. For wind speed, we found threecluster centers 13 . kmh , 36 . kmh , and 60 kmh , which we refer themas calm , moderate , and storm windy conditions, respectively. Forrain, we identified three cluster centers 2 .

5, 7 .

1, and 11 . light , moderate , and heavy rainy conditions,respectively. Lastly, for snow we found three cluster centers 0 .

6, 1 . Possible values are clear , snow , rain , fog , hail , and thunderstorm . and 2 . light , moderate , and heavy snowy conditions, respectively. Given the above threshold valuesand the raw weather data records from August 2016 to August 2018,we processed each record to use it (if it represents an entity), mergeit (if it is part of a previously found entity), or remove it (if it doesnot represent any entity), and defined the following taxonomy: • Severe-Cold: extremely cold condition, with temperature ≤ − . ◦ . • Fog: low visibility condition as a result of fog or haze . • Hail: solid precipitation including ice pellets and hail . • Rain: rain of any type, ranging from light to heavy . • Snow: snow of any type, ranging from light to heavy . • Storm: the extremely windy condition, where the wind speed isat least 60 kmh . • Precipitation: any kind of solid or liquid deposit, but different fromsnow or rain. This was a generic label we frequently observed inraw weather data.We extracted 2,178,949 weather entities for a period of two years.Table 2 provides more details on weather data, where the most fre-quent entity types are “rain”, “fog”, and “snow”. Figure 2b also showsthe frequency distribution of weather entities by month; note thatmost of the entities occurred in January and the least in November.

Table 2:

Details on Weather Dataset, collected for the contiguousUnited States from Aug 2016 to Aug 2018.Entity Type Raw Count Relative Frequency

Severe-Cold 67,285 3.09%Fog 454,704 20.87%Hail 1,252 0.06%Rain 1,384,588 63.54%Snow 236,546 10.86%Storm 14,863 0.68%Precipitation 19,711 0.9%Total 2,178,949 100%

In this section we describe the pattern discovery framework, whichconsists of two major parts, one for discovery of propagation patternsand the other for influential patterns . We employed a multi-step process to discover short-term (propa-gation) patterns in geo-spatiotemporal data. Figure 3a illustratesthe process, which includes: 1) finding child-parent relationships; 2)building relation trees , and 3) extracting frequent tree patterns . • Finding Child-Parent Relationships : The first step is to extractall the weakly dependent pairs of entities to define the child-parentrelationship for each pair, using the definitions in Section 3. All the implementations in Python are available on GitHub: https://github.com/sobhan-moosavi/ShortLongTerm. pplied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA

FindingRelationshipsGeospatialEntities BuildingRelation Trees ExtractingTree PatternsOutcome:PropagationPatterns (a) Short-term pattern discovery

ExtractingLong Entities RetrievingVicinity Entities StatisticalTestingOutcome:InﬂuentialPatternsGeospatialEntities (b) Long-term pattern discovery -(W+D)-(W+D) (W+D)(W+D) TimeD

Before L After

D DW W (c) Definition of before and after time intervalswith respect to long entity L . “D” and “W“ areshort for Day and Week, respectively Figure 3:

Pattern discovery processes for geo-spatiotemporal data (a) and (b); Defining “before” and “after” time intervals (c). • Building Relation Trees : The next step is to create relation treesfrom the extracted child-parent relations. Here, tree is a rooted,labeled, un-ordered tree (see Section 3). This step results in a forestof relation trees. • Extracting Frequent Tree Patterns : The last step is to performfrequent tree pattern mining. As described in Section 3, the goalis to extract all frequently observed un-ordered subtrees in ourdatabase of relation trees. Examples of such tree patterns withminimum support 75% are shown in Figure 1-b. Here we adopt theSLEUTH algorithm, a growth-based approach proposed by Zaki[35], to extract frequent, embedded, un-ordered sub-trees in ourdatabase of relation trees.

In this section we describe the process of long pattern discovery tostudy the magnitude of impact of long-term entities. This process(shown in Figure 3b) consists of three steps; 1) extracting long-termentities, 2) retrieving vicinity entities, and 3) performing statisticalsignificance testing to explore influential patterns. • Extracting Long Entities : A long (or long-term) entity is onethat last for a long time interval, defined by a heuristic threshold.To define such threshold, we first obtain the distribution of dura-tion of entities over the input dataset; and then consider the 99 th percentile of the distribution as the threshold which defines longentities. Next, we resolve time and spatial overlaps between longgeospatial entities using Algorithm 1, to identify and merge over-laps. In this algorithm, we first identify all the conflicted cases foran entity l (lines 2–7); then merge the conflicted entities by updat-ing time, location, and type of l (lines 8–11); and finally we updatethe list of long entities (line 12). Function co - occurrence ( . ) checksthe time-overlap between two entities, and function collocation ( . ) checks the geographical collocation, using distance threshold ρ . • Retrieving Vicinity Entities : After extracting long entities andresolving overlaps, we retrieve entities in the vicinity of eachlong entity. Thus, given a long entity L , we need to find subsets S R , S –before R , and S –after R as follows ( R is a maximum vicinitydistance): – S R : for this set we look for all those geospatial entities whichhappened within a distance R from L , with start time strictlyafter the start time of L , and finished before the end time of L . – S –before R : this set is similar to the previous one, except we picka different time interval to define vicinity, as shown in Figure 3c.Based on this process, we move start and end time of L to W + D days before, where W stands for one week, and D shows durationof L in days. In such an interval, we extract all the entities whichhappened in vicinity distance R from L . – S –after R : similar to the previous one, except we move the startand end time of L to W + D days after. • Mining Patterns by Statistical Testing : Given the set of longentities, we first categorize them into disjoint buckets based on acommon characteristic or criteria (e.g., their location or their type).Then, for each bucket, we compare the values of S R , S –before R ,and S –after R for all long entities, to determine whether thereis any significant difference, therefore impact. For this purpose, we design six different testing scenarios and use two-sample t-test to test the difference between sample means. For a bucket B , we first calculate the following mean values: µ L , µ bef ore , and µ af ter as average of S R , S –before R , S –after R , respectively, basedof the long entities in bucket B . Further, we take the average of S –before R and S –after R for each long entity in bucket B , and takethe average of average values and denote that by µ avд . Now, wedefine the following tests for bucket B : – T : µ avд = µ L versus µ avд < µ L . A one-sided test which exam-ines whether the number of geospatial entities during a longentity is larger than this number when there is not such a longentity. – T : µ avд = µ L versus µ avд > µ L . Similar to the previous one,but with the opposite alternative hypothesis. – T : µ bef ore = µ L versus µ bef ore < µ L . A one-sided test whichexamines whether the number of geospatial entities during along entity is larger than when the long entity is not started yet. – T : µ bef ore = µ L versus µ bef ore > µ L . Similar to the previousone but with the opposite alternative hypothesis. – T : µ af ter = µ L versus µ af ter < µ L . A one-sided test whichexamines whether the number of geospatial entities during along entity is larger than when the long entity is ended. – T : µ af ter = µ L versus µ af ter > µ L . Similar to the previous onebut with the opposite alternative hypothesis.Note that in all of the above tests, the first condition is the nullhypothesis and the second one is the alternative hypothesis . Algorithm 1:

Merge Geospatial Overlaps

Input:

Long entity set L , and distance threshold ρ for l in L do List = [] for l’ in L do if co-occurrence ( l , l ′ ) and collocation ( l , l ′ , ρ ) then List.add(l’) end end l . StartT ime = min ∀ e ∈ List

StartT ime ( e ) l . EndT ime = max ∀ e ∈ List

EndT ime ( e ) l . location = center ∀ e ∈ List ( List ) l . T ype = concat ∀ e ∈ List ( e . T ype ) L = L −

List endOutput: L In this section, we describe how the proposed framework was em-ployed to perform pattern discovery. We start with the short-termpattern discovery, and then describe the results for the long-termpattern discovery.

DD ’19, August 4–8, 2019, Anchorage, AK, USA S. Moosavi et al.

Pattern: 1

CngCng

Pattern: 2

CngCngCng

Pattern: 3

RainCng

Pattern: 4

CngCng Cng

Pattern: 5

SnwCng

Pattern: 6

SnwCng Cng

Pattern: 7

FogCng

Pattern: 8

CngCngCng Cng

Pattern: 9

RainCng Cng

Pattern: 10

CngCngCng Cng

Pattern: 11

CngCngCngCng

Pattern: 12

RainAcd

Figure 4:

Top frequent embedded tree patterns found based on the short-term dependency relation trees. “Cng”, “Acd”, and “Snw” are short forcongestion, accident, and snow. These patterns show the propagation of traffic/weather entities on a short-term basis.

First, we extracted all short-term child-parent relationships usingthresholds D - thresh = meters and T - thresh = minutes (seeSection 3) . These thresholds were found empirically, with D - thresh ensuring spatial closeness, and T - thresh is large enough to considerthe delay in a “cause and effect” type of relationship, with respect toour application domain. Using these settings, we found 5 , , , ,

659 traffic and weather en-tities. In total, 39 .

33% of the traffic entities were found to have atleast one weakly dependent weather or traffic entity, and 12 . , ,

637 trees out of 5 , ,

729 child-parent re-lations. The maximum number of nodes in a tree was found to be25. Where a traffic entity t had more than one parent, we randomlypicked one of them. Given the size of the data and the number oftrees, we do not believe that any existing frequent pattern wouldbe missed by this choice. Finally, we employed SLEUTH (Zaki [35])to extract frequent tree patterns at the city-level . More scalable al-ternatives [26, 27] were an option but not required for our purpose.After extracting frequent patterns for a city, we used these patternsas core frequent patterns of the corresponding state. This allows usto account for the potential diversity among different cities in a state(i.e., based on population, traffic, and/or weather condition). As analternative, if we had chosen core patterns using state as the granu-larity level, the framework may not identify those patterns whichare frequent in one city but infrequent in the others. To choose theminimum-support value, regarding the large size of data and poten-tial seasonality in observations, we followed the approach proposedby Fournier-Viger [10]. Based on this approach, we used Equation 3to find the minimum support, where a , b , and c are the positive con-stants which we empirically set to 0 . .

5, and 0 .

05, respectively.In this formula, x is the number of relation trees in a set, and theminimum relative support is 5%.min_sup = e −( ax + b ) + c (3)Using the above setting, we extracted 708 frequent tree patterns forthe contiguous United States. In total, there were 90 unique frequentpatterns, with the minimum number of nodes in a tree pattern being2 and the maximum being 7. Figure 4 shows the top 12 frequenttree patterns . Along with each pattern is shown the number of For entity type snow , we empirically set T - thresh = minutes, because we expectto see a longer impact of snow on traffic flow. Check https://bit.ly/2Ef8tu7 for the list of all short-term frequent patterns in our data. states which have occurrences of that pattern, the average supportvalue, the peak time for instances of the pattern, and type of theroad-network in which instances of a pattern were common. A road-network can be a road inside a city (cities), an interstate or freewaywhich connects different cities or states to each other, or a mixtureof both. Each pattern shows how short-term entities are propagatedin a region. For instance, pattern 1 shows a congestion which causedanother congestion, and pattern 2 shows a propagation pattern of achain of traffic congestion entities.In total, 50 of 90 unique frequent patterns were initiated by aweather event, where 17 of these patterns were initiated by rain,14 by snow, 11 by fog, and 8 by the other types of weather entities.These observations demonstrate the significant impact of weatheron traffic. While this has been frequently discussed in prior research[2, 5, 6], in our work we reveal the propagation patterns whichshow HOW these weather entities impact traffic. For example, snow-initiated patterns usually happen on interstates and freeways, whilerain-initiated patterns happen within roadways inside cities. Also,most complex congestion-related patterns happen within cities road-network, with the average support of patterns which happen in a cityis lower than those which happen on interstates and freeways, or theentire road-network. The peak time for the majority of congestion-related patterns was the afternoon rush hour. For weather initiatedpatterns (except for the rain-related cases), the peak time was themorning rush hour. It was interesting to note that some weatherevents caused more traffic issues in the morning rather than theafternoon.To further analyze the short-term patterns, we created a one-hotvector of size 90 for each state which represents the presence orabsence of each unique short-term pattern. By applying K-meansclustering [12] on these vectors, we categorized different states basedon their short-term propagation patterns. To find the best number ofclusters, we adapted Description Length (DL) for K-Means [9], whichis represented by Equation 4. In this equation, p ( . ) is the probabilitydensity function based on distance of each data point x from itscluster center c x ; P is the number of parameters of distributionfunction; K is the number of clusters; and X is the set of all datapoints. By assuming the distribution function for distance fromcluster centers to be a Gaussian distribution, we have P =

2. Bychoosing K from set [ , , . . . , ] , we found the optimal number ofclusters to be 4, which provides the minimum description length. DL ( K ) = − (cid:213) x ∈ X loд ( p (|| x − c x ||)) + P loд (| X |) + K loд (| X |) (4) pplied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA Table 3:

Clustering of 49 states into 4 clusters based on their short-term patterns, using K-Means.Cluster States

Cluster 1 AL, AR, CT, DC, DE, IA, IN, LA, MA, ME, MI, MN,MO, MS, NE, NH, OH, OK, RI, SD, TN, VT, WICluster 2 AZ, CO, ID, KS, KY, MD, MT, NC, ND, NJ,NM, NV, OR, PA, SC, UT, VA, WV, WYCluster 3 FL, GA, IL, NY, TX, WACluster 4 CA

Table 3 shows the result of clustering, in which we profile clustersas follows: • Cluster 1 : mostly contains states with fewer traffic incidents (asrelated to weather). These are either states with lower popula-tion (e.g., NE, SD, etc.); or states where the impact of weather ismitigated by effective road crews (e.g., OH, MN, etc.). • Cluster 2 : mostly contains states with considerably more trafficissues in comparison to the states in cluster 1. Distinguished pat-terns which only observed for this cluster are chain of accidents,and complex snow-initiated patterns. • Cluster 3 : contains states with at-least one major city with signifi-cant traffic issues. Distinguished patterns observed for this clusterare those which initiated by construction, rain, severe-cold, andstorm. • Cluster 4 : contains only one state whose traffic patterns bore nosimilarity to any other state. Majority of distinguished patternsof this cluster are complex congestion-related, fog-initiated, andflow-incident related ones.It is worth noting that the states which were clustered together werenot necessarily located in the same geographical region, and mightnot have the same weather condition during the different seasons.However, their propagation patterns of traffic and impact of weatheron traffic was found to be the same, which led to them being in thesame cluster.

As described in Section 5.2,first we use the 99 th percentile of distribution of the duration of enti-ties across the entire dataset, as the threshold to extract long entities– resulting in about 300 minutes. Using this threshold, we extracted280 ,

649 long entities. To merge the overlaps by Algorithm 1, we set ρ = R , where R is the spatial neighborhood distance to define sets S R , S –before R , and S –after R . In this way, we ensure that after themerge, there is no pair of long entities whose spatial neighborhoodoverlapped. Next we describe how to determine R , and then performmerging the overlaps. Extracting R for long Traffic entities. To determine R , we use arandom sample S of two million traffic entities, and apply DBSCAN[12] to cluster entities in set S . We find the radius of each cluster asthe maximum distance from the center, and obtain R as the averageradius across all clusters. To define the two DBSCAN parameters– ϵ (maximum neighborhood distance) and minPts (the minimumrequired number of neighbors for not being an outlier), we useAlgorithm 2 adapted from [23]. Using a random sample set of 0 . S , we obtained ϵ = . miles and minPts = S resulted in 191 clusters,with the average radius R of these clusters being 14 . miles .Note that we cannot quantitatively define R for long weather entities.Thus, we define a traffic entity t be within R –neighborhood of along-term weather entity w , if t ’s zipcode can be mapped to theairport station which w is reported from, as the closest station. With ρ = .

03, and after merging the overlaps, we ended up with 148 , Algorithm 2:

Finding DBSCAN Parameters Input: S , a large sample of traffic entities. In S , obtain the closest neighbor distance for each entity, and let C bethe 99 th percentile of distribution of the closest neighbor distances. For each entity, count the number of entities within distance C , andobtain distribution of count values over S . Let C be the 99 th percentile of distribution of the count values. Output: C as ϵ , and C as minPts some of the types were combined to generate new type labels (e.g.,Rain_Event). Next, setting R = .

03, we created vicinity sets S R , S –before R , and S –after R for each long-term entity. Bucketing.

Prior to employing statistical significance testing toidentify long-term patterns, we need to determine the buckets oflong-term entities. We use three different criteria to create dis-joint buckets, namely,

Location , Duration , and

Type . Each “Location”bucket contains all the long entities which occurred in the samestate. For the “Duration bucket”, we first define several durationbuckets (intervals), and then assign each long entity to a bucket. For“Type buckets”, we create buckets of long entities, where each bucketcontains all the entities of the same type.

Positive and Negative Impacts.

The positive (negative) impactrefers to the case where the value of S R is larger (lower) than S –before R , S –after R , or their average. A significant positive im-pact can be determined by tests T , T , or T . A significant negativeimpact can be determined by tests T , T , or T . Table 4:

Top 15 long entity types and their frequency.

Before Merge After Merge

Type Frequency Type FrequencyConstruction 113,984 Rain 38,253Rain 41,668 Snow 25,820Event 32,144 Construction 22,373Snow 27,723 Fog 19,553Fog 20,847 Event 16,753Congestion 17,314 Severe-Cold 11,671Flow-Incident 13,099 Congestion 3,206Severe-Cold 12,083 Flow-Incident 2,381Storm 733 Construction_Event 1,332Other 440 Construction_rain 758Lane-Blocked 253 Storm 709Accident 226 Congestion_Flow-Incident 675Precipitation 72 Congestion_Event 516Broken-Vehicle 49 Event_Rain 514Hail 14 Congestion_Construction 359total 280,649 total 144,873

We present the identified long-term pat-terns in terms of three categories of such patterns; each categoryobtained based on a particular bucketing criteria.

Location-based Patterns.

Using location as the bucketing criteria,we applied significance tests T and T to identify patterns of the form“ long-term entity in location L → more (or less) traffic incidents ”. Fig-ure 5 shows the results of these tests. Here we represent 1 − p - value ,and also show three confidence levels 90%, 95%, and 99% as threered lines (the results of other tests are not presented because anydifferent trend of results was not observed). For a majority of states,we observed that a long-term weather/traffic entity had a significantimpact on traffic flow. In the majority of the cases we found theresult of test T to be significant, which means a positive impact. Outof the 49 states, we found 30 to be significant with a confidence of99%, 8 with a confidence of 95%, and 5 with a confidence of 90%. We DD ’19, August 4–8, 2019, Anchorage, AK, USA S. Moosavi et al. A L A R A Z C A C O C T D C D E F L G A I A I D I L I N K S K Y L A M A M D M E M I M N M O M S M T N C N D N E N H N J N M N V N Y O H O K O R P A R I S C S D T N T X U T VAV T W A W I W V W Y State - p v a l u e Test T Test T Figure 5:

Statistical significance testing by test T and T for Location buckets. Red lines show three confidence levels , , and . ( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ] Duration (hour) - p v a l u e Test T Test T (a) ( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ] Duration (hour) - p v a l u e Test T Test T (b) ( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ]( , ] Duration (hour) - p v a l u e Test T Test T (c) Figure 6:

Statistical significance testing for

Duration buckets. Red lines show three confidence levels , , and . C n g C n g _ C n s t C n g _ E v e n t C n g _ I n c d C n s t C n s t _ E v e n t C n s t _ R a i n E v e n t E v e n t _ R a i n I n c d C o l d F o g R a i n S n o w S t o r m - p v a l u e Test T Test T (a) C n g C n g _ C n s t C n g _ E v e n t C n g _ I n c d C n s t C n s t _ E v e n t C n s t _ R a i n E v e n t E v e n t _ R a i n I n c d C o l d F o g R a i n S n o w S t o r m - p v a l u e Test T Test T (b) C n g C n g _ C n s t C n g _ E v e n t C n g _ I n c d C n s t C n s t _ E v e n t C n s t _ R a i n E v e n t E v e n t _ R a i n I n c d C o l d F o g R a i n S n o w S t o r m - p v a l u e Test T Test T (c) Figure 7:

Statistical significance testing for

Type buckets. Red lines show three confidence levels , , and . Cng, Cnst, Incd, and Coldare short for Congestion, Construction, Flow-Incident, and Severe-Cold, respectively. also found that the existence of long-term traffic or weather entitiesdid not have much impact on traffic flow for AZ, CA, FL, LA, NM,and TX; although three of these states (i.e., CA, FL, and TX) are thetop-3 states with the most observed traffic entities. This observationreveals that in a state with more traffic issues, the existence of along-term incident does not have much impact on traffic flow. In-cidentally, CA was the only state for which the p-value is found tobe lower by Test T (although insignificantly so). This could implythat CA has a unique condition where a long-term weather or trafficentity causes less traffic issues in comparison to the time when thereis no such long-term entity. Duration-based Patterns.

Using duration of long entities as thebucketing criteria, we applied all the six significance tests to iden-tify patterns of the form “ long-term entity with duration D → more(or less) traffic incidents ”. Figure 6 shows the results of these tests. Weconclude that the shorter the duration of a long-term entity is, themore significant its impact. Also, for long-term entities which lastedfor more than 40 hours, we usually do not observe any significant im-pact. This observation might be due to adaptation of driving habits tothe new conditions. Also, a comparison of the results of tests T and T with tests T and T , provided evidences of more positive impactsbased on the after interval, rather than the before interval, for longentities which lasted more than 28 hours. Given that a majority ofsuch long entities were construction projects (about 75%), we posittwo potential interpretations. First, after a long construction project,we tended to observe a smoother traffic flow, even in comparisonto the time before the construction event. This observation mightbe due to the road conditions improving after the construction, butalso could point to the fact that, after a long construction project,there might be a significant group of drivers who stuck with thealternative routes discovered when the construction was active. Entity-type-based Patterns.

Using type of the long entities asthe bucketing criteria, we applied all the six significance tests toidentify patterns of the form “ long-term entity of type T → more(or less) traffic incidents ”. Figure 7 shows the results of these tests.Regarding the weather-based long entities, we observe the signifi-cant impact of all available types of weather entities, except for the“storm” event. However, we have an interesting diversity among im-pacts of different types of weather entities. Usually for “fog”, “snow”and “rain”, based on Tests T and T , we see a positive impact on pplied Data Science Track Paper KDD ’19, August 4–8, 2019, Anchorage, AK, USA traffic, while for “severe-cold” we observe a negative impact. Thisobservation reveals that in extremely cold temperatures, we shouldexpect to see smoother traffic flow probably because of fewer vehi-cles on the roads. Tests T through T also support such conclusion.Regarding the traffic-based long entities, we observed significantimpacts by “congestion”, “event”, and “flow-incident”. In case of along-term “congestion”, we have positive impact in comparison tobefore and after. For “flow-incident”, we also observed a similar situ-ation. However, for a long-term “event”, we only observed positiveimpact in comparison to the time when the “event” is terminated(test T ). It was interesting to note that a long-term constructionhad almost no significant impact on traffic flow. However, based ontests T and T , we could expect to see more traffic issues during along-term construction than before it or after it. To overcome with the shortcomings posed by the existing general-purpose spatiotemporal pattern discovery frameworks, such as re-lying on a simplistic definition of spatiotemporal neighborhood,we present a new framework to extract propagation as well as in-fluential patterns in geo-spatiotemporal data using improved andnovel techniques. To extract propagation patterns, that indicateimmediate impacts, we use a stricter definition of spatial colloca-tion and co-occurrence relationships to create relation trees, andthen perform tree pattern mining in a forest of relation trees. Influ-ential patterns, that show lagging impacts, explore the impact oflong-lived geospatial entities on their neighborhood, and we usedstatistical techniques to identify such patterns. Using a new andunique geo-spatiotemporal dataset of traffic and weather entities,which is collected, processed, and augmented for the contiguousUnited States over two years, we explored 90 prevalent propagationpatterns, where 50 of them were initiated by weather (mostly ob-served in morning) and the rest by traffic entities (mostly observedin afternoon). Based on these patterns, we identified four categoriesof US states, which show similarity of driving behavior and trans-portation infrastructures between different states. We also studiedthe lagging impact of long-term traffic or weather entities, withrespect to location, duration, and type of the entities. Interestingly,we identified a positive impact of long-term entities in a majorityof the states, except a few ones such as CA, FL, and TX. In general,we found that long-term entities which lasted for at most 40 hourshave the maximum impact on traffic flow. We found that long-termcongestion, snow, rain, fog, severe-cold, and flow-incidents causethe most significant lagging impact on traffic flow. In terms of futureresearch, we plan to separately study the lagging impact of differententity types for different states and top-cities.

ACKNOWLEDGMENT

This work is supported by a grant from the NSF (EAR-1520870) andone from the Ohio Supercomputer Center (PAS0536). Any findingsand opinions are those of the authors. We also thank Mr. AbbasShakiba for the preliminary discussions.

REFERENCES [1] Anna K Andersson and Lee Chapman. 2011. The impact of climate change onwinter road maintenance and traffic accidents in West Midlands, UK.

AccidentAnalysis & Prevention

43, 1 (2011), 284–289.[2] Tom Brijs, Dimitris Karlis, and Geert Wets. 2008. Studying the effect of weatherconditions on daily crash counts using a discrete time-series model.

AccidentAnalysis & Prevention

40, 3 (2008), 1180–1190.[3] Mete Celik. 2015. Partial spatio-temporal co-occurrence pattern mining.

Knowledgeand Information Systems

44, 1 (2015), 27–49.[4] Mete Celik, Shashi Shekhar, James P Rogers, and James A Shine. 2008. Mixed-DroveSpatio-Temporal Co-occurrence Pattern Mining. network

11 (2008), 15.[5] Mario Cools, Elke Moons, and Geert Wets. 2010. Assessing the impact of weatheron traffic intensity.

Weather, Climate, and Society

2, 1 (2010), 60–68. [6] Ye Ding, Yanhua Li, Ke Deng, Haoyu Tan, Mingxuan Yuan, and Lionel M Ni. 2017.Detecting and analyzing urban regions with high impact of weather change ontransport.

IEEE Transactions on Big Data

3, 2 (2017), 126–139.[7] Daniel Eisenberg. 2004. The mixed effects of precipitation on traffic crashes.

Accident analysis & prevention

36, 4 (2004), 637–647.[8] SR Ely. 1990. RDS-ALERT: a DRIVE project to develop a proposed standard forthe Traffic Message Channel feature of the radio data system RDS. In

Car and itsEnvironment-What DRIVE and PROMETHEUS Have to Offer, IEE Colloquium on .IET, 8–1.[9] Erik Erlandson. 2016. http://erikerlandson.github.io/blog/2016/08/03/x-medoids-using-minimum-description-length-to-identify-the-k-in-k-medoids/.(2016). Accessed: 2019-01-31.[10] Philippe Fournier-Viger. 2010.

Un modèle hybride pour le support à l’apprentissagedans les domaines procéduraux et mal définis . Ph.D. Dissertation. Université duQuébec à Montréal.[11] Derrick Hambly, Jean Andrey, Brian Mills, and Chris Fletcher. 2013. Projectedimplications of climate change for road safety in Greater Vancouver, Canada.

Climatic Change

Data mining: concepts andtechniques . Elsevier.[13] Haversine. 2019. https://en.wikipedia.org/wiki/Haversine_formula. (2019). Ac-cessed: 2019-01-31.[14] Yan Huang, Shashi Shekhar, and Hui Xiong. 2004. Discovering colocation patternsfrom spatial data sets: a general approach.

IEEE Transactions on Knowledge andData Engineering

16, 12 (2004), 1472–1485.[15] Ryo Inoue, Akihisa Miyashita, and Masatoshi Sugita. 2016. Mining spatio-temporalpatterns of congested traffic in urban areas from traffic sensor data. In

IntelligentTransportation Systems (ITSC), 2016 IEEE 19th International Conference on . IEEE,731–736.[16] David Jaroszweski and Tom McNamara. 2014. The influence of rainfall on roadaccidents in urban areas: A weather radar approach.

Travel behaviour and society

1, 1 (2014), 15–21.[17] Tanvi Jindal, Prasanna Giridhar, Lu-An Tang, Jun Li, and Jiawei Han. 2013. Spa-tiotemporal periodical pattern mining in traffic data. In

Proceedings of the 2nd ACMSIGKDD International Workshop on Urban Computing . ACM, 11.[18] Sachin Kumar and Durga Toshniwal. 2016. A data mining approach to characterizeroad accident locations.

Journal of Modern Transportation

24, 1 (2016), 62–72.[19] Wei Liu, Yu Zheng, Sanjay Chawla, Jing Yuan, and Xie Xing. 2011. Discoveringspatio-temporal causal interactions in traffic data streams. In

Proceedings of the 17thACM SIGKDD international conference on Knowledge discovery and data mining .ACM, 1010–1018.[20] Pradeep Mohan, Shashi Shekhar, James A Shine, and James P Rogers. 2010. Cas-cading spatio-temporal pattern discovery: A summary of results. In

Proceedings ofthe 2010 SIAM International Conference on Data Mining . SIAM, 327–338.[21] Pradeep Mohan, Shashi Shekhar, James A Shine, and James P Rogers. 2012. Cas-cading spatio-temporal pattern discovery.

IEEE Transactions on Knowledge andData Engineering

24, 11 (2012), 1977–1992.[22] Nominatim. 2019. https://wiki.openstreetmap.org/wiki/Nominatim. (2019). Ac-cessed: 2019-01-31.[23] DBSCAN parameter tunning. 2019. https://github.com/alitouka/spark_dbscan/wiki/Choosing-parameters-of-DBSCAN-algorithm/. (2019). Accessed: 2019.[24] Feng Qian, Qinming He, and Jiangfeng He. 2009. Mining spread patterns of spatio-temporal co-occurrences over zones. In

International Conference on ComputationalScience and Its Applications . Springer, 677–692.[25] Shashi Shekhar, Zhe Jiang, Reem Y Ali, Emre Eftelioglu, Xun Tang, Venkata Gunturi,and Xun Zhou. 2015. Spatiotemporal data mining: a computational perspective.

ISPRS International Journal of Geo-Information

4, 4 (2015), 2306–2338.[26] Shirish Tatikonda and Srinivasan Parthasarathy. 2009. Mining Tree-StructuredData on Multicore Systems.

PVLDB

2, 1 (2009), 694–705.[27] Shirish Tatikonda, Srinivasan Parthasarathy, and Tahsin Kurc. 2006. TRIPS andTIDES: new algorithms for tree mining. In

Proceedings of the 15th ACM internationalconference on Information and knowledge management . ACM, 455–464.[28] Athanasios Theofilatos. 2017. Incorporating real-time traffic and weather data toexplore road accident likelihood and severity in urban arterials.

Journal of safetyresearch

61 (2017), 9–21.[29] Athanasios Theofilatos and George Yannis. 2014. A review of the effect of trafficand weather characteristics on road safety.

Accident Analysis & Prevention

Proceedings of the 24th ACM SIGSPATIAL International Conference onAdvances in Geographic Information Systems . ACM, 81.[31] Ling Wang, Qi Shi, and Mohamed Abdel-Aty. 2015. Predicting crashes on express-way ramps with real-time traffic and weather data.

Transportation Research Record:Journal of the Transportation Research Board

Proceedings ofthe eleventh ACM SIGKDD international conference on Knowledge discovery in datamining . ACM, 716–721.[34] Wenhao Yu. 2016. Spatial co-location pattern mining for location-based servicesin road networks.

Expert Systems with Applications

46 (2016), 324–335.[35] Mohammed J Zaki. 2005. Efficiently mining frequent embedded unordered trees.