A Case Study on Visualizing Large Spatial Datasets in a Web-based Map Viewer
AA Case Study on Visualizing Large SpatialDatasets in a Web-based Map Viewer ⋆ Alejandro Corti˜nas [0000 − − − , Miguel R. Luaces [0000 − − − ,and Tirso V. Rodeiro [0000 − − − Universidade da Coru˜naLaboratorio de Bases de DatosA Coru˜na, Spain { alejandro.cortinas, luaces, tirso.varela.rodeiro } @udc.es Abstract.
Lately, many companies are using Mobile Workforce Man-agement technologies combined with information collected by sensorsfrom mobile devices in order to improve their business processes. Evenfor small companies, the information that needs to be handled grows ata high rate, and most of the data collected have a geographic dimension.Being able to visualize this data in real-time within a map viewer is avery important deal for these companies. In this paper we focus on thistopic, presenting a case study on visualizing large spatial datasets. Par-ticularly, since most of the Mobile Workforce Management software isweb-based, we propose a solution suitable for this environment.
Keywords: spatial big data, web-based GIS, software architectures
Mobile Workforce Management (MWM) technologies are increasingly being usedby companies to manage and optimize their workers’ task schedules and to im-prove the performance of their business processes [2]. These technologies, usedin combination with the information collected by current mobile technology (e.g.the geographic position using a GPS receiver, or the user activity using an ac-celerometer), are useful to detect patterns in the past activity of workers, or topredict trends that can improve the future scheduling.Datasets produced by mobile sensing and MWM technologies are large andcomplex. As an example, consider a small package delivery company with afleet of 100 vehicles, each one producing a GPS position every 10 seconds (64bytes taking into account a device id, a timestamp, three geographic coordi-nates, speed, bearing, and accuracy). Supposing that each vehicle is active 8 ⋆ This work has been funded by Xunta de Galicia/FEDER-UE CSI: ED431G/01;GRC: ED431C 2017/58. MINECO-CDTI/FEDER-UE CIEN LPS-BIGGER: IDI-20141259; INNTERCONECTA uForest: ITC-20161074. MINECO-AEI/FEDER-UEDatos 4.0: TIN2016-78011-C4-1-R; Flatcity: TIN2016-77158-C4-3-R. EU H2020MSCA RISE BIRDS: 690941. a r X i v : . [ c s . D B ] J a n ours per day, each one would produce 2,880 events generating 184,320 bytes ofdata every day, and the company would require over 17 MB of storage per day.Larger systems (e.g., MRW, a Spanish package delivery company, declares tohave more than 3,300 vehicles), or the inclusion of additional sensor data (suchas accelerometer data) would produce larger datasets.MWM technologies often require web-based dashboards to visualize andquery the information stored in the system. Moreover, given that the informa-tion is of geographic nature, these dashboards require GIS technology such asmap server and map viewers. Nah cites in [3] a number of studies that proposethat web users accept waiting between 1 and 42 seconds for a web page to load,but it concludes that, considering purposeful browsing for information retrievaltasks as opposed to open-browsing, most users are willing to wait for only abouttwo seconds. Even though the study considers that users are browsing the weband not using a web-based dashboard, we believe that a waiting interval of twoseconds for a page refresh is sensible.Data management technologies have been working during the last years tosupport horizontal scaling and distributed processing. Hence, storing and query-ing large geographic datasets can be achieved using different technologies. How-ever, choosing the most appropriate technology to support these usage scenariosis a complex task. Furthermore, current web-based GIS technology are not de-signed to achieve browsing of large datasets with a latency of less than 2 seconds.For example, middleware software such as map servers have little support forNoSQL technologies, and visualization software such as map viewers aggregategeographic information on the client side, thus requiring large datasets to betransferred over the network and to be processed in the web browser. Hence,in order to support the visualization of large geographic datasets, middlewarecomponents and map viewers must support querying and aggregating geographicdata using distributed processing systems.In this paper, we present a case study on visualizing large spatial datasetsin a web-based map viewer. We aim at identifying the most suitable technology,proposing an alternative to achieve data visualization with a latency smallerthan two seconds. In Sect. 2 we describe our previous work and the systemarchitecture that we propose. In Sect. 2 we present the research questions that wewant to answer with the case study and the evaluation methodology. In Sect. 4 weshow the experiments that we have performed and the results we have achieved.Finally, in Sect. 5 we present our conclusions and future work. We have presented in a previous paper [1] the architecture of a system tostore, query and visualize on the web large datasets of geographic information(see Fig. 1). The architecture includes a component to simulate a large numberof drivers that circulate through a road network and report their position tothe server on a regular basis (
Route Simulator ). In addition, the architectureprovides a
Storage System with exchangeable storage subsystems so that they torage System <
DataStorage [1..*] DataProvider
Query System <
DataRetriever
Route Simulator <
DataWriter
Fig. 1: System architecture { "driver_id"; 3, "position": { "x": -4.013856742, "y": 40.358347874, "z": 517, "speed": 32.48, "bearing": 83.6, "accuracy": 4.5509996 } , "timestamp": 1513763866, "data": { ...additional data in json format... } } Fig. 2: Example of an event received by the
Storage System componentcan be tested under the same load conditions and evaluate their performancewith the same queries. Fig. 2 shows an example of an event received and storedby the system. It consists of the driver id, the GPS position of the worker, thetimestamp of the position, and additional information in JSON format that isspecific of the particular domain for which the architecture is being used. Finally,the architecture also includes a component to solve queries and cluster data thatis visualized in a web-based map viewer (
Query System ).Figure 3 shows a detailed view of the querying architecture components. Thecomponents with a gray background are third-party components that are usedwithout modifications. The communication with the
Storage System compo-nent is managed by a component that implements the generic
DataRetriever interface. We have currently implemented three alternatives: one that retrievesthe events from Postgres + PostGIS (the component PostgreSQL Retriever ),another one that retrieves the data from MongoDB (the component MongoDBRetriever ), and another that retrieves the data from Druid [4] (the component Druid Retriever ). Queries are sent from a
Web Map Viewer component, im-plemented using
Leaflet , by a client-side component called
LeafletDataLayer implementing the
Layer interface of Leaflet. A server-side component called https://postgis.net/ http://druid.io/ torage System <
PostgreSQL <
PostGIS <
MongoDB <
Query Language
Query System <
PostgreSQL Retriever <
JDBC
MongoDB Retriever <
MongoDB Java Driver
Druid Retriever <
REST Client
DataRetrieverLeaflet Backend <
Web Map Viewer <
Leaflet <
LeafletDataLayer <
DataQuerying
Fig. 3: Detailed querying architecture
Leaflet Backend receives the queries, delegates them to the appropriate dataretrieving component, and sends back the results to the client-side.Considering that having a fluid visualization in the client side is a very im-portant requirement, we have to aggregate the points on the server side of theapplication and send to the client side only the result of the aggregation insteadof transferring large collections of individual geographic points to be aggregatedon the client-side. Furthermore, considering that the user will define specific spa-tial and temporal ranges for the set of events that have to be retrieved by meansof zoom and pan operations in a map and a time range control, precomputedclusters cannot be used because the variation among queries is too large. Thesimplest alternative is to perform the query get all points in the range (xmin,ymin, tmin) - (xmax, ymax, tmax) and apply a clustering algorithm on the re-sult, but it is a costly solution in terms of computation requirements. Instead,taking into account that in a geographic reference system where the coordinatesrepresent longitude and latitude in degrees a value with an accuracy of 9 deci-mals represents a maximum of 1 millimeter on the surface of the Earth, in [1]e proposed to store 7 additional versions of the same geographical point with7 different precisions (between 2 and 8 decimals). This makes the process ofclustering as simple as grouping the events by equal values of coordinates andcounting the number of elements. Moreover, computing additional versions ofeach geographic point is assumable in storage cost and insertion time.Our tests in [1] revealed that this approach cannot be used to achieve aconstant time in aggregation queries because truncating a decimal means thatone point in a level of aggregation represents one hundred points in the next levelof aggregation. Thus, the difference between the different levels of aggregation istoo high. Furthermore, the aggregated data required another brief aggregationstep in the client side in order to draw the different aggregated elements to makethe map look nice to the user. Hence, we decided to follow a different approachto determine the aggregation levels taking into account the final visualization.When a user navigates in a web map viewer, it sends queries to the serverdepending on the current view to retrieve the data that has to be shown. Eachof these queries is associated with the bounding box of the current view, thatis, the maximum and minimum latitude and longitude of the view. Regardingaggregation, there is another parameter that affects the actual representationof the aggregated elements in the map viewer: the zoom or scale of the currentview. If we are seeing the map with very little zoom, the map viewer needs toaggregate more in order to show a suitable view, and the other way around.We propose to compute discretized versions of the geographic points accord-ing to the zoom level. We consider 18 different zoom levels, which is common inGIS visualization, so we store, for each point, 18 alternative versions of it. Foreach level of zoom, the separation between aggregated elements is calculatedusing Formula 1, which maintains the same distance for the aggregated elementsindependently of the zoom level in a map viewer. For example, for the zoom level11, the aggregated elements should be separated 0.043945312 degrees . When westore the alternate version of a point for the zoom level 11, we calculate the clos-est multiple of 0.043945312 to both the latitude and the longitude of our point.All the points that need to be aggregated at zoom level 11 have the same alter-native location. Obviously this approach makes each point to take much morespace, but since we are focusing on fluid visualization, we assume this drawback. 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑖𝑜𝑛 ( 𝑧𝑜𝑜𝑚 ) = {︃ o , if 𝑧𝑜𝑜𝑚 = 0 𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑖𝑜𝑛 ( 𝑧𝑜𝑜𝑚 − / , if 𝑧𝑜𝑜𝑚 > In order to validate the architecture of the system proposed in Sect. 2, we haveidentified two key aspects that have to be evaluated described in the followingresearch questions: We are not considering different separations for latitude and longitude because forour case study location, Spain, using the same separation for both is adequate.a) Data distribution over time (b) Data distribution over space
Fig. 4: Data distribution over time and space – Research question 1 (RQ1). Can we build a web-based map viewer for largegeographic datasets with a latency lower than 2 seconds in data refreshes?
This research question will test whether we can use the simple approachproposed in Sect. 2 to improve the response time of aggregation queries. – Research question 2 (RQ2). Which of the candidate storage technologies pro-vides a faster answer to aggregation queries?
Even though the selection of astorage technology must take into account many requirements (e.g., trans-action support, horizontal scaling, etc.), being able to answer aggregationqueries is a very important requirement in our architecture.To evaluate these research questions, we have run the
Route Simulator com-ponent in a desktop computer (Intel Core i7-3770, 4 cores, 3.40GHz, 8GB ofRAM) to generate events for 2000 simultaneous drivers driving for approximately14 hours, resulting in a dataset of 47.8 million events. Each driver starts at a ran-dom position in the road network, it computes a route to a random destination,and it generates positions along the sections of the route every second assuminga random speed expressed as a percentage of the maximum allowed speed. Forexample, a driver can circulate at 80 % of the maximum speed of a road segment,and then circulate at 105 % of the maximum speed in the next road segment.In the
Route Simulator component, we have separated the dataset generationstep from the dataset ingestion step in order to ensure that exactly the samedatasets are stored in each storage technology. Fig. 4a and Fig. 4b shows thedata distribution over time and space. The distribution over time shows thatall drivers start simultaneously and finish smoothly. The distribution over spaceshows that the positions are distributed following the population density.In order to evaluate RQ1 and RQ2, one hundred queries were randomlygenerated with six different levels of zoom, from a higher zoom level (15) to alower one (10). To generate realistic queries with different zoom levels, we usedthe same map viewer (same width and height), to calculate the spatial ranges.Therefore a higher zoom level represents a smaller spatial range query, and theother way around. For each zoom level, the alternative version stored for it wasused for the aggregation as described in Sect. 2. Each query was executed exactlyonce to avoid the effects of any possible caching. T i m e ( s ) Zoom level
DruidPostGISMongoDB
Fig. 5: Results of the experiments
The experiments were run on a machine (Intel Core i5-4440, 4 cores, 3.10GHz,16 GB of RAM) that hosted all the server-side components of the architecture(
Storage System , and
Query System ). Only one storage technology was run-ning simultaneously (either Postgres+PostGIS, MongoDB or Druid) in order toavoid resource allocation competitions.Figure 5 show the results to evaluate RQ1 and RQ2. The horizontal axisrepresents different zoom levels from high (smaller spatial range queries) to low(bigger spatial range queries). The vertical axis represents the average time inseconds to answer 100 queries using a logarithmic scale. We can see that Druid isthe only technology able to resolve queries below 2-3 seconds, but only when thezoom level is 11 or higher. Particularly, when the zoom level is 11 the averagequery time is 2.93847 seconds. In our previous work, Postgres+PostGIS resultswere close to the Druid ones, but it seems that the extra size required to storethe 18 alternative versions or the higher density of events totally invalidatedPostgres+PostGIS in this case study.The results obtained indicate that with our approach we can build a web-based map viewer for large geographic datasets with a latency close or lower to2 seconds but only in certain conditions (zoom level below 11, approximatelya width of 40 km in a 600 px wide map viewer). Zoom levels above 11 implyretrieving extremely large collections of geographic points, and the only suitableapproach seems to be precomputing estimations for the clusters. This conclusionwas also validated using a web-based map viewer to visualize the evaluationdataset. The results also determine that Druid is the best option for the storagetechnology, matching the conclusion from our previous work.
Conclusions
We have presented in this paper a case study of a web-based map viewer for largegeographic datasets with a latency close or lower to 2 seconds. The case studyhas shown that storing additional versions of each geographic point and using acolumnar database designed to answer OLAP queries can be used to achieve thisgoal. The case study was also designed to help selecting the best technology tostore and query large geographic datasets. Whereas our previous research showedthat PostgreSQL+PostGIS was comparable to Druid in terms of efficiency, theextended dataset that we generated this time shows that PostgreSQL+PostGISperforms worse than Druid. The source code for our experiments can be foundat the research group GitLab .As future work, we need to compare these results with many other tech-nologies such as cstore fdw , a PostgreSQL columnar extension, or NoSQL tech-nologies oriented to store time-series such as InfluxDB . Since we cannot finda technology to solve large spatial range queries in an acceptable time, we arealso working on designing a data structure able to resolve these kind of queries.A future line of work is also testing the influence of the temporal dimension tothese queries. References
1. Corti˜nas, A., Luaces, M.R., Rodeiro, T.V.: Storing and Clustering Large SpatialDatasets Using Big Data Technologies. In: Proceedings of the 16th InternationalSymposium on Web and Wireless Geographical Information System (W2GIS 2018).A Coru˜na (2018), (Pending publication)2. Creelman, D.: Top Trends in Workforce Management: How Technology Pro-vides Significant Value Managing Your People (2014), , (Consulted on08/03/2018)3. Nah, F.F.H.: A study on tolerable waiting time: how long are web users willing towait? Behaviour & Information Technology 23(3), 153–163 (2004)4. Yang, F., Tschetter, E., L´eaut´e, X., Ray, N., Merlino, G., Ganguli, D.: Druid: A real-time analytical data store. In: Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data. pp. 157–168. SIGMOD ’14, ACM, New York,NY, USA (2014), http://doi.acm.org/10.1145/2588555.2595631 https://gitlab.lbd.org.es/groups/massive-geo-data https://citusdata.github.io/cstore fdw/8