[PDF] HiVision: Rapid Visualization of Large-Scale Spatial Vector Data

Abstract

Rapid visualization of large-scale spatial vector data is a long-standing challenge in Geographic Information Science. In existing methods, the computation overheads grow rapidly with data volumes, leading to the incapability of providing real-time visualization for large-scale spatial vector data, even with parallel acceleration technologies. To fill the gap, we present HiVision, a display-driven visualization model for large-scale spatial vector data. Different from traditional data-driven methods, the computing units in HiVision are pixels rather than spatial objects to achieve real-time performance, and efficient spatial-index-based strategies are introduced to estimate the topological relationships between pixels and spatial objects. HiVision can maintain exceedingly good performance regardless of the data volume due to the stable pixel number for display. In addition, an optimized parallel computing architecture is proposed in HiVision to ensure the ability of real-time visualization. Experiments show that our approach outperforms traditional methods in rendering speed and visual effects while dealing with large-scale spatial vector data, and can provide interactive visualization of datasets with billion-scale points/segments/edges in real-time with flexible rendering styles. The HiVision code is open-sourced at this https URL with an online demonstration.

Full PDF

GGraphical Abstract

HiVision: Rapid Visualization of Large-Scale Spatial Vector Data

Mengyu Ma,Ye Wu,Xue Ouyang,Luo Chen,Jun Li,Ning Jing

Object …… Object n Image for Display M e r g e Plotted Object …… P l o t Plotted Object n Datasets Q u e r y Range ResolutionPixel …… Pixel NM Image for DisplayValue …… J udg e S p a ti a l T opo l ogy Value NM Range&Resolution DatasetsVisualization flow of traditional approaches

Visualization flow of HiVision a r X i v : . [ c s . G R ] M a y ighlights HiVision: Rapid Visualization of Large-Scale Spatial Vector Data

Mengyu Ma,Ye Wu,Xue Ouyang,Luo Chen,Jun Li,Ning Jing• Proposes a display-driven computing model for large-scale data visualization• Designs a spatial-index-based optimization for real-time data visualization• Proposes a hybrid-parallel architecture for enhanced data processing• Implements an open-source tool for rapid visualization of large-scale spatial vector data iVision: Rapid Visualization of Large-Scale Spatial VectorData ⋆,⋆⋆

Mengyu Ma, Ye Wu, Xue Ouyang ∗ , Luo Chen, Jun Li and Ning Jing College of Electronic Science, National University of Defense Technology, Changsha 410073, China

A R T I C L E I N F O

Keywords :vector data visualizationbig datadisplay-driven computingparallel computingreal-time

A B S T R A C T

Rapid visualization of large-scale spatial vector data is a long-standing challenge in GeographicInformation Science. In existing methods, the computation overheads grow rapidly with datavolumes, leading to the incapability of providing real-time visualization for large-scale spatialvector data, even with parallel acceleration technologies. To ﬁll the gap, we present HiVi-sion, a display-driven visualization model for large-scale spatial vector data. Diﬀerent fromtraditional data-driven methods, the computing units in HiVision are pixels rather than spatialobjects to achieve real-time performance, and eﬃcient spatial-index-based strategies are intro-duced to estimate the topological relationships between pixels and spatial objects. HiVision canmaintain exceedingly good performance regardless of the data volume due to the stable pixelnumber for display. In addition, an optimized parallel computing architecture is proposed inHiVision to ensure the ability of real-time visualization. Experiments show that our approachoutperforms traditional methods in rendering speed and visual eﬀects while dealing with large-scale spatial vector data, and can provide interactive visualization of datasets with billion-scalepoints/segments/edges in real-time with ﬂexible rendering styles. The HiVision code is open-sourced at https://github.com/MemoryMmy/HiVision with an online demonstration.

1. Introduction There has been an explosion in the amounts of spatial data in recent years, due to the development of data acquisition technology, the prevalence of location-based services, and etc (Yao and Li, 2018). Visualization can make the intricate data more intuitive to human readers, thus important to discover implicit information and support further decision- making (Maceachren, Gahegan, Pike, Brewer, Cai, Lengerich and Hardisty, 2004). For example, eﬀective visualization of taxi trajectories can help people better understand the urban transportation system, ﬁnding out strategies to reduce the number of accidents and traﬃc jams (Zuchao, Min, Xiaoru, Junping and Huub, 2013); a scatter plot of the road network nationwide can help the government to expose isolated areas, planning and constructing new roads. As an important type of spatial data, spatial vector is the abstract of real-world geographical entities, generally expressed as points, linestrings, or polygons (areas) (Tong, Ben, Liu et al., 2013). In the big data era, the problem of eﬃcient spatial vector data visualization becomes even more prominent, as visualizing spatial vector data involves the rasterizing process, which can be extremely time-consuming when the data scale is large. Rapid visualization of large-scale spatial vector data has become a severe challenge in Geographic Information Science (GIS). With the development of computer hardware, there has been an expansion in processor numbers, and parallel computing becomes increasingly important for processing large-scale spatial data. Recently, to optimize the data- intensive and computing-intensive spatial analysis using high-performance computing technologies has become a hot research topic in GIS (Yao and Li, 2018). Parallel computing is an eﬀective way to accelerate the visualization process and is shown by representative works (Gao, Wang, Li and Shen, 2005; Tang, 2013; Guo, Guan, Xie, Wu, Luo and Huang, 2015; Guo, Huang, Guan, Xie and Wu, 2017) that it can achieve highly improved performance compared with traditional serial methods. In addition, with the emergence of various parallel computing models (e.g., MPI, OpenMP, Hadoop, Storm, Spark), a series of high-performance spatial vector data visualization frameworks have been proposed ⋆ This document is the results of the research project funded by the National Natural Science Foundation of China under Grant No. 41871284,41971362 and U19A2058. ⋆⋆ Mengyu Ma designed and implemented the algorithm; Mengyu Ma, Ye Wu and Luo Chen performed the experiments and analyzed the data;Jun Li and Ning Jing contributed to the construction of experimental environment; Mengyu Ma and Xue Ouyang wrote the paper. [email protected] (M. Ma); [email protected] (Y. Wu); [email protected] (X. Ouyang); [email protected] (L. Chen); [email protected] (J. Li); [email protected] (N. Jing)

ORCID (s): (M. Ma)

M Ma et al.:

Preprint submitted to Elsevier

Page 1 of 19 iVision: Rapid Visualization of Large-Scale Spatial Vector Data and have seen some success (e.g., HadoopViz (Eldawy, Mokbel and Jonathan, 2016), GeoSparkViz (Yu, Zhang and Sarwat, 2018), etc). However, not many existing methods can support real-time visualization of large-scale spatial vector data, even with the adopted high-performance computing technologies. Figure 1 presents the general processing ﬂow in existing visualization methods: ﬁrstly, each spatial object in the range is plotted according to the image resolution, then followed by a merge step to generate the ﬁnal raster image. The computational scale of this data-driven processing ﬂow expands rapidly with the volume of spatial objects in the image range, therefore, suﬀers performance drop when dealing with big data scenario and cannot meet the real-time requirements. Object …… Object n Image for Display M e r g e Plotted Object …… P l o t Plotted Object n Datasets Q u e r y Range Resolution

Figure 1:

Processing ﬂow of data-driven spatial vector data visualization.

To address the scale issue, we present HiVision, a display-driven vector data visualization model as shown in Figure 2. In HiVision, the computing units are pixels rather than spatial objects and the algorithms focus on determining the actual pixel value with relevant spatial objects: as the number of pixels in the image range is limited and stable, the computational complexity of HiVision remains stable while dealing with spatial data of diﬀerent scales. In addition, eﬃcient spatial-index-based strategies are introduced to estimate spatial topological relationships between pixels and spatial objects, thus determining the value of pixels. To the best of our knowledge, this is the ﬁrst attempt for rapid visualization of vector data that can achieve the advantage of being less sensitive to data volumes. Pixel …… Pixel NM Image for DisplayValue …… J udg e S p a ti a l T opo l ogy Value NM Range&Resolution Datasets

Figure 2:

Spatial vector data visualization processing ﬂow in HiVision.

The contributions of this paper can be summarized as follows. • Implements an open-source tool for rapid visualization of large-scale spatial vector data. HiVision can be used to provide an interactive exploration of massive raw spatial vector data with ﬂexible rendering styles, so as to discover implicit information and parameter settings for further processing. • Designs a display-driven vector data visualization approach and reduces the computational complexity dra- matically (from O( 𝑛 ) to O( 𝑙𝑜𝑔 ( 𝑛 ) ). HiVision calculates visualization results directly using a parallel per-pixel approach with eﬃcient ﬁne-grained spatial indexes. Our approach provides new research ideas for many related ﬁelds (e.g. map cartography (Kraak and Ormeling, 2013), spatial analysis and data visualization). M Ma et al.:

Preprint submitted to Elsevier

Page 2 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data • Carries out extensive experiment evaluations and provides an online demonstration. Experiments show that HiVision dramatically outperforms traditional data-driven methods in tile rendering speed, and it is capable of handling billion-scale spatial vector data. In addition, the demonstration veriﬁes that a normal 4 cores CPU with The rest of this paper proceeds as follows. Section 2 highlights the state-of-the-art of big spatial data visualization. In Section 3 and Section 4, the techniques of HiVision are described in details. The experimental results are presented and analyzed in Section 5 with an online demonstration of HiVision introduced in Section 6. And the conclusions are drawn in Section 7.

2. Related Work There are many studies on big spatial data that discuss the challenges brought by data volume and diverse appli- cation requirements. Most of the studies focus on solving the problem of spatial query that emerged when processing large-scale spatial data (Bellur, 2014; Fries, Boden, Stepien and Seidl, 2014; Aly, Mahmood, Hassan, Aref, Ouz- zani, Elmeleegy and Qadah, 2015; Zhu, Huo and Qiu, 2015; Eldawy, Yuan, Mokbel and Janardan, 2013; Scitovski, k-nearest-neighbor queries which achieved an order of magnitude enhancement compared with the state-of-the-art sys- tems; Eldawy et al. (2013) proposed a set of eﬃcient MapReduce algorithms for some basic spatial analysis operations, including polygon union, farthest/closest pair, skyline query and convex hull. Besides, many high-performance frame- works/systems have been proposed to process or analyze big spatial data, among them are ScalaGiST (Lu, Chen, Ooi, Vo and Wu, 2014), Sphinx (Eldawy, Elganainy, Bakeer, Abdelmotaleb and Mokbel, 2015), Hadoop-GIS (Aji, Wang, Vo, Lee, Liu, Zhang and Saltz, 2013), SpatialHadoop (Eldawy and Mokbel, 2015), GeoSpark (Yu, Wu and Sarwat, visualization of large-scale spatial data in real-time. Spatial data visualization, as an important means of spatial analysis, is a core issue in map cartography. To visu- alize large-scale spatial data, the Open Geospatial Consortium (OGC) has provided a standard Web Map Tile Service (WMTS) (OpenGIS, 2010), in which pre-rendered or run-time computed georeferenced map images are organized into the tile-pyramid structure and transferred as map tiles over the Internet. Tile-pyramid is a multi-resolution data structure model widely used for map browsing on the web. At the lowest level of tile-pyramid (level 0), a single tile summarizes the whole map. On each higher level, there are up to 𝑧 tiles, in which 𝑧 is the zoom level. Each tile has the same size of 𝑛 × 𝑛 pixels and corresponds to a same geographic range. However, existing solutions to large-scale map data visualization are not ideal due to the following problems: 1) ( long generating time ) on the one hand, it will take a long time to render one tile which intersects large amounts of spatial objects, on the other hand, it may take dozens of hours or even more to slice all the tiles to provide free exploration of one spatial dataset; 2) ( massive tiles ) a tile-pyramid of world-scale with zoom from 0 to 16 contains billions of tiles, requiring TB level in storage; 3) ( inﬂexible styles ) the style of rendered tiles can not be changed. In other words, a new set of tiles has to be re-generated if one wants to change the style. Several studies focus on improving tile rendering performance, a typical yet important benchmark of spatial data processing. In the ﬁeld of map cartography, there are many mature tools for spatial data visualization, such as Map- nik (A), GeoServer (B, b) and MapServer (B, a). These tools are widely used for generating maps due to their eﬃcient rendering algorithms and rich rendering styles. In order to further improve the rendering performance of large-scale spatial data, researchers have provided several parallel methods, and various acceleration technologies are adopted. For example, Gao et al. (2005) presented a parallel multi-resolution volume rendering algorithm for visualizing large data sets, in which the raw data is converted to a wavelet tree to achieve load-balanced rendering. Tang (2013) proposed a parallel construction method of large circular cartograms based on graphics processing units (GPUs) and achieved signiﬁcant acceleration performance. In order to achieve load-balance, Guo et al. (2015) developed a spatially adap- tive decomposition approach for polyline and polygon visualization to divide the visualization domain into unequally sized sub-domains, such that they entail approximately the same amount of computational intensity. Guo et al. (2017) proposed an approach of vector data rendering by using the parallel computing capability of many-core GPUs, which involves a balancing allocation strategy to take full advantage of all processing cores of the GPU. OmniSci (OmniSci, M Ma et al.:

Preprint submitted to Elsevier

Page 3 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

Table 1

Performance of data-driven methods and HiBuﬀer (Ma et al., 2018).

Algorithm 40,927 linestrings 208,067 linestrings 597,829 linestrings 21,898,508 linestrings

Parallel Method a b c d a (Shen, Chen, Wu and Jing, 2018) b (Fan, Ji, Gu and Sun, 2014) c (Huang, 2013) d (Wang, Zhao, Wang, Chen and Cao, 2016) dataset. It can process billions of points and millions of polygons and generate custom pointmaps, heatmaps, choro- pleths, scatterplots, and other visualizations, enabling zero-latency visual interaction at any scale. Eldawy et al. (2016) proposed a MapReduce-based framework, HadoopViz, for visualizing big spatial data, and experiments showed that HadoopViz can eﬃciently produce giga-pixel images for billions of input records. Yu et al. (2018) proposed a big spatial data visualization framework, GeoSparkViz, which takes advantage of the in-memory architecture of Spark, and experiments veriﬁed that GeoSparkViz can generate a gigapixel image of 1.3 billion taxi trips in 5 minutes on a four-node commodity cluster. In addition, researchers have proposed several other approaches to improve data visualization eﬀects. Yang, Wong,

Yang, Kafatos and Li (2005) listed several possible techniques to improve the performance of data exploration on

Web-based GIS, including pyramids and hash indices for large images, multi-threading, data catching and binary compression. To manage the massive map tiles, Wan, Huang and Peng (2016) developed a tile storage approach based on the NoSQL database. To provide interactive exploration of large-scale spatial data while avoiding generating all the image tiles, Ghosh, Eldawy and Jais (2019) proposed an adaptive image data index, which pre-generates image tiles for the regions where spatial objects are dense; other typical methods generate tile-like intermediate variables through precomputing thus to compute requested tiles on the ﬂy (Liu, Jiang and Heer, 2013; Pahins, Stephens, Scheidegger and Comba, 2016; Lins, Klosowski and Scheidegger, 2013). The vector tile technology (Wikipedia, 2019) has been a popular approach over the recent years to visualize large-scale spatial vector data; it transfers packets of geographic data rather than images to clients and can change the map styles without generating new tiles. However, as it involves the complex cartographic generalization operations, it is more time-consuming to generate vector tiles than image tiles.

To summarize, the existing solutions to rapid visualization of large-scale vector data are normally data-driven, with the computational scales expanding rapidly with the volume of spatial objects, leading to the result that it is diﬃcult to provide visualization of large-scale vector data in real-time.

Display-driven computing (DisDC) is a computing model that is especially suitable for data-intensive problems in

GIS. In DisDC, the computing units are pixels rather than the spatial objects. The core issue in DisDC is to identify spatial topological relationships between pixels and spatial objects, thus determining the value of pixels for display.

DisDC has a broad prospect of applications and researches in big data analysis.

In our previous works (Ma, Wu, Luo, Chen, Li and Jing, 2018; Ma, Wu, Chen, Li and Jing, 2019), the primary idea of DisDC was ﬁrst proposed and applied to solve some basic analysis problems in GIS. We have successively brought forward HiBuﬀer and HiBO to provide interactive buﬀer and overlay analysis of large-scale spatial data. In (Ma et al., methods proposed in recent years and the popular GIS software programs are discussed and compared (key results are summarized in Table 1), and the display-driven buﬀer analysis method, HiBuﬀer, is deployed and tested in the same hardware environment. Experiments veriﬁed that HiBuﬀer reduced computation time by up to orders of magnitude while dealing with large-scale spatial vector data, and DisDC has signiﬁcant advantages compared with data-driven computing (DataDC). In this paper, we have applied DisDC to the ﬁeld of rapid vision for large-scale spatial vector data to explore its eﬀects.

M Ma et al.:

Preprint submitted to Elsevier

Page 4 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

3. Methodology

In this section, the key ideas for spatial vector data visualization in HiVision are introduced. Given the fact that the core task of visualizing spatial vector is to rasterize the spatial objects and render the ﬁnal raster images for display, we have applied DisDC to scatter plot the spatial point, the linestring and the polygon objects in HiVision. And to provide better visualization eﬀects, points, linestrings and boundaries of polygons are generally plotted with widths, and the anti-aliasing process is needed (see Figure 3). In HiVision, we process each pixel of the ﬁnal raster image as an independent computing unit; and spatial indexes are utilized to identify the spatial topological relationships between pixels and spatial objects, thus determining the value of pixels for display. We design a DisDC oriented vector data organization structure for data visualization. Speciﬁcally, for point, linestring and polygon edges, we propose a visual- ization method named Spatial-Index-Based Visualization (SIBV); and for the ﬁlling problem in polygon visualization, we present Spatial-Index-Based Filling (SIBF) algorithm. (a) Point objects (b) Linestring objects(c) Polygon objects

Figure 3:

Plot spatial objects for visualization.

The core issue in DisDC is to identify spatial topological relationships between pixels and spatial objects, so as to calculate the value of pixels for display. To support the rapid visualization of large-scale spatial vector data using

DisDC, we design a specialized data organization structure in HiVision. Spatial indexes are widely used to organize spatial data so that eﬃcient spatial object accessing can be guaranteed. R-tree, as an eﬃcient tree data structure widely used for indexing and querying spatial data, can be built eﬃciently by grouping nearby objects and representing them with their Minimum Bounding Rectangle (MBR) in the next higher level of the tree (Choubey, Chen and Rundensteiner, have complex structures and diﬀerent shapes that are diﬃcult to identify accurately by the MBRs, use the MBR of each spatial object directly as R-tree record nodes can cause low query performance in the display-driven analyzing process. Accordingly, linestring or polygon objects are separated to segments or edges to be stored in the R-tree indexes

M Ma et al.:

Preprint submitted to Elsevier

Page 5 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data in HiVision.

As shown in Figure 4, for point and linestring objects, we create R-tree indexes with point and segment as nodes types; for polygon objects which involve a ﬁlling step, we design a multi-level index architecture (MLIA). In MLIA, each edge of the polygon objects is stored as a segment in

𝑅𝑡𝑟𝑒𝑒𝐸 and the polygon MBRs are stored as boxes in

𝑅𝑡𝑟𝑒𝑒𝑀𝐵𝑅 . In particular, to support spatial judging in SIBF, two operations are executed: 1) node information ( 𝐼𝑠𝐿𝑒𝑣𝑒𝑙 ) is included in

𝑅𝑡𝑟𝑒𝑒𝐸 to identify whether the edge is parallel to the x-axis; 2) for the edges which mono- tonically increase or decrease, the segment cutting process is adopted (see Figure 5).

PointLinestringPolygon

Raw Data Data Management

Record

Node of RtreeP point(x,y) IDpoint(x,y) IDsegment(x ,y ,x ,y ) IDsegment(x ,y ,x ,y ) ID segment(x ,y ,x ,y ) IDsegment(x ,y ,x ,y ) IDsegment(x ,y ,x ,y ) IDIsLevelsegment(x ,y ,x ,y ) IDIsLevel IsLevelsegment(x ,y ,x ,y ) IDIsLevelsegment(x ,y ,x ,y ) IDIsLevelsegment(x ,y ,x ,y ) IDIsLevelsegment(x ,y ,x ,y ) ID Other AttributesID Other AttributesID Other AttributesID Other AttributesIDID Other AttributesID Other Attributes Record Node of RtreeL (x ,y )(x ,y )(x ,y )(x ,y )(x ,y )(x ,y )(x,y)(x,y)(x ,y ) (x ,y )(x ,y ) (x ,y ) (x ,y ) (x ,y )(x ,y ) (x ,y ) (x ,y )(x ,y ) (x ,y ) (x ,y ) (x ,y )(x ,y ) Record Node of RtreeE

Record Node of RtreeMBR …… box(minx,miny,maxx,maxy) IDbox(minx,miny,maxx,maxy) ID Figure 4:

Vector data organization in HiVision. y x

Point P i-1

Segment S i-1

Segment S i Segment S j-1

Segment S j Point P i Point P i+1

Point P j-1

Point P j Point P j+1 y x

Point P i-1

Segment S i-1

Segment S i Segment S j-1

Segment S j Point P i Point P i+1

Point P j-1

Point P j Point P j+1 (a) Edges reverse direction at endpoint (cutting pro-cess not required) y x

Point P i-1

Segment S i-1

Segment S j-1

Segment S jj Point P i Point P i+1

Point P j Point P j-1

Point P j+1

Point P jj Point P ii y x Point P i-1

Segment S i-1

Segment S j-1

Segment S j Point P i Point P i+1

Point P j Point P j-1

Point P j+1

Tolerance t (t << Segment Length)

Segment S ii Segment S i (b) Edges monotonically increase or decrease (cutting process required) Figure 5:

Segment cutting for polygon edges.M Ma et al.:

Preprint submitted to Elsevier

Page 6 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

As points, linestrings and polygon edges are visualized with widths, it can be regarded as generating spatial buﬀers (Sommer and Wade, 2006) of the objects. Diﬀerent from general spatial buﬀer analysis which identiﬁes areas by surrounding geographic features with a given spatial distance, for visualization, the widths of spatial objects are measured by the number of pixels. In SIBV, we extend the buﬀer generation method in HiBuﬀer (Ma et al., 2018) to visualize spatial point, linestring and the boundaries of polygon objects; moreover, we design a super-sampling approach for anti-aliasing: as shown in Figure 6, the pixel 𝑃 is split into four sub-pixels and a sample is taken from each sub-center. The value of P is generated by weighting the values of sub-pixels (Figure 7 shows the improvement of visual eﬀects in SIBV with the anti-aliasing approach). P P P P P Figure 6:

Super-sampling of pixel 𝑃 for anti-aliasing in SIBV. (a) Before anti-aliasing (b) After anti-aliasing Figure 7:

Improvement of visual eﬀects with anti-aliasing in SIBV.

The details of SIBV are as shown in Algorithm 1, and the query boxes used in SIBV is illustrated in Figure 8. Two main factors are considered to optimize the algorithm: 1) the super-sampling process should only be used in the color transition regions, as it will surely increase the calculation amount; 2) when R-tree is used, intersect operators work well for queries using bounding-box rather than other shapes, and nearest-neighbor search has much higher computation complexity than the bounding-box query. We introduce 𝑅 (= 𝑅 − √ 𝑅 𝑧 ) and 𝑅 (= 𝑅 + √ 𝑅 𝑧 ). If the distance from 𝑃 to the nearest spatial object, deﬁned as 𝐷 , is less than 𝑅 , it means that all the sub-pixels of 𝑃 are in the zones of rasterized spatial objects; if 𝐷 is between 𝑅 and 𝑅 , 𝑃 belongs to the color transition regions; otherwise, 𝑃 belongs to the background. The query process in SIBV can be divided into two steps: Step 1

To determine whether 𝑃 is in the buﬀer area of spatial objects with 𝑅 as radius. We introduce 𝐼𝑛𝑛𝑒𝑟𝐵𝑜𝑥 and 𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 to deal with diﬀerent situations in this step (if there are lots of spatial objects within the distance 𝑅 from 𝑃 , we query the spatial objects intersects the 𝐼𝑛𝑛𝑒𝑟𝐵𝑜𝑥 , as high density of spatial objects in the neighbor is very likely to intersect the inner box; and if there are few spatial objects in the neighbor of 𝑃 , we use the 𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 to ﬁlter out the spatial objects which are far from 𝑃 ). M Ma et al.:

Preprint submitted to Elsevier

Page 7 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

OuterBox InnerBox P . r . OuterBox OuterBox Figure 8:

Query boxes for calculating pixel 𝑃 with 𝑁 as radius in SIBV ( 𝑅 𝑧 : resolution at zoom level 𝑍 ). Step 2

𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 is used as a ﬁlter to determine whether 𝑃 belongs to the color transition regions. If so, we calculate the number of sub-pixels that are in the plotting region. Specially, the number of sub-pixels in the plotting region is used as an indicator to identify the degree which 𝑃 belongs to the zones of rasterized spatial objects. Algorithm 1:

Spatial-Index-Based Visualization

Input:

Pixel 𝑃 , zoom level 𝑍 , radius 𝑁 (pixels) and spatial index 𝑅𝑡𝑟𝑒𝑒 ( 𝑅𝑡𝑟𝑒𝑒𝑃 , 𝑅𝑡𝑟𝑒𝑒𝐿 or 𝑅𝑡𝑟𝑒𝑒𝐸 ). Output: 𝑃 belongs to the background region; 1 - 3: 𝑃 belongs to color transition regions; 4: 𝑃 totallybelongs to the zones of rasterized spatial objects). 𝑅 𝑧 ← RESOLUTION ( 𝑍 ) 𝑅 ← 𝑁 × 𝑅 𝑧 𝑅 ← 𝑅 − √ 𝑅 𝑧 𝑅 ← 𝑅 + √ 𝑅 𝑧 𝑟 ← √ 𝑅 𝐼𝑛𝑛𝑒𝑟𝐵𝑜𝑥 ← BOX ( 𝑃 .𝑥 − 𝑟 , 𝑃 .𝑦 − 𝑟 , 𝑃 .𝑥 + 𝑟 , 𝑃 .𝑦 + 𝑟 ) 𝑇 𝑚𝑝 ← satisfying 𝑅𝑡𝑟𝑒𝑒 . INTERSECT ( 𝐼𝑛𝑛𝑒𝑟𝐵𝑜𝑥 ). LIMIT ( ) if 𝑇 𝑚𝑝 is not 𝑛𝑢𝑙𝑙 then return else 𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 ← BOX ( 𝑃 .𝑥 − 𝑅 , 𝑃 .𝑦 − 𝑅 , 𝑃 .𝑥 + 𝑅 , 𝑃 .𝑦 + 𝑅 ) 𝑇 𝑚𝑝 ← satisfying 𝑅𝑡𝑟𝑒𝑒 . INTERSECT ( 𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 ) and 𝑅𝑡𝑟𝑒𝑒 . NEAREST ( 𝑃 ) if 𝑇 𝑚𝑝 is not 𝑛𝑢𝑙𝑙 && DISTANCE ( 𝑇 𝑚𝑝 , 𝑃 ) ≤ 𝑅 then return else 𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 ← BOX ( 𝑃 .𝑥 − 𝑅 , 𝑃 .𝑦 − 𝑅 , 𝑃 .𝑥 + 𝑅 , 𝑃 .𝑦 + 𝑅 ) 𝑇 𝑚𝑝 ← satisfying 𝑅𝑡𝑟𝑒𝑒 . INTERSECT ( 𝑂𝑢𝑡𝑒𝑟𝐵𝑜𝑥 ) and 𝑅𝑡𝑟𝑒𝑒 . NEAREST ( 𝑃 ) if 𝑇 𝑚𝑝 is not 𝑛𝑢𝑙𝑙 && DISTANCE ( 𝑇 𝑚𝑝 , 𝑃 ) ≤ 𝑅 then 𝑃 ← POINT ( 𝑃 .𝑥 − 1∕4 × 𝑅 𝑧 , 𝑃 .𝑦 + 1∕4 × 𝑅 𝑧 ) ⊳ Super-sampling for anti-aliasing. 𝑃 ← POINT ( 𝑃 .𝑥 + 1∕4 × 𝑅 𝑧 , 𝑃 .𝑦 + 1∕4 × 𝑅 𝑧 ) 𝑃 ← POINT ( 𝑃 .𝑥 − 1∕4 × 𝑅 𝑧 , 𝑃 .𝑦 − 1∕4 × 𝑅 𝑧 ) 𝑃 ← POINT ( 𝑃 .𝑥 + 1∕4 × 𝑅 𝑧 , 𝑃 .𝑦 − 1∕4 × 𝑅 𝑧 ) return ∑ 𝑖 =1 DISTANCE ( 𝑇 𝑚𝑝, 𝑃 𝑖 ) ≤ 𝑅 ? 1 ∶ 0 return We design the SIBF to determine whether the pixel 𝑃 is inside the polygon objects, so as to visualize the zones inside polygon objects. The details of SIBF are shown in Algorithm 2, we use the 𝑅𝑡𝑟𝑒𝑒𝑀𝐵𝑅 to ﬁnd the candidate polygons and then measure the spatial relationship between the pixel and each candidate polygon one by one until the polygon which contains the pixel is found. We apply the ray casting algorithm (Shimrat, 1962) to determine whether a pixel is inside a polygon. To be more speciﬁc, given a pixel and a polygon, a segment (

𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 ) is drawn

M Ma et al.:

Preprint submitted to Elsevier

Page 8 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data from the MBR boundaries of the polygon to the pixel which is parallel to the x-axis, then

𝑅𝑡𝑟𝑒𝑒𝐸 is used to calculate how many times the segment intersects the edges of the polygon (the edges in parallel with the x-axis are processed as invalid edges). The pixel is classiﬁed as ’inside the polygon’ if the number of crossings is odd, or ’outside’ if it is an even number. The result holds for polygons with inner rings. Moreover, as longer

𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 may intersect large amounts of edges which belong to other polygons and thus cause performance degradation, two optimizations have been made to minimize the length of the

𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 : 1) the polygons with smaller x spans are used for spatial judging preferentially (in line 2 in Algorithm 2); 2) the vertical segment from the pixel to the closer edge of the polygon MBR is used as

𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 (details are given in line 6-9). When the spatial relationships are determined, we can then render the pixels inside polygon objects according to the given styles. In the current implementation,

Monochromatic colors and patterns ﬁlling are both supported. Figure 9 shows the visual eﬀects of polygon objects in

HiVision.

Algorithm 2:

Spatial-Index-Based Filling

Input:

Pixel P, RtreeE and RtreeMBR.

Output:

True or False (whether P is in polygons). 𝑇 𝑚𝑝𝑀𝐵𝑅 ← satisfying 𝑅𝑡𝑟𝑒𝑒𝑀𝐵𝑅 . INTERSECT ( 𝑃 ) SORT ( 𝑇 𝑚𝑝𝑀𝐵𝑅 ) ⊳ Polygon with smaller x span has higher priority. for 𝑣 ∈ 𝑇 𝑚𝑝𝑀𝐵𝑅 do 𝐸𝑑𝑔𝑒𝐶𝑜𝑢𝑛𝑡 ← 𝑣𝑀𝑖𝑛𝑥 ← 𝑣.𝐵𝑜𝑥.𝑚𝑖𝑛𝑥 , 𝑣𝑀𝑎𝑥𝑥 ← 𝑣.𝐵𝑜𝑥.𝑚𝑎𝑥𝑥 if 𝑃 .𝑥 − 𝑣𝑀𝑖𝑛𝑥 < 𝑣𝑀𝑎𝑥𝑥 − 𝑃 .𝑥 then 𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 ← SEGMENT ( 𝑣𝑀𝑖𝑛𝑥 , 𝑃 .𝑦 , 𝑃 .𝑥 , 𝑃 .𝑦 ) else 𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 ← SEGMENT ( 𝑃 .𝑥 , 𝑃 .𝑦 , 𝑣𝑀𝑎𝑥𝑥 , 𝑃 .𝑦 ) 𝑇 𝑚𝑝𝑆 ← satisfy 𝑅𝑡𝑟𝑒𝑒𝐸 . INTERSECT ( 𝑄𝑢𝑒𝑟𝑦𝑆𝑒𝑔𝑚𝑒𝑛𝑡 ) for 𝑠 ∈ 𝑇 𝑚𝑝𝑆 do if (not 𝑠.𝐼𝑠𝐿𝑒𝑣𝑒𝑙 )&& 𝑠.𝐼𝐷 == 𝑣.𝐼𝐷 then 𝐸𝑑𝑔𝑒𝐶𝑜𝑢𝑛𝑡 + + if 𝐸𝑑𝑔𝑒𝐶𝑜𝑢𝑛𝑡 is odd then return

True return

False (a) Monochromatic colors ﬁlling (b) Patterns ﬁlling

Figure 9:

Visualization of polygon objects in HiVision.

HiVision outperforms traditional data-driven solutions in the following two aspects: • ( low computation complexity ) Assume 𝑛 to be the number of spatial objects for visualization. In data-driven solutions, as each object will be computed and analyzed successively, the total computation complexity is O( 𝑛 ). In contrast, the computing units in HiVision are pixels and we introduce R-tree indexes to accelerate the process of ﬁnding the objects to determine the value of each pixel; as a result, the computation complexity is reduced to O( 𝑙𝑜𝑔 ( 𝑛 ) ). M Ma et al.:

Preprint submitted to Elsevier

Page 9 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data • ( easy to parallel ) Real-world spatial datasets have the spatial unbalanced distribution property, which creates challenges with respect to eﬃcient parallel processing. Using DataDC, in general, complex partitioning and merging strategies need to be designed; however, few strategies are capable of handling all kinds of spatial distributions with good load balancing. In contrast, as the computation complexity of the algorithm is O( 𝑙𝑜𝑔 ( 𝑛 ) ), it indicates that our approach is less sensitive to spatial distributions and simply partitioning the analysis task by dividing the pixels equally can achieve good load balancing.

4. Architecture

To provide an interactive exploration of large-scale spatial vector data, we design a high-performance parallel processing architecture as illustrated in Figure 10. The DisDC oriented vector data organization structure is stored as memory-mapped ﬁles (Wikipedia, 2020), which do not need to be totally loaded into memory while accessing the ﬁles. The architecture of HiVision adopts the browser-server application model. In HiVision, visualization results are organized into the tile-pyramid structure with

256 × 256 pixels as the tile size. When a user browses the spatial datasets, tiles in the display range will be rendered on the ﬂy according to the rendering styles. The server side of the architecture consists of three parts: Multi-Thread Visualization Server (MTVS), In-Memory Messaging Framework (IMMF) and Hybrid-Parallel Visualization Engine (HPVE).

Hybrid-Parallel Visualization Engine L o a d S p a ti a l I nd e x e s Compute Process (Multi-threads) S c h e du l e P r a ll e l T a s k s …… In-Memory Messaging

Framework

Task PoolResult Pool P a r s e T a s k s & R e nd e r T il e s Multi-Thread Visualization Server

DisDC Oriented Data Organization Vector DataVisualization WMTSVector DataRegistration Service R e g i s t e r V ec t o r D a t a Data Meta Info Users …… Pattern Library

Compute Process (Multi-threads) Compute Process (Multi-threads)

Figure 10:

Architecture of HiVision.

In MTVS, the spatial data visualization service is encapsulated as a WMTS, which can be easily added to web maps as a raster layer. The visualizing process of each tile is treated as independent tasks. The

Parse Tasks process analyzes tile requests and generates visualization tasks in

Task Pool ; in particular, the following types of tiles will not lead to new tasks: 1) tiles that are not in the spatial scope of data MBRs; 2) tiles that are previously processed and the visualization results that are still preserved in the

Result Pool ; 3) tiles with wrong request expressions. The

Render Tiles process gets visualization results from the

Result Pool once the visualizing process is done in HPVE, and renders tiles according to the style provided by users; The

Pattern Library stores the patterns used for pattern ﬁlling style of polygon objects. MTVS provides a data registration interface, and in the

MTVS creates spatial indexes and writes dataset meta-data (e.g., MBRs) to

Data Meta Info . In addition, multi-thread technology is adopted in MTVS to improve concurrency.

To reduce message transmission time, tasks, results, control messages are delivered in memory without disk I/O in IMMF. IMMF is implemented based on Redis, a distributed, in-memory key-value database.

Task Pool is a ﬁrst-

M Ma et al.:

Preprint submitted to Elsevier

Page 10 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

Table 2

Experimental environment.

Item Description

CPU cores, Intel(R) Xeon(R) [email protected] GHzMemory

GBOperating System Centos 7.1 in-ﬁrst-out queue that stores the requested tile tasks; the tasks are pushed to the list by MTVS and popped to the task processors in HPVE, and the operations are executed in blocking mode to avoid repeated allocation of tasks.

Result

Pool stores the visualization results in key-value structure: the key identiﬁes the tile request expression, while the value stores the visualization results (in the form of a two-dimensional array indicating diﬀerent zones for the rendering process). Once a task is ﬁnished in MPVE, the visualization result will be written to the

Result Pool , and then a task completion message will be sent to MTVS through subscribing/publishing in Redis. To avoid overwhelming memory consumption, visualization results are set with an expire time window and expired results will be cleaned up if memory usage reaches the upper limit.

HPVE adopts a hybrid MPI-OpenMP parallel processing model and a dynamic task partitioning strategy to achieve real-time exploration of large-scale spatial vector data. In HiVision, each tile is partitioned by lines and processed with multiple OpenMP threads in one MPI process. When a user browses the spatial datasets, the tasks are generated in a streaming way and handled at a ﬁrst-in/ﬁrst-served basis. An MPI process will be suspended if no tasks are assigned or the assigned tasks are ﬁnished. Tasks are dynamically allocated to a suspended MPI process, and if there are no available free MPI processes, the extra task will be temporarily stored in the

Task Pool waiting for idle processes.

5. Experimental Evaluation

In this section, we conduct several experiments to evaluate the performance of HiVision. Firstly, we compare

HiVision with the typical data-driven methods which are popular in recent years; then, we test the ability of HiVision to support interactive exploration of large-scale spatial vector data; moreover, we carry out an experiment to analyze the inﬂuence of request rates while providing interactive visualization in HiVision; ﬁnally, the parallel scalability of

HiVision is tested through running with varying numbers of MPI processes and OpenMP threads.

All the experiments are conducted on a cluster with four nodes (Table 2). The computer code of HiVision is implemented in C++, based on Boost C++ 1.64, MPICH 3.04, GDAL 2.1.2 and Redis 3.2.12. The data-driven methods are based on Hadoop 2.9.2, Spark 2.3.3, SpatialHadoop 2.4.2, GeoSpark 1.2.0 and Mapnik 3.0.22. In HiVision, the spatial indexes could be constructed quickly (FernÃąndez, 2018) and the experiments are conducted based on the pre-built DisDC oriented vector data organization structure. To deploy HiVision on the cluster, we keep a copy of index ﬁles in each cluster node so that all the processes can access the spatial vector data eﬃciently. Table 3 shows the datasets used in the experiments, and the datasets are all on the planet level. P is from OpenCellID , which is the world’s largest collaborative community project that collects GPS positions of base stations. Other datasets are from OpenStreetMap, which is a digital map database built through crowdsourced volunteered geographic information. L , P and A respectively contain all the linestrings, points and polygons on the planet from OpenStreetMap; and there are more than 1 billion segment/point/edge items in each of the dataset. In order to highlight the superiority of HiVision, we compare HiVision with three typical data-driven meth- ods, namely, HadoopViz, GeoSparkViz and Mapnik. All the methods are deployed on the cluster with four nodes.

HadoopViz and GeoSparkViz are respectively implemented based on the Hadoop and the in-memory Spark with well load-balance task partition strategies; given a spatial dataset, the methods can generate all the tiles of given zoom lev- els. Mapnik is an open-source toolkit for rendering maps. The inputs of Mapnik are the spatial objects in the tile range and the rendering styles, and the output is rendered map tile. In the experiment, the tile rendering tasks are stored in a queue and multiple Mapnik rendering processes are launched to process the tasks successively. We have totally started

128 Mapnik rendering processes in the cluster. HiVision is set to run with 128 MPI processes and 2 OpenMP threads https://opencellid.org M Ma et al.:

Preprint submitted to Elsevier

Page 11 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

Table 3

Datasets used in the experiments.

Dataset Type Records Size L : OSM postal code areas boundaries Linestring 171,226 65,334,342 segmentsL : OSM boundaries of cemetery areas Linestring 193,076 1,800,980 segmentsL : OSM sporting areas boundaries Linestring 1,767,137 18,969,047 segmentsL : OSM water areas boundaries Linestring 8,419,324 376,208,235 segmentsL : OSM parks green areas boundaries Linestring 9,961,896 454,636,308 segmentsL : OSM roads and streets Linestring 72,339,945 717,048,198 segmentsL : OSM all linestrings on the planet Linestring 106,269,321 1,578,947,752 segmentsP : OpenCelliD cell tower locations Point 40,719,479 40,719,478 pointsP : OSM all points on the planet Point 2,682,401,763 2,682,401,763 pointsA : OSM buildings Polygon 114,796,734 689,197,342 edgesA : OSM all polygons on the planet Polygon 177,662,806 2,077,524,465 edges in each process. For each dataset, we generate tiles of zoom levels 1, 3, 5, 7 and 9 with diﬀerent methods; the numbers of tiles in each level are 4, 64, 1024, 16384 and 262144. Figure 11 shows the total rendering time of all the tiles in zoom levels 1, 3, 5, 7 and 9 with diﬀerent methods.

GeoSparkViz shows high performance among the data-driven methods: for all the datasets, GeoSparkViz takes the shortest time than other data-driven methods. Taking HiVision and GeoSparkViz for comparison, GeoSparkViz shows higher performance while the dataset scale is small (L ), and HiVision outperforms GeoSparkViz on larger datasets (L , P , A ). For the billion-scale datasets L , P and A , HiVision shows the high performance and it respec- tively takes about 38.33% ( = ÷ = ÷ = ÷ the rendering time using GeoSparkViz on each dataset. From L to L , the data size increases sequentially, there is no signiﬁcant uptrend in the tile rendering time using HiVision; in contrast, the tile rendering time of data-driven methods expands rapidly with the increase of data scales. Surprisingly, using HiVision, L , the largest linestring dataset with more than 1 billion segments, produces better performance than L , which has a much smaller scale. Experiment re- sults show that HiVision is less sensitive to data volumes. In addition, HiVision produces better visual eﬀects. Neither HadoopViz nor GeoSparkViz contains the anti-aliasing and polygon ﬁlling processes, and polygon objects are treated as linestring objects in the methods. Mapnik, as a mature map cartography tool, can visualize spatial objects with various styles; however, Mapnik failed to process the billion-scale datasets L , P and A . In conclusion, compared with traditional data-driven methods, HiVision produces higher performance with better visual eﬀects while dealing with large-scale spatial vector data. Figure 12 shows the tile rendering speed of diﬀerent zoom levels. As illustrated by the results, data-driven methods show high tile rendering speed while zoom levels are high, however, the speed decreases rapidly while the zoom level is low. It is because the spatial objects in a tile range could be extremely large if the zoom level is low, and using DataDC, each object needs to be plotted and merged successively to generate the ﬁnal visual eﬀects. Given a large-scale spatial dataset, as the amounts of spatial objects in the views are unpredictable, it is diﬃcult to provide eﬃcient visualization of the dataset on all zoom levels using data-driven methods; by contrast, using the display-driven HiVision, as the tile rendering speed remains stable with the diﬀerent zoom levels, it is easy to provide an interactive exploration of the dataset in all the zoom levels. Compared with data-driven methods, HiVision shows obvious advantages while the density of spatial objects is high but tends to be slower when density is low. In our future works, we will consider combining both display-driven and data-driven computing to provide interactive spatial visual analysis with better performance.

In this experiment, we test the ability of HiVision to support the real-time exploration of large-scale spatial vector data. HiVision is set to run with 64 MPI processes and 4 OpenMP threads in each process. For each dataset, we generate 10000 tile rendering tasks through a test program, which randomly requests tiles from zoom levels 3 to 15.

HiVision is set with no cache preserved in

Result Pool to ensure that each requested tile will lead to a new task in

Task

Pool , thus evaluating the performance of the visualization engine more precisely.

Figure 13 (a) shows the total rendering time of 10000 tiles on diﬀerent datasets using HiVision. For all the datasets, A produces the poorest performance with the speed of 356.69 tiles/s; as the number of tiles for display in a screen is M Ma et al.:

Preprint submitted to Elsevier

Page 12 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . L . X L F a il e d P . P . P . P . P . P . P . A . A . A . A . A . A . A . X A F a il e d T I M E ( S ) X P F a il e d H i V ision ( d isplay-driven ) H adoopViz (d ata-driven ) G eoSparkViz (d ata-driven ) Mapnik (d ata-driven ) Figure 11:

Total rendering time of generating all the tiles in zoom levels 1, 3, 5, 7 and 9. generally no more than 100 (much fewer than the number of tiles generated per second of A ), it is possible to perform real-time visualization with HiVision on all the datasets. As shown in Figure 13 (b), the rendering time distributions of each tile on diﬀerent datasets are visualized through box plots (’ × ’ represents the average rendering time of each tile). For A which produces the poorest performance, most of the tiles are rendered in 0.24s. Assume that a browser requests 100 diﬀerent tiles of A . As there are totally 64 running MPI processes, all the 100 tasks will be processed in two rounds with 28 ( = 64 processes × 2 − 100 tasks ) MPI processes suspended in the second round, and all the tasks will be most likely completed in less than 0.48s ( = 0 .

24s × 2 ). In conclusion, HiVision is able to provide an interactive exploration of large-scale spatial vector data.

In the experiments other than this experiment, all the tile requests are dispatched to HiVision simultaneously, which means that the request rate is set to inﬁnity, and HiVision keeps running at full load until all the tasks have been ﬁnished.

In practical applications, the tile requests are generally generated at much lower rates. In this experiment, HiVision is set to run with 64 MPI processes and 4 OpenMP threads in each process. It means that there are 64 tile rendering processes that can render 64 tiles at the same time (the redundant tasks are stored in the Task Pool waiting for idle processes). The number of tiles to request per second in the experiment is set to multiple of 64 and the request rate is respectively set to 128, 256, 512, 1024, 2048, 4096 and inﬁnity (INF) tiles/s. For each rate, we generate 10000 random tile tasks from zoom levels 3 to 15.

The rendering time distributions of each tile with diﬀerent request rates are shown in Figure 14. The performance limit is the number of tiles rendered per second at the rate of INF tiles/s (Figure 13 (a)). For all the datasets, the rendering time of tiles increases obviously with the request rates when the rate is less than the performance limit. It is because of the increasing resources competition between processes. And when the request rate exceeds the performance limit, the rendering time of tiles does not change signiﬁcantly. It is because that the visualization engine is running at full load and the increase of request rates will not cause obvious eﬀects on the performance; in particular, as the increase of extra tasks in

Task Pool , the response time of HiVision will increase rapidly if the request rate exceeds the performance limit. In conclusion, compared with the results in other experiments, higher performance can be achieved in practical applications as a result of the lower request rates.

To evaluate the parallel scalability, HiVision is respectively tested to run on 4, 8, 16, 32, 64, 128 and 256 MPI processes with 1, 2, 4 and 8 OpenMP threads in each process. For each pair of MPI processes and OpenMP threads,

M Ma et al.:

Preprint submitted to Elsevier

Page 13 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (a) L LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (b) L LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (c) L LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (d) L LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (e) L LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (f) L LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (g) L (Mapnik failed) LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (h) P LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (i) P (Mapnik failed) LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (j) A LEVEL1 LEVEL3 LEVEL5 LEVEL7 LEVEL9

SPEE D ( T I L ES / S ) HiVision HadoopViz GeoSparkViz Mapnik (k) A (Mapnik failed) HiVisionHadoopVizGeoSparkVizMapnik

Figure 12:

Tile rendering speed of diﬀerent zoom levels. L1 L2 L3 L4 L5 L6 L7 P1 P2 A1 A2 0200400 T I M E ( S ) NU M BE R O F T I L ES Time Consumed To Render 10000 Tiles Number Of Tiles Rendered Per Second (a) Rendering time of 10000 tiles (b) Rendering time of each tile

Figure 13:

Tile rendering time of HiVision on diﬀerent datasets.M Ma et al.:

Preprint submitted to Elsevier

Page 14 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data (a) L (Performance limit: 1357.11 tiles/s) (b) L (Performance limit: 1329.82 tiles/s) (c) L (Performance limit: 1188.54 tiles/s)(d) L (Performance limit: 717.92 tiles/s) (e) L (Performance limit: 1057.94 tiles/s) (f) L (Performance limit: 785.56 tiles/s)(g) L (Performance limit: 703.87 tiles/s) (h) P (Performance limit: 1238.77 tiles/s) (i) P (Performance limit: 480.73 tiles/s)(j) A (Performance limit: 683.49 tiles/s) (k) A (Performance limit: 356.69 tiles/s) Figure 14:

Rendering time of each tile with diﬀerent request rates in HiVision. we generate 1000 random tile requests of diﬀerent zoom levels on each dataset. The experimental results are plotted in Figure 15.

We analyze the rendering time of 1000 tiles. HiVision achieves high performance of parallel acceleration when the process number is below 32, which is approximate to linearity; and the performance of parallel acceleration decreases as the process number is over 32, especially while running with 8 OpenMP threads in each process. It is because the increase of process numbers intensiﬁes the resource competition, and the competition is even more intense while running with multiple threads. For example, see Figure 15 (d), the rendering time of 1000 tiles with 8 threads increases as the process numbers increase from 128 to 256. Then, we analyze the average rendering time of each tile line with diﬀerent OpenMP threads. As shown in the ﬁgures, multi-thread parallel processing reduces the rendering time of a tile when the resource competition is not intense. Surprisingly, for L , L and P which have the smaller scale, running with 2 thread produces weaker performance than 1 thread even if the resource competition is not intense, it is because that the initialization cost of multiple threads is higher compared with the parallel acceleration. Based on the experimental results and the analysis, a conclusion can be drawn about the deployments of HiVision in the given hardware environment: 1) if the number of request tiles is high, 256 processes × of request tiles is low, 32 processes × M Ma et al.:

Preprint submitted to Elsevier

Page 15 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data T I M E ( S ) NUMBER OF PROCESSES (a) L T I M E ( S ) NUMBER OF PROCESSES (b) L … T I M E ( S ) NUMBER OF PROCESSES (c) L T I M E ( S ) NUMBER OF PROCESSES (d) L T I M E ( S ) NUMBER OF PROCESSES (e) L T I M E ( S ) NUMBER OF PROCESSES (f) L

100 4 8 16 32 64 128 256 T I M E ( S ) NUMBER OF PROCESSES (g) L T I M E ( S ) NUMBER OF PROCESSES (h) P T I M E ( S ) NUMBER OF PROCESSES (i) P T I M E ( S ) NUMBER OF PROCESSES (j) A T I M E ( S ) NUMBER OF PROCESSES (k) A Rendering Time Of 1000 Tiles (1 Thread)Rendering Time Of 1000 Tiles (2 Threads)Rendering Time Of 1000 Tiles (4 Threads)Rendering Time Of 1000 Tiles (8 Threads)Average Rendering Time Of Each Tile (1 Thread)Average Rendering Time Of Each Tile (2 Threads)Average Rendering Time Of Each Tile (4 Threads)Average Rendering Time Of Each Tile (8 Threads)

Figure 15:

Parallel performance of HiVision with diﬀerent numbers of MPI processes and OpenMP threads.M Ma et al.:

Preprint submitted to Elsevier

Page 16 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

Table 4

Environment of the online demo.

Item Description

CPU 4cores, Intel(R) Xeon(R) [email protected] 32 GBOperating System Centos 7

Table 5

Datasets of China.

Dataset Type Records Size

China roads Linestring 21,898,508 163,171,928 segmentsChina points Point 20,258,450 20,258,450 pointsChina farmland Polygon 10,520,644 133,830,561 edges

6. Online Demo

An online demonstration of HiVision is provided . The 10-million-scale datasets (see Table 5) used in the demon- stration are provided by map service providers. As the datasets are not open published, the raw datasets are encrypted by adding oﬀsets. To note that, a current demonstration is deployed on a stand-alone server with 4 cores CPU and

32 GB Memory (see Table 4), which is accessible for an up-to-date personal computer. Even so, as illustrated in the demonstration, it is still possible to provide an interactive visualization of 10-million-scale datasets in HiVsion.

7. Conclusions and Future Work

In this paper, we present a display-driven visualization model, HiVision, for interactive exploration of large-scale spatial vector data. Diﬀerent from traditional methods, in HiVision, the computing units are pixels rather than spatial objects to achieve the goal of being less sensitive to data volumes. Diﬀerent experiments are designed and conducted to evaluate various system performance: experiment 1 shows that, compared with traditional data-driven methods, our approach produces higher performance with better visual eﬀects; experiment 2 demonstrates the ability of HiVision to provide an interactive exploration of large-scale spatial vector data; in experiment 3, we analyze the impact of the request rate in HiVision and demonstrate that higher performance can be achieved in practical applications compared with the results in experiments; experiment 4 tests the parallel scalability of HiVision, and the results show that HiVi- sion achieves high performance of parallel acceleration while the resource competition is not intense. Moreover, an online demonstration of HiVision is provided on the Web, which veriﬁes that HiVision is capable of handling 10- million scale spatial data even deployed on a personal computer. Our future work will focus on extending HiVision to support more complex visualization styles and applying HiVision to the ﬁeld of map cartography.

Sourcecode availability

The source code of HiVision, including test data and user manuals, is available for download from Github. (https://github.com/MemoryMmy/HiVision)

References

A, P., . Mapnik. URL: https://mapnik.org . Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J., 2013. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment 6, 1009–1020.

Aly, A.M., Mahmood, A.R., Hassan, M.S., Aref, W.G., Ouzzani, M., Elmeleegy, H., Qadah, T., 2015. Aqwa: adaptive query workload aware partitioning of big spatial data. Proceedings of the VLDB Endowment 8, 2062–2073.

B, K., a. Mapserver. URL: https://mapserver.org . B, Y., b. Geoserver. URL: http://geoserver.org . Bellur, U., 2014. On parallelizing large spatial queries using map-reduce, in: International Symposium on Web and Wireless Geographical Infor- mation Systems, Springer. pp. 1–18. https://github.com/MemoryMmy/HiVision M Ma et al.:

Preprint submitted to Elsevier

Page 17 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data (a) China POI(b) China roads(c) China farmlands(d) Patterns ﬁlling for polygon objects

Figure 16:

Visualized results of the online demo.

Choubey, R., Chen, L., Rundensteiner, E.A., 1999. Gbi: A generalized r-tree bulk-insertion strategy, in: International Symposium on Spatial

Databases, Springer. pp. 91–108.

Eldawy, A., Elganainy, M., Bakeer, A., Abdelmotaleb, A., Mokbel, M., 2015. Sphinx: Distributed execution of interactive sql queries on big spatial data, in: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM. p. 78.

Eldawy, A., Mokbel, M.F., 2015. Spatialhadoop: A mapreduce framework for spatial data, in: 2015 IEEE 31st international conference on Data

Engineering, IEEE. pp. 1352–1363.

Eldawy, A., Mokbel, M.F., Jonathan, C., 2016. Hadoopviz: A mapreduce framework for extensible visualization of big spatial data, in: 2016 IEEE

Eldawy, A., Yuan, L., Mokbel, M.F., Janardan, R., 2013. Cg-hadoop: Computational geometry in mapreduce, in: Acm Sigspatial International

Conference on Advances in Geographic Information Systems.

Fan, J., Ji, M., Gu, G., Sun, Y., 2014. Optimization approaches to mpi and area merging-based parallel buﬀer algorithm. Boletim de Ciências

Geodésicas 20, 237–256.

Fries, S., Boden, B., Stepien, G., Seidl, T., 2014. Phidj: Parallel similarity self-join for high-dimensional vector data with mapreduce, in: 2014

IEEE 30th International Conference on Data Engineering, IEEE. pp. 796–807.

M Ma et al.:

Preprint submitted to Elsevier

Page 18 of 19iVision: Rapid Visualization of Large-Scale Spatial Vector Data

Gao, J., Wang, C., Li, L., Shen, H.W., 2005. A parallel multiresolution volume rendering algorithm for large data visualization. Parallel Computing

31, 185–204.

Ghosh, S., Eldawy, A., Jais, S., 2019. Aid: An adaptive image data index for interactive multilevel visualization, in: 2019 IEEE 35th International

Conference on Data Engineering (ICDE), IEEE. pp. 1594–1597.

Guo, M., Guan, Q., Xie, Z., Wu, L., Luo, X., Huang, Y., 2015. A spatially adaptive decomposition approach for parallel vector data visualization of polylines and polygons. International Journal of Geographical Information Science 29, 1419–1440.

Guo, M., Huang, Y., Guan, Q., Xie, Z., Wu, L., 2017. An eﬃcient data organization and scheduling strategy for accelerating large vector data rendering. Transactions in GIS 21, 1217–1236.

Huang, X., 2013. Parallel buﬀer generation algorithm for gis. Journal of Geology and Geosciences 02.

Kraak, M.J., Ormeling, F.J., 2013. Cartography: visualization of spatial data. Routledge.

Lins, L., Klosowski, J.T., Scheidegger, C., 2013. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Transactions on Visualization and Computer Graphics 19, 2456–2465.

Liu, Z., Jiang, B., Heer, J., 2013. immens: Real-time visual querying of big data, in: Computer Graphics Forum, Wiley Online Library. pp. 421–430.

Lu, P., Chen, G., Ooi, B.C., Vo, H.T., Wu, S., 2014. Scalagist: scalable generalized search trees for mapreduce systems [innovative systems paper].

Proceedings of the VLDB Endowment 7, 1797–1808.

Ma, M., Wu, Y., Chen, L., Li, J., Jing, N., 2019. Interactive and online buﬀer-overlay analytics of large-scale spatial data. ISPRS International

Journal of Geo-Information 8. URL: , doi: . Ma, M., Wu, Y., Luo, W., Chen, L., Li, J., Jing, N., 2018. Hibuﬀer: Buﬀer analysis of 10-million-scale spatial data in real time. ISPRS International

Journal of Geo-Information 7. URL: , doi: . Maceachren, A.M., Gahegan, M., Pike, W., Brewer, I., Cai, G., Lengerich, E., Hardisty, F., 2004. Geovisualization for knowledge construction and decision support. IEEE Computer Graphics & Applications 24, 13–17.

OmniSci, 2020. Omnisci technical white paper. URL: . OpenGIS, 2010. Web map tile service implementation standard. URL: http://portal.opengeospatial.org/files/?artifact_id=35326 . Pahins, C.A., Stephens, S.A., Scheidegger, C., Comba, J.L., 2016. Hashedcubes: Simple, low memory, real-time visual exploration of big data.

IEEE transactions on visualization and computer graphics 23, 671–680.

Scitovski, S., 2018. A density-based clustering algorithm for earthquake zoning. Computers & Geosciences 110, 90–95.

Shen, J., Chen, L., Wu, Y., Jing, N., 2018. Approach to accelerating dissolved vector buﬀer generation in distributed in-memory cluster architecture.

ISPRS International Journal of Geo-Information 7, 26.

Shimrat, M., 1962. Algorithm 112: position of point relative to polygon. Communications of the ACM 5, 434.

Sommer, S., Wade, T., 2006. A to Z GIS: An Illustrated Dictionary of Geographic Information Systems. Esri Press.

Tang, W., 2013. Parallel construction of large circular cartograms using graphics processing units. International Journal of Geographical Information

Science 27, 2182–2206.

Tong, X., Ben, J., Liu, Y., et al., 2013. Modeling and expression of vector data in the hexagonal discrete global grid system. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 4, W2.

Wan, L., Huang, Z., Peng, X., 2016. An eﬀective nosql-based vector map tile management approach. ISPRS International Journal of Geo-Information

5, 215.

Wang, T., Zhao, L., Wang, L., Chen, L., Cao, Q., 2016. Parallel research and opitmization of buﬀer algorithm based on equivalent arc partition.

Remote Sensing Information , 147–152.

Wikipedia, 2019. Vector tiles. URL: https://wiki.openstreetmap.org/wiki/Vector_tiles . Wikipedia, 2020. Memory-mapped ﬁle. URL: https://en.wikipedia.org/wiki/Memory-mapped_file . Xie, D., Li, F., Yao, B., Li, G., Chen, Z., Zhou, L., Guo, M., 2016. Simba: Spatial in-memory big data analysis, in: Proceedings of the 24th ACM

SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM. p. 86.

Yang, C., Wong, D.W., Yang, R., Kafatos, M., Li, Q., 2005. Performance-improving techniques in web-based gis. International Journal of Geo- graphical Information Science 19, 319–342.

Yao, X., Li, G., 2018. Big spatial vector data management: a review. Big Earth Data 2, 108–129.

Yu, J., Wu, J., Sarwat, M., 2015. Geospark: A cluster computing framework for processing large-scale spatial data, in: Proceedings of the 23rd

SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM. p. 70.

Yu, J., Zhang, Z., Sarwat, M., 2018. Geosparkviz: a scalable geospatial data visualization framework in the apache spark ecosystem, in: Proceedings of the 30th International Conference on Scientiﬁc and Statistical Database Management, ACM. p. 15.

Zhu, X., Huo, J., Qiu, Q., 2015. A novel methodology for parallel spatial overlay over vector data - a case study with shape ﬁle, in: 2015 IEEE

International Geoscience and Remote Sensing Symposium (IGARSS), IEEE. pp. 4522–4525.

Zuchao, W., Min, L., Xiaoru, Y., Junping, Z., Huub, V.D.W., 2013. Visual traﬃc jam analysis based on trajectory data. IEEE Transactions on

Visualization & Computer Graphics 19, 2159–2168.

M Ma et al.: