Geo-Spatial Data Visualization and Critical Metrics Predictions for Canadian Elections
GGeo-Spatial Data Visualization and Critical MetricsPredictions for Canadian Elections
Mohammad Abdul Hadi
Department of Computer ScienceUniversity of British Columbia
British Columbia, [email protected]
Fatemeh Hendijani Fard
Department of Computer ScienceUniversity of British Columbia
British Columbia, [email protected]
Irene Vrbik
Department of Computer ScienceUniversity of British Columbia
British Columbia, [email protected]
Abstract —Open data published by various organizations isintended to make the data available to the public. All overthe world, numerous organizations maintain a considerablenumber of open databases containing a lot of facts and numbers.However, most of them do not offer a concise and insightfuldata interpretation or visualization tool, which can help usersto process all of the information in a consistently comparableway. Canadian Federal and Provincial Elections is an exampleof these databases. This information exists in numerous websites,as separate tables so that the user needs to traverse through atree structure of scattered information on the site, and the useris left with the comparison, without providing proper tools, data-interpretation or visualizations.In this paper, we provide technical details of addressing thisproblem, by using the Canadian Elections data (since 1867) asa specific case study as it has numerous technical challenges.We hope that the methodology used here can help in developingsimilar tools to achieve some of the goals of publicly availabledatasets. The developed tool contains data visualization, trendanalysis, and prediction components. The visualization enablesthe users to interact with the data through various techniques,including Geospatial visualization. To reproduce the results, wehave open-sourced the tool.
Index Terms —Open data, Geo-spatial visualization; Canada-map; Election-visualization; Election-database-scraper; Opensource tool
I. I
NTRODUCTION
Open data is a movement to make data accessible forpublic use, intending to allow people to manipulate the data(e.g., using software tools) for, among others, linking datasets,mapping, and visualizations [1]. Canada is one of the leadingcountries toward this effort, with Open Parliament as anexample of making data available for public and data enrichedcitizen engagement in policy [3]. The effective use of opendata requires many considerations, including technical chal-lenges, interpreting the data, and availability of visualizations[1]. Although a massive amount of data is available and isbeing maintained by the government as well as independentorganizations in the form of web-based databases, conciseand insightful data-interpretation using adequate visualizationtechniques is found to be lacking [4], [2]. There are multipleefforts towards making publicly available data usable, suchas DBPedia, to extract structured information from Wikipedia[5], and narrative visualization of Swiss open data [2]. Visualizing the data can be helpful in the interpretation ofthe results [2], which can be powered by data analytics andprediction models. However, there are rare works that integratethe visualizations with more advanced techniques such as trendanalysis and prediction components in one place. Moreover,when the Geo-spatial data is not included in the metadata ofthe tables, plotting the data points on maps is a challenge,as extracting and demonstrating the correct information isdifficult. Therefore, in this paper, we provide details of thetool we developed to collect, visualize, and analyze open data.We made the tool available as open-source .We use the Canadian Federal and Provincial Elections dataas a case study, as it contains various smaller tables andnumerous branches that a user needs to traverse. The datasetincludes the Election results since 1867 and is challenging tointerpret as separate tables. Therefore, it can be a valuablecase of publicly open data that cannot be used efficiently. Motivation of the case study.
The 2019 Canadian Fed-eral Election took place on October 21st, 2019, for electingmembers of the House of Commons to the 43rd CanadianParliament. As a part of the election campaign, participatingparties distributed a large number of pamphlets and leafletsamong all the voters. But for making an informed decision,the voters must know the outcomes and details of the pastelections. Voters use different websites to gather data aboutnumber of candidates, party popularity over time, or thegeographic trends of party support for any given election andfurther analyze the data for trends. To best of our knowledge,no data-interpretation or visualization tool has been developedso that voters can gather all this information effortlessly.
Objective.
To fulfill our objective, we intend to design aninteractive platform that would provide a Geospatial visual-ization of Canadian provinces and color-code them accordingto the party-wise election outcomes in any chosen electionyear. The Canadian map representing election data wouldalso be accompanied by some powerful graphs to improveusers’ comprehension about the past elections and help themsee underlying trends for the forthcoming elections. Userswould be provided options from all the available federal and https://github.com/Mohammad-Abdul-Hadi/scraper-for-canadianelectionsdatabase.ca a r X i v : . [ c s . H C ] S e p rovincial elections over the past 152 years.We also intend to integrate an auxiliary component in ourtool to predict some important metrics for future election byusing the available data from the past elections. Moreover,the technical details and the approach that we provide in thispaper can help to develop similar tools and overcome somechallenges in interpreting the open data for non-expert users. Contribution.
We deliver a visualization tool where we haveused the Choropleth map to show the outcome of electionson specific election year with colored geographical regionsaccording to the associated winner party in that region. Usersare also able to explore the massive amount of election data inthe most efficient way where all correlated factors or metricslike Seats won by Political Parties, Votes shared by PoliticalParties, Seats shared by Political Parties for any given electionwould be categorically and graphically represented.For the election data, we have relied on a web-embeddeddatabase created by The School of Public Policy of Universityof Calgary where we can get all the relevant data for all theelections since 1867.The rest of this paper is organized as follows. In section II,the methodology and the architecture of the tool are explained,followed by Results of User Experiments and Discussions andConclusions in Sections III and IV.II. M
ETHODOLOGY AND T OOL A RCHITECTURE
The architecture of the tool consists of three main com-ponents. Each component requires different technologies toaccomplish a given job. The functions of these componentsare explained below:
A. Python Scraper:
We designed a python scraper that toefficiently scrape through the web-embedded database andstore all the information in distinct comma-separated values.These files are the input of the other parts of the tool.
B. Geo-spatial Map Visualization:
An R script that inte-grates various packages and libraries to enable the visualiza-tion of the Geospatial map of Canada. It assures that The mapsrepresent proper election data that was gathered by the scraper.
C. Trend Analysis Component:
Tableau is used to generatedifferent graphs for the user to comprehend the informationbetter. This component also provides forecasting and trendanalysis for certain metrics, i.e., the number of candidates,number of seats won by a certain party. Different optionsare provided for the users so that each user has a choiceto see various representations of the same information andvisualize different graph-types (e.g., horizontal/vertical bar-chart, pie/donut-chart, and heat-maps).In the following, we describe the details of these threecomponents. A. Python Scraper
For this case study, we require to extract validated datasetof Canadian elections from different resources. The databasethat is developed by Dr. Anthony Sayers at the Departmentof Political Science at the University of Calgary is one of the
Fig. 1. View of web-embedded Database to .csv file extraction (static pages) most reliable, consistent, and accessible databases for Cana-dian elections [6]. The details of the database are given here:http://canadianelectionsdatabase.ca/ This database contains in-formation on federal, provincial, and territorial elections since1867 and is arranged by-election, party, candidate, and district.The database allows users to explore the data in numerousways. We have chosen this database as it contains aroundfour million data points [6], and represents the difficulties ofinterpreting open data for users.
A challenge of working with this data is the lack ofhaving a download option to retrieve and use this information(e.g., import to analysis tools). Therefore, we developed ascraper to scrape, extract necessary data, and store them inan appropriate format, to reduce further data cleaning andpolishing, whereas storing data as it is found in the databasemay lead to exhaustive data-formatting for the re-usability ig. 2. Basic Map of Canada (without color-coding) purpose. The extracted data is intended to be fed to the othertwo components. An example of one of the tables on this data(from the database website) is shown in the top part of theFig. 1. After the whole process of scraping, we are going toacquire all the data stored in the mentioned web-embeddedtables and save them as .csv files (shown in the bottom partof Fig. 1) so that the data can be easily accessed later for theuse of data interpretation or visualization tool.We have chosen Python to develop the scraper componentas it provides a powerful, robust package for web-crawlingand data scraping, namely ”Beautiful Soup.” Beautiful Soupis a Python package for parsing HTML and XML documents(including having malformed markup, i.e., non-closed tags,so named after tag soup). It creates a parse tree for parsedpages that are used to extract data from HTML web pages [7].
Beautiful Soup package provides methods to return/downloadthe HTML page upon sending a request to the server withthe corresponding URL. From the downloaded HTML page,we can search, identify, and retrieve required components(document segments) of the page such as ”table” using theirdesignated class or id. Once we locate the required element,the data is parsed into a list to be stored later in a .csv file.The intermediate list helps to modify huge chunks of dataas per the requirement of other components in the project(for convenience and reusability). The scraped election datais stored as census-district-level data and province-territory-level-data, shown in the top part of Fig. 1. We have nameddifferent tables storing different election information in thefollowing format ” (cid:104) election − type (cid:105) (cid:104) year (cid:105) ” where election-type refers to either Federal or Provincial election and year refers to the year when the election took place. A particular challenge was that some pages were communi-cating with the database using dynamic AJAX and JS requests.Beautiful Soup does not provide support for such scenarios.
Fig. 3. Basic Map of Canada (with color-coded provinces)
These pages were inspected separately for the specific re-quests, which would, in turn, return the data that we require.To retrieve the data that is fetched by AJAX or JS request, ourcode listens on the port that communicates with the databaseand downloads the database response. To accomplish this task,we utilized another powerful, robust package
Selenium thatenabled us to capture all the AJAX and JS responses. Lastly,these methods are highly dependent on the ”connection-time.”If the response is not received within a specified period, themethod terminates listening to the port. A try-catch blockcaptures the method and converts the connection-time to be anincrementing one-hot-encoding variable to address this issue.As a result, for a specified connection-time, if the method cannot capture a request, it would try again with an increasedconnection-time and returns to base value after each loop.The static and dynamic pages are handled differently, as ex-plained above. Each program extracts and stores the retrieveddata separately. B. Geo-spatial Map Visualization
For this part of the project, we have developed a Choroplethmap (i.e., a map in which we shaded the areas in proportionto a statistical variable) of Canada and integrate it with theelection information. The gist of the process is grabbinga .shapefile (geospatial vector data format for geographicinformation system software) and converting it to simple-features objects to be used in R.The tidyverse is used as the core package to convert mapdata into a plot. Converting map data into a format (that Rpackages can use) requires a lot of different technical steps asthey cannot be used directly with the provided methods in tidy-verse libraries. Other packages: sf, rgdal, geojsonio, spdplyr,rmapshaper are also used, which provide functionalities forconversion and mapping process. As part of this process, we ig. 4. Basic Map of Canada (with color-coded census-district) have built a separate function, theme map() , as a ggplot themethat turns off insignificant pieces of the plot (so that it looksneat).In the next step, the actual map data ( .shapefile (.shp) is the most popular andwidely-used standard format for map data. In our tool, weselected
ArcGIS .shp files as they contain the category Censusdivisions and cartographic boundaries, and it is convenientfor the integration of the election data. These shapefiles areimported in our R script as an object using the readOGR function of rgdal package and then are converted intoGeoJSON format to simplify the polygons. This GeoJSONfile becomes the building block for further components. Afterthat, we read the GeoJSON file back as an sf (simple features) object.These steps of data importing, converting, and thinning takea long time to execute, the resulting data are saved for furtheruse and are made available for the use of other researchers.the .geojson the file is also included in the repository, so forthe test purposes, one can load the data and start working formthis point.Fig. 2, 3, 4 represent the generated maps without themerging of the election data. The first part, Fig. 2 showsthe map with our putting any color code throughout theregion; the second part, Fig. 3 shows the map with color-codedprovinces and the third part, Fig. 4 shows a color-coded districtthroughout the Canadian region.Finally, the scraped election data stored as census-district-level data and province-territory-level-data are converted intodata frames and merged separately with the map-data producedas a sf (simple- feature) object, namely canada-cd (this isessentially a big data frame specifying a large number of linesthat need to be drawn on the plot).This merging step required careful attention while matchingthe key variables to avoid introducing missing values; other-wise, the lines on the map would not have smoothly joined. TABLE IC
ORRESPONDING E LECTION D ATA EXTRACTED FROM W EB EMBEDDED D ATABASE FOR M AP GENERATED IN F IGURE
If missing values are introduced, it would have resulted ina shredded map as R tries automatically to fill the missingportion of the polygons [8].That is why it is worth mentioning that for plotting data onmaps, joining two datasets on the character attributed columnshould be avoided [8]. An example is provided to illustrate the
Fig. 5. View of web-embedded Database (static pages)ig. 6. Information about Canadian Federal Election in 1963 is representedin Vertical Bar-chart issue. If we had taken
PRNAME (province name) column tomerge the data, it would introduce null values in the resulting-data frame as rows corresponding to the province ”BritishColumbia” is referred as ”British Columbia (BC)/Colombiebritannique” in the election data frame but ”British Columbia(BC)” in the corresponding map data frame. Broken Maps areusually caused by these kind of merge errors. Another examplecan be, one of the province names could contain a leading ortrailing space as a result of data- extraction limitations (like”Alberta” and ”Alberta” which would cause the join to fail).Therefore, another attribute is used to merge the data (in ourcase province id- PRID) as it has a numeric value. Other issuesthat we have considered for plotting maps is converting thenumeric columns back to factors and using specific data type(discrete or continuous) to avoid receiving errors.Once the joining is complete, our R script generates themap with conventional ggplot by filling the PRID (ProvinceID) for representing federal elections or the CDUID (Census-District ID) for provincial elections. Our interactive platformlets the user choose any election (federal or provincial) andthe year. We just select a specific table storing the specifiedinformation, merge the table with map-data, and generate theChoropleth map for visualization.For demonstration purposes, we have randomly selected onemap (for the sake of brevity) that colors the regions basedon the winner parties in the latest Province and Territorialelection and shown the representation in Fig. 5. This figure isproduced from data shown in Table I. At the bottom of thisfigure, all the party names are assigned different colors andthen different regions of the Geospacial map are painted withthe same color of the associated party won in that particularregion. From the bottom right corner of the figure, we haveattached the associated table-data that was retrieved from theweb-embedded database.Please note that this map is selected randomly from over
Fig. 7. Information about Canadian Federal Election in 1963 is representedin Pie-chart
400 maps generated from the scraped data (31 elections foreach of the 13 provinces). C. Trend Analysis Graphs
Visualizing election-data on maps is helpful, but it cannotprovide a lot of the required information and analysis. Wedeveloped the last component of our tool architecture tomitigate the map limitations and provide precise predictions onparticular metrics for both federal and provincial elections [9].We have used Tableau to generate different graphs of users’choice.A demonstration on how user can choose from differentdata-interpretation or plotting techniques is provided in Fig.6 and 7. The user can choose any of these illustrations fromthe platform. The mentioned figures represent the CanadianElection data from 1963. In Fig. 6, we have used a verticalbar chart to interpret 3 important metrics; ”percentage of seatswon,” ”the number of votes won,” ”percentage of seats won.”The same metrics are also represented by the pie chart in Fig.7. Using these representations, any user can quickly get thegist of the whole election and can easily identify the most andthe least prominent players in the election. We do acknowledgethe fact that every user would not be comfortable using thesame representation. Keeping that in mind, in addition to
Fig. 8. Total candidate numbers in the federal elections since 1867ig. 9. Federal parties history of winning in the past elections these two representations, we have provided users lots of otherconventional plotting techniques such as horizontal bar chart,and donut chart, scatter plot, and line plot. From our iterativeuser study, we understood that providing different options forthe users to explore the data is essential, as the data hasdifferent aspects to it and also ranges quite a lot.The next illustration in Fig. 8 shows the total candidatenumbers in the federal elections since 1867. We can seethat the number of candidates is ever increasing. We haveimplemented a simple
Linear Regression algorithm to predictthe number of candidates in the 2019 election [11]. Predictionof the illustration (Candidate Number in 2019 election: 336)is close to the original result. Here, we have also addedinformation about the median and average of the number ofcandidates. If anyone wants details of an election since 1867,the user can hover the mouse over the specific point on theshown line, and the text-box pops up with all the interestinginformation about that election. For future predictions, the usercan hover over any point on the best-fit line.The illustration in Fig. 9 shows the federal parties’ historyof winning in the past elections. We have chosen heat-map as ithas previously been used to compare the volumetric differenceusing color intensities. Therefore, a user can easily grasp theinformation like the following: ”In the history of Canada, aspecific party has won the most election” This is a review wegot from our user during the testing phase of the tool. [12].Users can also pick that the second and third most popularparties in Canada always went neck-to-neck when it comesto winning federal elections, although the user may need topossess certain knowledge about the party-names and whichparties are still in play. But users can gain this knowledgefrom other graphs that are discussed in the project.From our initial study, users revealed that having onlyyear-wise or election-wise data representation is not helpfulenough. Our users have requested a progression map whereall the practical data should be present and accessible bythe users at their convenience. In Fig. 10, a representationis shown where the information of all of the parties is put
Fig. 10. Merged representation for 3 metrics (number of seats won, percentageof seats won, percentage of votes won) together (e.g. ”number of seats won,” ”percentage of seatswon,” ”percentage of votes won”) along with the regressionlines generated for each party. This graph can identify whichparty has been consistent over the year, which party has beentrending over the past couple of elections, which parties’popularity (a feature that is depicted by other attributes likepercentage of seats and votes won) [13]. For example, fromthis illustration, we can see that although the Liberal Party haswon most of the elections throughout history, its’ popularityhas been slowly decreasing over the long run. In contrast, theConservative Party of Canada is gaining tremendous support inrecent decades, and Unionists have been holding their positionsteadily throughout the history [14].The first part of Fig. 11 deals with provincial elections,unlike the other graphs. Here all the provincial parties par-ticipated to demonstrate which ones gained most of the votesthroughout the history. The same information has been inter-preted using the heat-map in the second part of 11, where theuser has the independence to choose from any representation.III. U
SER E XPERIMENTS ’ R
ESULTS
Although we are not reporting directly on the usabilityresults of the study, we have tested the tool with ten users.The goal was to introduce the user to a data visualization toolthat would interpret the Canadian election data in a sensible,comprehensible, reproducible way using different graphs andvisualization methods. We have received positive feedbackfrom the ten users who used our tool for gathering informationabout the Canadian election. Moreover, we recruited a smallsample population (14 people) from the grad and undergradstudents at random as users (there were 6 Canadian and 8international/immigrant students). All of them were paid forgiving us feedback upon the use of the tool. Six of these in-ternational students had almost no knowledge about CanadianElection. They were asked to use the tool for 10 minutes andto surf around the database for another 10 minutes. All ofthem reported that the visualization tool was far more effectivefor gathering knowledge, comprehension, and comparison. Allof the 14 users preferred the visualization tool compared toscraping the raw data provided in the web-embedded database. ig. 11. Provincial parties’ gained most of the votes throughout the history
Source Codes and Graphs.
To encourage replicability, weuploaded all scripts, codes, and graphs to the following linkand provides the annotated dataset upon request .IV. C ONCLUSION
Open data is meant to be used by the public and pro-vide data-driven decisions. However, visualization tools arerequired to make the data interpretable. One of the otherchallenges to provide the analysis and visualization toolsfor open data is the different formats of the data that arepublished by various parties in separate databases. In thispaper, we provided architecture and the technical details of anopen-source tool that we developed for collecting data, andvisualizing and analyzing information. Although the tool isdeveloped explicitly for Canadian Election data, the technicaldetails and the approach can be used by researchers fromvarious fields and developers to address the issue of open data(i.e., having separate databases with no interpretation tool).As a continuation of the work, we will run an empiricalstudy to acquire user feedback for usability studies. Further-more, we acknowledge that many other machine learningapproaches can be adopted to predict numerous other metricsof the Canadian Election. Separately we have been developingand implementing different algorithms, but the benchmarkresult is yet to be determined.V. A
CKNOWLEDGMENT
This work is supported by NSERC Grant 05175.R
EFERENCES[1] M. B. Gurstein, Open data: Empowering the empowered or effectivedata use for everyone? First Monday, pp. Volume 16, Issue 2, 2011.[2] Philipp Ackermann, and Kurt Stockinger, Narrative visualization of opendata, in Applied Data Science. Springer, Cham, 2019, pp. 251264[3] (2020) Keep tabs on parliament. [Online]. Available:https://openparliament.ca/[4] T. A. Slocum, Thematic Cartography And Visualization. August: Prince-ton Hall Press, 1995.2