Road Accidents in the UK (Analysis and Visualization)
RRoad Accidents in the UK (Analysis and Visualization)
Anjul K. Tyagi, Ayush Kumar, Anshul Gandhi, Klaus Mueller * Department of Computer Science, Stony Brook University, New YorkFigure 1: Multiple Correspondence Analysis of road accidents based on
Postcode, Day of the week and Age of Drivers . Liverpool(Postcode: L) has more accidents on Saturdays with majority drivers lying in the age group of 26 to 35 years. A BSTRACT
Analysis of road accidents is crucial to understand the factors in-volved and their impact. Accidents usually involve multiple vari-ables like time, weather conditions, age of driver etc. and henceit is challenging to analyze the data. To solve this problem, weuse Multiple Correspondence Analysis (MCA) to first, filter out themost number of variables which can be visualized effectively in twodimensions and then study the correlations among these variablesin a two dimensional scatter plot. Other variables, for which MCAcannot capture ample variance in the projected dimensions, we usehypothesis testing and time series analysis for the study.
Index Terms:
Multiple Correspondence Analysis—DimensionReduction—Time Series Analysis—Hypothesis Testing
NTRODUCTION
Analysis of road accidents data can reveal various hidden facts.Accident datasets are high dimensional in nature and techniques likeMDS and PCA can be used to project the data in lower dimensionsfor visualization. However, these techniques don’t preserve thecorrelation among variables. Instead, Multiple CorrespondenceAnalysis [2] (MCA) can be used to visualize and correlate betweenvariables from the high dimensional data in two dimensions. It alsogives the discrimination measure of how correctly each variable * e-mail: { aktyagi, aykumar, anshul, mueller } @cs.stonybrook.edu from the dataset is represented in lower dimensions. We use thismeasure to effectively visualize some variables from the dataset. Forother variables which can’t be correctly visualized by MCA, weuse hypothesis testing and time series analysis to get some furtherinsights. ATASET
The dataset is taken from Kaggle [1] and it contains the details ofevery recorded accident in the UK from 2005 till 2015. The fulldataset is divided into three major categories i.e. accident informa-tion, casualty information, and vehicle information.
ELATED W ORKS
Ljubic et al. [3] used time series analysis to study the accidents datain the UK. Sikdar et al. [4] used hypothesis testing to study accidentsdata in India. However, our work uses data visualization to filterout a smaller set of features which can’t be effectively visualized inlower dimensions.
PPROACH
We use discrimination plot generated with MCA to see which vari-ables can be represented accurately in two-dimensional visualiza-tions of the dataset. As shown in Figure 2, more the value of avariable along any dimension, easier it is to represent that variablealong that dimension. As the circles in Figure 2 show, the mainvariables which can be visualized using MCA are
Age of the driver,the location of an accident, day of the week . Other variables likevehicle type, date, weather conditions and sex of the driver which a r X i v : . [ c s . C Y ] J u l igure 2: Discrimination Measure plot over variables in accidentsdata.Figure 3: Prediction of monthly accidents using autoregression from2005 till 2014. cannot be accurately represented using MCA, we use hypothesistesting and time series analysis to analyze them. We choose three variables which had high variance along Dimen-sions 1 and 2 from Discrimination Measure Analysis, namely
Lo-cation of the accident (Postcode), Day of the week and the Agegroup of the driver and project them using MCA with all the fea-tures in a single plot to study the correlation. Figure1 shows howeach of these categories is related to others in the form of a scatterplot. Several insights obtained from this analysis are discussed inthe results section.
Not all the variables can be efficiently represented in lower dimen-sions using MCA, hence further techniques to analyze data arerequired. Hypothesis testing can be used to further understand howthe variables are related to each other. We used Welch’s t-test statis-tic to study several hypothesis on this dataset. Furthermore, becausethis dataset is time-bound, we can make some predictions on thedata using time series analysis. We applied autoregression on ourdataset to analyze and predict the trend in accidents over the years.Results are discussed in the next section.
ESULTS • The number of accidents on Sundays and Wednesdays is fewerthan those on other days in any postcode.
Table 1: Results from hypothesis testing.
Null Hypothesis Result
Number of daily accidents insummer and winter are equal 15 to 30 more daily accidentsin summerNumber of daily accidentsby young drivers (Age 18-25years) and old drivers (Age 65-85 years) are equal 85 to 89 more accidents byyoung peopleThe number of daily accidentsbefore and during the Lon-don Summer Olympics (2012)were same Accept Null Hypothesis. P-value 0.197.Number of daily accidents inareas close to subway stationsis same as other areas 9 to 29 more accidents daily inareas close to subway stations.Males cause an equal numberof daily accidents as females 428 to 439 more accidents bymales.• Age groups 11-15 years, 26-35 years and 36-45 years havethe similar number of accident records and the major day ofaccidents for these age groups is Saturday.• Warrington(WA) and Guildford(GU) have more accidents onTuesdays and the most common age group of people causingaccidents is 46 to 55 years.• Age group 6-10 years is responsible for a lesser number ofaccidents compared to other age groups.
We found out that the number of accidents before, and during theLondon Summer Olympics remained same. Similarly, other interest-ing hypothesis were tested and are discussed in Table 1.
Figure 3 shows the prediction of the number of monthly accidentsover the years. We see that the number of accidents has decreasedover the years. The prediction accuracy can be measured by the rootmean square error value, which was 699.84.
ONCLUSION
In this paper, we combined visualization and data analysis techniquesfor the effective study of a dataset. We visualized the correlationbetween the location of the accident, day of the week and age ofthe drivers using MCA. Further, we studied other important featuresusing hypothesis testing and predicted the trend in accidents usingtime series analysis. Future work will include more detailed analysisof the data using Machine Learning and other advanced visualizationtechniques.
CKNOWLEDGMENTS
This research was partially supported by NSF grant IIS 1527200& MSIT, Korea under the ICTCC Program (IITP-2017-R0346-16-1007). R EFERENCES
Encyclope-dia of measurement and statistics , pp. 651–657, 2007.[3] P. Ljubiˇc, L. Todorovski, N. Lavraˇc, and J. C. Bullas. Time-series analy-sis of uk traffic accident data. In
Proceedings of the Fifth InternationalMulti-conference Information Society , pp. 131–134, 2002.[4] P. Sikdar, A. Rabbani, N. Dhapekar, and D. G. Bhatt. Hypothesis testingof road traffic accident data in india.