Dynamic and interpretable hazard-based models of traffic incident durations
DD YNAMIC AND INTERPRETABLE HAZARD - BASED MODELS OFTRAFFIC INCIDENT DURATIONS
A P
REPRINT
Kieran Kalair
Centre for Complexity ScienceUniversity of WarwickGibbet Hill Road, Coventry CV4 7ALUnited Kingdom. [email protected]
Colm Connaughton ∗ Mathematics InstituteUniversity of WarwickGibbet Hill Road, Coventry CV4 7ALUnited Kingdom. [email protected]
February 18, 2021 A BSTRACT
Understanding and predicting the duration or “return-to-normal" time of traffic incidents is importantfor system-level management and optimisation of road transportation networks. Increasing real-timeavailability of multiple data sources characterising the state of urban traffic networks, together withadvances in machine learning offer the opportunity for new and improved approaches to this problemthat go beyond static statistical analyses of incident duration. In this paper we consider two suchimprovements: dynamic update of incident duration predictions as new information about incidentsbecomes available and automated interpretation of the factors responsible for these predictions.For our use case, we take one year of incident data and traffic state time-series data from theM25 motorway in London. We use it to train models that predict the probability distribution ofincident durations, utilising both time-invariant and time-varying features of the data. The latterallow predictions to be updated as an incident progresses, and more information becomes available.For dynamic predictions, time-series features are fed into the Match-Net algorithm, a temporalconvolutional hitting-time network, recently developed for dynamical survival analysis in clinicalapplications. The predictions are benchmarked against static regression models for survival analysisand against an established dynamic technique known as landmarking and found to perform favourablyby several standard comparison measures. To provide interpretability, we utilise the concept ofShapley values recently developed in the domain of interpretable artificial intelligence to rank thefeatures most relevant to the model predictions at different time horizons. For example, the time of dayis always a significantly influential time-invariant feature, whereas the time-series features stronglyinfluence predictions at 5 and 60-minute horizons. Although we focus here on traffic incidents, themethodology we describe can be applied to many survival analysis problems where time-series datais to be combined with time-invariant features. K eywords big data · traffic incident duration prediction · landmarking · survival analysis · deep learning Managing and reducing congestion on roads is a fundamental challenge faced across the world. In many countries, onebeing the UK, there is limited scope to build additional infrastructure to cope with the demand for road traffic. Instead,the focus is on an approach referred to as ‘Intelligent Mobility’, broadly described as combining real-time data andmodelling to improve the management of existing physical infrastructure. An appropriate test-bed for such approachesis the UK Strategic Road Network (SRN), which constitutes around 4,400 miles of motorways and major trunk roadsacross England [see Peluffo, 2015]. Further, the SRN carries 30% of all traffic in the UK, with 4 million vehicles using ∗ London Mathematical Laboratory, London, United Kingdom a r X i v : . [ s t a t . A P ] F e b PREPRINT - F
EBRUARY
18, 2021it everyday and 1 billion tonnes of freight being transported across it each year [see Highways England, 2017]. Thereis little chance that in the short term, the SRN will see significant infrastructure changes, however data describingthe traffic state across the network is already being collected and made available for analysis. Whilst a significantcomponent of the UK transport infrastructure, congestion remains a major problem on the network, with the cited reportsuggesting 75% of businesses consider tackling congestion on the SRN is important or critical to their business.Traffic congestion can be broadly separated into two types: recurrent and non-recurrent. Recurrent congestion is simplythe result of the demand regularly exceeding capacity on busy sections of road during ‘rush hour’ periods. Non-recurrentcongestion on the other hand is mainly caused by traffic incidents and rare incidents [Li et al., 2018]. To better managetraffic during these incidents, traffic operators require reliable estimates of how long a particular incident will last.Whilst there is significant existing work on modelling incident duration, many fundamental challenges remain that areboth of practical interest to traffic management centres and remain active areas of research in an academic sense. Areview of existing work on this problem is found in Li et al. [2018], where six future challenges for incident durationprediction are listed. These are: combining multiple data-sources, time sequential prediction models, outlier prediction,improvement of prediction methods through machine learning or alternative frameworks, combining recovery timesand accounting for unobserved factors. Our work aims to address five of these challenges, combining time-seriesfrom sensor networks with incident reports to issue dynamically updated duration predictions. Inspired by approachesadopted in medical applications, we consider classical survival approaches and non-linear, machine learning methods forprediction, understanding where gains in performance are attained. Finally, as we are able to observe traffic behaviourover long periods of time through the sensor network, we are able to judge not just when a traffic incident has beencleared, but when the traffic state has returned to normal operating conditions, thereby combining recovery times intoour modelling approach.The rest of this paper is structured as follows. Firstly, we overview existing work on incident duration prediction,specifying how the duration is defined, existing methods and offer more detail on challenges highlighted in the literature.Secondly, we detail our dataset, its collection and processing and an initial exploratory analysis of it. We then describethe considered modelling approaches, both for static and dynamic predictions, and results for our dataset. Finally, weconsider what variables are important for the models and end by summarising our main findings.Note that throughout this paper, an ‘event’ is defined as the traffic state on a section of road returning to a baselinebehaviour. As such, when an event has ‘occurred’ we really mean the traffic state has recovered. This is just a note ofterminology commonly used in the survival analysis literature.
In this section, we summarise existing work on traffic incident duration analysis that is relevant to our own, along withrelevant research from other disciplines that has influenced our approach.Before any methodologies are considered, it is first important to define exactly what is being modelled. A traffic incidentis considered to have four different time-phases: the time taken to detect and report an incident, the time to dispatchan operator to the scene, the travel time of the operator to the scene, and finally the time to clear an incident. Such aframework is described in Li et al. [2018]. We consider an incident to have ‘started’ when the incident is reported bythe human operator, who would use a series of cameras to observe the state of the road in the case of the SRN. Ourfocus is to model the time from this point until both the incident has been physically cleared, and traffic behaviour onthe road has returned to some sense of ‘normal’ behaviour. We describe exactly how we determine normal behaviour insection 3.1. In brief, we use a seasonal model of the speed time-series to estimate when the traffic speed on a section ofroad has recovered to a level close to what would be normally be expected for that section at that time of day. The ideato use such a speed profile is also considered in Hojati et al. [2014], where they define the ‘total incident duration’ to bethe time from incident start until the speed has recovered to the profile.Whatever explicit definition of duration is used, there is an enormous amount of work on predicting incident durations.An initial step of many of these is to determine an appropriate distributional form that the durations take. These aretypically heavy tailed and empirically show significant variation. Examples of this include modelling the distribution ofincident durations as log-normal in Golob et al. [1987] and Chung and Yoon [2012], log-logistic in Zhang and Khattak[2010] and Junhua et al. [2013], Weibull in Hojati et al. [2014] and Kaabi et al. [2012], and generalised F in Ghosh et al.[2014]. In the latter, it is noted that the generalised F distribution can be equivalent to many other distributional formsfor particular parameter choices, including the exponential, Weibull, log-normal, log-logistic and gamma distributions.Hence, it offers more freedom than choosing any single one of these forms. Indeed, the authors state that the increasedflexibility it offers allows it to fit the data better. Even further flexibility in the distributional choice is given in Zou et al.[2016], where it was shown that modelling the distribution as a mixture, that is the sum of multiple components, mayimprove model performance. Specifically, they consider a 2-component log-logistic mixture model, where the final2
PREPRINT - F
EBRUARY
18, 2021distribution is the weighted average of two log-logistic distributions. Finally, other authors Smith and Smith [2001] havehad difficulty finding statistically defensible distributional fits to their data, although it should be noted that differentdefinitions of incident durations will likely impact this.Using common probability distribution is appealing as it limits a model’s freedom and can be easier to fit to data.However, it is clear that authors are exploring more complex distributional forms to better model the data and seeingbetter results when they do so, as in Ghosh et al. [2014] and Zou et al. [2016]. Mixture distributions are a naturallyappealing form, as we assume the data is generated by multiple sub-populations, and can have different effects ofcovariates for different populations. We incorporate ideas from this section of the literature by considering models thatassume log-normal and Weibull distributions on incident durations, as-well as mixture distributions where one supposesthe data is generated by one of many sub-populations, each of which has a parametric form. Recent applicationsof survival analysis in healthcare Lee et al. [2018] have removed distributional assumptions entirely, and insteadformulate models that output distributions with no closed form. This is done by treating the output space as discrete,and treating the model output as a probability mass function (PMF) defined over it, allowing for construction of a fullynon-parametric estimate. Such an approach offers even more freedom, and as we see more complex distributions usedin the traffic literature to provide more freedom, one could ask if removing the distribution assumption entirely canimprove model performance.After a distribution is determined, many works focus on methods from survival analysis, with a common choice beingthe Accelerated Failure Time (AFT) model. Example applications of this are given in Nam and Mannering [2000],Chung et al. [2010] and Chung et al. [2015]. Such a model assumes that each covariate either accelerates or deceleratesthe life-time of a particular individual. These models are widely used, and offer an interpretable means of investigatingwhat factors strongly or weakly influence incident duration. However, it can be difficult to incorporate time-seriesfeatures into them. Whilst it is possible to model time-varying effects of covariates, for example in Zeng and Lin [2007],it is more complex to derive optimal features from a time-series that are also interpretable. Clearly, AFT models areuseful in that they produce interpretable outputs and relationships between variables, and can model the non-Gaussiandistributions empirically observed in incident duration data. The alternative and well known classical survival modelone could apply is a Cox regression model Cox [1972]. Such a model assumes a baseline hazard function for thepopulation, describing the instantaneous rate of incidents, from which survival probabilities can be calculated, seesection 4.2 for more details. Covariate vectors for individuals shift this baseline hazard allowing for individualisedpredictions. Applications of this to transportation problems are given in Oralhan and Göktolga [2018], Wu et al. [2013]and Li and Guo [2014].Whilst these two methods are widely used, a number of alternatives exist. One such method is a sequential regressionapproach, an early example being Khattak et al. [1995]. Importantly, the authors identify that more informationdescribing an incident will become available over time, and hence consider a series of models to make sequentialpredictions. However, their sequential information was more descriptive from an operational stand-point, for exampleidentifying damage to the road and response time of rescue vehicles. We do not have this, and instead focus on theminute-to-minute updates provided through traffic time-series recorded by sensors along the road, and how to engineerfeatures from these. Truncated regression approaches were discussed in Zhang and Khattak [2010], where there was aspecific effort to model ‘cascading’ incidents (referred to as primary and secondary incidents in some literature). Thethought here was that incidents that occur nearby in time and space would lead to a significantly longer clearance timefor the road segment, and hence this should be accounted for in modelling. We performed an extensive analysis ofprimary and secondary incidents in our dataset in Kalair et al. [2021]. Building on this and the previous cited work, weinclude a cascade variable in the models, allowing this to influence duration predictions.Further regression approaches are explored in Garib et al. [1997] and Weng et al. [2015], and switching regressionmodels are used in Ding et al. [2015]. Note that in Weng et al. [2015], the authors first cluster the incident data, thenuse this clustering as additional features for a model, further suggesting that there is some element of sub-populationstructure in the data. A final relevant regression based work is Khattak et al. [2016], where quantile regression is usedto model incident durations. This is a natural choice, as there is a clear skew in the empirically observed durationdistributions, and if one does not want to assume a particular distributional form, they can instead model properties ofthe distribution, in this case quantiles of the data.Further methodologies to note are those based on tree models or ensembles of them, particularly because we apply sucha method later in this work. Tree based models are discussed in Smith and Smith [2001], where the authors comparemodels that assume particular incident duration distributions, a k-nearest-neighbour approach and a classification treemethod based on predicting ‘short’, ‘medium’ and ‘long’ incidents. They concluded that no model provided accurateenough results on their dataset to warrant industrial implementation, but found the classification tree was the preferredmodel of those considered. Further, Zhan et al. [2011] considered a regression tree approach, where the terminal nodesof each tree were themselves multivariate linear models. Such an approach avoids binning of incidents into pre-defined3
PREPRINT - F
EBRUARY
18, 2021categories, and achieved 42.70% mean absolute percentage error, better than the compared reference models. From aninterdisciplinary setting, alternative tree methods have been considered, namely one known as ‘random survival forests’Ishwaran et al. [2008] as an extension to random forests to a survival analysis setting. In such a framework, the terminalnodes of each tree specify cumulative hazard functions for all data-points that fall into that node, and these hazards arecombined across many trees to determine an ensemble hazard. There is no defined distributional assumption in such amodel, again leaning towards the side of freedom in allowing the data to construct its own hazard function estimaterather than parametrizing an estimated form.Neural networks are a rapidly developing methodology in machine learning, and have been used extensively in incidentduration prediction and form the basis for some of our considered approaches. Examples of this include Wang et al.[2005], Wei and Lee [2005] and Lee and Wei [2010]. Each of these applies feed forward neural networks to determineestimates of incident durations, and particularly in Lee and Wei [2010] sequential prediction was considered, using twomodels. The first took standard inputs, and the second took these along with detector information near the incident.These were input into feed-forward neural networks, and used to generate point predictions. Additional neural networkapplications are given in Valenti et al. [2010], where their performance is compared to that of linear regressions, decisiontrees, support vector machines and k-nearest-neighbour methods. The authors find that different models have optimalperformance at different incident durations, suggesting there is still much to improve on feed forward networks. A finalpoint to note is that neural networks have been applied to survival analysis problems in healthcare a number of times.Examples of this include Katzman et al. [2018] which develops a Cox model, replacing linear regression with a neuralnetwork output, Lee et al. [2018] which removes any distributional assumptions, and Jarrett et al. [2020] which uses asliding window mechanism and temporal convolutions for dynamic predictions. We consider if the later two are usefulin the application to traffic incidents later in this work. Specifically, using Jarrett et al. [2020] offers an automated wayto engineer features from our sensor network data, whilst being able to model a parametric or non-parametric output.Whilst we have discussed a number of different methodological approaches, the actual features used to make thesepredictions, regardless of approach, appear quite consistent across different works. In Garib et al. [1997], the authorsstate that using number of lanes affected, number of vehicles involved, truck involvement, time of day, police responsetime and weather condition, one can explain 81% of variation in incident duration. An overview of various featuretypes is given in Li et al. [2018], identifying incident characteristics, environmental conditions, temporal characteristics,road characteristics, traffic flow measurements, operator reactions and vehicle characteristics as important factors whenmodelling incident durations.We note that we are not the first to use sensor data in incident duration analysis. Speed data collected from roads wasused in Ghosh et al. [2014], where the authors included two features based on the speed series: if the difference betweenthe 15th and 85th percentiles of the speed data was greater than 7mph and if the 85th percentile for speed was less than70mph. Further, in Wei and Lee [2007] and Lee and Wei [2010] the authors train two feed forward neural networks,with input features that include the speed and flow for detectors near the incident. The first model provided a forecastjust before the incident occurred, where as new data was fed into the second whenever available, updating predictionsas time progressed. In the second paper, the focus was on reducing the dimensionality of the problem through featureselection via a genetic algorithm. We fundamentally differ from these for multiple reasons. With regard to Ghoshet al. [2014], this was an analysis of which factors impact incident duration the most, and where as they used hazardbased models for analysis, we use the sensor data to engineer dynamic features, either manually or through temporalconvolutions. Additionally, we differ from Wei and Lee [2007] and Lee and Wei [2010] through how we determinefeatures and network structure, and through predicting an output distribution, not just a point estimate.With dynamic prediction being highlighted as an area to address in the literature, some recent papers have looked atthis problem through different methods to those already discussed. One example of this is Li et al. [2015], wherea topic model was used to interpret written report messages and predictions were made as new textual informationarrived. Further, multiple regression models were built in Ghosh et al. [2019], and as different features became available,data-points were assigned clusters, and a prediction was generated using a regression model tailored to each cluster.Lastly Li et al. [2020] considered a five stage approach, where a prediction was made at each stage and different featureswere available defining these stages. These included vehicles involved, agency response time and number of agenciesresponding. While this structured approach addresses some aspects of dynamic prediction, the purely data-drivenapproach which we present in this paper provides much more flexibility.
As discussed, various traffic data for the SRN is provided by a system called the National Traffic Information Service(NTIS) . This is both historic and real-time data. Whilst the SRN includes all motorways and major A-roads in the UK, Technical details of the NTIS data feeds are available at PREPRINT - F
EBRUARY
18, 2021we focus our analysis on one the busiest UK motorways, the M25 London Orbital, pictured in
Figure 1 . Inside NTIS,Figure 1: The M25 London Orbital, roughly 180 kilometres in length. The Dartford crossing, located in the east, is ashort segment of road that is not included in the data set.roads are represented by a directed graph, with edges (referred to as links from now on) being segments of road thathave a constant number of lanes and no slip roads joining or leaving. We extract all links that lie on the M25 in theclockwise direction, yielding a subset of the road network to collect data for. Placed along links are physical sensorscalled ‘loops’, which record passing vehicles and report averaged quantities each minute.The most relevant components of NTIS to our work are incident flags that are manually entered by traffic operators.These flags specify an incident type, for example accident or obstruction, the start and end time, date and the link theincident occurred on. Accompanying this information are further details on the incident, for example what vehicles itinvolved and how many lanes were blocked if any. We extract all incidents that occurred on the chosen links betweenSeptember 1st 2017 and September 30th 2018. Along with this, NTIS provides time-series data for each link, recordingthe average travel time, speed and flow for vehicles and publishing it at 1 minute intervals. These values are determinedby a combination of floating vehicle data and loop sensors. As well as the incident flags, we extract the time-series ofthese quantities in the specified time period. In total, our dataset has 4415 incidents that we train on, and 2011 incidentsthat we use for out of sample validation of the models. Note that of these 4415 training incidents, we use 1324 ashold-out data to judge when to stop our training machine learning models.
As discussed, incident duration consists of 4 distinct phases and we want to model the time it takes for a link to recoverto some baseline state. How to determine this baseline, and ensure it is robust to outliers whilst retaining importantfeatures of the data is an open problem. However a natural way to approach it is to develop some seasonal model ofbehaviour on a link and use this seasonality as a baseline behaviour. We define such a baseline by first taking the speedtime-series and pre-filtering it by removing the periods impacted by incidents. After, we account for any potentialmissing incident flags by further removing any remaining periods with a severity higher than 0.3. Severity is definedas in Kalair and Connaughton [2020], which in short, considers the joint behaviour of the speed-flow time-series andquestions what points correspond to large fluctuations from a region of typical behaviour in this relationship. We thentake this filtered series and extract the seasonal components, in our case daily and weekly components, to capturenatural variability on the link. Note that inspection of the data shows no trend.To construct a seasonal estimate, we consider simple phase averaging, taking the median of data collected at a giventime-point in a week, and STL decomposition Cleveland et al. [1990]. We see little difference between the two methods,so choose to use the phase average baseline for simplicity. We define one speed baseline for each link, establishing arobust profile describing the speed behaviour on a ‘typical week’, and replicate this over the entire data period. It isrobust in the sense that we have pre-filtered the extreme outliers. It also captures the clear seasonality in the problemand can be applied to new test data without any difficulty. An example of this profile for a single link, along with theresiduals from it are given in
Figure 2A , along with an example incident in
Figure 2B . We specifically mark where5
PREPRINT - F
EBRUARY
18, 2021the NTIS flag was raised, where it was closed, and where the speed behaviour returned to the baseline. Using this B a s e li n e S p ee d ( k m / h r ) -
22 09 -
25 09 -
28 10 -
01 10 -
04 10 -
07 10 -
10 10 -
13 10 -
16 10 - Date-80-400 R e s i du a l S p ee d ( k m / h r ) A : : : : : : : : : : : : : Time of Day S p ee d ( k m / h r ) DataStartNTIS EndRTN End B Figure 2: The baseline for a single link and an example incident. The baseline captures clear seasonality in the data. Ona much shorter time-scale, we see a drop in speed in the wake of an NTIS incident flag, and a recovery after a sustainedperiod of low speed. (A) : An example weekly baseline, and residuals (data - baseline) for a single link, showing 4weeks of data. (B)
An example comparison between NTIS end times and the Return to Normal (RTN) end time. We aretrying to predict the time until we return to normal operating conditions, rather than the time at which the NTIS flag isturned off.methodology, we process our dataset such that we have a set of records with the start time of each incident and the timeat which the link returned to normal. We include a safety margin in this baseline to account for any persistent but minorproblems, shifting it down by 8km/hr ( ≈ mph). A link is considered to have returned to normal when its speed isabove this shifted baseline for at-least 3 consecutive minutes, acting as a persistence check. Our modelling approach compares multiple methodologies to predict incident durations. A concise summary with allmodels considered and the main points of note about them is available in the supplementary material.
As discussed, there is much work in the literature identifying the most influential features for incidents. However, weare restricted in two ways. The first is that at any time-step of a prediction, we only incorporate features that would beknown at that step. The second is that our dataset does not provide as comprehensive an overview of some features thatare likely to be highly informative of the duration, for example we do not have in-depth injury reports or arrival time ofpolice forces. The features we do use are separated into time invariant and time varying categories. The time varyingfeatures are derived from time-series of speed, flow and travel-time provided at a 1-minute resolution, which we take forthe link an incident occurs on. We first remove the seasonality from each of these series by determining ‘typical weeks’just as in the case of the speed baseline, then subtracting this from the time-series to generate a set of residual series.We then hand engineer features that may be of use for some simple dynamic models, computing the gradient of theresiduals at some time t using the previous 5-minutes of data from t , as-well as simply recording the value of the series.These are used as initial features from the time-series as they provide a sensible and intuitive summary of an incidentfrom the sensors, however of-course more complex features might be derived from the series. We consider models thatdo just this by applying temporal convolutions across the residual series, and compare the modelling results in section 5.The time invariant features on the other hand are detailed in Table 1 , and are a combination of what an operator mightknow using existing camera and phone coverage of the SRN. For completeness we also detail the time varying featuresthere also. Whilst not exhaustive, the features in
Table 1 offer a combination of contextual information in time, spaceand specific to the incident. Our choice of bins for time of day reflects typical commuting patterns in the UK. Someauthors use day and night time as separation, as in Zou et al. [2016], where as others account for peak times in theirbinning Nam and Mannering [2000] and Khattak et al. [1995], as we have.6
PREPRINT - F
EBRUARY
18, 2021Table 1: Overview of the features considered for incident duration modelling. These would be recorded by an NTISoperator when an incident is declared in the system and observed on the network. The time-series features are recordedby inductive loop sensors along the road, and reported each minute. We further consider machine learning models thatengineer time-series features automatically.
Feature Variable Type DescriptionBinned dailytime Categorical An indicator of the time of day. Bins: Morning Rush (6a.m.-9a.m.),Afternoon (9a.m.-3p.m.), Evening Rush (3p.m.-6p.m.) and Night (6p.m.-6a.m.).Capacityreduction Categorical The fraction of lanes that are blocked due to the incident, binned into 0-25%,25-50%, 50-75% and 75-100%.Incident type Categorical Specified incident type, coded as Accident, Vehicle Obstruction, Non-VehicleObstruction, and Abnormal Traffic.Link length Continuous Length of the link the incident occurred on in metresLink downstreamatypical? Binary Is the link downstream perturbed to some atypical state at the time of theincident flag?Link upstreamatypical? Binary Is the link upstream perturbed to some atypical state at the time of the incidentflag?Number ofvehicles Categorical How many vehicles are involved in the incident?Has Cascade? Binary Did the incident occur immediately after another incident nearby in spaceand time?Has roadworks? Binary Did the incident occur on a link with roadworks active?Spatial location Categorical Network is split into 8 sections (North, North East, East, . . . , North West) andthe incident location is specified with this encodingSeason Categorical What season did the incident occur during?Vehicle typesinvolved Binary Binary variables indicating if an incident involved a car, motorcycle, lorry,trailer and articulated vehicle.Weekend indicator Binary 1 if incident occurs on a weekend, 0 otherwise.Time-Series Residual Continuous Speed, flow and travel time residuals from their weekly baselines (used onlyin landmarking models)Gradient of TimeSeries Residual Continuous Gradient of speed, flow and travel time residuals from their weekly baselines(used only in landmarking models)
We first offer more detail on models in the vein of classic survival analysis that we will consider for our problem. Weremind the reader that, following the convention in the survival analysis literature, we use the word “event" to mean theoccurrence of the outcome of interest. In our case, this is the end of a traffic incident as determined by the return of thespeed to within some threshold difference from the the profile value. Survival analysis methods aim to model someproperty of the duration distribution. Let f ( t ) be the PDF of incident durations, and F ( t ) be the cumulative distributionfunction (CDF). A key component in survival analysis is the survival function S ( t ) , which in our context describes theprobability an incident has not ended by time t , where time is measured from the start of the incident. Denoting theevent time (the incident end time) by T , we formally write: S ( t ) = P ( T ≥ t )= 1 − F ( t )= (cid:90) ∞ t f ( x ) dx. (1)7 PREPRINT - F
EBRUARY
18, 2021Further, many survival analysis methods are concerned with the hazard function λ ( t ) , describing the instantaneous rateof occurrence of events. One can show that: λ ( t ) = lim dt → (cid:18) P ( t ≤ T < t + dt | T ≥ t ) dt (cid:19) = f ( t ) S ( t ) . (2)In practice, this means that the instantaneous rate of events is equal to the density of events at that time divided bythe probability of surviving to that time. A final concept of note is the cumulative hazard function Λ( t ) , which is theintegral of the hazard function between time 0 and t : Λ( t ) = (cid:90) t λ ( s ) ds. (3)Using these concepts, the first model we apply is a Cox regression model, reviewed in Kumar and Klefsjö [1994].Suppose some ‘individual’ i (incident in this application) has covariate vector x i . A Cox model specifies the hazardfunction for individual i as: λ i ( t | x i ) = λ ( t ) e x (cid:48) i β (4)where λ ( t ) is some baseline hazard at time t , and β is vector of regression coefficients. The baseline hazard describesthe hazard function for an individual with covariates all equal to , and then it is adjusted for a particular individualwith the exponential of the regression term. In this original formulation, the covariate effect is constant in time, butthe baseline hazard varies in time. Various methods exist for estimating a baseline hazard function, with more detailsfound in Kleinbaum and Klein [2012]. In short, the baseline hazard is assumed to be piecewise constant and determinedwithout any distributional assumptions, allowing the data to construct an approximation. One can determine β byoptimizing the partial likelihood: P L ( β ) = N (cid:89) i =1 (cid:34) e x (cid:48) i β (cid:80) Nj =1 e x (cid:48) j β Y j ( τ i ) (cid:35) δ i (5)where δ i is 1 if the event time is observed and 0 if it is censored and Y l ( τ i ) is 1 if individual l is still at risk at time τ i .Here τ i represents the recorded incident duration for incident i . The baseline hazard function can be determined usingthe Breslow estimator: λ ( τ i ) = d i (cid:80) j ∈R ( τ i ) e x (cid:48) j β (6)where R ( τ i ) is the set of at-risk individuals at time τ i and d i is the number of events that have occurred at the i − th time. We use the implementation of Cox models provided in the R package ’survival’ Therneau [2015]. Ties in eventtimes are handled using the ‘Efron’ method, detailed in Efron [1977], altering the likelihood in Eq. (5).The next model we apply is an accelerated failure time model (recall AFT from the literature review). Such a modelsupposes relationships between the survival and hazard functions of the form: s i ( t ) = s (cid:0) te x i β (cid:1) , λ i ( t ) = λ (cid:0) te x i β (cid:1) e x i β . (7)Here, s ( t ) and λ ( t ) represent assumed baseline survival and hazard forms, and covariates ‘accelerate’ or ‘decelerate’the survival time of particular individuals. Given an assumed form, for example Weibull, Log-normal or so on withparameters θ , one can then fit this model through maximum likelihood, optimizing: L ( θ ) = N (cid:89) i =1 [ f ( τ i )] δ i s ( τ i ) − δ i . (8)A common way to interpret the AFT model is as a regression on the log of the durations: log ( τ i ) = τ + x (cid:48) i β + (cid:15) i (9)where (cid:15) i is noise, with some assumed form. We implement the models using the R package ‘flexsurv’ detailed inJackson [2016].As Cox and AFT models involve, in some way, a linear regression on covariates of interest, they are unable to accountfor potential non-linear, complex interactions and effects of variables without manual investigation and specification.One way to account for this is to instead use random survival forests (recall RSF from the literature review), which arenon-linear models based on an ensemble of individual tree models. The basic idea is as follows. Firstly, one takes a8 PREPRINT - F
EBRUARY
18, 2021training set and generates B bootstrap samples from it, that is samples with replacement. Each of these samples is usedto grow a decision tree, however randomness is introduced in the growing of the tree, by selecting a set of potential splitvariables at each point in which the tree needs to be split. The optimal split variable from this set is chosen to optimisesome survival criterion. One of the most commonly used is the log-rank splitting rule [Ishwaran et al., 2008]. The treeis then grown until some criteria is met, either a maximal size or minimum number of cases remaining, and the outputat the end of any branch is the cumulative hazard function for all data-points that are placed into that branch whenpassed through the tree. This process is repeated several times and the collection of trees is referred to as a forest. Eachdecision tree is a non-linear mapping from input covariates to the output cumulative hazard function, and the collectionof many trees acts as an ensemble learner. Ensemble models are known to show promising performance in a range oftasks, and this in addition to the non-linear decision tree models suggests such models may improve upon Cox and AFTmodels for certain datasets. For our work, we use the R implementation found in Ishwaran and Kogalur [2007]. Alternative approaches in survival analysis have focused on applying methods from deep learning to incorporatenon-linear covariate effects and behaviours. One of the first such methods was in Faraggi and Simon [1995], where aCox model was considered, but the term x (cid:48) i β was replaced with g ( x i ) , which was the output from a neural networkgiven input x i . Similarly, Katzman et al. [2018] and Luck et al. [2017] extended the cox model to a neural networksetting, however fundamentally such models are still somewhat restrictive in that they assume a form of the hazardfunction. More recently, Lee et al. [2018] suggested to make far fewer assumptions, and instead train a network todirectly model the function F ( t | x ) = 1 − P ( T > t ) , referred to as the failure function. To avoid specifying anyparticular form of this function, the output space was treated as discrete, defined on times { t , t , . . . t max } . We supposea single output value in this discrete space at time t j gives P ( t j | x i ) and hence we derive F ( t j | x ) as: F ( t j | x i ) = t j (cid:88) t = t P ( t | x i ) . (10)However, we still need to enforce that the output vector actually defines a discrete probability distribution. A naturalway to enforce this is apply a softmax function on the output layer, normalizing the sum of the values to 1 though: σ ( z ) j = e z j (cid:80) t max k =1 e z k . (11)In particular, Lee et al. [2018] considered an application with competing risks, where individuals experienced oneof many possible events. Here we consider a simpler case, having only one event (traffic state returning to normal),however the methodology remains consistent in this application. As we are only able to measure if an event has occurredeach minute from our sensor data, the discrete nature of the model is not restrictive in our context, yet the non-parametricoutput is appealing as we have seen in an exploratory analysis (described in the supplementary material) that the datadoes not appear to be generated from any particular, simple closed form distribution. We adapt our implementationfrom the implementation found in Lee [2020].For our implementation, we specify a t max value equal to the longest duration, plus a 20% margin as in the originalimplementation, and define the output grid at a 1 minute resolution. However doing so leads to a very large output spacefor the model, and could potentially lead to over-fitting. To combat this, we apply dropout after every full connectedlayer in the network, elastic net regularization on the weights and early stopping based on holdout data. We furtherconsider if a parametric distribution may be sufficiently flexible when attached to a neural network model to performwell in the prediction task. To test this, we build another model as described above, but remove the softmax output layerand replace with with a mixture distribution layer, influenced by Zou et al. [2016]. We choose our mixture componentsto be log-normal, avoiding the other specified distributions for numerical stability. This alters the output size to be × N m where N m is the number of mixtures, a hyper-parameter to tune.A final alternative that compromises between the full non-parametric discrete output and the parametric mixture is toallow the output layer of the network to define a set of weights, and construct a probability distribution from theseweights using kernel smoothing. Kernel smoothing is a non-parametric technique that aims to construct a distributionby summing a set of kernel functions, evaluated at given data-points. Formally, we can write a kernel smoothed resultfor some desired point z , with kernel centres Z i and weights ω i as: ˆ ν ( z ) = 1 h bw N (cid:88) i =1 ω i K (cid:18) z − Z i h bw (cid:19) . (12)Here h bw is the smoothing bandwidth and the kernel K ( x ) is often taken to be Gaussian, and the resulting estimateessentially builds a distribution as a weighted smoothed sum over all kernel centres. A point with high weighting9 PREPRINT - F
EBRUARY
18, 2021will result in a significant amount of mass near this location, and a wide bandwidth will smooth this mass out to thesurrounding area. Applying this to our problem, it allows us to avoid treating the output space as discrete, and insteadwe place a kernel centre at each point in the formerly discrete grid, and treat the neural network output (includinghaving a softmax applied) as definitions of the weights ω i . Doing this also enforces some amount of smoothness in theoutput distribution, determined by the choice of h bw . We choose to use a bandwidth of 3 minutes, which still allowssignificant freedom to the distribution.The actual function proposed in Lee et al. [2018] to optimize in order to train the network is a combination of two lossvalues, the first accounting for the likelihood of the observed data and the second enforcing ordering. The likelihoodloss function is given as: L = − N (cid:88) i =1 (cid:104) δ i log (cid:16) ˆ f ( τ i | x i ) (cid:17) + (1 − δ i ) log(1 − ˆ F ( τ i | x i )) (cid:105) (13)where ˆ f is the PDF (or PMF in the discrete case) implied by the model output, and ˆ F is the CDF or (cumulative massfunction in the discrete case) implied by the model output, given a particular input x i . This is exactly as in Eq. (8), buttaking logs, describing the likelihood of survival data. The second loss function is written: L = (cid:88) i (cid:54) = j ( τ i < τ j ) η (cid:16) ˆ F ( τ i | x i ) , ˆ F ( τ i | x j ) (cid:17) η ( x, y ) = e − x − yησ . (14)This loss penalizes the incorrect ordering of pairs in terms of the cumulative probability at their event time. If anincident i ends before j , then we would expect ˆ F ( τ i | x i ) to be larger than ˆ F ( τ i | x j ) , and if so this pair is consideredcorrectly ordered. Large deviations from correct ordering are penalized by η ( x, y ) . The total loss function is then thesum of L and L . The hyper-parameter grids used for all machine learning models can be found in the supplementarymaterial, and all models are trained using 100 instances of random search. To this point, all models discussed in this section have been static, that is an individual has a covariate vector x i , it ispassed through some model, and an estimate of its hazard function, failure function or alternative is attained. However,in-practice our specific application contains a significant amount information that may be useful in determining theduration that is not available at the start of the incident. Such information in the traffic domain is a police report madeon the scene, recovery information, and specifically of use to us, the time-series provided by the sensors along theroad. A significant incident on a road network could lead to speed drops, flow breakdown and travel time spikes, all ofwhich will be evident when we inspect the time-series as the incident progresses. However, the recovery of the linkto normal operating conditions is closely tied to these time-series, firstly through the level of the speed series (as thisdefines how far from a baseline we are), but one could imagine much richer indicators of traffic state can be mined fromthem. Recall Figure 2B , we see significant structure in the series, having a linear drop near the start of the incident, anunstable oscillation period at lower speeds, then a recovery to normal conditions. Further examples of these series forincident periods can be found in the supplementary material. A number of methods have been suggested to handledynamic predictions in a survival analysis setting.
With any dynamic prediction approach, the goal is to provide estimates of a hazard function, survival function orsimilar at some time t , conditioned on the fact that the individual has survived to time t and any covariates they provide.A simple method to do this is known as ‘landmarking’ and is discussed in Dafni [2011]. We note from the outsetthat landmarking is similar to truncated regression discussed in section 2, however this terminology is consistentwith the wider survival analysis literature. To carry out landmarking, one first specifies a set of ‘landmark times’ { t LM , t LM , . . . , t LM K } at which we want to make dynamic predictions. One then chooses some survival model, forexample a Cox model, and the hazard function at landmark time t LM j becomes: λ i ( t | x i ( t LM j ) , t LM j ) = λ ( t | t LM j ) e x i ( t LMj ) (cid:48) β ( t LMj ) (15)with t LM j ≤ t < t LM j +∆ t , for some ∆ t defining how far ahead we are interested in looking. Notice how, compared toEq. (4), the covariate values x i are replaced with those known at time t LM j and the regression coefficients and baselinehazard can vary based on landmark time. At each landmark time, only incidents that are still ongoing are retained,so the model is therefore conditioned on surviving up to this landmark time. To account for potential time-varying10 PREPRINT - F
EBRUARY
18, 2021effects and avoid misspecification of the regression parameters, events that occur after t LM j + ∆ t are administrativelycensored, that is they are marked as censored if they survived past the look-ahead time of interest. Such a model issimple to implement as one can refine the dataset at different times to produce dynamic models. However, as thelandmark time grows and less data becomes available, some power may be lost when drawing statistical conclusions.To implement these models, we choose landmark times t LM of { , , , , , } minutes and horizons ∆ t of { , , , , , , , } minutes and display results throughout section 5. Finally, the landmarking frameworkcan be applied with models other than a Cox model, so we consider both Cox and RSF landmarking models as twocandidate dynamic prediction models, with RSF offering a non-linear alternative. The same was done to compare todynamic models in Lee et al. [2020]. The models considered so far require us to manually engineer features from the time-series variables to incorporate themas covariates. As discussed, we use the level and gradient of the residual series, as these will indicate both how closethe link is to reaching standard behaviour, and if the situation is getting better or worse. However, gradients computedin short windows can be noisy, yet computed on large windows may lead to significant delay in identifying features.Instead, we would like some automated method that, given a time-series, is able to learning meaningful features fromthem and incorporate them into predictions. Such a method is proposed in Jarrett et al. [2020], where the authors detaila sliding window model which they name MATCH-Net. We note that the algorithm is designed to make dynamicpredictions accounting for missing data, although in our application we do not have missing data so are interested in thedynamic prediction aspect only. In this model, a window of longitudinal measurements is fed through a convolutionalneural network (CNN), with the convolutions learning features from the time-series that are then used for prediction ofrisk in some look-ahead window. The model slides across the data, shifting the window each time, meaning featuresare updated as time progresses and predictions are also shifted forward. It is upon this we base our sliding windowmethodology.Specifically, we take a historical window of length w , and at time t feed the time-series from time t − w to t through aCNN, where the filters in the CNN aim to derive features from the series without any manual specification of what theyshould be. We then concatenate the features output by the CNN with the time-invariant features, and then pass thesethrough a series of fully connected layers. In Jarrett et al. [2020], the output layer was a discrete space upon which asoftmax activation was applied, and we again consider this, a mixture of log-normal distributions and a kernel smoothedoutput distribution. At each input time, we consider a window ahead of the the same length as in the fixed case, and treat w as a hyper-parameter to optimize. Since this model is more complex and has far more parameters than in the formercase, we consider the discrete distribution to be piecewise constant for 5 minute intervals. As a result, the output spacedecreases in size by 80% without sacrificing too much freedom. A schematic of the network architecture is given in Figure 3 , with the output layer left intentionally vague to be clear that we consider multiple different forms of output.
Window of Time Series CNNTime-invariant Features Concatenate Fully Connected Output Layer
Figure 3: Network schematic for sliding window model. We pass filters across the residual time-series to engineerfeatures from each time-series, then concatenate these with the time invariant features to create a feature vector, whichis passed through a series of fully connected layers, and some output layer is applied to the result. The example shownis for a single traffic incident being passed through the network. The number of boxes for features is not to scale. Thewindow of time-series represents 3 variables and a window size of 7 in this simple example.
A point infrequently discussed in the context of traffic incidents, is that there are multiple criteria that define a ‘good’hazard model, and multiple ways to measure this in the dynamic setting. We discuss some of these ways in the textbelow. We note also that elastic net regularization is applied to all deep learning methods, and the optimal Cox and AFTmodels are selected though inspection of sample-size adjusted Akaike information criterion (AIC) to avoid over-fitting.11
PREPRINT - F
EBRUARY
18, 2021
The concordance index (C-index) has different definitions in the static and dynamic setting. In the static setting, wewrite it as: C = P (cid:16) ˆ F ( τ i | x i ) > ˆ F ( τ i | x j ) | τ i < τ j (cid:17) . (16)Eq. (16) is the so called ‘time dependent’ definition used in Lee et al. [2018], accounting for the fact that we care aboutthe entire function ˆ F and not a single point value. In the dynamic setting, it is written given prediction time t andevaluation time ∆ t as: C ( t, ∆ t ) = P (cid:16) ˆ F ( t + ∆ t | x i ( t )) > ˆ F ( t + ∆ t | x j ( t )) | τ i < τ j , τ i < t + ∆ t (cid:17) (17)The only difference here is now we are specifically computing the C-index at a given prediction time and horizon ratherthan over the entire dataset, and this is the definition given in Lee et al. [2020]. In computing this, we are taking the ˆ F values at some time t + ∆ t and compare pairs where an incident i actually ended in the horizon.As described in Lee et al. [2020], such a measure compares the ordering of pairs. If individual i experienced an eventbefore individual j , then we would expect a good model to correctly assign more chance of an event to individual i than j . A model with perfect C-index, given N traffic incidents, will perfectly predict the order in which the incidentswill end. This idea stems from viewing survival analysis as a ranking problem, and since we compare the CDF for twoevents, we see that it incorporates the entire history from a prediction time up to an evaluation time, not just a singlepoint measurement. A random model will achieve a C-index on average of 0.5, and a perfect model will attain a valueof 1.0, so these are reference values to consider when interpreting this measure.We formally compute Eq. (17) given our dataset by evaluating: C ( t, ∆ t ) ≈ (cid:80) i (cid:54) = j A i,j · (cid:16) ˆ F ( t + ∆ t | x i ( t )) > ˆ F ( t + ∆ t | x j ( t )) (cid:17)(cid:80) i (cid:54) = j A i,j A i,j = ( τ i < τ j , τ i < t + ∆ t ) (18)where we simply evaluate empirically how often the ordering is correct conditioned on the requirements. The sameis true for the static case. If two incidents happen to give exactly the same CDF values when evaluating, we take theconvention of adding 0.5 to the total rather than 0 or 1, following the convention in Harrell Jr et al. [1996]. The brier score measures how well calibrated a model is, and compares the binary label (1 if an event has happened atsome time, 0 otherwise) with the model prediction at that time. Formally, we write: BS ( t, ∆ t ) = 1 N N (cid:88) i =1 (cid:16) ( τ i < t + ∆ t ) − ˆ F ( t + ∆ t | x i ( t )) (cid:17) (19)where we sum over all events still active at time t , and ask did incident i end before t + ∆ t . If it did, then we wouldexpect a good model to have a high CDF value at this point, with 1 being perfect (i.e. predicting the incident would endby t + ∆ t for certain). On the other hand, if the incident did not end before t + ∆ t , then we would expect a low CDFvalue. This definition is that proposed in the supplementary material of Lee et al. [2020]. In a sense, this measures themean square error of a probabilistic forecast of a binary outcome. In terms of reference values, a model that outputs asurvivor function value equal to 0.5 at a particular time will have a Brier score of 0.25, so lower values than this aredesirable, and a perfect model will achieve a score of 0. Whilst C-index and Brier score are used throughout the survival literature, we also note that mean absolute percentageerror (MAPE) is used throughout the traffic literature and has some practical relevance to our application. Since wehave no censoring, we know the true duration for each traffic incident. As such, we can evaluate for every data-point,what is the error between a point-prediction, and the true duration. A natural choice for such a point prediction is themedian of each models output distribution, Yuan [2008], Qi and Teng [2008], Li et al. [2015] as the distribution oftraffic incident durations is known to be heavy tailed. We can then ask what is the point-wise error for each model.Note that, C-index and Brier score asked about the accuracy of the output distribution, where as this asks about a singlepoint taken from the distribution. Highways England currently states that NTIS are measured on their prediction of12
PREPRINT - F
EBRUARY
18, 2021the ‘likely delay associated with an event.’ Specifically, NTIS is scored as follows. One aggregates all incidents thathave a predicted return to profile time made at their half way point and lasted over 1 hour. The MAPE between thesepredictions at the half-way point and the true values is computed. The target for NTIS predictions is for this to be below35%, and it is stated in IBI Group UK [2019] that the current value in practice is 35.49%.There is of course a problem with this criterion, in that a ‘perfect’ model by this standard would just always predictdouble the current duration, which would optimise the prediction at the mid-point, but be of no practical use aside fromthis. Regardless, this rough measure allows us to frame our work in the context of the practical considerations trafficoperators are currently working towards.
We begin by considering how a range of models perform in the static sense. For this, we use only the fixed covariateinformation available at the start of the incident to fit the models. We apply each of the discussed models, and showresults for all metrics in
Table 2 . We see the ordering of incident durations, measured by the C-index, attains valuesTable 2: Performance measures for models in a static setting, where we make only a single prediction per incidentusing a set of time-invariant covariates. Optimal values are highlighted in bold. AFT (LN) - An accelerated failuretime model assuming a log-normal distribution of incident durations. AFT (W) - An accelerated failure time modelassuming a Weibull distribution of incident durations. Cox - A linear Cox regression model. RSF - Random survivalforest. NN (LN) - A feed-forward neural network model with an output layer that parametrises a mixture of log-normaldistributions. NN (NP) - A feed-forward neural network model with a non-parametric output layer. NN (Kernel) - Afeed-forward neural network with a kernel smoothed output.
Model C-Index Point-WiseMAPEAFT (LN) 0.624 40.677AFT (W) 0.624 38.543Cox 0.626 38.545RSF
NN (Kernel) 0.659 39.332 Brier ScorePrediction Horizon (minutes) Mean5 15 30 45 60 120 180 240 between 0.624 and 0.676. All models are informative, beating the 0.5 reference value, and the biggest gains in C-indexare seen when we go from the linear to non-linear modelling frameworks. The RSF achieves the optimal C-index,followed by the neural network with a mixture of log-normals.In terms of point-wise error, we do not make predictions at the half-way point of incidents, we only make them at thestart of an incident in this setting. Doing so and measuring for all incidents longer than 60 minutes, we see that allmodels achieve a MAPE between 37% and 41%, with the best model being the non-parametric neural network. Nomodel achieves an MAPE of less than 35%. Finally, the optimal Brier score is always achieved by the RSF method,with the most noticeable differences observed at horizons of 120 and 180 minutes. There is not much to distinguishmany of these models, and ultimately one might suggest that in a static setting, a RSF offers a good compromisebetween performance measured by C-index, Brier score and MAPE, however if MAPE is the single desired criterion, anon-parametric neural network model would be preferred.
We now consider the models in a dynamic setting. We consider C-index as defined in Eq. (18), and show results for it atvarious prediction times and horizons in
Table 3 . Initially, the RSF achieves optimal C-index across all horizons whenpredicting at t = 0 . As time of prediction increases, we see a strong favouring of neural network models, specificallythe sliding window model with a kernel smoothed output achieves the optimal C-index in most cases. There is asystematic preference for the non-parametric sliding window models compared to others at all prediction horizonswhen considering prediction times of 30 minutes or greater. Even at a prediction time of 15 minutes, the non-parametricsliding window models are preferred when considering 180 and 240 minute horizons. As a general summary of Table PREPRINT - F
EBRUARY
18, 2021 , one should see that out of all prediction time, prediction horizon pairs considered, the optimal model in terms ofC-index is the RSF roughly 34% of the time, the sliding window neural network with kernel output 43% of the time,and the sliding window neural network with non-parametric output the remainder of times.Averaged over all horizons, we see that one would prefer the RSF model when initially making predictions, but allprediction times after 15 minutes favour the kernel smoothed output, with the non-parametric neural network oftensimilar in performance. The neural network model parametrising a mixture of log-normals achieved the highest C-indexof the neural network models in the static case, closely following the RSF model, however it never wins in the dynamiccase, suggesting that when we provide the time-series features, the RSF makes better use of them initially, and then theother neural network models make better use of them as time-passes. All models achieve C-index values higher thanthe reference value of 0.5, across all prediction times and horizons showing their predictions remain informative.One point of note from Table 3 is that the Cox model has quite poor C-index compared to the alternatives consideredwhen predicting at a horizon of 5 minutes. We believe this is due to the amount of administrative censoring introducedat such a short horizon. If we look to van Houwelingen and Putters [2012], an assumption of the Cox landmarkingmodel is that there is not too much censoring at the horizon time. For a very short horizon of 5 minutes, almost allincidents last longer than this, and hence, when we are applying our administrative censoring, this assumption becomesinvalid, and we suspect this is why the Cox model has poor results at this horizon.We further show the Brier scores for each model in
Table 4 . Again, we observe that initially, the random survivalforest achieves optimal scores across all horizons, however as time of prediction increases we gradually see the slidingwindow neural network with kernel smoothed output start to achieve better Brier scores for short prediction horizons.This is again systematic, and we see for a prediction time of 120 minutes that the optimal model is the sliding windowneural network with kernel smoothed output at horizons up to and including 45 minutes, but for a prediction time of 45minutes it is only optimal for horizons up to and including 15 minutes. One could postulate that initially, time-seriesprovide less useful information than the fixed features, that is at the very start of an incident, we see only the state of thelink before the incident, which might have been reasonably seasonal. However, as time progresses, we will attain moreinformative features specific to single incidents, and in this case the fact the sliding window method engineers its ownfeatures, rather than our noisy gradient and level values we manually input to the RSF model, may prove more useful.Despite this, if a duration is very long, say 4 hours, and make a prediction 60 minutes into it, how much do we trulyexpect to gain from inspecting the time-series so far? It may be that there is just simply no sign of recovery and all wecan really conclude is that speed has been slow for a long time, and shows no other clear features.Another point of note when considering Brier score is that RSF appeared to perform well compared to a non-parametricneural network model in other works. If we look to the supplementary material of Lee et al. [2020], we see that a neuralnetwork with a non-parametric output did not consistently improve upon the Brier score achieved by a RSF model (seetable VI in the supplementary material of the cited reference). It is unclear therefore if there is some fundamental reasonfor this in the modelling framework, as two entirely different datasets and applications appear to have observed thesame behaviour. Despite this, all models achieve Brier scores below the reference value of 0.25.Finally, we show the error in a point prediction made at various times throughout incidents in
Table 5 . For referencewe also include the value achieved by the fixed model to get an idea of what we are gaining from making dynamicpredictions. From
Table 5 , we see that when making a prediction after 30% of the duration of an incident has passed,we can expect between 30% and 33% MAPE in that prediction. This is around an 5-10% improvement from theprediction made from the corresponding static models at the start of the incident. If we predict half way through anincident, we see that the neural network models all now achieve quite a significantly better MAPE value than thelandmarking models, with an optimal MAPE of 21.6% achieved by the mixture of log-normals model, followed by22% for the non-parametric model. The discrepancy between the sliding window and landmarking models grows as wemake predictions later and later, with the sliding window models achieving an MAPE of between 16.5% and 17.3%compared to a value of 26.5% for the optimal landmarking model (RSF). Additionally, the prediction error showsvery little improvement moving beyond the 50th percentiles of an incidents duration for the RSF model, and actuallyincreases for the Cox model, suggesting that they are not sufficiently capturing signs in the time-series that indicate theend is near. A key point of practical interest is that with the dynamic models, we do indeed achieve a MAPE valuebelow 35% as desired by Highways England. Of-course, we would need to attain data for all incidents across the UK totruly ensure that we are able to maintain this on a wider scale, but as far as we are able to measure we achieve whatwould be considered industrially satisfactory error rates with the dynamic models.Whilst the landmarking models appear to plateau in point-wise performance here, we note that it is partly due toplateauing or noisy error they appear to exhibit when making predictions at large landmarking times when only veryfew incidents remain active. We visualize the error per minute into incidents in the supplementary material, whichshows this. The plots within the supplementary material are more akin to something one can find in other dynamicworks, for example Figure 2 in Li et al. [2015]. 14
PREPRINT - F
EBRUARY
18, 2021Table 3: C-Index values for considered models, across a range of different prediction times (when predications aremade) and prediction horizons (at what time after the prediction time they are evaluated). Higher values show a bettermodel. Optimal values for each prediction time - prediction horizon pair are shown in bold.
Cox - A linear Coxlandmarking model. RSF - Random survival forest landmarking model. SW (LN) - Sliding window with log-normalmixture output. SW (NP) - Sliding window with non-parametric output. SW (Kernel) - Sliding window with kernelsmoothed output.
Prediction Time Model Prediction Horizon (minutes) Mean Over(minutes) 5 15 30 45 60 120 180 240 Horizons t = 0 Cox - 0.851 0.799 0.754 0.712 0.654 0.642 0.638 0.721RSF -
SW (LN) - 0.766 0.709 0.673 0.650 0.606 0.599 0.597 0.657SW (NP) - 0.798 0.743 0.705 0.682 0.642 0.637 0.634 0.692SW (Kernel) - 0.823 0.757 0.717 0.689 0.641 0.634 0.630 0.699 t = 15 Cox 0.513 0.864 0.784 0.732 0.698 0.648 0.639 0.637 0.689RSF
SW (LN) 0.927 0.851 0.773 0.732 0.694 0.644 0.633 0.632 0.736SW (NP) 0.947 0.884 0.811 0.772 0.731 0.678 t = 30 Cox 0.529 0.861 0.770 0.731 0.702 0.662 0.652 0.648 0.664RSF 0.921 0.880 0.795 0.760 0.738 t = 45 Cox 0.504 0.859 0.796 0.764 0.724 0.679 0.667 0.661 0.707RSF 0.970 0.884 0.817 0.788 t = 60 Cox 0.578 0.872 0.796 0.755 0.723 0.684 0.670 0.668 0.718RSF 0.954 0.887 0.811 0.782 0.742 0.700 0.674 0.654 0.776SW (LN) 0.928 0.871 0.796 0.759 0.726 0.681 0.672 0.671 0.763SW (NP) 0.953 0.903 0.830 0.787 0.755 0.711 t = 120 Cox 0.522 0.850 0.804 0.781 0.750 0.706 0.692 0.683 0.724RSF 0.961 0.889 0.839 0.807 0.777 0.731 0.697 0.688 0.799SW (LN) 0.944 0.878 0.822 0.799 0.769 0.718 0.715 0.713 0.795SW (NP) 0.968 0.896 0.852 0.824 0.791
As we have applied methods from Jarrett et al. [2020] to formulate a sliding window model, we have naturally includedtemporal convolutions to generate information from the time-series. However, a valid question one could ask is arethese required in our application, or do the simple levels and gradients retain enough information to make informativepredictions? We test this by implementing a model without the CNN structure and instead feeding an input vectorconsisting of the time-invariant features and the level and gradients computed as in the landmarking case to a feedforward network. Doing so results in a model that achieves a worse Brier score and C-index across all prediction time,horizon pairs and a worse error at the half way point. Exact results are given in the supplementary material.15
PREPRINT - F
EBRUARY
18, 2021Table 4: Brier scores for considered models, across a range of different prediction times (when predications are made)and prediction horizons (at what time after the prediction time they are evaluated). Lower values indicate a better model.Optimal values for each prediction time - prediction horizon pair are shown in bold. All keys are as defined in
Table 3 . Prediction Time Model Prediction Horizon (minutes) Mean Over(minutes) 5 15 30 45 60 120 180 240 Horizons t = 0 Cox
SW (LN) 0.006 0.096 0.197 0.260 0.315 0.317 0.204 0.115 0.189SW (NP) 0.007 0.082 0.160 0.213 0.262 0.290 0.195 0.113 0.165SW (Kernel) 0.006 0.079 0.159 0.211 0.261 0.292 0.197 0.114 0.165 t = 15 Cox 0.027 0.062 0.113 0.152 0.192 0.218 0.149 0.088 0.125RSF
SW (LN) 0.020 0.077 0.157 0.212 0.266 0.280 0.176 0.097 0.161SW (NP) 0.019 0.069 0.134 0.180 0.230 0.259 0.170 0.096 0.145SW (Kernel) t = 30 Cox 0.025 0.068 0.127 0.165 0.198 0.204 0.138 0.082 0.126RSF 0.020
SW (LN) 0.019 0.074 0.154 0.204 0.250 0.255 0.161 0.090 0.151SW (NP) 0.018 0.067 0.138 0.183 0.223 0.236 0.155 0.088 0.139SW (Kernel) t = 45 Cox 0.028 0.073 0.131 0.162 0.196 0.194 0.132 0.082 0.125RSF 0.018
SW (LN) 0.021 0.082 0.157 0.201 0.245 0.239 0.154 0.090 0.149SW (NP) 0.019 0.074 0.138 0.176 0.217 0.221 0.148 0.089 0.135SW (Kernel) t = 60 Cox 0.034 0.083 0.139 0.168 0.196 0.185 0.129 0.083 0.127RSF 0.023 0.079
SW (LN) 0.025 0.082 0.154 0.199 0.240 0.228 0.145 0.090 0.145SW (NP) 0.023 0.073 0.136 0.176 0.212 0.210 0.139 0.089 0.132SW (Kernel) t = 120 Cox 0.034 0.090 0.146 0.168 0.184 0.169 0.117 0.083 0.124RSF 0.024 0.084 0.132 0.157
SW (LN) 0.026 0.087 0.148 0.179 0.213 0.199 0.137 0.091 0.135SW (NP) 0.023 0.082
Variable importance is a topic often addressed in the literature, and we offer some discussion of it here for both the RSFmodel and the non-parametric neural network model, chosen for simplicity compared to the kernel model.
As RSF are adaptations of random forest methods, standard variable importance metrics are well explored and readilyimplemented. Recall that trees are trained based on a bootstrap sampled dataset, meaning a set of observations remainsfor each tree that are ‘out of bag’ which will be used for measuring variable importance. Given a trained forest and somevariable of interest x , we drop the out of bag data for each tree down the tree and whenever a split on x is encountered,we assign a daughter node at random instead of evaluating based on the value of x . We then compute the estimates fromthe model doing this, and the variable importance for x is the prediction error for the original ensemble subtracted fromthe prediction error for the new ensemble ignoring the x value. A large variable importance suggests that a variable is16 PREPRINT - F
EBRUARY
18, 2021Table 5: MAPE at various points for incidents, all of which are at-least 60 minutes long. The optimal model for eachprediction point is shown in bold. The point prediction is generated as the median of the output distribution from eachmodel. All keys are as defined in
Table 3 .Model MAPEStatic Model MAPE Dynamic ModelPercentile Into Incident Prediction Made at30th 50th 70th 90thCox 38.545 32.513 31.002 32.180 35.923RSF 41.607 highly useful in accurately predicting the output. We compute the importance for all features, then scale the importancevalues by diving by the largest. This yields variable importance on a scale from 0 to 1 and we plot particularly importantvariables in
Figure 4 . Relative Importance
Travel Time GradientFlow GradientSpeed GradientTravel TimeFlowSpeedIncident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
39 4 210 165812117 A Relative Importance
Travel Time GradientFlow GradientSpeed GradientTravel TimeFlowSpeedIncident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
411 3 27 15691012 8 B Relative Importance
Travel Time GradientFlow GradientSpeed GradientTravel TimeFlowSpeedIncident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
611 5 27 18 410912 3 C Relative Importance
Travel Time GradientFlow GradientSpeed GradientTravel TimeFlowSpeedIncident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
911 7 34 26512 810 1 D Figure 4: Variable importance, as measured for the random survival forest model, for a subset of the important variables.Each plot is a different prediction horizon h , all shown are at a prediction time of t = 30 . The importance at eachprediction horizon has been normalized such that the largest importance at any time is 1, and all others are relative tothis. The rank of each variable at a given time is written beside each bar. Due to the scaling, one should focus on the theranking and relative difference between bars in each plot. (A) : h = 5 . (B) : h = 30 . (C) : h = 60 . (D) : h = 180 .17 PREPRINT - F
EBRUARY
18, 2021From
Figure 4 , we see that the speed value is the most important variable for horizons of 5, 30 and 60 minutes. At 180minutes, the most important variable becomes the time of day. This makes intuitive sense, as we expect the time-varyingfeatures to provide more useful information about the immediate future rather than times far into the future. Similarreasoning applies to the importance of travel time and flow. It is interesting to note that flow is often less importantthan other, time-invariant features, across all horizons. The gradients are always less important than the residual valuesthemselves, which may be a consequence of noise when estimating them, or the fact that the we can see short term risesand falls in the traffic variables that do not indicate the incident is actually near ending, but rather traffic state is justunstable as in
Figure 4 . The location of an incident is always somewhat important, ranking between sixth and fourthacross all horizons. This suggests clear heterogeneity in durations by location. Note that with a location and length,a model should be able to identify single links in the network, so predictions can be specific to these if it improvesperformance. However, the importance of length decays over horizons, and is always less than the location itself, so thecoarse segmentation we have introduced for location seems more important than the specific link an incident occurs on.We see that the season becomes increasingly important as horizon increases. The type of incident varies between thesixth, fifth, eighth and sixth most important variable going from horizons of 5, 30, 60 and 180 minutes respectively.
Recently, there has been a significant effort to improve the interpretability of prediction models, both those involvingneural networks and more general frameworks. Examples of this include Ribeiro et al. [2016] and Shrikumar et al.[2017]. In the first, the general idea is to build a simpler ‘explainer’ model g that locally approximates some complexmodel f One optimizes g by penalizing complexity, and assigning more weight to data-points near the one we wish toexplain the prediction of, hence resulting in local accuracy. It was then shown in Lundberg and Lee [2017] that manyexisting model interpretability methods could be phrased in-terms of a concept from game-theory known as Shapleyvalues. In short, they proposed explainer models of the form: g ( z (cid:48) ) = φ + M (cid:88) i =1 φ i z (cid:48) i (20)where z (cid:48) i is a binary value indicating the inclusion or exclusion of a particular feature and we have M features. Theythen specified three properties that one might desire in a feature attribution method: local accuracy (predictions ofthe same input give the same output), missingness (no attributed impact of missing features) and consistency (for twomodels, if ones output is more sensitive to a particular feature change than another, then it achieves a higher attributionvalue). The authors showed that under these properties and Eq. (20), the φ i values actually coincide with Shapley valuesfrom game theory. This approach unified many existing methods, including Ribeiro et al. [2016] and Shrikumar et al.[2017]. They denoted φ i a ‘SHAP value’ and they are appealing as they are additive, showing how particular featuresshift a models predictions away from some mean φ to the final result for a particular data-instance. Their absolutevalue shows the size of a particular features importance, however one can go deeper and ask for a given feature, is thisvalue increasing or decreasing the final prediction of the model? One actually computes the SHAP value for feature i as: φ i = (cid:88) S ⊆M \ i | S | !( | M | − | S | − M ! [ F ( S ∪ { i } ) − F ( S )] (21)where M is the set of all features. In Eq. (21), we sum over all subsets of feature vectors that do not include feature i .For any one of these sets S , we compute the difference between the model output using the features in S and feature i ,and the model output using only the features in S , shown by F ( S ∪ { i } ) − F ( S ) . The remaining term | S | !( | M |−| S |− M ! accounts for all possible orderings of the feature vector.It is upon this that we base our feature importance exploration for the neural network model. We compute theSHAP values of the network, for the fixed features and the features output from the CNN deriving information fromthe time-series. We use the implementation provided by the original authors , specifically the permutation methodfor computational speed and the incorporation of structured inputs. More details on SHAP values are given in thesupplementary material.A point of note for this method of feature importance is that we are computing values for each output neuron in thenetwork that correspond to particular horizons, and how this output value changes, not how some performance metricchanges. As a result, we can question if different features have more or less impact on different parts of the outputdistribution for a single input data-point. First, we consider raw feature importance, that is does a variable have a largeor small impact the output of the model at particular horizons, showing results in Figure 5 . From
Figure 5 , we see Implementation and examples given in https://github.com/slundberg/shap PREPRINT - F
EBRUARY
18, 2021
Normalized SHAP Values (Relative Importance)
TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1TSF 0Incident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day A Normalized SHAP Values (Relative Importance)
TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1TSF 0Incident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
18 71716 128141513 61011 4 19 5 3 2 B Normalized SHAP Values (Relative Importance)
TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1TSF 0Incident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
16 14 1012151311 7 634 29 581718 1 C Normalized SHAP Values (Relative Importance)
TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1TSF 0Incident TypeLocationLink LengthSeasonWeekday Y/NTime Of Day
16 51314171815108119 6 4 212 7 3 1 D Figure 5: Variable importance, as measured for the sliding window neural network model, for a subset of the importantvariables. Each plot is a different prediction horizon, all shown are at a prediction time of t = 30 . The importance ateach prediction time is computed by taking the average of the absolute SHAP values for each feature across all queryinstances, and then has been normalized such that the largest importance at any horizon is 1, and all others are relativeto this. The rank of each variable at a given time is written beside each bar. Due to the scaling, one should focus onthe the ranking and relative difference between bars in each plot. TSF stands for time-series feature, that have beenextracted through passing temporal convolutions across the data. (A) : h = 5 . (B) : h = 30 . (C) : h = 60 . (D) : h = 180 .that for very short horizons (5 minutes) there are a large number of time-series features with high importance. Thismakes intuitive sense for the same reasoning as in the RSF case. After the time-series features, we see the time of day,location and incident type are the features with the highest impact. Moving to a horizon of 30 minutes, we then see thetime-series features become less important, and location and time of day dominate the other features. Note here thatin the 5 minute horizon, there were lots of features with quite high importance, showing quite a number influencedthe model’s output, but at a horizon of 30 minutes we see two with large importance and many others with far less.At a horizon of 60 minutes, we again see the importance of the time-series features increase, but the time of day andlocation are still the two most important of the time-invariant features, ranking first and fifth respectively. One mightquestion why the time-series resurge in importance here, and we explore this further in Figure 6 and the analysis of it.At a long horizon of 180 minutes, the time of day is by far the most important feature, and the location is second, but isless important relatively than it was at a horizon of 30 minutes. A natural interpretation of this might be that for longhorizons into the future, knowing if an incident will overlap with rush hour or go into lunch time or the night is a goodindication of if we believe it might last a long time.Having inspected the magnitude of SHAP values, we now question how do actual feature values shift the networkoutput, either increasing or decreasing it. Visualizing this is more complex due to the fact that we want to plot theimpact of various features, the value they attain, and if this shifted the prediction up or down. A standard way to do thisfor SHAP values is to make a ‘beeswarm’ plot, in which each data-instance is plotted as a single dot, once per each19
PREPRINT - F
EBRUARY
18, 2021feature. Examples of making such plots for our dataset are given in
Figure 6 . One should read these plots as follows.Firstly, along the y-axis are features the model used, where categorical ones have been split into their one-hot encodedstates. Secondly, the x-axis displays the SHAP values, not normalized as they were in the previous analysis, showinghow the particular feature shifts the model output either up or down. Thirdly, the colour indicates the feature value. Forbinary features such as ‘Morning Rush’ a high value indicates the data-point was in the morning rush. Fourth, wheremany data-points had similar SHAP values for the same feature, points are expanded outwards in the y-axis, so a largevertical strip of points for a single feature indicates a high density of points at that SHAP value. Horizon is indicated by h in each sub-caption, which corresponds to looking at a particular output node of the network. We split the fixed andseries features to aid readability. Note that pushing the output ‘up’ (a feature with a positive SHAP value) indicatesincreasing the probability mass functions value at this time-horizon.Whilst there is a large amount of information in Figure 6 , we breakdown the main points here. Firstly, we see that ata horizon of 5 minutes (
Figures 6a, 8b ), there is clear coherence in the features, shown by there not being a randomassortment of colours across single features. Earlier, we saw time-series features, the time of day and location wereinfluential factors at this horizon. Considering the time of day more finely, we see from
Figure 6a that when incidentsare in the morning rush period, the value of the PMF at this time is decreased, suggesting the model believes it isunlikely incidents at this time of day will end very quickly. Inspecting horizons of 30 and 60 minutes, in
Figures 6c,6d , we see that that when incidents occur in the morning rush, this increases the output values at these horizons. Finally,there appears to be more complex behaviour at a horizon of 180 minutes, as incidents occurring in the morning rushsometimes increase, and sometimes decrease the output at this time. If we turn instead to view how a location impactsthe result, for example inspecting the SHAP values for ‘West’ we see that attaining a value of 1 here decreases the modeloutput at a horizon of 5 minutes (
Figure 6a ), increases it at horizons of 30 and 60 minutes (
Figures 6c, 6e ) and thendecreases it again at a horizon of 180 minutes (
Figure 6g ). Note however that since some of these features are in-factcategorical and have been one-hot encoded for use with a neural network, care must be taken in interpreting the impactof such values. In doing this analysis, we attain a SHAP value for each feature, however every data-point has as singlelocation value equal to 1, and the rest equal to 0. So each location feature here will alter the neural network output, butthe total impact of a data-point having a particular location will be the sum of the SHAP values for that data-point overall encoded categories. As such, we can better visualize the impact of categorical variables, for example location, byfirst summing the SHAP values for a data-point for all encoded groups of a particular feature, then visualizing how theoverall feature impacted predictions. We do so in
Figure 7 .The more intuitive description offered in
Figure 7 allows one to see the overall impact of the location variable. Anexample interpretation one could read from
Figure 7 is: for data-points where the spatial location was east, the overallimpact of having this location on the model output was an increase in the output at horizons of 5 and 30 minutes, and adecrease at 60 and 180 minutes. From this more refined view, we can see that incidents in the north east are quite varied,as their SHAP values sometimes shift up and sometimes shift down for all horizons. Incidents in the north typicallyhave the output increased at horizons of 5 and 30 minutes (
Figures 7a, 7b ) and then decreased at horizons of 60 and180 minutes (
Figures 7c, 7d ). This suggests that, for example, the model has learnt incidents in the north are of shorterduration than incidents in the south, however this can then be adjusted further by other observed features. Locations canbe compared in this way for all possible pairs.A further question of interpretability relates to the features engineered from the time-series: what effect do these haveon the model? We previously saw that they were highly important at 5 and 60 minute horizons, but we can now use
Figure 6 to consider what impact they are having. If we start with a horizon of 5 minutes, we see from
Figure 6b thatwhen the time-series features attain a high value, the model output at this horizon is increased. We do not have aninterpretable explanation of these features, but we see that they can provide quite significant shifts up in the output iftheir values are high. Moving to a horizon of 30 minutes, we see from
Figure 6d that the series features become lesscoherent. Some features attain a high value and shift the output up, others down, and the overall impact is small formany features compared to the impact of the fixed features. This does indicate that the features learnt from the seriesare distinct in some sense, providing different impacts on the model output. At a horizon of 60 minutes, we see from
Figure 6f that the features attaining a high value indicates a decrease in the model value at this point. From this wecan interpret that when high values of features from the are attained, they are making the model put more mass in theimmediate future, and less at horizons of 60 minutes or longer. This is perhaps an intuitive result, that the time-seriesare providing information that can significantly increase the model output at short term horizons, and attaining thesesame values shifts down the predictions at long horizons.Of course, the sheer amount of data available here is somewhat overwhelming, however using SHAP values one cangain a significant understanding of why the machine learning model is outputting particular values. We further showplots for the overall impact of categorical features in the supplementary material, for interested readers, omitted forbrevity here. However, care must always be taken in ensuring that feature importance and causality are not assumed,rather we are able to question why the model gives a set of outputs for a particular set of inputs.20
PREPRINT - F
EBRUARY
18, 2021
SHAP ValueGeneral ObstructionVehicle ObstructionAccidentAbnormal TrafficNorth EastNorthNorth WestWestSouth WestSouthSouth EastEastLink LengthAutumnSummerSpringWinterWeekday Y/NNightEvening RushAfternoonMorning Rush
LowHigh F e a t u r e V a l u e A SHAP ValueTSF 21TSF 20TSF 19TSF 18TSF 17TSF 16TSF 15TSF 14TSF 13TSF 12TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1 TSF 0
LowHigh F e a t u r e V a l u e B -0.005 0.0 0.005 0.01 SHAP ValueGeneral ObstructionVehicle ObstructionAccidentAbnormal TrafficNorth EastNorthNorth WestWestSouth WestSouthSouth EastEastLink LengthAutumnSummerSpringWinterWeekday Y/NNightEvening RushAfternoonMorning Rush
LowHigh F e a t u r e V a l u e C -0.005 0.0 0.005 0.01 SHAP ValueTSF 21TSF 20TSF 19TSF 18TSF 17TSF 16TSF 15TSF 14TSF 13TSF 12TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1 TSF 0
LowHigh F e a t u r e V a l u e D -0.004 -0.002 0.0 0.002 0.004 SHAP ValueGeneral ObstructionVehicle ObstructionAccidentAbnormal TrafficNorth EastNorthNorth WestWestSouth WestSouthSouth EastEastLink LengthAutumnSummerSpringWinterWeekday Y/NNightEvening RushAfternoonMorning Rush
LowHigh F e a t u r e V a l u e E -0.004 -0.002 0.0 0.002 0.004 SHAP ValueTSF 21TSF 20TSF 19TSF 18TSF 17TSF 16TSF 15TSF 14TSF 13TSF 12TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1 TSF 0
LowHigh F e a t u r e V a l u e F -0.002 -0.001 0.0 0.001 SHAP ValueGeneral ObstructionVehicle ObstructionAccidentAbnormal TrafficNorth EastNorthNorth WestWestSouth WestSouthSouth EastEastLink LengthAutumnSummerSpringWinterWeekday Y/NNightEvening RushAfternoonMorning Rush
LowHigh F e a t u r e V a l u e G -0.002 -0.001 0.0 0.001 SHAP ValueTSF 21TSF 20TSF 19TSF 18TSF 17TSF 16TSF 15TSF 14TSF 13TSF 12TSF 11TSF 10TSF 9TSF 8TSF 7TSF 6TSF 5TSF 4TSF 3TSF 2TSF 1 TSF 0
LowHigh F e a t u r e V a l u e H Figure 6: SHAP values visualising how feature values shift the output of the network at particular horizons h . (A) :Time-invariant features, h = 5 . (B) : Time-varying features, h = 5 . (C) : Time-invariant features, h = 30 . (D) :Time-varying features, h = 30 . (E) : Time-invariant features, h = 60 . (F) : Time-varying features, h = 60 . (G) :Time-invariant features, h = 180 . (H) : Time-varying features, h = 180 .21 PREPRINT - F
EBRUARY
18, 2021
SHAP ValueEastSouth EastSouthSouth WestWestNorth WestNorthNorth East A -0.005 0.0 0.005 0.01 SHAP ValueEastSouth EastSouthSouth WestWestNorth WestNorthNorth East B -0.004 -0.002 0.0 0.002 0.004 SHAP ValueEastSouth EastSouthSouth WestWestNorth WestNorthNorth East C -0.002 -0.001 0.0 0.001 SHAP ValueEastSouth EastSouthSouth WestWestNorth WestNorthNorth East D Figure 7: SHAP values for the location feature. In each plot, we have summed the SHAP values for each one-hotencoded value, and then only plotted the resulting value in the row corresponding to each data-points true feature value.This shows the overall impact of the location feature, and allows one to view this impact separately for each value itattains. (A) : h = 5 . (B) : h = 30 . (C) : h = 60 . (D) : h = 180 . In this work, we have addressed a number of issues raised in the literature regarding traffic incident duration prediction.Firstly, we considered a method to determine incident duration as not only when an operator declared it cleared, butalso when the traffic stated had returned to some typical behaviour. This ensured that our predictions reflected whencommuters could expect normal traffic conditions to resume if an incident had a significant impact even after it wascleared. Secondly, we considered a range of models, some based on classic survival analysis and others based onmachine learning and assessed how they performed on our dataset in both a static and dynamic setting. In-particular,we took inspiration from work in the domain of healthcare and applied emerging methods used there to problems intraffic incident analysis. We saw that in a static setting, there was little to choose between the models but generallyeither a neural network or random survival forest method would be preferred to the others considered. We moved intothe dynamic setting by utilising landmarking and sliding window neural networks that applied temporal convolutions tothe time-series, both of which were inspired by success in healthcare applications. We saw non-parametric methods thatmade no distributional assumptions were preferred to methods that parametrized mixture distributions, and saw somebenefit in applying kernel smoothing to enforce some minimal structure on a non-parametric output. Utilizing modelswith non-parametric distributional forms was influenced by the increasingly complex distributions being considered forthis problem in the traffic literature, and showed significant promise in our results.We assessed how each model performed using three different scoring criteria and in the dynamic sense, we saw clearstructure in the results, with the kernel smoothed neural network model achieving an optimal C-index, Brier scoredepending on prediction time and horizon as to which model was preferred between a random survival forest and kernel22
PREPRINT - F
EBRUARY
18, 2021smoothed neural network, and finally saw the neural network models showed much better performance in terms ofpoint-wise error than the comparison models. We saw that all score criteria were improved by engineering features fromwindows of sensor data through temporal convolutions, compared to feeding in local levels and gradients, suggestingfuture work should continue to explore methods of deriving features from sensor data rather than inputting raw values.After, we considered variable importance for our model in the dynamic sense, assessing how the random survival forestand neural network models were influenced by the derived features. Whilst we are aware of variable importance beingstudied in the traffic literature previously, we are not aware of SHAP values being applied to neural network modelsfor incident duration analysis. Time of day and location were generally important across models, particularly at longhorizons, and the time-series features were shown to have significant impact on the neural network output at 5 and 60minute horizons.Finally, our suggestion to use the output of a neural network to define kernel weights, and from this construct anon-parametric distribution through smoothing was novel in that we are not aware of this being done before to theconsidered model. It has relevance to other applications, specifically if one requires a continuous distribution to beoutput, but does not want to make strong parametric assumptions about what form this will take. Further work couldbe done to consider alternative methods to select a bandwidth when applying this type of model, but the freedomit provides appeared promising on our data. Other avenues for further work include incorporating more features inthe dynamic prediction models. In-particular, traffic operators will be able to report when recovery vehicles arrive,police involvement and details from on-site reports. Further, social media and weather data could be collected inreal-time. It is likely these will have some predictive power for the duration of an incident, and considering how bestto incorporate them is an interesting problem. Finally, from our analysis in section 5 we suspect that if one was ableto derive robust, complex features from the time-series and feed them into a random survival forest model, we maysee further improvements in the model. One way to do this may be through an auto-encoder framework in which wepre-train a model to determine hidden representations of the series, however it is unclear if these will offer the samepredictive power as we have observed from the time-series here.
Funding
This work was supported by the EPSRC (grant number EP/L015374/1).
Acknowledgments
We thank Dr. Steve Hilditch, Thales UK for sharing expertise on NTIS and UK transportation systems.
A Distribution Information In Table 6 we summarize the various probability distributions discussed throughout this work.
B Incident Visualization
We saw in the main text that some structure clearly existed in the time-series. One could question what features mightwe aim to discover in the series, and how varied is their behaviour during incidents? To aid in this, we visualize anumber of windows with incidents in them, plotting the speed again and seeing some representative series of incidents in
Figure S1 . From
Figure 8 , we see that different incidents show quite different representations in the speed time-series.
Figures 8A, 8B, 8C show short lived but large drops in speed associated with incidents, each at a different time of day.The speed profile is reasonably symmetric in time and the traffic state starts to recover soon after reaching its minimalspeed.
Figures 8D, 8E, 8F on the other hand show very asymmetric incidents. In
Figures 8D, 8E there is quite asudden drop in speed, but a comparatively slow recovery. In particular
Figure 8E shows two periods where the speedrecovers somewhat, but then stalls at particular values, but later recovering again.
Figure 8F shows the opposite case,a slow fall in speed to 85 km/hr and then a comparatively rapid recovery. Finally
Figures 8G, 8H, 8I show extremeincidents, with a long period of very low speed. In the case of
Figure 8H , we see the speed recovers somewhat to 40km/hr however it remains around this for an extended period of time, where as in
Figures 8G, 8I the speed remainsaround 20 km/hr for a significant period of time. 23
PREPRINT - F
EBRUARY
18, 2021Table 6: Summary of distributions considered throughout this work. All distributions are defined for positive x .Log-normal: Can be seeing as assuming the logarithm of x is normally distributed. erf ( x ) is the complementary errorfunction. Generalized F: ˜ B d wd w + d (cid:0) d , d (cid:1) represents the regularized incomplete beta function. If B ( z ; x, y ) is theincomplete beta function and B ( x, y ) is the beta function, then: ˜ B d wd w + d (cid:0) d , d (cid:1) = B (cid:16) d wd w + d ; d , d (cid:17) B ( d , d ) . Mixture: Eachmixture component has some PDF ˜ f i ( x ; θ i ) , CDF ˜ F i ( x ; θ i ) that depend on some parameter set θ i , and a componentweight π i . The weighted sum of all components generates the final distribution. Distribution Parameters PDF CDFLog-normal µ ∈ ( −∞ , ∞ ) σ ∈ (0 , ∞ ) xσ √ π e − (log( x ) − µ )22 σ (cid:16) erf (cid:16) log( x ) − µσ √ (cid:17)(cid:17) Log-logistic α ∈ (0 , ∞ ) β ∈ (0 , ∞ ) ( βα )( xα ) β − (cid:16) ( xα ) β (cid:17) ( xα ) − β Weibull λ ∈ (0 , ∞ ) k ∈ (0 , ∞ ) kλ (cid:0) xλ (cid:1) k − e − ( xλ ) k − e − ( xλ ) k Generalized F µ ∈ (0 , ∞ ) σ ∈ (0 , ∞ ) d ∈ (0 , ∞ ) d ∈ (0 , ∞ ) Let w = log( x ) − µσ , then: wB ( d , d ) (cid:16) d d (cid:17) d w d − (cid:16) d d w (cid:17) − d d ˜ B d wd w + d (cid:0) d , d (cid:1) Mixture θ i ∀ i ∈ { , . . . , k } π i ∀ i ∈ { , . . . , k } (cid:80) ki =1 π i ˜ f i ( x ; θ i ) (cid:80) ki =1 π i ˜ F i ( x ; θ i ) C Exploratory Analysis
C.1 Incident Duration Analysis
An initial exploratory analysis of our data reveals a number of properties. As discussed, an initial step for many worksin this field is to take all of the durations and determine if they are well modelled by a particular statistical distribution.Our review of the literature suggested fitting log-normal, log-logistic, Weibull and generalized F distributions, so we doexactly this and show our results in
Figure 9 . Specifically,
Figure 9A shows the probability density functions (PDFs)compared to the data and
Figure 9B shows a quantile-quantile plot. An overview of each distribution considered isgiven in the supplementary material. From
Figure 9A , we see that the Weibull PDF follows the data most closely,but inspecting the quantiles we see only the generalized F distribution has reasonable estimates of the extreme values.When we apply a Kolmogorov-Smirnov with estimated parameters Babu and Rao [2004], we see no significant evidencethat any of these candidate distributions are statistically valid fits. Note that in Smith and Smith [2001], a chi-squaredtest was applied to determine the statical significance of distributional fits, however doing so requires binning the data,which we avoid here. These results suggest that to properly describe the data, we may want to consider more complexdistributional representations, either non-parametric or mixtures of many components. Note also that the tail of thedurations contains around 6 outliers, suggesting very rare incidents where the link was at a reduced speed for almost anentire day, however the bulk of the incidents attain a duration of less than 300 minutes.
C.2 Incident Clustering Analysis
We saw throughout the literature that time-invariant features were sometimes clustered to reveal different sub-populationsof incidents, and provided useful features in modelling. This was particularly evident in Weng et al. [2015] and Araghiet al. [2014], for example. We question if these clusters are reflected also in the time-series for the data. To do so, werequire a distance metric between time-series. A common choice is dynamic time warping (DTW) discussed in Berndtand Clifford [1994] and used in, for example, Izakian et al. [2015]. We specifically use the implementation in Monteroand Vilar [2014]. The idea is to stretch or squeeze a pair of time-series such that they are as similar as possible. If onetime-series has the same shape as another, it will have a low DTW distance, where as if two are fundamentally different,they will have a large distance. We first standardize all link data by it’s mean and standard deviation, then compute theDTW distance between all pairs of series that represent a window where an incident occurred. From this, we attain adistance matrix, and we use hierarchical agglomerative clustering (HAC) with a ward linkage function Ward Jr. [1963]to construct a dendrogram to visualize distances between elements. The results are shown in
Figure 10 , where
Figure10A clusters the data by their time-invariant features and
Figure 10B clusters the data by their time-series. From
Figure10 , there is clear structure in the distance matrix, suggesting that incidents cluster by both time invariant features, butalso by the observed traffic metrics on the links. The number of clusters is somewhat subjective, but if we inspect the24
PREPRINT - F
EBRUARY
18, 2021 :
30 17 :
50 18 :
10 18 : Time of Day S p ee d ( k m / h r ) A :
20 06 :
40 07 :
00 07 :
20 07 :
40 08 :
00 08 :
20 08 :
40 09 :
00 09 : Time of Day S p ee d ( k m / h r ) B :
20 06 :
40 07 :
00 07 :
20 07 : Time of Day S p ee d ( k m / h r ) C :
40 08 :
00 08 :
20 08 :
40 09 :
00 09 : Time of Day S p ee d ( k m / h r ) D :
30 09 :
50 10 :
10 10 :
30 10 :
50 11 :
10 11 :
30 11 :
50 12 : Time of Day S p ee d ( k m / h r ) E :
10 21 :
30 21 :
50 22 :
10 22 : Time of Day S p ee d ( k m / h r ) F :
40 08 :
00 08 :
20 08 :
40 09 :
00 09 : Time of Day S p ee d ( k m / h r ) G :
30 09 :
50 10 :
10 10 :
30 10 :
50 11 :
10 11 :
30 11 :
50 12 : Time of Day S p ee d ( k m / h r ) H :
10 21 :
30 21 :
50 22 :
10 22 : Time of Day S p ee d ( k m / h r ) I Figure 8: Various example speed time-series from some incidents in the dataset.
Duration (minutes) . . . . . . . . N o r m a li s e d F r e qu e n c y Log NormalLog LogisticWeibullGeneralised FData A Theoretical Quantiles E m p e r i c a l Q u a n t il e s Log NormalLog LogisticWeibullGeneralised F B Figure 9: Analysis of various distributions fit to incident duration data across the M25. Overall, the Weibull distributionfits the bulk of the data most closely, but the heavy tail is best captured by the generalized F distribution. (A) :Comparison of probability density functions. (B) : Quantile-Quantile plot.25
PREPRINT - F
EBRUARY
18, 2021 D i s t an c e Incidents A D i s t an c e Incidents B Figure 10: Clustering Dendrogram using all incidents in the NTIS data. We see clear structure, perhaps suggesting 4distinct groups exist. (A) : Clustering of incidents using the time-invariant features. (B) : Clustering of incidents usingthe speed time-series.average within cluster sum of squared distances, we see that the reduction in this begins to diminish after around 6clusters are identified for both clustering instances. Examples of different series observed in incident windows are givenin the supplementary material, where we observe differences in the severity, symmetry, time-scales and stability ofspeed behaviour during incidents, and more details on the clustering discussed.
C.3 Clustering Details
Whilst we saw that the dataset appeared to cluster on both the time-series and the time-invariant features, we offermore explanation of the clustering here. When clustering the data by their time-invariant features, we use the ‘daisy’dissimilarity measure Kaufman and Rousseeuw [1990] as it is able to handle features of varying types. To do so, ituses the Gower dissimilarity coefficient Gower [1971]. Inspecting the average within cluster sum of squares distancevalues on both clustering results, we construct a so called ‘elbow plot’ shown in
Figure 11 . From
Figure 11 we seediminishing reduction in the within cluster sum of squared distances after around 6 clusters in each case, suggestingthat one should not segment the data further than this.
D Models Summary In Table 7 , we summarize the models considered and note some key points one should consider when comparingthem. They have been chosen to reflect existing work in the transportation literature and recent advancements in otherdisciplines.
E Hyper Parameter Grids
For each of our deep learning models, we consider a range of hyper-parameters and apply 100 iterations of randomsearch to determine optimal choices. In
Table 8 we detail the possible choices for various hyper-parameters. For allfitting, we apply early stopping based on the loss on a holdout set, and we train using a batch size of 128. Learning ratefor all models was decayed when learning stalled, and the Adam optimizer was used. Models are implemented in Keras.26
PREPRINT - F
EBRUARY
18, 2021
Number of Clusters . . . . . . . A v e r a g e W i t h i n S u m o f S qu a r e s A Number of Clusters A v e r a g e W i t h i n S u m o f S qu a r e s B Figure 11: Plots comparing the within cluster sum of squared distances when cluster on fixed and time-series features.Distance measures used are very different, so the numerical values on the y-axis should not be compared. (A) : Timeinvariant features, (B) : Time-series features
F Joint Models
Here we discuss the main alternative for dynamic predictions in survival analysis one can find in the literature: jointmodels. A more detailed discussion for interested readers can be found in Rizopoulos [2012]. The primary idea behindjoint modelling is to couple two models, a linear mixed effects model for the time-series available, and a survival modelfor the incident times. A general linear mixed effects model is written as: y i = ˜ x (cid:48) i ˜ β + z (cid:48) i b i + (cid:15) i b i ∼ N (0 , D ) (cid:15) i ∼ N (cid:0) , σ I n i (cid:1) . (22)Here, ˜ β describes the average longitudinal evolution through time of the population (fixed effects) and b i representsthe deviation for a particular individual i from the population average (random effects). The error terms are assumedto be Gaussian with covariance matrix D , and the random effect regression coefficients are assumed to be Gaussiandistributed with some fixed variance σ across the population. Finally ˜ x i and z i are covariates of interest, repeatedlymeasured through time.The survival model used in joint models is commonly a Cox model, meaning the full specification of the ‘basic’ jointmodel can be written as: h i ( t |M i ( t ) , w i ) = h ( t ) e x (cid:48) i β + αm ( t ) y i ( t ) = m i ( t ) + (cid:15) i ( t ) m i ( t ) = ˜ x (cid:48) i ( t ) ˜ β + z (cid:48) i ( t ) b i b i ∼ N (0 , D ) (cid:15) i ∼ N (cid:0) , σ (cid:1) . (23)where M i ( t ) represents the history of the longitudinal process up to time t , and the covariates impact in the coxmodel are augmented by the term αm i ( t ) , representing the influence of the longitudinal data. Given Eq. (23), wecan incorporate the impact of time-varying covariates and uncertainty into the survival model, however we are reallyinterested in dynamic predictions: determining the probability of surviving to a particular survival time u > t ,conditioned on surviving up to t . It is detailed in Rizopoulos [2012] how to do this, requiring Monte Carlo sampling tointegrate over all possible random effect values.We do not consider such a model for comparison in this work for a number of reasons. Firstly, particularly in extremeaccident scenarios, the behaviour of traffic metrics is likely highly non-linear as a function of the known covariates.Secondly, we would be required to specify the model for the multivariate time-series, the survival outcome and how tolink to the two models when applying this, leaving significant room for model misspecification to accumulate without27 PREPRINT - F
EBRUARY
18, 2021Table 7: Summary of models considered with key points and origins highlighted.
Model DistributionalAssumptions Complexity NotesAFT (LN) Log-normal Linear Accelerated Failure Time model assuming a log-normalduration distribution, used in previous traffic studies, forexample Chung and Yoon [2012].AFT (W) Weibull Linear Accelerated Failure Time model assuming a Weibull durationdistribution, used in previous traffic studies, for exampleNam and Mannering [2000] and Kaabi et al. [2012].Cox Non-Parametric Linear Cox regression model, baseline hazard is non-parametric,covariate effects act as an exponential of a linear regression.Used in previous traffic studies such asOralhan and Göktolga [2018] and Wu et al. [2013].RSF Non-Parametric Non-Linear Random survival forest model, using an ensemble of 1000 trees.Used in various other applications such as Dietrich et al. [2016]and Miao et al. [2015].NN (NP) Non-Parametric Non-Linear Neural network model with a non-parametric distributionmodelled by a softmax output layer. Used in healthcareapplications in Lee et al. [2018].NN (LN) Mixture oflog-normal Non-Linear Neural network model parametrizing a mixture distribution,with log-normal components for each mixture component.Mixture used due to work in Zou et al. [2016] suggestingthese may offer predictive power for traffic incident datasets.NN (Kernel) Non-Parametricwith smoothing Non-Linear Neural network model with a kernel smoothed outputdistribution, compromising between the strong assumption ofNN (LN) and the complete freedom of NN (NP).SW (NP) Non-Parametric Non-Linear Neural network model with a sliding window CNN architectureto capture time-series features. Output is non-parametric as inNN (NP). Used in healthcare applications in Jarrett et al. [2020].SW (LN) Mixture oflog-normal Non-Linear Neural network model as in SW (NP), with an output layerspecifying a mixture of log-normals as in NN (LN). Combinesideas as in Jarrett et al. [2020] with mixture observations inZou et al. [2016].SW (Kernel) Non-Parametricwith smoothing Non-Linear Neural network model as in SW (NP), with an output layerspecifying a kernel smoothed distribution. This compromisesbetween the strong assumption of SW (LN) and the completefreedom of SW (NP). exhaustive searches. Doing so without significant prior work to base our assumptions on in not desirable. Finally,the computational complexity is prohibitive for a dataset of the size we are considering. Further discussions on thechallenges of joint models are given in Hickey et al. [2016].
G Comparison of Temporal Convolutions to Manually Engineered Time-Series Values In Table 9 we show the C-index and Brier score comparing a sliding window model that engineers time-series featuresthrough temporal convolutions to a model that simply takes the most recent information (the level and gradient). Thelater model is denoted ‘Raw (NP)’ here as it does not derive features itself from the series, and has a non-parametricoutput. The same two models are compared in
Table 10 , showing instead the MAPE at various percentiles into anincident.
H SHAP Discussion
Shapley values arise from a theory from coalition games, discussed at length in Shoham and Leyton-Brown [2008],stating that there must be a unique way to divide the pay-off between a set of players whilst satisfying reasonableconstraints. In the machine learning context, the ’game’ becomes the model output, and the ’players’ are the features.28
PREPRINT - F
EBRUARY
18, 2021Table 8: Hyper parameters considered when fitting neural network models.Hyper Parameter Values ConsideredDropout 0.5Kernel size {5, 10} L , L regularization − Learning rate { − , − }Number of dense layers {1, 2, 3}Number of neurons per dense layer {32, 64, 128, 256}Number of mixtures {1, 2, 3}Number of CNN layers {1, 2, 3}Number of filters per layer {4, 8, 16}Smoothing bandwidth 3 (minutes)Time-Series window size {30, 60} (minutes) η σ {0.1, 1.0}Table 9: Results comparing a sliding window model and a model where we input the most recent time-series values andthe gradient computed as previously discussed. The sliding window model achieves better C-Index and Brier valuesacross all considered prediction time and horizon pairs. Prediction Time Metric Model Prediction Horizon (minutes) Mean Over(minutes) 5 15 30 45 60 120 180 240 Horizons t = 0 C-Index SW (NP) -
Raw (NP) - 0.755 0.701 0.669 0.646 0.599 0.592 0.588 0.650Brier SW (NP)
Raw (NP) 0.015 0.122 0.208 0.260 0.306 0.310 0.203 0.116 0.193 t = 15 C-Index SW (NP)
Raw (NP) 0.946 0.849 0.773 0.729 0.689 0.637 0.624 0.620 0.733Brier SW (NP)
Raw (NP) 0.022 0.091 0.163 0.210 0.257 0.272 0.175 0.098 0.161 t = 32 C-Index SW (NP)
Raw (NP) 0.948 0.870 0.773 0.730 0.702 0.655 0.645 0.641 0.746Brier SW (NP)
Raw (NP) 0.021 0.083 0.158 0.203 0.241 0.248 0.160 0.090 0.151 t = 45 C-Index SW (NP)
Raw (NP) 0.953 0.851 0.777 0.748 0.715 0.667 0.654 0.651 0.752Brier SW (NP)
Raw (NP) 0.021 0.087 0.158 0.196 0.234 0.230 0.152 0.090 0.146 t = 60 C-Index SW (NP)
Raw (NP) 0.938 0.870 0.799 0.763 0.729 0.679 0.667 0.665 0.764Brier SW (NP)
Raw (NP) 0.025 0.087 0.153 0.192 0.228 0.221 0.144 0.091 0.143 t = 120 C-Index SW (NP)
Raw (NP) 0.958 0.847 0.795 0.773 0.745 0.697 0.692 0.689 0.775Brier SW (NP)
Raw (NP) 0.026 0.101 0.159 0.180 0.207 0.194 0.135 0.091 0.137
Table 10: Results comparing a sliding window model and a model where we input the most recent time-series valuesand the gradient computed as previously discussed. The Sliding window model achieves better MAPE at the 30th and50th percentiles, but marginally worse at the 70th and 90th percentiles.Model MAPE Dynamic ModelPercentile Into Incident Prediction Made at30th 50th 70th 90thSW (NP)
To explain a complex model, in our case a neural network, one writes a simpler ’explainer model’ of the form: g ( z (cid:48) ) = φ + M (cid:88) i =1 φ i z (cid:48) i . (24)29 PREPRINT - F
EBRUARY
18, 2021Here, φ represents the baseline model output, and then each feature i shifts this output up or down, until a final valueis reached, and we have M features in total. This output can be measured for each neuron in the output layer. Recallfrom the main text that une computes the SHAP value for feature i as: φ i = (cid:88) S ⊆M \ i | S | !( | M | − | S | − M ! [ F ( S ∪ { i } ) − F ( S )] (25)where M is the set of all features. We can intuitively consider how Eq. (25) arises as follows. Imagine at first we havean empty feature vector. Then, at random, we start adding features to it. Whenever we happen to add feature i , thechange in output will be: F ( S ∪ { i } ) − F ( S ) (26)Now consider, at the point we add i into S , how many ways could S have formed into its current state? There are | S | ! ways this could have happened. Further, there are ( | N | − | S | − ways that the remaining features could be addedafter i has been. This suggests for a single instance we have: | S | !( | N | − | S | − v ( S ∪ { i } ) − v ( S )] (27)However, this is just for a single S , so we then sum over all possible sets S and then average the value by dividing bythe the total number of possible orderings of players. Combining all of this, we attain Eq. (25).However, there is a clear problem in that neural networks cannot take arbitrary missing features as an input, so usingEq. (25) alone one cannot compute the SHAP values for such models. To address this, one considers some ‘reference’or ‘background’ dataset, from which values are taken as replacements when considering alterations of feature vectors.Such an approach is discussed in Lundberg and Lee [2017]. The idea is that, since we cannot omit a feature fully,we instead set its value to a reasonable reference value. The appropriate choice of background is an open problem inapplying these methods. For image tasks, it might be clear than one can use a blank image as a reference, and then weare evaluating a pixels importance relative to it being blank. However, in many domains some reasonable and intuitivereference value does not exist. Instead, it is common to provide a dataset with many records in as the background, andaverage over it, which is the approach we take in our work. Due to computational constraints, we set the backgrounddataset to 10000 random instances from the training dataset, and compute the SHAP values for 1000 randomly selectedpoints in the unseen data at specific prediction times.A second complication of note is that there is often structure to feature vectors in problems, and elements are not simplya random collection of possible values for each entry. One can consider this structure most clearly when considering aone-hot encoded example. Suppose our feature vector contained three binary values, which indicate morning, afternoonor evening, which of course correspond to a time of day being discretised and then one-hot encoded. In the rawmethodology for SHAP in Lundberg and Lee [2017], one would go through each feature, and consider setting it to areference value and use the change in model output as a measure of importance. However, suppose we had our time ofday encoding, and an entry occurred during the morning, meaning the subset of our feature vector was [1 , , . We maythen consider altering the entry corresponding to the binary label for afternoon, which might lead us to considering afeature vector of [1 , , . Of-course, such an input is not practically possible, as a data-point can only come from onetime of day, and so the model output for it is irrelevant to our problem. To avoid this, recent work in Lundberg [2017]incorporates the structure in the data before performing any perturbations. A partition of the data into highly correlatedcomponents is performed, and then when considering feature importance, instead of altering a single element, a set arealtered if they are grouped together. This is a natural requirement for our problem where locations, time of day andincident types have been one-hot encoded. We do however note that we compute the SHAP values both incorporatingthis structure and neglecting it, and see similar results in both cases. I SHAP Values for Categorical Features
In the main text, we considered SHAP values as a measure for variable importance and gave example interpretationsof why the neural network model was outputting particular predictions. Here we plot further examples of what thecategorical values contribute overall to the model. In
Figure 12 , we show the results for the time of day variable.Generally, we see that the impact of the time of day variable attaining the value ‘afternoon’ pushes down predictions atshort horizons (5, 30 minutes in
Figures 12A, 12B ). At a horizon of 60 minutes (
Figure 12C ), we see that sometimesthe model result is deceased and sometimes it is increased, and at a horizon of 180 minutes (
Figure 12D ) every instancehad the model output increased. From this we might be lead to believe that incidents occurring in the afternoon canspill over into the evening rush hour and last a significant amount of time. The same analysis can be done for alternativetimes of day, for example we see that incidents in the night have the output increased at horizons of 5, 30 and 60 minutes(
Figures 12A, 12B, 12C ) and decreased at horizons of 180 minutes (
Figure 12D ).30
PREPRINT - F
EBRUARY
18, 2021
SHAP ValueMorning RushAfternoonEvening RushNight A -0.005 0.0 0.005 0.01 SHAP ValueMorning RushAfternoonEvening RushNight B -0.004 -0.002 0.0 0.002 0.004 SHAP ValueMorning RushAfternoonEvening RushNight C -0.002 -0.001 0.0 0.001 SHAP ValueMorning RushAfternoonEvening RushNight D Figure 12: SHAP values for the time of day feature. In each plot, we have summed the SHAP values for each one-hotencoded value, and then only plotted the resulting value in the row corresponding to each data-points true feature value.This shows the overall impact of the location feature, and allows one to view this impact separately for each value itattains. (A) : h = 5 , (B) h = 30 , (B) h = 60 , (B) h = 180 If we consider instead the incident type, we see results as shown in
Figure 13 . We first note that the NTIS incidentsdata is dominated by abnormal traffic incidents, which is clearly seen by the random sample of 1000 incidents forexplanation containing a large number of abnormal traffic cases. When we inspect the data, the abnormal traffic casesare quite heavy tailed, suggesting either they are not always turned off by the operator in a timely manner, or thereare some significant drops in link performance that are not directly attributed to a physical incident on the link. Wegenerally see that general obstructions increases the model output at horizons of 30 minutes (
Figure 13B ) but decreasesit after this. Abnormal traffic incidents generally increase the model output at higher horizons and decrease it at lowerones, however the time-series and other information could then account for some of the short-lived abnormal trafficincidents we observe, or indeed some of the heavy tailed aspects we see.Finally, we can also inspect the seasons impact on the model, with results shown in
Figure 14 . It is first important tonote that we only have one year of data, so the season variable here could actually capture some long term systematicchange in response time or incident severity. However, we are not told to expect this from industry experts. Whenwe consider incidents in the winter, we see the output is mixed at a 5 minute horizon (
Figure 14A ), increased at ahorizon of 30 minutes (
Figure 14B ) where as it is decreased at the later horizons. The effect of spring appears to besystematically smaller in size than the other seasons. 31
PREPRINT - F
EBRUARY
18, 2021
SHAP ValueAbnormal TrafficAccidentVehicle ObstructionGeneral Obstruction A -0.005 0.0 0.005 0.01 SHAP ValueAbnormal TrafficAccidentVehicle ObstructionGeneral Obstruction B -0.004 -0.002 0.0 0.002 0.004 SHAP ValueAbnormal TrafficAccidentVehicle ObstructionGeneral Obstruction C -0.002 -0.001 0.0 0.001 SHAP ValueAbnormal TrafficAccidentVehicle ObstructionGeneral Obstruction D Figure 13: SHAP values for the incident type feature. In each plot, we have summed the SHAP values for each one-hotencoded value, and then only plotted the resulting value in the row corresponding to each data-points true feature value.This shows the overall impact of the location feature, and allows one to view this impact separately for each value itattains. (A) : h = 5 , (B) h = 30 , (B) h = 60 , (B) h = 180 J AUROC Validation
Whilst we have looked at C-index and Brier score, for different prediction time, prediction horizon pairs as was done inLee et al. [2020], we can equally look at the area under the receiver operator curve as a performance metric, as wasdone in Jarrett et al. [2020]. We show results for this, at particular horizons, averaged over all input windows, in
Table11 , computed using the methods discussed in Pölsterl [2020] and Lambert and Chevret [2016].
K MAPE as a Function of Time
We have looked at the error at different percentiles into an incident because of the national criteria set by HighwaysEngland when considering incident duration modelling. However, we note that some other papers consider error atdifferent minutes into an incident. These are clearly two different methods of aggregation. The first averages acrossdifferent events, taking predictions from potentially very different time points, but that represent some fraction of theentire incident. The second averages across the same prediction time for every incident, but disregards the fact thatthis might be a very short time relative to the total duration of some events, and a very long time relative to others. Wepresent results for the second averaging here to give clear comparison and show our results align generally with theexpected behaviour of other dynamic prediction works. Absolute percentage error (APE), computed as a function ofminutes into an incident for some of the models considered is given in
Figure 15 for events longer than 60 minutes.This behaviour is generally consistent with that seen in Figure 2 in Li et al. [2015].32
PREPRINT - F
EBRUARY
18, 2021
SHAP ValueWinterSpringSummerAutumn A -0.005 0.0 0.005 0.01 SHAP ValueWinterSpringSummerAutumn B -0.004 -0.002 0.0 0.002 0.004 SHAP ValueWinterSpringSummerAutumn C -0.002 -0.001 0.0 0.001 SHAP ValueWinterSpringSummerAutumn D Figure 14: SHAP values for the season. In each plot, we have summed the SHAP values for each one-hot encodedvalue, and then only plotted the resulting value in the row corresponding to each data-points true feature value. Thisshows the overall impact of the location feature, and allows one to view this impact separately for each value it attains. (A) : h = 5 , (B) h = 30 , (B) h = 60 , (B) h = 180
30 60 90 120 150 180 210 240 270 300 330 360 390Minutes Into Incident A P E Cox (Mean)Cox (Median)RSF (Mean)RSF (Median)SW Kernel (Mean)SW Kernel (Median)
Figure 15: APE (both mean and median shown) for some dynamic models as a function of the minutes into an incident.We show 30 minutes onwards, which is the point where only incident related information is being fed into the slidingwindow model, and the landmarking models are no longer fit with minor events, as these have already ended.33
PREPRINT - F
EBRUARY
18, 2021Table 11: AUROC values for the dynamic models. Larger values indicate a better model. We see similar conclusionsare present here that were made when inspecting the time-varying C-index values.Model Prediction Horizon (minutes) Mean Over15 30 45 60 120 180 240 HorizonsCox 0.693 0.716 0.655 0.615 0.558 0.557 0.540 0.619RSF 0.805 0.745 0.692 0.653 0.582 0.585
SW (Raw) 0.877 0.809 0.764 0.712 0.613 0.571 0.544 0.69834
PREPRINT - F
EBRUARY
18, 2021
References
N. Peluffo. Strategic road network statistics. Technical report, United Kingdom Department for Transport, January2015. URL .Highways England. Strategic road network initial report. Technical report, December 2017. URL .R. Li, F. C. Pereira, and M. E. Ben-Akiva. Overview of traffic incident duration analysis and prediction.
EuropeanTransport Research Review , 10(2):22, May 2018.A. T. Hojati, L. Ferreira, S. Washington, P. Charles, and A. Shobeirinejad. Modelling total duration of traffic incidentsincluding incident detection and recovery time.
Accident Analysis & Prevention , 71:296–305, October 2014.T. F. Golob, W. W. Recker, and J. D. Leonard. An analysis of the severity and incident duration of truck-involvedfreeway accidents.
Accident Analysis & Prevention , 19(5):375–395, October 1987.Y. Chung and B-J. Yoon. Analytical method to estimate accident duration using archived speed profile and its statisticalanalysis.
KSCE Journal of Civil Engineering , 16(6):1064–1070, September 2012.H. Zhang and A. J. Khattak. Analysis of cascading incident event durations on urban freeways.
Transportation ResearchRecord , 2178(1):30–39, January 2010.W. Junhua, C. Haozhe, and Q. Shi. Estimating freeway incident duration using accelerated failure time modeling.
SafetyScience , 54:43–50, April 2013.A. A. Kaabi, D. Dissanayake, and R. Bird. Response time of highway traffic accidents in abu dhabi: Investigation withhazard-based duration models.
Transportation Research Record , 2278(1):95–103, January 2012.I. Ghosh, P. T. Savolainen, and T. J. Gates. Examination of factors affecting freeway incident clearance times: acomparison of the generalized F model and several alternative nested models.
Journal of Advanced Transportation ,48(6):471–485, April 2014.Y. Zou, K. Henrickson, D. Lord, Y. Wang, and K. Xu. Application of finite mixture models for analysing freewayincident clearance time.
Transportmetrica A: Transport Science , 12(2):99–115, January 2016.K. W. Smith and B. L. Smith. Forecasting the clearance time of freeway accidents. Technical report, September 2001.C. Lee, W. R. Zame, J. Yoon, and M. van der Schaar. Deephit: A deep learning approach to survival analysis withcompeting risks. In
AAAI , pages 2314–2321, Palo Alto, California USA, February 2018. Association for theAdvancement of Artificial Intelligence, AAAI Press.D. Nam and F. Mannering. An exploratory hazard-based analysis of highway incident duration.
TransportationResearch Part A: Policy and Practice , 34(2):85–102, February 2000.Y. Chung, L. F. Walubita, and K. Choi. Modeling accident duration and its mitigation strategies on South Koreanfreeway systems.
Transportation Research Record , 2178(1):49–57, January 2010.Y-S. Chung, Y-C. Chiou, and C-H. Lin. Simultaneous equation modeling of freeway accident duration and lanesblocked.
Analytic Methods in Accident Research , 7:16–28, July 2015.D. Zeng and D. Y. Lin. Efficient estimation for the accelerated failure time model.
Journal of the American StatisticalAssociation , 102(480):1387–1396, January 2007.D. R. Cox. Regression models and life-tables.
Journal of the Royal Statistical Society. Series B (Methodological) , 34(2):187–220, January 1972.B. Oralhan and Z. Göktolga. Determination of the risk factors that influence occurrence time of traffic accidents withsurvival analysis.
Iranian journal of public health , 47(8):1181–1191, August 2018.J. Wu, R. Subramanian, M. Craig, M. Starnes, and A. Longthorne. The effect of earlier or automatic collision notificationon traffic mortality by survival analysis.
Traffic Inj Prev , 14(sup1):S50–S57, August 2013.R. Li and M. Guo. Competing risks analysis on traffic accident duration time.
Journal of Advanced Transportation , 49(3):402–415, June 2014.A. J. Khattak, J. L. Schofer, and M-H. Wang. A simple time sequential procedure for predicting freeway incidentduration.
I V H S Journal , 2(2):113–138, 1995.K. Kalair, C. Connaughton, and P. A. Di Loro. A non-parametric hawkes process model of primary and secondaryaccidents on a uk smart motorway.
Journal of the Royal Statistical Society: Series C (Applied Statistics) , 70(1):80–97, January 2021.A. Garib, A. E. Radwan, and H. Al-Deek. Estimating magnitude and duration of incident delays.
Journal ofTransportation Engineering , 123(6):459–466, November 1997.35
PREPRINT - F
EBRUARY
18, 2021J. Weng, W. Qiao, X. Qu, and X. Yan. Cluster-based lognormal distribution model for accident duration.
Transportmet-rica A: Transport Science , 11(4):345–363, January 2015.C. Ding, X. Ma, Y. Wang, and Y. Wang. Exploring the influential factors in incident clearance time: Disentanglingcausation from self-selection bias.
Accident Analysis & Prevention , 85:58–65, December 2015.A. J. Khattak, J. Liu, B. Wali, X. Li, and M. Ng. Modeling traffic incident duration using quantile regression.
Transportation Research Record , 2554(1):139–148, January 2016.C. Zhan, A. Gan, and M. Hadi. Prediction of lane clearance time of freeway incidents using the m5p tree algorithm.
IEEE Transactions on Intelligent Transportation Systems , 12(4):1549–1557, December 2011.H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S Lauer. Random survival forests.
Ann. Appl. Stat. , 2(3):841–860,09 2008.W. Wang, H. Chen, and M. C. Bell. Vehicle breakdown duration modelling.
Journal of Transportation and Statistics , 8(1):75–84, 2005.C-H. Wei and Y. Lee. Applying data fusion techniques to traveler information services in highway network.
Journal ofthe Eastern Asia Society for Transportation Studies , 6:2457–2472, 2005.Y. Lee and C-H. Wei. A computerized feature selection method using genetic algorithms to forecast freeway accidentduration times.
Computer-Aided Civil and Infrastructure Engineering , 25(2):132–148, January 2010.G. Valenti, M. Lelli, and D. Cucina. A comparative study of models for the incident duration prediction.
EuropeanTransport Research Review , 2(2):103–111, June 2010.J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger. Deepsurv: personalized treatmentrecommender system using a cox proportional hazards deep neural network.
BMC Medical Research Methodology ,18(1), February 2018.D. Jarrett, J. Yoon, and M. van der Schaar. Dynamic prediction in clinical survival analysis using temporal convolutionalnetworks.
IEEE Journal of Biomedical and Health Informatics , 24(2):424–436, February 2020.C-H. Wei and Y. Lee. Sequential forecast of incident duration using artificial neural network models.
Accident Analysis& Prevention , 39(5):944–954, September 2007.R. Li, F. C. Pereira, and M. E. Ben-Akiva. Competing risk mixture model and text analysis for sequential incidentduration prediction.
Transportation Research Part C: Emerging Technologies , 54:74–85, May 2015.B. Ghosh, M. T. Asif, J. Dauwels, U. Fastenrath, and H. Guo. Dynamic prediction of the incident duration usingadaptive feature set.
IEEE Transactions on Intelligent Transportation Systems , 20(11):4019–4031, 2019.X. Li, J. Liu, A. Khattak, and S. Nambisan. Sequential prediction for large-scale traffic incident duration: Applicationand comparison of survival models.
Transportation Research Record , 2674(1):79–93, 2020.K. Kalair and C. Connaughton. Anomaly detection and classification in traffic flow data from fluctuations in theflow-density relationship, 2020.R. B. Cleveland, W. S. Cleveland, and I. Terpenning. STL: A seasonal-trend decomposition procedure based on loess.
Journal of Official Statistics , 6(1):3–73, March 1990.D. Kumar and B. Klefsjö. Proportional hazards model: a review.
Reliability Engineering & System Safety , 44(2):177–188, 1994.D. G. Kleinbaum and M. Klein.
Survival Analysis . Statistics for Biology and Health. Springer-Verlag, New York, 3edition, 2012.Terry M Therneau.
A Package for Survival Analysis in S , 2015. URL https://CRAN.R-project.org/package=survival . version 2.38.B. Efron. The efficiency of cox’s likelihood function for censored data.
Journal of the American Statistical Association ,72(359):557–565, 1977.C. Jackson. flexsurv: A platform for parametric survival modeling in R.
Journal of Statistical Software , 70(8):1–33,2016.H. Ishwaran and U. B. Kogalur. Random survival forests for r.
R News , 7(2):25–31, October 2007.D. Faraggi and R. Simon. A neural network model for survival data.
Statistics in Medicine , 14(1):73–82, January 1995.M. Luck, T. Sylvain, H. Cardinal, A. Lodi, and Y. Bengio. Deep learning for patient-specific kidney graft survivalanalysis, 2017.C. Lee. Deephit. https://github.com/chl8856/DeepHit , 2020.36
PREPRINT - F
EBRUARY
18, 2021U. Dafni. Landmark analysis at the 25-year landmark point.
Circulation: Cardiovascular Quality and Outcomes , 4(3):363–371, May 2011.C. Lee, J. Yoon, and M. van der Schaar. Dynamic-deephit: A deep learning approach for dynamic survival analysiswith competing risks based on longitudinal data.
IEEE Transactions on Biomedical Engineering , 67(1):122–133,January 2020.F. E. Harrell Jr, K. L. Lee, and D. B. Mark. Multivariable prognostic models: Issues in developing models, evaluatingassumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine , 15(4):361–387, 1996.Y. Yuan.
Prediction Performance of Survival Models . PhD thesis, Waterloo, Ontario, Canada, 2008.Y. Qi and H. Teng. An information-based time sequential approach to online incident duration prediction.
Journal ofIntelligent Transportation Systems , 12(1):1–12, 2008.IBI Group UK. Highways england’s provision of information to road users. Technical report, Office of Rail and Roadand Highways England, March 2019. URL .H. C. van Houwelingen and H. Putters.
Dynamic Prediction in Clinical Survival Analysis . Monographs on Statisticsand Applied Probability 123. CRC Press, 1 edition, 2012.M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD’16, pages 1135–1144, New York, NY, USA, 2016. Association for Computing Machinery.A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences.volume 70 of
Proceedings of Machine Learning Research , pages 3145–3153, International Convention Centre,Sydney, Australia, August 2017. PMLR.S. M. Lundberg and S. Lee. A unified approach to interpreting model predictions. In
Advances in Neural InformationProcessing Systems 30 , pages 4765–4774. Curran Associates, Inc., December 2017.G. Jogesh Babu and C. R. Rao. Goodness-of-fit tests when parameters are estimated.
Sankhy¯a: The Indian Journal ofStatistics (2003-2007) , 66(1):63–74, February 2004.B. N. Araghi, S. Hu, R. Krishnan, M. Bell, and W. Ochieng. A comparative study of k-nn and hazard-based models forincident duration prediction. In ,pages 1608–1613, Qingdao, China, October 2014. IEEE Intelligent Transportation Systems Society, IEEE.D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In
Proceedings of the 3rdInternational Conference on Knowledge Discovery and Data Mining , AAAIWS’94, pages 359–370, Seattle, WA,1994. AAAI Press.H. Izakian, W. Pedrycz, and I. Jamal. Fuzzy clustering of time series data using dynamic time warping distance.
Engineering Applications of Artificial Intelligence , 39:235–244, March 2015.P. Montero and J. A. Vilar. TSclust: An R package for time series clustering.
Journal of Statistical Software , 62(1):1–43, 2014.J. H. Ward Jr. Hierarchical grouping to optimize an objective function.
Journal of the American Statistical Association ,58(301):236–244, April 1963.L. Kaufman and P. J. Rousseeuw.
Finding Groups in Data . Wiley Series in Probability and Statistics. Wiley, March1990.J. C. Gower. A general coefficient of similarity and some of its properties.
Biometrics , 27(4):857–871, December 1971.S. Dietrich, A. Floegel, M. Troll, T. Kühn, W. Rathmann, A. Peters, D. Sookthai, M. von Bergen, R. Kaaks, J. Adamski,C. Prehn, H. Boeing, M. B. Schulze, T. Illig, T. Pischon, S. Knüppel, R. Wang-Sattler, and D. Drogan. Randomsurvival forest in practice: a method for modelling complex metabolomics data in time to event analysis.
InternationalJournal of Epidemiology , 45(5):1406–1420, October 2016.F. Miao, Y. Cai, Y. Zhang, and C. Li. Is random survival forest an alternative to cox proportional model on predictingcardiovascular disease? In , volume 45, pages 740–743, Springer, Cham, September 2015. Springer International Publishing.D. Rizopoulos.
Joint Models for Longitudinal and Time-to-Event Data . CRC Biostatistics Series. Chapman & Hall/CRC,2012.G. L. Hickey, P. Philipson, A. Jorgensen, and R. Kolamunnage-Dona. Joint modelling of time-to-event and multivariatelongitudinal outcomes: recent developments and issues.
BMC Medical Research Methodology , 16(1):117, September2016. 37
PREPRINT - F
EBRUARY
18, 2021Y. Shoham and K. Leyton-Brown.
Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations .Cambridge University Press, Cambridge, June 2008.S. Lundberg. SHAP (SHapley Additive exPlanations). https://github.com/slundberg/shap , 2017. Accessed:2019-12-14.S. Pölsterl. scikit-survival: A library for time-to-event analysis built on top of scikit-learn.
Journal of Machine LearningResearch , 21(212):1–6, 2020.J. Lambert and S. Chevret. Summary measure of discrimination in survival models based on cumulative/dynamictime-dependent roc curves.