[PDF] A Causal Inference Approach to Measure the Vulnerability of Urban Metro Systems

Abstract

Transit operators need vulnerability measures to understand the level of service degradation under disruptions. This paper contributes to the literature with a novel causal inference approach for estimating station-level vulnerability in metro systems. The empirical analysis is based on large-scale data on historical incidents and population-level passenger demand. This analysis thus obviates the need for assumptions made by previous studies on human behaviour and disruption scenarios. We develop four empirical vulnerability metrics based on the causal impact of disruptions on travel demand, average travel speed and passenger flow distribution. Specifically, the proposed metrics based on the irregularity in passenger flow distribution extends the scope of vulnerability measurement to the entire trip distribution, instead of just analysing the disruption impact on the entry or exit demand (that is, moments of the trip distribution). The unbiased estimates of disruption impact are obtained by adopting a propensity score matching method, which adjusts for the confounding biases caused by non-random occurrence of disruptions. An application of the proposed framework to the London Underground indicates that the vulnerability of a metro station depends on the location, topology, and other characteristics. We find that, in 2013, central London stations are more vulnerable in terms of travel demand loss. However, the loss of average travel speed and irregularity in relative passenger flows reveal that passengers from outer London stations suffer from longer individual delays due to lack of alternative routes.

Full PDF

11 A Causal Inference Approach to Measure the Vulnerability of Urban Metro Systems

Nan Zhang [email protected] Daniel J. Graham * [email protected] Daniel Hörcher [email protected] Prateek Bansal [email protected] Transport Strategy Centre, Department of Civil and Environmental Engineering Imperial College London, London, UK * Corresponding author

Abstract

Transit operators and passengers need vulnerability measures to understand the level of service degradation under disruptions. This paper contributes to the literature with a novel causal inference approach for estimating station-level vulnerability in metro systems. The empirical analysis is based on large-scale data on historical incidents and population-level passenger demand, thus obviates the need for assumptions made by previous studies on human behaviour and disruption scenarios. We develop three empirical vulnerability metrics based on the causal impact of disruptions on travel demand and the average travel speed. The unbiased estimates of disruption impact are obtained by adopting a propensity score matching method, which adjusts for the confounding biases caused by non-random occurrence of disruptions. An application of the proposed framework to London Underground indicates that the vulnerability of a metro station depends on the location, topology, and other characteristics. We find that in 2013 central London stations are more vulnerable in terms of travel demand loss. However, the loss of average travel speed reveals that passengers from outer London stations suffer from longer individual delays due to lack of alternative routes. Key words: vulnerability, urban metro system, causal inference, propensity score matching Introduction

Metros, also known as subways or rapid transit, have become a vital component of public transport. With the advantage of large capacity and high-frequency services, 178 metro systems worldwide carried a total of 53,768 million trips in 2017 (International Union of Public Transport, 2018). Incidents occur frequently in urban metro systems, mainly due to supply-side failures (e.g., signal failures), sudden increase in travel demand (e.g., public concert or football matches) and change in weather conditions (Brazil et al., 2017; Melo et al., 2011; Wan et al., 2015). These incidents can cause service delays and overcrowding, which in turn lead to safety concerns and potential losses in social welfare. For instance, the London Underground encountered 7973 service disrupting incidents of above 2 minutes duration between April 2016 and April 2017, causing a total loss of around 34 million customer hours (Transport for London, 2016-2017; Transport for London, 2019). The Singapore Mass Rapid Transit experienced 47 severe delays that lasted over 30 minutes between 2015 and 2017 (Land Transport Authority, 2017). Operators may consider investing in new technologies to improve metro facilities and mitigate the effect of incidents. For instance, the New York City Subway was in a state of emergency in June 2017 after a series of derailments, track fires and overcrowding incidents. The Metropolitan Transportation Authority invested over $8 billion to stabilise and modernise the incident-plagued metro system (Metropolitan Transportation Authority, 2019).

It is apparent that metros are willing to invest in their infrastructure systems, but it is often not known how those investments compare in achieving improvements.

To facilitate project selection, metros are increasingly relying on disaggregate performance metrics that reveal the most vulnerable parts of the network. Performance can be measured in various ways. Popular examples are risk, resilience, reliability and vulnerability related metrics. These concepts are often confused by researchers as well as well as practitioners. Interested readers can refer to Faturechi and Miller-Hooks (2015) to understand the most agreed relationship among these concepts. In this paper, we focus on the vulnerability of urban metro systems, where the service output of interest will be passenger demand and the average travel speed. Since the 1990s, the concept of vulnerability has been widely used to characterise the performance of transport systems (Mattsson and Jenelius, 2015). Vulnerability is often defined as a measure of susceptibility of the transport system to incidents (Berdica, 2002; Jenelius et al., 2006). Vulnerability metrics may measure the consequences of disturbances, in the form of service outputs such as train kilometres, passenger volumes or the quality of travelling.

Such metrics have important implications in identifying weak stations or links in metro systems and efficiently allocating resources to the most affected areas. Vulnerability metrics can also be used to inform passengers about the general level of inconvenience at disrupted stations, which can help them in adjusting their travel plans to avoid potential delays or crowding. Given the rising interest in utilising vulnerability metrics in disruption prevention and management, obtaining a correct measure of such metrics is crucial. Traditionally, vulnerability in urban metros is investigated based on complex network theory and graph theory. Complex network theory converts metro networks into graphs, which enables the quantitative measurement of vulnerability in metro systems (Chopra et al, 2016; Derrible and Kennedy, 2010; Yang et al., 2015). The adoption of graph theory has facilitated the evolution of vulnerability indicators from simply capturing the characteristics of network topology to also considering travel demand patterns and their land use dependencies (Jiang et al., 2018). However, most of these studies rely on simulation-based approaches to quantify vulnerability under hypothetical scenarios of disruptions. These simulation experiments are based on assumptions, both in terms of passenger behaviour and the type and scale of disruptions (Lu, 2018; Sun and Guan, 2016; Sun et al., 2015; Sun et al., 2018). With an empirical approach, such assumptions can be avoided, and thus more reliable metrics of vulnerability can be achieved using historical evidence. The empirical approach is rare but not unique in the literature. The exception we are aware of is Sun et al. (2016), who first detect incidents based on abnormal ridership and apply these real incidents data to assess the vulnerability of the metro system. While, their method has some limitations. First, they assume the occurrence of incidents to be random, which is a strict and unrealistic assumption as we demonstrate in this study. Also, the abnormal ridership may not be a good indicator of incidents if the fluctuation in ridership are merely manifestations of changes in travel demand. This paper proposes a novel alternative methodology to quantify vulnerability, by empirically estimating the causal impact of service disruptions on travel demand and average travel speed at station-level. The application of a propensity score matching method accounts for the non-randomness of disruptions and ensures unbiasedness of the causal estimates. We make this approach comprehensive for the entire network, including stations where incidents are not observed, by predicting the level of vulnerability at these stations with a random forest algorithm. In this way, we eliminate the need for ad hoc assumptions on passenger behaviour and the nature of disruptions. We use London Underground as a case study and apply the methodology with large-scale automated ticketing and incident data. We find that among the most incident affected stations, the gross travel speed of the trips initiated at Canary Wharf, Victoria and Liverpool street decrease by 156%, 121% and 115%, respectively, relative to regular conditions. In terms of passenger demand, the highest reduction in station inflows and outflows are measured at Oxford Circus, Waterloo and London Bridge, which are in this sense the most vulnerable stations of London Underground. In more isolated parts of the network, where alternative routes may not be available, stations may lose up to 98% of their demand due to incidents. These results have the potential to aid investment decisions of the metro operator. The rest of paper is organised as follows. Section 2 reviews the literature on vulnerability measurement and disruption impact analysis in urban metro systems. Section 3 presents our empirical framework to compute vulnerability metrics. This section discusses the proposed causal inference approach to estimate the unbiased disruption impact, which is the key input in building vulnerability metrics. In Section 4, we analyse the vulnerability of London Underground as a case study. Results are discussed in Section 5. Finally, Section 6 concludes and highlights the potential avenues for future research. Literature review

Below we provide a contextual review of previous studies related to vulnerability measurement. In Section 2.1, we review the literature on vulnerability quantification in rail transit networks, while Section 2.2 investigates previous attempts to estimate the impact of disruptions.

There are two traditional methods used to build vulnerability indicators of metro systems – topology-based and system-performance-based analysis. The topological methods rely on complex network theory to convert the metro network into a scale-free graph, in which nodes represent metro stations, edges represent links between directly connected stations and the weight associated with each edge is computed based on travel time or distance (Derrible and Kennedy, 2010; Mattsson and Jenelius, 2015; Zhang et al., 2011). The changes in the system’s connectivity are reflected on graphs by removing nodes or links and vulnerability is entirely governed by the topological structure. For instance, the location importance of metro stations or links is indicated by the number of edges connected to a specific node and the fraction of shortest paths passing through the given node/edge (Sun and Guan, 2016; Sun et al., 2018;

Yang et al., 2015; Zhang et al., 2018). Network-level efficiency is indicated by the average of reciprocal shortest path length between any origin-destination (OD) pair. Such global indicators capture the overall reachability as well as the service size of a metro system (Sun et al., 2015; Yang et al., 2015).

System-performance-based analyses not only consider the network topology but also incorporate real data on metro operations (e.g., ridership distribution) into vulnerability measurement (LaniM’cleod et al., 2017; Mattsson and Jenelius, 2015). For instance, Sun et al. (2018) use a ridership-based indicator – a sum of flows in edges connected with the given node – to complement the topological measures by integrating passengers’ travel preferences. Other studies use passenger delay and demand loss as vulnerability indicators (Adjetey-Bahun et al., 2016; LaniM ’ cleod et al., 2017; Rodríguez-Núñez and Garc í a-Palomares, 2014). Specifically, passenger delay is summarised by changes in the weighted average of travel time between all OD pairs due to disruptions where weights are station-level passenger loads. Jiang et al. (2018) suggest integrating land use characteristics around stations into vulnerability measurement because metro systems interact with the external environment during incidents. To quantify vulnerability based on the aforementioned indicators of the system’s performance, almost all previous studies adopt simulation-based approaches and assume hypothetical disruption scenarios. The simplest disruption scenario involves a single station or link closure, assuming one node or edge in the graph is out of service. This incident affects the topology structure and passengers’ route choice and the differences in the corresponding performance indicators under normal and disrupted scenarios are quantified to measure vulnerability (Sun et al., 2015). More complex disruption scenarios include the closure of two or more non-adjacent stations, failure of an entire line, and sequential closure of stations until the network crashes (Adjetey-Bahun et al., 2016; Chopra et al., 2016; Sun and Guan, 2016; Zhang et al., 2018; Zhang et al., 2018). Ye and Kim (2019) also discuss the case of partial station closure. Simulation-based studies gained popularity because they do not require incident data and can flexibly control simulation settings to imitate a wider range of possible situations. However, researchers have to make many assumptions to infer passengers’ response to virtual disruptions. Without observing passengers’ movements during real incidents, the validity of the simulation assumptions is questionable. For example, while quantifying passenger delay indicators, Rodríguez-Núñez and García-Palomares (2014) and Adjetey-Bahun et al. (2016) assume that all passengers have the same travel speed and they do not change their destinations under disruptions unless there is no available route. However, in reality, passengers can travel at different speeds, leave the metro system, change their destinations, or reroute during disruptions. As a result, especially for system-based analyses, vulnerability metrics obtained from simulation-based studies may not reflect the true changes in the level of service due to disruptions. There is, therefore, scope to improve vulnerability measurement by empirically estimating the impact of disruptions. In an urban rail transit context, early attempts to analyse disruption impact relied on surveys. Rubin et al. (2005) conducted a stated preference survey to understand the psychological and behavioural reactions of travellers to the bombing incident, which happened in London during July 2005. They consider passenger’s reduced intention of travelling by the London Underground after the attack as the key indicator. Since stated willingness may not reflect real travel behaviour, Zhu et al. (2017) performed a revealed preference survey to investigate travellers’ reactions to transit service disruptions in Washington D.C. Metro. By comparing their actual travel choices before and during the metro shutdown, they find a 20% reduction in demand. Results from such surveys are usually presented as the percentage change in passengers’ preferences for travel modes, departure time, and destinations. Although this information is useful, we still need detailed information about delays or demand losses to quantify true disruption impacts. Furthermore, there are inherent limitations of survey-based studies. For instance, repeated observations of a respondent are difficult to collect for a long period because of constraints associated with cost, manpower, recording accuracy, and privacy protection of respondents (Kusakabe and Asakura, 2014). A survey sample also cannot cover all passengers, which may lead to biased estimates of disruption impact if the sample is not representative of the population. With the wide use of automated fare collection facilities in metro systems, smart card data have become a powerful tool for research related to transit operations and travel behaviour (Pelletier, Trépanier and Morency, 2011). Compared to survey data, the key advantages of smart card data are cost-effectiveness, continuous long-term recording and accurate travel information for each passenger within the system (Kusakabe and Asakura, 2014). Therefore, researchers have started using smart card data to analyse disruption impacts. For instance, Sun et al. (2016) develop a method to identify incidents and conduct trip assignments with/without incidents. They estimate the disruption impact by computing the differences between two assignments in terms of ridership distribution and travel time across all OD pairs. This study does not require extra assumption about passengers’ reaction because their actual locations and movements are revealed from smart card data. However, they assume that metro disruptions occur randomly, while in reality, factors such as travel demand, signalling type, passenger behaviour, operating years, rolling stock characteristics and weather conditions have a significant influence on the likelihood of metro failures (Brazil et al., 2017; Melo et al., 2011; Wan et al., 2015). This is a particularly important consideration because the impact estimated from direct comparison of performance indicators before and after disruptions will be biased under non-random occurrence of disruptions. Specifically, a few factors affecting the impact of disruptions (e.g., passenger behaviour and weather conditions) may also affect the occurrence of disruptions, leading to confounding bias in pre-post comparison estimates (Imbens and Rubin, 2015). Some researchers also adopt prediction-based approaches to quantify disruption impact using smart card data. For instance, Silva et al. (2015) propose a framework to predict the exit ridership and model behaviours of passengers under station closure and line segment closure. In a very recent study, Yap and Cats (2020) apply supervised learning approaches to predict the passenger delay caused by incidents. However, these prediction-based studies also cannot disentangle the causal effect of disruptions and can result into biased estimates due to the existence of confounding factors. Table 1 shows a comparison of recent vulnerability studies and also illustrates the contribution of this research. We conclude this section with a summary of gaps in the literature that we address to obtain more accurate measures of vulnerability: 1. Previous studies on vulnerability metrics of transit systems are largely based on simulation approaches. These studies do not account for the actual behaviour of passengers under disruptions. Basing analyses on empirical data, rather than simulations, obviates the need for making potentially unrealistic assumptions on passengers’ movement. 2. In urban metro systems, disruption occurrences can be non-random. Therefore, empirical studies on quantifying disruption impacts should account for this non-randomness to eliminate confounding biases in estimation. In this paper, we show that both improvements can be made by adopting causal inference methods and calibrating them using large-scale smart card data and incident data. Specifically, the proposed method allows for the non-random occurrence of disruptions and adjusts for potential bias caused by confounding factors. Subsequently, unbiased empirical estimates of disruption impact are used to accurately compute vulnerability metrics of metro systems.

Table 1:

A comparison of recent research on metro vulnerability.

Research Vulnerability metrics or disruption impacts Analysis approach Smart card or OD data Land-use Non-random disruptions Topology-based System performance-based Simulation-based Empirical (real incidents) Derrible and Kennedy, 2010 √ √ Zhang et al., 2011 √ √ Yang et al., 2015 √ √ Chopra et al, 2016 √ √ Zhang et al., 2018 √ √ Zhang et al., 2018 √ √ Ye and Kim, 2019 √ √ Rodríguez-Núñez and García-Palomares, 2014 √ √ √ Adjetey-Bahun et al., 2016 √ √ √ LaniM’cleod et al., 2017 √ √ √ Sun et al., 2015 √ √ √ √ Sun and Guan, 2016 √ √ √ √ Sun et al., 2018 √ √ √ √ Lu, 2018 √ √ √ √ Jiang et al., 2018 √ √ √ √ Sun et al., 2016 √ √ √ Our approach √ √ √ √ √ Methodology

From a methodological point of view, our empirical approach has three stages: first, we apply a causal inference method to estimate the impact of disruptions on station-level travel demand and travel speed (see Section 3.1). Then, in Section 3.2. we construct vulnerability metrics based on the disruption impact estimated in the first stage. Finally, the third stage imputes missing vulnerability metrics for non-disrupted stations using machine learning algorithms. Figure 1 illustrates all steps of the proposed empirical framework.

Figure 1:

Flowchart of the paper’s methodological framework.

To evaluate the impact of a disruption on a metro system, we use Rubin’s potential outcome framework to establish causality (Rubin, 1974). We define metro disruptions as ‘treatments’ and the objective of our analysis is to quantify the causal effect of treatments on ‘outcomes’ related to system performance. Specifically, we are interested in estimating causal effects on travel speed and ridership. From the literature, we know that factors such as passenger demand, weather conditions, network topology and engineering design influence the likelihood of disruption occurrence (Brazil et al., 2017; Melo et al., 2011; Wan et al., 2015). Therefore, the assignment of the treatment is not random. This is important in our context because the factors associated with the assignment of the treatment are also likely to affect the outcomes of interest, and are thus potential confounders in estimation of impacts. Since previous studies on disruption impact have ignored the non-randomness of treatments, their estimated impact may be biased. We adopt propensity score matching (PSM) methods to address this issue, which potentially eliminates such confounding biases. The propensity score is defined as the conditional probability that a unit receives treatment given its baseline confounding characteristics. If the observed characteristics sufficiently capture the sources of confounding, then the propensity score can be used to consistently estimate impacts given conditional independence between treatment assignment and outcomes (e.g. conditional on the propensity score) (Imbens and Rubin, 2015). This index is obtained by estimating a relationship between treatment assignment and baseline confounding characteristics using a regression model. The estimated propensity score is then used to form various semi-parametric estimators of the treatment effect such as weighting, regression, and matching. In this section, we first provide a contextual formulation of PSM and then describe how we apply PSM to quantify the causal impact of metro disruptions on the performance of metro systems.

The system-level impact, which averages the impact of all disruptions occurred within the metro system, is too generic to represent network vulnerability. Thus, we focus instead on estimating station-level disruption impacts. We define study unit 𝑖 as the observation of a metro station within a 15-minute interval. The treatment variable, denoted by 𝑊 𝑖𝑡 ∈ {0, 1} , records whether study unit 𝑖 at time 𝑡 is observed in a disrupted (𝑊 𝑖𝑡 = 1) or undisrupted state ( 𝑊 𝑖𝑡 = 0 ). To quantify disruption impacts, we define outcomes of interest as the changed travel demand and average speed of trips that start from the given study unit, denoted by 𝑌 𝑖𝑡 . 𝑌 𝑖𝑡 = 𝑌 𝑖𝑡 (𝑊 𝑖𝑡 ) = 𝑌 𝑖𝑡 (0) × (1 − 𝑊 𝑖𝑡 ) + 𝑌 𝑖𝑡 (1) × 𝑊 𝑖𝑡 = {𝑌 𝑖𝑡 (0) 𝑖𝑓 𝑊 𝑖𝑡 = 0𝑌 𝑖𝑡 (1) 𝑖𝑓 𝑊 𝑖𝑡 = 1 (1) 𝑖 = 1, … , 𝑛 𝑡 = 1, … , 𝑇 , where 𝑛 is the total number of stations within the metro system, and 𝑇 is the total number of time intervals during the study period (for example, T=4 if study period is 1 hour). 𝑌 𝑖𝑡 (0) and 𝑌 𝑖𝑡 (1) are counterfactual potential outcomes, only one of which is observed. The propensity score, denoted by e(𝑋 𝑖𝑡 ) , is obtained by regressing 𝑊 𝑖𝑡 on confounding factors, denoted by 𝑋 𝑖𝑡 . We discuss potential confounding factors in the empirical study in Section 4. To derive valid causal inference using PSM we need our model to satisfy three key assumptions. The first one is the conditional independence assumption (CIA), 𝑊 𝑖𝑡 ⊥ (𝑌 𝑖𝑡 (0), 𝑌 𝑖𝑡 (1)) | 𝑋 𝑖𝑡 , which states that conditional on the observed confounding factors 𝑋 𝑖𝑡 , the treatment assignment should be independent of the potential outcomes. The advantages of the propensity score stems from a property that this conditional independence can be achieved by just conditioning on a scalar rather than high-dimensional baseline covariates (Rosenbaum and Rubin, 1983). Thus, the CIA based on the propensity score can be written as: 𝑊 𝑖𝑡 ⊥ (𝑌 𝑖𝑡 (0), 𝑌 𝑖𝑡 (1))| e(𝑋 𝑖𝑡 ). 𝑖𝑡 = 1|𝑋 𝑖𝑡 = 𝑥) < 1 for all 𝑥 , which states that the conditional distribution of 𝑋 𝑖𝑡 given 𝑊 𝑖𝑡 = 1 should overlap with that of the conditional distribution of 𝑋 𝑖𝑡 given 𝑊 𝑖𝑡 = 0 . T his assumption can be tested by comparing the distributions of propensity scores between treatment and control groups. The third assumption, also known as the stable unit treatment value assumption (SUTVA), requires that the outcome for each unit should be independent of the treatment status of other units (Graham et al., 2014). If all three assumptions hold, the average treatment effects (ATE) of disruptions on a station 𝑖 can be derived as (Imbens and Wooldridge, 2009): 𝜏 𝑖𝐴𝑇𝐸 = 𝜏̂ 𝑖𝑚𝑎𝑡𝑐ℎ = 𝑑 ∑ (𝑌̂ 𝑖𝑡 (1) − 𝑌̂ 𝑖𝑡 (0)) 𝑇 𝑑 𝑡=1 , (2) 𝑌̂ 𝑖𝑡 (1) = 𝑌 𝑖𝑡 , 𝑌̂ 𝑖𝑡 (0) = 1𝑀 ∑ 𝑌 𝑗 𝑗∈𝐽 𝑀 (𝑖𝑡) 𝑊 𝑗 ≠ 𝑊 𝑖𝑡 , 𝑖 = 1, … , 𝑛 𝑡 = 1, … , 𝑇 𝑑 , where 𝑡 ∈ {1, … , 𝑇 𝑑 } denotes all the disrupted time intervals of station 𝑖 during the study period (i.e., 𝑊 𝑖𝑡 = 1) . 𝐽 𝑀 (𝑖𝑡) is a set of indices of the closest 𝑀 control units (in terms of propensity scores) for station 𝑖 disrupted at time 𝑡 . Thus, 𝜏̂ 𝑖𝑚𝑎𝑡𝑐ℎ represents the average of the difference between the outcomes of treated and matched control units. In the next subsection, we explain how the causal inference framework introduced in Equations (1) and (2) can be implemented in the present application. The sequence of steps we follow are shown in Figure 1. We first provide details of the propensity score model, followed by description of our matching algorithms and the estimation of disruption impact. To predict the propensity score, i.e. probability of encountering disruptions at a metro station within 15-minute interval conditional on the baseline confounding characteristics, we use the logistic regression model with a linear link function: e(𝑋 𝑖𝑡 ) = 𝑝𝑟(𝑊 𝑖𝑡 = 1|𝑋 𝑖𝑡 = 𝑥 {𝑐} ) = 𝑝(𝑖𝑡) (3) 𝑙𝑜𝑔 [ 𝑝(𝑖𝑡)1 − 𝑝(𝑖𝑡)] = 𝛼 + 𝛽𝑥 {𝑐} 𝑖 = 1, … , 𝑛 𝑡 = 1, … , 𝑇, where 𝛼 is the intercept and 𝛽 is the vector of regression coefficients related to the vector of confounding factors 𝑥 {𝑐} . In our empirical study, a station with a higher number of incidents in the past is more likely to encounter a new disruption in the future, just like the black spot on highways. To account for this temporal correlation among disruption occurrence, we ensure that confounding factors contain the history of past disruptions happened on the same day. 1 Additionally, we also consider a more advanced generalised additive model (GAM), in which the logarithm of the odds ratio is modelled via semi-parametric smoothing splines. A GAM has potential to uncover flexible relationships between the likelihood of disruption occurrence and confounding factors. The GAM with temporal correlation is presented in Equation (4): e(𝑋 𝑖𝑡 ) = 𝑝𝑟(𝑊 𝑖𝑡 = 1|𝑋 𝑖𝑡 = 𝑥 {𝑐} ) = 𝑝(𝑖𝑡), (4) 𝑙𝑜𝑔 [ 𝑝(𝑖𝑡)1 − 𝑝(𝑖𝑡)] = 𝛼 + 𝑓(𝑥 {𝑐} ; 𝛽) 𝑖 = 1, … , 𝑛 𝑡 = 1, … , 𝑇, where 𝑓(𝑥 {𝑐} ; 𝛽) is a flexible spline function of baseline characteristics. After estimating propensity scores, we check the common support (overlap) assumption to ensure the effective matching and reliability of the propensity score estimates (Lechner, 2001). The next step is matching. Every treated unit 𝑖 at time 𝑡 is paired with 𝑀 similar control units using the value of their propensity scores. Since there is no theoretical consensus on the superiority of matching algorithms, we adopt two commonly used approaches: Subclassification Matching and Nearest Neighbour Matching. We then compare them with different replacement conditions and pairing ratios, and select the one that balances the greatest disparity among the mean of confounding factors. It is also necessary to check the conditional independence assumption after matching. We conduct balancing tests to check whether the disrupted units and the matched units are statistically similar across the domain of confounders. If significant differences are found, we try another specification of the propensity score model and repeat the above-discussed procedure. In the last step, we estimate station-level disruption impact using Equation (2). Given the matched pairs, the treatment effect for a station at a specific period is estimated as the difference between outcomes of the treated unit and its matched control units. Then the average station-level disruption impact is obtained by averaging these differences across disrupted periods. We separately estimate the average treatment indicators for the following three measures of metro performance: 1.

Entry ridership: the number of passengers who enter the study unit. 2.

Exit ridership: the number of passengers who exit the study unit. 3.

Average travel speed: the average speed of all trips that start from the study unit. This speed is computed as total travel distances divided by total journey time, where travel distances (track length) are recovered using the shortest path assignment. If the origin station is entirely closed, no passenger can continue their trips and the average speed will be zero. If the origin station is partially closed, this average speed reflects the trips of passengers who remain in the system and keep traveling on other alternative lines . Stage 2: Constructing vulnerability metrics

We propose three station-level vulnerability metrics that are constructed from the empirical estimates of disruption impacts. i).

The

Loss of travel demand is expressed as: 𝑑 𝑖 = |𝜏 𝑖𝐴𝑇𝐸 (𝑒𝑥𝑖𝑡) − 𝜏 𝑖𝐴𝑇𝐸 (𝑒𝑛𝑡𝑟𝑦)| , (5) where 𝜏 𝑖𝐴𝑇𝐸 (𝑒𝑥𝑖𝑡) and 𝜏 𝑖𝐴𝑇𝐸 (𝑒𝑛𝑡𝑟𝑦) denote the change in the number of passengers exits and entries, respectively. 𝑑 𝑖 is the net change in station-level demand during a 15-minute interval. 2 ii). The

Loss of average travel speed quantifies the decline in level of service experienced by each passenger at a metro station (individual delay), which is expressed as: 𝑠 𝑎𝑣𝑔𝑖 = 𝜏 𝑖𝐴𝑇𝐸 (𝑠𝑝𝑒𝑒𝑑) (6) where 𝜏 𝑖𝐴𝑇𝐸 (𝑠𝑝𝑒𝑒𝑑) denotes the decrease in average travel speed of each trip starting from station 𝑖 during a 15-minute disruption period. By definition, 𝑠 𝑎𝑣𝑔𝑖 accounts for the changes in both travel distance and journey time of passengers. iii). The loss of gross travel speed reflects the loss of passenger kilometres per unit time for the entire station (overall delays), which is expressed as 𝑠 𝑔𝑟𝑜𝑠𝑠𝑖 = 𝜏 𝑖𝐴𝑇𝐸 (𝑠𝑝𝑒𝑒𝑑) × 𝑟 𝑖 (7) where 𝑟 𝑖 denotes the average entry ridership of all disrupted 15-minute intervals at the corresponding station. Thus, 𝑠 𝑔𝑟𝑜𝑠𝑠𝑖 denotes the total decrease in average travel speed for all passengers who start their journeys from station 𝑖 during a 15-minute service disruption. Stage 3: Imputing Missing Vulnerability Metrics

Some stations may not encounter any incidents within the study period. Thus, the empirical disruption impact and the vulnerability metrics cannot be estimated directly for these stations. To predict the missing metrics of non-disrupted stations, we estimate a random forest regression model (Hastie et al. 2009): 𝑓̂ 𝑟𝑓𝐵 (𝑥 {𝑠} ) = ∑ 𝑇( 𝐵𝑏=1 𝑥 {𝑠} ; 𝜃 𝑏 ), (8) where 𝑓̂ 𝑟𝑓𝐵 (𝑥 {𝑠} ) denotes the random forest predictor. In the equation above, 𝐵 is the number of trees, 𝑥 {𝑠} is a vector of input features (see Table 2 for details). Furthermore, 𝑇(𝑥 {𝑠} ; 𝜃 𝑏 ) is the output of the 𝑏 𝑡ℎ random forest tree, and 𝜃 𝑏 characterizes the 𝑏 𝑡ℎ random forest tree. The random forest regression that we apply here is a combination of a bagging algorithm and ensemble learning techniques. By averaging the output of several trees (or weak learners in boosting terminology), it reduces the overfitting problem. Interested readers can read Hastie et al. (2009) for details of random forest regression algorithms and the reasons behind its superior prediction accuracy. We also benchmark the prediction performance of random forest regression against competing methods such as linear regression and support vector machines. Case study: London Underground

In 2013, the London Underground (LU) had 270 stations and 11 lines, with a total length of 402 km stretching deep into Greater London. The circle-radial network structure, as shown in Figure 2 (Wikimedia Commons, 2013), is one of the largest and most complex metro systems in the world. Of all lines within the network, one is circular (Circle Line) covering Central London, and the remaining 10 are radial routes converging at the centre of the system. For connectivity among stations, LU has 56 stations connecting 2 lines, 16 stations connecting 3 lines and 8 stations connecting more than 4 lines. LU is also one of the busiest metro systems, with 1.265 billion 3 journeys by the end of 2013 (Transport for London, 2019). Due to over 150 years old operations and enormous passenger demand, disruptions occur frequently in LU.

Figure 2:

London Underground network [adapted from (Wikimedia Commons, 2013)]. We use the following data to analyse the station-level vulnerability of the LU system. We conducted data processing and analysis using open-source R software (version 3.6.2).

Pseudonymised smart card data:

Transport for London (TfL) provided automated fare collection data from 28/10/2013 to 13/12/2013 (35 weekdays) between 6:00 and 24:00. We consider this duration as our study period. The smart card data contain information on transaction date and time, entry and exit locations, encrypted card ID and ticket type (pay as you go/season ticket). The resolution of time stamps exacts to one minute. By using smart card data, we compute entry/exit ridership of each station and obtain passengers’ journey time and travel speed.

Incidents and service disruption information:

TfL also provided incident information data for our study period. By mining provided incidents logs, we construct an accurate database of service disruptions, which includes the occurrence time, location and duration of disruptions.

LU network topology information:

We collect data on station coordinates, topology structure and the length of tracks between adjacent stations from open databases authorised by TfL . Weather data: We collect temperature (°C), wind speed (km/h) and rain status from the Weather Underground web portal . Based on the observations of over 1000 weather stations around London, we estimate weather conditions for all LU stations at 15-minute resolution for our study period. LU station characteristics:

These station-level features include daily ridership, station age, sub-surface/deep-tube stations, terminal stations and screen doors. We also calculate supplementary affected area as a circular area with the radius of 500 metres around the station. We use 2011 UK Census data at Lower Super Output Area (LSOA) level to calculate these supplementary factors. We select all LSOAs whose centroids are within the 500 metres radius of the affected area. We then average the related statistics of the selected LSOAs according to their areas in the circle. Figure 3 illustrates the above process of calculations. Figure 3:

The illustration of calculating station-level supplementary factors. To construct the causal inference framework for LU, our study unit is the observation of metro stations during each 15-minute interval within the system service time. We define metro disruption as the state when scheduled train services are interrupted for at least 10 minutes at a station. Over the study period, LU encountered 2894 disruptions lasting from 10 minutes to 11 hours. The aim of causal inference is to estimate the unbiased impact of these observed disruptions (i.e., treatment) on system-performance measures (outcome). The treatment status 𝑊 𝑖𝑡 is constructed according to the disruption database mentioned in Section 4. To match the disruption duration with the timeframe of study units, we define the following rule to assign the treatment status: if a disruption occurs within a 15-minute interval 𝑡 of a given station 𝑖 , we regard this study unit as disrupted (i.e., 𝑊 𝑖𝑡 = 1 ), no matter whether disruptions start or end in the middle or last for the entire 15-minute interval. Conversely, if the station is under normal service during entire 15-minute interval, we regard this study unit as un-disrupted (i.e., 𝑊 𝑖𝑡 = 0 ). The treatment outcomes 𝑌 𝑖𝑡 are presented as three station-level performance indicators: entry ridership, exit ridership, and average travel speed. As discussed earlier, metro disruptions may not occur randomly. We list all potential confounding factors for LU in Table 2, which we use in estimating the propensity score model (Section 3.1). These confounders are selected according to the literature and expertise, including travel demand, weather conditions, engineering design, time of day and past disruptions (Brazil et al., 2017; Melo et al., 2011; Wan et al., 2015). Table 2 also shows available covariates for the imputation of missing vulnerability metrics in Stage 3 (Section 3.3), which not only include some of confounders, but also include supplementary factors of LU station characteristics. Table 2:

Available covariates for PSM and vulnerability imputation.

Variable Description Stage 1 Stage 3 Source: London Datastore, published by Greater London Authority: https://data.london.gov.uk/census/. (a) Station affected areas (b) An example of LSOA data

Metro Station Real-time travel demand   Average travel demand

Daily entry ridership The daily average number of passengers that enter a station during the study period.  Daily exit ridership The daily average number of passengers that exit a station during the study period.  Weather conditions

Temperature Atmospheric temperature around study units. Observations range from -3 ℃ to 20 ℃ .  Wind speed The wind speed around study units (km/h), ranges from 0 to 88 km/h.  Rain status Dummy variable, representing whether it was raining at study units.  Engineering design characteristics

Rail connectivity Dummy variable, representing whether the station is connected to other rail systems.   Overground Dummy variable, representing whether the station is on surface or closed deep in tube.   Terminal Dummy variable, representing whether the station is an origin or terminal station.   Screen door Dummy variable, representing whether the station has screen doors on the platform.   Number of lines The number of lines within the given station, ranges from 1 to 6 in LU.   Average adjacent distance The average distance between the given station and its adjacent stations (km).   Station age Age of the oldest metro line served by the station.   Zone Categorical variable, the zone where the station is located, ranges from 1 to 9 in London Underground.   Time of day Time of day divided into nine intervals; AM peak: 6:30 to 9:30, PM peak: 16:00 to 19:00  Past disruptions

Number of past disruptions occurred in the same day Representation of the temporal correlation of disruption occurrence.  Station supplementary factors

Socio-economic characteristics

Total population *  *  IMD * Index of Multiple Deprivation scores  Land use characteristics

Domestic buildings * Area of domestic buildings (10 m )  Non-domestic buildings * Area of non-domestic buildings (10 m )  Other land use * Area of other land use (10 m )  Accessibility measures

Number of bus stops *  Biking * Sharing bicycle facility dummy  Parking * Car parking facility dummy  Road area (m ) *  Path area (m ) *  * computed for the affected area around each station Results

Out of 270 stations of the LU system, TfL provided the required datasets for 265 stations during the study period (28/10/2013 – 13/12/2013). Smart card data were missing for the remaining five stations. Our analysis only covers weekdays, during which the system is open for 18 hours per day, starting from 6:00 a.m. to midnight. Based on the assumption of exchangeability of weekdays (Silva et al., 2015), we generate a panel dataset with a total of 265×35×18×60/15=667,800 study units. Although the PSM method is a data-hungry method , the untreated pool (control group) is large enough to ensure adequate matches for treated units. Specifically, the ratio of the number of control and treatment units is around 15:1.

Propensity score models

We initially include three key baseline covariates – past disruptions, time of day and real-time travel demand – in the logistic regression. We then iteratively add one of the remaining covariates at a time from covariates listed in Table 2, and conduct the likelihood ratio test to decide whether the additional covariate should be included in the final specification or not. Generalised additive models (GAM) are also be tested, but we do not observe any gains in the model fit. A high proportion of dummy variables (12 out of 18) may limit the gains from a flexible spline specification of the link function. The estimation results of the logistic regression model are summarised in Table 3. The role of propensity score models is to establish a comprehensive index to represent all confounding factors, rather than predicting treatment assignment. While noting that the logistic regression model does not reveal the causal effect of covariates on the likelihood of incident occurrence, we succinctly discuss the multivariate correlations uncovered by this model. The coefficients of time dummies indicate that incidents are more likely to occur in morning peak hours. Positive signs on coefficients of the remaining confounders (except Rail dummy) confirm that all these factors increase the probability of encountering a disruption. Specifically, origin or terminal stations are more likely to face vehicle dispatching problems and depot related incidents. Surface stations are more susceptible to the surrounding environment than those in tubes. The accumulated 7 number of past disruptions happened on the same day increases the probability of encountering another incident. Conclusively, the propensity score model reveals that the occurrence of metro disruptions is non-random, which, in turn, also justifies the application of causal inference methods in estimating disruption impacts.

Table 3:

The results of propensity score model (logistic regression).

Confounders Coef.

S.E. Intercept -4.547*** 0.029 Past disruptions 0.275*** 1.609e-03 Time0 (6:00-6:30) (1) 1.883*** 0.027 Time1 (6:30-7:45) (1) 1.642*** 0.021 Time2 (7:45-8:45) (1) 1.640*** 0.022 Time3 (8:45-9:30) (1) 1.301*** 0.026 Time4 (9:30-16:00) (1) 0.814*** 0.016 Time5 (16:00-17:15) (1) 0.240*** 0.026 Time6 (17:15-18:15) (1) 0.219*** 0.028 Time7 (18:15-19:00) (1) 0.469*** 0.029 Temperature (℃) 0.033*** 1.920e-03 Wind speed (km/h) 9.942-03*** 1.187e-03 Rain (1) 0.326*** 0.015 Rail (1) -0.120*** 0.013 Overground (1) 0.065*** 0.011 Ave distance (km) 0.046*** 4.801e-03 Max age 1.497e-04*** 1.699e-04 Terminal (1) 0.044* 0.017 Pre 15-minute entry ridership 5.717e-05** 2.079e-05 Pseudo R-squared 0.183

Note: (1) represents dummy variables The base dummy for time of the day is Time8 (19:00-24:00). ∗ p < 0.1; ∗∗ p < 0.05; ∗∗∗ p < 0.01. Alternatively, the estimated propensity score model can also be viewed as a binary classifier that predicts whether metro disruptions occur or not. To illustrate its diagnostic ability, we compute the area under the receiver operating characteristic curve: AUC=0.795, which again indicates that the occurrence of metro disruptions is non-random.

Matching results

Before the estimated propensity scores are utilised for matching, we inspect the common support condition (assumption 2 of the PSM method). Figure 4 presents the propensity score distributions for both disrupted and normal observations. The histograms display apparent overlap between the treatment and control groups, even for large propensity scores. There is no treated unit outside the range of common support, which means we do not need to discard any observations. We thus conclude that the overlap assumption is tenable in our empirical study. 8 The PSM method aims to balance the distribution of confounders between the treatment and control groups after the matching stage. To assess the quality of matching, we perform balance tests for four algorithms: subclassification matching, nearest neighbour matching without replacement (

𝑀 = 1 ), nearest neighbour matching with replacement (

𝑀 = 1 ) and nearest neighbour matching with replacement (

𝑀 = 2 ), where M is the number of matched control units for each treatment unit. We find that nearest neighbour matching with replacement (

𝑀 = 1 ) performs the best, improving the overall balance of all confounding factors by 99.97%. This improvement indicates that within matched pairs, the difference of propensity scores between treatment and control units reduces 99.97%, compared the original data before matching.

Figure 4:

Histogram of propensity scores to test the Common Support condition . Imputation of missing vulnerability metrics

During the study period, 21 out of 265 stations did not encounter any service disruptions. We apply the random forest regression model to predict the missing vulnerability metrics of these stations. The input features of the model are indicated in

Stage 3 column of Table 2. Table 4 compares the prediction performance of random forest regression, linear regression and support vector machines. We consider four measures to benchmark the performance of random forest regression against other methods – mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), and relative squared error (RSE). Whereas MAE measures the average magnitude of the errors in predictions, RMSE represents the standard deviation of the unexplained variance (Willmott and Matsuura, 2005). A better prediction model produces lower values of these Due to higher share of the control group, the frequency in Figure 4 ranges up to 60,000 for lower propensity scores. However, we truncate frequency at 2,000 to clearly show the validity of overlap condition across the entire domain of the propensity score.

Table 4 : Prediction accuracy of different regression methods. Imputation methods Performance measures MAE RMSE RAE RSE

Random Forest Regression

Demand loss 21.286 10.133 0.266 0.101 Avg. travel speed loss 0.234 0.163 0.353 0.138 Gross travel speed loss 52.533 27.663 0.247 0.078

Linear Regression

Demand loss 44.107 26.000 0.683 0.431 Avg. travel speed loss 0.557 0.395 0.856 0.783 Gross travel speed loss 101.625 66.381 0.593 0.291

Support Vector Machines

Demand loss 49.789 18.809 0.494 0.550 Avg. travel speed loss 0.515 0.317 0.688 0.668 Gross travel speed loss 116.494 53.754 0.480 0.382

LU vulnerability metrics

The estimated vulnerability metrics vary across stations in the LU system. For 265 operated stations in 2013, during a 15-minute period of service disruption, the loss of station demand ranges from 0 to 743 passengers, the loss of average travel speed ranges from 0 to 4.78 kilometres/hour, and the loss of gross travel speed ranges from 0 to 1790 passenger-kilometres/hour. The spatial distribution of the three vulnerability metrics is visualised in Figure 5. For the demand loss and gross speed loss, the most vulnerable stations are in central London areas, while a small number of vulnerable stations are also located in suburban areas. Conversely, for the loss of average travel speed, the most vulnerable stations are scattered around outer London areas. These stations usually have only one metro line, and have very limited access to other transport modes compared to Central London areas. When passengers encounter disruptions, to continue their trips they need to wait for longer time in the system until train services are recovered. In other words, due to of lack of alternative routes , passengers at these stations tend to experience more individual delays. We sort all 265 stations based on demand and speed loss, and present vulnerability metrics of the top 15 stations in Table 5. Six stations (Canary Wharf, Oxford Circus, Victoria, London Bridge, Bond Street, Green Park) are ranked among top fifteen vulnerable stations based on demand loss (first) and gross speed loss (third) metrics. Oxford Circus, London Bridge and Canary Wharf are the most vulnerable stations in the LU network. Under disruptions, they not only encounter significant demand loss, but also lose huge passenger distance per unit time. However, if we look purely at the metric of average travel speed, the most vulnerable stations are South Kenton, Latimer Road and Chesham in outer London areas, where each passenger suffers the longest delay due to lack of alternative routes. The above rankings based on different vulnerability metrics can assist metro operators in preparing effective plans for ridership evacuation and service recovery. In this study, there are two types of alternative routes under disruptions: i) within the metro system (interchange to use other operated lines) and ii) outside of it, in the form of other modes. (b) The loss of average travel speed (c) The loss of gross travel speed

Figure 5:

Spatial distribution of station-level vulnerability metrics in London Underground. 1

Table 5:

Top 15 vulnerable stations based on empirical vulnerability metrics.

Station Demand loss in passenger/15-minute (% of baseline)

Station Avg. travel speed loss in km/h (% of baseline) Station Gross travel speed loss in passenger-km/h (% of baseline) Oxford Circus 742.5 (25.9%) South Kenton 4.78 (30.6%) Canary Wharf 1790.0 (155.9%) Waterloo 294.5 (28.7%) Latimer Road 4.21 (26.6%) Piccadilly Circus 950.3 (51.3%) London Bridge 287.6 (14.4%) Chesham 4.17 (15.5%) Oxford Circus 882.2 (36.5%) Moorgate 261.3 (30.0%) Canons Park 2.78 (10.9%) Victoria 781.4 (120.8%) Bank 218.9 (16.8%) North Harrow 2.34 (10.0%) Liverpool Street 780.8 (115.0%) Canary Wharf 197.6 (10.9%) Northwood Hills 2.25 (9.0%) London Bridge 653.3 (29.0%) Upton Park 194.0 (47.6%) Ealing Common 2.21 (11.3%) South Kensington 631.4 (30.5%) Warwick Avenue 180.9 (98.0%) Harrow On The Hill 2.00 (8.2%) Stratford 596.7 (57.5%) Bond Street 158.5 (11.4%) Canary Wharf LU (E1) 1.98 (10.0%) Bond Street 588.5 (29.4%) Camden Town 151.4 (22.2%) Hanger Lane 1.97 (9.3%) Westminster 504.7 (9.5%) Green Park 148.7 (11.6%) Moor Park 1.96 (7.2%) Green Park 497.7 (6.3%) Warren Street 142.9 (19.4%) Southwark 1.96 (11.9%) Southwark 497.1 (38.7%) Holborn 139.7 (10.8%) Loughton 1.91 (7.6%) North Greenwich 479.5 (22.1%) St. James's Park 139.0 (23.6%) Greenford 1.91 (8.4%) Farringdon 422.6 (13.4%) Victoria 130.3 (8.1%) West Harrow 1.90 (8.5%) Tottenham Court Road 373.4 (20.2%) Conclusions and Future Work

Incidents occur frequently in urban metro systems, causing delays, crowding and substantial loss of social welfare. Operators need accurate estimates of vulnerability measures to identify the bottlenecks in the network. We propose a novel causal inference framework to estimate station-level vulnerability metrics in urban mero systems and empirically validate it for the London Underground system. In contrast to previous simulation-based studies, which largely assume virtual incident scenarios and necessitate the adoption of unrealistic assumptions on passenger behaviour, our approach relies on real incident data and avoids making behavioural assumptions by leveraging automated fare collection (smart card) data. We also illustrate that incidents can occur non-randomly, which further justifies the importance of the proposed causal inference framework in obtaining the unbiased estimate of disruption impacts. The proposed empirical framework consists of three stages. First, we conduct propensity score matching methods and estimate unbiased disruption impacts at the station level. The estimated impacts are subsequently used to establish vulnerability metrics. In the last stage, for non-disrupted stations, we impute their vulnerability metric by using the random forest regression model. We propose three empirical vulnerability metrics at station level, which are loss of travel demand, loss of average travel speed and loss of gross travel speed. The demand loss metrics reflects the amount of passenger who i) switched to other transport modes, ii) switched their departure time, trip origin or destination, iii) ended their trip, before and after entering the disrupted metro system. In other words, it implies the demand for alternative transport services during disruptions, which can guide metro operators to prepare effective service replacement plans. The two speed related metrics reflect the degradation in the level of service for passengers who still use the metro system under disruptions. These metrics provide essential information for service recovery to mitigate the adverse influence on passengers and the overall performance of stations. 3 The results of the case study of London Underground in 2013 indicate that the effect of service disruption is heterogeneous across metro stations and it depends on the location of a station in the network and other station-level characteristics. In terms of the travel demand loss and gross speed loss (overall delay), the most affected stations are more likely to be found in central London areas, such as Oxford Circus, London Bridge and Canary Wharf. On the other hand, considering average speed loss (individual delay), the most affected stations are scattered around outer London areas (e.g., South Kenton and Chesham) due to lack of alternative routes. For metro operators, the first demand metric is particularly useful when preparing service replacement plans and the two speed loss metrics can help mitigate the adverse disruption impact on passengers. This study addresses a research gap in measuring metro vulnerability from an empirical perspective. The empirically estimated vulnerability metrics are practical because they reveal the actual impact of real incidents, rather than virtually simulated disruptions. The proposed methodology to obtain the unbiased estimates of disruption impact thus provides crucial information to metro operators for disruption management. It helps in identifying the bottlenecks in the network and in preparing targeted plans to evacuate ridership as well as to recover services in case of incidents. The direct integration of the estimated vulnerability metrics in preparing these target plans remains an avenue for future research. It is worth noting that the proposed framework can be applied to other metro systems conditional on the availability of the required data on incident logs and confounding characteristics, among others. Future empirical studies can also incorporate other context-specific and relevant confounders in their analysis. For example, they can include interchange ridership as a confounder in the propensity score estimation. We do not include this covariate in our LU case study because it cannot be directly derived from smart card data, rather an advanced assignment algorithm is required to identify passengers’ routes by matching smart card data with vehicle location data. Moreover, disruption impact estimates are probabilistic relative to the sample data, that is, causal estimates and vulnerability metrics estimates have sampling distribution. Since our analysis is based on the data of LU from October 28 to December 13, 2013, the results of our case study reflect the vulnerability status of LU for this specific period. If we use data from other periods, the estimates of vulnerability metrics might change due to inherent temporal variations in travel demand and incidents. Therefore, to improve the generalisability of vulnerability metrics estimates, the study period needs to be long enough such that the sample is representative of the population. That is, a sample should capture supply-side interruptions as much as possible, including service disruptions due to maintenance. In addition, the sample should also reflect the possible fluctuations of travel demand. Future research has three potential directions. First, stations surrounding the disrupted stations may also be affected due to disruptions indirectly, but this study does not account for such spill-over effects. Modelling spatiotemporal propagation of disruption impact requires significant methodological developments, which would be an important improvement over the current method. Second, the proposed vulnerability metrics can reveal static disruption impacts at different stations, but passengers need real-time service information to reschedule their trips. Thus, the current framework can be extended to update the vulnerability metrics dynamically. Considering the interaction between information provision and how it influences passengers’ decision under disruptions, this advancement would improve the dissemination of the incident alerts to passengers 4 in real-time. Finally, by merging data from other travel modes (e.g., bus) with metro datasets, we can estimate multi-modal vulnerability metrics in the same causal inference framework and understand the characteristics of the mode shift due to disruptions.

Acknowledgement

The authors are grateful for the support of Transport for London (TfL), the data provider of this research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Transport for London (TfL).5

References

Adjetey-Bahun, K., Birregah, B., Châtelet, E. and Planchet, J.L., 2016. A model to quantify the resilience of mass railway transportation systems.

Reliability Engineering & System Safety , 153, pp.1-14. Ben-Akiva, M.E., Lerman, S.R. and Lerman, S.R., 1985. Discrete choice analysis: theory and application to travel demand (Vol. 9). MIT press. Berdica, K., 2002. An introduction to road vulnerability: what has been done, is done and should be done.

Transport policy , 9(2), pp.117-127. Brazil, W., White, A., Nogal, M., Caulfield, B., O'Connor, A. and Morton, C., 2017. Weather and rail delays: Analysis of metropolitan rail in Dublin.

Journal of Transport Geography , 59, pp.69-76. Chopra, S.S., Dillon, T., Bilec, M.M. and Khanna, V., 2016. A network-based framework for assessing infrastructure resilience: a case study of the London metro system.

Journal of The Royal Society Interface

Physica A: Statistical Mechanics and its Applications , 389(17), pp.3678-3691. Faturechi, R. and Miller-Hooks, E., 2015. Measuring the performance of transportation infrastructure systems in disasters: A comprehensive review.

Journal of infrastructure systems , 21(1), p.04014025. Graham, D.J., McCoy, E.J. and Stephens, D.A., 2014. Quantifying causal effects of road network capacity expansions on traffic volume and density via a mixed model propensity score estimator.

Journal of the American Statistical Association , 109(508), pp.1440-1449. Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Imbens, G.W. and Rubin, D.B., 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. Imbens, G.W. and Wooldridge, J.M., 2009. Recent developments in the econometrics of program evaluation.

Journal of economic literature , 47(1), pp.5-86. Jenelius, E., Petersen, T. and Mattsson, L.G., 2006. Importance and exposure in road network vulnerability analysis.

Transportation Research Part A: Policy and Practice , 40(7), pp.537-560. Jiang, R., Lu, Q.C. and Peng, Z.R., 2018. A station-based rail transit network vulnerability measure considering land use dependency.

Journal of Transport Geography , 66, pp.10-18. 6 Kusakabe, T. and Asakura, Y., 2014. Behavioural data mining of transit smart card data: A data fusion approach.

Transportation Research Part C: Emerging Technologies

Physica , Heidelberg. Mattsson, L.G. and Jenelius, E., 2015. Vulnerability and resilience of transport systems–A discussion of recent research.

Transportation Research Part A: Policy and Practice , 81, pp.16-34. M’cleod, L., Vecsler, R., Shi, Y., Levitskaya, E., Kulkarni, S., Malinchik, S. and Sobolevsky, S., 2017. Vulnerability of Transportation Networks: The New York City Subway System under Simultaneous Disruptive Events.

Procedia computer science , 119, pp.42-50. Melo, P.C., Harris, N.G., Graham, D.J., Anderson, R.J. and Barron, A., 2011. Determinants of delay incident occurrence in urban metros.

Transportation research record

Transportation Research Part C: Emerging Technologies , 19(4), pp.557-568. Rodríguez-Núñez, E. and García-Palomares, J.C., 2014. Measuring the vulnerability of public transport networks.

Journal of transport geography , 35, pp.50-63. Rosenbaum, P.R. and Rubin, D.B., 1983. The central role of the propensity score in observational studies for causal effects.

Biometrika , 70(1), pp.41-55. Rubin, D.B., 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.

Journal of educational Psychology , 66(5), p.688. Rubin, G.J., Brewin, C.R., Greenberg, N., Simpson, J. and Wessely, S., 2005. Psychological and behavioural reactions to the bombings in London on 7 July 2005: cross sectional survey of a representative sample of Londoners.

Bmj , 331(7517), p.606. Silva, R., Kang, S.M. and Airoldi, E.M., 2015. Predicting traffic volumes and estimating the effects of shocks in massive transportation systems.

Proceedings of the National Academy of Sciences , 112(18), pp.5643-5648. Sun, D.J. and Guan, S., 2016. Measuring vulnerability of urban metro network from line operation perspective.

Transportation Research Part A: Policy and Practice , 94, pp.348-359. 7 Sun, D.J., Zhao, Y. and Lu, Q.C., 2015. Vulnerability analysis of urban rail transit networks: A case study of Shanghai, China.

Sustainability , 7(6), pp.6919-6936. Sun, H., Wu, J., Wu, L., Yan, X. and Gao, Z., 2016. Estimating the influence of common disruptions on urban rail transit networks.

Transportation Research Part A: Policy and Practice , 94, pp.62-75. Sun, L., Huang, Y., Chen, Y. and Yao, L., 2018. Vulnerability assessment of urban rail transit based on multi-static weighted method in Beijing, China.

Transportation Research Part A: Policy and Practice

Accident Analysis & Prevention , 82, pp.90-100. Wikimedia Commons, 2013. London Underground with Greater London map. Available from: https://commons.wikimedia.org/wiki/File:London_Underground_with_Greater_London_map.svg [Accessed 19th May 2020]. Willmott, C.J. and Matsuura, K., 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance.

Climate research , 30(1), pp.79-82. Yang, Y., Liu, Y., Zhou, M., Li, F. and Sun, C., 2015. Robustness assessment of urban rail transit based on complex network theory: A case study of the Beijing Subway.

Safety science , 79, pp.149-162. Yap, M. and Cats, O., 2020. Predicting disruptions and their passenger delay impacts for public transport stops.

Transportation , pp.1-29. Ye, Q. and Kim, H., 2019. Assessing network vulnerability of heavy rail systems with the impact of partial node failures.

Transportation , 46(5), pp.1591-1614. Zhang, D.M., Du, F., Huang, H., Zhang, F., Ayyub, B.M. and Beer, M., 2018. Resiliency assessment of urban rail transit networks: Shanghai metro as an example.

Safety Science , 106, pp.230-243. 8 Zhang, J., Wang, S. and Wang, X., 2018. Comparison analysis on vulnerability of metro networks based on complex network.

Physica A: Statistical Mechanics and its Applications , 496, pp.72-78. Zhang, J., Xu, X., Hong, L., Wang, S. and Fei, Q., 2011. Networked analysis of the Shanghai subway network, in China.

Physica A: Statistical Mechanics and its Applications , 390(23-24), pp.4562-4570. Zhang, X., Deng, Y., Li, Q., Skitmore, M. and Zhou, Z., 2016. An incident database for improving metro safety: The case of shanghai.

Safety science , 84, pp.88-96. Zhu, S., Masud, H., Xiong, C., Yang, Z., Pan, Y. and Zhang, L., 2017. Travel Behavior Reactions to Transit Service Disruptions: Study of Metro SafeTrack Projects in Washington, DC.

Transportation Research Record , 2649(1), pp.79-88., 2649(1), pp.79-88.