Deep Learning for Flight Demand and Delays Forecasting
Liya Wang, Amy Mykityshyn, Craig Johnson, Benjamin D. Marple
11 Deep Learning for Flight Demand and Delays Forecasting
Liya Wang, Amy Mykityshyn, Craig Johnson, The MITRE Corporation, McLean, VA, 22102, United States
Benjamin D. Marple, Federal Aviation Administration
The last few years have seen an increased interest in deep learning (DL) due to its success in applications such as computer vision, natural language processing (NLP), and self-driving cars. Inspired by this success, this paper applied DL to predict flight demand and delays, which have been a concern for airlines and the other stakeholders in the National Airspace System (NAS). Demand and delays prediction can be formulated as a supervised learning problem, where, given an understanding of past historical demand and delays, a deep learning network can examine sequences of historic data to predict current and future sequences. With that in mind, we applied a well-known DL method, sequence to sequence (seq2seq), to solve the problem. Our results show that the seq2seq method can reduce demand prediction mean squared error (MSE) by 50%, compared to two classical baseline algorithms.
I. Nomenclature 𝑿 = exogenous time series vector 𝑭 = factors feature vector 𝒚 = response variable 𝒕 = time step 𝒑 = time lag 𝒏 = look ahead time 𝒇 = function between input and output II. Introduction
Flight demand and delays have been a concern for airlines and the Federal Aviation Administration (FAA). Surface demand is comprised of aircraft from scheduled commercial flights and unscheduled General Aviation (GA) operations. Commercial flights are scheduled months in advance and their schedule is shared between the airlines, the airport, and the FAA, whereas GA flights’ schedules are more flexible and, therefore, less predictable. For airports that experience large increases in GA traffic due to special public events (e.g., professional sports games) or airports that have a consistently high percentage of GA traffic (e.g., VNY, TEB), it can be quite challenging for FAA traffic managers to strategically manage expected demand. Data Scientist, Lead, Department of Operation Performance Aviation Systems Engineering, Lead, Department of Safety Intelligence Concepts and Evolution Group Leader, Department of NAS Future Vision & Research Operations Research Analyst, Surveillance Branch, ANG-C52 The MITRE Corporation has developed Pacer [1], a mobile application, to support the FAA for efficient surface management. Pacer displays future departure demand information to GA pilots and then collects pilots’ intended departure times, which will be subsequently incorporated to update departure demand. Field tests have shown the current Pacer demand prediction model still has room for improvement. Therefore, this research is one of several efforts to support the mission in this regard. The current implementation of Pacer relies on a rule-based method to forecast departure demand. However, the latest trend in forecasting techniques has shifted from rule-based methods to intelligent deep learning (DL) methods. There are some noticeable differences between traditional rule-based methods and machine learning (ML) methods as shown in Fig. 1. In traditional rule-based methods, experts from the domain define the rules, and then the computer makes predictions according to those defined rules. However, in ML, the rules are learned from data and are applied to make predictions.
Fig. 1. Traditional programming vs. machine learning methods comparison
There are some obvious deficiencies with the rule-based methods. For example, when the situation is extremely complicated (e.g., image classification for various types of cats vs. dogs), humans often cannot enumerate all the rules correctly. Table 1 provides the shortcomings of rule-based methods in Pacer and the benefits DL can bring to fill in those gaps.
Table 1. Comparison of rule-based methods and DL methods Deficiencies of Rule-based Methods Promises of Proposed DL Methods • Rules may not accurately describe reality • Rules may not be complete • There is no learning ability • Rules cannot be easily extended to other airports • Few factors are considered • Rules will be automatically learned from historical data • Rules can be adaptively upgraded with latest available training data • Rules can be easily extended to multiple airports • Multiple factors such as runway configurations, weather, seasons, and events (e.g., sports, conferences, pandemic) can be included
Aviation demand and delay problems can be thought of as a set of sequential and temporal relationships. Because of this nature, our research adopted a DL method, termed sequence-to-sequence (seq2seq) designed to extrapolate historical sequences to future sequential conditions. Intuitively, delay states of previous hours’ flights may affect subsequent hours’ flight delays. In fact, ML has provided great tools at finding patterns from complex data and has been used successfully across multiple domains such as natural language processing (NLP), computer vision, self-driving cars, playing complex Go games, and time series forecasting. In our daily routine, we also enjoy the benefits and convenience that artificial intelligence (AI) has brought to us (e.g., Google Translate, email spam filters, smart assistants on mobile phones, text-to-speech natural language processing). Moreover, aviation researchers also began to adopt ML to advance the domain. For example, researchers have applied these techniques to predict an unstable approach risk at a specific distance location to the runway threshold [2], predict air travel demand for a representative city pair [3], predict flight delays [4], and predict taxi-out times at Charlotte Airport [5]. The remainder of the paper examines the application of the seq2seq methodology to demand and delays prediction and is organized as follows: Section III gives a short introduction of the data sources used in the research; and Section IV describes our problem formulations. Section V provides details of the deep learning modeling process, and the results are shown in Section VI. Finally, we conclude and describe next steps in Section VII.
III. Data Source
The FAA keeps databases of operations and performance data. The Aviation System Performance Metrics (ASPM) database is one such source. ASPM has already integrated multiple sources of flight data together and can provide information about flights arriving and departing at an airport [6]. Therefore, it was selected it as our primary data source for predicting demand and delays. ASPM datasets were used to provide quarter-hour aggregated demand, delays, runway configuration, and weather information needed for forecasting. Table 2 explains the key data items used in our study [7]. In particular, quarter-hour departure demand (DEPDEMAND) and average taxi-out delays (DLATOA) are our targeted forecasting objectives.
Table 2.
ASPM airport quarter hour data dictionary
Column Name Description
Slice_Start_Loc
Number of aircraft intending to depart for the period ArrDemand
Number of aircraft intending to arrive for the period ADR
Airport-supplied departure rate AAR Airport-supplied arrival rate DLATOA Average taxi out delay in minutes RWYCONF Airport supplied runway configuration (arrival | departure) CEILING In hundreds of feet Visibility In statute miles
IV. Problem Formulation
Flight departure demand and taxi-out delay forecasting problems can be formulated as multivariate multi-step time series forecasting problems (Eq. 1-3). Time lag ( 𝑝 ) was learned from the data and look ahead time ( 𝑛 ) was determined by the needs of Pacer. The current look ahead time was set to 31 hours with consideration that ASPM data is updated at 7 am daily. In addition, the following input variables were set up to model demand and delays respectively (Fig. 2 and Fig. 3). Equations No. 𝒚 𝒕 = 𝒇(𝒚 𝒕"𝟏 , 𝒚 𝒕"𝟏 , 𝒚 𝒕"𝟐 , … , 𝒚 𝒕"𝐩 , 𝑿 𝒕"𝟏 , 𝑿 𝒕"𝟐 , … , 𝑿 𝒕"𝒑 , 𝑭 𝒕 ) (1) 𝒚 𝒕’𝟏 = 𝒇(𝒚 𝒕 , 𝒚 𝒕"𝟏 , … , 𝒚 𝒕"𝐩 , 𝑿 𝒕"𝟏 , 𝑿 𝒕"𝟐 , … , 𝑿 𝒕"𝒑 , 𝑭 𝒕’𝟏 ) (2) ⋮ 𝒚 𝒕’𝒏 = 𝒇(𝒚 𝒕’𝒏"𝟏 , 𝒚 𝒕"𝟏 , … , 𝒚 𝒕"𝐩 , 𝑿 𝒕"𝟏 , 𝑿 𝒕"𝟐 , … , 𝑿 𝒕"𝒑 , 𝑭 𝐭’𝒏 ) (3) Fig. 2. Demand forecasting inputs and outputs
Fig. 3. Delays forecasting inputs and outputs
Currently the factor values (see Figs. 3 and 4) are derived from our ASPM data source. For example, event_flag is a binary variable (1 or 0). It is derived from box-plot outliers of quarter-hour departure demand (Fig. 4). When quarter-hour departure demand is a right-side outlier, that is defined as a quarter-hour event_flag equal to 1. Otherwise, it is set 0. Table 3 also shows some sample data that was used to train our models. If the trained model is
Y: Departure demand
Departure demand history X: Exogenous Series•Arrival demand•Scheduled departure demand F: Factors•Event_flag•Quarter_Hour_Index•Hour•Day of Week•Month
Y: Taxi-out delay
Taxi-out delay history X: Exogenous Series•Arrival demand•Departure demand•Taxi-in delay F: Factors•Event_flag•Ceiling•Visibility•Runway configuration•Quarter_Hour_Index•Hour•Day of Week•Month deployed in the field, and the factors such as event_flag and runway configuration cannot be obtained directly from historical data, we will design new ways to obtain them. For example, we can use the FAA’s System Wide Information Management (SWIM) live data [8] to get runway configuration and demand data to derive event_flag.
Fig. 4. Event_flag defined by box-plot outliers Table 3. Processed sample data to train DL models
After properly formulating problems, we can deploy deep learning methods to find function 𝑓 . With that, we made multi-step predictions with recursive strategy where the prediction for the prior time step is used as an input for making a prediction on the following time step [9]. In specific, we used function 𝑓 to predict value at time 𝑡 , and subsequently this prediction would be used as an observation input in order to predict value at time 𝑡 + 1 . V. Deep Learning for Time Series Forecasting
Solving real-world time series forecasting problems is challenging. The challenges are multifaceted and include, for example, properly selecting exogenous variables, accounting for external factors (e.g., seasons, events), and the need to perform the same type of prediction for multiple physical sites, like different airports, in our cases [10]. To tackle the difficult time series forecasting problems, the latest research has turned to DL. There are many benefits to the use of DL, such as the ability to handle multiple exogenous variables with complex dependencies, the ability to identify the multifaceted relationship between input variables and output variables, the ability to learn and adapt, and the ability to extend to multiple sites with ease. Our research sought to enable DL to predict aviation flight departure demand and delays for Pacer. Fig. 5 depicts the designed DL model training architecture for Pacer’s demand and delays forecasting.
Fig. 5. Deep learning architecture for Pacer’s demand and delay forecasting A.
Feature Preparation
When we train ML models, feature preparation plays an essential role. Simply stated, feature preparation transforms raw data into the ML algorithm-required format, as well as improves the model performance. Domain knowledge and feature engineering knowledge are important components at this step. We designed a five-stage feature processing procedure (Fig. 6) to transform raw time series data into sequence data required by our sequence-to-sequence (seq2seq) model, which is described in the following section.
Fig. 6. Features processing procedure for Pacer demand and delays forecasting B.
Sequence-to-sequence (seq2seq) Method
Several DL techniques have been proposed for time series problems, and seq2seq is one of the few proven to be effective. Seq2seq was originally designed by Google for machine translation problems and achieved a lot of success in tasks like machine translation, text summarization, and image captioning. One famous example is Google Translate, which started using such a model in production in late 2016. These models are explained in two pioneering papers ( [11], [12]). Inspired by the success of seq2seq in the NLP field, researchers have explored seq2seq for the time series forecasting problems because they have a similar problem structure (e.g., [13]). Seq2seq can achieve better performance than traditional statistical time series forecasting methods, which usually require strong restrictions and assumptions. As the name suggests, seq2seq takes an input sequence (e.g., sentences in English) and generates an output sequence (e.g., the same sentences translated to Chinese). It does so by using a recurrent neural network (RNN). Two commonly used types of RNN are Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) because LSTM and GRU can better handle the vanishing gradient problem [14], which can cause the whole neural network not being trained properly. Seq2seq is made up of two parts: the encoder and the decoder . It is sometimes referred to as the
Encoder-Decoder Network (Fig. 7). • Encoder:
Uses deep neural network layers and converts the input sequence to a corresponding hidden vector as an initial state to the first recurrent layer of the decoder part.
Raw Time Series Data Feature Creation Data Scaling and Transformation Train Test Split Series to Sequence Data • Decoder:
Takes the input from a) the hidden vector generated by the encoder, b) its own hidden states, c) its own current output, and d) other factors 𝐹 to produce the next hidden vector and finally predict the next output. Specifically, our input sequence will be time series data [ 𝑋 *"+ , 𝑋 *"+’, , … , 𝑋 *", ] comprised of previous time steps, and our output sequence will be present and future predictions [ 𝑌 * , 𝑌 *’, , … , 𝑌 *’- ] . In our research, we found that the LSTM type of cell performed better than GRU. Therefore, we decided to proceed with LSTM in our seq2seq model building. Fig. 7. Seq2seq architecture for our time series forecasting
VI. Results
This section presents our modeling results. We selected Dallas/Fort Worth International Airport (DFW) and Las Vegas McCarran International Airport (LAS) for case studies with consideration that DFW represents a busy airport that serves as a hub for a major airline, with a small but consistent number of GA operations; and LAS represents a busy airport with a large number of airline operations as well as a significant number of GA operations that can surge around certain events. To better understand how DL would improve forecasting performance, we also chose two baseline algorithms, linear regression (LR) [15] and vector autoregression (VAR) [16], for the comparison. The following sub-sections give a short introduction about baseline algorithms and all models’ prediction results. A. Baseline Algorithms A.1 Linear Regression
In statistics, linear regression is used to model the relationship between a scalar response and one or more explanatory variables. The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression (see Fig. 8). The extension to multiple and/or vector-valued predictor variables (denoted with a capital X ) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models are multiple regression models. Note, however, that in these cases the response variable y is still a scalar. Another term, multivariate linear regression, refers to cases where y is a vector [15]. Fig. 8. Example of simple linear regression, which has one independent variable ( [15])
A.2 Vector Autoregression (VAR)
Vector autoregression (VAR) is a stochastic process model designed for time series forecasting. It can capture the linear interdependencies among multiple time series. VAR models generalize the univariate autoregressive model (AR model) with multiple input time series variable. All variables in a VAR enter the model in the same way: each variable has an equation explaining its evolution based on its own lagged values, the lagged values of the other model variables, and an error term (Eq. 4). VAR method cannot consider factor variables as DL and LR do. The only prior knowledge required is a list of variables which can be hypothesized to affect each other intertemporally [16].
Equation No. (4)
Where y is the observation, c is constant (intercepts), a is a time-invariant (k × k)-matrix and e is error item. B. Train and Test Data Split
In ML, the dataset is commonly split into two sets: training and test datasets. For example, in Fig. 9, the red dots represent our DFW fourteen months of training dataset (1/1/2019-2/28/2020), and the blue dots represent our DFW one month of testing dataset (3/1/2020-3/31/2020). Two things stand out in the data. First, 3/13/2020 is an abnormal day with very high departure demand reaching up to 87 flights in a quarter hour. During the 14-month training period, departure demand never reaches such high levels. Second, from 3/20/2020, COVID-19 caused the departure demand to decrease greatly, which is also a situation that never occurred during our training period. These two observations can help us understand why the trained models may not perform well during those times.
Fig. 9. DFW train test data split C.
DFW Departure Demand Forecasting Results
Fig. 10 shows an example of DFW quarter-hour departure demand forecasting results on a normal day. The red line represents the true departure demand, and the dashed green line is the forecasted demand. From the graph, it can be seen that the seq2seq model did a good job for the normal days. By normal, we mean that the daily demand follows the common patterns in the training dataset. 3/13 is an abnormal day with very high demand up to 87
Fig. 10. DFW quarter-hour departure demand forecasting results on a normal day
Fig. 11 compares the three algorithms’ forecasted quarter-hour departure demand at DFW airport (dark blue, light blue, green) and 5 years (2015-2019) average departure demand (black) with true demand (red). According to Fig. 11, VAR (light blue) and simple average method (black) did not account for the sudden demand increase on 3/13/2020. Table 4 lists the evaluation metrics in mean squared error (MSE), mean absolute error (MAE), and the explained variance scores for these four methods. For MSE and MAE values, a lower value is better and for the explained variance score, a higher number is better. As stated in Table 4, the seq2seq model achieves the best performance across all three categories. Compared to other methods, seq2seq can reduce MSE over 50%.
Fig. 11. DFW models quarter-hour departure demand forecasting results comparison Table 4. DFW models evaluation results for quarter-hour demand forecasting
Additionally, we aggregated the quarter-hour forecasted demand to hourly and daily forecasts. Fig. 12 and Fig. 13 show the comparison of the hourly and daily forecasting results. According to the evaluation metrics, seq2seq still performs the best. In comparison, VAR performs the worst for daily demand prediction, almost forecasting a constant daily demand for the whole testing period, which in another way indicates that factors such as unusual events play important roles in the prediction. Note that all four methods over-forecasted departure demand once traffic demand was suppressed due to the COVID-19 pandemic. This is to be expected since those methods are only able to reproduce sequences based on the historical data on which it is it trained and is unable to predict entirely novel departure demand sequences. Similarly, on 3/13/2020, the demand is so high, beyond anything observed during the training period, that the forecasted values did not match the true value well.
Fig. 12. DFW hourly departure demand forecasting results comparison Fig. 13. DFW daily departure demand forecasting results comparison D.
DFW Taxi-out Delay Forecasting Results
Fig. 14 shows DFW quarter-hour average taxi-out delay forecasting results on a normal day. In ASPM, average taxi-out delay is calculated with Eq. 5. During non-peak times, the counts of delayed flights are so small, which could introduce some noise in the delay data and make predictions more difficult.
Fig. 14. DFW quarter-hour average taxi-out delay forecasting results on a normal day Equation No.
𝑫𝑳𝑨𝑻𝑶𝑨 = 𝒕𝒐𝒕𝒂𝒍 𝒕𝒂𝒙𝒊 − 𝒐𝒖𝒕 𝒅𝒆𝒍𝒂𝒚𝒔 (5) E.
LAS Departure Demand Forecasting Results
The DL demand forecasting model was also trained for LAS. Fig. 15 presents an example of LAS quarter-hour departure demand forecast results on a normal day. Table 5 lists the evaluation metrics of the four methods for quarter-hour demand forecasting. Results depict that our DL model, seq2seq, still performs the best when compared to the other models. It improves the performance of MSE by 30%-40% from the baseline models. Fig. 16 of hourly demand forecasting and Fig. 17 of daily demand forecasting also demonstrate that the seq2seq achieves the best performance. Although seq2seq still achieves the best performance at LAS, comparing prediction results of DFW (see Fig. 12) and those for LAS (see Fig. 17), we can see that LAS predictions do not perform as well as the models for DFW. We attribute this difference to the challenge that large numbers of unscheduled GA operations present in LAS when it comes to predicting demand.
Fig. 15. LAS quarter-hour departure demand forecasting results on a normal day Table 5. LAS four models’ evaluation results for quarter-hour demand forecasting Fig. 16. LAS hourly demand forecasting results
Fig. 17 LAS daily demand forecasting results F. LAS Taxi-out Delays Forecasting Results
Fig. 18 shows LAS quarter-hour average taxi-out-delay forecasting results on a normal day. Although the forecasting does not exactly follow the true trendline, the verticality of the trend is still captured by the seq2seq algorithm.
Fig. 18. LAS quarter hour average taxi-out delays forecasting results on a normal day
In a short summary, two case studies of DFW and LAS demonstrate that the performance of the DL method, seq2seq, has achieved better performance than the two baseline models.
VII. Conclusions and Future Work
This research has explored the cutting-edge DL technique, seq2seq, to forecast flight departure demand and taxi-out delays. The results have highlighted that the seq2seq method can achieve much better performance (e.g., 50% decrease in MSE) over baseline algorithms of LR and VAR. In addition, DL can train models flexibly for multiple sites, as demonstrated by training separate models for two airports, DFW and LAS. We recommend additional research in three areas for further study: 1) retrain model to account for the pandemic period, 2) explore more advanced DL techniques to improve model performance, and 3) bring in arrival prediction services. On researching the use of more advanced DL techniques, we recommend exploring the use of a transformer model and the application of transfer learning. Although the seq2seq model has demonstrated much better performance than traditional forecasting methods for our problems, there are still certain limitations associated with it [17]: • The encoder converts the entire input sequence into a fixed length vector and then the decoder predicts the output sequence. This works only for short sequences since the decoder is looking at the entire input sequence for the prediction. • Thus, a problem emerges with long sequences. It is difficult for the encoder to memorize long sequences into a fixed length vector. The transformer model can be used to overcome this problem of long sequences and may provide additional performance improvements. The transformer model represents one of the latest technological advancements in DL research, and it uses a self-attention mechanism [18]. In addition, transfer learning techniques can also be applied to improve modeling efficiency when models are trained for multiple airports [19]. 5
Acknowledgments
We thank the following MITRE colleagues: Paul Diffenderfer, Kevin Long, Joey Menzenski, Dr. Ronald Chong, Dr. Travis Gaydos, Diane Baumgartner, Caroline Abramson, Emily Stelzer, Erik Vargo, Dr. Alex Tien, Brennan Haltli, and Suzanne Porter for their valuable discussions and insights.
NOTICE
This work was produced for the U.S. Government under Contract DTFAWA-10-C-00080 and is subject to Federal Aviation Administration Acquisition Management System Clause 3.5-13, Rights In Data-General, Alt. III and Alt. IV (Oct. 1996). The contents of this document reflect the views of the author and The MITRE Corporation and do not necessarily reflect the views of the Federal Aviation Administration (FAA) or the Department of Transportation (DOT). Neither the FAA nor the DOT makes any warranty or guarantee, expressed or implied, concerning the content or accuracy of these views. For further information, please contact The MITRE Corporation, Contracts Management Office, 7515 Colshire Drive, McLean, VA 22102-7539, (703) 983-6000. ã Approved for Public Release; Distribution Unlimited. PRS Case 20-3083.
References [1]
MITRE, "Pacer – Departure Readiness," [Online]. Available: https://sites.mitre.org/mobileaviationresearch/pacer-original/. [Accessed 22 Oct 2020]. [2] Z. Wang, S. Lance and J. Shortle, "Improving the Nowcast of Unstable Approaches," in
Integrated Communications, Navigation, Surveillance (ICNS) , Dullas,VA, 2016. [3] A. Maheshwari, N. Davendralingam and D. DeLaurentis, "A Comparative Study of Machine Learning Techniques forAviation Applications," in
Aviation Technology, Integration, and Operations Conference , Altalanta, GA, 2018. [4] Y. J. Kim, S. Choi, S. Briceno and D. Mavris, "A deep learning approach to flight delay prediction," in
IEEE/AIAA 35th Digital Avionics Systems Conference (DASC) , Sacramento, CA, 2016. [5] H. Lee and W. Malik, "Taxi-Out Time Prediction for Departures at Charlotte Airport Using Machine Learning Techniques," in
In Advances in Neural Information Processing Systems , Montréal, Canada, 2014. 6 [12] K. Cho, B. V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation," in
Empirical Methods in Natural Language Processing (EMNLP) , Doha, Qatar, 2014. [13] W. Wang, "The Amazing Effectiveness of Sequence to Sequence Model for Time Series," [Online]. Available: https://weiminwang.blog/2017/09/29/multivariate-time-series-forecast-using-seq2seq-in-tensorflow/. [Accessed 1 May 2020]. [14] S. Madhu, "Chapter 10: DeepNLP - Recurrent Neural Networks with Math," 10 Jan 2018. [Online]. Available: https://medium.com/deep-math-machine-learning-ai/chapter-10-deepnlp-recurrent-neural-networks-with-math-c4a6846a50a2. [Accessed 8 Aug 2020]. [15] Wiki, "Linear Regression," [Online]. Available: https://en.wikipedia.org/wiki/Linear_regression , Long Beach, CA, 2017. [19] Wiki, "Transfer learning," [Online]. Available: https://en.wikipedia.org/wiki/Transfer_learning, Long Beach, CA, 2017. [19] Wiki, "Transfer learning," [Online]. Available: https://en.wikipedia.org/wiki/Transfer_learning