Integrated Dataset of Brazilian Flights
Claudio Teixeira, Lucas Giusti, Jorge Soares, Joel dos Santos, Glauco Amorim, Eduardo Ogasawara
II ntegrated D ataset of B razilian F lights Claudio Teixeira
CEFET / RJ [email protected] Lucas Giusti
CEFET / RJ [email protected] Jorge Soares
CEFET / RJ [email protected] Joel dos Santos
CEFET / RJ [email protected] Glauco Amorim
CEFET / RJ [email protected] Eduardo Ogasawara
CEFET / RJ [email protected] March 1, 2021 A bstract The Brazilian commercial aviation system achieved the first position among Latin American coun-tries and the fifteenth place worldwide on the Revenue Passenger-Kilometer (RPK) ranking. Theavailability of data regarding flight, including flight information and meteorological conditions, en-ables studies about the Brazilian flight system, such as flight delays and timetabling. Therefore, thispaper contributes to such studies by o ff ering an integrated dataset containing data on departure andarrival for flights departing and arriving on Brazilian airports comprising the period from 2000 to2019. This paper presents a dataset composed of 15 , ,
922 records of flight data, each contain-ing 45 attributes. The attributes include data regarding the airline, flight, airports, meteorologicalconditions, scheduled and elapsed times for departure and arrival. K eywords Flight delays · Commercial aviation · Brazilian system
The Brazilian commercial aviation system contains more than one hundred airports. It transported 95.9 million rev-enue passengers during 2014. It achieved the first position among Latin American countries and the fifteenth placeworldwide on the Revenue Passenger-Kilometer (RPK) ranking (3) . The commercial aviation network in Brazil is or-ganized towards regional hubs in contrast to airline hubs. The main reason is the Brazilian territorial extension andthat few Brazilian states have more than one major airport. One exception to this rule is Campinas (in the state ofS˜ao Paulo), where airline company
Azul holds 77% of its commercial flights. Besides, due to market deregulationinstituted in 2005, the Brazilian commercial aviation system experienced significant changes in its players, leading tomarket share changes and flight availability.The National Civil Aviation Agency (ANAC) is responsible for regulating and supervising the Brazilian civil aviationactivities. Since 2000, ANAC keeps track of departure and arrival data for Brazilian flights in its Active RegularFlight (VRA) dataset (1) . The data available in VRA are registered by the airlines and consolidated by ANAC. Itcontains data about each flight stage, i.e. , the aircraft’s necessary steps from its takeo ff to the next landing. These stepsare established regardless of where the object of transport has been loaded or unloaded. For each flight step, VRAprovides data such as airline, flight number, type (such as international, domestic, and cargo), class (such as regular,extra, charter, and instruction), airports, and scheduled and elapsed times for departure and arrival. ANAC monthlyprovides VRA data on its webpage. a r X i v : . [ s t a t . A P ] F e b preprint - M arch
1, 2021VRA enables studying the Brazilian commercial aviation system. Examples of studies are flight delay patterns (7) and their prediction (4,5) . Although meteorological conditions play an essential role in analyzing flight information,such data is not present in VRA. Thus, this paper presents a dataset that integrates Brazilian flight data. It fuses allmonthly data available in VRA. It enriches it with meteorological data from the ASOS (Automated Surface ObservingSystems) dataset (2) provided by the IOWA University in the USA. ASOS contains weather sensor data from airportsaround the world. During the entire data integration, data cleaning and data preprocessing techniques were also appliedto improve its quality.
According to the flight regulation of ANAC, commercial airline companies must register flight metadata indicatingchanges in flight time, either delay, anticipation, or canceling. They have to log the time a flight happened and ajustification for the alteration. Table 1 indicates the flight metadata together with their semantics.Table 1: Flight metadata registered by airline companies available in VRA
Attribute Description
Airline ICAO code representing the airline companyFlight Flight numberAuthorization code Identifies the authorization type for each flight stepFlight type Identifies the type of operation performedOrigin ICAO code of origin airportDestination ICAO code of destination airportExpected Departure Date and time of scheduled departureReal departure Date and time of departure performed informed by theairlineEstimated Arrival Date and time of estimated arrivalReal Arrival Date and time of arrival, informed by the airlineFlight status Informs if the flight was performed or canceledJustification Code Identifies the delay, cancellation, and other changes con-cerning the planned flightAccording to the regulation of ANAC, the metadata indicated in Table 2 must be registered in a paper form, eithertyped or handwritten. ANAC then consolidate the data sent by the airline companies into the VRA dataset. VRA ispublished monthly, comprising all flight steps expected to depart in a given month.The primary goal of ANAC is to use the recorded metadata to compute the punctuality rate of airlines. Thus, sectorregulation obliges airline companies to provide the data presented in Table 1. Therefore, it comprises all flight stepsthat took place in a given period. However, around 20% of the records may be considered inconsistent due to errorswhile filling the report form. As will be presented in Section 3.1, the causes of errors include arrival time beforedeparture or flight duration inconsistent with the regulation of ANAC.Meteorological conditions play an important role in aviation operations. The Automated Surface Observing Systems(ASOS) is a program that involves several American government agencies. It was created to become an o ffi cial networkof meteorological information to support primarily aviation entities. It includes meteorological, climatological, andhydrological components. ASOS data come from weather sensors in locations all over the planet. In Brazil, ASOScovers all 154 airports available in VRA, as seen in Figure 1.The Department of Agronomy at Iowa State University, in the United States, compiles daily information from theUS ASOS system. It creates an hourly report of meteorological observations in all of its sites. Table 2 indicates themeteorological data together with their semantics. The integrated
Brazilian Flight Dataset (BFD) presented in this paper includes both the flight data present in VRAand meteorological information present in ASOS. It is intended to enable studies regarding the Brazilian commercialaviation system. BFD is composed of 15 , ,
922 records of flight data, each containing 45 attributes. The dataset,together with its integration process description and R scripts, is available on IEEE DataPort . Dataset is available at http://dx.doi.org/10.21227/k10b-qn21 . Additional information can be found at (8) . preprint - M arch
1, 2021Figure 1: Brazilian airports included in the ASOS dataset (2)
Table 2: ASOS meteorological data
Attribute Description
Sky condition Cloud height and amount (clear, scattered, broken, over-cast) up to 12,000 feetVisibility To at least ten statute milesWeather Type and intensity for rain, snow, and freezing rain.Obstructions to vision fog, hazePressure Sea-level pressure, altimeter settingTemperature Ambient and dew point temperatureWind Direction, speed, and character (gusts, squalls)Precipitation accumulationFigure 2 presents the data model of BFD. It is detailed in the following sections. As can be seen, BFD aggregatesdata from VRA and ASOS for flight information and meteorological information, respectively. It also includes datacurrently unavailable in VRA, such as describing the justification codes of ANAC, airline and airport names, and ISOcodes for country names.BFD focus on flight data regarding flights that departed or arrived in Brazil. When both origin and destination airportsare located in Brazil, those flights are considered domestic flights. Conversely, when either the origin or the destinationairport is located outside of Brazil, it is considered international. The data integration process for creating BFD wasorganized into three main activities: (i) data preprocessing, (ii) data enrichment, and (iii) data fusion. Those activitiesresemble the traditional Extraction, Transformation, and Load (ETL) process (10) . The preprocessing stage was performed in three parts. First, VRA attribute names were translated from BrazilianPortuguese to English. It was unnecessary to translate the acronyms used in each variable since they were alreadyfollowing the International Civil Aviation Organization (ICAO) standards. It was necessary to convert temperatureand dew point data to the International System of Units regarding the ASOS data. Data from ASOS was filtered toconsider the 154 airports available in VRA.The second part consisted of data cleaning for both VRA and ASOS datasets. Given that flight information is usuallyrecorded by hand, VRA data was cleaned to remove inconsistent data. During cleaning, records with missing variables3 preprint - M arch
1, 2021Figure 2: The data model for the BFDwere removed. Also, records with departure time (either elapsed or expected) greater or equal to arrival time wereremoved. They corresponded to approximately 0 .
02% of the records. Approximately 3 .
77% of VRA records wereremoved for being out of BFD scope, i.e. , with origin and destination out of Brazil. Finally, the regulation of ANACprohibits delays higher than 24 hours. Thus, during cleaning records with departure or arrival delays exceeding thisnorm were removed. The complete data cleaning removed 21 .
07% of VRA records.The third part of the preprocessing stage consisted of removing outliers. For each pair of airports (cid:104) o , d (cid:105) in VRA, it wasconsidered both the expected and elapsed duration of a flight from origin o and destination d . Flights whose duration(either elapsed or expected) were not in the interval [ Q − · IQR , Q + · IQR ] were considered as outliers. Theycorresponded to 2.76% of VRA records. The preprocessing step resulted in 15 , ,
922 flight records from VRA tobe used in the fusion stage.
After preprocessing, the dataset is enriched as follows. The dataset schema is changed by separating departure andarrival data attributes (see Table 1 into an hour and date attributes. Besides, it included attributes related to flightduration, departure and arrival delays.Additionally, two discrete attributes were included for the time of the day for departures and arrivals. It divides thetime attribute into seven ranges, as presented in Table 3.Table 3: Time attribute discretizationPeriod Start Time End TimeNight 23:00 04:00Early Morning 05:00 08:00Mid Morning 09:00 10:00Late Morning 11:00 12:00Afternoon 13:00 16:00Early Evening 17:00 19:00Late Evening 20:00 22:004 preprint - M arch
1, 2021Two discrete attributes are included in ASOS while enriching the dataset. The use the wind velocity in knots to includethe wind intensity using a Beaufort Scale. The second uses the wind direction in degrees to include the wind directionusing Wind Rose with 16 cardinal directions (N, NNE, NE, ENE, E, ESE, SE, SSE, S, SSW, SW, WSW, W, WNW,NW and NNW) . Data fusion was applied over VRA data from 2000 to 2019, except for June, July 2014, and March 2018, when ANACdid not collect the data. It is worth mentioning that ASOS provides hourly meteorological data.During the fusion process for the meteorological and flight data, it was necessary to group all flight data in a givenhour. The grouping was performed for each elapsed departure and arrival of the flight to determine its meteorologicalinformation.Furthermore, the fusion stage resolved airport and airline names from VRA data. It also included an ISO code forcountry names whenever the flight departs or arrives at a non-Brazilian airport. Finally, the justification codes forflight delay were also expanded to their descriptions.
BFD allows for studies regarding the Brazilian commercial aviation system. In this section, we present previous andongoing work conducted on top of BFD together with an exploratory analysis of BFD data. To present the importanceof using the database, we conduct an exploratory analysis and mention studies that used the data in their research.As discussed before, the Brazilian flight system is oriented towards regional hubs instead of company hubs. Figure 3presents the number of flights per airport, considering just the 25 biggest airports on flights. It also divides flights intodomestic (D), international (I), and cargo (C) flights.As can be seen in Figure 3, in the top five busiest airports, the first two are in S˜ao Paulo (SBSP and SBGR), thethird in Bras´ılia (SBBR), and the last two in Rio de Janeiro (SBGL and SBRJ). Rio and S˜ao Paulo are the two higherGross Domestic Products (GDPs) in Brazil. They are two major gateways for flights coming and exiting Brazil.Approximately one-third of the flight in Guarulhos Airport (SBGR) and Gale˜ao Airport (SBGL) are internationalflights. Figure 3: Number of flights per airport, for the top-25 most active airports Wind Rose Data - US Department of Agriculture - Natural Resources Conservation Service (NRCS) available at preprint - M arch
1, 2021Brasilia is the capital of the country and is located in the middle of Brazil. It acts as a hub for flights from and to citiesin the north and northeast regions. It can be seen, however, that it has few international flights.Brazil and Argentina have strong touristic relations. Thus we can see the Buenos Aires international airport (SAEZ)in the top-25 busiest airports. Since BFD has only flights from and to Brazil, SAEZ has only international and cargoflights.Figure 4 presents the takeo ff and arrival delay per airport for the top-25 busiest airports. It indicates whether an airporthas recover capabilities for arrival delays. The radius of the airport also indicates the level of punctuality. The higherthe radius, the airports are more punctual.Figure 4: Mean takeo ff delay and punctuality rate per mean arrival delay for the top-25 busiest airportsFigure 5 presents the distribution of flights according to the period of the day. As shown, most of the flight departures(Figure 5.a) occur in the afternoon and early evening. Most arrivals (Figure 5.b) occur in the afternoon and earlymorning. During the mid and late morning, the number of flights decreases significantly for both departure and arrival.Figure 5: Number of flights per period of the day: (a) departure; (b) arrivalAccording to ANAC regulation, a flight is considered to be delayed when its departure or arrival time surpasses,respectively, the expected departure or arrival by more than 30 minutes. Figure 6 presents the punctuality rate con-sidering all the Brazilian flight systems per year, from 2000 to 2019. It is possible to observe that the Brazilian flightcrises that occurred in 2007 interfered with both punctuality rates and mean delay (9) .Figure 7 analysis of the Brazilian systems monthly. Historically, months of school break (December, January, andJuly) have the lowest punctuality rates and the highest mean delay. August is the month with the highest level ofpunctuality and lowest mean delay. 6 preprint - M arch
1, 2021Figure 6: Punctuality rate and mean delay per year. The charts present the mean delay together with its confidenceinterval of 95% Figure 7: Punctuality rate and mean delay per month of the yearFinally, Figure 8 presents the punctuality rate (circle size) and the average delay in minutes per number of flights forthe top-25 companies. According to Figure 8, two airlines present the most significant number of flights, TAM andGol (GLO). It is also possible to observe that airlines with lower punctuality rates tend to have a higher mean delay.Figure 8: Mean delay and Punctuality rate per number of flights for the top-25 airline companiesGiven the various inconveniences for airlines, airports, and passengers caused by flight delays, it is fundamental tomitigate their occurrence and optimize an air transport system’s decision-making process. Mainly, airlines, airports,and users may be more interested in when delays are likely to occur than the accurate prediction of the absence ofdelays. In that context, Moreira et al. (4) use BFD to analyze Flight delays in the period between 2009 and 2015. Theauthors present a classification model capable of predicting delays, getting about 60% of hits.Flight delays fall into two main categories: root delay and delay propagation. Root delays are related to events thatare intrinsic to a particular flight. In delay propagation, it is presumed that a delay has already occurred at some pointin the network, i.e. , new delays occur due to previous delays. The understanding of delay propagation patterns amongairports is essential for decision-making processes. 7 preprint - M arch
1, 2021That study may devise patterns in flight delays and the way the system recover from it. Focusing on unveiling thosepatterns, Sternberg et al. (6) apply data indexing techniques combined with BFD data association rules. The authorsobserved that the Brazilian flight system has di ffi culties recovering from previous delay when operating under adversemeteorological conditions, when delays occurrences may increase up to 216%. This work aimed to create a reliable and enriched database on national and international flights that arrived and de-parted from Brazilian airports. With the data o ff ered by this database, it is possible to carry out several studies to aidthe decision-making process. For example, it is possible to answer the following questions: (i) “Which airport su ff ersthe most delays?”; (ii) “What month of the year is an airport most likely to be delayed?”; or (iii) “What part of the dayis a particular airport most likely to experience a delay in departure?”The answers to these questions can help companies and governments review their protocols and optimize their services.Additionally, we intend to update this dataset yearly, conducting the entire data integration. Acknowledgments
The authors thank CNPq, CAPES (finance code 001), FAPERJ, and CEFET / RJ for partially funding this research.
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Author’s contributions
All authors contributed equally to the study. EO conceptualized the study design. CT acquired the data. LT and JSconducted data analysis and interpretation. Furthermore JAS and GA revised it critically for intellectual content. Allauthors have approval of the final version.
References [1] ANAC. Agˆencia Nacional de Aviac¸ ˜ao Civil. Technical report, , 2015.[2] ASOS. Automated Surface Observing System. Technical report, https://mesonet.agron.iastate.edu/ASOS/ , 2000.[3] ICAO. Annual Report of the Council 2014. Technical report, , 2015.[4] L. Moreira, C. Dantas, L. Oliveira, J. Soares, and E. Ogasawara. On Evaluating Data Preprocessing Methods forMachine Learning Models for Flight Delays. In
Proceedings of the International Joint Conference on NeuralNetworks , volume 2018-July, 2018.[5] R. A. Scarpel and L. Pelicioni. A data analytics approach for anticipating congested days at the S˜ao PauloInternational Airport.
Journal of Air Transport Management , 72:1–10, 2018.[6] A. Sternberg, D. Carvalho, L. Murta, J. Soares, and E. Ogasawara. An analysis of Brazilian flight delays basedon frequent patterns.
Transportation Research Part E: Logistics and Transportation Review , 95:282–298, 2016.[7] A. Sternberg, D. Carvalho, L. Murta, J. Soares, and E. Ogasawara. Ex-perimental Evaluation. Technical report, https://eic.cefet-rj.br/˜dal/an-analysis-of-brazilian-flight-delays-based-on-frequent-patterns/ , 2016.[8] C. Teixeira, L. Teixeira, J. dos Santos, G. Amorim, J. Soares, and E. Ogasawara. Inte-grated Brazilian Flight Datasets Description. Technical report, https://eic.cefet-rj.br/˜dal/brazilian-flight-dataset-description , 2020.[9] N. Y. Times. Brazil Demands Solution to Aviation Crisis. Technical report, , 2007.[10] P. Vassiliadis. A survey of extract-transform-load technology.