[PDF] Machine Learning in Airline Crew Pairing to Construct Initial Clusters for Dynamic Constraint Aggregation

Abstract

The crew pairing problem (CPP) is generally modelled as a set partitioning problem where the flights have to be partitioned in pairings. A pairing is a sequence of flight legs separated by connection time and rest periods that starts and ends at the same base. Because of the extensive list of complex rules and regulations, determining whether a sequence of flights constitutes a feasible pairing can be quite difficult by itself, making CPP one of the hardest of the airline planning problems. In this paper, we first propose to improve the prototype Baseline solver of Desaulniers et al. (2020) by adding dynamic control strategies to obtain an efficient solver for large-scale CPPs: Commercial-GENCOL-DCA. These solvers are designed to aggregate the flights covering constraints to reduce the size of the problem. Then, we use machine learning (ML) to produce clusters of flights having a high probability of being performed consecutively by the same crew. The solver combines several advanced Operations Research techniques to assemble and modify these clusters, when necessary, to produce a good solution. We show, on monthly CPPs with up to 50 000 flights, that Commercial-GENCOL-DCA with clusters produced by ML-based heuristics outperforms Baseline fed by initial clusters that are pairings of a solution obtained by rolling horizon with GENCOL. The reduction of solution cost averages between 6.8% and 8.52%, which is mainly due to the reduction in the cost of global constraints between 69.79% and 78.11%.

Full PDF

aa r X i v : . [ c s . A I] S e p Machine Learning in Airline Crew Pairing to ConstructInitial Clusters for Dynamic Constraint Aggregation

Yassine Yaakoubi a,b,d, ∗ , Fran¸cois Soumis a,b , Simon Lacoste-Julien c,d a Department of Mathematics and Industrial Engineering, Polytechnique Montr´eal, Canada b GERAD, Montr´eal, Canada c Department of Computer Science and Operations Research, University of Montr´eal,Canada d MILA, Montr´eal, Canada

Abstract

The crew pairing problem (CPP) is generally modelled as a set partitioningproblem where the ﬂights have to be partitioned in pairings. A pairing is asequence of ﬂight legs separated by connection time and rest periods that startsand ends at the same base. Because of the extensive list of complex rules and reg-ulations, determining whether a sequence of ﬂights constitutes a feasible pairingcan be quite diﬃcult by itself, making CPP one of the hardest of the airline plan-ning problems. In this paper, we ﬁrst propose to improve the prototype

Baseline solver of Desaulniers et al. [1] by adding dynamic control strategies to obtain aneﬃcient solver for large-scale CPPs: Commercial-GENCOL-DCA. These solversare designed to aggregate the ﬂights covering constraints to reduce the size ofthe problem. Then, we use machine learning (ML) to produce clusters of ﬂightshaving a high probability of being performed consecutively by the same crew.The solver combines several advanced Operations Research techniques to assem-ble and modify these clusters, when necessary, to produce a good solution. Weshow, on monthly CPPs with up to 50 000 ﬂights, that Commercial-GENCOL-DCA with clusters produced by ML-based heuristics outperforms

Baseline fedby initial clusters that are pairings of a solution obtained by rolling horizon with ∗ Corresponding author

Email addresses: [email protected] (Yassine Yaakoubi), [email protected] (Fran¸cois Soumis), [email protected] (SimonLacoste-Julien)First publication in the “Cahiers du GERAD” series in February 2020.Submitted to EURO Journal on Transportation and Logistics on January 17, 2020.Accepted August 24, 2020, available online September 2, 2020. https://doi.org/10.1016/j.ejtl.2020.100020

ENCOL. The reduction of solution cost averages between 6.8% and 8.52%,which is mainly due to the reduction in the cost of global constraints between69.79% and 78.11%.

Keywords:

Machine Learning, Column Generation, Constraint Aggregation,Airline Crew Scheduling, Crew Pairing

1. Introduction

Crew scheduling is of crucial importance to airlines, as crew costs representtheir second-largest source of expenditure, after fuel costs. As the global air-line industry grows in size and volume, the complexity of scheduling problemsincreases signiﬁcantly over time. Increasing computing capacity is not enough,and innovation in algorithms is needed to solve the growing problems.In practice, the planning processes begin with the construction of the ﬂightschedule, a list of ﬂights to be operated in a period of time. Then, aircraft areallocated to the scheduled ﬂight legs (non-stop ﬂights) based on their type andcapacity in order to maximize the net proﬁt, which is called the ﬂeet assign-ment problem. The next step is to solve the aircraft routing problem wheremaintenance requirements of the aircraft are considered. Next, the airline mustassign a cockpit and cabin crew for each of the scheduled ﬂights, where the crewscheduling problems arise. The objective of the crew scheduling problem is tominimize crew-related costs.Due to its complexity, the crew scheduling problem is tackled in two se-quential phases, both in practice and in the literature. First, the crew pairingproblem (CPP) forms a minimum-cost set of anonymous feasible pairings fromthe scheduled ﬂights, such that all ﬂights are covered exactly once, and all thepairing regulations and contractual rules are respected. Then, crew assignmentcombines the anonymous pairings with rest periods, vacations, preassigned ac-tivities such as training, and other breaks over a standardized month to producea set of individual schedules for crew members.In the context of CPP, a pairing is a sequence of ﬂight legs that begins and2nds at the same crew base (an airport where crews are stationed). A pairingcontains multiple duties: sequences of ﬂights and deadheads (repositioning)that form a day of work. Two consecutive duties inside a pairing are separatedby a layover. Pairings must comply with airline regulations as well as collectiveagreements, such as the maximum ﬂying time, the maximum duration of a duty,the minimum rest time in a layover, the maximum number of calendar days ina pairing, etc. The cost of a pairing approximates the crew’s salary as well asother expenditures, such as hotel costs. When pilots and copilots are trained tooperate on a single type of aircraft as in the considered applications, the CPPcan be decomposed by aircraft type.Since the 1990s, the most prevalent method used to solve this problem hasbeen the branch-&-price: column generation embedded in a branch-&-boundstructure [2, 3]. Column generation is an iterative method alternating betweena restricted master problem (RMP) for pairing selection and several subproblems(SP) for pairing generation. At each iteration, the RMP is simply the masterproblem restricted to a subset of its variables. The subproblem is a shortest pathproblem with resource constraints that aims at ﬁnding a feasible pairing startingon the corresponding day at the corresponding base with the least reduced cost.For more details on the method, see the survey on CPPs by Cohn and Barnhart[4], and Deveci and Demirel [5] for a more recent survey.In the process of developing a solver capable of dealing with much largerproblems than those that have been addressed to date, GENCOL , a commercialoptimization solver, was integrated into a rolling horizon approach. Because theoriginal solver can only handle a few thousand ﬂights per window, windows areconstrained to be two-days long in the case of a large CPP. The length of thewindows is too short to produce good solutions to monthly problems, as theconstraints cover longer periods of time.Elhallaoui et al. [6] introduce the dynamic constraint aggregation (DCA)approach to reduce the number of constraints simultaneously considered in a A = B − A , the computation of ¯ A becomes expensive forlarge-scale problems.Desaulniers et al. [1] combine multiple methods developed and tested onsmall datasets in order to obtain an eﬃcient algorithm for large-scale CPPs,namely, they combined column generation (GENCOL), DCA, multi-phase DCA(MPDCA), and IPS. This algorithm is adapted for the set partitioning problemwith few supplementary linear constraints, but aggregate only set partitioningconstraints permitting to use the network methods of DCA to identify the com-patible and incompatible variables, where a variable is said to be compatiblewith respect to the partition if the ﬂights covered by the pairing correspond to4 concatenation of some clusters. Otherwise, this variable is declared incompat-ible. This permits the construction of the CP of IPS in less time than computing B − A . The CP is solved to obtain a negative reduced cost combination of in-compatible variables and to obtain dual variables. These central dual solutionspermit to speed up column generation compared to DCA. Furthermore, the al-gorithm combines some techniques of partial pricing and Branch-&-Bound tosolve large-scale problems, where DCA, MPDCA, and IPS have not been ap-plied before: monthly CPP with complex industrial constraints. The partialpricing method MPDCA (Elhallaoui et al. [8]) will be explained in Section 3,where we present the main ideas of Baseline . We consider this algorithm as

Baseline in Section 3.3 where we discuss the improvements, and in Section 5where we compare the performance of our improved algorithm with respect tothis

Baseline .Since the diﬃculty with DCA, MPDCA, and IPS is to produce good initialclusters,

Baseline [1] is fed with clusters from pairings of a solution obtainedby rolling horizon with GENCOL. The drawback of

Baseline is that it takesmore time to produce the initial clusters than to re-optimize with

Baseline .Following the ﬂaw in this method, and with the emergence of machine learning(ML), we propose the following idea: to use ML models to ﬁnd good clusters,which will be provided as initial information to OR (Operational Research)algorithms, thus improving the quality of solutions and the speed with whichthese solutions are discovered. Yaakoubi et al. [9, 10] build on this idea bystudying the performance of several ML algorithms to solve the problem of theﬂight-connection problem, in which the objective is to predict the next ﬂightthat a crew has to follow in its schedule.More generally, prior to Yaakoubi et al. [9, 10], several other studies com-bined ML and OR to solve a Combinatorial Optimization (CO) problem. In-deed, in cases where there is a structural understanding of the CO problem andthus a theoretical knowledge of the decisions to be made by the optimizationalgorithm, ML can be used to provide quick approximations of these decisions,thus reducing the computation time required. In the literature, this is called5 earning by imitation : samples of the expected behavior are available and areused in the ML model as demonstrations for learning.There are various architectural approaches for implementing ML models forCO problems. In the simplest case, the ML model is the solver itself. Thisapproach has been successfully used to solve some Euclidean TSP classes usingdeep learning. Vinyals et al. [11] use this approach in combination with imita-tion learning : using exact TSP solutions for smaller graphs and approximationsfor larger graphs, they generate a set of demonstrations which are then encodedby an RNN (

Recurrent Neural Networks ). Another RNN serves as a decoderthat can be used to produce a permutation on TSP nodes. This method createsa model capable of handling inputs of diﬀerent sizes.In more complex cases, ML is used to enrich the decision process with anexisting CO algorithm. A ML model can be used to process the problem deﬁni-tion in order to extract a meaningful structure or to reduce the problem space.In these cases, ML will play a preparatory role for the CO algorithm. An exam-ple can be found in Kruber et al. [12] for solving mixed-integer linear program(MILP) problems. Using the Dantzig-Wolfe decomposition of a MILP instance,they train ML models to predict which of the two solvers will solve the instanceoptimally and faster. However, ML models are not used to ﬁnd the ﬁnal so-lutions. Similarly, Lodi et al. [13] examine the problem of locating facilitieswhere a ML model is used to predict whether a derived problem instance willhave a solution similar to its reference. They then add this additional data intothe ﬁnal solver, allowing for a faster computation time, thus allowing energycompanies to assess the viability of facility locations without having to performthe possible but costly computation. We refer the interested reader to Bengioet al. [14] for a more detailed overview of the use of ML models in OR.The application of ML techniques to solve CO problems has led to progressin solving airline-speciﬁc problems. The ML models used are frequently encoun-tered in other problem areas, e.g., certain varieties of Genetic Algorithms (GA).Graf et al. [15] examine ﬂight scheduling and use traditional linear program-ming to solve it. The same problem is addressed in Tsai et al. [16] using a6A with chromosomes deﬁned as two-dimensional objects representing possibleschedules. Pandit et al. [17] study a new approach to aircraft ﬂeet planning us-ing a simple 3-layer neural network (NN). Various studies have addressed CPPsusing GAs. Zeren et al. [18] introduce a ”perturbation” operator for their GA.Another study on CPP using evolutionary algorithms can be found in Azadehet al. [19]. They implement a particle swarm optimization algorithm. Deveciand Demirel [5] address a variant of CPP, which consists of ﬁnding a set of min-imum cost pairings covering all ﬂights using various evolutionary algorithms.They have improved on previous solutions by using a hybrid algorithm (GAand hill-climbing method). Thus, their use of ML created a richer dataset formeta-heuristics. Although the proposed approach outperforms the two variantsof GA presented in the paper, the instances used are small, containing up to750 ﬂights. Note that in all cases, ML alone cannot solve CPP: Solving a 50000 ﬂight-problem needs to ﬁnd 50 000 crew connections. Even if ML yields a99.9% accuracy for each connection, the probability of ﬁnding a good feasiblesolution is ( . ≈ − .As a ﬁrst contribution, we propose Commercial-GENCOL-DCA, a new im-plementation of Baseline [1], including new control strategies for the columngeneration, the constraint aggregation, and the branch-&-bound. In particu-lar, a dynamic control strategy is used to identify a ”neighborhood” that islarge enough to reach a good LP solution, but small enough to maintain a smallnumber of fractional variables permitting to have an eﬃcient heuristic branch-&-bound. Note that a limited number of variables connecting consecutive ﬂights inthe current solution are unfrozen to deﬁne the neighborhood around the currentsolution. In addition, a variable aggregation and disaggregation strategy is pro-posed, reducing the size of the problem without negative aspects on the numberof iterations of column generation and the complexity of branching decisions.The second contribution is to modify the column generation solver to takeadvantage of an initial solution and clusters of ﬂights with a large probabilityto be consecutive in a solution to speed up the column generation algorithm,where this initial information can be obtained by ML. In addition, we modify the7omplementary problem of IPS to reduce its density using the good informationin the clusters.As a third contribution, we show that modern ML techniques can help toautomatically learn highly eﬃcient crew pairing schedules and develop initialclusters for the solver. We use a modiﬁed version of the NNs proposed byYaakoubi et al. [9], where, ﬁrst, we use multi-layer convolutional neural networks(CNNs). Second, we use dropout (randomly drop units from the neural networkduring training) to prevent overﬁtting and batch normalization, to normalize theactivations over the current batch in each hidden layer. Third, since we use amore complex ML model, we use extensive Bayesian Optimization, a sequentialprocedure where a probabilistic form of the model’s performance is maintainedusing the Gaussian process.As a fourth contribution, we detail a proof-of-concept study on a hugemonthly problem with up to 50 000 ﬂights. Starting with ML and ﬁnishingwith mathematical programming will permit to solve globally larger problemsand will avoid the loss of optimality due to heuristic decomposition in small timeslices in the rolling horizon approach. In addition, the use of a rolling-horizonapproach was improved to use clusters tailored to the ﬂights of the currentwindow and connecting well with the schedule of the previous window.Note that our ML predictor does not have access to collective agreementsand exact cost function from the airline company. Furthermore, it does not useairline-dependant information in any step of the learning process, as it wouldbe a considerable amount of work to code them. In addition, it would reducethe generality of the predictor and complicate its application across diﬀerentairlines.The remainder of this article is structured as follows. A literature review oncrew pairing is presented in Section 2. Section 3 presents

Baseline [1] and thenew implementation Commercial-GENCOL-DCA. Next, Section 4 describes theML predictor and introduces the cluster construction methodology, includingdiﬀerent heuristics, to avoid constructing infeasible pairings. Section 5 reportscomputational results. Finally, a conclusion is drawn in Section 6.8 . Literature review on the crew pairing problem

This section reviews a collection of relevant literature on the CPP to providean overview of how researchers suggest approaching this problem. We focus onapproaches developed to solve large-scale industrial problems.According to Desaulniers et al. [3], for each category of the crew and eachtype of aircraft ﬂeet, CPP aims to ﬁnd a set of pairings at minimal cost so thateach planned ﬂight is performed by a crew. The methodology for solving theproblem depends on the size of the airline’s network structure, rules, collectiveagreements, and cost structure [20].In the CPP, two ﬂight legs can be operated by the same crew if the arrivalstation of the ﬁrst ﬂight leg is the same as the departure station of the secondone, and the time between the ﬂights is adequate to satisfy the crew feasibilityrules. A sequence of ﬂights forms a duty, where every two consecutive ﬂightsare separated by idle time. The ﬂights in a pairing are operated by a singlecrew, and a consecutive sequence of duty periods is referred to as a pairing, aslong as the ﬁrst duty period starts and the last duty period ends at the samestation (base). Idle time between duty periods is called layover, and a pairingis feasible if it satisﬁes all safety and collective agreement rules such as: • minimum connection time between two consecutive ﬂights; • maximum number of landings, maximum ﬂying time, and maximum num-ber of ﬂights per duty; • maximum number of days and maximum number of duties in a pairing; • minimum rest-time between two duties and maximum span of a duty.Finally, a set of feasible pairings constitutes a feasible solution if each sched-uled ﬂight is covered by at least one pairing. When a ﬂight appears in more thanone pairing, one crew operates the ﬂight while the other crews are transferredbetween two stations for repositioning purposes. This is called deadheading.Deadheads are also used to relocate crew members either at the end of a pair-ing (to bring crews to base) or at the beginning of a pairing (to cover a ﬂight9eparting from a non-base station).In the literature, the CPP has been traditionally modelled as a set coveringproblem (SCP) or a set partitioning problem (SPP), with a covering constraintfor each ﬂight and a variable for each feasible pairing [3, 21, 22].Formally, we consider F to be a set of legs that must be operated duringa given period and Ω to be the set of all feasible pairings that can be used tocover these legs. For each pairing p ∈ Ω, let c p be its cost and a fp , f ∈ F , be aconstant equal to 1 if it contains leg f and 0 otherwise. Moreover, let x p be abinary variable that takes value 1 if pairing p is selected, and 0 otherwise. Usinga set-partitioning formulation, the CPP can be modelled as follows:minimize x X p ∈ Ω c p x p (1)subject to X p ∈ Ω a fp x p = 1 ∀ f ∈ F (2) x p ∈ { , } ∀ p ∈ Ω (3)The objective function (1) minimizes the total pairing costs. Constraints (2)ensure that each leg is covered exactly once, and constraints (3) enforce binaryrequirements on the pairing variables.Marsten et al. [23] present a ﬁrst commercial system with a specialized MILPsolver for the set partitioning formulation. They present results for various com-panies and propose a heuristic decomposition for larger problems. Anbil et al.[24] introduced a comprehensive method with a cost-minimization goal. In thisresearch, presenting collaborative research between American Airlines DecisionTechnologies and IBM, numerous possible pairings (columns) are considered ina column pool, and a substantial amount is fed into the MILP solver. Then,several of the non-basic pairings are rejected, and pairings from the columnpool are being inserted. The procedure is repeated until all pairings from thecolumn pool have been taken into consideration [21]. This is only possible ifthe number of feasible pairings is not too large. Because not all viable pairings10re considered, the divergence from optimality can be signiﬁcant. The sameassessment can be made regarding all heuristic approaches.To overcome the above limitations, more sophisticated approaches have beenproposed over the years. To solve the CPP modelled as SPP, three time-horizonsare generally studied: a daily, a weekly, and a monthly horizon [25], and themost prevalent resolution method since the 1990s is column generation insertedin branch-&-bound [3, 26]. The daily problem assumes that the ﬂights areidentical or relatively similar, for every day of the planning horizon, and thatminimum cost pairings are generated for ﬂights scheduled for a day. A dailycyclic solution is produced, where the number of crews present in every city inthe evening is the same as in the morning. The weekly and monthly problemsalso assume repetitive schedules and function analogously except for the obviouslonger time span that makes the master problem larger in row size. The dailysolution is then unfolded over a typical week, such that one copy of each pairingis kept for each weekday, and infeasible copies are removed. Then, a cyclicweekly problem is solved, preserving as much as possible the unfolded dailysolution. Similarly, the computed weekly solution is unfolded over the month,infeasible copies are removed, and the monthly problem is solved, preserving asmuch as possible the unfolded weekly solution.For large ﬂeets, it may take too long to globally solve the weekly problemor to globally re-optimize the monthly problem. The rolling horizon approachis used to speed up the solution process [27]. The horizon is divided into timeslices of equal length (except maybe the last one), each one overlapping partiallywith the previous one. Then a solution is constructed greedily in chronologicalorder by solving the problem restricted to each time slice sequentially, takinginto account the solution of the previous slice, and the next one if it is a re-optimization, through additional constraints. When the size of the problemincreases, the time slices need to be shorter to obtain problems that can besolved within a reasonable time period. When the time slices become shorterthan the pairings, the quality of the solution is deeply impacted. Indeed, thepairings partially outside of the time slice cannot be moved to another base,11nd many deadheads are used to balance the workload between bases. Recentstudies have concentrated on weekly and monthly problems. Owing to thevacation period and diﬀerences in ﬂight schedules, the monthly time horizon isthe most accurate [25].

3. The improved algorithm for the crew pairing problem

In this section, we present the structure of the

Baseline algorithm proposedby Desaulniers et al. [1] in order to locate and explain the improvements intro-duced in this paper and to help understand the results on the performance ofthe algorithm. First, Section 3.1 presents the structure of the algorithm in Fig-ure 1 and focuses on the dynamic control strategies introduced to have a morestable algorithm producing better results. Then, Section 3.2 presents modiﬁ-cations to take advantage of clusters of ﬂights with a high probability of beingconsecutive in a solution that may be diﬀerent from the initial solution, as wellas the modiﬁed complementary problem of IPS to reduce its density using thegood information in the clusters. Finally, Section 3.3 presents improvements tothe rolling horizon algorithm for the monthly problem.Before discussing the algorithmic improvements, we ﬁrst shortly describesoftware improvements. Baseline works with GENCOL 4.5 using CPLEX 12.4.Commercial-GENCOL-DCA works with GENCOL 4.10 using CPLEX 12.6.3.Many adaptations were required because there was a three-year period betweenGENCOL 4.5 and GENCOL 4.10. During the experimentation of Commercial-GENCOL-DCA, we do not use parallel processors, as the emphasis was onimproving the quality of the solution rather than reducing CPU time. This timewill be reduced in the industry by using a workstation and parallel computing.

As shown in Figure 1, in box 1, the initial solution is a set of feasible pair-ings. These pairings do not necessarily cover all ﬂights. It is better to have apartial solution than to start from scratch. This solution permits a warm-start12f CPLEX with few artiﬁcial variables. The ﬂight-partition is a set of clustersof ﬂights. Within each cluster, ﬂights have a high probability of being doneconsecutively by the same crew. Unlike Baseline, where clusters are the pair-ings of the initial solution, in Commercial-GENCOL-DCA, the clusters may bediﬀerent from the pairings of the initial solution, allowing the use of clusterswith fewer mispredicted ﬂight-connections. Indeed, a heuristic solution con-tains some mispredicted connections to include all ﬂights and still be feasible.It permits to use clusters produced by ML containing only sequences of ﬂightswith a high probability of staying together. Note that we are not aware of a MLapproach that can directly construct a feasible solution (taking into account themany complex airline-dependent costs and constraints that are not necessarilyavailable to the ML system at training time). The construction of clusters byML will be explained in Section 4. The mathematical programming part of thealgorithm is a Branch-&-Bound using column generation at each node of thebranching tree.In box 2, the Aggregated Reduced Master Problem (ARMP) contains onlythe compatible pairing variables generated up to date in the column generationprocess. Recall that a variable is said to be compatible with respect to thepartition if the ﬂights covered by the pairing correspond to a concatenation ofsome clusters. Otherwise, this variable is declared incompatible. Observe thatthe covering constraints of the ﬂights in a cluster are identical for the compati-ble variables of ARMP, which justiﬁes keeping only the covering constraints ofthe ﬁrst ﬂight in each cluster in the ARMP. The ARMP also contains a fewsupplementary linear constraints; for example, the base constraints limiting theﬂight time per base. The ARMP, with only one constraint per cluster and onlythe compatible variables, permits to rapidly improve the solution. This smallernon-degenerate problem or with very few degeneracy is solved eﬃciently withthe primal simplex.In box 3, the complementary problem (CP) contains only the incompatiblepairing variables generated up to date. This linear program ﬁnds a compatibleconvex combination of incompatible columns, with minimal reduced cost. Let13 olve ARMP(compatible variables)2 Solve CP(Incompatible variables)3 MinReduced cost < < Figure 1: The improved algorithm for the SPP type be the set of non-zero variables in this solution. These variables have thesame value. Let A S be the surrogate column obtained by the summation of thecolumns in S . The dual solution of CP is combined with the dual solution ofthe ARMP to obtain dual variables for all ﬂights in the problem. Section 3.2gives more details on the improvements added to solve the CP more eﬃciently.In box 4, we select between two options. The ﬁrst is to add the surrogatecolumn in the ARMP. This compatible column can be added without changingthe aggregation and the number of constraints in the ARMP. When the ARMPis non-degenerate, the cost of the solution decreases at the ﬁrst pivot. TheARMP of the CPP with few additional constraints has few degeneracy, andthe solution is improved most of the time with one or very few pivots. Thesecond is to add in the ARMP, the columns of the variables in the set S . Someclusters of the partition are broken to make compatible the added variables.The incompatible variables becoming compatible are also added to the ARMP.When the ARMP in non-degenerate, the ﬁrst | S | − | S | th pivot improves the solution.We develop the following strategy. In the ﬁrst iterations of the column gener-ation, we use the ﬁrst option. The cost decreases rapidly with this small ARMP.When the tailing eﬀect appears, we observe that some surrogate variables enterand leave the basis with small step sizes in the simplex pivots. This is due tothe fact that minimization on many terms of a dense column in the exit criteriaof the simplex has smaller values and creates pivots with small steps.Indeed, a simplex tableau with few non zeros in it is said to have a lowdensity, and be sparse. Informally, a problem with a sparse initial tableau maybe referred to as a sparse problem. Singleton columns (containing a single entry)do not increase the density of the tableau when they are pivoted on. Similarly,doubleton columns (containing two entries) can only increase the density slightlywhen they are pivoted on. Singleton and doubleton columns are called sparsecolumns. The density of the simplex tableau aﬀects the density of the pivotcolumn and row. As pivoting uses the pivot column and row across the tableau,the density of the pivot row and column aﬀects the amount of work required. It15s therefore desirable to preserve the sparsity of the tableau during the solutionprocess.In the situation where the tailing eﬀect appears, we replace the surrogatevariables that take a positive value in the solution of ARMP by the columnscomposing these combinations (which may require adjusting the current par-tition) while the other surrogate variables are discarded from the ARMP. Theoptimization continues with the second option. The cost function of this lessdense problem restarts to decrease more rapidly. Furthermore, the solutionbecomes less fractional with this less dense problem. Replacing the surrogatevariables by the original variables before branching permits only branching onoriginal variables.In box 5, when the solution of CP has null reduced cost, the sub-problemneeds to be solved to generate columns with negative reduced cost, if possible.To be more eﬃcient, we introduce the following dynamic strategy: we go tothe sub-problem when the reduced cost is above a negative threshold. Insteadof adding existing incompatible columns with a small potential of solution im-provement, we give priority to generating columns with a better potential forimprovement. At each column generation iteration, this threshold is increasedup to zero to reach the optimality criteria. There is a sub-problem per crewbase and per starting day for pairings. Each sub-problem has a shortest pathproblem structure with resource constraints modeling the rules, ensuring thatthe paths are feasible pairings.In box 6, we control the partial pricing strategy of MPDCA (Elhallaoui etal. [8]). This strategy uses the degree of incompatibility of a column, whichis the number of times an incompatible column enters or exits in the middleof a cluster. This value can be computed in the sub-problem when a columnis generated. MPDCA proceeds through a predetermined sequence of phases,typically, phases k = 0, 1, 2, . . . In phase k , only pairings with a degree ofincompatibility not exceeding k can be generated by the pricing problems. Toimpose this constraint in the pricing problem, an additional incompatibilitydegree resource is considered. In Baseline , the value of k is increased of one16nit, from 0 to 2, when the objective improvement becomes too small. Moreexplanation can be found in Desaulniers et al. [1]. In Commercial-GENCOL-DCA, a more dynamic strategy manages the value of k using T , a thresholdon T , the min reduced cost in the sub-problem, and M a maximum on N thenumber of fractional variables in the solution of ARMP. We used M = p × number of ﬂights, with p = . p = .

6. The rule is the following:1. if

T > T and N < M , then increase k of one unit up to 3;2. if T > T and N > M , replace T by T ;3. if we reach T > T and N > M and k = 3, we take few branching decisionson variables very close to 1. It reduces the number of fractional variablesand we take k = 1 and come back to rule 1. This rule maintains k as smallas possible, therefore gives priority to less dense columns and produces lessfractional solutions in ARMP by combining less dense columns. This pointwill be explained in Section 3.2.In box 7, the compatible and incompatible variables are identiﬁed with agood data structure according to the deﬁnition of compatibility given in box 2.This easily applies to CPPs, since ﬂights can be ordered by departure time. Apath corresponding to a variable is stored as a sequence of pointers to the nextﬂight using the departure-time ordering, and clusters are stored in a similarmanner. In addition, in clusters, each ﬂight in a path has a pointer to itself.Starting at the beginning of the path of a variable, it is easy to check if the pathis the concatenation of clusters, and if it is compatible. Compatible variablesare added to the ARMP. Incompatible ones are added to the CP. After addingnew variables, the ARMP is resolved ﬁrst without changing the partition. TheARMP has priority as long as its least reduced cost variable is less than the leastreduced-cost of incompatible variables multiplied by a predetermined multiplier(smaller than 1).In box 8, we use a partial exploration of the branching tree. Since columngeneration solves the sub-problem at optimality and has access to a very largenumber of columns, the LP relaxation provides a very good lower bound. Fur-17hermore, MPDCA provides a primal solution with a small number of fractionalvariables. The lower bound and the primal solution give useful information to ef-ﬁciently explore the branching tree. We ﬁrst use the column ﬁxing algorithm asfollows (for instance, see [27]): At each node of the search tree, several columnstaking a fractional-value in the current ARMP solution are selected, and theirvalues are set to one. These columns are selected in decreasing order of theirvalues because columns with larger values, in general, increase less the value ofthe lower bound when they are set to one than columns with smaller values.After imposing a pairing p , we remove from the ARMP and CP all columnscontaining the ﬂights covered by p , and from the sub-problem networks all arcsrepresenting those ﬂights. These decisions reduce rapidly the size of the prob-lems to solve. When there are no variables large enough to ﬁx, we use the arcﬁxing strategy. We ﬁx to one the arcs with a large value. Unfortunately, suchdiving heuristic may sometimes make bad decisions and produce a poor integersolution. To reduce the weakness of the diving branching, we use the Retrospec-tive Branching [28]. This algorithm detects and revises, without backtracking,poor decisions made previously in the search tree. These decisions are selectedfrom a list of risky decisions that is maintained during the branching process.A risky decision is a column, or an arc ﬁxed even if its value was smaller than athreshold. When the relative gap q i between the values z i and z of the solutionscomputed at a node i in the search tree and at the root node (i.e., q i = z i − z z )exceeds a relative estimate gap for a good integer solution, the algorithm ejectsrisky decisions from the current solution without backtracking. This ejectionis performed by adding a constraint on the number of risky decisions in thesolution. The Retrospective Branching improves the integer solution by 25% ofthe optimality gap compared to the diving branching. IPS and Baseline use a basis B compatible with the initial solution, to com-pute ¯ A = B − A in the construction of CP. This computation can be time-18onsuming for large problems. To reduce the density of A , we use a transfor-mation matrix M obtained from the information in the clusters. Note thatmultiplying by M does not change the set of feasible solutions, since we willconstruct M to be injective. Let us then consider that the ﬂight covering con-straints are in the order of ﬂight start time. The transformation matrix M isblock-diagonal, with a block for each cluster. Figure 2 presents the structureof M and the block structure associated with the clusters including the ﬂights ℓ , . . . , ℓ k . M =  M M . . . M | L |  with M ℓℓ ≤ i ≤ ℓ k ℓ ≤ j ≤ ℓ k =  − − − −  Figure 2: The transformation matrix M and the block for one cluster Computing A = M.A consists in modifying A by subtracting the row i from the row i + 1, for i = 1 . . . m . Figure 3 presents an example of M.A for asample of typical columns of A . Consider 2 clusters of ﬂights (1 , , , , − − − − · A | {z } | {z } = A − − compatible incompatible Figure 3: Example of computing the matrix A The matrix A has the following properties:1. a compatible column has a 1 on the row of the ﬁrst ﬂight of each clustercovered by this column and a 0 for the other ﬂights (columns 1 and 2);2. an incompatible column leaving a cluster just before the ﬂight i has − i (columns 3 and 5);3. an incompatible column entering in a cluster just before the ﬂight i has+1 on the row of ﬂight i (columns 4 and 5);4. an incompatible column covering the ﬁrst ﬂights of a cluster has +1 onthe row of the ﬂight.We reorganize A by moving in the upper part the rows of the ﬁrst ﬂightsof each cluster and, in the lower part, the remaining rows. We obtain thefollowing matrix  A C A I A C A I  ; where A C ( A C ) are the upper (lower) part of thecompatible columns, and A I ( A I ) are the upper (lower) part of the incompatiblecolumns.In what follows, we detail a few additional remarks: • A C is the matrix of the ARMP; 20 A C = 0, it justiﬁes removing these rows in the ARMP; • M.A j = A j = 0 is an easy criteria to identify if column j is compatible; • the number of non-null elements in a column of A j , j ∈ I is the numberof incompatibilities in this column; • A I has low density using the strategy described in box 6, giving priorityto the generation of columns with few incompatibilities.The complementary problem CP ﬁnds a combination of compatible columnswith a minimum reduced cost. After the matrix transformation, the constraintsof CP can be rewritten as: A I v = 0 (4) w.v = 1 (5) v ≥ A I .v to be compatible is equivalent to asking A I .v = 0. Without theconstraint (5), where w = (1 , · · · , ⊤ is of dimension dictated by the context,the problem is unbounded and produces an extreme ray. With this constraint,the problem produces an extreme point. The columns of A I v are obtained easilyby transforming the columns generated using the data structures described inBox 6.IPS and Baseline use a basis B compatible with the initial solution, to com-pute ¯ A = B − A in the construction of CP. There are many bases B compatiblewith the solution for a degenerate problem. We can select a basis that is easier toinvert by replacing the null variables in the basis by high-cost artiﬁcial variables.The original zero variables become non-basic. This permits to replace a part ofthe basis by an identity matrix (Zaghrouti et al. [29]). B , this new basic ma-trix, is used by Baseline and Commercial-GENCOL-DCA. We use B to speedup the ARMP. For CP, there is still a diﬃcult part in the matrix B , and thecomputation of ¯ A = B − A can be expensive. In Commercial-GENCOL-DCA,we use the transformation matrix M to reduce the density of A . Indeed, this21roblem with fewer constraints and a low-density matrix can be solved moreeasily. Because CP is severely degenerate, primal simplex is not a good algo-rithm to solve it. Instead, the dual simplex is a more appropriate algorithm.The dual is likely not degenerate because the reduced costs in the objective arereal numbers, such that the probability of having two subsets of variables withthe same cost is very low. Furthermore, the use of the less dense matrix A is anadaptation of the algorithm to take advantage of the good information providedby ML. Indeed, the clusters provided by ML contain more reliable informationthan the initial solution produced with a heuristic. While the heuristic makes aset of compromises to satisfy the constraints, ML, using the abstention strategypresented in Section 4.4, keeps only the good information and discards what isless reliable. We improve the integration in a rolling horizon approach, in order to useclusters tailored for the ﬂights of each window and connecting well with theschedule of the previous window. In order to present our contributions, we ﬁrstpresent the optimization process used in GENCOL init and Baseline:

GENCOL init.

Consider a standard monthly solution called “GENCOL init”,obtained with the GENCOL solver (without DCA). In this approach, the prob-lem is solved by a “rolling horizon” approach. Because the GENCOL solver(without DCA) is able to solve up to a few thousand ﬂights per window, it isconstrained to use two-day windows and a one-day overlap period. This meansthat the month is divided into overlapping time slices of equal length. Then asolution is constructed greedily in chronological order by solving the problem restricted to each time slice sequentially, taking into account the solution of theprevious slice through additional constraints.

Baseline.

The constraint aggregation approach is used in a rolling-horizon pro-cedure (seven-day windows and a two-day overlap period) for the monthly CPP.

Baseline is fed with the pairings of the solution “GENCOL init” as an initial22olution, thus obtaining a solution that we consider as a baseline for compari-son. These pairings are used as initial clusters. These clusters are inadequate,since (1) “GENCOL init” uses narrow windows and makes many compromisesto produce a feasible solution, and (2) the initial pairings in the next seven-daywindows to optimize are in conﬂict with the new pairings in the overlappingdays, generated by the optimizer in the previous window.

Commercial-GENCOL-DCA.

For each experiment, we use ML to construct aninitial DCA partition and feed it to Commercial-GENCOL-DCA as initial clus-ters to solve the monthly CPP. In Commercial-GENCOL-DCA the clusters aredynamically generated by ML after the solution of the previous windows, whichpermits to have clusters connecting well with the pairings starting in the previ-ous ﬁve days.

4. Machine Learning model

This section describes the ML predictor constructed to provide the probabil-ities of ﬂights being performed consecutively by the same crew. The predictionproblem is ﬁrst stated and formulated in Section 4.1. The proposed predictionmodel is disclosed in Section 4.2. Section 4.3 presents the input transformationsneeded to get an eﬀective prediction model. Upon the ﬁnalization of the modeltraining, and in order to construct monthly pairings, we use four heuristicspresented in Section 4.4.

In our solution method for the overall CPP, we provide the optimizer withan initial partition of ﬂights into clusters. Each cluster represents a sequence ofﬂights with a high probability of being consecutive in the same pairing in thesolution. To construct each cluster, we need the following: • Information on where and when a crew begins a pairing. This informationmakes it possible to identify whether a ﬂight is the beginning of a pairing;23

For each incoming ﬂight to a connecting city, predict what the crew isgoing to do next: layover, ﬂight, or end of a pairing. If it is the secondcase (ﬂight), then we further predict which ﬂight the crew will undertake.Since the end of a pairing depends on the maximum number of days in apairing permitted by the airline company, we will solely rely on this numberas a hyperparameter. In other words, there is no need to predict the end ofthe pairing. Therefore, with regards to the pairing construction, we propose todecompose the second task into two sub-tasks. The ﬁrst is predicting whetherthe crew will make a layover; the second is predicting the next ﬂight, under theassumption that the crew will always take another ﬂight.On the one hand, layovers are highly correlated with the duration of theconnection, although there is no ﬁxed threshold for the number of hours a crewmust stay in an airport to make a layover. On the other hand, predicting thenext ﬂight is a much more complicated prediction problem. We call this the ﬂight-connection problem , which is the focus of our ML approach describedin the next sections (see Figure 4).

Base BaseDuty period Layover Duty periodOperated ﬂightRestFlight-Connection

Figure 4: Illustration of a crew pairing. Here the ﬂight-connection variable (in red) isdeﬁned to determine the next ﬂight that a crew is going to perform, given an incoming ﬂight

Similar to Yaakoubi et al. [9], we transform the data to build a ﬂight-connection prediction problem (a supervised ML multiclass classiﬁcation prob-lem) where the goal is to predict the next ﬂight that an airline crew shouldfollow in their schedule given the previous ﬂight. The classiﬁcation problem isthus the following: given the information about an incoming ﬂight in a speciﬁcconnecting city, choose among all the possible departing ﬂights from this city(which can be restricted to the next 48 hours) the one that the crew should fol-24ow. These departing ﬂights will be identiﬁed by their ﬂight code and departingcity, and day (about 2 000 possible ﬂight codes in our dataset). Note that diﬀer-ent ﬂights may share the same ﬂight codes in some airline companies, as ﬂightsperformed multiple times a week usually use the same ﬂight code. Nevertheless,a ﬂight is uniquely identiﬁed by its ﬂight code, departing city, and day.Each ﬂight can be described by the following ﬁve features that we can usein our classiﬁcation algorithm: • City of origin and city of destination ( ∼

200 categories); • Aircraft type (5 categories); • Duration of ﬂight (in minutes) • Time (with date) of arrival (for an incoming ﬂight) or of departure (for adeparting ﬂight).

As shown in Yaakoubi et al. [9, 10], Neural Networks (NN) have shown greatpotential to extract meaningful features from complex inputs and obtain highaccuracy. Therefore, we use a modiﬁed version of their NN predictor, as shownin Figure 5.First, using standard encoding, the city code, and aircraft type features aretreated as numeric values. This means that cities with close values are treatedsimilarly by the NN even though the codes are somewhat arbitrary. A moremeaningful encoding for such categorical features is to use one-hot encoding.By fully connecting each one-hot encoding of a categorical feature to a separatehidden layer, we get an embedding layer of dimension d for this feature ( d isa hyperparameter). The C × d parameter matrix (where C is the number ofpossible categorical values) represents the d -dimensional encoding for each of the C values, and this encoding is learned during the NN training. The embeddinglayer approach thus learns a d -dimensional representation of each city, and onecould explore which city is similar to another one in this space. By concatenatingthe embedding layer for each categorical feature with the other numeric features25 nput(2D) Embedding(3D) 3 × + × × + × + Dropout Dense (120)Relu activ. + Dropout Dense (20)Softmax activ.Forward ﬂow Backward ﬂow

Figure 5: The architecture diagram for the NN predictor. Note that in this example, we use2 convolutional layers, 3 × × (such as hours and minutes), we get an n d -vector, where n d is the dimensionalityof the representation of one ﬂight that is fed into the NN.Second, we use a multi-layer CNN. CNNs are a particular type of NNs forprocessing grid-type data, e.g., time series (1D), image data (2D), and medicalCT or MRI scans (3D). The convolution layer convolves the input matrix witha ﬁlter (kernel) of predeﬁned size. The kernel weights are updated during thelearning process. Our convolutional layer consists of successive layers organizedhierarchically; each layer has convolutions with learned ﬁlters, followed by pointnonlinearity and a sub-sampling operation called pooling. Successive convolu-tion layers automatically extract the most relevant features from the input dataand insert them into the ﬁnal classiﬁcation layers of the network. Note that ifwe use a large number of hidden layers and neurons per layer, NNs gives similarresults to CNNs in our case. But, using CNNs gives more stable results, isfar less time consuming, and takes into consideration the input format. Moreintuition behind using CNNs is explained in Section 4.3. The output layer is adense layer with softmax as the activation function, deﬁned by Eq. 7, where x is a vector of dimension K given by the last hidden layer. We use categoricalcross-entropy as a loss function and standard categorical accuracy on the testset as a performance metric.softmax( x ) i = e x i P Kk =1 e x k , ∀ i ∈ J , K (7)Third, we use dropout [30] to prevent overﬁtting in NNs. It is a gen-eralization method and consists of applying (element-wise product) a binary26ask to each layer outputs during training and cached for future use on back-propagation. Since each neuron is susceptible to being masked, neurons learnthe ability to adapt to the lack of information, thus making the NN more robustand avoid overﬁtting. We also use batch normalization [31] between the acti-vation function and the dropout, to normalize the activations over the currentbatch in each hidden layer. Indeed, the layers adjust their parameters duringback-propagation with the assumption that their input distribution stays thesame. But, all the layers are updated with back-propagation, thereby changingeach of their outputs. This changes the inputs of the layers as each layer gets itsinputs from the previous layer. This is called internal covariate shift . Hence, thelayers experience a constantly changing input distribution. Batch normalizationstandardizes the distribution of the output activations. Although using batchnormalization is not crucial in our case, it produces the desired distribution ofactivations that don’t vary too much when the weights change. Said otherwise,it minimizes the internal covariate shift and aids the model to learn faster.Finally, because we use a diﬀerent and more complex architecture than thatproposed by Yaakoubi et al. [9], our predictor needs much more ﬁne-tuning.Therefore, we use extensive Bayesian Optimization using the Gaussian process(500 iterations) to ﬁnd a good conﬁguration of hyperparameters. Note that onecan use random search or grid search to ﬁne-tune. But, Bayesian optimizationwas shown to provide much better results for hyperparameter tuning [32, 33]. Given that the aircrew arrived at a speciﬁc airport at a given time, we canuse a priori knowledge to deﬁne which ﬂights are possible. For example, asshown in Figure 6, it is not possible to make a ﬂight that starts ten minutesafter the arrival, nor is it possible ﬁve days later. Furthermore, it is rare thatthe type of aircraft changes between ﬂights since each aircrew is formed to useone or two types of aircraft at most. The reader is referred to [22] for furtherdetails on the likelihood of these scenarios. Note that these simplistic conditionsdo not depend on the airline company, are applicable to various airline industry27roblems, and represent soft feasibility conditions . They are not the sameconditions that are used by the GENCOL optimizer, which uses the completelist of collective agreements and the exact cost function.In our work, we use the following conditions, that need to be always satisﬁedfor the next ﬂight performed by the crew: • The departure time of the next ﬂight should follow the arrival time of theprevious ﬂight to the connecting city; • The departure time of the next ﬂight should not exceed 48 hours followingthe arrival time of the previous ﬂight to the connecting city; • The departure city of the next ﬂight should be identical to the connectingcity in the previous ﬂight; • The aircraft type should be the same. Indeed, crew scheduling is separableby crew category and aircraft type or family [22].

Incoming ﬂightOutgoing allowed ﬂightOutgoing disallowed ﬂight

Airport Time

Figure 6: Illustration of diﬀerent feasible possibilities for an incoming ﬂight

We can, therefore, design a more useful encoding for the output classes bynormalizing their representations across diﬀerent instances. Given a vertex v ,we would like to predict which arc is most probably after v in a pairing thatcontains v . One of the main diﬃculties, when one wants to use NNs to predictrelevant information on a combinatorial optimization problem, is that a NNaccepts only standardized inputs (i.e., vectors in R d with d ﬁxed). In contrast,an instance of combinatorial optimization problems is generally not a vectorin R d with d ﬁxed. Here, the number of arcs outgoing from v depends on v . We sparsify the graph to overcome this diﬃculty. Sparsifying the graph isindeed a generic technique to scale-up a column generation for a vehicle or crew28cheduling problem without deteriorating too much the quality of the solutionreturned. It consists of removing most of the arcs of the graph and keepingonly those that are the best candidates to be in the solution. For the CPP,the most relevant arcs are those corresponding to connections such that ourmasking constraints are met. Previous work on weekly CPP [9] has shownthat keeping only the 20 most probable outgoing arcs of each vertex leads toexcellent solutions. We exploit this technique to obtain a standardized input ofthe NN: Indeed, after sparsiﬁcation, all vertices have the same maximal numberof outgoing arcs. A vector of d features for each of these arcs will thereforealways have the same dimension, 20 d , and can, therefore, be used as an inputof the NN (which predicts, for a pairing that contains a vertex v , the mostprobable arc after v .)We use the embedding layer described earlier to construct a feature repre-sentation for each of these 20 ﬂights. Then, we concatenate them to have amatrix n d ×

20 input for the next layers, where n d is the embedded representa-tion of information on one ﬂight. Consequently, we not only reduce the numberof possible next ﬂights but also construct a similarity-based input, where theneighboring factors have similar features, which in turn allows the usage ofCNNs. The intuition here is that we can consider each next ﬂight as a diﬀerenttime step, enabling the use of convolutional architecture across time [34, 35]. Upon the ﬁnalization of the ﬂights-connection prediction model training, wecan use the same architecture to solve two other prediction problems on the testset (50 000 ﬂights): (i) predict if each of the scheduled ﬂights is the beginning ofa pairing or not; and (ii) predict whether each ﬂight is performed after a layoveror not. In reality, the three predictors share the same representation. To solvethese independent classiﬁcation problems, we sum the three prediction prob-lems’ cross-entropy losses when learning, therefore performing a multi-outputclassiﬁcation. 29 tart in a base..

While training the NN predictor to recognize whether a ﬂightis the beginning of a pairing or not, it is possible that it misclassiﬁes that aﬂight departing from a non-base city is indeed the beginning of a pairing. It isimperative to correct such false predictions in order to avoid pairings startingaway from the base. Even though it is possible to construct a predictor to whichonly ﬂights departing from the bases are given, it is more eﬃcient and robustto use all ﬂights in the training step. That way, the predictor learns a betterrepresentation of the input.

No Layover below a threshold..

While training the NN predictor to recognizewhether there is a layover or not between two ﬂights, and since this decision isindependent of ﬁnding the next ﬂight, we can use a threshold on the number ofhours between the previous and the next ﬂight, below which it does not makesense to make a layover. This threshold should be deﬁned considering previoussolutions or by a practitioner.To build the crew pairing, one can use the following heuristics: • Heuristic 1 (H1): We use a greedy heuristic to build a crew pairing. Specif-ically, we consider each ﬂight that the model predicts at the beginning of apairing as a ﬁrst ﬂight. Given this incoming ﬂight, we predict whether thecrew is making a layover or not. In both cases, we consider the incomingﬂight and predict the next one. The pairing ends when the maximal num-ber of days permitted per pairing is approached. We can use the aboveheuristic to construct a solution for the testing data, obtaining a crewpairing that can be fed as an initial cluster for the solver. Unfortunately,if one ﬂight in the pairing is poorly predicted, as the ﬂights are prede-ﬁned, the crew can ﬁnish its pairing away from the base. Such pairingsare discarded. • Heuristic 2 (H2): Like H1, but we also discard pairings where the predictorabstains. Indeed, instead of taking all the predictions into account inconstructing the initial clusters for DCA, one can also discard a percentage30f these predictions to enhance accuracy. We consider augmenting ourclassiﬁer with an “abstain” option [36]. Since MPDCA adds in phase k in the sub-problem a constraint forbidding to generate a pairing breakingclusters more than k times, breaking a connection in a cluster takes moretime than building one. Therefore, removing low conﬁdence links (ﬂights)using this “abstain” option can be advantageous. • Adaptable Heuristic 1 (Adapt-H1) and Adaptable Heuristic 2 (Adapt-H2):Because the monthly problem is solved using a rolling-horizon approachwith one-week windows and two days of overlap period, constructing aninitial partition for the entire month and using the subset in each windowto feed DCA can be a major ﬂaw. Such initial partition will have manyinconsistencies with the solution of the previous window, particularly dur-ing the overlap period, such as legs belonging to two diﬀerent clusters.We propose to adapt the proposed clusters to the solution of the previ-ous window using the heuristics proposed in Section 4.4 to construct theclusters of the current window in accordance with the solution found forthe previous window and any inconsistency with the previous window isavoided, so the proposed partition is adapted to the current resolution. Inaddition, instead of only considering the ﬂights that the model predictsat the beginning of a pairing as a ﬁrst ﬂight, incomplete clusters fromthe solution of the previous window starting during the overlap period arecompleted. For the next section, we denote by Adapt-H1 and Adapt-H2the heuristics proposing such adaptation to the partitions produced withH1 and H2.

5. Computational experiments

In this section, we report the results of the computational experiments weconducted using

Baseline and Commercial-GENCOL-DCA for large scale CPPs.First, we present the instances used for training and testing in Section 5.1, thenthe hyperparameters and hardware used in the experimentation in Section 5.2.31inally, Sections 5.3 and 5.4 report our ML prediction results and the crew-pairing problem resolution using Commercial-GENCOL-DCA.

Our training instances originate from a major airline and consist of sixmonthly crew pairing solutions for approximately 200 cities and 50 000 ﬂightsper month. The test set is a diﬀerent instance, which is a benchmark used byairlines to decide which commercial solver to use. Each instance contains a one-month ﬂight schedule, crew pairings, as well as a list of airports and bases. Thisinformation is used to create input for the learning phase. The dataset consistsof approximately 1 000 diﬀerent source-destination pairs, 2 000 diﬀerent ﬂightcodes and pairings start from 7 diﬀerent bases.

The parameters setting is relevant to the hyperparameter optimization, eval-uation process, and hardware description. To ﬁnd the best conﬁguration ofhyperparameters, we use Bayesian optimization with k -fold cross-validation onthe training set to measure the conﬁguration quality. These are presented inTable 1. We use diﬀerent months for diﬀerent folds (6 folds) to simulate themore realistic scenario where we make a prediction over a new period of time.More speciﬁcally, we optimize the hyperparameters listed in Table 1 [37]with an implementation of Gaussian process-based Bayesian optimization pro-vided by the GPyOpt Python library version 1.2.1 [38]. Bayesian optimizationconstructs a probabilistic model of the function mapping from hyperparametersettings to the model performance and provides a systematic way to explore thespace more eﬃciently [39]. To identify the following best candidate to sample,we choose the point that maximizes an acquisition function. One of the mostpopular acquisition functions is of the type Expected Improvement ( EI ), whichrepresents the belief that new trials will improve upon the current best conﬁg-uration. The one with the highest EI will be tested directly after. Maximizing EI gives a clear indicator of the region from which we should sample, in order32o gain the maximum information about the location of the global maximumof the model performance function. As mentioned above, one can use randomsearch or grid search to ﬁne-tune. But, Bayesian optimization was shown toprovide much better results for hyperparameter tuning [32, 33].In our approach, the optimization was initialized with 50 random searchiterations, followed by up to 450 iterations of standard Gaussian process opti-mization. Here, the test accuracy is used as the surrogate function and EI asthe acquisition function. Table 1: Hyperparameters used in optimization

Parameters Search space Type

Optimizer Adadelta; Adam; Adagrad; Rmsprop CategoricalLearning rate 0.001, 0.002, . . . , 0.01 FloatDimensions of the embeddings 5, 10, 15, . . . , 50 IntegerNumber of dense layers 1, 2, 3, 4, 5 IntegerNeurons per layer 100, 200, . . . , 1000 IntegerDropout rate 0.1, 0.2, . . . , 0.9 FloatConvolutional layers 0, 1, 2, 3 IntegerFilters n

50, 100, 250, 500, 1000 IntegerFilter size h

3, 4, 5 Integer

All experiments were executed on a 40-core machine with 384 GB of memory.Each method is executed in an asynchronously parallel setup of 2-4 GPUs. Thatis, it can evaluate multiple models in parallel, with each model on a single GPU.When the evaluation of one model is completed, the methods can incorporate theresult and immediately re-deploy the next job without waiting for the others tobe ﬁnalized. We use four K80 (12 GB) GPUs with a time allocation of 10 hours.All algorithms were implemented in Python using Keras [40] and Tensorﬂow [41]libraries.

We perform the Gaussian process to search for the best conﬁguration ofhyperparameters. To warm start the method with initial samples, we ﬁrst userandom search. After only a few iterations of random search, we are able to33et an accuracy of 99.35%. Then, random search boosts the total return veryquickly up to 99.62% after 18 iterations and thus remains until the end of therandom search cycle (iteration number 50). Using Bayesian optimization, wecan show that we continuously improve our process of searching for the bestconﬁguration of hyperparameters that maximizes the overall return. In ourcase, we stopped at iteration number 500 with the best architecture providingan accuracy of 99.68%.Using the “abstain” option, the accuracy increases from 99.62% to 99.94%,estimating conﬁdence with dropout: The prediction tasks can be carried N times while applying dropout in the entire layers of the NNs [42], yielding N probability vectors for each test sample. A rough estimate of certainty ofprediction is obtained by computing the mean of these N probability vectorsand subtracting their component-wise estimated standard deviation (computedfrom the same N vectors). This gives a lower bound on the certainty of ourprediction. If the maximum value of these conﬁdences is too low, we decide toabstain. For the next subsection, we will use a 0.5% rejection rate (abstention)for Heuristic 2 (H2). Computational results per window are reported in Table 2 for all algorithms,namely, Baseline, H1, H2, Adapt-H1, and Adapt-H2. For each window and eachalgorithm, we provide the LP value at the root node of the search tree N0 (LP-N0), the computational time at N0 (N0 time), the number of fractional variables( in. Alg. LP-N0 Diﬀ.

Table 2: Computational results per window p = . p = . p = .

6, and we select p = . p = . p = . p = .

6, and we select p = . Solution cost Diﬀ. vs. Baseline Cost of Diﬀ. vs. Baseline Number of Diﬀ. vs. Baseline T time(%) global constraints (%) deadheads (%) (hours:minutes)GENCOL init 30 681 120.5 9 465 982.28 1725Baseline 20 639 814.6 -32.73 (vs. GENCOL init) (vs. GENCOL init)

992 -42.49 (vs. GENCOL init)

Table 3: Computational results on monthly solution

We start by comparing GENCOL init and Baseline. Observe ﬁrst that asigniﬁcantly better solution was found using Baseline, compared to the initialsolution GENCOL init. Indeed, the solver signiﬁcantly reduced both the solu-tion cost and the cost of global constraints with a reduction factor of 32.73%and 77.53%, respectively, while reducing the number of deadheads by 42.49%.This attests to the optimizer’s capacity to tackle larger windows ﬁnding bettersolutions and improving industrial-scale solutions.Next, we compare Baseline, H1, and H2. While the solution cost and thecost of global constraints found with H1 are slightly worse than Baseline withan increase factor of 2.32% and 3.55%, H2 outperforms Baseline reducing thesolution cost by 6.8%. Furthermore, H2 reduced the cost of global constraints38y 69.79%, which supports and justiﬁes the large number of fractional variablesat the root node of the search tree and the number of nodes, yielding largercomputational times. We believe that this trade-oﬀ is acceptable since theimprovement in the cost of global constraints is signiﬁcant and that the largercomputational times are due to the tight constraints on the number of workedhours per base. Relaxing these constraints may yield better results than Baselinewhile reducing the computational time. Finally, note that the poor results ofH1 and good results of H2 are explained by the ability of H2 to tackle and revisepoor predictions and discard poorly constructed clusters.Then, we compare Baseline, Adapt-H1, and Adapt-H2. The solutions foundby both heuristics present better statistics than Baseline, H1, and H2. Adapt-H1yields better solutions than any other heuristic, with a reduction factor in thesolution cost and cost of global constraints of 8.52% and 78.11%, respectively.Adapt-H2 presents similar results while providing a reduction factor of 7.44%and 76.93%. It is also worth noting that, for all heuristics, the number ofdeadheads used is slightly larger with an increase factor between 2.22% and14.52%, compared to Baseline. The cost of deadheads is accounted for in thesolution cost. Because the solution cost is a multi-objective function, we believethat using slightly more deadheads permitted to get better solutions, enhancingboth the solution cost and the cost of global constraints.

6. Conclusion

In this paper, we present Commercial-GENCOL-DCA, an improved imple-mentation of

Baseline [1] that relies on column generation and constraint ag-gregation: DCA, MPDCA, and IPS. On the one hand, DCA only incorporatesset partitioning constraints. On the other hand, IPS is capable of handling ad-ditional linear constraints but does not provide fast techniques for identifyingcompatible columns.

Baseline constructs the complementary problem of IPS byworking on clusters and ﬂights with the network techniques of DCA to identifycompatible and incompatible variables. It also combines some new techniques of39artial pricing and Branch-&-Bound to solve a large-scale problem where DCA,MPDCA, and IPS have not been applied before: monthly CPP with complexindustrial constraints.Commercial-GENCOL-DCA improves upon

Baseline by using a dynamiccontrol strategy and a variable aggregation and disaggregation strategy reducingthe size of the problem without negative aspects on the number of columngeneration iterations and the complexity of branching decisions. In addition,the column generation solver is modiﬁed to take strong advantage of an initialsolution and clusters of ﬂights with a large probability to be consecutive in asolution to speed up the column generation algorithm.We developed ML-based heuristics capable of constructing adapted initialclusters for the optimizer, taking into account multiple past solutions. Thiswork is the ﬁrst attempt to embed into a column generation framework recentlydeveloped ML methods. We also proposed an adaptation mechanism to proposeclusters in an online manner, taking into account the solution of the previouswindow.We compared the performance of the

Baseline solver [1] (using a standardinitial solution as clusters) with Commercial-GENCOL-DCA (using clustersproposed by ML-based heuristics). The main computational results show thatCommercial-GENCOL-DCA yields better results than

Baseline using a proto-type rolling-horizon approach with narrow windows. In addition, ML-basedheuristics, taking advantage of abstention or adaptation, yield better resultswith signiﬁcantly smaller costs reducing by 8.52% the solution cost, and by78.11% the cost of global constraints.We believe that the proposed solution is able to handle larger problems aswell as learn from multiple airline companies to construct initial clusters for anew company since it does not use ﬂight codes in any part of the pre-processing,learning or prediction process. We also believe that the combination of ML andoptimization with constraint aggregation can easily be adapted to other typesof optimization problems, such as railway or bus shift scheduling. Indeed, it hasbeen observed numerous times that warm-starting column generation with an40ptimal integer basis can even be counterproductive to the overall branch-&-price. This paper presents results illustrating that algorithms using constraintaggregation (DCA, MPDCA, and IPS) can achieve high acceleration with goodinitial information. It is useful information for the large community using col-umn generation.A potential improvement to our work is to propose a ML predictor that iscapable of providing a monthly solution that can be used both as initial clustersand as an initial solution. Unlike our NN predictor, such a ML predictor cannotbe greedy and must take into consideration the graph structure to maximize thefeasibility of the proposed solution.Finally, our long-term goal is to develop eﬃcient new learning techniquesthat could handle and learn from the ﬂight-based network structure, wherenodes correspond to time-space coordinates and arcs represent tasks performedby crew members (legs, deadheads, connections, rests, etc.) capable of incorpo-rating global and local constraints in the ML-predictor learning process.

7. Acknowledgements

The authors thank the anonymous reviewers for their valuable commentsthat improved the quality of this work. This work was supported by IVADOand a Collaborative Research and Development Grant from the Natural Sciencesand Engineering Research Council of Canada (NSERC) and AD OPT, a divisionof IBS Software. The authors would like to thank these organizations for theirsupport and conﬁdence.

References [1] G. Desaulniers, F. Lessard, S. Mohammed, S. Franois, Dynamic constraintaggregation for solving very large-scale airline crew pairing problems, LesCahiers du GERAD G2020 (21).[2] M. Desrochers, F. Soumis, A column generation approach to the urbantransit crew scheduling problem., Transportation Science 23 (1989) 1–13.413] G. Desaulniers, J. Desrosiers, Y. Dumas, S. Marc, B. Rioux, M. M.Solomon, F. Soumis, Crew pairing at Air France, European journal ofoperational research 97 (2) (1997) 245–259.[4] A. M. Cohn, C. Barnhart, Improving crew scheduling by incorporating keymaintenance routing decisions, Operations Research 51 (3) (2003) 387–396.[5] M. Deveci, N. C¸ . Demirel, Evolutionary algorithms for solving the airlinecrew pairing problem, Computers & Industrial Engineering 115 (2018) 389–406.[6] I. Elhallaoui, D. Villeneuve, F. Soumis, G. Desaulniers, Dynamic aggrega-tion of set-partitioning constraints in column generation, Operations Re-search 53 (4) (2005) 632–645.[7] I. Elhallaoui, A. Metrane, G. Desaulniers, F. Soumis, An improved primalsimplex algorithm for degenerate linear programs, INFORMS Journal onComputing 4 (2011) 569–577.[8] I. Elhallaoui, A. Metrane, F. Soumis, G. Desaulniers, Multi-phase dynamicconstraint aggregation for set partitioning type problems, MathematicalProgramming 123 (2) (2010) 345–370.[9] Y. Yaakoubi, F. Soumis, S. Lacoste-Julien, Flight-connection prediction forairline crew scheduling to construct initial clusters for OR optimizer, LesCahiers du GERAD G2019 (26). arXiv:2009.12501 .[10] Y. Yaakoubi, Combiner intelligence artiﬁcielle et programmationmath´ematique pour la planiﬁcation des horaires des ´equipages en trans-port a´erien, Ph.D. thesis, Polytechnique Montr´eal (2019).[11] O. Vinyals, M. Fortunato, N. Jaitly, Pointer networks, in: Advances inNeural Information Processing Systems, 2015, pp. 2692–2700.[12] M. Kruber, M. E. L¨ubbecke, A. Parmentier, Learning when to use a decom-position, in: International Conference on AI and OR Techniques in Con-42traint Programming for Combinatorial Optimization Problems, Springer,2017, pp. 202–210.[13] A. Lodi, L. Mossina, E. Rachelson, Learning to handle parameter pertur-bations in combinatorial optimization: an application to facility location,arXiv preprint arXiv:1907.05765.[14] Y. Bengio, A. Lodi, A. Prouvost, Machine learning for combinato-rial optimization: a methodological tour d’horizon, arXiv preprintarXiv:1811.06128.[15] V. Graf, D. Teichmann, M. Dorda, Mathematical model for charter ﬂightsplanning at airports with high air traﬃc volume, Communications-Scientiﬁcletters of the University of Zilina 18 (3) (2016) 22–27.[16] M.-W. Tsai, T.-P. Hong, W.-T. Lin, A two-dimensional genetic algorithmand its application to aircraft scheduling problem, Mathematical Problemsin Engineering 2015 (2015) 1–12.[17] P. K. Pandit, M. Hasin, A. Akhtar, Business model of aircraft ﬂeet planningusing ann, in: Hamburg International Conference of Logistics (HICL) 2018,epubli, 2018, pp. 221–247.[18] B. Zeren, ˙I. ¨Ozkol, An improved genetic algorithm for crew pairing opti-mization, Journal of Intelligent Learning Systems and Applications 4 (01)(2012) 70.[19] A. Azadeh, M. H. Farahani, H. Eivazy, S. Nazari-Shirkouhi, G. Asadipour,A hybrid meta-heuristic algorithm for optimization of crew scheduling, Ap-plied Soft Computing 13 (1) (2013) 158–164.[20] J. W. Yen, J. R. Birge, A stochastic programming approach to the airlinecrew scheduling problem, Transportation Science 40 (1) (2006) 3–14.[21] S. Qiu, Airline crew pairing optimization problems and capacitated vehiclerouting problems, Ph.D. thesis, Georgia Institute of Technology (2012).4322] A. Kasirzadeh, M. Saddoune, F. Soumis, Airline crew scheduling: models,algorithms, and data sets, EURO Journal on Transportation and Logistics6 (2) (2017) 111–137.[23] R. E. Marsten, F. Shepardson, Exact solution of crew scheduling problemsusing the set partitioning model: Recent successful applications., Networks11 (1981) 165–177.[24] R. Anbil, R. Tanga, E. L. Johnson, A global approach to crew-pairingoptimization, IBM Systems Journal 31 (1) (1992) 71–78.[25] J. O. Brunner, J. F. Bard, Flexible weekly tour scheduling for postal serviceworkers using a branch and price, Journal of Scheduling 16 (1) (2013) 129–149.[26] C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. Savelsbergh, P. H.Vance, Branch-and-price: Column generation for solving huge integer pro-grams, Operations research 46 (3) (1998) 316–329.[27] M. Saddoune, G. Desaulniers, F. Soumis, Aircrew pairings with possiblerepetitions of the same ﬂight number, Computers & Operations Research40 (3) (2013) 805–814.[28] F. Quesnel, G. Desaulniers, F. Soumis, A new heuristic branching schemefor the crew pairing problem with base constraints, Computers & Opera-tions Research 80 (2017) 159–172.[29] A. Zaghrouti, I. E. Hallaoui, F. Soumis, Improved integral simplex usingdecomposition for the set partitioning problem, EURO Journal on Compu-tational Optimization 6 (2) (2018) 185–206.[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,Dropout: A simple way to prevent neural networks from overﬁtting, Journalof Machine Learning Research 15 (2014) 1929–1958.4431] S. Ioﬀe, C. Szegedy, Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift, in: Proceedings of the 32Nd In-ternational Conference on International Conference on Machine Learning -Volume 37, ICML’15, JMLR.org, Lille, France, 2015, pp. 448–456.[32] J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization ofmachine learning algorithms, in: Advances in neural information processingsystems, 2012, pp. 2951–2959.[33] G. E. Dahl, T. N. Sainath, G. E. Hinton, Improving deep neural networksfor LVCSR using rectiﬁed linear units and dropout, in: 2013 IEEE interna-tional conference on acoustics, speech and signal processing, IEEE, 2013,pp. 8609–8613.[34] N. Hatami, Y. Gavet, J. Debayle, Classiﬁcation of time-series images usingdeep convolutional neural networks, in: Tenth International Conference onMachine Vision (ICMV 2017), Vol. 10696, 2018, pp. 10696 – 10696 – 8.[35] A. Borovykh, S. Bohte, K. Oosterlee, Conditional time series forecastingwith convolutional neural networks, in: Lecture Notes in Computer Sci-ence/Lecture Notes in Artiﬁcial Intelligence, Springer, USA, 2017, pp. 729–730.[36] M. S. A. Nadeem, J.-D. Zucker, B. Hanczar, Accuracy-rejection curves(arcs) for comparing classiﬁcation methods with a reject option, in:S. Dˇzeroski, P. Guerts, J. Rousu (Eds.), Proceedings of the third Inter-national Workshop on Machine Learning in Systems Biology, Vol. 8 of Pro-ceedings of Machine Learning Research, PMLR, Ljubljana, Slovenia, 2009,pp. 65–81.[37] F. Dernoncourt, J. Y. Lee, Optimizing neural network hyperparameterswith gaussian processes for dialog act classiﬁcation, 2016 IEEE SpokenLanguage Technology Workshop (SLT) 2016 (2016) 406–413.4538] J. Gonz´alez, Gpyopt: a bayesian optimization framework in python, http://github.com/SheffieldML/GPyOpt (2016).[39] F. Hutter, H. H. Hoos, K. Leyton-Brown, Sequential model-based optimiza-tion for general algorithm conﬁguration, in: Proceedings of the 5th Inter-national Conference on Learning and Intelligent Optimization, LION’05,Springer-Verlag, Berlin, Heidelberg, 2011, pp. 507–523.[40] F. Chollet, et al., Keras: Deep learning library for theano and tensorﬂow, https://keras.io/https://keras.io/