[PDF] Generative deep learning for decision making in gas networks

Abstract

A decision support system relies on frequent re-solving of similar problem instances. While the general structure remains the same in corresponding applications, the input parameters are updated on a regular basis. We propose a generative neural network design for learning integer decision variables of mixed-integer linear programming (MILP) formulations of these problems. We utilise a deep neural network discriminator and a MILP solver as our oracle to train our generative neural network. In this article, we present the results of our design applied to the transient gas optimisation problem. With the trained network we produce a feasible solution in 2.5s, use it as a warm-start solution, and thereby decrease global optimal solution solve time by 60.5%.

Full PDF

TTakustr. 714195 BerlinGermany

Zuse Institute Berlin L OVIS A NDERSON , M ARK T URNER , T HORSTEN K OCH Generative deep learning for decisionmaking in gas networks ZIB Report 20-38 (December 2020) a r X i v : . [ m a t h . O C ] F e b use Institute BerlinTakustr. 714195 BerlinGermanyTelephone: +49 30-84185-0Telefax: +49 30-84185-125E-mail: [email protected] URL:

ZIB-Report (Print) ISSN 1438-0064ZIB-Report (Internet) ISSN 2192-7782 enerative deep learning for decision making ingas networks

Lovis Anderson, Mark Turner, Thorsten KochFebruary 4, 2021

Abstract

A decision support system relies on frequent re-solving of similar prob-lem instances. While the general structure remains the same in corre-sponding applications, the input parameters are updated on a regularbasis. We propose a generative neural network design for learning integerdecision variables of mixed-integer linear programming (MILP) formula-tions of these problems. We utilise a deep neural network discriminatorand a MILP solver as our oracle to train our generative neural network.In this article, we present the results of our design applied to the transientgas optimisation problem. With the trained network we produce a feasi-ble solution in 2.5s, use it as a warm-start solution, and thereby decreaseglobal optimal solution solve time by 60.5%.

Mixed-Integer Linear Programming (MILP) is concerned with the modelling andsolving of problems from discrete optimisation. These problems can representreal-world scenarios, where discrete decisions can be appropriately captured andmodelled by the integer variables. In real-world scenarios a MILP model is rarelysolved only once. More frequently, the same model is used with varying data todescribe diﬀerent instances of the same problem which are solved on a regularbasis. This holds true in particular for decision support systems, which canutilise MILP to provide real-time optimal decisions on a continual basis, see [4]and [40] for examples in nurse scheduling and vehicle routing. The MILPs thatthese decision support systems solve have identical structure due to both theirunderlying application and cyclical nature, and thus often have similar optimalsolutions. Our aim is to exploit this repetitive structure, and create generativeneural networks that generate binary decision encodings for subsets of importantvariables. These encodings can then be used in a primal heuristic by solving theinduced sub-problem following variable ﬁxations. Additionally, the then resultof the primal heuristic can be used in a warm-start context to help improve solverperformance in a globally optimal context. We demonstrate the performance ofour neural network (NN) design on the transient gas optimisation problem [38],1peciﬁcally on real-world instances embedded in day-ahead decision supportsystems.The design of our framework is inspired by the recent development of GenerativeAdversarial Networks (GANs) [17]. Our design consists of two NNs, a Generatorand a Discriminator. The Generator is responsible for generating the binarydecision values, while the Discriminator is tasked with predicting the optimalobjective function value of the MILP induced by ﬁxing these binary variablesto their generated values.Our NN design and its application to transient gas-network MILP formula-tions is an attempt to integrate Machine Learning (ML) into the MILP solvingprocess. This integration has recently received an increased focus [7, 16, 43],which has been encouraged by the success of ML integration into other facets ofcombinatorial optimisation, see [5] for a thorough overview. Our contributionto this intersection of two ﬁelds is as follows: We introduce a new generativeNN design for learning integer variables of parametric MILPs, which interactswith the MILP directly during training. We also apply our design to a muchmore diﬃcult and convoluted problem than traditionally seen in similar papers,namely the transient gas transportation problem. This paper is to the best ourknowledge the ﬁrst successful implementation of ML applied to discrete controlin gas-transport.

As mentioned in the introduction, the intersection of MILP and ML is cur-rently an area of active and growing research. For a thorough overview of DeepLearning (DL), the relevant subset of ML used throughout this article, we referreaders to [18], and for MILP to [1]. We will highlight previous research fromthis intersection that we believe is either tangential, or may have shared appli-cations to that presented in this paper. Additionally, we will brieﬂy detail thestate-of-the-art in transient gas transport, and highlight why our design is ofpractical importance. It should be noted as-well, that there are recent researchactivities aiming at the reverse direction, with MILP applied to ML instead ofthe orientation we consider, see [45] for an interesting example.Firstly, we summarise applications of ML to adjacent areas of the MILP solvingprocess. [16] creates a method for encoding MILP structure in a bipartite graphrepresenting variable-constraint relationships. This structure is the input to aGraph Convolutional Neural Network (GCNN), which imitates strong branchingdecisions. The strength of their results stem from intelligent network designand the generalisation of their GCNN to problems of a larger size, albeit withsome generalisation loss. [47] take a diﬀerent approach, and use a NN designthat incorporates the branch-and-bound tree state directly. In doing so, theyshow that information contained in the global branch-and-bound tree state isan important factor in variable selection. Furthermore, they are one of the2ew publications to present techniques on heterogeneous instances. [12] show asuccessful implementation of reinforcement learning for variable selection. [43]show preliminary results of how reinforcement learning can be used in cutting-plane selection. By restricting themselves exclusively to Gomory cuts, they areable to produce an agent capable of selecting better cuts than default solversettings for speciﬁc classes of problems.There exists a continuous trade-oﬀ between model exactness and complexity inthe ﬁeld of transient gas optimisation, and as such, there is no standard modelfor transient gas transportation problems. [31] presents a piece-wise linear MILPapproach to the transient gas transportation problem, [8] a non-linear approachwith a novel discretisation scheme, and [24] and [26] a linearised approach. Forthe purpose of our experiments, we use the model of [24], which uses linearisedequations and focuses on active element heavy subnetworks. The current re-search of ML in gas transport is still preliminary. [37] use a dual NN design toperform online calculations of a compressors operating point to avoid re-solvingthe underlying model. The approach constraints itself to continuous variablesand experimental results are presented for a gunbarrel type network. [30] presenta NN combined with a genetic algorithm for learning the relationship betweencompressor speeds and the fuel consumption rate in the absence of completedata. More often ML has been used in ﬁelds closely related to gas transport, asin [20], with ML used to track the degradation of compressor performance, andin [35] to forecast demand values at the boundaries of the network. For a morecomplete overview of the transient gas literature, we refer readers to [38].Our Discriminator design, which predicts the optimal objective value of aninduced sub-MILP, can be considered similar to [3] in what it predicts andsimilar to [14] in how it works. In the ﬁrst paper [3], a neural network isused to predict the associated objective value improvements on cuts. This isa smaller scope than our prediction, but is still heavily concerned with theMILP formulation. In the second paper [14], a technique is developed thatperforms backward passes directly through a MILP. It does this by solvingMILPs exclusively with cutting planes, and then receiving gradient informationfrom the KKT conditions of the ﬁnal linear program. This application of aneural network, which produces input to the MILP, is very similar to our design.The diﬀerences arise in that we rely on a NN Discriminator to appropriatelydistribute the loss instead of solving a MILP directly, and that we generatevariable values instead of parameter values with our Generator.While our discriminator design is heavily inspired from GANs [17], it is alsosimilar to actor-critic algorithms, see [36]. These algorithms have shown successfor variable generation in MILP, and are notably diﬀerent in that they sam-ple from a generated distribution for down-stream decisions instead of alwaystaking the decision with highest probability. Recently, [9] generated a series ofcoordinates for a set of UAVs using an actor-critic based algorithm, where thesecoordinates were continuous variables in a MINLP formulation. The indepen-dence of separable sub-problems and the easily realisable value function within3heir formulation resulted in a natural Markov Decision Process interpretation.For a better comparison on the similarities between actor-critic algorithms andGANs, we refer readers to [36].Finally, we summarise existing research that also deals with the generationof decision variable values for MIPs. [6, 7] attempt to learn optimal solutionsof parametric MILPs and MIQPs, which involves both outputting all integerdecision variable values and the active set of constraints. They mainly useOptimal Classiﬁcation Trees in [6] and NNs in [7]. Their aim is tailored towardssmaller problems classes, where speed is an absolute priority and parametervalue changes are limited. [29] learn binary warm start decisions for MIQPs.They use NNs with a loss function that combines binary cross entropy and apenalty for infeasibility. Their goal of a primal heuristic is similar to ours, andwhile their design is much simpler, it has been shown to work eﬀectively onvery small problems. Our improvement over this design is our non-reliance onlabelled optimal solutions which are needed for binary cross entropy. [11] presenta GCNN design which is an extension of [16], and use it to generate binarydecision variable values. Their contributions are a tripartite graph encodingof MILP instances, and the inclusion of their aggregated generated values asbranching decisions in the branch-and-bound tree, both in an exact approachand in an approximate approach with local branching [15]. Very recently, [32]combined the branching approach of [16] with a novel neural diving approach,in which integer variable values are generated. They use a GCNN both forgenerating branching decisions and integer variables values. Diﬀerent to ourgenerator-discriminator based approach, they generate values directly from alearned distribution, which is based on an energy function that incorporatesresulting objective values.

We begin by formally deﬁning both a MILP and a NN. Our deﬁnition of a MILPis an extension of more traditional formulations, see [1], but still encapsulatesgeneral instances.

Deﬁnition 1.

Let π ∈ R p be a vector of problem deﬁning parameters. We callthe following a MILP parameterised by π . P π := min c T x + c T x + c T z + c T z s.t A π  x x z z  ≤ b π c k ∈ R n k , k ∈ { , , , } , A π ∈ R m × n , b π ∈ R m x ∈ R n , x ∈ R n , z ∈ Z n , z ∈ Z n (1) Furthermore let Σ ⊂ R p be a set of valid problem deﬁning parameters. We then N { θ ,θ } call { P π | π ∈ Σ } a problem class for Σ . Note that the explicit parameter space Σ is usually unknown, but we assumein the following to have access to a random variable Π that samples from Σ. Inaddition, note that c, n , n , n , and n are not parameterised by π , and as suchthe objective function and variable dimensions do not change between scenarios. Deﬁnition 2. A k layer NN N θ is given by the following: N θ : R | a | −→ R | a k +1 | h i : R | a i | −→ R | a i | , ∀ i ∈ { , ..., k + 1 } a i +1 = h i +1 ( W i a i + b i ) , ∀ i ∈ { , ..., k } (2) Here θ fully describes all weights ( W ) and biases ( b ) of the network. h i ’s arecalled activation functions and are non-linear element-wise functions. An outline of our framework is depicted in Figure 1. The Generator G θ is a NNthat takes as input π . G θ outputs values for the variables z , which we denoteby ˆ z . These variable values ˆ z alongside π are then input into another NN,namely the Discriminator D θ . D θ ﬁnally outputs a prediction of the optimalobjective function value of P π with values of z ﬁxed to ˆ z , namely ˆ f ( P ˆ z π ). Moreformally this is: Deﬁnition 3.

The generator G θ and discriminator D θ are both NNs deﬁnedby the following: G θ : R p −→ Z n D θ : R p × Z n −→ R (3) Furthermore, a forward pass of both G θ and D θ is deﬁned as follows: ˆ z = G θ ( π ) (4)ˆ f ( P ˆ z π ) = D θ ( ˆ z , π ) (5)5 he hat notation is used to denote quantities that were approximated by a NN,and f ( P π ) refers to the optimal objective function value of P π . We use super-script notation to create the following instances: P ˆ z π = P π s.t z = ˆ z (6) Note that the values of ˆ z must be appropriately rounded when explicitly solving P ˆ z π s.t they are feasible w.r.t. their integer constraints. As such, it is a slightabuse notation to claim that G θ ( π ) lies in Z n The goal of this framework is to produce good initial solution values for z ,which lead to an induced sub-MILP, P z π , whose optimal solution is a goodfeasible solution to the original problem. Further, the idea is to use this feasiblesolution as a ﬁrst incumbent for warm-starting P π . To ensure feasibility for allchoices of z , we divide the continuous variables into two sets, x and x , as seenin Deﬁnition 1. The variables x are potential slack variables to ensure that allgenerated decisions result in feasible P ˆ z π instances. Penalising these slacks inthe objective then feeds in naturally to our design, where G θ aims to minimisethe induced optimal objectives. For the purpose of our application it should benoted that z and z are binary variables instead of integer. Next we describethe design of G θ and D θ . G θ and D θ are NNs whose structure is inspired by [17], as well as both incep-tion blocks and residual NNs, which have greatly increased large scale modelperformance [42]. We use the block design from Resnet-v2 [42], see Figure 3, al-beit with slight modiﬁcations for the case of transient gas-network optimisation.Namely, we primarily use 1-D convolutions with that dimension being time. Ad-ditionally, we separate initial input streams by their characteristics, and whenjoining two streams, use 2-D convolutions, where the second dimension is of size2 and quickly becomes one dimensional again. See Figure 2 for an example ofthis process. The ﬁnal layer of G θ contains a softmax activation function withtemperature. As the softmax temperature increases, this activation function’soutput approaches a one-hot vector encoding. The ﬁnal layer of D θ containsa softplus activation function. All other intermediate layers of N { θ ,θ } use theReLU activation function. We refer readers to [18] for a thorough overview ofdeep learning, and to Figure 14 in Appendix A for our complete design.For a vector x = ( x , · · · , x n ), the Softmax function with temperature T ∈ R (7), ReLu function (8), and Softplus function with parameter β ∈ R (9) are: σ ( x, T ) := exp( T x i ) (cid:80) nj =1 exp( T x j ) (7) σ ( x i ) := max(0 , x i ) (8) σ ( x i , β ) := 1 β log(1 + exp( βx i )) (9)6 imePadding1 1 0 1 12 2 1 0 0 4 4 21 01 0 11 * Padded Input Stream 1Padded Input Stream 2

Figure 2: Method of merging two 1-D input streams

Convolution Kernel Size 1 Padding 0 ReLuConvolution Kernel Size 1 Padding 0 Convolution Kernel Size 1 Padding 0Convolution Kernel Size 3 Padding 1 Convolution Kernel Size 1 Padding 0

Input

Concatenate

Output

Convolution Kernel Size 3 Padding 1Convolution Kernel Size 3 Padding 1Convolution Kernel Size 1 Padding 0

Figure 3: 1-D Resnet-v2 Block DesignWe can compose G θ with D θ , as in Figure 1, so that the combined resultingNN is deﬁned as: N { θ ,θ } ( π ) := D θ ( G θ ( π ) , π ) (10) In a similar manner to GANs and actor-critic algorithms, see [36], the designof N { θ ,θ } has a bi-level optimisation interpretation, see [10] for an overview ofbi-level optimisation. Here we list the explicit objectives of both G θ and D θ ,and how their loss functions represent these objectives.The objective of D θ is to predict f ( P ˆ z π ), the optimal induced objective valuesof P ˆ z π . Its loss function is thus: L ( θ , π ) := (cid:12)(cid:12) D θ ( G θ ( π ) , π ) − f ( P G θ ( π ) π ) (cid:12)(cid:12) (11)The objective of G θ is to minimise the induced prediction of D θ . Its loss7unction is thus: L (cid:48) ( θ , π ) := D θ ( G θ ( π ) , π ) (12)The corresponding bi-level optimisation problem can then be viewed as:min θ E π ∼ Π [ D θ ( G θ ( π ) , π )]s.t min θ E π ∼ Π [ D θ ( G θ ( π ) , π ) − f ( P G θ ( π ) π )] (13) For eﬀective training of G θ , a capable D θ is needed. We therefore pre-train D θ . The following loss function, which replaces G θ ( π ) with prior generated z values in (11), is used for this pre-training: L (cid:48)(cid:48) ( θ , π ) := (cid:12)(cid:12) D θ ( z , π ) − f ( P z π ) (cid:12)(cid:12) (14)However, performing this initial training requires generating instances of P z π .Here we do supervised training in an oﬄine manner on prior generated data.After the initial training of D θ , we train G θ as a part of N { θ ,θ } , using samples π ∈ Π, the loss function (12), and ﬁxed θ . The issue of G θ outputting con-tinuous values for ˆ z is overcome by the ﬁnal layer’s activation function of G θ .The softmax with temperature (7) ensures that adequate gradient informationstill exists to update θ , and that the results are near binary. When using theseresults to explicitly solve P ˆ z π , we round our result to a one-hot vector encodingalong the appropriate dimension.After the completion of both initial training, we alternately train both NN’susing updated loss functions in the following way: • D θ training: – As in the initial training, using loss function (14). – In an online fashion, using predictions from G θ and loss function(11). • G θ training: – As explained above with loss function (12).Our design allows the loss to be back-propagated through D θ and distributedto the individual nodes of the ﬁnal layer of G θ , i.e., that representing z . Thisis largely diﬀerent to other methods, many of which rely on using binary crossentropy loss against optimal solutions of P π . Our advantage over these is that8he contribution to the objective function we are trying to minimise of eachvariable decision in z can be calculated. This has an added beneﬁt of gener-ated suboptimal solutions being much more likely to be near-optimal, as theyare trained in a manner to minimise the objective rather than copy previouslyobserved optimal solutions.For our application, transient gas network optimisation, methods for samplinginstances currently do not exist. In fact, even gathering data is notoriouslydiﬃcult, see [28] and [46]. For this reason, we introduce a new method forgenerating training data in section 5. To evaluate the performance of our approach, we test our framework on thetransient gas optimisation problem, see [38] for an overview of the problem andassociated literature. This problem is diﬃcult to solve as it combines a transientﬂow problem with complex combinatorics representing switching decisions. Thenatural modelling of transient gas networks as time-expanded networks lendsitself well to our framework however, due to the static underlying network andrepeated constraints at each time-step.We use the description of transient gas networks by [24]. The advantages ofthis description for our framework is a natural separation of z variables, whichinduce feasible P z π for all choices due to the existence of slack variables in thedescription. These slack variables are then represented by x in Deﬁnition 1.The gas network is modelled as a directed graph G = ( V , A ) where A is theset of arcs representing network elements, e.g. pipes, and the nodes V representjunctions between adjacent elements. Every arc a ∈ A models a speciﬁc elementwith A = A pi ∪ A va ∪ A rs ∪ A rg ∪ A cs , i.e., pipes, valves, resistors, regulators,and compressors. Additionally, the node set V contains multiple element types,with V = V b ∪ V partitioned into boundary and inner nodes respectively. Theboundary nodes represent the sources and sinks of the ﬂow network. Thus, ﬂowand pressure forecasts are given for each v ∈ V b .It should be noted that this description focuses on network stations , the beatinghearts of gas networks. Network stations are commonly located at the intersec-tions of major pipelines and contain nearly all elements, which can be used tocontrol the gas ﬂow. Next, we brieﬂy explain the most important constraintsfrom the model of [24], particularly those which we exploit with our approach.For a full deﬁnition of the MILP, please see [24].As we optimise a transient problem, we deal with a time horizon, namely T := { , . . . , k } . We aim to calculate a network state for each t ∈ T := T \ { } , i.e.control decisions for all future time steps. As such, the initial gas network stateat time 0 contains a complete description of that time step and is immutable. Onthe other hand all future time steps contain, before optimising, only forecastedpressure and ﬂow values at V b . We denote τ ( t ) as the time diﬀerence in seconds9rom time step 0. Pipes constitute the majority of elements in any gas transmission network. Thedynamics of ﬂow through pipes are governed by the Euler Equations, a set ofnonlinear hyperbolic partial diﬀerential equations, see [33]. We consider theisothermal case and discretise as in [23]. Consider the pipe a = ( u, v ), a ∈ A pi ,where u, v ∈ V are the two incident nodes. We attach a ﬂow-in q u,a,t and ﬂow-out q v,a,t variable to each pipe. Additionally, each incident node has an attachedpressure variable, namely ( p u,t ) and ( p v,t ). Moreover, these ﬂow-in, ﬂow-out,and pressure values also appear for each time step. R s , z a , and T are assumed tobe constant, and D a , L a , s a , A a , g , and λ a are themselves constant. The aboveconstant assumptions are quite common in practice [38]. It is only after settingthe velocity of gas within each individual pipe, | v w,a | to be constant that allnon-linearities are removed however. We do this via a method developed in [23]and seen in [13]. The resulting pipe equations are: p u,t + p v,t − p u,t − p v,t + 2 R s T z a ( τ ( t ) − τ ( t )) L a A a ( q v,a,t − q u,a,t ) = 0 (15) p v,t − p u,t + λ a L a D a A a ( | v u,a | q u,a,t + | v v,a | q v,a,t )+ gs a L a R s T z a ( p u,t + p v,t ) = 0 (16)As nodes represent junctions between network elements and thus have no vol-ume in which to store any gas, the ﬂow conservation constraints (17) (18) arerequired. In the below equations, d v,t represents the inﬂow resp. outﬂow ofentry and exit nodes in the network at time t ∈ T . Note that network elementsthat aren’t pipes have only one associated ﬂow variable, instead of the in-outﬂow exhibited by pipes. This is due to them having no volume, and as such noability to store gas over time, i.e. line-pack. (cid:88) ( u,w )= a ∈A pi q w,a,t − (cid:88) ( w,v )= a ∈A pi q w,a,t + (cid:88) ( u,w )= a ∈A\A pi q a,t − (cid:88) ( w,v )= a ∈A\A pi q a,t + d w,t = 0 ∀ w ∈ V b (17) (cid:88) ( u,w )= a ∈A pi q w,a,t − (cid:88) ( w,v )= a ∈A pi q w,a,t + (cid:88) ( u,w )= a ∈A\A pi q a,t − (cid:88) ( w,v )= a ∈A\A pi q a,t = 0 ∀ w ∈ V (18)10 .2 Operation Modes Operation modes represent binary decisions in our gas network. We identify thecorresponding binary variables with the z variables from our MILP formula-tion (1). Let O represent the set of operation modes, and m om o,t the associatedvariables. Operation Modes are very important in our modelling context asthey describe every allowable combination of discrete decisions associated with valves and compressors . Compressors are typically set up as a compressor station consisting of multiplecompressor units, which represent the union of one single compressor machineand its associated drive. These compressor units are dynamically switched onor oﬀ and used in diﬀerent sequences to meet the current needs in terms of com-pression ratios and ﬂow rates. Out of the theoretically possible arrangementsof compressor units, the set of technically feasible arrangements are known asthe conﬁgurations of a compressor station.Selecting an operation mode results in ﬁxed conﬁgurations for all compressorstations. The binary variables associated with a compressor station a = ( u, v ) ∈A cs at time t ∈ T are m by a,t (bypass), m cl a,t (closed), and m cf c,a,t ∀ c ∈ C a (active). C a denotes the set of conﬁgurations associated to compressor station a avail-able in active mode, where the conﬁguration’s operating range is a polytope inspace ( p u,t , p v,t , q u,a,t ). The polytope of conﬁguration c is represented by theintersection of half-spaces, H c = { ( α , α , α , α ) ∈ R } .1 = (cid:88) c ∈C a m cf c,a,t + m by a,t + m cl a,t (19) α p u-cf c,a,t + α p v-cf c,a,t + α q cf c,a,t + α m cf c,a,t ≤ ∀ ( α , α , α , α ) ∈ H c ∀ c ∈ C a (20)Note that the variables in (20) have an extra subscript and superscript comparedto those in (15) and (16). This is due to our use of the convex-hull reformulation,see [2]. The additional subscript refers to the conﬁguration in question, and thesuperscript the mode, with the pressure variables having an additional nodeidentiﬁer. It should also be noted that the continuous variables attached to acompressor station are not ﬁxed by a choice in operation mode or conﬁguration,but rather the operation mode restricts the variables to some polytope. Valves decide the allowable paths through a network, and can separate areas,decoupling their pressure levels. They are modelled as an arc a = ( u, v ), whosediscrete decisions can be decided by an operation mode choice. Valves have two11odes, namely open and closed. When a valve is open, similar to a compressorstation in bypass, ﬂow is unrestricted and there exists no pressure diﬀerencebetween the valves start and endpoints. Alternatively in the closed mode, a valveallows no ﬂow to pass, and decouples the pressure of the start- and endpoints ofthe arc. The variable m op a,t represents a valve being open with value 1 and closedwith value 0. The general notation ¯ x and ¯ x refer to lower and upper bounds ofa variable x . The constraints describing valves are then as follows: p u,t − p v,t ≤ (1 − m op a,t )(¯ p u,t − ¯ p v,t ) (21) p u,t − p v,t ≥ (1 − m op a,t )(¯ p u,t − ¯ p v,t ) (22) q a,t ≤ ( m op a,t )¯ q a,t (23) q a,t ≥ ( m op a,t )¯ q a,t . (24) As mentioned earlier, not all combinations of compressor station conﬁgurationsand valve states are possible. We thus deﬁne a mapping M ( o, a ) from operationmode o ∈ O to the discrete states of all a ∈ A va ∪ A cs M ( o, a ) := m where m is the mode or conﬁguration of arc a in operation mode o ∀ o ∈ O ∀ a ∈ A va ∪ A cs with m ∈ { op , cl } if a ∈ A va m ∈ { by , cl } ∪ C a if a ∈ A cs Using this mapping we can then deﬁne a set of constraints for all valid com-binations of compressor station and valve discrete states for each t ∈ T . Thevariable m om o,t , o ∈ O t ∈ T , is a binary variable, where the value 1 representsthe selection of o at time step t . 12 o ∈O m om o,t = 1 (25) m op a,t = (cid:88) o ∈O : M ( o,a )=op m om o,t ∀ a ∈ A va (26) m by a,t = (cid:88) o ∈O : M ( o,a )=by m om o,t ∀ a ∈ A cs (27) m cl a,t = (cid:88) o ∈O : M ( o,a )=cl m om o,t ∀ a ∈ A cs (28) m cf c,a,t = (cid:88) o ∈O : M ( o,a )= c m om o,t ∀ c ∈ C a ∀ a ∈ A cs (29) m om o,t ∈ { , } ∀ o ∈ O . Flow Directions deﬁne the sign of ﬂow values over the boundary nodes of anetwork station. With regards to our MILP they are a further set of decisionvariables. We avoid generating these decisions with our deep learning frameworkas not all combinations of operation modes and ﬂow directions are feasible.These variables thus exist as integer variables in P z π , namely as a subset of z ,see (1). They are few in number however due to the limited combinations afterthe operation modes are ﬁxed. Boundary nodes, unlike inner nodes, have a prescribed ﬂow and pressure valuesfor all future time steps. For each boundary node v ∈ V b and t ∈ T , we have σ p + v,t and σ p − v,t , which capture the positive and negative diﬀerence between theprescribed and realised pressure. In addition to these pressure slack variables,we have the inﬂow slack variables σ d + v,t and σ d − v,t which act in a similar mannerbut for inﬂow. The relationships between the slack values, prescribed values,and realised values can be modelled for each v ∈ V b and t ∈ T as:ˆ p v,t = p v,t − σ p + v,t + σ p − v,t ∀ v ∈ V b (30)ˆ d v,t = d v,t − σ d + v,t + σ d − v,t ∀ v ∈ V b (31)Note that unlike the model from [24], we do not allow the inﬂow over a set ofboundary nodes to be freely distributed according to which group they belongto. This is an important distinction, as each single node has a complete forecast. In addition to the forecast mentioned in subsection 4.4, we also start our op-timisation problem with an initial state. This initial state contains complete13nformation of all discrete states and continuous values for all network elementsat t = 0. The objective of our formulation is to both minimise slack usage, and changesin network operation. Speciﬁcally, it is a weighted sum of changes in the ac-tive element modes, changes in the continuous active points of operation, andthe deviations from given pressure and ﬂow demands. For the exact objectivefunction we refer readers to [24].

In this section we propose an experimental design to determine the eﬀectivenessof our neural network design approach. We outline how we generate synthetictraining data, and show the exact architecture and training method we use forour neural network. Our ﬁnal test set consists of 15 weeks of real-world dataprovided by our project partner OGE.

As mentioned previously, acquiring gas network data is notoriously diﬃcult[28, 46]. Perhaps because of this diﬃculty, there exists no standard method forgenerating valid states for a ﬁxed gas network. Below we outline our meth-ods for generating synthetic transient gas instances for training purposes, i.e.generating π ∈ Π and artiﬁcial z values. For our application of transient gasinstances, π is a tuple of a boundary forecast and an initial state. We consider network stations as our gas network topology. They contain allheavy machinery and at most only short segments of large scale transportpipelines. As such, our gas networks cannot be used to store large amountsof gas. We thus aim to generate balanced demand scenarios, with the require-ment described as follows: (cid:88) v ∈V b ˆ d v,t = 0 ∀ t ∈ T (32)The distribution of gas demand scenarios is not well known. Hence we naivelyassume a uniform distribution, and using the largest absolute ﬂow value foundover any node and time step in our real-world data, create an interval as follows:14 q = max v ∈V b ,t ∈T | ˆ d v,t | ˆ d v,t ∈ [ − . M q , . M q ] (33)In addition to the above, we require three MILP formulation speciﬁc require-ments. The ﬁrst is that the absolute diﬀerence between the ﬂow values of a nodeis not too large for any adjacent time steps. Secondly, the sign of the generatedﬂow values must match the attribute of the boundary node, i.e., entry (+), exit(-). Thirdly, the ﬂow values do not diﬀer too largely between boundary nodesof the same fence group within the same time step. A fence group is denotedby g ∈ G , and enforces the sign of all nodes in the group to be identical. Theseconstraints are described below: | ˆ d v,t − ˆ d v,t − | ≤ ∀ t ∈ T , v ∈ V b sign( ˆ d v,t ) = (cid:40) v ∈ V + − v ∈ V − ∀ t ∈ T , v ∈ V b | ˆ d v ,t − ˆ d v ,t | ≤ ∀ t ∈ T , v , v ∈ g, g ∈ G , v , v ∈ V b (34)To generate demand scenarios that satisfy constraints (32) and (33), we use themethod proposed in [39]. Its original purpose was to generate samples fromthe Dirichlet distribution, but it can be used for a special case of the Dirichletdistribution that is equivalent to a uniform distribution over a simplex in 3-dimensions. Such a simplex is exactly described by (32) and (33) for each timestep. Hence we can apply it for all time-steps and reject all samples that donot satisfy constraints (34). Note that this method is insuﬃcient for networkstations with more than three boundary nodes.In addition to ﬂow demands, we require a pressure forecast for all boundarynodes. Our only requirements here is that the pressures between adjacent timesteps for a single node not ﬂuctuate heavily and that the bounds are respected.We create a bound on the range of pressure values by ﬁnding maximum andminimum values over all nodes and time steps in our test set. We once againassume our samples to be uniformly distributed and sample appropriately over(35) with rejection of samples that do not respect constraint (36). Note thatmany scenarios generated by this approach are unlikely to happen in practice,as the pressure and ﬂow proﬁles may not match. M +p = max v ∈V b ,t ∈T ˆ p v,t M − p = min v ∈V b ,t ∈T ˆ p v,t ˆ p v,t ∈ [ M − p − . M +p − M − p ) , M +p + 0 . M +p − M − p )] (35)15 ˆ p v,t − ˆ p v,t − | ≤ ∀ t ∈ T , v ∈ V b (36)Combining the two procedures from above yields the artiﬁcial forecast datageneration method described in Algorithm 1. Algorithm 1:

Boundary Value Forecast Generator

Result:

A forecast of pressure and ﬂow values over the time horizonﬂow forecast = Sample simplex (32)(33) uniformly, rejecting via (34) ;pressure forecast = Sample (35) uniformly, rejecting via (36) ; return (ﬂow forecast, pressure forecast)

During oﬄine training, D θ requires optimal solutions for a ﬁxed z . In Algo-rithm 2 we outline a naive yet eﬀective approach of generating reasonable z values, i.e., operation mode sequences: Algorithm 2:

Operation Mode Sequence Generator

Result:

An Operation Mode per time stepoperation modes = [ ] ; for t = 1; t < |T | ; t = t + 1 doif t == 1 then new operation mode = rand( O ) ; else if rand(0,1) ≥ then new operation mode = rand( O\ new operation mode) ; end operation modes.append(new operation mode) ; endreturn operation modes Many coeﬃcients of A π are invariant due to static network topology. Manyothers however are found by substituting multiple parameters into an equationdescribing gas properties. This information is contained in the initial state, andwe generate them similar to boundary forecasts: c state ∈ { Temperature, Inﬂow Norm Density, Molar Mass,Pseudo Critical Temperature, Pseudo Critical Pressure } (37) M +c = max state ∈ initial states c state M − c = min state ∈ initial states c state c state ∈ [ M − c − . M +c − M − c ) , M +c + 0 . M +c − M − c )] (38)We now have the tools to generate synthetic initial states, see Algorithm 3.16lgorithm 3 is designed to output varied and valid initial states w.r.t our MILPformulation. However, it comes with some drawbacks. Firstly, the underlyingdistribution of demand scenarios for both ﬂow and pressure are probably notuniform nor conditionally independent. Moreover, the sampling range we useis signiﬁcantly larger than that of our test set as we take single maximum andminimum values over all nodes. Secondly, the choice of operation modes thatoccur in reality is also not uniform. In reality, some operation modes occurwith a much greater frequency than others. Our data is thus more dynamicthan reality, and likely to contain operation mode choices that do match thedemand scenarios. Finally, we rely on a MILP solver to generate new initialstates in our ﬁnal step. Hence we cannot rule out the possibility of a slightbias. One example would be the case of a repeated scenario, which has multiple17ptimal solutions, but the MILP solver always returns an identical solution. Algorithm 3:

Initial State Generator

Input:

Desired time-step distance j ∈ [1 , · · · , k ] Result:

An initial state to the transient gas optimisation problemﬂow forecast, pressure forecast = Boundary Prognosis Generator() a ;gas constants = Sample (38) uniformly ;initial state = Select random state from real-world data ; π = (ﬂow forecast, pressure forecast, gas constants, initial state) b ; z = Operation Mode Sequence Generator() c ; P z π = generate from π and z ;( state 1, · · · , state k ) = Optimal solution states from solving P z π ; return state j a See Algorithm 1 b Note that in general our π does not include gas constants. This is because the informationis generally encoded in initial state. Our gas constants in this context are randomly generatedhowever, and may not match the initial state. This does not aﬀect solving as these values aresimply taken as truths. c See Algorithm 2

Algorithm 4:

Synthetic Gas Data Generator

Input: num states, num scenarios, time step diﬀerence

Result: num scenarios many gas instances and their optimal solutionsinitial states = [] ; for i = 0; i < num states ; i = i + 1 do initial states.append(Initial State Generator(time step diﬀerence)) a ; end forecasts = [] ; for i = 0; i < num scenarios ; i = i + 1 do ﬂow forecast, pressure forecast = Boundary Prognosis Generator() b ;forecasts.append((ﬂow forecast, pressure forecast)) ; end solve data = [] ; for i = 0; i < num scenarios ; i = i + 1 do z = Operation Mode Sequence Generator() c ;initial state = Uniformly select from initial states ; π = (forecasts[i], initial state) ; P z π = Create MILP from π and z ;solution = Solve P z π ;solve data.append(( z , π , solution)) ; endreturn solve data a See Algorithm 3 b See Algorithm 1 c See Algorithm 2

18n the case of initial state generation, we believe that further research needs tobe performed. Our method is eﬀective in the context of machine learning wherewe aim for a diverse set of data, but it is naive and incapable of ensuring thatgenerated boundary scenarios are realistic.

To train D θ and G θ , we need both the transient gas transportation scenario,and an optimal solution for it. Combining the generation methods for syntheticdata in subsections 5.1.1, 5.1.2, 5.1.3, and the solving process of the createdinstances, we derive Algorithm 4. We generated our initial training and validation sets oﬄine. To do so we useAlgorithm 4 with inputs: num states = 10 , num scenarios = 4 × , andtime step diﬀerence = 8. This initial training data is exclusively used for train-ing D θ , and is split into a training set of size 3 . × , a test set of 4 × ,and a validation set of 4 × .The test set is checked against at every epoch, while the validation set is onlyreferred to at the end of the initial training. Following this initial training,we begin to train N { θ ,θ } as a whole, alternating between G θ and D θ . Theexact algorithm is given in 5, which references functions provided in AppendixA. For training, we used the Adam algorithm [27] as our descent method. Theassociated parameters to this algorithm and a complete set of other trainingparameters are listed in Table 4. In the case of a parameter being non-listed, thedefault value was used. The intention behind our training method is to ensurethat N { θ ,θ } receives no real-world data prior to its ﬁnal evaluation. With thismethod we hope to show that synthetic data is suﬃcient for training purposesand that N { θ ,θ } successfully generalises to additional data sets. However, weshould note that Algorithm 3 does use real-world data as a starting point fromwhich to generate artiﬁcial data. 19 lgorithm 5: Neural Network Training

Input:

Neural network N { θ ,θ } , prelabelled data Result:

Trained neural network N { θ ,θ } set trainable( D θ );set untrainable( G θ );Discriminator Pretraining( D θ , prelabelled data) a ;softmax temperature = 0;data = []; for i = 0; i < num epochs do set trainable( G θ );set untrainable( D θ ); for i = 0; i < num generator epochs do softmax temperature += 1;set( G θ , softmax temperature);loss = Generator Training( N { θ ,θ } ) b ; if loss ≤ stopping loss generator then break; endend set trainable( D θ );set untrainable( G θ );data = Prepare Discriminator Training Data( N { θ ,θ } , data) c ;mixed data = MixData(data, prelabelled data, num prelabelled);training data, test data = split data(mixed data, ratio test);optimizer = Adam(learning rate, weight decay) d ;lr scheduler=ReduceLROnPlateau e (patience, factor);dataloader = DataLoader(training data, batch size, shuﬄe=True); for i = 0; i < num discriminator epochs do Discriminator Training Loop( D θ , dataloader, optimizer) f ;lr scheduler.step();test loss = compute L1Loss( D θ , test data); if test loss ≤ stopping loss discriminator then break; endendendreturn N { θ ,θ } a See Algorithm 9 b See Algorithm 11 c See Algorithm 8 d See [27] pytorch.org/docs/stable/optim.html?highlight=adam e See pytorch.org/docs/stable/optim.html f See Algorithm 10

We consider the solution of P ˆ z π as a primal heuristic for the original problem20 π . Due to our usage of slack, i.e. the application of variables x , any validsolution for P z π is a valid solution of P π . We aim to incorporate N { θ ,θ } in aglobal MIP context and do this by using a partial solution of P ˆ z π as a warm-start suggestion for P π . The partial solution consists of ˆ z , an additional setof binary variables called the ﬂow directions, which are a subset of z in (1),and p v,t ∀ v ∈ V b , t ∈ T , which are a subset of x in (1). Note that partialsolutions are used as instances are numerically diﬃcult. In doing so, we hope togenerate valid solutions quickly, and speed up the global solution process. Theprimal heuristic and warm-start algorithm can be seen in Algorithms 6 and 7respectively. Algorithm 6:

Primal Heuristic

Input: P π ˆ z = G θ ( π ) ; P ˆ z π = Create MILP from π and ˆ z ;solution = Solve P ˆ z π ; return solution ; Result:

Optimal solution of P ˆ z π , primal solution of P π . Algorithm 7:

Warm Start Algorithm

Input: P π primal solution = Primal Heuristic( P π ) a ;optimum = Solve P π with primal solution as a warm-start suggestion ; Result:

Optimal solution of P π a See Algorithm 6

For our experiments we used PyTorch 1.4.0 [34] as our ML modelling framework,Pyomo v5.5.1 [21, 22] as our MILP modelling framework, and Gurobi v9.02 [19]as our MILP solver. The MILP solver settings are available in Table 5 inAppendix A. N { θ ,θ } was trained on a machine running Ubuntu 18, with 384 GBof RAM, composed of 2x Intel(R) Xeon(R) Gold 6132 running @ 2.60GHz, and4x

NVIDIA Tesla V100 GPU-NVTV100-16 . The ﬁnal evaluation times wereperformed on a cluster using 4 cores and 16 GB of RAM of a machine composedof 2x

Intel Xeon CPU E5-2680 running @ 2.70 GHz.Our validation set for the ﬁnal evaluation of N { θ ,θ } consists of 15 weeks oflive real-world data from our project partner OGE. Instances are on average 15minutes apart for this period and total 9291.All instances, both in training and test, contain 12 time steps (excluding theinitial state) with 30 minutes between each step. Additionally, we focus onStation D from [24], and present only results for this station. The statistics forStation D can be seen in Table 1, and its topology in Figure 4. Station D canbe thought of as a T intersection, and is of average complexity compared to the21tations presented in [24]. The station contains 6 boundary nodes, but they arepaired, such that for each pair only one can be active, i.e., have non-zero ﬂow.Due to this, our sampling method in subsection 5.1.1 exists in 3-dimensions andis uniform ∀ t ∈ T .Name |V| |A| (cid:80) a ∈A pi L a |A pi | |C a | ∀ a ∈ A cs |O| |V b | |A va | D 31 37 0.404 km 2, 6 56 3x2 11Table 1: Overview of diﬀerent properties of station D.Figure 4: Topology of Station D.

As a large portion portion of our input data into both G θ and D θ is time-expanded data, we originally believed that the ideal design would be a seriesof LSTMs [25]. Preliminary results however showed that convolutional neuralnetworks (CNNs) were more eﬀective for our problem, in particular when usingInception Blocks, see [42].The exact block design used in N { θ ,θ } can be seen in Figure 3, and the generallayout in Figure 1. For the complete network design we refer readers to Figure14 and Table 6 in the Appendix. 22 Computational Results

We partition our results into three subsections. The ﬁrst focuses on the trainingresults of N { θ ,θ } , the second on our data generation methods, while the third isconcerned with our results on the 15 weeks of real-world transient gas data. Notethat when training we scaled f ( P z π ) values by 500 to reduce the magnitude ofthe losses. For visualisation purposes of comparing the performance of N { θ ,θ } and our data generation methods, we re-scaled all results. Figure 5: The loss per epoch of D θ during the initial training of Algorithm 9.The dashed lines show the performance of D θ after N { θ ,θ } has been completelytrained.Figure 5 shows the training loss throughout the initial oﬄine training. We seethat D θ learns how to accurately predict f ( P z π ) as the loss decreases. This isa required result, as without a trained discriminator we cannot expect to traina generator. Both the training and test loss converge to approximately 1000,which is excellent considering the generated f ( P z π ) range well into the millions.As visible by both the test loss and ﬁnal validation loss, we see D θ generalisesto P z π instances of our validation set that it has not seen. This generalisationability doesn’t translate perfectly to real-world data however. This is due tothe underlying distribution of real-world data and our generated data beingsubstantially diﬀerent. Despite this we believe that an L1 loss, in this casesimply the average distance between ˆ f ( P z π ) and f ( P z π ), of 10000 is still verygood. We discuss the issues of diﬀerent distributions in subsection 6.2.23igure 6: The loss per epoch of D θ as it is trained using Algorithm 5The loss during training using Algorithm 5 for D θ is shown in Figure 6, and for G θ in Figure 7. The cyclical nature of the D θ loss is caused by the re-trainingof G θ , which learns how to induce sub-optimal predictions from the then static D θ . These sub-optimal predictions are quickly re-learned, but highlight thatlearning how to perfectly predict f ( P ˆ z π ) over all possibilities, potentially due tothe rounded nature of ˆ z , is unlikely without some error. Figure 7 (left) showsthe loss over time of G θ as it is trained, with Figure 7 (right) displaying mag-niﬁed losses for the ﬁnal epochs. We observe that G θ quickly learns important z decision values. We hypothesise that this quick descent is helped by ˆ z thatare unlikely given our generation method in Algorithm 2. The loss increasesfollowing this initial decrease in the case of G θ , showing the ability of D θ tofurther improve. It should also be noted that signiﬁcant step-like decreases inloss are absent in both (left) and (right) of Figure 7. Such steps would indicate G θ discovering new important z values (operation modes). The diversity ofproduced operation modes however, see Figure 12, implies that early in traininga complete spanning set of operation modes is derived, and the usage of theirratios is then learned and improved. As an interlude between results from N { θ ,θ } , we outline the performance of oursynthetic gas network data generation methods. Figure 8 (left) shows how ourgenerated ﬂow prognosis compares to that of historic real-world data. We seethat Nodes A, B, and C are not technically entry or exits, but over historicaldata are dominated by a single orientation for each node. Speciﬁcally, NodeC is the general entry, and Nodes A / B are the exits. In addition to thegeneral orientation, we see that each node has signiﬁcantly diﬀerent ranges anddistributions. These observations highlight the simplicity of our data generationmethods, as we see near identical distributions for all nodes over the artiﬁcial24igure 7: (Left) The loss per epoch of G θ as it is trained using Algorithm 5.On the left the loss over all epochs is shown. (Right) A magniﬁed view of theloss starting from epoch 20.data. We believe this calls for further research in prognosis generation methods.Figure 8 (right) shows our pressure prognosis compared to that of historic values.Unlike historic ﬂow values, we observe little diﬀerence between historic pressurevalues of diﬀerent nodes. This is supported by the optimal choices z ∗ over thehistoric data, see Figure 12, as in a large amount of cases compression is notneeded and the network station is in bypass. Note that each correspondingentry (+) and exit (-) have identical pressure distributions due to the way theyare constructed.A further comparison of how our generated data compares to historic data canbe seen in Figure 9. Here one can see the distribution of ˆ f ( P ˆ z π ) and f ( P ˆ z π )for the generated validation set, and ˆ f ( P z ∗ π ) and f ( P π ) for the real-world data.As expected, the distributions are diﬀerent depending on whether the datais artiﬁcial or not. Our data generation was intended to be simplistic, and asindependent as possible from the historic data. As such, the average scenario hasoptimal solution larger than that of any real-world data point. The performanceof D θ is again clearly visible here, with ˆ f ( P ˆ z π ) and f ( P ˆ z π ) being near identicalover the artiﬁcial data, keeping in mind that these data points were never usedin training. We see that this ability to generalise is relatively much worse onreal-world data, mainly due to the the lower values of f ( P π ) over this data.Figure 9 (right) shows the results with log-scale axes to better highlight thisdisparity. It should be noted that the real-world instances with larger f ( P π ) arepredicted quite well, and all real-world instances have an L1 distance betweenˆ f ( P z ∗ π ) and f ( P π ) that is small in terms of absolute diﬀerences.25igure 8: Comparison of generated ﬂow (Left) / pressure (Right) value distri-butions per node vs. the distribution seen in real-world data. We now present results of our fully trained N { θ ,θ } applied to the 15 weeksof real-world data. Note that we had to remove 651 instances from our 9291instances, as the warm-start resulted in an optimal solution value further awaythan the optimality tolerances we set. These instances have been kept in thegraphics, but are marked and conclusions will not be drawn from them. Webelieve the problems with reproducibility are caused by the numeric diﬃcultiesin managing the pipe equality constraints.Figure 10 shows the comparison of f ( P ˆ z π ) and f ( P π ). In a similar manner to D θ ,we see that G θ struggles with instances where f ( P π ) is small. This is visible inthe bottom left, where we see f ( P ˆ z π ) values much larger than f ( P π ) for like π .This comes as little surprise given the struggle of D θ with small f ( P π ) values.Drawing conclusions becomes more complicated for instances with larger f ( P π )values, because the majority hit the time limit. We can clearly see however, thevalue of our primal heuristic. There are many cases, those below the line f ( P ˆ z π )= f ( P π ), where our primal heuristic retrieves a better solution than the MILPsolver does in one hour. Additionally, we see that no unsolved point abovethe line is very far from the line, showing that our primal heuristic produced acomparable, sometimes equivalent solution in a much shorter time frame. Fora comparison of solve-times, see Table 2.26igure 9: ˆ f ( P ˆ z π ) for the validation set, and ˆ f ( P z ∗ π ) for real-world data, comparedto f ( P ˆ z π ) and f ( P π ) respectively. Linear scale (Left) and log-scale (Right). Mean Median STD Min Max N { θ ,θ } Inference Time (s) 0.009 0.008 0.001 0.008 0.017Warmstarted P π Time (s) 100.830 9.380 421.084 0.130 3600.770 P π Time (s) 147.893 24.380 463.279 3.600 3601.280 P ˆ z π + Warmstarted P π Time (s) 103.329 12.130 424.543 0.190 3726.110 P ˆ z π Time (s) 2.499 1.380 12.714 0.060 889.380

Table 2: Solve time statistics for diﬀerent solving strategies.Figure 11 shows the performance of the predictions ˆ f ( P ˆ z π ) compared to f ( P ˆ z π ).Interestingly, D θ generally predicts ˆ f ( P ˆ z π ) values slightly larger than f ( P ˆ z π ).We expect this for the smaller valued instances, as we know that D θ struggleswith f ( P ˆ z π ) instances near 0, but the trend is evident for larger valued instancetoo. The closeness of the data points to the line ˆ f ( P ˆ z π ) = f ( P ˆ z π ) show that D θ can adequately predict ˆ z solutions from G θ despite the change in data sets.Figure 10 showed that G θ successfully generalised to a new data set, albeit withdiﬃculties around instances with f ( P π ) valued near 0. From Figures 10 and 11,we can see that the entire N { θ ,θ } generalises to unseen real-world instances,despite some generalisation loss.We now compare the operation modes ˆ z , which are generated by G θ , and the z ∗ , which are produced by our MILP solver. To do so we use the followingnaming convention: We name the three pairs of boundary nodes N (north), S(south), and W (west). Using W NS C 2 as an example, we know that ﬂowcomes from W, and goes to N and S. The C in the name stands for active27igure 10: A comparison of f ( P ˆ z π ) and f ( P π ) for all real-world data instances. N W N S N SS W N S W C N S N S W W N S C N SS W N W S N SS W W N S C N W S O t h e r NW NS 1 884 22 0 9529 31 37 2436 4 24 397 82NS SW 2 48 102 1 40298 0 114 630 24 0 51 13N SW C 1 0 27 65 11008 0 4 0 2 0 0 55NS NSW 1 41 29 0 26509 0 28 557 9 0 49 15W NS C 1 0 0 0 0 0 0 0 0 0 0 0NS SW 1 0 0 0 76 0 1 0 0 0 0 0NW S 2 4 0 0 0 0 0 2 0 0 1 1NS SW 3 6 7 0 5220 0 7 108 1 0 4 5W NS C 2 28 0 0 0 136 0 0 0 93 0 0NW S 1 30 11 0 2880 0 12 315 2 0 30 6Other 0 1 0 78 0 0 0 0 0 0 1

Table 3: Operation Mode Correlation Matrix between ˆ z and z ∗ .28igure 11: A comparison of ˆ f ( P ˆ z π ) and f ( P ˆ z π ) for all real-world data instances.compression, and the ﬁnal index is to diﬀerentiate between duplicate names.As seen in Figure 12, which plots the frequency of speciﬁc z if they occurredmore than 50 times, a single choice dominates z ∗ . This is interesting, because weexpected there to be a-lot of symmetry between z , with the MILP solver select-ing symmetric solutions with equal probability. For instance, take W NS C 1and take W NS C 2. N { θ ,θ } only ever predicts W NS C 2, however with halfthe frequency the MILP solver selects each of them. This indicates that fromthe MILP’s point of view they are symmetric, and either can be chosen, while N { θ ,θ } has recognised this and converged to a single choice. We can supportthis by analysing the data, where the diﬀerence in W NS C 1 and W NS C 2is which compressor machine is used, with both machines being identical. Thisduplicate choice apparently does not exist in bypass modes however, where theuniqueness of z , determined by valve states, results in diﬀerent f ( P z π ) values.It is observable then that for the majority of instances NS NSW 1 is the opti-mal choice, and that N { θ ,θ } has failed to identify its central importance. Webelieve this is due to the training method, where over generalisation to a sin-gle choice is strongly punished. For a comprehensive overview of the selectionof operation modes and the correlation between ˆ z and z ∗ , we refer interestedreaders to Table 3.As discussed above, N { θ ,θ } cannot reliably produce z ∗ . Nevertheless, it pro-duces near-optimal ˆ z suggestions, which are still useful in a warm-start context,see Algorithm 7. The results of our warm-start algorithm are displayed in Fig-ure 13. Our warm-start suggestion was successful 72% of the time, and thealgorithm resulted in an average speed up of 60.5%. We use the shifted geomet-29igure 12: Frequency of operation mode choice by G θ compared to MILP solverfor all real-world instances. (Left) Linear scale, and (Right) log scale.ric mean with a shift of 1 for this measurement to avoid distortion by relativevariations of the smaller valued instances. Especially surprising is that someinstances that were previously unsolvable within the time-limit were easily solv-able given the warm-start suggestion. In addition, many of the solvable butcomplicated instances are also solved near instantly with the warm-start sug-gestion. As such, we have created an eﬀective primal heuristic that is both quickto run and beneﬁcial in the context of locating a globally optimal solution. In this paper, we presented a dual neural network design for generating decisionsin a MILP. This design is trained without ever solving the MILP with unﬁxeddecision variables. The neural network is both used as a primal heuristic andused to warm-start the MILP solver for the original problem. We proved theusefulness of our design on the transient gas transportation problem. Whiledoing so we created methods for generating synthetic transient gas data fortraining purposes, reserving an unseen 9291 real-world instances for validationpurposes. Despite some generalisation loss, our trained neural network resultsin a primal heuristic that takes on average 2.5s to run, and results in a 60.5%decrease in global optimal solution time when used in a warm-start context.While our approach is an important step forward in neural network design andML’s application to gas transport, we believe that there exists four primarydirections for future research. The ﬁrst of which is to convert our approachinto more traditional reinforcement learning, and then utilise policy gradientapproaches, see [44]. The major hurdle to this approach is that much of thecomputation would be shifted online, requiring many more calls to solve theinduced MILPs. This could be oﬀset however, by using our technique to ini-30igure 13: The combined running time of solving P ˆ z π , and solving a warm-started P π , compared to solving P π directly.tialise the weights for such an approach, thereby avoiding early stage trainingdiﬃculties with policy gradient approaches. The second is focused on the recentimprovements in Graph Neural Networks, see [16]. Their ability to generaliseto diﬀerent input sizes would permit the creation of a single NN over multiplenetwork stations or gas network topologies. Thirdly, there exists a large gap inthe literature w.r.t data generation for transient gas networks. Improved meth-ods are needed, which are scalable and result in real-world like data. Finally,although we focused on the transient gas transportation problem, our approachcan be generalised to arbitrary problem classes. Acknowledgements

The work for this article has been conducted in the Research Campus MODALfunded by the German Federal Ministry of Education and Research (BMBF)(fund numbers 05M14ZAM, 05M20ZBM), and was supported by the GermanFederal Ministry of Economic Aﬀairs and Energy (BMWi) through the projectUNSEEN (fund no 03EI1004D). 31 eferences [1] T. Achterberg.

Constraint integer programming . PhD thesis, TechnischeUniversit¨at Berlin, 2007.[2] E. Balas. The Convex Hull of a Disjunctive Set. In

Disjunctive Program-ming , pages 17–39. Springer International Publishing, Cham, 2018.[3] R. Baltean-Lugojan, P. Bonami, R. Misener, and A. Tramontani. Scoringpositive semideﬁnite cutting planes for quadratic optimization via trainedneural networks. optimization-online preprint 2018/11/6943 , 2019.[4] J. Beli¨en, E. Demeulemeester, and B. Cardoen. A decision support systemfor cyclic master surgery scheduling with multiple objectives.

Journal ofscheduling , 12(2):147, 2009.[5] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combi-natorial optimization: a methodological tour d’horizon. arXiv preprintarXiv:1811.06128 , 2018.[6] D. Bertsimas and B. Stellato. The voice of optimization. arXiv preprintarXiv:1812.09991 , 2018.[7] D. Bertsimas and B. Stellato. Online mixed-integer optimization in mil-liseconds. arXiv preprint arXiv:1907.02206 , 2019.[8] R. Burlacu, H. Egger, M. Groß, A. Martin, M. E. Pfetsch, L. Schewe, M. Sir-vent, and M. Skutella. Maximizing the storage capacity of gas networks:a global minlp approach.

Optimization and Engineering , 20(2):543–573,2019.[9] Z. Chen, Y. Zhong, X. Ge, and Y. Ma. An actor-critic-based uav-bss deploy-ment method for dynamic environments. arXiv preprint arXiv:2002.00831 ,2020.[10] S. Dempe.

Foundations of bilevel programming . Springer Science & BusinessMedia, 2002.[11] J.-Y. Ding, C. Zhang, L. Shen, S. Li, B. Wang, Y. Xu, and L. Song.Optimal solution predictions for mixed integer programs. arXiv preprintarXiv:1906.09575 , 2019.[12] M. Etheve, Z. Al`es, C. Bissuel, O. Juan, and S. Kedad-Sidhoum. Rein-forcement learning for variable selection in a branch and bound algorithm. arXiv preprint arXiv:2005.10026 , 2020.[13] J. Fang, Q. Zeng, X. Ai, Z. Chen, and J. Wen. Dynamic optimal energyﬂow in the integrated natural gas and electrical power systems.

IEEETransactions on Sustainable Energy , 9(1):188–198, 2017.[14] A. Ferber, B. Wilder, B. Dilina, and M. Tambe. Mipaal: Mixed integerprogram as a layer. arXiv preprint arXiv:1907.05912 , 2019.3215] M. Fischetti and A. Lodi. Local branching.

Mathematical programming ,98(1-3):23–47, 2003.[16] M. Gasse, D. Ch´etelat, N. Ferroni, L. Charlin, and A. Lodi. Exact combina-torial optimization with graph convolutional neural networks. In

Advancesin Neural Information Processing Systems , pages 15554–15566, 2019.[17] I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXivpreprint arXiv:1701.00160 , 2016.[18] I. Goodfellow, Y. Bengio, and A. Courville.

Deep learning . MIT press,2016.[19] L. Gurobi Optimization. Gurobi optimizer reference manual, 2020.[20] H. Hanachi, C. Mechefske, J. Liu, A. Banerjee, and Y. Chen. Performance-based gas turbine health monitoring, diagnostics, and prognostics: A sur-vey.

IEEE Transactions on Reliability , 67(3):1340–1363, 2018.[21] W. E. Hart, C. D. Laird, J.-P. Watson, D. L. Woodruﬀ, G. A. Hackebeil,B. L. Nicholson, and J. D. Siirola.

Pyomo–optimization modeling in python ,volume 67. Springer Science & Business Media, second edition, 2017.[22] W. E. Hart, J.-P. Watson, and D. L. Woodruﬀ. Pyomo: modeling and solv-ing mathematical programs in python.

Mathematical Programming Com-putation , 3(3):219–260, 2011.[23] F. Hennings. Beneﬁts and limitations of simpliﬁed transient gas ﬂow formu-lations. In

Operations Research Proceedings 2017 , pages 231–237. Springer,2018.[24] F. Hennings, L. Anderson, K. Hoppmann-Baum, M. Turner, and T. Koch.Controlling transient gas ﬂow in real-world pipeline intersection areas.

Op-timization and Engineering , pages 1–48, 2020.[25] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural com-putation , 9(8):1735–1780, 1997.[26] K. Hoppmann, F. Hennings, R. Lenz, U. Gotzes, N. Heinecke, K. Spreck-elsen, and T. Koch. Optimal operation of transient gas transport networks.Technical report, Technical Report 19-23, ZIB, Takustr. 7, 14195 Berlin,2019.[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[28] F. Kunz, M. Kendziorski, W.-P. Schill, J. Weibezahn, J. Zepter, C. R. vonHirschhausen, P. Hauser, M. Zech, D. M¨ost, S. Heidari, et al. Electricity,heat, and gas sector data for modeling the german system. Technical report,DIW Data Documentation, 2017. 3329] D. Masti and A. Bemporad. Learning binary warm starts for multiparamet-ric mixed-integer quadratic programming. In , pages 1494–1499. IEEE, 2019.[30] M. MohamadiBaghmolaei, M. Mahmoudy, D. Jafari, R. MohamadiBagh-molaei, and F. Tabkhi. Assessing and optimization of pipeline system per-formance using intelligent systems.

Journal of Natural Gas Science andEngineering , 18:64–76, 2014.[31] S. Moritz.

A mixed integer approach for the transient case of gas networkoptimization . PhD thesis, Technische Universit¨at, 2007.[32] V. Nair, S. Bartunov, F. Gimeno, I. von Glehn, P. Lichocki, I. Lobov,B. O’Donoghue, N. Sonnerat, C. Tjandraatmadja, P. Wang, R. Addanki,T. Hapuarachchi, T. Keck, J. Keeling, P. Kohli, I. Ktena, Y. Li, O. Vinyals,and Y. Zwols. Solving mixed integer programs using neural networks, 2020.[33] A. J. Osiadacz. Diﬀerent Transient Flow Models - Limitations, Advantages,And Disadvantages. In

PSIG-9606 . Pipeline Simulation Interest Group,1996.[34] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An impera-tive style, high-performance deep learning library. In

Advances in neuralinformation processing systems , pages 8026–8037, 2019.[35] M. Petkovic, Y. Chen, I. Gamrath, U. Gotzes, N. S. Hadjidimitriou, J. Zit-tel, and T. Koch. A hybrid approach for high precision prediction of gasﬂows. Technical Report 19-26, ZIB, Takustr. 7, 14195 Berlin, 2019.[36] D. Pfau and O. Vinyals. Connecting generative adversarial networks andactor-critic methods. arXiv preprint arXiv:1610.01945 , 2016.[37] A. Pourfard, H. Moetamedzadeh, R. Madoliat, and E. Khanmirza. Designof a neural network based predictive controller for natural gas pipelines intransient state.

Journal of Natural Gas Science and Engineering , 62:275–293, 2019.[38] R. Z. R´ıos-Mercado and C. Borraz-S´anchez. Optimization problems in nat-ural gas transportation systems: A state-of-the-art review.

Applied Energy ,147:536–555, 2015.[39] D. B. Rubin. The bayesian bootstrap.

The annals of statistics , pages130–134, 1981.[40] R. Ruiz, C. Maroto, and J. Alcaraz. A decision support system for areal vehicle routing problem.

European Journal of Operational Research ,153(3):593–606, 2004.[41] L. N. Smith. Cyclical learning rates for training neural networks, 2017.3442] C. Szegedy, S. Ioﬀe, V. Vanhoucke, and A. A. Alemi. Inception-v4,inception-resnet and the impact of residual connections on learning. In

Thirty-ﬁrst AAAI conference on artiﬁcial intelligence , 2017.[43] Y. Tang, S. Agrawal, and Y. Faenza. Reinforcement learning for integerprogramming: Learning to cut. arXiv preprint arXiv:1906.04859 , 2019.[44] P. S. Thomas and E. Brunskill. Policy gradient methods for reinforce-ment learning with function approximation and action-dependent baselines. arXiv preprint arXiv:1706.06643 , 2017.[45] E. Wong and J. Z. Kolter. Provable defenses against adversarial ex-amples via the convex outer adversarial polytope. arXiv preprintarXiv:1711.00851 , 2017.[46] I. Yueksel Erguen, J. Zittel, Y. Wang, F. Hennings, and T. Koch. Lessonslearned from gas network data preprocessing. Technical report, TechnicalReport 20-13, ZIB, Takustr. 7, 14195 Berlin, 2020.[47] G. Zarpellon, J. Jo, A. Lodi, and Y. Bengio. Parameterizing branch-and-bound search trees to learn branching policies. arXiv preprintarXiv:2002.05120 , 2020.

A Appendix

Algorithm 8:

Prepare Discriminator Training Data

Input:

Neural network N { θ ,θ } , labelled data Result:

Data for training D θ new labelled data = [] ; for i = 0; i < num data new do initial state = Uniformly select from generated oﬄine data ;ﬂow forecast, pressure forecast = Boundary Prognosis Generator() a ; π = (ﬂow forecast, pressure forecast, initial state);ˆ z , ˆ f ( P ˆ z π ) = N { θ ,θ } ( π ); f ( P ˆ z π ) = solve P ˆ z π ;new labelled data.append( π , ˆ z , f ( P ˆ z π )) ; endreturn concatenate(labelled data[-num data old:], new labelled data) a See Algorithm 1 lgorithm 9: Discriminator Pretraining

Input:

Discriminator D θ , dataoptimizer = Adam(learning rate, weight decay);dataloader = DataLoader(data, batch size, shuﬄe=True);lr scheduler=ReduceLROnPlateau(); for i = 0; i < num epochs do Discriminator Training Loop( D θ , dataloader, optimizer) ;lr scheduler.step(); endAlgorithm 10: Discriminator Training Loop

Input:

Discriminator D θ , dataloader, optimizer for batch in dataloader do optimizer.zero grad();ˆ f ( P z π ) = D θ (batch);loss = L1Loss( ˆ f ( P z π ), f ( P z π ));loss.backward();optimizer.step(); end lgorithm 11: Generator Training

Input:

Neural network N { θ ,θ } Result:

Average loss in trainingoptimizer = Adam();lr scheduler=cyclicLR a (max lr, base lr, step size up);data = [] ; for i = 0; i < num scenarios do initial state = Uniformly select from generated oﬄine data ;ﬂow forecast, pressure forecast = Boundary Prognosis Generator() b ; π = (ﬂow forecast, pressure forecast, initial state);data.append( π ) ; end dataloader = DataLoader(data, batch size, shuﬄe=True); for batch in dataloader do optimizer.zero grad();losses = [];ˆ z , ˆ f ( P ˆ z π ) = N { θ ,θ } (batch);loss = L1Loss( ˆ f ( P ˆ z π ), 0);loss.backward();optimizer.step();lr scheduler.step();losses.append(loss); endResult: mean(losses) a Introduced by Smith in [41], see alsopytorch.org/docs/stable/optim.html b See Algorithm 1 inear & ReLu x 2Linear & ReLu x 2 Unfold over time

Constants Boundary Flows & Pressures

InceptionInceptionResInception x 2ResInceptionResInception x 6Convolution Kernel 1Normalize and Multiply by TemperatureSoftmax

Operation Modes

Linear & ReLu x 3 Inception InceptionResInceptionResInceptionCombine to 2DFlattenResInception x 8Linear & ReLu x 3Linear output size 1Softplus

Objective ValueInitial Operation Mode

Linear & ReLu x 2 ResInception 2D -> 1D G ene r a t o r D i sc r i m i na t o r Figure 14: Neural Network Architecture38arameter Method Valuebatch size Algorithm 9 2048num epochs Algorithm 9 500learning rate Algorithm 9 / Adam 0.005weight decay Algorithm 9 / Adam 5e-06batch size Algorithm 11 2048max lr Algorithm 11 / CyclicLR 0.0005base lr Algorithm 11 / CyclicLR 5e-06step size up Algorithm 11 / CyclicLR 10000num scenarios Algorithm 11 3200000num data new Algorithm 8 2048num data old Algorithm 8 8192num epochs Algorithm 5 10num generator epochs Algorithm 5 25num discriminator epochs Algorithm 5 25stopping loss discriminator Algorithm 5 3 * 1022.5 stopping loss generator Algorithm 5 0.9 * 121848.27 num prelabelled Algorithm 5 / mix in prelabelled data 8192ratio test Algorithm 5 / split data 0.1learning rate Algorithm 5 / Adam 0.001weight decay Algorithm 5 / Adam 5e-06patience Algorithm 5 / ReduceLROnPlateau 2factor Algorithm 5 / ReduceLROnPlateau 0.5 f ( P ˆ z ππ