An Outer-approximation Guided Optimization Approach for Constrained Neural Network Inverse Problems
AAn Outer-approximation Guided Optimization Approach forConstrained Neural Network Inverse Problems
Myun-Seok CheonCorporate Strategic Research,ExxonMobil Research and EngineeringFebruary 25, 2020
Abstract
This paper discusses an outer-approximation guided optimization method for constrained neural net-work inverse problems with rectified linear units. The constrained neural network inverse problems referto an optimization problem to find the best set of input values of a given trained neural network inorder to produce a predefined desired output in presence of constraints on input values. This paper ana-lyzes the characteristics of optimal solutions of neural network inverse problems with rectified activationunits and proposes an outer-approximation algorithm by exploiting their characteristics. The proposedouter-approximation guided optimization comprises primal and dual phases. The primal phase incorpo-rates neighbor curvatures with neighbor outer-approximations to expedite the process. The dual phaseidentifies and utilizes the structure of local convex regions to improve the convergence to a local opti-mal solution. At last, computation experiments demonstrate the superiority of the proposed algorithmcompared to a projected gradient method.
Neural networks are the most essential and widely used ingredient of modern machine learning applications[Schmidhuber, 2015]. The primary reason of the prevalent usages is the flexibility and capability of ap-proximating any continuous function, known as the universal approximation theorem [Hornik et al., 1990].Deep neural networks have been demonstrating the strength of universal approximation capabilities inmany machine learning applications including image processing, natural language processing and rein-forcement learning problems. In the image classification problems, convolutional neural networks are com-monly adopted to describe the spatial and temporal dependencies in images without any tailored featureengineering [Krizhevsky et al., 2012]. Recurrent neural networks are another example of the wide usageof analyzing temporal data [Hochreiter and Schmidhuber, 1997]. In reinforcement learning, various typesof neural networks are employed to approximate intricate value and policy functions [Silver et al., 2017,Peters and Schaal, 2008].Neural network inverse problems refer to a class of optimization problems that find a right set of inputparameters to achieve a desired output with a trained neural network. The trained neural networks can beviewed as a surrogate function that predicts outputs of an input. For example, in the material design context,neural networks can be trained to predict material properties or performances with material fingerprints suchas molecular structures. The inverse problem is to find the right molecular structure to produce a certainmaterial property. The input parameters are the main optimization variables for the inverse problems whereasthe forward problems optimize weights and biases of a neural network with a large amount of input andoutput data to improve the prediction accuracy. 1 a r X i v : . [ m a t h . O C ] F e b .1 Constrained neural network inverse problems Let f ( x ) : R n → R m be a neural network function whose input and output dimensions are n and m ,respectively. Let ˆ f ∈ R m be a desired output. Let L ( · , · ) : R m × R m → R be a function to measure thedifference or loss between two vectors. Let X be the feasible space for input values. The constrained neuralnetwork inverse problem can be described as follows;(P) min x ∈X L ( f ( x ) , ˆ f ) (1.1)The optimization problem is to find the input values ( x ) that minimize the difference between the desiredoutput ( ˆ f ) and the corresponding neural network output ( f ( x )). For the sake of simplicity, let g ( x ) : R n → R denote L ( f ( x ) , ˆ f ) ∈ R and ∇ g ( x ) ∈ R n be the gradient of L ( f ( x ) , ˆ f ) at x . Adversarial examples for a trained neural network are crafted to increase robustness of neural networks.The main idea is from the observation that trained neural networks often make mistakes on classificationor prediction tasks for images that have a small perturbation of the original image, which are indifferent tohuman eyes [Szegedy et al., 2014, Linden and Kindermann, 1989]. Such adversarial examples in the trainingset improve the robustness. Generating adversarial examples can be formulated as a neural network inverseproblem. For a given image and the corresponding classification, the main decision variable of the problemis perturbations of the image and the main objective is to minimize the prediction accuracy to the cor-responding classification. In this problem, the magnitude of perturbation can be controlled by constraints[Fischetti and Jo, 2018, Anderson et al., 2019].The second example is any engineering and scientific analysis that consists of forward and inverse mod-eling. For example, material design and discovery activities can be described with two major parts; one isthe forward (prediction) problem that predicts the property of a given material and the other is the inverseproblem that finds a set of materials for a given desired property. With a large amount of data or simulation,deep neural networks can be trained to capture the core characteristics of the forward problem. The inverseproblem with the trained neural network can be of the form problem (P). The problem optimizes the input,which is a choice of materials, to achieve a desired output such as a target property [Chen and Gu, 2020].Similar structures of two phase approaches for engineering problems have been discussed in several literatures[Rezaee and Dadkhah, 2019, Corts et al., 2009]Another motivating example is physics-based inversion problems with a lower-dimensional representation.The physics-based inverse problems are to find the right set of parameters for a physics model that explainobservations. For example, X-ray tomography is a nondestructive method to analyze the internal propertyand structure of an object with X-ray measurements. Another example is the subsurface analysis in the oiland gas industries, which utilize seismic and gravity data to describe the subsurface characterization. One ofmain challenges on such analysis is the ill-posedness of the inverse problem such that the number of modelparameters to fit are several orders of magnitude larger than the number of independent observations. Thisphenomenon causes overfitting such that the resulting model explains the observation data extremely wellbut it is not a physically plausible solution. One idea to circumvent the issue is incorporating additionalinformation through machine learning techniques into the inversion process. The main role of the machinelearning techniques such as variants of variational autoencoders is to capture the main characteristics ofreasonable physical realizations with many prior examples. Variational autoencoders consist of an encoderthat maps or abstracts physical realizations to latent space variables and a decoder that projects latentvariables to a physical realization. The decoder of a successfully trained variational autoencoder can producea plausible physical realization with any value of the latent space along with a probabilistic measure. Inthe new approach, the resulting problem is of the form a neural network inverse problem that combinesthe physics based inverse problem along with the trained decoder. The problem optimizes the latent spacevariables of the decoder to minimize the difference between the given observation and the physics responseof the corresponding physics model from the decoder [O’Malley et al., 2019].2 .3 Prior solution methods
Solution techniques for neural network inverse problems can be categorized into three classes; one groupis methods that utilize the gradient information such as steepest descent algorithms, projected gradientmethod, etc. The second group is methods based on the forward function calls such as derivative-free methodsand meta-heuristics. The last group is based on explicit mathematical formulations such as mixed integernonlinear programming.The gradient respect to input parameters can be computed with back-propagation of neural networks[Linden and Kindermann, 1989]. Unconstrained problems, i.e., no additional constraints for the input pa-rameters, can be tackled with the gradient based methods such as steepest descent methods, quasi-Newtonmethods, etc. For constrained problems, the projected gradient method can be considered.Meta-heuristics such as particle swarm optimization or genetic algorithms can be considered. The meta-heuristics are attractive because of inexpensive forward evaluation of trained neural networks. Compared togradient based approaches, these methods avoid trapping into a local optimum [Rezaee and Dadkhah, 2019].When a neural network has only rectified linear activation units, the inverse problem can be describedas a mixed integer nonlinear programming problem [Fischetti and Jo, 2018, Anderson et al., 2019]. Section2.1 will discuss a mathematical model.
This paper proposes an optimization algorithm in the context of a multi-start optimization framework forneural network inverse problems. The multi-start concept is employed to mitigate the local optimal issueof neural network inverse problems. In the framework, the initial optimization starting points are collectedfrom the training set, e.g., n -closest data for a given target or n -clustering center points. There are twobenefits of using the training set to select starting points. One is that it is easy to collect meaningful startingpoints with polynomial time examination. The other is that the solutions stay within the training region.Since neural networks are a surrogate function with a limited training data, if a solution is far from thetraining data region, the corresponding prediction might not be accurate. By using the trained data forstarting solutions, the resulting solution might be retained within the comfort zone of neural networks.The multi-start optimization framework requires solving many optimization problems with various startingpoints. Therefore, it is important to have an efficient solution method.The contributions of this paper are twofold; One is that the paper analyzes the characteristics of localoptimal solutions of neural network inverse problems with rectified activation units; The second contribu-tion is the development of an efficient algorithm exploiting the characteristics and a demonstration of thesuperiority compared to a gradient based method through computational experiments.The characteristics of neural network inverse problems and the types of local optimal solutions arediscussed in Section 2. In Section 3, an outer approximation based algorithm is proposed. Section 4 containsthe computational results for the proposed method and a projected gradient method. A neural network consists of multiple layers of interconnected neurons. Each neuron is a computing unitdefined by weights, bias and an activation function. The weights and bias describe a linear relationship withoutputs from connected neurons and the activation function, typically a nonlinear operator and applied afterthe linear computation, generates the final output. Equation (2.1) describes the computation in a neuronwith a rectified linear activation function. t j = max (cid:32) , (cid:88) i w ij t i + b j (cid:33) , (2.1)where t i is the output of neuron i and w i,j and b j are the weights from neuron i to neuron j and the biasterm for neuron j , respectively. In Equation (2.1), the rectified linear activation is described with the max3perator. Equation (2.1) can be described as a mixed integer linear system as follows; t j − s j = (cid:88) i w ij t i + b j , ∀ j ∈ N r , (2.2) t j ≤ M z j , ∀ j ∈ N r , (2.3) s j ≤ M (1 − z j ) , ∀ j ∈ N r , (2.4) t j , s j ≥ , ∀ j ∈ N r , (2.5) z j ∈ { , } , ∀ j ∈ N r , (2.6)where parameter M denotes a big-M, which is a large number and set N r denotes the neurons with a rectifiedlinear activation function. The binary variable z j indicates whether neuron j is active or not. When neuron j is active such as z j = 1, only variable t j can take a nonzero value. Otherwise, variable s j can take a nonzerovalue. Only a positive output value ( t j ) is passed to connected neurons.A neural network can be described as a mixed integer linear system as follows. Let N be the all neuronsand let N I and N O be the neurons in the input and output layers, respectively.(2.2) − (2.6) (2.7) t j = x j , ∀ j ∈ N I , (2.8) y j = (cid:88) i w ij t i + b j , ∀ j ∈ N O , (2.9) y j ∈ R , ∀ j ∈ N O (2.10)In the the mixed integer linear system, x j ’s are the parameter for the input layer in Constraint (2.8) andthe result is the solution of variable y j ’s of the linear system for the output layer in Constraint (2.9).The formulation can be tightened with an extended formulation or projected cuts from an extendedformulation [Anderson et al., 2019]. In this section, we discuss the properties of the neural network inverse problem with rectified linear activationunits. Figure 1 shows a simple example of a neural network and the squared error loss with a given target.The neural network has 5 neurons and its output is a piece-wise linear, which is depicted in the right graph((a)) of Figure 1. Given a target output, the orange dotted line in (a) of Figure 1, the left graph of Figure 1shows the squared error loss respect to input x .The first observation is that with a given activation such that a set of active neurons is predetermined,the resulting problem is convex. Let N ( x ) be a set of active neurons, whose output is positive for input x .Note that for a given solution x , there can be more than one active set definition such that N ( x ) (cid:54) = N ( x ). Proposition 2.1.
If a neural network has only rectified linear activation units and a convex loss functionsuch that L ( · , ˆ y ) is convex for a given target ˆ y , then, the neural network inverse problem g ( x ) over x ∈{ x (cid:48) |N ( x (cid:48) ) = N ( ˆ x ) } for a given set of activation N ( ˆ x ) is convex.Proof. With a set of fixed activation ( N ( ˆ x )), the feasible region of { x (cid:48) |N ( x (cid:48) ) = N ( ˆ x ) } can be describedwith the following linear system by setting the binary variables of active neurons to one and the rest to zero. Y ( N ( ˆ x )) = x ∈ R n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t j = x j , ∀ j ∈ N I t j = (cid:80) i w ij t i + b j , ∀ j ∈ N ( ˆ x ) , ≥ (cid:80) i w ij t i + b j ∀ j ∈ N r \ N ( ˆ x ) ,t j ≥ , ∀ j ∈ N ( ˆ x ) ,t j = 0 , ∀ j ∈ N r \ N ( ˆ x ) (2.11)Since the loss function is convex and the feasible region is convex, the resulting optimization problem isconvex. 4a) f ( x ) (b) g ( x )Figure 1: An illustrative example of several types of local optimal solutions. (a) a neural network describes apiecewise linear. The blue line represents the output of the neural network over input x . Each integer intervalhas a unique activation pattern. The orange dotted line represents the target for the loss function; (b) thesquared error loss function for f ( x ) with a given target. The figure illustrates various local optimal solutions.The orange dotted line represents the zero loss.Any feasible solution can be categorized into two groups based on the following set definition. Let B ( x )be a set of neurons whose output after the linear computation, i.e., before the activation operation, is exactlyzero for input x such as B ( x ) = (cid:40) j ∈ N r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i w ij ˜ t i + b j = 0 (cid:41) , (2.12)where ˜ t i is the corresponding output of neuron i for input x . One group is the solutions without any zerooutput neurons before the activation such that B ( x ∗ ) = ∅ . The other group is at least one neuron withexactly zero output before the activation such that B ( x ∗ ) (cid:54) = ∅ . Note that any solution is categorized intoone of two categories such that { x |B ( x ) = ∅ ∧ B ( x ) (cid:54) = ∅} = ∅ and { x ∈ X |B ( x ) = ∅ ∨ B ( x ) (cid:54) = ∅} = X . InFigure 1, any integer solution x ∈ { , , , } has at least one zero output neuron. Set B ( x ) for all other realnumbers x ∈ R \ { , , , } is empty.The neural network inverse problem with rectified linear units can be viewed as a union of many convexoptimization problems with all possible activation permutations. Definition 2.2. (Type 1) An optimal solution x ∗ is called a self-contained local optimal solution if thesolution x ∗ is optimal to the neural network inverse problem g ( x ) over x ∈ Y ( N ( x ∗ )) and no neuron has anoutput of exact zero before the corresponding activation such as B ( x ∗ ) = ∅ .‘Type 1’ solution in Figure 1 is an example of a self-contained local optimal solution. In this case, it issufficient to prove the local optimality without considering its neighbor activation regions. If there exist zerooutput neurons such as (cid:80) i w ij ˜ t i + b j = 0, there exists more than one activation pattern that contains solution x ∗ . In this case, it is required to check if the solution is optimal respect to all possible permutations of N ( x ∗ ).Let M ( x ∗ ) be a set of all feasible activation permutations for solution x ∗ . Note that some permutations ofactivations can be infeasible. Definition 2.3. (Type 2) An optimal solution x ∗ is called a boundary local optimal solution if the solution x ∗ is optimal to the neural network inverse problem g ( x ) over x ∈ { x (cid:48) |N ( x (cid:48) ) ∈ M ( x ∗ ) } and the solution isassociated with more than one activation pattern.In this case, there exists at least one neuron with exactly zero output such as B ( x ∗ ) (cid:54) = ∅ . ‘Type 2’ solutionin Figure 1 falls into this category. When x = 3, there are two possible activations - one is for the intervalfrom 2 to 3 and the other is for the interval from 3 to 4. In both cases, the solution ( x = 3) is locally optimal.5onversely, consider the solution of x = 1, which is a red square in Figure 1. The solution x = 1 is a localoptimal respect to the convex region defined by the activation for 1 ≤ x ≤ ≤ x ≤ Definition 2.4. (Type 3) An optimal solution x ∗ is called a redundant local optimal solution if an element x j of the solution x ∗ is insensitive to the local optimality such that solutions with any value x j within thefeasible region x ∈ { x (cid:48) |N ( x (cid:48) ) = N ( x ∗ ) } are still locally optimal.Note that the redundant local optimal solutions can be categorized as a self-contained or boundary localoptimal solution. ‘Type 3’ solution in Figure 1 is an example of redundant local optimal solutions. Thesolution is similar to a saddle point in nonlinear optimization problems. Note that the solution ( x = 4) atthe boundary of ‘Type 3’ solution in Figure 1 is not a local optimal solution. Its left neighbor has betterobjective values. Theorem 2.5.
A solution ( x ∗ ) is locally optimal to Problem (P) if and only if the solution ( x ∗ ) is locallyoptimal for all neighbor activation subproblems such that g ( x ∗ ) = min x ∈X ∩Y ( N ( x ∗ )) g ( x ) ∀N ( x ∗ ) ∈ M ( x ∗ ) . (2.13) Proof.
Suppose that a solution ( x ∗ ) is locally optimal respect to all subproblems defined by its neighboractivation patterns and it is not locally optimal to Problem (P). Then, for any (cid:15) >
0, there exists a solutionˆ x such that (cid:107) x ∗ − ˆ x (cid:107) < (cid:15) , g ( x ∗ ) > g ( ˆ x ) and x ∗ / ∈ Y ( N ( ˆ x )). Since Y ( N ( ˆ x )) is a closed set (a linear system),there exists 0 < λ < − λ ) × x ∗ − λ × ˆ x / ∈ Y ( N ( ˆ x )). That is, ˆ x is not a neighbor. It contradicts.Suppose that there is an activation pattern N ( x ∗ ) whose corresponding neighbor convex region containsa better solution ( ¯ x ) such as g ( x ∗ ) > g ( ¯ x ) = min x ∈X ∩Y ( N ( x ∗ )) g ( x ) . (2.14)Since g ( x ) is convex, the following inequality holds for 0 < λ ≤ g ((1 − λ ) × x ∗ + λ × ¯ x ) ≤ (1 − λ ) × g ( x ∗ ) + λ × g ( ¯ x ) < g ( x ∗ ) . (2.15)It implies that there exists a neighbor solution ¯ x that has a better objective value. Thus, x ∗ is not a localoptimal solution. In this section, an outer-approximation guided algorithm is proposed for constrained neural network in-verse problems. The proposed algorithm adapts core concepts from gradient based algorithms and outerapproximation approaches [Geoffrion, 1970] while it exploits the characteristics of local optimal solutions.The proposed algorithm iteratively identifies a descent direction with outer approximation subproblems anddetermines the next solution with a step size.The proposed algorithm comprises two phases - primal and dual phases; The primal phase is to improvethe solution by incorporating local (neighbor) gradients and the dual phase is focusing on proving localoptimality. The algorithm uses two outer approximation subproblems to find a descent direction for primaland dual phases and one outer approximation subproblem to prove local optimality.
The outer approximation algorithms are widely used to solve large scale convex optimization and con-vex mixed integer programming problems[Geoffrion, 1970, Benders, 1962, Duran and Grossmann, 1986]. Themain idea of the algorithms is to approximate nonlinear convex or complex linear systems with a set of hy-perplanes. Consider the following optimization problem;min x ∈X g ( x ) , (3.1)6here g ( x ) is convex over x ∈ X . Then, we can approximate the problem with hyperplanes as follows;min x ∈X ,v ∈ R v (3.2)s.t. v ≥ g ( x k ) + ∇ g ( x k ) T ( x − x k ) k ∈ K , (3.3)where x k is feasible such that x k ∈ X and ∇ g ( x k ) is the gradient at x k . When g ( · ) is a certain linear system,the dual values can be used instead of gradients [Benders, 1962]. Similar approaches have been proposed andapplied for large-scale nonsmooth convex optimization problems [Ben-Tal and Nemirovski, 2005].The outer-approximation is valid only when function g ( x ) is convex over the feasible set X . One caneasily show that the neural network functions with rectified linear activation functions are not convex. Eventhough the approximation is not valid, it can be used to determine a descent direction. The approximationwill provide additional information from previous observations compared to simple first order methods. Inaddition, if the final solution is within a localized convex region, the corresponding approximation on theregion is valid. In this section, we discuss an outer approximation model for the neural network inverse problems. Let K bean index set for solutions. Let set K ( x (cid:48) ) be an index subset for solutions ( x k ) in the union of convex feasibleregions defined by activation patterns of solution x (cid:48) such that K ( x (cid:48) ) = { k ∈ K|∃N ( x k ) ∈ M ( x (cid:48) ) } . (3.4)Let x ∗ be the best solution among solutions x k ’s such that x ∗ = arg min x k ,k ∈K g ( x k ). Consider an outerapproximation model for neural network inverse problems.(NOA) min x ∈X ,v ∈ R v (3.5)s.t. v ≥ g ( x k ) + ∇ g ( x k ) T ( x − x k ) , k ∈ K ( x ∗ ) . (3.6)Note that g ( x ∗ ) is not differentiable when B ( x ∗ ) (cid:54) = ∅ . Let ∇ N ( · ) g ( x ∗ ) be the gradient of function g ( · ) atsolution x ∗ in a specific activation pattern N ( · ). With an activation pattern N ( · ), the resulting problemsuch as minimizing g ( x ) over x ∈ Y ( N ( · )) is convex and differentiable for any feasible solution. Proposition 3.1.
Let x ∗ be a solution in the convex region defined by an activation pattern N ( x ∗ ) . Thesolution x ∗ is locally optimal for the convex region Y ( N ( x ∗ )) if and only if the following inequality holds; ∇ N ( · ) g ( x ∗ ) T ( x − x ∗ ) ≥ , ∀ x ∈ Y ( N ( x ∗ )) . (3.7) Proof.
Let x ∗ be a local optimal solution within convex region Y ( N ( x ∗ )). It implies that there is no feasibledescent direction d = α ( x − x ∗ ) for any feasible x ∈ Y ( N ( x ∗ )) and a scalar α >
0. That is, the inequality(3.7) is valid for all x ∈ Y ( N ( x ∗ )).Conversely, suppose that the inequality (3.7) is valid for all x ∈ Y ( N ( x ∗ )) and x ∗ is not local optimal.Then, there exists at least a solution ˆ x such that g ( x ∗ ) > g ( ˆ x ). Since the problem is convex, the followingouter-approximation is always valid. g ( ˆ x ) ≥ g ( x ∗ ) + ∇ N ( · ) g ( x ∗ ) T ( ˆ x − x ∗ ) , ⇒ > g ( ˆ x ) − g ( x ∗ ) ≥ ∇ N ( · ) g ( x ∗ ) T ( ˆ x − x ∗ ) . It contradicts that the inequality (3.7) is valid. Therefore, x ∗ is local optimal. Theorem 3.2.
Let v ∗ be the optimal objective value of the outer approximation (NOA) with solutions x k ∈ X , k ∈ K and x ∗ be the best known solution respect to function g ( x ) such that x ∗ = arg min x k ,k ∈K g ( x k ) .The solution x ∗ is local optimal to Problem (P) if the following conditions hold; . g ( x ∗ ) = v ∗ ,2. ∇ N ( · ) g ( x ∗ ) T ( x − x ∗ ) ≥ , ∀ x ∈ Y ( N ( · )) , N ( · ) ∈ M ( x ∗ ) .Proof. Consider two cases; one is B ( x ∗ ) = ∅ and the other is B ( x ∗ ) (cid:54) = ∅ .i. B ( x ∗ ) = ∅ Consider the feasible set (2.11). Since B ( x ∗ ) = ∅ , the following two constraints of (2.11) are non-binding; 0 ≥ (cid:88) i w ij t i + b j , ∀ j ∈ N r \ N (ˆ x ) , (3.8)0 ≤ t j = (cid:88) i w ij t i + b j , ∀ j ∈ N (ˆ x ) . (3.9)The rest of (2.11) is redundant to function g ( x ). Therefore, the outer approximation (3.6) is a bindingouter approximation for g ( x ) around solution x ∗ . It means that x ∗ is local optimal and self-contained local optimal.ii. B ( x ∗ ) (cid:54) = ∅ In this case, it is required to show that the current solution ( x ∗ ) is also local optimal to its neighboractivations ( M ( x ∗ )). By Proposition 3.1, the second condition ensures that the solution is local optimalfor feasible regions defined by the corresponding activation patterns.For a given activation pattern N ( · ), the following linear programming problem can check if the corre-sponding feasible region has a descent direction.(OC( N ( · ))) min x ∈X ∇ N ( · ) g ( x ∗ ) T ( x − x ∗ ) (3.10)s.t. x ∈ Y ( N ( · )) (3.11)If the optimal objective is equal to zero, then it proves that there is no descent direction. The proposed algorithm consists of primal and dual phases. The primal phase is designed to find a bettersolution quickly while the focus of the dual phase is to prove the local optimality of the best known solution.The dual phase employs the outer approximation model (NOA) discussed in Section 3.2. In order to identifythe proper hyperplanes, the (NOA) outer approximation models require bookkeeping efforts of activationpatterns for each solution. In order to alleviate the bookkeeping efforts in the early stage, the primal phaseuses an outer approximation constructed based on the simple distance to the best known solution.
Subproblem (DLOA k ) at iteration k is defined as follows;(DLOA k ) min x ∈X ,v ∈ R v (3.12)s.t. v ≥ g ( x k (cid:48) ) + ∇ g ( x k (cid:48) ) T ( x − x k (cid:48) ) k (cid:48) ∈ K ∗ ∪ { k } , (3.13)where set K ∗ is an index subset of hyperplanes that satisfy the following two conditions. One is the basesolutions for the selected hyperplanes should be close to the best known solution such that the Euclideandistance from each base solution for hyperplanes to the best known solution should be less than γ c . The8econd condition is that the resulting hyperplane should not exclude the best feasible solution. The set K ∗ is defined as follows; K ∗ = (cid:26) k ∈ K (cid:12)(cid:12)(cid:12)(cid:12) g ( x ∗ ) < g ( x k ) + ∇ g ( x k ) T ( x ∗ − x k ) , (cid:107) x ∗ − x k (cid:107) ≤ γ c (cid:27) . (3.14)Note that since the algorithm drops previously generated outer approximation constraints, it can experiencea stalling behavior such that the algorithm could revisit the solution that is previously examined. Suchbehaviors are prevented with two mechanism. The first mechanism is to always include the latest hyperplaneeven if it cuts off the best known solution. The other mechanism is adjusting the step size. If the algorithmfails to find a better solution, it decreases the step size, which guarantees a different solution. Supposethat the hyperplane defined by the latest solution cuts off the best feasible solution. There are two possibleoutcomes. One is that the new approximation identifies a better solution. In this case, there is no stallingproblem. The other case is that the new approximation fails to identify a new solution. In this case, thenewly added hyperplanes will be removed with the validity check against the best known solution. It canlead to the same approximation model as before. However, since the algorithm fails to find a better solution,it will reduce the step size, which prevents the stalling behavior. The following subproblems are employed in the dual phase to expedite proving the local optimality. Let N ( · ) ∈ M ( x ∗ ) be a selected activation pattern.(RLOA k ( N ( · ))) min x ∈X ,v ∈ R v (3.15)s.t. v ≥ g ( x k (cid:48) ) + ∇ g ( x k (cid:48) ) T ( x − x k (cid:48) ) ∀ x k (cid:48) ∈ Y ( N ( · )) , (3.16)0 ≥ r k (cid:48) T x + d k (cid:48) ∀ x k (cid:48) / ∈ Y ( N ( · )) , (3.17)where Constraint (3.17) are feasibility cuts under consideration of the convex region Y ( N ( · )). Since the set Y ( N ( · )) is a linear system, a feasibility cut for solution x k (cid:48) / ∈ Y ( N ( · )) can be constructed based on theextreme ray of the following dual problem.max r,π (cid:88) j ∈ N I x k (cid:48) j r j + (cid:88) j ∈ N \ N I b j π j (3.18) (cid:88) j ∈ N I x k (cid:48) j r j + (cid:88) j ∈ N \ N I b j π j ≤ , (3.19) r j − (cid:88) k ∈ N w jk π k = 0 , ∀ j ∈ N I , (3.20) π j − (cid:88) k ∈ N w jk π k ≤ , ∀ j ∈ N ( · ) , (3.21) r j ∈ R , ∀ j ∈ N I , (3.22) π j ∈ R , ∀ j ∈ N ( · ) , (3.23) π j ≥ , ∀ j ∈ N r \ N ( · ) , (3.24) π j = 0 , ∀ j ∈ N O (3.25)When the neural network has many nodes, solving the aforementioned dual problem can be challengingand often numerically unstable. In order to overcome numerical and computational challenges, the proposedalgorithm uses forward-propagation to populate Constraint (3.17).Assume that nodes closer to the input layer are assigned with lower index numbers than ones furtheraway. Let j d be the lowest indexed node with activation discrepancy between the two solutions such that j d = min( j ∈ N r | ( j ∈ N ( x k (cid:48) ) ∧ j / ∈ N ( · )) ∨ ( j / ∈ N ( x k (cid:48) ) ∧ j ∈ N ( · ))). Let L be the set of layers. Let n l be thenumber of nodes in layer l ∈ L . Let W l ∈ R n l × n l − and b l ∈ R n l be the weight matrix and the bias column9ector for layer l ∈ L , respectively. Let I l ∈ R n l × n l be a matrix whose diagonal entries corresponding toactive nodes in the activation pattern N ( · ) are one, otherwise zero; I li,j = (cid:26) , ∀ i = j, ν i ∈ N ( · ) , , otherwise , (3.26)where I li,j denotes j th element in i th row of I l and ν i denotes the node index corresponding to i th row.Suppose x is not in the same activation region N ( · ) such that N ( x ) (cid:54) = N ( · ). With a known activationpattern, the max operation can be described with matrix I l . The first layer of forward propagation can becomputed as follows; I [ W x + b ] (3.27)Let l j d be the layer for node j d . Then, the forward propagation up to node j d can be described as follows; (cid:99) W = (cid:89) l = l jd I l W l , (3.28)ˆ b = I l jd b l jd + (cid:88) l = l jd − l jd (cid:89) l (cid:48) = l +1 I l (cid:48) W l (cid:48) I l b l (3.29)Now, the feasibility constraint can be constructed as follows; suppose i d be i th row of the correspondingoutput to discrepancy node j d such that ν i d = j d . Let (cid:99) W i and ˆ b i denote i th row of matrix (cid:99) W and ˆ b ,respectively. (cid:99) W i d x + ˆ b i d ≤ if j d / ∈ N ( · ) , (3.30) (cid:99) W i d x + ˆ b i d ≥ if j d ∈ N ( · ) , (3.31)By the construction of j d , solution x k (cid:48) violates one of the constraints above. Algorithm 3.1 describes the outer approximation guided algorithm. The algorithm starts with an initialpoint x and hyperparameters in line 2 - 4. Parameter γ s , γ s min , and γ s max denote the step size and itslower and upper bounds, respectively. Parameter γ c defines the neighborhood size of distance-localizedouter approximation subproblems. Parameter ρ c and ρ e are decrease and increase ratios for the step size,respectively. Parameter (cid:15) is the local optimality gap for termination and parameter N defines the maximumnumber of iterations. From line 5 to line 7, it initializes the bookkeeping parameters; ¯ x ∗ and ¯ g ∗ denote thebest known solution and the corresponding objective value, respectively. Parameter k tracks the number ofiteration and parameter dual-phase takes a boolean value to indicate whether the algorithm is in the dualphase or not.In each iteration, the algorithm evaluates the function value and the gradient for the current solution x k and updates the corresponding constraint for subproblems (line 9 - 10). In line 11 to 17, if the algorithm findsa better solution than the current best known solution, it updates the best known solution and increasesthe step size with parameter ρ e . When there is no improvement, it reduces the step size with parameter ρ c .The step size is truncated by predetermined parameters γ s min and γ s max . Line 18 updates the iteration count.If the step size is too small (Line 19), the algorithm sets to the dual phase. When the algorithm is not inthe dual phase, it solves the distance-localized outer approximation subproblem to find the next direction. Ifthe objective of the approximation is close to the objective value of the best known solution, it reevaluatesthe approximation with solutions within the neighbor activation regions. If the resulting approximation isclose to the best known objective within (cid:15) , the algorithm activates the dual phase. In the dual phase, thealgorithm terminates if the best known solution is (cid:15) − optimal for all possible neighbor feasible region. If thealgorithm finds a neighbor region with potential improvement, it generates a next direction with the solution10 lgorithm 3.1 Outer approximation guided algorithm x ∈ X Define γ s > , γ s min > , γ s max > Define γ c > , < ρ c < , ρ e > Define (cid:15) > , N > ¯ x ∗ ← x , ¯ g ∗ ← g ( x ) k ← dual-phase ← f alse for k ∈ { , . . . , N } do compute g ( x k ) and ∇ g ( x k ) update constraints if g ( x k ) < ¯ g ∗ then γ s ← min( ρ e × γ s , γ s max ) dual-phase ← f alse ¯ x ∗ ← x k , ¯ g ∗ ← g ( x k ) else γ s ← max( γ s min , ρ c × γ s ) end if k ← k + 1 if γ s = γ s min then dual-phase ← true end if if ¬ dual-phase then ˆ x k , v k ← solve (DLOA k ) if ¯ g ∗ − v k < (cid:15) then ˆ x k , v k ← solve (NOA) if ¯ g ∗ − v k < (cid:15) then dual-phase ← true end if end if end if if dual-phase then ˆ x k ← ∅ for N ( · ) ∈ M ( x ∗ ) do ˆ x k , v k ← solve (RLOA k ( N ( · ))) if ¯ g ∗ − v k ≥ (cid:15) then break end if end for if ˆ x k = ∅ then break end if end if x k = ¯ x ∗ + γ s × ( ˆ x k − ¯ x ∗ ) end for of (RLOA k ( N ( · ))). In Line 43, the algorithm sets the next solution under consideration of the approximationsolution ˆ x k and the step size parameter γ s .Set M ( x ∗ ) in Line 33 can have exponentially many activation patterns. In the implementation, thealgorithm records the previously examined activation patterns to avoid reexamining same regions, repeatably.The algorithm refreshes the record when it finds a new best solution. In the examination of neighbor patterns,the algorithm filters the activation patterns with the following inequality in a preliminary way. ∇ N ( · ) g ( x ∗ ) T ( x k − x ∗ ) < , x k ∈ Y ( N ( · )) , N ( · ) ∈ M ( x ∗ ) (3.32)For each activation pattern N ( · ) ∈ M ( x ∗ ), the algorithm checks whether the direction to all previousobservations within the corresponding feasible region is a descent direction or not. If it is a descent direction,the algorithm solves RLOA k ( N ( · )) in Line 34. Even if the algorithm fails to find an activation pattern thatsatisfies the condition (3.32), it does not mean that the algorithm identifies an (cid:15) − optimal solution. Thealgorithm still needs to solve RLOA k ( N ( · )) in Line 34 for all possible activation patterns. The main purposeof the preliminary check is to quickly find a region with higher potential of improvement.It is very challenging to find an exact solution at the boundary. Therefore, in the implementation, theboundary solutions are determined with a numerical tolerance τ > B ( x ) = (cid:40) j ∈ N r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i w ij ˜ t i + b j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ (cid:41) . (3.33) Theorem 3.3. If X is compact, Algorithm 3.1 will converge to an (cid:15) − local optimal solution as N → ∞ .Proof. By the design of the algorithm, it only checks the optimality condition with hyperplanes within thecorresponding convex region respect to the best known solution. Since L ( · , ˆ f ) is convex and X is compact,the algorithm converges to an (cid:15) − local optimal[Kelley, 1960].11ote that the feasible set X does not need to be convex in order to prove the local optimality [Geoffrion, 1972,Eaves and Zangwill, 1971]. The current representation of the algorithm retains all prior hyperplanes. Thealgorithm can be improved by deleting prior hyperplanes [Hogan, 1973].Checking all possible activation patterns in M ( x ∗ ) guarantees an (cid:15) − local optimal. However, Set M ( x ∗ )can have exponentially many possible patterns, which leads to computational challenges. Alternatively, thealgorithm can examine the previously visited patterns and add a termination check at Line 40 by solving theouter-approximation (NOA). If the bound from the approximation is close to the best known objective, thenthe algorithm terminates. Otherwise, the algorithm continues with the solution of approximation (NOA).While this approach would reduce the computational challenges of examining on all possible patterns, theapproach does not guarantee a local optimality in some cases. For example, the boundary solution x = 4 inFigure 1 is the case if all previous observations are strictly greater than 4. The gradient projection methods can solve a class of constrained neural network inverse problems. Thegradient projection methods are an iterative method that identifies a descent direction with gradients andensures the feasibility through projection operations [Bertsekas, 1999]. An iteration of gradient projectionmethods can be described as follows; x k +1 = x k + α k (¯ x k − x k ) , (3.34)¯ x k = [ x k − s k ∇ g ( x k )] + X , (3.35)where x k is the solution at iteration k , ∇ g ( x k ) is the gradient of x k , [ · ] + X denotes projection on the set X ,and parameters α k and s k are a step size and a positive scalar, respectively. Variable ¯ x k is the projectedsolution of x k − s k ∇ g ( x k ) onto the set X . For a given solution x k , Equation (3.35) finds a descent direction x k − s k ∇ g ( x k ) and if the resulting solution is outside of feasible region X , it projects the solution ontothe feasible region X . Similar to gradient based algorithms, Equation (3.34) uses a step size to update thesolution.The projection operation can be computationally expensive depending on the characteristics of feasibleregion X . If the feasible region is simple bound constraints, the projection operation is same as roundingoperation. For other cases, the projection operation can be described as a quadratic programming problem,which often requires computational efforts. This section discusses the computational results against two sets of neural network inverse problem instances;One set is based on a material design problem. The other instances are randomly generated.All algorithms are implemented in Python 3.6.5 along with Tensorflow 1.10.0. The linear programmingproblems are implemented in PuLP 1.6.8 . The main linear programming solver of the implementation isCbC 2.9.0 , which is the default linear solver for PuLP. Computational experiments have been conductedon a Sandy Bridge dual-socket Linux machine with eight 2.6 GHz cores on each socket and 64GB of RAM. The material design problem considered in this section is to find the optimal topological polymer structure forgiven desired rheological properties. The main approach consists of two phases; the first phase is developinga neural network that mimics the behavior of the forward simulation, bob-rheology , that predicts therheological properties with a given topological structure [Read et al., 2011]. The second phase is solvingan inverse optimization problem for a given target rheological property description with the trained neural LP modeler written in Python https://github.com/coin-or/pulp https://projects.coin-or.org/Cbc https://sourceforge.net/projects/bob-rheology/ × × × × I (cid:48) . The rest of weights and biases are fixed.The projected gradient method is implemented within Tensorflow with the following modification. Themain modification is introducing two additional layers prior to the original input layer in order to describethe ratio constraint and describe the input parameters as variables. Let I be the input layer of the originaltrained neural network. Let I (cid:48) be the additional layer connected the input layer I and I (cid:48)(cid:48) be another layeradjacent to the layer I (cid:48) . We create one fewer neurons in layer I (cid:48) such that we duplicate all neurons in layer I except one neuron in the ratio constraint. In layer I (cid:48)(cid:48) , we create only one neuron. We set the biases of layer I to 0 except that the bias of the neuron without a replication in layer I (cid:48) is set to 1. The weights from layer I (cid:48) to layer I are set to 1 if both neurons have a same neuron number. The weight from the ratio neuron inlayer I (cid:48) to the ratio neuron in layer I is set to -1. All other weights from layer I (cid:48) to layer I are set to 0. Theweights from neurons in layer I (cid:48)(cid:48) to neurons in layer I (cid:48) are set to 0. The neuron in layer I (cid:48)(cid:48) is the input layerof the modified neural network. All neurons in layer I (cid:48)(cid:48) , I (cid:48) and I are a linear system without any activationfunction. The rest of neural network remains same as the original trained neural network. At last, we set onlythe biases of neurons in layer I (cid:48) trainable and set all other weights and biases not trainable. Figure 2 depictsthe modified neural network. Neurons and weight parameters in the dotted box are modified compared tothe original trained neural network.The bound constraints can be incorporated with a simple projection procedure. First, we optimize thebiases in layer I (cid:48) of the modified neural network with a Tensorflow optimizer. With a fixed number ofiterations, we examine the solution if it is within the bounds. If it is outside of the bounds, we project it tothe feasible bounds. Since it is a simple bound, the projection is equivalent to move the value to the closest13abel Step size ( γ s ) Neig. size ( γ c )OGO-0.01-0.1 10 − − OGO-0.01-0.05 10 − × − OGO-0.01-0.01 10 − − OGO-0.1-0.5 10 − × − OGO-0.1-0.05 10 − × − Label Step sizePGO-0.01 10 − PGO-0.001 10 − PGO-0.0001 10 − (a) Outer approximation guided method (b) Projected gradient methodTable 1: Parameter settings for computational experimentsbound.The projected gradient algorithm has two termination criteria; one condition is when the next solution issame as the previous solution, which means either the gradient of the current solution is equal to zero or thedescent direction is infeasible respect to the bounds. The other is stalling behaviors such that the Tensorflowcannot find a better solution with multiple attempts. The projected gradient approach and the proposed outer-approximation guided approach have been testedon 100 instances. All instances share the same trained neural network discussed in Section 4.1 with a giventarget and the mean squared error as the loss function. The only difference between instances is their startingpoints, which have been collected through examination of the training data set. For each training set, it canbe easily to measure the loss between the corresponding output and the given target. The following resultsare a summary of the 100 optimization runs.Table 1 summarizes the various optimization hyper-parameter configurations for the experiment. Forthe projected gradient method, Adam and RMSprop have been tested as the Tensorflow optimizer. Forthese specific instances, RMSprop outperforms the other optimizer. For the stalling termination criteria, thenumber of no improvement steps is set to 20, which is tuned to a balanced setting between the solutionquality and the computational time. In order to improve the solution time of the projected gradient method,the projection operation is conducted at every 16 (epochs) steps of Tensorflow optimization instead ofprojection at each step. Table 1 (b) summarizes various step sizes that have been tested for the projectedgradient method. Table 1 (a) summarizes various step size parameter γ s and neighborhood size parameter γ c of problem (DLOA) for the proposed outer-approximation guided method. All other hyper-parametersfor Algorithm 3.1 are summarized in Table 2. (cid:15) γ s min γ s max ρ c ρ e N E − γ c (cid:15) × √ n . . E n denotes the number of input vari-ables.The percent-gap-closed to the best known solution are employed as the solution quality metrics. Thepercent-gap-closed metrics ρ kj for instance j and approach k is defined as follows; ρ kj = v j − v kj v j − v ∗ j , (4.1)where v j and v ∗ j denote the initial objective value and the best objective value among all approaches forinstance j . v kj is the objective value of approach k for instance j . The denominator of Equation (4.1) com-putes the best known improvement for instance j and the numerator computes the improvement for thecorresponding approach. The metrics measures how close to the best known solution the solution from eachmethod is. 14 .1.2 Overall solution time and quality Figure 3: The average solution time and average percent-gap-closed metrics. The x-axis represents optimiza-tion approaches with various hyper-parameter settings. The bar charts are the average solution time of 100instances. The dot charts show the average percent-gap-closed across 100 optimization runs at termination.The blue bars and dots are the results for the primal phase and the orange bars and dots are correspondingto the dual phase.Figure 3 summarizes the average solution time and the average percent-gap-closed metrics across 100instances in the primal and dual phases. The proposed outer-approximation algorithm can alter the statusbetween primal and dual phases. However, in the computational summary, once the algorithm enters thedual phase, the subsequent procedure is treated as the dual phase. The projected gradient method has onlythe primal phase.For the projected gradient method, as the step size decreases, the solution time increases and the solutionquality is improved. There is no such clear trend for the proposed algorithm in terms of the solution qualityand time. The computational result implies that the time spent in the primal phase is dependent on theratio of the step size γ s and the neighborhood size γ c . If they are similar, the algorithm tends to spend moretime in the primal phase.The proposed outer-approximation guided method outperforms the projected gradient method in termsof the solution time and quality. The projected gradient method with a small step size (PGO-0.01) cansolve the problems within comparable solution times as the proposed algorithm but it cannot achieve similarsolution qualities. The solution quality from the proposed approach is superior to one from the projectedgradient method.The average improvement on the computational time ranges between 30 and 35 seconds, which can beconsidered as insignificant. However, in the material design context, it is required to solve many cases upto hundreds or thousands in order to produce many leading candidates and overcome the local optimalityissues. In that regard, the percentage improvement, 1 - the average time for the proposed algorithm / theaverage time for the projected gradient method, is a more meaningful measure, which ranges between 77%and 92%. It implies that the proposed algorithm can solve the same number of problems with less than 23%of the computational efforts. Figure 4 shows the progress profile of the solution for four approaches (OGO-0.01-0.1, OGO-0.1-0.05, PGO-0.001, PGO-0.0001). The graphs in Figure 4 show the average solution quality over time. It clearly shows15igure 4: The progress profile of the average percentage gap closed metrics. The x-axis represents time spentin seconds. The y-axis is the average percentage gap closed.that the proposed algorithm finds a better solution quickly. Both outer-approximation approaches achieveover 90% solution quality within 5 seconds whereas the projected gradient methods around 80% of the bestknown solution.
The random instances are generated as follows; The process starts with a given neural network structure.The first step is randomly generating weights and biases for the given neural network structure. The initialrandomization uses a normal distribution. Randomly generated neural networks tend to have a huge spreadat outputs even with controlled inputs ranging between 0 and 1. The second step is to scale the output ofthe neural network. In the second step, the neural network is re-trained with many scaled outputs, whichcan be generated from the neural network from the first step. Many pairs of input and output data can begenerated along with the neural network from the first step. After the output data is processed to rangebetween 0 and 1, the processed data is used to retrain the neural network.The two network architectures in Table 3 are considered as the base structures. The targets are randomlygenerated and are not part of the scaled training set. The starting points are randomly chosen with a standarduniform distribution. Only bound constraints for inputs are considered.S1 100,256,128,64,128,64,32,8S2 256,128,64,128,64Table 3: Base neural network structures for random instancesFor the outer approximation guided approach, the hyper-parameter settings in Table 4 are tested. Inthis experiment, the various optimality gap criteria are considered. For the projected gradient approach, twostep sizes are considered. PGO 0.001 and PGO 0.0001 refer to the projected gradient methods with step size0 .
001 and 0 . − and initial solutions aretypically far from the local optimal solution because of random starting points. These two factors make thedenominator of the metrics large and the differences in the numerator small. In this experiment, the absolute16abel (cid:15) γ s γ c γ s min γ s max ρ c ρ e N OGO 0.001 1 E − .
01 0 . γ c (cid:15) × √ n . . E E − .
01 0 . γ c (cid:15) × √ n . . E n denotes thenumber of input variables.difference to the best known solution is used instead of the percentage-gap-closed. The metrics measures theabsolute difference between the best known solution and the solution of each instance and approach.(a) S1 (b) S2Figure 5: The progress profile of the average absolute difference. The x-axis represents time spent in seconds.The y-axis in a logarithmic scale is the average absolute difference to the best known solution.Figure 5 shows the representative progress profiles of experiments. The y-axis in a logarithmic scale is theaverage absolute difference between the solution over time and the best known solution. Each line denotesthe different approach. Note that the orange line is truncated at 10 − for display purpose.The outer approximation guided approach finds a good quality solution quickly for neural network struc-ture S1 instances compared to the projected gradient method. For neural network structure S2 instances, theprojected gradient method with step size of 0 .
001 shows better performance than two outer-approximationguided approaches at the earlier stage of the progress. After around 50 seconds, the outer-approximationguided approach with optimality tolerance of 10 − achieves a better average solution quality. Note thatthe optimality gap (cid:15) is respect to a local optimal solution and not for the global optimality. Therefore,OGO 0.0001 can have solutions with more than 0 . This paper discusses constrained neural network inverse optimization problems that find the optimal set ofinput parameters for a desired output and proposes an outer-approximation guided method, especially whenthe inverse problem has additional constraints on input parameters. The proposed method is devised byexploiting the characteristics of the local optimal solution for neural network inverse optimization problemswith rectified activation units. The proposed algorithm consists of primal and dual phases. In the primalphase, the algorithm incorporates neighbor gradients through outer approximations of neighbors to expeditethe convergence to a good quality solution. In the dual phase, the algorithm exploits the convex structure ofthe local optimal solution to improve the speed of convergence. In addition, the paper proposes a method togenerate feasibility cuts without solving an explicit dual problem. The superiority of the proposed algorithm17s demonstrated with computational results compared to a projected gradient method for a material designproblem and randomly generated instances.
References [Anderson et al., 2019] Anderson, R., Huchette, J., Tjandraatmadja, C., and Vielma, J. P. (2019). Strongmixed-integer programming formulations for trained neural networks. In Lodi, A. and Nagarajan, V.,editors,
Integer Programming and Combinatorial Optimization , pages 27–42, Cham. Springer InternationalPublishing.[Ben-Tal and Nemirovski, 2005] Ben-Tal, A. and Nemirovski, A. (2005). Non-euclidean restricted memorylevel method for large-scale convex optimization.
Mathematical Programming , 102(3):407–456.[Benders, 1962] Benders, J. F. (1962). Partitioning procedures for solving mixed-variables programmingproblems.
Numerische Mathematik , 4(1):238–252.[Bertsekas, 1999] Bertsekas, D. (1999).
Nonlinear Programming . Athena Scientific.[Chen and Gu, 2020] Chen, C. T. and Gu, G. X. (2020). Generative deep neural networks for inverse mate-rials design using backpropagation and active learning.
Advanced Science , page 10.[Corts et al., 2009] Corts, O., Urquiza, G., and Hernndez, J. (2009). Optimization of operating conditionsfor compressor performance by means of neural network inverse.
Applied Energy , 86(11):2487 – 2493.[Duran and Grossmann, 1986] Duran, M. A. and Grossmann, I. E. (1986). An outer-approximation algo-rithm for a class of mixed-integer nonlinear programs.
Mathematical Programming , 36(3):307–339.[Eaves and Zangwill, 1971] Eaves, B. and Zangwill, W. (1971). Generalized cutting plane algorithms.
SIAMJournal on Control , 9(4):529–542.[Fischetti and Jo, 2018] Fischetti, M. and Jo, J. (2018). Deep neural networks and mixed integer linearoptimization.
Constraints , 23(3):296–309.[Geoffrion, 1970] Geoffrion, A. M. (1970). Elements of large-scale mathematical programming part i: Con-cepts.
Management Science , 16(11):652–675.[Geoffrion, 1972] Geoffrion, A. M. (1972). Generalized benders decomposition.
Journal of OptimizationTheory and Applications , 10(4):237–260.[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.
Neural Computation , 9(8):1735–1780.[Hogan, 1973] Hogan, W. W. (1973). Applications of a general convergence theory for outer approximationalgorithms.
Mathematical Programming , 5(1):151–168.[Hornik et al., 1990] Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of anunknown mapping and its derivatives using multilayer feedforward networks.
Neural Networks , 3(5):551– 560.[Kelley, 1960] Kelley, Jr., J. (1960). The cutting-plane method for solving convex programs.
Journal of theSociety for Industrial and Applied Mathematics , 8(4):703–712.[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification withdeep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q.,editors,
Advances in Neural Information Processing Systems 25 , pages 1097–1105. Curran Associates, Inc.[Linden and Kindermann, 1989] Linden, A. T. and Kindermann, J. (1989). Inversion of multilayer nets.
International 1989 Joint Conference on Neural Networks , pages 425–430 vol.2.18O’Malley et al., 2019] O’Malley, D., Golden, J. K., and Vesselinov, V. V. (2019). Learning to regularizewith a variational autoencoder for hydrologic inverse analysis.[Peters and Schaal, 2008] Peters, J. and Schaal, S. (2008). Natural Actor-Critic.
Neurocomputing .[Read et al., 2011] Read, D. J., Auhl, D., Das, C., den Doelder, J., Kapnistos, M., Vittorias, I., and McLeish,T. C. B. (2011). Linking models of polymerization and dynamics to predict branched polymer structureand flow.
Science , 333(6051):1871–1874.[Rezaee and Dadkhah, 2019] Rezaee, M. J. and Dadkhah, M. (2019). A hybrid approach based on inverseneural network to determine optimal level of energy consumption in electrical power generation.
Computers& Industrial Engineering , 134:52–63.[Schmidhuber, 2015] Schmidhuber, J. (2015). Deep learning in neural networks: An overview.
Neural Net-works , 61:85 – 117.[Silver et al., 2017] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert,T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Van Den Driessche, G.,Graepel, T., and Hassabis, D. (2017). Mastering the game of Go without human knowledge.
Nature .[Szegedy et al., 2014] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., andFergus, R. (2014). Intriguing properties of neural networks. In