[PDF] Learning to Solve AC Optimal Power Flow by Differentiating through Holomorphic Embeddings

Abstract

Alternating current optimal power flow (AC-OPF) is one of the fundamental problems in power systems operation. AC-OPF is traditionally cast as a constrained optimization problem that seeks optimal generation set points whilst fulfilling a set of non-linear equality constraints -- the power flow equations. With increasing penetration of renewable generation, grid operators need to solve larger problems at shorter intervals. This motivates the research interest in learning OPF solutions with neural networks, which have fast inference time and is potentially scalable to large networks. The main difficulty in solving the AC-OPF problem lies in dealing with this equality constraint that has spurious roots, i.e. there are assignments of voltages that fulfill the power flow equations that however are not physically realizable. This property renders any method relying on projected-gradients brittle because these non-physical roots can act as attractors. In this paper, we show efficient strategies that circumvent this problem by differentiating through the operations of a power flow solver that embeds the power flow equations into a holomorphic function. The resulting learning-based approach is validated experimentally on a 200-bus system and we show that, after training, the learned agent produces optimized power flow solutions reliably and fast. Specifically, we report a 12x increase in speed and a 40% increase in robustness compared to a traditional solver. To the best of our knowledge, this approach constitutes the first learning-based approach that successfully respects the full non-linear AC-OPF equations.

Full PDF

LLOPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 1

Learning to Solve AC Optimal Power Flow byDifferentiating through Holomorphic Embeddings

Henning Lange ∗ , Bingqing Chen † , Mario Berg´es † , Soummya Kar †∗ University of Washington, Seattle, WA, 98195, USA † Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract —Alternating current optimal power ﬂow (AC-OPF)is one of the fundamental problems in power systems operation.AC-OPF is traditionally cast as a constrained optimizationproblem that seeks optimal generation set points whilst fulﬁll-ing a set of non-linear equality constraints – the power ﬂowequations. With increasing penetration of renewable generation,grid operators need to solve larger problems at shorter intervals.This motivates the research interest in learning OPF solutionswith neural networks, which have fast inference time and ispotentially scalable to large networks. The main difﬁculty insolving the AC-OPF problem lies in dealing with this equalityconstraint that has spurious roots, i.e. there are assignmentsof voltages that fulﬁll the power ﬂow equations that howeverare not physically realizable. This property renders any methodrelying on projected-gradients brittle because these non-physicalroots can act as attractors. In this paper, we show efﬁcientstrategies that circumvent this problem by differentiating throughthe operations of a power ﬂow solver that embeds the power ﬂowequations into a holomorphic function. The resulting learning-based approach is validated experimentally on a 200-bus systemand we show that, after training, the learned agent producesoptimized power ﬂow solutions reliably and fast. Speciﬁcally, wereport a 12x increase in speed and a 40% increase in robustnesscompared to a traditional solver. To the best of our knowledge,this approach constitutes the ﬁrst learning-based approach thatsuccessfully respects the full non-linear AC-OPF equations.

Index Terms —alternating current optimal power ﬂow; rein-forcement learning; control; holomorphic embeddings

I. I

NTRODUCTION

The Optimal Power Flow (OPF) problem is fundamental topower systems operation [1]. In general, OPF ﬁnds the optimalgeneration set points that minimize operation costs given aset of loads while at the same time satisfying physical andsecurity constraints. The physicality of solutions is ensuredby enforcing the power ﬂow equations – a set of non-linearequality constraints.The increasing penetration of distributed energy resources(DER) is posing new challenges for power system operation.Traditionally, generation schedules were updated at 5-min in-tervals [1]. Due to the variability and uncertainty of renewablegeneration, system operators are required to update generationschedules more frequently. At the same time, system operatorsneed to manage a growing number of smaller resources, re-sulting in a signiﬁcantly larger problem than the conventionalOPF problem [2]. Thus, the integration of DERs requires

Demand, 𝑆 ! ℒ = 𝑐 𝑔 " 𝑆 ! + 𝑢 𝑆 ! $ 𝑔 " 𝑆 ! % Dual Variable, 𝑢 (𝑆 ! ) Differentiable

AC-OPF SolverPolicy, 𝑆 & = 𝑔 " (𝑆 ! ) ⊕ Fig. 1:

Framework:

We propose a learning-based approach toAC-OPF problem by directly differentiating through a HELMsolver. Thus, we learn a neural policy, g Θ ( S d ) , that respectspower ﬂow equations. At the same time, We impose additionalconstraints, e.g., voltage magnitude and unit commitment, bylearning dual variables corresponding to these constraints withanother NN, u Ψ ( S d ) . Finally, we learn the policy end-to-endby optimizing the Lagrangian function, L solving larger OPF problems at shorter time intervals. Thismotivates research on using function approximators, such asneural networks (NN), to learn solutions to the OPF problem,due to their fast inference and their potential to scale well tolarger power networks [3].In order to deal with the non-linearity of the OPF problem,a common practice in power system operation is to linearizethe grid dynamics, leading to the DC-OPF approximation.The DC-OPF problem is generally formulated as either alinear program (LP) or a quadratic program (QP), dependingon the cost function. Such a linearization can work reasonablywell for small networks, but the resulting approximationerror scales poorly with network size [4]. Furthermore, theassumptions required for linearizing the power ﬂow equationsare not valid for heavily-loaded or distribution networks [5]and the author of [6] show that even under mild assumptionsDC solutions are usually never AC feasible. Ideally, thefull non-linear AC power ﬂow equations should be used.However, using the non-linear power ﬂow equations comeswith another challenge. Because the non-linear AC-OPFequality constraints have a fractal nature riddled withspurious roots, which act as attractors [7], it is difﬁcult todisambiguate physical from spurious non-physical solutions.As a result, convergence to the true solution cannot beguaranteed rendering existing solvers brittle.In this paper, we propose a learning-based approach to theAC-OPF problem that alleviates the aforementioned issuesby differentiating through a robust power ﬂow solver, asshown in Figure 1. Our contributions can be summarized a r X i v : . [ c s . L G ] D ec OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 2 as follows. Firstly, we show that a power ﬂow solver basedon a holomorphic embedding of the power ﬂow equations[8] is differentiable in its arguments. The primary beneﬁt ofusing Holomorphic Embedded Load Flow method (HELM)is that it is much more likely to ﬁnd the correct solutionif it exists. Secondly, we can use this differentiable HELMsolver as a computational layer in a NN. This allows us tolearn a neural policy that solves the OPF problem, whilerespecting the power ﬂow constraints. Furthermore, we showhow this method can be extended to enforce convex securityconstraints, i.e., voltage magnitudes, as well as non-convexinteger constraints, i.e., unit commitment. To validate ourapproach, we empirically show that it is faster and more robust(i.e., results in physically-viable solutions more often) on a200-bus system in comparison to traditional solvers basedon mixed-integer interior point methods (MATPOWER [9]).Speciﬁcally, our proposed method achieves a 12x increasein speed and a 40% increase in robustness compared toMATPOWER. II. R

ELATED W ORK

A. Challenges of Solving AC-OPF

As described earlier, AC-OPF is traditionally posed asa constrained optimization problem, i.e. a cost function isminimized under network constraints oftentimes relying onsome form of projected gradient descent [10]. Note thatthis problem formulation faces numerous computationaldifﬁculties. First, the non-linear equality constraint (2) posesa challenge. In reality, (2) is a necessary but not sufﬁcientcondition for the system to be in a physical state. There areassignments of nodal voltages v that fulﬁll the power ﬂowequations that. however, do not constitute a physical state [11],[7]. Optimization algorithms based on some form of projectedgradient descent might be attracted to such a non-physicalsolution rendering them not robust. Advanced techniques likethe Homotopy [12] or Continuation [13] methods alleviatebut not fully remedy this problem, and can incur substantialcomputational costs. Furthermore, non-convex constraintscreate additional computational issues. Algorithms such asbranch-and-bound that are typically employed to deal withinteger constraints require solving multiple, and in the worstcase exponentially many, relaxed linear programs [14]. B. Learning OPF

Because larger OPF problems need to be solved morefrequently, there is growing interest in learning to solve OPFproblems with NNs. These approaches formulate OPF asthe problem of learning a policy, which maps the demandconﬁgurations to generation set points. This policy could belearned from historical data, i.e., imitation learning, and/or byinteracting with the system, i.e. reinforcement learning.The most common paradigm of existing solutions is to learna mapping from demand assignments to optimal generation setpoints based on a data set provided by an external solver, suchas MATPOWER [9]. This is analogous to behaviour cloning of an expert policy. In this case, the external OPF solver consti-tutes the expert policy [15]. Following [16], we refer to such aparadigm as

OPF-then-learn . Speciﬁcally, authors of [17], [3]pose AC-OPF as an end-to-end regression task. Speciﬁcally,in [3] graph neural networks (GNN) were adopted, whichmake predictions based on local information from neighboringnodes. Such an approach allows for a decomposition of thegrid-level problem based on power system connectivity and, inprinciple, scales well to large networks. However, a signiﬁcantlimitation of such an approach is that the solutions may notadhere to physical and security constraints inherent to thephysical system. Concretely, 49% and 30% of the solutionswere infeasible when applied to the IEEE 30-bus and 118-bustest systems respectively [17].This has motivated research on learning-based methods thatrespect constraints. The authors of [15], [18], [19] studiedthe problem in the context of DC-OPF. Since DC-OPF caneither be posed as an LP or QP, the problem has amenablemathematical properties and permits for relatively simplesolution strategies. More relevant to our work, [20], [1] studiedthe problem in the context of AC-OPF. To ensure feasibility, in[20] generation set points produced by a NN were passed intoa power ﬂow solver. This speeds up the computation since thepower ﬂow equations are easier to solve than AC-OPF. In [1],the authors applied primal-dual updates to the Lagrangian dualfunction of the underlying constrained optimization problem,where a NN predicts the generation set points and the powersystem states. Notably, the approach proposed by [1] producesmore accurate and cost-effective solutions than those by DC-OPF approximation, which is commonly used in industry.However, there are key limitations to the

OPF-then-learn paradigm. Firstly, a large labeled data set needs to be generatedfor training. Given the convergence issues of solving AC-OPF, some load conﬁgurations may not have a solution fromcommercial solvers. Secondly, the learned solution may notgeneralize to unseen scenarios and re-training the systemon incoming demand assignments requires creating a newtraining set. Because of this, authors of [16] proposed

OPF-and-learn as an alternative paradigm where generation cost isoptimized directly by computing gradients through the OPFproblem. While

OPF-then-learn is analogous to imitationlearning,

OPF-and-learn can be seen as a one-step MarkovDecision Process (MDP). However, the

OPF-and-learn ap-proach proposed in [16] relies on the DC-OPF linearization,which limits the accuracy of solutions and is not applicableto heavily-loaded or distribution networks. Furthermore, whenassuming the DC-OPF linearization, creating a training set istrivial because optimal solutions can be obtained reliably andfast. Thus, by making the DC-OPF assumption, the circulardependence on training data is broken but a signiﬁcantly easierproblem is solved which limits the potential advantages of the

OPF-and-learn paradigm.To the best of our knowledge, this paper introduces theﬁrst

OPF-and-learn approach that respects the full non-linear power ﬂow constraints as well as non-convex unit-decomittment and non-linear physical and security constraints.

OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 3

III. P

ROPOSED L EARNING F RAMEWORK

Let c ( v ) be the cost associated with the assignment of nodalvoltages v ∈ C N and N being the number of buses of thesystem. In the following, without loss of generality, we deﬁnethe cost in terms of v for presentational simplicity. However,numerous constraints need to be enforced, some of whichare non-linear and non-convex, i.e. the load ﬂow problem istraditionally posed as:minimize w.r.t. v : c ( v ) (1)subject to: S = S g − S d = diag ( v )( Y v ) ∗ (2) k i ( v ) ≤ (3) h i ( v ) = 0 (4)with S d ∈ C N and S g ∈ C N being a demand and generationassignment, respectively, and Y ∈ C N × N being the busadmittance matrix. For different demand assignments S d , thisoptimization problem would be solved over and over again.Our proposed learning-based formulation following the OPF-and-learn paradigm is as follows: First, we note thatthe nodal voltages v are a function of S d and S g , i.e. v = v ( S d , S g ) with v solving the power ﬂow equations (2).Second, we introduce a function that is tasked with producingoptimal generator conﬁgurations S (cid:48) g as a function of the systemstate, in this case the demand S d , in an expected risk mini-mization setting. Thus, S (cid:48) g = g Θ ( S d ) with Θ parameterizingthe function g . Third, we assume knowledge of a data set D containing historic demand assignments, i.e. a collectionof possible states S d ∈ D . Note that unlike OPF-then-Learn approaches [cite,cite,cite] we do not assume knowledge of thecorresponding optimal generator assignments. The goal is toestimate the parameters of the function g , in this case Θ , thatproduce optimal generation assignments as a function of thedemand. As an optimization objective and learning signal forthe function g , we propose the following:minimize w.r.t. Θ : (cid:88) S d ∈ D c ( v ( S d , g Θ ( S d ))) = L (5)subject to: k i ( v ) ≤ (6) h i ( v ) = 0 (7)When compared to traditional optimization based approaches,this problem formulation has a number of advantages:1) The non-linear power ﬂow constraint (2) vanishes. Thisincreases robustness and avoids convergence to non-physical solutions given a differentiable and robustpower ﬂow solver.2) As we will show later, optimization under non-convexconstraints can be amortized, i.e. time training the sys-tem is spent once and after training, inference is ex-tremely fast and solely requires a forward-pass througha NN even when respecting binary unit-commitmentconstraints.3) Because training g entails learning an optimal mappingbetween Euclidean spaces that represent demand and generation assignments respectively, the proposed ap-proach exploits covariances between load ﬂow problemsand allows for the generalization to unseen problems.However, optimizing Θ w.r.t. (5) poses challenges: If gra-dient descent is used for optimization, gradients need to bedeﬁned. Applying the chain rule to L yields: ∂ L ∂ Θ = ∂c∂ v ∂ v ∂g ∂g∂ Θ Thus, the loss is differentiable if the cost function c , thevoltage function v and the actor function g are differentiable. g and c can be usually assumed to be differentiable, however,the fact that computing the gradient through a power ﬂowsolver w.r.t. to the generation assignment S g , i.e. ∂ v ∂S g , ispossible, might not be obvious. In section IV, we will showthat computing the gradient through a robust power ﬂow solverthat represents the full non-linear AC-OPF equations, namelythe Holomorphic Embedded Load Flow Method, is indeedpossible. Doing this allows us to obtain the learning signal ∂ L ∂ Θ .Furthermore, the constraints i.e. (6) and (7) need to be en-forced. In section V, we will show that an auxiliary function u can be used to enforce the Karush-Kuhn-Tucker conditionsultimately allowing us to enforce arbitrary constraints similarbut not identical to the approaches introduced in [1], [16].In section VI, we show how the proposed approach can beextended to handle non-convex, e.g. binary unit commitmentconstraints, by optimizing a variational lower bound whilstintroducing little computational overhead during inference.The approach is validated empirically on a 200 bus system.The experimental setup and results are described in section5. In section 6, our ﬁndings are concluded and pathways forfuture work are laid out.IV. H OLOMORPHIC E MBEDDED L OAD F LOW M ETHOD

HELM was ﬁrst proposed by Trias [8], [21] and was laterextended in e.g. [22], [23]. HELM addresses the convergenceissues of power ﬂow solvers based on Householder’s methodsfor root-ﬁnding such as the Newton-Raphson algorithm. Be-cause the power ﬂow equations have multiple roots but onlya single physically realizable solution, these types of solversare at risk to converge to a spurious, unstable or low voltagesolution [24]. Whether or not the ‘correct’ root is being foundis usually dependent on the initial condition [7]. BecauseHELM does not require an initial guess or initial condition, itovercomes the ambiguity problem that Householder’s solversface, by performing analytical continuation from a knownphysically-realizable solution. Analytical continuation also al-lows HELM to ﬁnd the root that is on the same branch-cutas the previously known physically-realizable solution. Thissolution is unique because analytical continuation is uniquewhen the function at hand is holomorphic. HELM treats thecomplex nodal voltages at each bus as holomorphic functionsof a complex scalar z . These functions are evaluated at a pointfor which obtaining a physically-realizable solution is trivial(usually z = 0 ) and are then continued to the solution ata desired point where the original power ﬂow equations arerecovered (usually z = 1 ). OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 4

Let V ( z ) be a function of the complex scalar z . Equation (8)then describes such a holomorphic embedding, i.e. obtaininga solution at z = 0 is trivial because no power is ﬂowing andthe original power ﬂow equations (2) are recovered at z = 1 .See [23] for a proof that V ( z ) is indeed holomorphic. Y V ( z ) = zS ∗ V ∗ ( z ∗ ) (8)In order to obtain the power series coefﬁcients required foranalytical continuation, V ( z ) and its reciprocal are approxi-mated by a power series expansion, i.e.: V ( z ) = ∞ (cid:88) n =0 c [ n ] z n (9) V ( z ) = W ( z ) = ∞ (cid:88) n =0 d ∗ [ n ] z n (10)Similar to traditional power ﬂow solver such as NewtonRaphson, in order to avoid overspeciﬁcation of the problem, aslack bus is introduced: Let Y r ∈ C N − × N − be the reduced Y matrix by removing the row and column of the slack busand y s ∈ C N − be the slack-row of Y sans self-admittance.We assume that the voltage at the slack generator is v s + 0 j with v s ∈ R .For the i th row of (8) the following then holds: (cid:88) k Y rik ∞ (cid:88) n c k [ n ] z n + ( v s + 0 j ) y s = zS ∗ i ∞ (cid:88) n d ∗ i [ n ] z n (11)Setting z = 0 : (cid:88) k Y rik c k [0] = − v s y s (12)Thus, solving the linear system in (12) yields a solution at z = 0 . Note that this solution is physically-realizable becauseno power is ﬂowing. Higher order power series coefﬁcientscan be obtained by equating coefﬁcients of the same orderand by making use of ( (cid:80) ∞ n =0 c [ n ] z n )( (cid:80) ∞ n =0 d ∗ [ n ] z n ) = 1 which yields: d k [0] = 1 c k [0] (13) (cid:88) k Y rik c k [ n ] = S ∗ i d i [ n − (14) d i [ n ] = − (cid:80) n − m =0 c i [ n − m ] d i [ m ] c i [0] (15)After obtaining power series coefﬁcients, analytic continuationis performed to obtain a solution at z = 1 . However, sincethe radius of convergence is usually smaller than 1, analyticalcontinuation is performed using Pad´e approximants instead ofdirectly evaluating (9). Pad´e is a rational approximation ofpower series known to have the widest radius of convergence[25]. Analytical continuation by Pad´e approximation is per-formed as follows: V i ( z ) ≈ R ( z ) = (cid:80) mj =0 a i,j z j (cid:80) mk =1 b i,k z k (16) Approximants of order m , i.e. a i and b i can be obtained fromthe power series coefﬁcients by solving a linear system ofequations, speciﬁcally: (cid:2) I M ( c i ) (cid:3) (cid:20) a i b i (cid:21) = c i with: M ( c i ) =  . . . − c i [1] 0 . . . − c i [2] − c i [1] 0 . . . − c i [2] − c i [2] − c i [1] 0 . . . ... ... ... ... ... ... − c i [ n ] − c i [ n − − c i [ n − . . . − c i [ m ]  and I being the identity matrix. Because we perform analyt-ical continuation to z = 1 , plugging the obtained coefﬁcientsinto (16) yields: V i ≈ (cid:80) mj =0 a j,i / (1 + (cid:80) mj =0 b j,i ) . A. Differentiating through HELM

In the following we will view HELM as a function thatmaps complex nodal power to complex nodal voltages, i.e. v = v ( S d , g Θ ( S d )) . We will show that v is not holomorphic but R -differentiable in Θ ultimately allowing us to compute gradientsw.r.t. the parameters of the actor function g . By making useof the chain-tule, the strategy is to decompose HELM intoa succession of functions and show that each function is R -differentiable. Speciﬁcally, we decompose HELM into its algo-rithmic steps, i.e. v ( S d , g Θ ( S d )) = f v ◦ f ab ◦ f c,n ( S d − g Θ ( S d )) with f c,n computing power series coefﬁcients, f ab computingPad´e approximants and f v computing voltage phasors givenPad´e approximants. We then show that f ab , f c,n and f v are R -differentiable. Note that f c,n is a recursive function and thatwriting its gradient out would be tedious. But its gradients canbe computed efﬁciently using the backpropagation algorithmand the implementation is trivial in any deep learning frame-works with automatic differentiation capability, e.g. PyTorch and

TensorFlow .As stated earlier, HELM ﬁrst computes the power seriescoefﬁcients followed by Pad´e approximation. The power seriescoefﬁcients c [ n ] and d [ n ] are obtained in alternating fashion:Let f c,n and f d,n be the function that produces c [ n ] and d [ n ] respectively. Note that f c,n requires knowledge of the previous d -coefﬁcient and g Θ , whereas f d is a function of all previous c - and d -coefﬁcients:corrs. to (14): f c,n ( x ) = ( Y r ) − f d,n − ( x ) x ∗ (17)corrs. to (15): f d,n ( x ) = (cid:80) n − m =0 f c,n − m ( x ) f d,m ( x ) f c, ( x ) (18)Because of the complex conjugation in (17), v is not holomor-phic in x , and Θ , if x = S d − g Θ ( S d ) . However, it is easy tosee that, by induction, (17) and (18) are R -differentiable when f c, and f d, are R -differentiable which is easy to see from(12) and (13).After obtaining the power series coefﬁcients, Pad´e approxi-mants a and b are calculated. Note that this also only includessolving a linear system of equations, i.e. f ab ( x ) = (cid:20) ab (cid:21) = (cid:2) I M ( x ) (cid:3) − x OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 5 which is differentiable. Then f v includes only a summationand fraction, i.e: f v ( (cid:20) ab (cid:21) ) = m (cid:88) i =0 a i / (1 + m (cid:88) i =0 b i ) Although, v ( x ) = f v ( f ab ( f c,n ( x ))) is not holomorphic, it is R -differentiable in its argument x and, when applied to x = S d − g Θ ( S d ) , it is R -differentiable in Θ .V. E NFORCING C ONSTRAINTS

A. A priori constraints

As stated earlier, we treat the generation assignment S g as the output of a parameterized function g Θ . Because ofthe reasoning laid out earlier, we require g to be differen-tiable and because of recent successes of NNs in non-linearoptimization, we choose g to be a NN with a penultimatesigmoidal layer [20]. We incorporate the generation limits ofthe generators into the output layer of the NN and thereforeenforce generation limits by construction. Let σ ∈ (0 , N g bethe penultimate layer with N g being the number of generatorbuses. Thus, every generator is associated with two neurons,i.e.: g Θ ( S d ) i = ( S g ) i = ( P maxi − P mini ) σ i + P mini + j ( Q maxi − Q mini ) σ i + N g + jQ mini with P maxi , P mini , Q maxi and Q mini being the active and re-active generation limit respectively. Because σ is bounded by (0 , non-slack generation limits cannot be violated. However,other constraints such as e.g. voltage magnitude or thermalline limits are not enforced by construction. That is why, in thenext section we show how to enforce what we call a posterioriconstraints, i.e. constraints whose violation is only known afterevaluating v . B. A posteriori constraints

We adapt ideas from mathematical optimization to enforcearbitrary constraints on v . In mathematical optimization, theKarush-Kuhn-Tucker conditions (KKT-conditions) are neces-sary conditions for a solution to be optimal [26]. Given theoptimization problem (1) expressed in terms of v , the KKTconditions state that a solution v (cid:48) is locally optimal under someregularity conditions when there exist µ i such that: • ∀ i µ i ≥ (Dual feasibility) • ∀ i µ i k i ( v (cid:48) ) = 0 (Complementary slackness) • ∀ i k i ( v (cid:48) ) ≤ (Primal feasibility) • ∇ f ( v (cid:48) ) + (cid:80) i µ i ∇ k i ( v (cid:48) ) (Stationarity)Note that, without loss of generality (because any equalityconstraint can be expressed as two inequality constraints)and for notational convenience, we restrict the optimizationproblem to only have inequality constraints.However, as stated earlier, we are not interested in thesolution of a single constraint optimization problem but insteadin solutions to all instances of a class of optimization problem.In this case, S d , i.e. the demand assignment, speciﬁes theinstance of the optimization problem whereas the networktopology, i.e. admittance matrix Y , speciﬁes the class. First, we note that the KKT-multipliers are dependent on the instance ofthe optimization problem, thus instead of introducing a scalar µ i , we introduce a scalar-valued function u ψ ( S d ) . In order toenforce dual feasibility by construction, we choose u to be aNN with soft-plus output parameterized by ψ . Furthermore,let g S d Θ = v ( S d , g Θ ( S d )) , ( u S d ψ ) i = u ψ ( S d ) i the i th outputof u and k + i ( v ) = max( k i ( v ) , . We will now introduce alearning criterion and show that local optima of this criterionfulﬁll the KKT-conditions for instances of the class containedin the training set. As a learning criterion we propose: L ( S d ) = c ( g S d Θ ) + (cid:88) i ( u S d ψ ) i k + i ( g S d Θ ) (19) ∀ S d ∈ D max ψ { min Θ {L ( S d ) }} (20)We will now show that, after convergence, for all S d ∈ D , v (cid:48) = g S d Θ is locally optimal under some regularity constraints,i.e. it fulﬁlls the KKT-conditions and furthermore, that theKKT-multipliers for which the KKT-conditions hold are: µ i = (cid:40) ( u S d ψ ) i if k i ( g S d Θ ) = 00 else (21) • Dual feasibility: µ i is dual feasible by construction: it iseither or greater than because it is the output of asoft-plus NN. • Complementary slackness: Follows directly from (21) • Primal Feasibility: Since (20) converged, we know that ∂L∂ ( u Sdψ ) i = 0 and since ∂L∂ ( u Sdψ ) i = k + i ( v (cid:48) ) = 0 , v (cid:48) must beprimal feasible. Or in other words: if v (cid:48) was not primalfeasible, k + i ( v (cid:48) ) > but then the maximization step of(20) could have increased L by increasing µ i which is acontradiction to the assumption that (20) has converged. • Stationarity: Follows directly from the assumption that(20) has converged. Note that substituting k + i for k i doesnot have an inﬂuence because if k i ( v ) (cid:54) = 0 then thecorresponding µ i = 0 (complementary slackness) andwhen k i ( v ) = 0 then ∇ k i ( v ) = ∇ k + i ( v ) Note that the detour of substituting k i ( v ) for k + i ( v ) im-proves the performance substantially. Without the substitution,because more constraints are complied with initially, the NNdrives the outputs before the soft-plus non-linearity to −∞ inorder to make the corresponding µ i equal to 0. The outputunits are then ‘dead’ and, because the gradient of the outputnon-linearity is close to 0, will always stay 0. C. Enforcing Physicality

So far, we have shown how to enforce ‘a priori’-constraints,i.e. constraints whose violation is known before inferring nodalvoltages, by construction, as well as ‘a posteriori’-constraints,i.e. constraints whose violation is known after inferring nodalvoltages, by introducing a learning objective that, after con-vergence, will enforce the KKT-conditions. However, we havenot yet shown how to keep the function g in the physicalregime, i.e. prevent g from producing a generation assignment S g for some S d such that there is no v that fulﬁlls the powerﬂow equations (2). An extreme example of a non-physical OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 6 α l o g ( † ) Fig. 2: The log -mismatch (cid:15) as a function of α , i.e. (cid:15) ( v ( S d , αS g )) . α scales a physical solution, i.e. when α = 1 the corresponding (cid:15) is small. The number of HELM iterations n is color coded.tuple ( S d , S g ) , for any demand assignment S d for which (cid:80) i real ( S d ) i > is S g = (cid:126) .First, we note that HELM will always produce complexnodal voltages even for non-physical tuples. However, for non-physical tuples the power ﬂow equations (2) will not hold, i.e.there is a mismatch between the RHS and LHS of (2). Wequantify this mismatch by deﬁning: (cid:15) ( v ) = || S g − S d − diag ( v )( Y v ) ∗ || ∞ The goal now is to enforce that (cid:15) ( v ) < ξ with ξ beingsome parameter which speciﬁes when a power ﬂow solutionis deemed physical. Note that because (cid:15) is a function of v , in principle, an additional inequality constraint could beintroduced, i.e. k i ( v ) = (cid:15) ( v ) − ξ ≤ and one could try toenforce this constraint as an a posteriori constraint as describedearlier. However, in our experience this approach struggles,i.e. the learning objective usually does not converge. Figure 2gives an intuition as to why this is the case. Figure 2 shows log( (cid:15) ) as a function of α on a 200 bus system. α scales thegeneration S g of a physical tuple ( S d , S g ) , i.e. the y-axis shows log( (cid:15) ( v ( S d , αS g ))) . Note that when α is either small or big( < . or > . ), (cid:15) is close to ﬂat and therefore the gradient of (cid:15) is close to 0. After randomly initializing the the parametersof the function g , its guesses about optimal generation assign-ments will naturally be bad which corresponds to scaling theoptimal generation assignment with a small or big α . However,the function cannot improve its guesses by gradient descentbecause the gradient will be close to 0.In order to overcome this problem, we propose to optimizea proxy of the actual mismatch function (cid:15) . Note that anindicator of whether or not a solution is physical is whetheror not the power series coefﬁcients c i [ n ] have converged to0. Let ¯ c [ n ] be the mean n th power series coefﬁcient of allvoltages, i.e. ¯ c [ n ] = (cid:80) i c i [ n ] /N . Figure 3 shows a scatter-plot of log ¯ c [ n ] and log (cid:15) . Empirically, one can see that small ¯ c [ n ] is a sufﬁcient condition for small (cid:15) , however not a

25 20 15 10 5 0 5 10log( ¯ c [n])35302520151050510 l o g ( † ) Fig. 3: Small log ¯ c [ n ] is a sufﬁcient condition for small log( (cid:15) ) .Speciﬁcally, when log ¯ c [ n ] > − then − < log( (cid:15) ) < − .However, the opposite is not true, i.e. small log( (cid:15) ) does notbound log ¯ c [ n ] .necessary condition. That is, a small ¯ c [ n ] implies small (cid:15) but not vice versa. Thus in order to enforce physicality, ¯ c [ n ] can be minimized as a proxy for (cid:15) . However, one mightthink that optimizing log ¯ c [ n ] is unnecessarily restrictive, i.e. itexcludes solutions where the power series coefﬁcients did notconverge to 0 but the corresponding v nevertheless fulﬁll thepower ﬂow equations. But, as we will show later, imposingvoltage magnitude constraints naturally enforces physicalityand additionally minimizing log ¯ c [ n ] is only required after theactor function g was ﬁrst initialized in order to ‘nudge’ the itinto the physical regime.VI. B INARY C ONSTRAINTS

Binary constraints naturally occur in optimal power ﬂowwhen incorporating the possibility of completely shuttingdown generators. Introducing this constraint also known asunit commitment makes generation limit constraints non-convex, i.e. becomes a possible generation assignment, eventhough points between and P min are not valid. Typically,optimal power ﬂow solvers employ mixed integer program-ming techniques such as branch and bound or branch andcut [14] algorithms to tackle this problem. However, thesealgorithms can incur substantial computational cost, i.e. everybranch requires solving an LP relaxed optimal power ﬂowproblem and there are exponentially-many branches in a worst-case scenario.However, using the problem formulation introduced here,because the constraint is a priori and can be enforced byconstruction, we can reduce the non-convex constraint intothe problem of inferring the mode of a probability distribution P over binary conﬁgurations. As we will show later, becauseinference in P is intractable, we optimize a variational bound,i.e. we introduce a variational distribution Q φ for which poste-rior inference is tractable and choose the variational parameters φ in such a way that Q φ best approximates P . Speciﬁcally, OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 7 we built on recent advances in Bayesian inference, speciﬁcallyVariational Inference [27] and train a variational distribution Q φ parameterized by a NN. See [28], [29] for recent reviewsof Variational Inference.We begin by showing that computing the optimal binaryconﬁguration is equivalent to computing the mode of a dis-tribution P . Let b ∈ { , } N g be the vector describing whichgenerators are turned on or off and p ( b, S d ) be an exponentialdistribution, p ( b | S d ) its posterior (Boltzmann distribution) and L be the loss as deﬁned in (19), i.e. p ( b, S d ) = λ exp − λL ( b · S d ) (22) p ( b | S d ) = exp − λL ( b · S d ) (cid:80) b (cid:48) ∈ B exp − λL ( b (cid:48) · S d ) (23)It is easy to see that computing the mode of (23), i.e. arg max b p ( b | S d ) is equivalent to choosing the binary con-ﬁguration that results in the smallest loss. However, na¨ıveevaluation of the mode is usually intractable, because of theintractable denominator. Na¨ıvely computing the mode of (23)is equivalent to brute-force search, i.e. enumerating all possiblelatent conﬁgurations and picking the one with the smallesterror. However, in the following, we show how ideas fromVariational Inference [27] can reduce the computational burdenof inference.We introduce a variational distribution Q φ whose posterioris tractable. Speciﬁcally, we choose q ( b | S d ) to be a multi-variate Bernoulli distribution and ensure tractability with ideasintroduced in [30]. Note that Q φ is parameterized with a NN,therefore ensuring that inference at test-time is fast. As alearning signal for the parameters of the auxiliary posteriordistribution φ , we choose the Evidence Lower Bound deﬁnedby: L BO ( φ ) = E q φ ( b | S d ) log p ( b, S d ) q φ ( b | S d ) (24) = log p ( S d ) − D KL ( q φ ( b | S d ) || p ( b | S d )) (25)Note that optimizing (24) does not require knowledge ofthe intractable posterior of P but nevertheless allows forminimizing a divergence measure between the true ( P ) andauxiliary posterior ( Q ). Thus, after training, in order to obtainan approximation of the mode of P , because P and Q will bemaximally similar, posterior inference is performed on Q in-stead. However, the price for this ‘trick’ is increased variance.It can be shown that the stochastic gradient estimator of (24)w.r.t. φ is an unbiased but higher variance estimator of the KL-divergence [31]. In order to combat variance, a decades-oldvariance reduction technique is employed, namely samplingwithout replacement. Sampling without replacement from Q isnot trivial. However, there is a considerable body of preexistingwork that we make us of. The sampling scheme introducedin [32] with slight modiﬁcations is employed. Speciﬁcally,instead of using the Pareto sampler as the underlying samplingmechanism, a slightly slower but more accurate eliminationsampler introduced in [33] is used.In order to obtain an approximation of the mode of the trueposterior, because Q allows for drawing samples efﬁciently, Fig. 4: A graphical depiction of the LOPF-pipeline. NNs b and g are fed the complex demand S d and tasked withproducing the optimal binary activation vector and generatorconﬁguration respectively. Because the voltages are a functionof demand and generation, both are fed into the HELMbased power ﬂow solver v . The loss L is computed basedon the resulting voltages. In order to ensure that control andnetwork inequality constraints are satisﬁed, a third NN istasked with predicting Lagrange multipliers. Because HELMis differentiable, the whole pipeline can be optimized jointly. S -many samples are drawn from Q . Then, in order to approx-imate the mode of P , out of the S -many binary conﬁgurationssampled from Q , the one which results in the smallest gen-eration cost is chosen. Note that the optimal conﬁguration ofgenerators is dependent on the binary conﬁguration, thus b should additionally be fed into the actor function g . Figure 4shows a graphical depiction of the proposed data pipeline.VII. T HE LOPF-

ALGORITHM

In this section we summarize the resulting algorithm, wecall L earning O ptimal P ower F low, or short LOPF. Thealgorithm iterates over batches of the data set D makingupdates to the three constituent NNs g , b and u . It is describedin pseudo-code in Algorithm 1. For notational convenience, wedeﬁne a function solve : S g = b · g Θ ( S d , b ) solve ( S d , b ) =  (cid:15) ( v ( S d , S g )) c ( v ( S d , S g )) + (cid:80) i ( u ψ ( S d )) i k + i ( v ( S d , S g )) N − (cid:80) i ( f c,n ( S d − S g )) i  T Figure 5 shows a graphical depiction of the input/outputrelationships of the individual networks. Note that the network g not only produces active and reactive generation assignmentsfor non-slack generators but also the voltage at the slack bus.Furthermore, the magnitude by which constraints are violated,denoted by k , are fed into the network u that produces a proxyof the Lagrange multipliers. Additionally feeding k into the u -network eases and speeds up learning considerably.VIII. E XPERIMENTS

Since this work introduces a learning based approach tothe problem of ACOPF, the performance of the algorithm isevaluated similar to how the performance of reinforcementlearning agents is evaluated, i.e. an empirical evaluationstrategy is employed. Speciﬁcally, given a held out test set ofload ﬂow problems that the system was not presented withduring training, the generation cost and the result of whetheror not the system was able to ﬁnd a feasible solution arerecorded.

OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 8

Algorithm 1:

LOPF-Algorithm in pseudo-code input : data set D output: Trained model parameters Θ , φ and ψ Initialize Θ , φ and ψ randomly; while not converged do for number of subsets of D do select d ⊂ D ; for S d ∈ d do G = {} ; G = {} ; ; B ∼ q ( b | S d ) (without replacement); for b (cid:48) ∈ B do (cid:15), L, c ← solve ( S d , b (cid:48) ) ; if (cid:15) < ξ then G ← G ∪ { L } ; else G ← G ∪ { c } ; Compute L BO based on (24); Maximize L BO w.r.t. φ ; Maximize (cid:80) L ∈ G L w.r.t. ψ ; Minimize (cid:80) L ∈ G L w.r.t. Θ ; Minimize (cid:80) c ∈ G c w.r.t. Θ ; Active Power Generation Reactive Power Generation vS d Binary Conﬁguariong a) Binary Conﬁguration 1S d b b) ....Binary Conﬁguration S c) S d k u u Fig. 5: The input and output relationships of the three con-stituent NNs. a) The NN produces active and reactive powergeneration for non-slack generators as well as the voltage atthe slack bus given the demand S d and binary conﬁgurationproduced by the b -network. b) The binary-network that pa-rameterizes an auxiliary distribution. Note that the networkproduces multiple binary conﬁguration by sampling from theauxiliary distribution. c) The Lagrange-network that producesa proxy of the Lagrange multipliers. Note that the constraint-violation magnitude k is additionally fed into the network toease learning.The requirements for feasibility excluding those that are metby construction are the following: • Log-mismatch between the RHS and LHS of the powerﬂow equations (2), i.e. (cid:15) , must be smaller than − . • Slack active and reactive generation are within limits • Non-slack voltage magnitude constraints are metThe experiments were conducted on the 200 bus IllinoisIEEE test case [9]. However, since the IEEE test cases only

LOPF MIPS

Feasible [%] 99.86 60.85Mean Cost [USD] 33325.98 25817.55Mean time per Instance [s] 1.2 14.4

TABLE I: Comparison of LOPF in terms of robustness,optimality and speed on a held-out test set in comparison toMIPS.contain a single demand assignment, the demand base casewas superimposed by temporal patterns extracted from the REEurope data set [34]. RE Europe data set contains historicaldemand for 3 years at an hourly interval. Let S d (cid:48) ∈ C bethe base demand taken from test case and x t ∈ R × be the temporal demand patterns taken from RE Europe dataset. The temporal patterns were imposed such that the meandemand of every node is equal to the demand in the test caseand such that the ratio between mean and standard deviationas seen in the RE Europe data set is preserved. The data setwas separated into training (20.280 data points) and test set(6000) when conducting experiments.The NNs used in this experiments constitute standard fullyconnected three-layer networks with intermediate tanh acti-vations. All intermediate layers have hidden units.IX. R ESULTS

As stated earlier, learning based approaches to controlproblems are usually not guaranteed to be optimal but can offeradvantages in terms of computational time and robustness. Theperformance of LOPF reinforces these expectations. Figure 7(left) shows the percentage of load ﬂow solutions producedby the system that violate any requirement for feasibility asa function of learning steps. One learning step encompasses32 load ﬂow problems. For all of the 32 load ﬂow problems50 candidate binary conﬁgurations are drawn from the auxil-iary distribution. One can see that the system quickly learnsto produce feasible solutions. Initially the system producesfeasible solutions to no load ﬂow problems. However, afterjust 30 steps close to all solutions proposed by the system arefeasible.Figure 7 (right) shows the log -mismatch between the RHSand LHS of the power ﬂow equations (2), i.e. log (cid:15) . Note thatinitially, the proposed approach produces generation assign-ments for which the HELM solver is unable to produce voltagephasors that fulﬁll the power ﬂow equations but by minimizingthe power series coefﬁcients as described in section V-C, thesystem is quickly nudged into a regime where the proposedsolutions fulﬁll the power ﬂow equations. However, afterapproximately 75 learning steps, for a short period of time,the system produces generation assignments that, again, donot fulﬁll the power ﬂow equations. This can most likely beexplained by the fact that the system also tries to minimizecost. Thus, by trying to ﬁnd cheaper generation assignments,the system left the regime in which solutions can be found byHELM but was then steered back into this regime.Table I showcases the performance of our proposed algorithmin comparison to the MIPS solver proposed in [9]. In order

OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 9 G e n e r a t i o n C o s t [ U S D ] MIPSLOPF

Fig. 6: Comparison of generation cost on the ﬁrst 300 load ﬂow problems of the test set. F e a s i b l e [ % ] l o g e p s il o n Fig. 7: Left: The percentage of feasible solutions as a function of learning steps. Right: The average log mismatch of the powerﬂow equations (2) as a function of time.to deal with non-convex generation limit constraint, the MIPSsolver was run with a unit-decommitment heuristic (runuopf).When obtaining the results for the MIPS solver, all initial-izations were unchanged and only demand was varied in theway described earlier. Slightly varying the demand revealsthe weakness of traditional solvers: Convergence to a feasiblesolution cannot be guaranteed. In our experiments, the MIPSsolver produced solution which comply with all constraintsand fulﬁll the power ﬂow equations in only about 61% ofall problem instances (failure in 2349 out of 6000 instances).Our proposed solution produces feasible solutions for 99.86%(failure in 8 out of 6000 cases) of the problem instances. On top of that, our proposed learning based approach is con-siderably faster than optimization based approaches: Becausesolutions can be obtained by feeding a demand assignmentthrough the NNand the forward pass through NNs is usuallyfast, obtaining the generation assignment proposed by thesystem is fast. Note that when we report the time per instancefor the MIPS solver, we report the mean-time over all loadﬂow problems. However, when the solver fails, it usually failsquickly. If only the time per successful instance was reported,the mean time per instance of the MIPS solver would be closeto 30s per instance.However, Table I and Figure 6 more clearly reveal the mainweakness of our proposed learning based approach. Eventhough solutions can be obtained robustly and fast, the pro-posed learning system does not ﬁnd solutions that are optimalin terms of generation cost. On average, the solutions that When LOPF fails, it slightly violates voltage magnitude constraints. the approach produces are approximately 29% more expensivethan the solutions found by the MIPS solver. Note that theaverage cost is reported for only those load ﬂow problems forwhich both approaches yielded feasible solutions.X. C

ONCLUSION AND FUTURE WORK

The main contribution of this paper is the introduction ofa learning based framework for the problem of ACOPF thatrespects the full non-linear ACOPF equations. Speciﬁcally,we introduce a learning based approach in which a functionis tasked to produce feasible and minimal cost generationassignments as a function of the demand. A learning signalfor this function is obtained by differentiating through theoperators of a load ﬂow solver. Furthermore, we show howconvex security constraints and non-convex generation limitconstraints can be enforced. The resulting system seems toproduce feasible solutions fast. However, these solutions arenot necessarily optimal in terms of generation cost.An obvious future research path is to close the optimalitygap. At this moment, because of the complexity of theresulting system, it is hard to understand why the solutionsare not optimal. But note that the proposed algorithm cannotbe optimal by design, because the slack generator cannotbe decommitted and there is a natural interpolation betweenload ﬂow solutions. However, in the opinion of the authors,performance gains in terms of optimality should be possible.Another potentially interesting research question is whetheror not the trained auxiliary distribution Q that learns the costsurface as a function of the binary generator conﬁgurationallows for conditional sampling. Imagine a scenario where OPF: LEARNING OPTIMAL POWER FLOW BY DIFFERENTIATING THROUGH HOLOMORPHIC EMBEDDINGS 10 generators have failed. In such a scenario, it is paramountto reconﬁgure the network in a feasible state fast. If itis possible to sample from Q conditioned that the failedgenerators are off, then the proposed learning based approachcould potentially ﬁnd application in emergency and securitysensitive situations. Note that this seemingly easy problemis not trivial because of the FactorNet [30] structure of theauxiliary distribution. Rconditioned that the failedgenerators are off, then the proposed learning based approachcould potentially ﬁnd application in emergency and securitysensitive situations. Note that this seemingly easy problemis not trivial because of the FactorNet [30] structure of theauxiliary distribution. R