[PDF] Lagrangian Decomposition for Neural Network Verification

Abstract

A fundamental component of neural network verification is the computation of bounds on the values their outputs can take. Previous methods have either used off-the-shelf solvers, discarding the problem structure, or relaxed the problem even further, making the bounds unnecessarily loose. We propose a novel approach based on Lagrangian Decomposition. Our formulation admits an efficient supergradient ascent algorithm, as well as an improved proximal algorithm. Both the algorithms offer three advantages: (i) they yield bounds that are provably at least as tight as previous dual algorithms relying on Lagrangian relaxations; (ii) they are based on operations analogous to forward/backward pass of neural networks layers and are therefore easily parallelizable, amenable to GPU implementation and able to take advantage of the convolutional structure of problems; and (iii) they allow for anytime stopping while still providing valid bounds. Empirically, we show that we obtain bounds comparable with off-the-shelf solvers in a fraction of their running time, and obtain tighter bounds in the same time as previous dual algorithms. This results in an overall speed-up when employing the bounds for formal verification. Code for our algorithms is available at this https URL.

Full PDF

LLagrangian Decomposition for Neural Network Veriﬁcation

Rudy Bunel ∗ , Alessandro De Palma ∗ , Alban Desmaison University of Oxford { rudy , adepalma } @robots.ox.ac.uk Krishnamurthy (Dj) Dvijotham, Pushmeet Kohli

DeepMind

Philip H.S. Torr, M. Pawan Kumar

University of Oxford

Abstract

A fundamental component of neural networkveriﬁcation is the computation of bounds onthe values their outputs can take. Previousmethods have either used off-the-shelf solvers,discarding the problem structure, or relaxedthe problem even further, making the boundsunnecessarily loose. We propose a novelapproach based on Lagrangian Decomposition.Our formulation admits an efﬁcient supergradi-ent ascent algorithm, as well as an improvedproximal algorithm. Both the algorithms offerthree advantages: (i) they yield bounds thatare provably at least as tight as previous dualalgorithms relying on Lagrangian relaxations;(ii) they are based on operations analogousto forward/backward pass of neural networkslayers and are therefore easily parallelizable,amenable to GPU implementation and able totake advantage of the convolutional structureof problems; and (iii) they allow for anytimestopping while still providing valid bounds.Empirically, we show that we obtain boundscomparable with off-the-shelf solvers in afraction of their running time, and obtain tighterbounds in the same time as previous dualalgorithms. This results in an overall speed-upwhen employing the bounds for formal veriﬁ-cation. Code for our algorithms is available at https://github.com/oval-group/decomposition-plnn-bounds . As deep learning powered systems become more and morecommon, the lack of robustness of neural networks andtheir reputation for being “Black Boxes” is increasinglyworrisome. In order to deploy them in critical scenarios ∗ These authors contributed equally to this work.

Proceedings of the 36 th Conference on Uncertainty in ArtiﬁcialIntelligence (UAI) , PMLR volume 124, 2020. where safety and robustness would be a prerequisite, weneed to invent techniques that can prove formal guaranteesfor neural network behaviour. A particularly desirableproperty is resistance to adversarial examples (Goodfel-low et al., 2015, Szegedy et al., 2014): perturbations ma-liciously crafted with the intent of fooling even extremelywell performing models. After several defenses were pro-posed and subsequently broken (Athalye et al., 2018, Ue-sato et al., 2018), some progress has been made in beingable to formally verify whether there exist any adversarialexamples in the neighbourhood of a data point (Tjenget al., 2019, Wong and Kolter, 2018).Veriﬁcation algorithms fall into three categories: unsound(some false properties are proven false), incomplete (sometrue properties are proven true), and complete (all prop-erties are correctly veriﬁed as either true or false). Acritical component of the veriﬁcation systems developedso far is the computation of lower and upper bounds onthe output of neural networks when their inputs are con-strained to lie in a bounded set. In incomplete veriﬁca-tion, by deriving bounds on the changes of the predic-tion vector under restricted perturbations, it is possibleto identify safe regions of the input space. These resultsallow the rigorous comparison of adversarial defensesand prevent making overconﬁdent statements about theirefﬁcacy (Wong and Kolter, 2018). In complete veriﬁca-tion, bounds can also be used as essential subroutinesof Branch and Bound complete veriﬁers (Bunel et al.,2018). Finally, bounds might also be used as a trainingsignal to guide the network towards greater robustnessand more veriﬁability (Gowal et al., 2018, Mirman et al.,2018, Wong and Kolter, 2018).Most previous algorithms for computing bounds are eithercomputationally expensive (Ehlers, 2017) or sacriﬁce a lotof tightness in order to scale (Gowal et al., 2018, Mirmanet al., 2018, Wong and Kolter, 2018). In this work, wedesign novel customised relaxations and their correspond-ing solvers for obtaining bounds on neural networks. Ourapproach offers the following advantages: a r X i v : . [ c s . L G ] J un While previous approaches to neural networkbounds (Dvijotham et al., 2018) are based on La-grangian relaxations, we derive a new family of opti-mization problems for neural network bounds through

Lagrangian Decomposition , which in general yieldsduals at least as strong as those obtained through La-grangian relaxation (Guignard and Kim, 1987). Wein fact prove that, in the context of ReLU networks,for any dual solution from the approach by Dvijothamet al. (2018), the bounds output by our dual are as leastas tight. We demonstrate empirically that this deriva-tion computes tighter bounds in the same time whenusing supergradient methods. We further improve onthe performance by devising a proximal solver for theproblem, which decomposes the task into a series ofstrongly convex subproblems. For each, we use an iter-ative method for which we derive optimal step sizes. • Both the supergradient and the proximal method op-erate through linear operations similar to those usedduring network forward/backward passes. As a con-sequence, we can leverage the convolutional struc-ture when necessary, while standard solvers are of-ten restricted to treating it as a general linear opera-tion. Moreover, both methods are easily paralleliz-able : when computing bounds on the activations atlayer k , we need two solve two problems for eachhidden unit of the network (for the upper and lowerbounds). These can all be solved in parallel. In com-plete veriﬁcation, we need to compute bounds for sev-eral different problem domains at once: we solve theseproblems in parallel as well. Our GPU implementationthus allows us to solve several hundreds of linear pro-grams at once on a single GPU, a level of parallelismthat would be hard to match on CPU-based systems. • Most standard linear programming based relax-ations (Ehlers, 2017) will only return a valid boundif the problem is solved to optimality. Others, likethe dual simplex method employed by off-the-shelfsolvers (Gurobi Optimization, 2020) have a very highcost per iteration and will not yield tight bounds with-out incurring signiﬁcant computational costs. Bothmethods described in this paper are anytime (terminat-ing it before convergence still provides a valid bound),and can be interrupted at very small granularity. Thisis useful in the context of a subroutine for completeveriﬁers, as this enables the user to choose an appropri-ate speed versus accuracy trade-off. It also offers greatversatility as an incomplete veriﬁcation method.

Bound computations are mainly used for formal veriﬁca-tion methods. Some methods are complete (Cheng et al.,2017, Ehlers, 2017, Katz et al., 2017, Tjeng et al., 2019,Xiang et al., 2017), always returning a verdict for each problem instances. Others are incomplete, based on re-laxations of the veriﬁcation problem. They trade speedfor completeness: while they cannot verify properties forall problem instances, they scale signiﬁcantly better. Twomain types of bounds have been proposed: on the onehand, some approaches (Ehlers, 2017, Salman et al., 2019)rely on off-the-shelf solvers to solve accurate relaxationssuch as PLANET (Ehlers, 2017), which is the best knownlinear-sized approximation of the problem. On the otherhand, as PLANET and other more complex relaxationsdo not have closed form solutions, some researchers havealso proposed easier to solve, looser formulations (Gowalet al., 2018, Mirman et al., 2018, Weng et al., 2018, Wongand Kolter, 2018). Explicitly or implicitly, these are allequivalent to propagating a convex domain through thenetwork to overapproximate the set of reachable values.Our approach consists in tackling a relaxation equivalentto the PLANET one (although generalised beyond ReLU),by designing a custom solver that achieves faster perfor-mance without sacriﬁcing tightness. Some potentiallytighter convex relaxations exist but involve a quadraticnumber of variables, such as the semi-deﬁnite program-ming method of Raghunathan et al. (2018) . Better re-laxations obtained from relaxing strong Mixed IntegerProgramming formulations (Anderson et al., 2019) have aquadratic number of variables or a potentially exponentialnumber of constraints. We do not address them here.A closely related approach to ours is the work of Dvi-jotham et al. (2018). Both their method and ours areanytime and operate on similar duals. While their dualis based on the Lagrangian relaxation of the non-convexproblem, ours is based on the Lagrangian Decompositionof the nonlinear activation’s convex relaxation. Thanks tothe properties of Lagrangian Decomposition (Guignardand Kim, 1987), we can show that our dual problem pro-vides better bounds when evaluated on the same dual vari-ables. The relationship between the two duals is studiedin detail in section 4.2. Moreover, in terms of the followedoptimization strategy, in addition to using a supergradientmethod like Dvijotham et al. (2018), we present a proxi-mal method, for which we can derive optimal step sizes.We show that these modiﬁcations enable us to computetighter bounds using the same amount of compute time.

Throughout this paper, we will use bold lower case letters(like z ) to represent vectors and upper case letters (like W )to represent matrices. Brackets are used to indicate the i -th coordinate of a vector ( z [ i ] ), and integer ranges (e.g., [1 , n − ). We will study the computation of the lowerbound problem based on a feedforward neural network,with element-wise activation function σ ( · ) . The networkinputs are restricted to a convex domain C , over which weassume that we can easily optimise linear functions. Thiss the same assumption that was made by Dvijotham et al.(2018). The computation for an upper bound is analogous.Formally, we wish to solve the following problem: min z , ˆz ˆz n (1a)s.t. z ∈ C , (1b) ˆz k +1 = W k +1 z k + b k +1 k ∈ [1 , n − , (1c) z k = σ ( ˆz k ) k ∈ [1 , n − . (1d)The output of the k -th layer of the network before andafter the application of the activation function are denotedby ˆz k and z k respectively. Constraints (1c) implementsthe linear layers (fully connected or convolutional) whileconstraints (1d) implement the non-linear activation func-tion. Constraints (1b) deﬁne the region over which thebounds are being computed. While our method can beextended to more complex networks (such as ResNets),we focus on problem (1) for the sake of clarity.The difﬁculty of problem (1) comes from the non-linearity (1d). Dvijotham et al. (2018) tackle it by re-laxing (1c) and (1b) via Lagrangian multipliers, yieldingthe following dual: max µ , λ d ( µ , λ ) , where: d ( µ , λ ) = min z , ˆz W n z n − + b n + n − (cid:88) k =1 µ Tk ( ˆz k − W k z k − − b k )+ n − (cid:88) k =1 λ Tk ( z k − σ ( ˆz k )) s.t. l k ≤ ˆz k ≤ u k k ∈ [1 , n − ,σ ( l k ) ≤ z k ≤ σ ( u k ) k ∈ [1 , n − , z ∈ C . (2)If σ is a ReLU, this relaxation is equivalent (Dvijothamet al., 2018) to the PLANET relaxation (Ehlers, 2017).The dual requires upper ( u k ) and lower bounds ( l k ) onthe value that ˆz k can take, for k ∈ [0 ..n − . We callthese intermediate bounds : we detail how to computethem in appendix B. The dual algorithm by Dvijothamet al. (2018) solves (2) via supergradient ascent. We will now describe a novel approach to solve problem(1), and relate it to the dual algorithm by Dvijotham et al.(2018). We will focus on computing bounds for an outputof the last layer of the neural network.

Our approach is based on Lagrangian decomposition, alsoknown as variable splitting (Guignard and Kim, 1987).Due to the compositional structure of neural networks,most constraints involve only a limited number of vari-ables. As a result, we can split the problem into mean- ingful, easy to solve subproblems. We then impose con-straints that the solutions of the subproblems should agree.We start from the original non-convex primal problem (1),and substitute the nonlinear activation equality (1d) witha constraint corresponding to its convex hull. This leadsto the following convex program: min z , ˆz ˆz n (3a)s.t. z ∈ C , (3b) ˆz k +1 = W k +1 z k + b k +1 k ∈ [1 , n − , (3c) cvx hull σ ( ˆz k , z k , l k , u k ) k ∈ [1 , n − . (3d)In the following, we will use ReLU activation functionsas an example, and employ the PLANET relaxation as itsconvex hull, which is the tightest linearly sized relaxationpublished thus far. By linearly sized, we mean that thenumbers of variables and constraints to describe the relax-ation only grow linearly with the number of units in thenetwork. We stress that the derivation can be extended toother non-linearities. For example, appendix A describesthe case of sigmoid activation function. For ReLUs, thisconvex hull takes the following form (Ehlers, 2017): cvx hull σ ≡  if l k [ i ] ≤ u k [ i ] ≥ z k [ i ] ≥ , z k [ i ] ≥ ˆz k [ i ] , (4a) z k [ i ] ≤ u k [ i ]( ˆz k [ i ] − l k [ i ]) u k [ i ] − l k [ i ] . if u k [ i ] ≤ z k [ i ] = 0 . (4b)if l k [ i ] ≥ z k [ i ] = ˆz k [ i ] . (4c)Constraints (4a), (4b), and (4c) corresponds to the relax-ation of the ReLU (1d) in different cases (respectively:ambiguous state, always blocking or always passing).To obtain a decomposition, we divide the constraints intosubsets. Each subset will correspond to a pair of an acti-vation layer, and the linear layer coming after it. The onlyexception is the ﬁrst linear layer which is combined withthe restriction of the input domain to C . This choice ismotivated by the fact that for piecewise-linear activationfunctions, we will be able to easily perform linear opti-misation over the resulting subdomains. For a differentactivation function, it might be required to have a differ-ent decomposition. Formally, we introduce the followingnotation for subsets of constraints: P ( z , ˆz ) ≡ (cid:40) z ∈ C ˆz = W z + b , (5) P k ( ˆz k , ˆz k +1 ) ≡  l k ≤ ˆz k ≤ u k , cvx hull σ ( ˆz k , z k , l k , u k ) , ˆz k +1 = W k +1 z k + b k . (6)The set P k is deﬁned without an explicit dependency on z k because z k only appears internally to the subset ofonstraints and is not involved in the objective function.Using this grouping of the constraints, we can conciselywrite problem (3) as: min z , ˆz ˆz n s.t. P ( z , ˆz ) , P k ( ˆz k , ˆz k +1 ) k ∈ [1 , n − . (7)To obtain a Lagrangian decomposition, we duplicate thevariables so that each subset of constraints has its owncopy of the variables it is involved in. Formally, werewrite problem (7) as follows: min z , ˆz ˆz A,n (8a)s.t. P ( z , ˆz A, ) , (8b) P k ( ˆz B,k , ˆz A,k +1 ) k ∈ [1 , n − , (8c) ˆz A,k = ˆz B,k k ∈ [1 , n − . (8d)The additional equality constraints (8d) impose agree-ments between the various copies of variables. We intro-duce the dual variables ρ and derive the Lagrangian dual: max ρ q ( ρ ) , where: q ( ρ ) = min z , ˆz ˆz A,n + n − (cid:88) k =1 ρ Tk ( ˆz B,k − ˆz A,k ) s.t. P ( z , ˆz A, ) , P k ( ˆz B,k , ˆz A,k +1 ) k ∈ [1 , n − . (9)Any value of ρ provides a valid lower bound by virtue ofweak duality. While we will maximise over the choice ofdual variables in order to obtain as tight a bound as possi-ble, we will be able to interrupt the optimisation processat any point and obtain a valid bound by evaluating q .We stress that, at convergence, problems (9) and (2) willyield the same bounds, as they are both reformulations ofthe same convex problem. This does not imply that thetwo derivations yield solvers with the same efﬁciency. Infact, we will next prove that, for ReLU-based networks,problem (9) yields bounds at least as tight as a correspond-ing dual solution from problem (2). We now compare our dual problem (9) with problem (2)by Dvijotham et al. (2018). From the high level perspec-tive, our decomposition considers larger subsets of con-straints and hence results in a smaller number of variablesto optimize. We formally prove that, for ReLU-based neu-ral networks, our formulation dominates theirs, producingtighter bounds based on the same dual variables.

Theorem 1.

Let us assume σ ( ˆz k ) = max( , ˆz k ) . Forany solution µ , λ of dual (2) by Dvijotham et al. (2018)yielding bound d ( µ , λ ) , it holds that q ( µ ) ≥ d ( µ , λ ) .Proof. See appendix E.Our dual is also related to the one that Wong and Kolter(2018) operates in. We show in appendix F how our problem can be initialised to a set of dual variables so asto generate a bound matching the one provided by theiralgorithm. We use this approach as our initialisation forincomplete veriﬁcation ( § Now that we motivated the use of dual (9) over problem(2), it remains to show how to solve it efﬁciently in prac-tice. We present two methods: one based on supergradientascent, the other on proximal maximisation. A summary,including pseudo-code, can be found in appendix D.

As Dvijotham et al. (2018) use supergradient methods ontheir dual, we start by applying it on problem (9) as well.At a given point ρ , obtaining the supergradient requiresus to know the values of ˆz A and ˆz B for which the innerminimisation is achieved. Based on the identiﬁed valuesof ˆz ∗ A and ˆz ∗ B , we can then compute the supergradient ∇ ρ q = ˆz ∗ B − ˆz ∗ A . The updates to ρ are then the usualsupergradient ascent updates: ρ t +1 = ρ t + α t ∇ ρ q ( ρ t ) , (10)where α t corresponds to a step size schedule that needsto be provided. It is also possible to use any variants ofgradient descent, such as Adam (Kingma and Ba, 2015).It remains to show how to perform the inner minimizationover the primal variables. By design, each of the vari-ables is only involved in one subset of constraints. As aresult, the computation completely decomposes over thesubproblems, each corresponding to one of the subset ofconstraints. We therefore simply need to optimise linearfunctions over one subset of constraints at a time. P subproblems To minimize over z , ˆz A, , the variables constrained by P , we need to solve: [ z ∗ , ˆz ∗ A, ] = argmin z , ˆz A, − ρ T ˆz A, s.t z ∈ C , ˆz A, = W z + b . (11)Rewriting the problem as a linear function of z only, problem (11) is simply equivalent to minimising − ρ T W z over C . We assumed that the optimisation ofa linear function over C was efﬁcient. We now give exam-ples of C where problem (11) can be solved efﬁciently. Bounded Input Domain If C is deﬁned by a set of lowerbounds l and upper bounds u (as in the case of (cid:96) ∞ adversarial examples), optimisation will simply amountto choosing either the lower or upper bound depending onthe sign of the linear function. Let us denote the indicatorfunction for condition c by c . The optimal solution is: x = ρ T W < · l + ρ T W ≥ · u , ˆx A, = W x + b . (12) k [ j ] u k [ j ] ˆz k [ j ] z k [ j ] Figure 1: Feasible domain of the convex hull for an ambiguousReLU. Red circles indicate the vertices of the feasible region. (cid:96) Balls If C is deﬁned by an (cid:96) ball of radius (cid:15) arounda point ¯x ( (cid:107) x − ¯x (cid:107) ≤ (cid:15) ), optimisation amounts tochoosing the point on the boundary of the ball such thatthe vector from the center to it is opposed to the costfunction. Formally, the optimal solution is given by: ˆx A, = W x + b , x = ¯x + ( √ (cid:15) / (cid:107) ρ (cid:107) ) ρ . (13) P k subproblems For the variables constrained by subproblem P k ( ˆz B,k , ˆz A,k +1 ), we need to solve: [ ˆz ∗ B,k , ˆz ∗ A,k +1 ] = argmin ˆz B,k , ˆz A,k +1 ρ Tk ˆz B,k − ρ Tk +1 ˆz A,k +1 s.t l k ≤ ˆz B,k ≤ u k , cvx hull σ ( ˆz B,k , z k , l k , u k ) , ˆz A,k +1 = W k +1 z k + b k +1 . (14)In the case where the activation function σ is the ReLUand cvx hull is given by equation (4), we can ﬁnda closed form solution. Using the last equality ofthe problem, we can rewrite the objective function as ρ Tk ˆz B,k − ρ Tk +1 W k +1 z k . If the ReLU is ambiguous, theshape of the convex hull is represented in Figure 1. Ifthe sign of ρ Tk +1 W k +1 is negative, optimizing subprob-lem (14) corresponds to having z k at its lower bound max ( ˆz B,k , . If on the contrary the sign is positive, z k must be equal to its upper bound u k u k − l k ( ˆz B,k − l k ) . Wecan therefore rewrite the subproblem (14) in the case ofambiguous ReLUs as: argmin ˆz B,k ∈ [ l k , u k ] (cid:18) ρ Tk − (cid:2) ρ Tk +1 W k +1 (cid:3) + u k u k − l k (cid:19) ˆz B,k − (cid:2) ρ Tk +1 W k +1 (cid:3) − max ( ˆz B,k , , (15)where [ A ] − corresponds to the negative values of A ( [ A ] − = min ( A, ) and [ A ] + corresponds to the pos-itive value of A ( [ A ] + = max ( A, ). The objectivefunction and the constraints all decompose completelyover the coordinates, so all problems can be solved inde-pendently. For each dimension, the problem is a convex,piecewise linear function, which means that the optimalpoint will be a vertex. The possible vertices are ( l k [ i ] , , (0 , , and ( u k [ i ] , u k [ i ]) . In order to ﬁnd the minimum,we can therefore evaluate the objective function at thesethree points and keep the one with the smallest value.If for a ReLU we either have l k [ i ] ≥ or u k [ i ] ≤ , the cvx hull constraint is a simple linear equal- ity constraint. For those coordinates, the problem isanalogous to the one solved by equation (12), withthe linear function being minimised over the ˆz B,k boxbounds being ρ Tk ˆz B,k in the case of blocking ReLUs or (cid:0) ρ Tk − ρ Tk +1 W k +1 (cid:1) ˆz B,k in the case of passing ReLUs.We have described how to solve problem (14) by sim-ply evaluating linear functions at given points. Thematrix operations involved correspond to standard op-erations of the neural network’s layers. Computing ˆz A,k +1 = W k +1 z k + b k +1 is exactly the forward passof the network and computing ρ Tk +1 W k +1 is analogous tothe backpropagation of gradients. We can therefore takeadvantage of existing deep learning frameworks to gainaccess to efﬁcient implementations. When dealing withconvolutional layers, we can employ specialised imple-mentations rather than building the equivalent W k matrix,which would contain a lot of redundancy.We described here the solving process in the context ofReLU activation functions, but this can be generalised tonon piecewise linear activation function. For example,appendix A describes the solution for sigmoid activations. We now detail the use of proximal methods on problem(9) as an alternative to supergradient ascent.

Applying proximal maximization to the dual function q results in the Augmented Lagrangian Method, also knownas the method of multipliers. The derivation of the updatesteps detailed by Bertsekas and Tsitsiklis (1989). Forour problem, this will correspond to alternating betweenupdates to the dual variables ρ and updates to the primalvariables ˆz , which are given by the following equations(superscript t indicates the value at the t -th iteration): ρ t +1 k = ρ tk + ˆz tB,k − ˆz tA,k η k , (16) (cid:2) z t , ˆz t (cid:3) = argmin z , ˆz L (cid:0) ˆz , ρ t (cid:1) := argmin z , ˆz (cid:34) ˆz A,n + (cid:88) k =1 ..n − ρ Tk ( ˆz B,k − ˆz A,k )+ (cid:88) k =1 ..n − η k (cid:107) ˆz B,k − ˆz A,k (cid:107) (cid:35) s.t. P ( z , ˆz A, ) , P k ( ˆz B,k , ˆz A,k +1 ) k ∈ [1 , n − . (17)The term L ( ˆz , ρ ) is the Augmented Lagrangian of prob-lem (8). The additional quadratic term in (17), comparedto the objective of q ( ρ ) , arises from the proximal termson ρ . It has the advantage of making the problem stronglyconvex, and hence easier to optimize. Later on, will showthat this allows us to derive optimal step-sizes in closedorm. The weight η k is a hyperparameter of the algorithm.A high value will make the problem more strongly convexand therefore quicker to solve, but it will also limit theability of the algorithm to perform large updates.While obtaining the new values of ρ is trivial using equa-tion (16), problem (17) does not have a closed-form solu-tion. We show how to solve it efﬁciently nonetheless. Problem (17) can be optimised using the conditionalgradient method, also known as the Frank-Wolfe algo-rithm (Frank and Wolfe, 1956). The advantage it providesis that there is no need to perform a projection step toremain in the feasible domain. Indeed, the different it-erates remain in the feasible domain by construction asconvex combination of points in the feasible domain. Ateach time step, we replace the objective by its linear ap-proximation and optimize this linear function over thefeasible domain to get an update direction, named con-ditional gradient. We then take a step in this direction.As the Augmented Lagrangian is smooth over the primalvariables, there is no need to take the Frank-Wolfe stepfor all the network layers at once. We can in fact do it ina block-coordinate fashion, where a block is a networklayer, with the goal of speeding up convergence; we referthe reader to appendix C for further details.Obtaining the conditional gradient requires minimising alinearization of L ( ˆz , ρ ) on the primal variables, restrictedto the feasible domain. This computation correspondsexactly to the one we do to perform the inner minimisa-tion of problem (9) over z and ˆz in order to compute thesupergradient (cf. § § ∇ ˆz B,k L ( ˆz , ρ t ) = ρ t +1 k and ∇ ˆz A,k +1 L ( ˆz , ρ t ) = − ρ t +1 k .The equivalence of conditional gradient and supergra-dient is not particular to our problem. A more generaldescription can be found in the work of Bach (2015).The use of a proximal method allows us to employ an opti-mal step size, whose calculation is detailed in appendix C.This would not be possible with supergradient methods,as we would have no guarantee of improvements. Wetherefore have no need to choose the step-size. In prac-tice, we still have to choose the strength of the proximalterm η k . Finally, inspired by previous work on accelerat-ing proximal methods (Lin et al., 2017, Salzo and Villa,2012), we also apply momentum on the dual updates toaccelerate convergence; for details see appendix C. One of the beneﬁts of our algorithms (and of the supergra-dient baseline by Dvijotham et al. (2018)) is that they areeasily parallelizable. As explained in § § § In order to evaluate the performance of the proposed meth-ods, we ﬁrst perform a comparison in the quality andspeed of bounds obtained by different methods in thecontext of incomplete veriﬁcation ( § § • WK uses the method of Wong and Kolter (2018). Thisis equivalent to a speciﬁc choice of ρ . • DSG+ corresponds to supergradient ascent on dual(2), the method by Dvijotham et al. (2018). We usethe Adam (Kingma and Ba, 2015) updates rules anddecrease the step size linearly, similarly to the exper-iments of Dvijotham et al. (2018). We experimentedwith other step size schedules, like constant step sizeor t schedules, which all performed worse. • Dec-DSG+ is a direct application of Theorem 1: itobtains a dual solution for problem (2) via DSG+ andsets ρ = µ in problem (9) to obtain the ﬁnal bounds. • Supergradient is the ﬁrst of the two solvers presented( § • Proximal is the solver presented in § (asingle iteration is a full pass over the network layers). • Gurobi is our gold standard method, employing thecommercial black box solver Gurobi. All the problemsare solved to optimality, providing us with the bestresult that our solver could achieve in terms of accuracy. • Gurobi-TL is the time-limited version of the above,which stops at the ﬁrst dual simplex iteration for whichthe total elapsed time exceeded that required by iterations of the proximal method.We omit

Interval Bound Propagation (Gowal et al.,2018, Mirman et al., 2018) from the comparisons as onaverage it performed signiﬁcantly worse than WK on thenetworks we used: experimental results are provided inappendix G.1. For all of the methods above, as donein previous work for neural network veriﬁcation (Bunelet al., 2020, Lu and Kumar, 2020), we compute interme-diate bounds (see appendix B) by taking the layer-wisebest bounds output by Interval Propagation and WK. Thisholds true for both the incomplete and the complete veri-ﬁcation experiments. Gurobi was run on CPUs for § for § We investigate the effectiveness of the various methodsfor incomplete veriﬁcation on images of the CIFAR-10test set. For each image, we compute an upper bound onthe robustness margin: the difference between the groundtruth logit and all the other logits, under an allowed pertur-bation (cid:15) verif in inﬁnity norm of the inputs. If for any classthe upper bound on the robustness margin is negative,then we are certain that the network is vulnerable againstthat adversarial perturbation. We measure the time to compute last layer bounds, and report the optimality gapcompared to the best achieved solution.As network architecture, we employ the small modelused by Wong and Kolter (2018) and whose structureis given in appendix G.1.1. We train the network againstperturbations of a size up to (cid:15) train = 8 / , and test foradversarial vulnerability on (cid:15) verif = 12 / . This is doneusing adversarial training (Madry et al., 2018), based onan attacker using 50 steps of projected gradient descent toobtain the samples. Additional experiments for a networktrained using standard stochastic gradient descent andcross entropy, with no robustness objective can be foundin appendix G.1. For both supergradient methods (ourSupergradient, and DSG+), we decrease the step sizelinearly from α = 10 − to α = 10 − , while for Proximal,we employ momentum coefﬁcient µ = 0 . and, for alllayers, linearly increase the weight of the proximal termsfrom η = 10 to η = 5 × (see appendix C).Figure 2 presents results as a distribution of runtime andoptimality gap. WK by Wong and Kolter (2018) performsa single pass over the network per optimization problem,which allows it to be extremely fast, but this comes at thecost of generating looser bounds. At the complete oppo-site end of the spectrum, Gurobi is extremely slow butprovides the best achievable bounds. Dual iterative meth-ods (DSG+, Supergradient, Proximal) allow the user tochoose the trade-off between tightness and speed. In orderto perform a fair comparison, we ﬁxed the number of iter-ations for the various methods so that each of them wouldtake the same average time (see Figure 3). This was doneby tuning the iteration ratios on a subset of the imagesfor the results in Figures 2, 3. Note that the LagrangianDecomposition has a higher cost per iteration due to themore expensive computations related to the more com-plex primal feasible set. The cost of the proximal methodis even larger due to the primal-dual updates and the op-timal step size computation. For DSG+, Supergradient Figure 2: Distribution of runtime and gap to optimality on a network adversarially trained with the method by Madry et al. (2018),on CIFAR-10 images. In both cases, lower is better. The width at a given value represents the proportion of problems for which thisis the result. Gurobi always return the optimal solution so doesn’t appear on the optimality gap, but is always the highest runtime.

Method −2 −1 T i m e Gurobi Gurobi-TL WK DSG+520 steps Dec-DSG+520 steps Supergradient370 steps Proximal200 steps DSG+1040 steps Dec-DSG+1040 steps Supergradient740 steps Proximal400 steps10 −2 −1 G a p t o b e s t b o un d Supergradient 740 steps D S G + s t e p s Timing (in s) −1 Supergradient 740 steps −1 D S G + s t e p s Gap to best bound

Proximal 400 steps S u p e r g r a d i e n t s t e p s Timing (in s) −2 −1 Proximal 400 steps −2 −1 S u p e r g r a d i e n t s t e p s Gap to best bound

Figure 3: Pointwise comparison for a subset of the methods on the data presented in Figure 2. Each datapoint corresponds to aCIFAR image, darker colour shades mean higher point density. The dotted line corresponds to the equality and in both graphs, loweris better along both axes. and Proximal, the improved quality of the bounds as com-pared to the non-iterative method WK shows that there arebeneﬁts in actually solving the best relaxation rather thansimply evaluating looser bounds. Time-limiting Gurobiat the ﬁrst iteration which exceeded the cost of

Proxi-mal iterations signiﬁcantly worsens the produced boundswithout a comparable cut in runtimes. This is due to thehigh cost per iteration of the dual simplex algorithm.By looking at the distributions and at the point-wise com-parisons in Figure 3, we can see that Supergradient yieldsconsistently better bounds than DSG+. As both meth-ods employ the Adam update rule (and the same hyper-parameters, which are optimal for both), we can concludethat operating on the Lagrangian Decomposition dual(1) produces better speed-accuracy trade-offs comparedto the dual (2) by Dvijotham et al. (2018). This is inline with the expectations from Theorem 1. Furthermore,while a direct application of the Theorem (Dec-DSG+)does improve on the DSG+ bounds, operating on the De-composition dual (1) is empirically more effective (seepointwise comparisons in appendix G.1). Finally, on av-erage, the proximal algorithm yields better bounds thanthose returned by Supergradient, further improving on theDSG+ baseline. In particular, we stress that the supportof optimality gap distribution is larger for Proximal, witha heavier tail towards better bounds.

We present results for complete veriﬁcation. In this set-ting, our goal is to verify whether a network is robust to (cid:96) ∞ norm perturbations of radius (cid:15) verif . In order to do so,we search for a counter-example (an input point for whichthe output of the network is not the correct class) by mini-mizing the difference between the ground truth logit anda randomly chosen logit of images of the CIFAR-10 testset. If the minimum is positive, we have not succeeded inﬁnding a counter-example, and the network is robust. Incontrast to the previous section, we now seek to solve anonconvex problem like (1) (where another layer repre-senting the aforementioned difference has been added atthe end) directly, rather than an approximation.In order to perform the minimization exactly, we employBaSBR, a Branch and Bound algorithm by Bunel et al.(2020). In short, Branch and Bound divides the veriﬁ-cation domain into a number of smaller problems, forwhich it repeatedly computes upper and lower bounds. Atevery iteration, BaSBR picks the sub-problem with thelowest lower bound and creates two new sub-problemsby ﬁxing the most “impactful” ReLU to be passing orblocking. The impact of a ReLU is determined by es-timating the change in the sub-problem’s output lowerbound caused by making the ReLU non-ambiguous. Sub-problems which cannot contain the global lower bound Computation time [s] % o f p r o p e r t i e s v e r i f i e d Base model

MIPplanetGurobi BaBSRProximal BaBSRSupergradient BaBSRDSG+ BaBSR Computation time [s] % o f p r o p e r t i e s v e r i f i e d Wide large model

MIPplanetGurobi BaBSRProximal BaBSRSupergradient BaBSRDSG+ BaBSR Computation time [s] % o f p r o p e r t i e s v e r i f i e d Deep large model

MIPplanetGurobi BaBSRProximal BaBSRSupergradient BaBSRDSG+ BaBSR

Figure 4: Cactus plots for the base, wide and deep models. For each, we compare the bounding methods and complete veriﬁcationalgorithms by plotting the percentage of solved properties as a function of runtime.able 1: For base, wide and deep models, we compare average solving time, average number of solved sub-problems and thepercentage of timed out properties. The best performing iterative method is highlighted in bold.

Base Wide DeepMethod time(s) sub-problems % Timeout time(s) sub-problems % Timeout time(s) sub-problems % TimeoutG

UROBI B A BSR 1588.651 1342.777 10.56 2912.246 630.864 50.66 3007.237 299.913 54.00MIP

PLANET A BSR 426.554 5312.674 6.54 1048.594 4518.116 20.00 509.484 1992.345 9.60S

UPERGRADIENT B A BSR 377.104 5079.682

ROXIMAL B A BSR are progressively discarded.The computational bottleneck of Branch and Bound isthe lower bounds computation, for which we will employa subset of the methods in section 6.2. Speciﬁcally, wewant to compare the performance of DSG+, Supergradi-ent and Proximal with Gurobi (as employed in Bunel et al.(2020)). For this purpose, we run the various Branch andBound implementations on the dataset employed by Luand Kumar (2020) to test their novel Branch and Boundsplitting strategy. The dataset picks a non-correct classand a perturbation radius (cid:15) verif for a subset of the CIFAR-10 test images, and runs Branch and Bound on three dif-ferent convolutional network architectures of varying size:a “base” network, a “wide” network, and a “deep” net-work. Further details on the dataset and architectures areprovided in appendix G.2.We compare the Branch and Bound implementations withMIPplanet to provide an additional veriﬁcation baseline.MIPplanet computes the global lower bound by usingGurobi to solve the Mixed-Integer Linear problem aris-ing from the Planet relaxation (Bunel et al., 2020, Ehlers,2017). As explained in section 6.1, due to the highly par-allelizable nature of the dual iterative algorithms, we areable to compute lower bounds for multiple sub-problemsat once for DSG+, Supergradient and Proximal, whereasthe domains are solved sequentially for Gurobi. The num-ber of simultaneously solved sub-problems is for thebase network, and for the wide and deep networks. In-termediate bound computations (see § α = 10 − to α = 10 − , while for Proximal,we do not employ momentum and keep the weight of theproximal terms ﬁxed to η = 10 for all layers throughoutthe iterations. As in the previous section, the number ofiterations for the bounding algorithms are tuned to em-ploy roughly the same time: we use iterations forProximal, for Supergradient, and for DSG+. Forall three, the dual variables are initialized from the dualsolution of the lower bound computation of the parentnode in the Branch and Bound tree. We time-limit theexperiments at one hour.Consistently with the incomplete veriﬁcation results inthe last section, Figure 4 and Table 1 show that the Super- gradient overperforms DSG+, conﬁrming the beneﬁts ofour Lagrangian Decomposition approach. Better boundsdecrease the number of sub-problems that Branch andBound needs to solve, reducing the overall veriﬁcationtime. Furthermore, Proximal provides an additional in-crease in performance over DSG+, which is visible es-pecially over the properties that are easier to verify. Thegap between competing bounding methods increases withthe size of the employed network, both in terms of layerwidth and network depth: at least an additional ofthe properties times out when using DSG+ on the largernetworks. A more detailed experimental analysis of thebase model data is presented in appendix G.2. We have presented a novel dual approach to computebounds over the activation of neural networks based onLagrangian Decomposition. It provides signiﬁcant ben-eﬁts compared to off-the-shelf solvers and improves onboth looser relaxations and on a previous method basedon Lagrangian relaxation. As future work, it remains toinvestigate whether better complete veriﬁcation resultscan be obtained by combining our supergradient and prox-imal methods. Furthermore, it would be interesting to seeif similar derivations could be used to make tighter butmore expensive relaxations computationally feasible. Im-proving the quality of bounds may allow the training ofrobust networks to avoid over-regularisation.

Acknowledgments

ADP was supported by the EPSRC Centre for DoctoralTraining in Autonomous Intelligent Machines and Sys-tems, grant EP/L015987/1. RB wishes to thank LeonardBerrada for helpful discussions on the convex hull ofsigmoid activation functions.

References

Anderson, R., Huchette, J., Tjandraatmadja, C., andVielma, J. P. (2019). Strong mixed-integer program-ming formulations for trained neural networks.

Con-ference on Integer Programming and CombinatorialOptimization .Athalye, A., Carlini, N., and Wagner, D. A. (2018). Obfus-cated gradients give a false sense of security: circum-venting defenses to adversarial examples.

InternationalConference on Machine Learning .Bach, F. (2015). Duality between subgradient and condi-ional gradient methods.

SIAM Journal on Optimiza-tion .Bertsekas, D. P. and Tsitsiklis, J. N. (1989).

Parallel anddistributed computation: numerical methods . Prenticehall Englewood Cliffs, NJ.Bunel, R., Lu, J., Turkaslan, I., Torr, P. H., Kohli, P., andKumar, M. P. (2020). Branch and bound for piecewiselinear neural network veriﬁcation.

Journal of MachineLearning Research .Bunel, R., Turkaslan, I., Torr, P. H., Kohli, P., and Kumar,M. P. (2018). A uniﬁed view of piecewise linear neuralnetwork veriﬁcation.

Neural Information ProcessingSystems .Cheng, C.-H., N¨uhrenberg, G., and Ruess, H. (2017).Maximum resilience of artiﬁcial neural networks.

Au-tomated Technology for Veriﬁcation and Analysis .Dvijotham, K., Stanforth, R., Gowal, S., Mann, T., andKohli, P. (2018). A dual approach to scalable veri-ﬁcation of deep networks.

Uncertainty in ArtiﬁcialIntelligence .Ehlers, R. (2017). Formal veriﬁcation of piece-wise linearfeed-forward neural networks.

Automated Technologyfor Veriﬁcation and Analysis .Frank, M. and Wolfe, P. (1956). An algorithm forquadratic programming.

Naval Research LogisticsQuarterly .Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015).Explaining and harnessing adversarial examples.

Inter-national Conference on Learning Representations .Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin,C., Uesato, J., Mann, T., and Kohli, P. (2018). On theeffectiveness of interval bound propagation for trainingveriﬁably robust models.

Workshop on Security inMachine Learning, NeurIPS .Guignard, M. and Kim, S. (1987). Lagrangean decompo-sition: A model yielding stronger lagrangean bounds.

Mathematical programming .Gurobi Optimization, L. (2020). Gurobi optimizer refer-ence manual.Katz, G., Barrett, C., Dill, D., Julian, K., and Kochender-fer, M. (2017). Reluplex: An efﬁcient smt solver forverifying deep neural networks.

International Confer-ence on Computer-Aided Veriﬁcation .Kingma, D. P. and Ba, J. (2015). Adam: A method forstochastic optimization.

International Conference onLearning Representations .Lin, H., Mairal, J., and Harchaoui, Z. (2017). Catalystacceleration for ﬁrst-order convex optimization: Fromtheory to practice.

Journal of Machine Learning Re-search . Lu, J. and Kumar, M. P. (2020). Neural network branch-ing for neural network veriﬁcation. In

InternationalConference on Learning Representations .Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. (2018). Towards deep learning models resis-tant to adversarial attacks.

International Conferenceon Learning Representations .Mirman, M., Gehr, T., and Vechev, M. (2018). Differen-tiable abstract interpretation for provably robust neuralnetworks.

International Conference on Machine Learn-ing .Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., andLerer, A. (2017). Automatic differentiation in pytorch.

NIPS Autodiff Workshop .Raghunathan, A., Steinhardt, J., and Liang, P. S. (2018).Semideﬁnite relaxations for certifying robustness toadversarial examples.

Neural Information ProcessingSystems .Salman, H., Yang, G., Zhang, H., Hsieh, C.-J., and Zhang,P. (2019). A convex relaxation barrier to tight robust-ness veriﬁcation of neural networks.

Neural Informa-tion Processing Systems .Salzo, S. and Villa, S. (2012). Inexact and acceleratedproximal point algorithms.

Journal of Convex Analysis ,19:1167–1192.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. (2014). Intriguingproperties of neural networks.

International Confer-ence on Learning Representations .Tjeng, V., Xiao, K., and Tedrake, R. (2019). Evaluat-ing robustness of neural networks with mixed integerprogramming.

International Conference on LearningRepresentations .Uesato, J., O’Donoghue, B., Oord, A. v. d., and Kohli, P.(2018). Adversarial risk and the dangers of evaluatingagainst weak attacks.

International Conference onMachine Learning .Weng, T.-W., Zhang, H., Chen, H., Song, Z., Hsieh, C.-J., Boning, D., Dhillon, I. S., and Daniel, L. (2018).Towards fast computation of certiﬁed robustness forrelu networks.

International Conference on MachineLearning .Wong, E. and Kolter, Z. (2018). Provable defenses againstadversarial examples via the convex outer adversarialpolytope.

International Conference on Machine Learn-ing .Xiang, W., Tran, H.-D., and Johnson, T. T. (2017). Outputreachable set estimation and veriﬁcation for multi-layerneural networks.

IEEE Transactions on Neural Net-works and Learning Systems . − − − (a) The upper bound is in Case 1, while the lowerbound is in the special case of Case 2 with onlyone piece. − − − − (b) The upper bound is in the special case of Case2 with only one piece, while the lower bound is inCase 1. − − − − (c) Both upper and lower bounds are in Case 2 withthe two pieces visible. Figure 5: Convex hull of the Sigmoid activation function for different input bound conﬁgurations.

A Sigmoid Activation function

This section describes the computation highlighted in the paper in the context where the activation function σ ( x ) is thesigmoid function: σ ( x ) = 11 + e − x (18)A similar methodology to the ones described in this section could be used to adapt the method to work with otheractivation function such as hyperbolic tangent, but we won’t discuss it.We will start with a reminders about some properties of the sigmoid activation function. It takes values between 0 and 1,with σ (0) = 0 . . We can easily compute its derivatives: σ (cid:48) ( x ) = σ ( x ) × (1 − σ ( x )) σ (cid:48)(cid:48) ( x ) = σ ( x ) × (1 − σ ( x )) × (1 − σ ( x )) (19)If we limit the domain of study to the negative inputs ( [ −∞ , ), then the function x (cid:55)→ σ ( x ) is a convex function, ascan be seen by looking at the sign of the second derivative over that domain. Similarly, if we limit the domain of studyto the positive inputs ( [0 , ∞ ] ), the function is concave. A.1 Convex hull computation

In the context of ReLU, the convex hull of the activation function is given by equation(4), as introduced by Ehlers(2017). We will now derive it for sigmoid functions. Upper and lower bounds will be dealt in the same way, so ourdescription will only focus on how to obtain the concave upper bound, limiting the activation function convex hull byabove. The computation to derive the convex lower bound is equivalent.Depending on the range of inputs over which the convex hull is taken, the form of the concave upper bound will change.We distinguish two cases.

Case 1: σ (cid:48) ( u k ) ≥ σ ( u k ) − σ ( l k ) u k − l k . We will prove that in this case, the upper bound will be the line passing through thepoints ( l k , σ ( l k )) and ( u k , σ ( l k )) . The equation of it is given by: φ u k , l k ( x ) = σ ( u k ) − σ ( l k ) u k − l k ( x − l k ) + σ ( l k ) (20)Consider the function d ( x ) = φ u k , l k ( x ) − σ ( x ) . To show that φ u k , l k is a valid upper bound, we need to prove that ∀ x ∈ [ l k , u k ] , d ( x ) ≥ . We know that d ( l k ) = 0 and d ( u k ) = 0 , and that d is a continuous function. Its derivative isgiven by: d (cid:48) ( x ) = σ ( u k ) − σ ( l k ) u k − l k − σ ( x ) (1 − σ ( x )) . (21)To ﬁnd the roots of d (cid:48) , we can solve d (cid:48) ( x ) = 0 for the value of σ ( x ) and then use the logit function to recover the valueof x . In that case, this is only a second order polynomial, so it can admit at most two roots.We know that lim x →∞ d (cid:48) ( x ) ≥ , and our hypothesis tells us that d (cid:48) ( u k ) ≤ . This means that at least one of the rootlies necessarily beyond u k and therefore, the derivative of d change signs at most once on the [ l k , u k ] interval. If it neverchanges sign, it is monotonous. Given that the values taken at both extreme points are the same, d being monotonouswould imply that d is constant, which is impossible. We therefore conclude that this means that the derivative change itssign exactly once on the interval, and is therefore unimodal. As we know that d (cid:48) ( u k ) ≤ , this indicates that d is ﬁrstncreasing and then decreasing. As both extreme points have a value of zero, this means that ∀ x ∈ [ l k , u k ] , d ( x ) ≥ .From this result, we deduce that φ u k , l k is a valid upper bound of σ . As a linear function, it is concave by deﬁnition.Given that it constitutes the line between two points on the curve, all of its points necessarily belong to the convex hull.Therefore, it is not possible for a concave upper bound of the activation function to have lower values. This means that φ u k , l k deﬁnes the upper bounding part of the convex hull of the activation function. Case 2: σ (cid:48) ( u k ) ≤ σ ( u k ) − σ ( l k ) u k − l k . In this case, we will have to decompose the upper bound into two parts, deﬁned asfollows: φ u k , l k ( x )  σ ( t k ) − σ ( l k ) t k − l k ( x − l k ) + σ ( l k ) if x ∈ [ l k , t k ] (22a) σ ( x ) if x ∈ [ t k , u k ] , (22b)where t k is deﬁned as the point such that σ (cid:48) ( t k ) = σ ( t k ) − σ ( l k ) t k − l k and t k > . The value of t k can be computed bysolving the equation σ ( t k ) (1 − σ ( t k )) = σ ( t k ) − σ ( l k ) t k − l k , which can be done using the Newton-Raphson method or abinary search. Note that this needs to be done only when deﬁning the problem, and not at every iteration of the solver.In addition, the value of t k is dependant only on l k so it’s possible to start by building a table of the results at the desiredaccuracy and cache them.Evaluating both pieces of the function of equation(22) at t k show that φ u k , l k is continuous. Both pieces are concave(for x ≥ t k ≥ , σ is concave) and they share a supergradient (the linear function of slope σ ( t k ) − σ ( l k ) t k − l k ) in t k , so φ u k , l k is a concave function. The proof we did for Case 1 can be duplicated to show that the linear component is the bestconcave upper bound that can be achieved over the interval [ l k , t k ] . On the interval [ t k , u k ] , φ u k , l k is equal to theactivation function, so it is also an upper bound which can’t be improved upon. Therefore, φ u k , l k is the upper boundingpart of the convex hull of the activation function.Note that a special case of this happens when l k ≥ t k . In which case, φ u k , l k consists of only equation (22b).All cases are illustrated in Figure 5. Case 1 is shown in 5a, where the upper bound contains only the linear upper bound.Case 2 with both segments is visible in Figure5c, with the cutoff points tb k highlighted by a green dot, and the specialcase with l k ≥ t k is demonstrated in Figure 5b. A.2 Solving the P k subproblems over sigmoid activation As a reminder, the problem that needs to be solved is the following (where c k +1 = − ρ k +1 , c k = ρ k ): [ ˆx B,k , ˆx A,k +1 ] = argmin ˆz B,k , ˆz A,k +1 c Tk ˆz B,k + c Tk +1 ˆz A,k +1 s.t l k ≤ ˆz B,k ≤ u k cvx hull σ ( ˆz B,k , z k , l k , u k ) ˆz A,k +1 = W k +1 z k + b k +1 , (23)where cvx hull is deﬁned either by equations (20) or (22). In this case, we will still be able to compute a closed formsolution.We will denote φ u k , l k and ψ u k , l k the upper and lower bound functions deﬁning the convex hull of the sigmoid function,which can be obtained as described in the previous subsection. If we use the last equality constraint of Problem (23)to replace the ˆz A,k +1 term in the objective function, we obtain c Tk ˆz B,k + c Tk +1 W k +1 z k . Depending on the sign of c Tk +1 W k +1 , z k will either take the value φ u k , l k or ψ u k , l k , resulting in the following problem: min ˆz B,k c Tk ˆz B,k + (cid:2) c Tk +1 W k +1 (cid:3) − φ u k , l k ( ˆz B,k ) + (cid:2) c Tk +1 W k +1 (cid:3) + ψ u k , l k ( ˆz B,k ) s.t l k ≤ ˆz B,k ≤ u k . (24)To solve this problem, several observations can be made: First of all, at that point, the optimisation decomposescompletely over the components of ˆz B,k so all problems can be solved independently from each other. The secondobservation is that to solve this problem, we can decompose the minimisation over the whole range [ l k , u k ] into a set ofminimisation over the separate pieces, and then returning the minimum corresponding to the piece producing the lowestvalue.The minimisation over the pieces can be of two forms. Both bounds can be linear (such as between the green dottedlines in Figure 5c), in which case the problem is easy to solve by looking at the signs of the coefﬁcient of the objectiveunction. The other option is that one of the bound will be equal to the activation function (such as in Figures 5a or 5b,or in the outer sections of Figure 5c), leaving us with a problem of the form: min ˆz B,k c T lin ˆz B,k + c Tσ σ ( ˆz B,k ) l ≤ ˆz B,k ≤ u , (25)where l , u , c lin and c σ will depend on what part of the problem we are trying to solve.This is a convex problem so the value will be reached either at the extremum of the considered domain ( l or u ), or itwill be reached at the points where the derivative of the objective functions cancels. This corresponds to the roots of c lin + c Tσ σ ( ˆz B,k ) (1 − σ ( ˆz B,k )) . Provided that c lin c σ ≥ , the possible roots will be given by σ − (cid:32) ± (cid:113) c lin c σ (cid:33) ,with σ − being the logit function, the inverse of the sigmoid function (cid:16) σ − ( x ) = log (cid:16) x − x (cid:17)(cid:17) . To solve problem (25),we evaluate its objective function at the extreme points ( l and u ) and at those roots if they are in the feasible domain,and return the point corresponding to the minimum score achieved. With this method, we can solve the P k subproblemseven when the activation function is a sigmoid. B Intermediate bounds

Intermediate bounds are obtained by solving a relaxation of (1) over subsets of the network. Instead of deﬁningthe objective function on the activation of layer n , we deﬁne it over ˆz k , iteratively for k ∈ [1 ..n − . Computingintermediate bounds with the same relaxation would be the computational cost of this approach: we need to solvetwo LPs for each of the neurons in the network, and even if (differently from off-the-shelf LP solvers) our methodallows for easy parallelization within each layer, the layers need to be tackled sequentially. Therefore, depending on thecomputational budget, the employed relaxation for intermediate bounds might be looser than (3) or (2): for instance,the one by Wong and Kolter (2018). C Implementation details for Proximal method

We provide additional details for the implementation of the proximal method. We start with the optimal step sizecomputation, deﬁne our block-coordinate updates for the primal variables, and ﬁnally provide some insight onacceleration.

C.1 Optimal Step Size Computation

Having obtained the conditional gradients ˆx k ∀ k ∈ [0 ..n − for our problem, we need to decide a step size. Computingthe optimal step size amounts to solving a one dimensional quadratic problem: γ ∗ = argmin γ ∈ [0 , L ( γ ˆx + (1 − γ ) ˆz , ρ ) . (26)Computing the gradient with regards to γ and setting it to 0 will return the optimal step size. Denoting the gradient ofthe Augmented Lagrangian (17) with respect to ( ˆz B,k − ˆz A,k ) as ∇ ( ˆz B,k − ˆz A,k ) L ( ˆz , ρ ) = ρ k + ˆz B,k − ˆz A,k η k = g k , thisgives us the following formula: γ ∗ = clamp [0 , (cid:32) − ( ˆx A,n − ˆz A,n ) + (cid:80) k g Tk ( ˆx B,k − ˆz B,k − ˆx A,k + ˆz A,k ) (cid:80) k η k (cid:107) ˆx B,k − ˆz B,k − ˆx A,k + ˆz A,k (cid:107) (cid:33) (27) C.2 Gauss-Seidel style updates

At iteration t , the Frank-Wolfe step takes the following form: ˆz t +1 = γ ∗ x + (1 − γ ∗ ) ˆz t (28)where the update is performed on the variables associated to all the network neurons at once.As the Augmented Lagrangian (17) is smooth in the primal variables, we can take a conditional gradient step aftereach ˆx k computation, resulting in Gauss-Seidel style updates (as opposed to Jacobi style updates 28) (Bertsekas andTsitsiklis, 1989). As the values of the primals at following layers are inter-dependent through the gradient of theAugmented Lagrangian, these block-coordinate updates will result in faster convergence. The updates become: ˆz t +1 k = γ ∗ k ˆx k + (1 − γ ∗ k ) ˆz tk (29)here the optimal step size is given by: γ ∗ k = clamp [0 , (cid:32) − g Tk (cid:0) ˆx B,k − ˆz B,k ) + g Tk +1 ( ˆz A,k +1 − ˆx A,k +1 ) (cid:1) η k (cid:107) ˆx B,k − ˆz B,k (cid:107) + η k +1 (cid:107) ˆz A,k +1 − ˆx A,k +1 (cid:107) (cid:33) (30)with the special cases ˆx B, = ˆz B, = 0 and: γ ∗ n − = clamp [0 , (cid:32) − g Tn − ( ˆx B,n − − ˆz B,n − ) + ( ˆx A,n − ˆz A,n )) η n − (cid:107) ˆx B,n − − ˆz B,n − (cid:107) (cid:33) (31) C.3 Momentum

The supergradient methods that we implemented both for our dual (9) and for (2) by Dvijotham et al. (2018) relyon Adam (Kingma and Ba, 2015) to speed-up convergence. Inspired by its presence in Adam, and by the work onaccelerating proximal methods Lin et al. (2017), Salzo and Villa (2012), we apply momentum on the proximal updates.By closely looking at equation (16), we can draw a similarity between the dual update in the proximal algorithm andsupergradient descent. The difference is that the former operation takes place after a closed-form minimization ofthe linear inner problem in ˆz A , ˆz B , whereas the latter after some steps of an optimization algorithm that solves thequadratic form of the Augmented Lagrangian (17). We denote the (approximate) argmin of the Lagrangian at the t -thiteration of the proximal method (method of multipliers) by ˆz t, † .Thanks to the aforementioned similarity, we can keep an exponential average of the gradient-like terms with parameter µ ∈ [0 , and adopt momentum for the dual update, yielding: π t +1 k = µ π tk + ˆz t, † B,k − ˆz t, † A,k η k ρ t +1 k = ρ tk + π t +1 k (32)As, empirically, a decaying step size has a positive effect on Adam, we adopt the same strategy for the proximalalgorithm as well, resulting in an increasing η k . The augmented Lagrangian then becomes: (cid:2) z t , ˆz t (cid:3) = argmin z , ˆz L (cid:0) ˆz , ρ t (cid:1) s.t. P ( z , ˆz A, ); P k ( ˆz B,k , ˆz A,k +1 ) k ∈ [1 ..n − where L ( ˆz , ρ ) = (cid:34) ˆz A,n + (cid:88) k =1 ..n − ρ Tk ( ˆz B,k − ˆz A,k ) + (cid:88) k =1 ..n − η k (cid:107) ˆz B,k − ˆz A,k (cid:107) − (cid:88) k =1 ..n − η k (cid:107) µ π k (cid:107) (cid:35) . (33)We point out that the Augmented Lagrangian’s gradient ∇ ( ˆz B,k − ˆz A,k ) L ( ˆz , ρ ) remains unvaried with respect to themomentum-less solver hence, in practice, the main change is the dual update formula (32). D Algorithm summary

We provide high level summaries of our solvers: supergradient (Algorithm 1) and proximal (Algorithm 2).For both methods, we decompose problem (3) into several subset of constraints, duplicate the variables to make thesubproblems independent, and enforce their consistency by introducing dual variables ρ , which we are going to optimize.This results in problem (9), which we solve using one of the two algorithms. Our initial dual solution is given by thealgorithm of Wong and Kolter (2018). The details can be found in Appendix F. Algorithm 1

Supergradient method function SUBG COMPUTE BOUNDS ( { W k , b k , l k , u k } k =1 ..n ) Initialise dual variables ρ using the algo of Wong and Kolter (2018) for nb iterations do ˆz ∗ ← inner minimization using § § Compute supergradient using ∇ ρ q ( ρ t ) = ˆz ∗ B − ˆz ∗ A ρ t +1 ← Adam’s update rule Kingma and Ba (2015) end for return q ( ρ ) end function he structure of our employed supergradient method is relatively simple. We iteratively minimize over the primalvariables (line 4) in order to compute the supergradient, which is then used to update the dual variables through theAdam update (line 6). Algorithm 2

Proximal method function PROX COMPUTE BOUNDS ( { W k , b k , l k , u k } k =1 ..n ) Initialise dual variables ρ using the algo of Wong and Kolter (2018) Initialise primal variables ˆz at the conditional gradient of ρ for nb outer iterations do Update dual variables using equation (16) or (32) for nb inner iterations do for k ∈ [0 ..n − do Compute layer conditional gradient [ ˆx k ] using using § § Compute layer optimal step size γ ∗ k using (27) Update layer primal variables: ˆz k = γ ∗ k ˆx k + (1 − γ ∗ k ) ˆz k end for end for end for return q ( ρ ) end function The proximal method, instead alternates updates to the dual variables (applying the updates from (16); line 5) and to theprimal variables (update of (17); lines 6 to 12). The inner minimization problem is solved only approximately, using theFrank-Wolfe algorithm (line 10), applied in a block-coordinate fashion over the network layers (line 7). We have closedform solution for both the conditional gradient (line 8) and the optimal step-size (line 9).

E Comparison between the duals

We will show the relation between the decomposition done by Dvijotham et al. (2018) and the one proposed in thispaper. We will show that, in the context of ReLU activation functions, from any dual solution of their dual providing abound, our dual provides a better bound.

Theorem.

Let us assume σ ( ˆz k ) = max( , ˆz k ) . For any solution µ , λ of dual (2) by Dvijotham et al. (2018) yieldingbound d ( µ , λ ) , it holds that q ( µ ) ≥ d ( µ , λ ) .Proof. We start by reproducing the formulation that Dvijotham et al. (2018) uses, which we slightly modify in order tohave a notation more comparable to ours . With our convention, doing the decomposition to obtain Problem (6) in(Dvijotham et al., 2018) would give: min z , ˆz W n z n − + b n + n − (cid:88) k =1 µ Tk ( ˆz k − W k z k − − b k ) + n − (cid:88) k =1 λ Tk ( z k − σ ( ˆz k )) such that l k ≤ ˆz k ≤ u k for k ∈ [1 , n − σ ( l k ) ≤ z k ≤ σ ( u k ) for k ∈ [1 , n − z ∈ C (34) Our activation function σ is denoted h in their paper. The equivalent of x in their paper is z in ours, while the equivalent of their z is ˆz . Also note that their paper use the computation of upper bounds as examples while ours use the computation of lower bounds. ecomposing it, in the same way that Dvijotham et al. (2018) do it in order to obtain their equation (7), we obtain b n − n − (cid:88) k =1 µ Tk b k + n − (cid:88) k =1 min ˆz k ∈ [ l k , u k ] (cid:0) µ Tk ˆz k − λ Tk σ ( ˆz k ) (cid:1) + n − (cid:88) k =1 min z k ∈ [ σ ( l k ) ,σ ( u k )] (cid:0) − µ Tk +1 W k +1 z k + λ Tk z k (cid:1) + min z ∈C − µ T W z . (35)(with the convention that µ n = − I )As a reminder, the formulation of our dual (9) is: q ( ρ ) = min z , ˆz ˆz A,n + Σ k =1 ..n − ρ Tk ( ˆz B,k − ˆz A,k ) s.t. P ( z , ˆz A, ); P k ( ˆz B,k , ˆz A,k +1 ) for k ∈ [1 ..n − (36)which can be decomposed as: q ( ρ ) = n − (cid:88) k =1  min ˆz B,k , ˆz A,k +1 ∈ P k ( · , · ) ρ Tk ˆz B,k − ρ Tk +1 ˆz A,k +1  + min z , ˆz A, ∈ P ( · , · ) − ρ T ˆz A, , (37)with the convention that ρ n = − I . We will show that when we chose the dual variables such that ρ k = µ k , (38)we obtain a tighter bound than the ones given by (35).We will start by showing that the term being optimised over P (cid:48) is equivalent to some of the terms in (35): min z , ˆz A, ∈ P ( · , · ) − ρ T ˆz A, = min z , ˆz A, ∈ P ( · , · ) − µ T ˆz A, = − µ T b + min z ∈C − µ T W z (39)The ﬁrst equality is given by the replacement formula (38), while the second one is given by performing the replacementof ˆz A, with his expression in P We will now obtain a lower bound of the term being optimised over P k . Let’s start by plugging in the values using theformula (38) and replace the value of ˆz A,k +1 using the constraint. min ˆz B,k , ˆz A,k +1 ∈ P k ( · , · ) (cid:0) ρ Tk ˆz B,k − ρ Tk +1 ˆz A,k +1 (cid:1) = − µ Tk +1 b k +1 + min ˆz B,k ∈ [ l k , u k ] cvx hull ( ˆz B,k , z k ) (cid:0) µ Tk ˆz B,k − µ Tk +1 W k +1 z k (cid:1) (40)Focusing on the second term that contains the minimization over the convex hull, we will obtain a lower bound. Itis important, at this stage, to recall that, as seen in section 5.1.2, the minimum of the second term can either beone of the three vertices of the triangle in Figure 1 (ambiguous ReLU), the ˆz B,k = z k line (passing ReLU), or the ( ˆz B,k = , z k = ) triangle vertex (blocking ReLU). We will write ( ˆz B,k , z k ) ∈ ReLU sol ( ˆz B,k , z k , l k , u k ) .We can add a term λ Tk ( σ ( ˆz k ) − σ ( ˆz k )) = 0 and obtain: min ˆz B,k ∈ [ l k , u k ] cvx hull ( ˆz B,k , z k ) (cid:0) µ Tk ˆz B,k − µ Tk +1 W k +1 z k (cid:1) = min ˆz B,k ∈ [ l k , u k ] ReLU sol ( ˆz B,k , z k , l k , u k ) (cid:0) µ Tk ˆz B,k − µ Tk +1 W k +1 z k (cid:1) = min ˆz B,k ∈ [ l k , u k ] ReLU sol ( ˆz B,k , z k , l k , u k ) (cid:0) µ Tk ˆz B,k − µ Tk +1 W k +1 z k + λ Tk ( σ ( ˆz B,k ) − σ ( ˆz B,k )) (cid:1) ≥ min ˆz B,k ∈ [ l k , u k ] ReLU sol ( ˆz B,k , z k , l k , u k ) (cid:0) µ Tk ˆz B,k − λ Tk σ ( ˆz B,k ) (cid:1) + min ˆz B,k ∈ [ l k , u k ] ReLU sol ( ˆz B,k , z k , l k , u k ) (cid:0) − µ Tk +1 W k +1 z k + λ Tk σ ( ˆz B,k ) (cid:1) ≥ min ˆz B,k ∈ [ l k , u k ] (cid:0) µ Tk ˆz B,k − λ Tk σ ( ˆz B,k ) (cid:1) + min z k ∈ [ σ ( l k ) ,σ ( u k )] (cid:0) − µ Tk +1 W k +1 z k + λ Tk z k (cid:1) (41)quality between the second line and the third comes from the fact that we are adding a term equal to zero. Theinequality between the third line and the fourth is due to the fact that the sum of minimum is going to be lower than theminimum of the sum. For what concerns obtaining the ﬁnal line, the ﬁrst term comes from projecting z k out of thefeasible domain and taking the convex hull of the resulting domain. We need to look more carefully at the second term.Plugging in the ReLU formula: min ˆz B,k ∈ [ l k , u k ] ReLU sol ( ˆz B,k , z k , l k , u k ) (cid:0) − µ Tk +1 W k +1 z k + λ Tk max { , ˆz B,k } (cid:1) = min ˆz B,k ∈ [ σ ( l k ) , σ ( u k )] ReLU sol ( ˆz B,k , z k , l k , u k ) (cid:0) − µ Tk +1 W k +1 z k + λ Tk ˆz B,k (cid:1) ≥ min z k ∈ [ σ ( l k ) ,σ ( u k )] (cid:0) − µ Tk +1 W k +1 z k + λ Tk z k (cid:1) , as (keeping in mind the shape of ReLU sol and for the purposes of this speciﬁc problem) excluding the negative partof the ˆz B,k domain does not alter the minimal value. The ﬁnal line then follows by observing that forcing ˆz B,k = z is aconvex relaxation of the positive part of ReLU sol .Summing up the lower bounds for all the terms in (37), as given by equations (39) and (41), we obtain the formulationof Problem (35). We conclude that the bound obtained by Dvijotham et al. (2018) is necessarily no larger than thebound derived using our dual. Given that we are computing lower bounds, this means that their bound is looser.

F Initialisation of the dual

We will now show that the solution generated by the method of Wong and Kolter (2018) can be used to provideinitialization to our dual (9), even though both duals were derived differently. Theirs is derived from the Lagrangiandual of the Planet Ehlers (2017) relaxation, while ours come from Lagrangian decomposition. However, we will showthat a solution that can be extracted from theirs lead to the exact same bound they generate.As a reminder, we reproduce results proven in Wong and Kolter (2018). In order to obtain the bound, the problem theysolved is: min ˆz n c T ˆz n such that ˆz n ∈ ˜ Z (cid:15) ( x ) , (42)When c is just an identity matrix (so we optimise the bounds of each neuron), we recover the problem (3). ˜ Z (cid:15) ( x ) correspond to the contraints (3b) to (3c).Based on the Theorem 1 , and the equations (8) to (10) (Wong and Kolter, 2018), the dual problem that they derive hasan objective function of : − n − (cid:88) i =1 ν Ti b i − x T ˆ ν − (cid:15) (cid:107) ˆ ν (cid:107) + n − (cid:88) i =1 (cid:88) j ∈I i l i,j [ ν i,j ] + (43)with a solution given by ν n = − c ˆ ν i = − W Ti +1 D i +2 W Ti +2 . . . D n W Tn c ν i = D i ˆ ν i (44)where D i is a diagonal matrix with entries ( D i ) jj =  j ∈ I − i j ∈ I + iu i,j u i,j − l i,j j ∈ I i , (45)and I − i , I + i , I i correspond respectively to blocked ReLUs (with always negative inputs), passing ReLUs (with alwayspositive inputs) and ambiguous ReLUs. We adjust the indexing slightly because Wong et al. note the input z , while we refer to it as z e will now prove that, by taking as our solution of (9) ρ k = ν k , (46)we obtain exactly the same bound.The problem we’re solving is (9), so, given a choice of ρ , the bound that we generate is:min z , ˆz ˆz A,n + (cid:88) k =1 ..n − ρ Tk ( ˆz B,k − ˆz A,k ) s.t. P ( z , ˆz A, ); P k ( ˆz B,k , ˆz A,k +1 ) for k ∈ [1 ..n − . (47)which can be decomposed into several subproblems: n − (cid:88) k =1  min ˆz B,k , ˆz A,k +1 ∈ P k ( · , · ) ρ Tk ˆz B,k − ρ Tk +1 ˆz A,k +1  + min z , ˆz A, ∈ P ( · , · ) − ρ T ˆz A, (48)where we take the convention consistent with (46) that ρ n = ν n = − c = − I .Let’s start by evaluating the problem over P , having replaced ρ by ν in accordance with (46): min z , ˆz A, − ν T ˆz A, such that ˆz = W z + b z ∈ C , (49)where in the context of (Wong and Kolter, 2018), C is deﬁned by the bounds l = x − (cid:15) and u = x + (cid:15) . We alreadyexplained how to solve this subproblem in the context of our Frank Wolfe optimisation, see Equation (12)). Plugging inthe solution into the objective function gives us: min z , ˆz A, ∈ P ( · , · ) − ρ T ˆz A, = − ν T W (cid:16) ν T W > · l + ν T W ≤ · u (cid:17) − ν T b = − ˆ ν T ( ˆ ν > · l + ˆ ν ≤ · u ) − ν T b = − ˆ ν T x + (cid:15) ˆ ν T ˆ ν > − (cid:15) ˆ ν T ˆ ν < − ν T b = − ˆ ν T x + (cid:15) (cid:107) ˆ ν (cid:107) − ν T b (50)We will now evaluate the values of the problem over P k . Once again, we replace the ρ with the appropriate values of ν in accordance to (46): min ˆz B,k , ˆz A,k +1 ν Tk ˆz B,k + ν Tk ˆz A,k +1 s.t l k ≤ ˆz B,k ≤ u k cvx hull σ ( ˆz B,k , z k , l k , u k ) ˆz A,k +1 = W k +1 z k + b k +1 , (51)We can rephrase the problem such that it decomposes completely over the ReLUs, by merging the last equality constraintinto the objective function: min ˆz B,k , z k ν Tk ˆz B,k − ν Tk +1 W k +1 z k − ν Tk +1 b k +1 s.t l k ≤ ˆz B,k ≤ u k cvx hull σ ( ˆz B,k , z k , l k , u k ) (52)Note that ν Tk +1 W k +1 = ˆ ν Tk . We will now omit the term − ν Tk +1 b k +1 which doesn’t depend on the optimisation variableanymore and will therefore be directly passed down to the value of the bound.For the rest of the problem, we will distinguish the different cases of ReLU, I + , I − and I .In the case of a ReLU j ∈ I + , we have cvx hull σ ( . . . ) ≡ ˆz B,k [ j ] = z k [ j ] , so the replacement of z k [ j ] makesthe objective function term for j becomes ( ν k [ j ] − ˆ ν k [ j ]) ˆz B,k [ j ] . Given that j ∈ I + , this means that the correspondingelement in D k is 1, so ν k [ j ] = ˆ ν k [ j ] , which means that the term in the objective function is zero.n the case of a ReLU j ∈ I − , we have cvx hull σ ( . . . ) ≡ z k [ j ] = 0 , so the replacement of z k [ j ] makes theobjective function term for j becomes ν k [ j ] ˆz B,k [ j ] . Given that j ∈ I − , this means that the corresponding element in D k is 0, so ν k [ j ] = 0 . This means that the term in the objective function is zero.The case for a ReLU j ∈ I is more complex. We apply the same strategy of picking the appropriate inequalities in cvx hull as we did for obtaining Equation (15), which leads us to: min ˆz B,k ( ν k [ j ] − [ ˆ ν k [ j ]] + u k [ j ] u k [ j ] − l k [ j ] ) z B,k [ j ] − [ ˆ ν k [ j ]] − max( z B,k , ν k [ j ]] + u k [ j ] u k [ j ] − l k [ j ] l k [ j ] s.t l k ≤ ˆz B,k ≤ u k (53)Once again, we can recognize that the coefﬁcient in front of z B,k [ j ] (in the ﬁrst line of the objective function) is equalto 0 by construction of the values of ν and ˆ ν . The coefﬁcient in front of max( z B,k , in the second line is necessarilypositive. Therefore, the minimum value for this problem will necessarily be the constant term in the third line (reachablefor z B,k = 0 which is feasible due to j being in I ). The value of this constant term is: [ ˆ ν k [ j ]] + u k [ j ] u k [ j ] − l k [ j ] l k [ j ] = [ ν k [ j ]] + l k [ j ] (54)Regrouping the different cases and the constant term of problem (52) that we ignored, we ﬁnd that the value of the termcorresponding to the P k subproblem is: − ν Tk +1 b k +1 + (cid:88) j ∈I [ ν k [ j ]] + l k [ j ] (55)Plugging all the optimisation result into Equation (48), we obtain: n − (cid:88) k =1 (cid:18) min P k ρ Tk ˆz B,k − ρ Tk +1 ˆz A,k +1 (cid:19) + min P − ρ T ˆz A, = n − (cid:88) k =1  − ν Tk +1 b k +1 + (cid:88) j ∈I [ ν k [ j ]] + l k [ j ]  − ˆ ν T x + (cid:15) (cid:107) ˆ ν (cid:107) − ν T b = − n − (cid:88) k =1 ν Tk b k − ˆ ν T x − (cid:15) (cid:107) ˆ ν (cid:107) + n − (cid:88) k =1 (cid:88) j ∈I [ ν k [ j ]] + l k [ j ] , (56)which is exactly the value given by Wong and Kolter (2018) in Equation (43). Supplementary experiments

We now complement the results presented in the main paper and provide additional information on the experimentalsetting. We start from incomplete veriﬁcation ( § G.1) and then move on to complete veriﬁcation ( § G.2).

G.1 Incomplete Veriﬁcation

Figure 6 presents additional pointwise comparisons for the distributional results showed in Figure 2. The comparisonbetween Dec-DSG+ and DSG+ shows that the inequality in Theorem 1 is strict for most of the obtained dual solutions,but operating directly on problem (9) is preferable in the vast majority of the cases. For completeness, we also include acomparison between WK and Interval Propagation (IP), showing that the latter yields signiﬁcantly worse bounds onmost images, reason for which we omitted it in the main body of the paper.

Dec-DSG+ 1040 steps D S G + s t e p s Timing (in s) −1 Dec-DSG+ 1040 steps −1 D S G + s t e p s Gap to best bound

Supergradient 740 steps D e c - D S G + s t e p s Timing (in s) −2 −1 Supergradient 740 steps −2 −1 D e c - D S G + s t e p s Gap to best bound WK I P Timing (in s) −1 WK −1 I P Gap to best bound

Figure 6: Additional pointwise comparison for a sub-set of the methods in Figure 2, along with the theomitted Interval Propagation (IP). Each datapoint cor-responds to a CIFAR image, darker colour shadesmean higher point density. The dotted line corre-sponds to the equality and in both graphs, lower isbetter along both axes.

Method −2 −1 T i m e Gurobi Gurobi-TL WK DSG+520 steps Dec-DSG+520 steps Supergradient370 steps Proximal200 steps DSG+1040 steps Dec-DSG+1040 steps Supergradient740 steps Proximal400 steps10 −1 G a p t o b e s t b o un d Figure 7: Comparison of the distribution of runtime and gap to optimality on a normally trained network. In both cases, lower isbetter. The width at a given value represents the proportion of problems for which this is the result. Note that Gurobi always returnthe optimal solution so doesn’t appear on the optimality gap, but is always the highest runtime.

Figures 7, 8, 9 show experiments (see section 6.3) for a network trained using standard stochastic gradient descentand cross entropy, with no robustness objective. We employ (cid:15) verif = 5 / . Most observations done in the main paperfor the adversarially trained network (Figures 2 and 3) still hold true: in particular, the advantage of the LagrangianDecomposition based dual compared to the dual by Dvijotham et al. (2018) is even more marked than in Figure 3. Inact, for a subset of the properties, DSG+ has returns rather loose bounds even after iterations. The proximalalgorithm returns better bounds than the supergradient method in this case as well, even though the larger support of theoptimality gap distribution and the larger runtime might reduce the advantages of the proximal algorithm. Supergradient 740 steps D S G + s t e p s Timing (in s) −1 Supergradient 740 steps −1 D S G + s t e p s Gap to best bound

Proximal 400 steps S u p e r g r a d i e n t s t e p s Timing (in s) −1 Proximal 400 steps −1 S u p e r g r a d i e n t s t e p s Gap to best bound

Figure 8: Pointwise comparison for a subset of the methods in Figure 7. Each datapoint corresponds to a CIFAR image, darker colourshades mean higher point density. The dotted line corresponds to the equality and in both graphs, lower is better along both axes.

Dec-DSG+ 1040 steps D S G + s t e p s Timing (in s) −1 Dec-DSG+ 1040 steps −1 D S G + s t e p s Gap to best bound

Supergradient 740 steps D e c - D S G + s t e p s Timing (in s) −1 Supergradient 740 steps −1 D e c - D S G + s t e p s Gap to best bound WK I P Timing (in s) −1 WK −1 I P Gap to best bound

Figure 9: Additional pointwise comparison for asubset of the methods in Figure 7, along with thethe omitted Interval Propagation (IP). Each datapointcorresponds to a CIFAR image, darker colour shadesmean higher point density. The dotted line corre-sponds to the equality and in both graphs, lower isbetter along both axes.

G.1.1 Network Architecture

We now outline the network architecture for both the adversarially trained and the normally trained network experiments.Conv2D(channels=16, kernel size=4, stride=2)Conv2D(channels=32, kernel size=4, stride=2)Linear(channels=100)Linear(channels=10)

Table 2: Network architecture for incomplete veriﬁcation, from Wong and Kolter (2018). Each layer but the last is followed by aReLU activation function. .2 Complete Veriﬁcation

As done by Lu and Kumar (2020), we provide a more in-depth analysis of the base model results presented in Figure 10.Depending on the time t i that the Gurobi baseline employed to verify them, the properties have been divided into an“easy” ( t i < ), “medium” and a “hard ” ( t i > ) set. The relative performance of the iterative dual methodsremains unvaried, with the initial gap between our two methods increasing with the difﬁculty of the problem. Table 3: For easy, medium and difﬁcult level veriﬁcation properties on the base model, we compare average solving time, averagenumber of solved sub-problems (on the properties that did not time out) and the percentage of timed out properties.

Easy Medium HardMethod time(s) sub-problems % Timeout time(s) sub-problems % Timeout time(s) sub-problems % TimeoutG

UROBI -B A BSR 545.720 580.428 0.0 1370.395 1408.745 0.0 3127.995 2562.872 41.31MIP

PLANET A BSR 112.100 2030.362 0.85 177.805 4274.811 0.52 1222.632 12 444.449 23.71S

UPERGRADIENT B A BSR 85.004 1662.000 0.64 136.853 3772.679 0.26 1133.264 12 821.481 P ROXIMAL B A BSR Computation time [s] % o f p r o p e r t i e s v e r i f i e d Base model, easy properties

MIPplanetGurobi BaBSRProximal BaBSRSupergradient BaBSRDSG+ BaBSR Computation time [s] % o f p r o p e r t i e s v e r i f i e d Base model, medium properties

MIPplanetGurobi BaBSRProximal BaBSRSupergradient BaBSRDSG+ BaBSR Computation time [s] % o f p r o p e r t i e s v e r i f i e d Base model, hard properties

MIPplanetGurobi BaBSRProximal BaBSRSupergradient BaBSRDSG+ BaBSR

Figure 10: Cactus plots for the base model, with properties of varying difﬁculty. For each, we compare the bounding methods andcomplete veriﬁcation algorithms by plotting the percentage of solved properties as a function of runtime.

G.2.1 Network Architectures

Finally, we describe properties and network architecture for the dataset by Lu and Kumar (2020).