[PDF] Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming

Abstract

Full PDF

EEnabling certiﬁcation of veriﬁcation-agnostic networksvia memory-efﬁcient semideﬁnite programming

Sumanth Dathathri ˚ , Krishnamurthy (Dj) Dvijotham ˚ , Alex Kurakin ˚ ,Aditi Raghunathan ˚ , Jonathan Uesato ˚ , Rudy Bunel , Shreya Shankar ,Jacob Steinhardt , Ian Goodfellow , Percy Liang , Pushmeet Kohli DeepMind Google Brain Stanford UC Berkeley Work done at Google {dathathri,dvij,kurakin,juesato}@google.com, [email protected]

Abstract

Convex relaxations have emerged as a promising approach for verifying desirableproperties of neural networks like robustness to adversarial perturbations. Widelyused Linear Programming (LP) relaxations only work well when networks aretrained to facilitate veriﬁcation. This precludes applications that involve veriﬁcation-agnostic networks, i.e., networks not specially trained for veriﬁcation. On the otherhand, semideﬁnite programming (SDP) relaxations have successfully be applied toveriﬁcation-agnostic networks, but do not currently scale beyond small networks dueto poor time and space asymptotics. In this work, we propose a ﬁrst-order dual SDPalgorithm that (1) requires memory only linear in the total number of network activa-tions, (2) only requires a ﬁxed number of forward/backward passes through the net-work per iteration. By exploiting iterative eigenvector methods, we express all solveroperations in terms of forward and backward passes through the network, enablingefﬁcient use of hardware like GPUs/TPUs. For two veriﬁcation-agnostic networkson MNIST and CIFAR-10, we signiﬁcantly improve (cid:96) veriﬁed robust accuracyfrom Ñ and Ñ respectively. We also demonstrate tight veriﬁcationof a quadratic stability speciﬁcation for the decoder of a variational autoencoder. Applications of neural networks to safety-critical domains requires ensuring that they behave asexpected under all circumstances [32]. One way to achieve this is to ensure that neural networksconform with a list of speciﬁcations , i.e., relationships between the inputs and outputs of a neuralnetwork that ought to be satisﬁed. Speciﬁcations can come from safety constraints (a robot shouldnever enter certain unsafe states [40, 29, 12]), prior knowledge (a learned physical dynamics modelshould be consistent with the laws of physics [49]), or stability considerations (certain transformationsof the network inputs should not signiﬁcantly change its outputs [57, 7]).Evaluating whether a network satisﬁes a given speciﬁcation is a challenging task, due to the difﬁcultyof searching for violations over the high dimensional input spaces. Due to this, several techniquesthat claimed to enhance neural network robustness were later shown to break under stronger attacks[61, 5]. This has motivated the search for veriﬁcation algorithms that can provide provable guaranteeson neural networks satisfying input-output speciﬁcations.Popular approaches based on linear programming (LP) relaxations of neural networks are compu-tationally efﬁcient and have enabled successful veriﬁcation for many speciﬁcations [37, 18, 30, 21].LP relaxations are sound (they would never incorrectly conclude that a speciﬁcation is satisﬁed) but ˚ Equal contribution. Alphabetical order. : Code available at https://github.com/deepmind/jax_verify .34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ c s . L G ] N ov ncomplete (they may fail to verify a speciﬁcation even if it is actually satisﬁed). Consequently, theseapproaches tend to give poor or vacuous results when used in isolation, though can achieve strongresults when combined with speciﬁc training approaches to aid veriﬁcation [22, 51, 67, 21, 54, 6].In contrast, we focus on veriﬁcation-agnostic models, which are trained in a manner agnostic tothe veriﬁcation algorithm. This would enable applying veriﬁcation to all neural networks, and notjust those trained to be veriﬁable. First, this means training procedures need not be constrainedby the need to verify, thus allowing techniques which produce empirically robust networks, whichmay not be easily veriﬁed [38]. Second, ML training algorithms are often not easily modiﬁable, e.g.production-scale ML models with highly speciﬁc pipelines. Third, for many tasks, deﬁning formalspeciﬁcations is difﬁcult, thus motivating the need to learn speciﬁcations from data. In particular, inrecent work [24, 50, 66], natural perturbations to images like changes in lighting conditions or changesin the skin tone of a person, have been modeled using perturbations in the latent space of a generativemodel. In these cases, the speciﬁcation itself is a veriﬁcation-agnostic network which the veriﬁcationmust handle even if the prediction network is trained with the veriﬁcation in mind.In contrast to LP-based approaches, the semideﬁnite programming (SDP) relaxation [52] has enabledrobustness certiﬁcation of veriﬁcation-agnostic networks. However, the interior point methodscommonly used for SDP solving are computationally expensive with O p n q runtime and O p n q memory requirements, where n is the number of neurons in the network [41, 60]. This limitsapplicability of SDPs to small fully connected neural networks.Within the SDP literature, a natural approach is to turn to ﬁrst-order methods, exchanging precision forscalability [63, 53]. Because veriﬁcation only needs a bound on the optimal value of the relaxation (andnot the optimal solution), we need not design a general-purpose SDP solver, and can instead operatedirectly in the dual. A key beneﬁt is that the dual problem can be cast as minimizing the maximumeigenvalue of an afﬁne function, subject only to non-negativity constraints. This is a standard techniqueused in the SDP literature [25, 42] and removes the need for an expensive projection operation ontothe positive semideﬁnite cone. Further, since any set of feasible dual variables provides a valid upperbound, we do not need to solve the SDP to optimality as done previously [52], and can instead stoponce a sufﬁciently tight upper bound is attained.In this paper, we show that applying these ideas to neural network veriﬁcation results in an efﬁcientimplementation both in theory and practice. Our solver requires O p n q memory rather than O p n q for interior point methods, and each iteration involves a constant number of forward and backwardpasses through the network. Our contributions.

The key contributions of our paper are as follows:1. By adapting ideas from the ﬁrst-order SDP literature [25, 42], we observe that the dual of the SDPformulation for neural network veriﬁcation can be expressed as a maximum eigenvalue problemwith only interval bound constraints. This formulation generalizes [52] without loss of tightness,and applies to any quadratically-constrained quadratic program (QCQP), including the standardadversarial robustness speciﬁcation and a variety of network architectures.Crucially, when applied to neural networks, we show that subgradient computations are expressiblepurely in terms of forward or backward passes through layers of the neural network. Consequently,applying a subgradient algorithm to this formulation achieves per-iteration complexity comparableto a constant number of forward and backward passes through the neural network.2. We demonstrate the applicability of ﬁrst-order SDP techniques to neural network veriﬁcation. Weﬁrst evaluate our solver by verifying (cid:96) robustness of a variety of veriﬁcation-agnostic networks onM NIST and C

IFAR -10. We show that our approach can verify large networks beyond the scope ofexisting techniques. For these veriﬁcation-agnostic networks, we obtain bounds an order of magni-tude tighter than previous approaches (Figure 1). For an adversarially trained convolutional neuralnetwork (CNN) with no additional regularization on MNIST ( (cid:15) “ . ), compared to LP relaxations,we improve the veriﬁed robust accuracy from % to % . For the same training and architectureon C IFAR -10 ( (cid:15) “ { ), the corresponding improvement is from % to % (Table 1).3. To demonstrate the generality of our approach, we verify a different quadratic speciﬁcation onthe stability of the output of the decoder for a variational autoencoder (VAE). The upper boundon speciﬁcation violation computed by our solver closely matches the lower bound on speciﬁcationviolation (from PGD attacks) across a wide range of inputs (Section 6.2).2 Related work

Neural network veriﬁcation.

There is a large literature on veriﬁcation methods for neural networks.Broadly, the literature can be grouped into complete veriﬁcation using mixed-integer programming[26, 18, 59, 10, 2], bound propagation [56, 70, 65, 21], convex relaxation [30, 17, 67, 51], andrandomized smoothing [35, 11]. Veriﬁed training approaches when combined with convex relaxationshave led to promising results [30, 51, 23, 6]. Randomized smoothing and veriﬁed training approachesrequires special modiﬁcations to the predictor (smoothing the predictions by adding noise) and/or thetraining algorithm (training with additional noise or regularizers) and hence are not applicable to theveriﬁcation-agnostic setting. Bound propagation approaches have been shown to be special instancesof LP relaxations [37]. Hence we focus on describing the convex relaxations and complete solvers, asthe areas most closely related to this paper.

Complete veriﬁcation approaches.

These methods rely on exhaustive search to ﬁnd counter-examplesto the speciﬁcation, using smart propagation or bounding methods to rule out parts of the searchspace that are determined to be free of counter-examples. The dominant paradigms in this space areSatisﬁability Modulo Theory (SMT) [26, 18] and Mixed Integer Programming (MIP) [59, 10, 2]. Thetwo main issues with these solvers are that: 1) They can take exponential time in the network size and2) They typically cannot run on accelerators for deep learning (GPUs, TPUs).

Convex relaxation based methods.

Much work has relied on linear programming (LP) or similarrelaxations for neural-network veriﬁcation [30, 17]. Bound propagation approaches can also be viewedas a special case of LP relaxations [37]. Recent work [54] put all these approaches on a uniform footingand demonstrated using extensive experiments that there are fundamental barriers in the tightness ofthese LP based relaxations and that obtaining tight veriﬁcation procedures requires better relaxations. Asimilar argument in [52] demonstrated a large gap between LP and SDP relaxations even for networkswith randomly chosen weights. Fazlyab et al. [19, 20] generalized the SDP relaxations to arbitrarynetwork structures and activiation functions. However, these papers use off-the-shelf interior pointsolvers to solve the resulting relaxations, preventing them from scaling to large CNNs. In this paper, wefocus on SDP relaxations but develop customized solvers that can run on accelerators for deep learning(GPUs/TPUs) enabling their application to large CNNs.

First-order SDP solvers.

While interior-point methods are theoretically compelling, the demands oflarge-scale SDPs motivate ﬁrst-order solvers. Common themes within this literature include smoothingof nonsmooth objectives [42, 33, 14] and spectral bundle or proximal methods [25, 36, 45]. Conditionalgradient methods use a sum of rank-one updates, and when combined with sketching techniques,can represent the primal solution variable using linear space [68, 69]. Many primal-dual algorithms[64, 63, 41, 4, 15] exploit computational advantages of operating in the dual – in fact, our approach toveriﬁcation operates exclusively in the dual, thus sidestepping space and computational challengesassociated with the primal matrix variable. Our formulation in Section 5.1 closely follows the eigenvalueoptimization formulation from Section 3 of Helmberg and Rendl [25]. While in this work, we show thatvanilla subgradient methods are sufﬁcient to achieve practical performance for many problems, manyideas from the ﬁrst-order SDP literature are promising candidates for future work, and could potentiallyallow faster or more reliable convergence. A full survey is beyond scope here, but we refer interestedreaders to Tu and Wang [60] and the related work of Yurtsever et al. [69] for excellent surveys.

Notation.

For vectors a,b , we use a ď b and a ě b to represent element-wise inequalities. We use B (cid:15) p x q to denote the (cid:96) ball of size (cid:15) around input x . For symmetric matrices X,Y , we use X ľ Y to denotethat X ´ Y is positive semideﬁnite (i.e. X ´ Y is a symmetric matrix with non-negative eigenvalues)We use r x s ` to denote max p x, q and r x s ´ for min p x, q . represents a vector of all ones. Neural networks.

We are interested in veriﬁying properties of neural network with L hidden layersand N neurons that takes input x . x i denotes the activations at layer i and the concantenated vector x “ r x ,x ,x , ¨¨¨ ,x L s represents all the activations of the network. Let L i denote an afﬁne mapcorresponding to a forward pass through layer i , for e.g., linear, convolutional and average poolinglayers. Let σ i is an element-wise activation function, for e.g., ReLU, sigmoid, tanh. In this work, wefocus on feedforward networks where x i ` “ σ i ` L i p x i q ˘ .3 eriﬁcation. We study veriﬁcation problems that involve determining whether φ p x q ď for networkinputs x satisfying (cid:96) ď x ď u where speciﬁcation φ is a function of the network activations x . opt “ : max x φ p x q subject to x i ` “ σ i ` L i p x i q ˘looooooooomooooooooon Neural net constraints , (cid:96) ď x ď u looooomooooon Input constraints . (1)The property is veriﬁed if opt ď . In this work, we focus on φ which are quadratic functions. Thisincludes several interesting properties like veriﬁcation of adversarial robustness (where φ is linear),conservation of an energy in dynamical systems [49]), or stability of VAE decoders (Section 6.2). Notethat while we assume (cid:96) -norm input constraints for ease of presentation, our approach is applicable toany quadratic input constraint. A starting point for our approach is the following observation from prior work—the neural networkconstraints in the veriﬁcation problem (1) can be replaced with quadratic constraints for ReLUs [52]and other common activations [19], yielding a Quadratically Constrained Quadratic Program (QCQP).We bound the solution to the resulting QCQP via a Lagrangian relaxation. Following [52], we assumeaccess to lower and upper bounds (cid:96) i ,u i on activations x i such that (cid:96) i ď x i ď u i . They can be obtainedvia existing bound propagation techniques [65, 30, 70]. We use (cid:96) ď x ď u to denote the collection ofactivations and bounds at all the layers taken together.We ﬁrst describe the terms in the Lagrangian corresponding to the constraints encoding layer i ina ReLU network: x i ` “ ReLU p L i p x i qq . Let (cid:96) i ,u i denote the bounds such that (cid:96) i ď x i ď u i . Weassociate Lagrange multipliers λ i “ r λ a i ; λ b i ; λ c i ; λ d i s corresponding to each of the constraints as follows. x i ` ě r λ a i s , x i ` ě L i p x i q r λ b i s x i ` d ` x i ` ´ L i p x i q ˘ ď r λ c i s , x i d x i ´p (cid:96) i ` u i qd x i ` (cid:96) i d u i ď r λ d i s . (2)The linear constraints imply that x i ` is greater than both and L i p x i q . The ﬁrst quadratic constrainttogether with the linear constraint makes x i ` equal to the larger of the two, i.e. x i ` “ max p L i p x i q , q .The second quadratic constraint directly follows from the bounds on the activations. The Lagrangian L p x i ,x i ` ,λ i q corresponding to the constraints and Lagrange multipliers described above is as follows. L p x i ,x i ` ,λ i q “ p´ x i ` q J λ a i `p L i p x i q´ x i ` q J λ b i ` ` x i ` dp x i ` ´ L i p x i qq ˘ J λ c i `p x i d x i ´p (cid:96) i ` u i qd x i ` (cid:96) i d u i q J λ d i “ p (cid:96) i d u i q J λ d i looooomooooon independent of x i ,x i ` ´ x J i ` λ a i `p L i p x i qq J λ b i ´ x J i ` λ b i ´ x J i ` p (cid:96) i ` u i qd λ d i ˘loooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooon linear in x i ,x i ` ` x J i ` diag p λ c i q x i ` ´ x J i ` diag p λ c i q L i p x i q` x J i diag p λ d i q x i looooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooon Quadratic in x i ,x i ` . (3)The overall Lagrangian L p x,λ q is the sum of L p x i ,x i ` ,λ i q across all layers together with the objective φ p x q , and consists of terms that are either independent of x , linear in x or quadratic in x . Thus, L p x,λ q isa quadratic polynomial in x and can be written in the form L p x,λ q “ c p λ q` x J g p λ q` x J H p λ q x . Eachof the coefﬁcients c p λ q , g p λ q , and H p λ q are afﬁne as a function of λ . We will describe our approach interms of c p λ q , g p λ q , and H p λ q , which need not be derived by hand, and can instead be directly obtainedfrom the Lagrangian L p x,λ q via automatic differentiation as we discuss in Section 5.2. We observe that L p x,λ q is itself composed entirely of forward passes L i p x i q and element-wise operations. This makescomputing L p x,λ q both convenient to implement and efﬁcient to compute in deep learning frameworks.Via standard Lagrangian duality, the Lagrangian provides a bound on opt : opt ď min λ ě max (cid:96) ď x ď u L p x,λ q “ min λ ě max (cid:96) ď x ď u c p λ q` x J g p λ q` x J H p λ q x. (4)We now describe our dual problem formulation starting from this Lagrangian (4). Our goal is to develop a custom solver for large-scale neural network veriﬁcation with the followingdesiderata: (1) compute anytime upper bounds valid after each iteration, (2) rely on elementary4omputations with efﬁcient implementations that can exploit hardware like GPUs and TPUs, and (3)have per-iteration memory and computational cost that scales linearly in the number of neurons.In order to satisfy these desiderata, we employ ﬁrst order methods to solve the Langrange dualproblem (4). We derive a reformulation of the Lagrange dual with only non-negativity constraintson the decision variables (Section 5.1). We then show how to efﬁciently and conveniently computesubgradients of the objective function in Section 5.2 and derive our ﬁnal solver in Algorithm 1.

Several algorithms in the ﬁrst-order SDP literature rely on reformulating the semideﬁnite programmingproblem as an eigenvalue minimization problem [25, 42]. Applying this idea, we obtain a Lagrangedual problem which only has non-negativity constraints and whose subgradients can be computedefﬁciently, enabling efﬁcient projected subgradient methods to be applied.Recall that (cid:96) i ,u i denote precomputed lower and upper bounds on activations x i . For simplicity inpresentation, we assume (cid:96) i “ ´ and u i “ respectively for all i . This is without loss of generality,since we can always center and rescale the activations based on precomputed bounds to obtainnormalized activations ¯ x P r´ , s and express the Lagrangian in terms of the normalized activations ¯ x . Proposition . The optimal value opt of the veriﬁcation problem (1) is bounded above by the Lagrangedual problem corresponding to the Lagrangian in (4) which can be written as follows: opt relax “ : min λ ě ,κ ě c p λ q` J ” κ ´ λ ´ min p diag p κ q´ M p λ qq ı ` loooooooooooooooooooooooooomoooooooooooooooooooooooooon f p λ,κ q , M p λ q “ ˆ g p λ q J g p λ q H p λ q ˙ , (5)and λ ´ min p Z q “ min p λ min p Z q , q is the negative portion of the smallest eigenvalue of Z and κ P R ` N . Proof Sketch.

Instead of directly optimizing over the primal variables x in the Lagrangian of theveriﬁcation problem (4), we explicitly add the redundant constraint x ď with associated dual variables κ , and then optimize over x in closed form. This does not change the the primal (or dual) optimum,but makes the constraints in the dual problem simpler. In the corresponding Lagrange dual problem(now over λ,κ ), there is a PSD constraint of the form diag p κ q ľ M p λ q . Projecting onto this constraintdirectly is expensive and difﬁcult. However, for any p λ,κ q ľ , we can construct a dual feasiblesolution p λ, ˆ κ q by simply subtracting the smallest eigenvalue of diag p κ q´ M p λ q , if negative. For anynon-negative λ,κ , the ﬁnal objective f p λ,κ q is the objective of the corresponding dual feasible solutionand the bound follows from standard Lagrangian duality. The full proof appears in Appendix A.3. Remark 1.

Raghunathan et al. [52] present an SDP relaxation to the QCQP for the veriﬁcationof (cid:96) adversarial robustness. The solution to their SDP is equal to opt relax in our formulation (5)(Appendix A.4). Raghunathan et al. [52] solve the SDP via interior-point methods using off-the-shelfsolvers which simply cannot scale to larger networks due to memory requirement that is quartic inthe number of activations. In contrast, our algorithm (Algorithm 1) has memory requirements thatscale linearly in the number of activations. Remark 2.

Our proof is similar to the standard maximum eigenvalue transformulation for the SDPdual, as used in Helmberg and Rendl [25] or Nesterov [42] (see Appendix A.6 for details). Cruciallyfor scalable implementation, our formulation avoids explicitly computing or storing the matrices foreither the primal or dual SDPs. Instead, we will rely on automatic differentiation of the Lagrangian andmatrix-vector products to represent these matrices implicitly, and achieve linear memory and runtimerequirements. We discuss this approach now.

Our formulation in (5) is amenable to ﬁrst-order methods. Projections onto the feasible set are simpleand we now show how to efﬁciently compute the subgradient of the objective f p λ,κ q . By Danskin’stheorem [13], B λ,κ ´ c p λ q` ” κ ´ “ v ‹J ` diag p κ q´ M p λ q ˘ v ‹ ‰ ı `J ¯ P B λ,κ f p λ,κ q , (6a)where v ‹ “ argmin } v }“ v J ` diag p κ q´ M p λ q ˘ v “ eigmin p diag p κ q´ M p λ qq , (6b)5nd B λ,κ denotes the subdifﬁrential with respect to λ,κ . In other words, given any eigenvector v ‹ corre-sponding to the minimum eigenvalue of the matrix diag p κ q´ M p λ q , we can obtain a valid subgradientby applying autodiff to the left-hand side of (6a) while treating v ‹ as ﬁxed. The main computationaldifﬁculty is computing v ‹ . While our ﬁnal certiﬁcate will use an exact eigendecomposition for v ‹ , forour subgradient steps, we can approximate v ‹ using an iterative method such as Lanczos [34]. Lanczosonly requires repeated applications of the linear map A “ : v ÞÑ ` diag p κ q´ M p λ q ˘ v . This linear mapcan be easily represented via derivatives and Hessian-vector products of the Lagrangian. Implementing implicit matrix-vector products via autodiff.

Recall from Section 4 that theLagrangian is expressible via forward passes through afﬁne layers and element-wise operationsinvolving adjacent network layers. Since M p λ q is composed of the gradient and Hessian of theLagrangian, we will show computing the map M p λ q v is computationally roughly equal to aforwards+backwards pass through the network. Furthermore, implementing this map is extremelyconvenient in ML frameworks supporting autodiff like TensorFlow [1], PyTorch [47], or JAX [8].From the Lagrangian (4), we note that g p λ q “ L x p ,λ q “ B L p x,λ qB x ˇˇˇˇ ,λ and H p λ q v “ L vxx p ,λ,v q “ ˆ B L p x,λ qB x B x T ˙ˇˇˇˇ ,λ v “ B v J L x p ,λ qB x ˇˇˇˇ ,λ . Thus, g p λ q involves a single gradient, and by using a standard trick for Hessian-vector products [48], theHessian-vector product H p λ q v requires roughly double the cost of a standard forward-backwards pass,with linear memory overhead. From the deﬁnition of M p λ q in (5), we can use the quantities above to get A r v s “ ` diag p κ q´ M p λ q ˘ v “ ´ κ d v ` ˆ p g p λ qq J v N g p λ q v ` H p λ q v N ˙ “ κ d v ´ ˆ p L x p ,λ qq J v N L x p ,λ q v ` L vxx p ,λ,v N q ˙ , where v is the ﬁrst coordinate of v and v N is the subvector of v formed by remaining coordinates. The Lagrange dual problem is a convex optimization problem, and a projected subgradient method withappropriately decaying step-sizes converges to an optimal solution [43]. However, we can achieve fasterconvergence in practice through careful choices for initialization, regularization, and learning rates.

Initialization.

Let κ opt p λ q denote the value of κ that optimizes the bound (5), for a ﬁxed λ . We initializewith λ “ , and the corresponding κ opt p q using the following proposition. Proposition . For any choice of λ satisfying H p λ q “ , the optimal choice κ opt p λ q is given by κ ˚ “ n ÿ i “ | g p λ q| i ; κ ˚ n “ | g p λ q| where κ “ r κ ; κ n s is divided into a leading scalar κ and vector κ n , and | g p λ q| is elementwise.See Appendix A.7 for a proof. Note that when φ p x q is linear, H p λ q “ is equivalent to removing thequadratic constraints on the activations and retaining the linear constraints in the Lagrangian (4). Regularization.

Next, we note that there always exists an optimal dual solution satisfying κ n “ ,because they are the Lagrange multipliers of a redundant constraint; full proof appears in Appendix A.5.However, κ has an empirical beneﬁt of smoothing the optimization by preventing negative eigenvaluesof A . This is mostly noticeable in the early optimization steps. Thus, we can regularize κ through eitheran additional loss term ř κ n , or by ﬁxing κ n to zero midway through optimization. In practice, wefound that both options occasionally improve ﬁnal performance. Learning rates.

Empirically, we observed that the optimization landscape varies signiﬁcantly for dualvariables associated with different constraints (such as linear vs. quadratic). In practice, we foundthat using adaptive optimizers [16] such as Adam [27] or RMSProp [58] was necessary to stabilizeoptimization. Additional learning rate adjustment for κ and the dual variables corresponding to thequadratic ReLU constraints provided an improvement on some network architectures (see Appendix B). The subgradient is a singleton except when the multiplicity of the minimum eigenvalue is greater than one,in which case any minimal eigenvector yields a valid subgradient. lgorithm 1 Veriﬁcation via SDP-FO

Input:

Speciﬁcation φ and bounds on the inputs (cid:96) ď x ď u Output:

Upper bound on the optimal value of (1)

Bound computation : Obtain layer-wise bounds (cid:96),u “ BoundProp p (cid:96) ,u q using approaches such as [39, 70] Lagrangian : Deﬁne Lagrangian L p x,λ q from (4) Initialization : Initialize λ,κ (Section 5.3) for t “ ,...,T do Deﬁne the linear operator A t as A t r v s “ κ d v ´ ˆ p L x p ,λ qq J v L x p ,λ q v ` L vxx p ,λ,v q ˙ (see section 5.2) v ‹ Ð eigmin p A t q using the Lanczos algorithm [34].Deﬁne the function f t p λ,κ q “ L p ,λ q` ” κ ´ “ v ‹J A t r v ‹ s ‰ ı `J (see (6)) ¯ f t Ð f t ` λ t ,κ t ˘ Update λ t ,κ t using any gradient based method to obtain ˜ λ, ˜ κ with the gradients: BB λ f t p λ t ,κ t q , BB κ f t p λ t ,κ t q Project λ t ` Ð ” ˜ λ ı ` ,κ t ` Ð r ˜ κ s ` . end forreturn min t ¯ f t We refer to our algorithm (summarized in Algorithm 1) as SDP-FO since it relies on a ﬁrst-ordermethod to solve the SDP relaxation. Although the full algorithm involves several components, theimplementation is simple (~100 lines for the core logic when implemented in JAX[9]) and easilyapplicable to general architectures and speciﬁcations. SDP-FO uses memory linear in the total numberof network activations, with per-iteration runtime linear in the cost of a forwards-backwards pass.

Computing valid certiﬁcates.

Because Lanczos is an approximate method, we always report ﬁnalbounds by computing v ‹ using a non-iterative exact eigen-decomposition method from SciPy [44]. Inpractice, the estimates from Lanczos are very close to the exact values, while using 0.2s/iteration onlarge convolutional network, compared to 5 minutes for exact eigendecomposition (see Appendix C). In this section, we evaluate our SDP-FO veriﬁcation algorithm on two speciﬁcations: robustness toadversarial perturbations for image classiﬁers (Sec. 6.1), and robustness to latent space perturbationsfor a generative model (Sec. 6.2). In both cases, we focus on veriﬁcation-agnostic networks.

We ﬁrst study veriﬁcation of (cid:96) robustness for networks trained on M NIST and C

IFAR -10. For this speciﬁcation, the objective φ p x q in (1) is given by p x L q y ´p x L q y , where x L denotes the the ﬁnal network activations, i.e. logits, y is the index of the true image label, and y is atarget label. For each image and target label, we obtain a lower bound on the optimal φ p x q ď φ ˚ p x q byrunning projected gradient descent ( PGD ) [38] on the objective φ p x q subject to (cid:96) input constraints.A veriﬁcation technique provides upper bounds φ p x q ě φ ˚ p x q . An example is said to be veriﬁedwhen the worst-case upper bound across all possible labels, denoted φ x , is below 0. We ﬁrst compare( SDP-FO, Algorithm 1 to the LP relaxation from [18], as this is a widely used approach for verifyinglarge networks, and is shown by [55] to encompass other relaxations including [30, 17, 65, 70, 39, 23].We further compare to the SDP relaxation from [51] solved using MOSEK [3], a commercial interiorpoint SDP ( SDP-IP ) solver, and the

MIP approach from [59].

Models

Our main experiments on CNNs use two architectures:

CNN-A from [67] and

CNN-B from[6]. These contain roughly 200K parameters + 10K activations, and 2M parameters + 20K activations,respectively. All the networks we study are veriﬁcation-agnostic: trained only with nominal and/oradversarial training [38], without any regularization to promote veriﬁability.While these networks are much smaller than modern deep neural networks, they are an order ofmagnitude larger than previously possible for veriﬁcation-agnostic networks. To compare with prior Core solver implementation at https://github.com/deepmind/jax_verify/blob/master/src/sdp_verify/sdp_verify.py − φ x (Adversarial lower bound) − φ x ( V e r i ﬁ e dupp e r b o und ) y = x ( l o w e r b o und o n v e r i ﬁ c a t i o n o b j ec t i v e ) Veriﬁed bounds across 100 examples

SDP-FOLPCROWN (a) M

NIST , CNN-Adv − − − − − φ x (Adversarial lower bound) − φ x ( V e r i ﬁ e dupp e r b o und ) y = x ( l o w e r b o und o n v e r i ﬁ c a t i o n o b j ec t i v e ) Veriﬁed bounds across 100 examples

SDP-FOLPCROWN (b) C

IFAR -10, CNN-Mix

Figure 1:

Enabling certiﬁcation of veriﬁcation-agnostic networks.

For 100 random examples on M

NIST and C

IFAR -10, we plot the veriﬁed upper bound on φ x against the adversarial lower bound (taking theworst-case over target labels for each). Recall, an example is veriﬁed when the veriﬁed upper bound φ x ă . Our key result is that SDP-FO achieves tight veriﬁcation across all examples, with all pointslying close to the line y “ x . In contrast, LP or CROWN bounds produce much looser gaps between thelower and upper bounds. We note that many CROWN bounds exceed the plotted y-axis limits.work, we also evaluate a variety of fully-connected MLP networks, using trained parameters from[51, 55]. These each contain roughly 1K activations. Complete training and hyperparameter detailsare included in Appendix B.1.

Scalable veriﬁcation of veriﬁcation-agnostic networks

Our central result is that, for veriﬁcation-agnostic networks, SDP-FO allows us to tractably provide signiﬁcantly stronger robustness guaranteesin comparison with existing approaches. In Figure 1, we show that SDP-FO reliably achieves tightveriﬁcation, despite using loose initial lower and upper bounds obtained from CROWN [70] inAlgorithm 1. Table 1 summarizes results. On all networks we study, we signiﬁcantly improve on thebaseline veriﬁed accuracies. For example, we improve veriﬁed robustness accuracy for CNN-A-Advon M

NIST from 0.4% to 87.8% and for CNN-A-Mix on C

IFAR -10 from 5.8% to 39.6%.

Accuracy Veriﬁed Accuracy

Dataset Epsilon Model Nominal PGD SDP-FO (Ours) SDP-IP : LP MIP : M NIST (cid:15) “ . MLP-SDP [52] 97.6% 86.4%

80% 39.5% 69.2%MLP-LP [52] 92.8% 81.2%

80% 79.4% -MLP-Adv [52] 98.4% 93.4% % 82% 26.6% -MLP-Adv-B [55] 96.8% 84.0% - 33.2% 34.4%CNN-A-Adv 99.1% 95.2% - 0.4% - (cid:15) “ . MLP-Nor [55] 98.0% 46.6% - 1.8% 6.0%C

IFAR -10 (cid:15) “ CNN-A-Mix-4 67.8% 55.6% ˚ ˚ ˚ ˚ : Using numbers from [52] for SDP-IP and [54] using approach of [59] for MIP. Dashes indicate previously reported numbers are unavailable. ˚ Computationally infeasible due to quartic memory requirement.

Table 1: Comparison of veriﬁed accuracy across veriﬁcation algorithms. Highlighted rows indicatemodels trained in a veriﬁcation-agnostic manner. All numbers computed across the same 500 test setexamples, except when using previously reported values. For all networks, SDP-FO outperformsprevious approaches. The improvement is largest for veriﬁcation-agnostic models.

Comparisons on small-scale problems

We empirically compare SDP-FO against SDP-IP usingMOSEK, a commercial interior-point solver. Since the two formulations are equivalent (see AppendixA.4), solving them to optimality should result in the same objective. This lets us carefully isolate theeffectiveness of the optimization procedure relative to the SDP relaxation gap. However, we note thatfor interior-point methods, the memory requirements are quadratic in the size of M p λ q , which becomesquickly intractable e.g. « petabytes for a network with 10K activations. This restricts our comparisonto the small MLP networks from [52], while SDP-FO can scale to signiﬁcantly larger networks.In Figure 4 of Appendix C.1, we conﬁrm that on a small random subset of matching veriﬁcationinstances, SDP-FO bounds are only slightly worse than SDP-IP bounds. This suggests that optimizationis typically not an issue for SDP-FO, and the main challenge is instead tightening the SDP relaxation.Indeed, we can tighten the relaxation by using CROWN precomputed bounds [70] rather than interval8rithmetic bounds [39, 22], which almost entirely closes the gap between SDP-FO and PGD for theﬁrst three rows of Table 1, including the veriﬁcation-agnostic MLP-Adv. Finally, compared to numbersreported in [55], SDP-FO outperforms the MIP approach using progressive LP bound tightening [59]. Computational resources

We cap the number of projected gradient iterations for SDP-FO. Usinga P100 GPU, maximum runtime is roughly 15 minutes per MLP instances, and 3 hours per CNNinstances, though most instances are veriﬁed sooner. For reference, SDP-IP uses 25 minutes on a4-core CPU per MLP instance [52], and is intractable for CNN instances due to quartic memory usage.

Limitations.

In principle, our solver’s linear asymptotics allow scaling to extremely large networks.However, in practice, we observe loose bounds with large networks. In Table 1, there is already asigniﬁcantly larger gap between the PGD and SDP-FO bounds for the larger CNN-B models comparedto their CNN-A counterparts, and in preliminary experiments, this gap increases further with networksize. Thus, while our results demonstrate that the SDP relaxation remains tight on signiﬁcantly largernetworks than those studied in Raghunathan et al. [52], additional innovations in either the formulationor optimization process are necessary to enable further scaling. Perturbation Radius: (no. of std. dev.) % o f v e r i f i e d s a m p l e s SDP-FOIBPPGD upper bound

Figure 2: Comparison of different approaches for verify-ing the robustness of the decoder of a VAE on MNIST,measured across 100 samples. The lower-bound on therobust accuracy computed with SDP-FO closely matchesthe upper bound based on a PGD adversarial attack uptoperturbations of 0.1 σ z , while the lower bound based onIBP begins to diverge from the PGD upper bound at muchsmaller perturbations. Setup

To test the generality of our approach, we consider a different speciﬁcation of verifying thevalidity of constructions from deep generative models, speciﬁcally variational auto-encoders (VAEs)[28]. Let q E p z | s q “ N p µ z ; sE ,σ z ; sE q denote the distribution of the latent representation z correspondingto input s , and let q D p s | z q “ N p µ s ; zD , I q denote the decoder. Our aim is to certify robustness of thedecoder to perturbations in the VAE latent space. Formally, the VAE decoder is robust to (cid:96) latentperturbations for input s and perturbation radius α P R `` if: ε recon p s,µ s ; zD q : “ (cid:107) s ´ µ s ; zD (cid:107) ď τ @ z s.t (cid:107) z ´ µ z ; sE (cid:107) ď ασ z ; sE , (7)where ε recon is the reconstruction error. Note that unlike the adversarial robustness setting where theobjective was linear, the objective function ε recon is quadratic. Quadratic objectives are not directlyamenable to LP or MIP solvers without further relaxing the quadratic objective to a linear one. Forvarying perturbation radii α , we measure the test set fraction with veriﬁed reconstruction error below τ “ . , which is the median squared Euclidean distance between a point s and the closest pointwith a different label (over MNIST). Results

We verify a VAE on MNIST with a convolutional decoder containing «

10K total activations.Figure 2 shows the results. To visualize the improvements resulting from our solver, we include acomparison with guarantees based on interval arithmetic bound propagation (IBP) [23, 39], whichwe use to generate the bounds used in Algorithm 1. Compared to IBP, SDP-FO can successfully verifyat perturbation radii roughly 50x as large. For example, IBP successfully veriﬁes 50% at roughly (cid:15) “ . compared to (cid:15) “ . for SDP-FO. We note that besides the IBP bounds being themselvesloose compared to the SDP relaxations, they further suffer from a similar drawback as LP/MIP methodsin that they bound ε recon via (cid:96) -bounds, which further results in looser bounds on ε recon . Furtherdetails and visualizations are included in Appendix B.2. We have developed a promising approach to scalable tight veriﬁcation and demonstrated goodperformance on larger scale than was possible previously. While in principle, this solver is applicableto arbitrarily large networks, further innovations (in either the formulation or solving process) arenecessary to get meaningful veriﬁed guarantees on larger networks.9 cknowledgements

We are grateful to Yair Carmon, Ollie Hinder, M Pawan Kumar, Christian Tjandraatmadja, VincentTjeng, and Rahul Trivedi for helpful discussions and suggestions. This work was supported by NSFAward Grant no. 1805310. AR was supported by a Google PhD Fellowship and Open PhilanthropyProject AI Fellowship.

Broader Impact

Our work enables verifying properties of veriﬁcation-agnostic neural networks trained usingprocedures agnostic to any speciﬁcation veriﬁcation algorithm. While the present scalability of thealgorithm does not allow it to be applied to SOTA deep learning models, in many applications it isvital to verify properties of smaller models running safety-critical systems (learned controllers runningon embedded systems, for example). The work we have presented here does not address data relatedissues directly, and would be susceptible to any biases inherent in the data that the model was trainedon. However, as a veriﬁcation technique, it does not enhance biases present in any pre-trained model,and is only used as a post-hoc check. We do not envisage any signiﬁcant harmful applications of ourwork, although it may be possible for adversarial actors to use this approach to verify properties ofmodels designed to induce harm (for example, learning based bots designed to break spam ﬁlters orinduce harmful behavior in a conversational AI system).10 eferences [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system forlarge-scale machine learning. In t USENIX u Symposium on Operating Systems Design andImplementation ( t OSDI u , pages 265–283, 2016.[2] Ross Anderson, Joey Huchette, Will Ma, Christian Tjandraatmadja, and Juan Pablo Vielma.Strong mixed-integer programming formulations for trained neural networks. MathematicalProgramming , pages 1–37, 2020.[3] MOSEK ApS.

The MOSEK optimization toolbox for MATLAB manual. Version 9.0. , 2019. URL http://docs.mosek.com/9.0/toolbox/index.html .[4] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semideﬁniteprograms.

J. ACM , 63(2), May 2016. ISSN 0004-5411. doi: 10.1145/2837020. URL https://doi.org/10.1145/2837020 .[5] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of se-curity: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420 , 2018.[6] Mislav Balunovic and Martin Vechev. Adversarial training and provable defenses: Bridg-ing the gap. In

International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=SJxSDxrKDr .[7] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´c, Pavel Laskov,Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In

Joint European conference on machine learning and knowledge discovery in databases , pages387–402. Springer, 2013.[8] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, DougalMaclaurin, and Skye Wanderman-Milne. Jax: composable transformations of python+ numpyprograms, 2018.

URL http://github. com/google/jax , page 18.[9] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, DougalMaclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPyprograms, 2018. URL http://github.com/google/jax .[10] Rudy R Bunel, Ilker Turkaslan, Philip Torr, Pushmeet Kohli, and Pawan K Mudigonda. Auniﬁed view of piecewise linear neural network veriﬁcation. In

Advances in Neural InformationProcessing Systems , pages 4790–4799, 2018.[11] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness viarandomized smoothing. arXiv preprint arXiv:1902.02918 , 2019.[12] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and YuvalTassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 , 2018.[13] John M Danskin.

The theory of max-min with applications . Siam J. Appl. Math, 1966.[14] Alexandre d’Aspremont and Noureddine El Karoui. A stochastic smoothing algorithm forsemideﬁnite programming.

SIAM Journal on Optimization , 24(3):1138–1177, 2014.[15] Lijun Ding, Alp Yurtsever, Volkan Cevher, Joel A Tropp, and Madeleine Udell. An optimal-storage approach to semideﬁnite programming using approximate complementarity. arXivpreprint arXiv:1902.03373 , 2019.[16] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learningand stochastic optimization.

Journal of machine learning research , 12(Jul):2121–2159, 2011.[17] Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy Mann, and Pushmeet Kohli.A dual approach to scalable veriﬁcation of deep networks. arXiv preprint arXiv:1803.06567 ,104, 2018. 1118] Rüdiger Ehlers. Formal veriﬁcation of piece-wise linear feed-forward neural networks. In DeepakD’Souza and K. Narayan Kumar, editors,

Automated Technology for Veriﬁcation and Analysis ,pages 269–286, Cham, 2017. Springer International Publishing. ISBN 978-3-319-68167-2.[19] Mahyar Fazlyab, Manfred Morari, and George J Pappas. Safety veriﬁcation and robustnessanalysis of neural networks via quadratic constraints and semideﬁnite programming. arXivpreprint arXiv:1903.01287 , 2019.[20] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George Pappas.Efﬁcient and accurate estimation of lipschitz constants for deep neural networks. In

Advancesin Neural Information Processing Systems , pages 11423–11434, 2019.[21] Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri,and Martin Vechev. Ai 2: Safety and robustness certiﬁcation of neural networks with abstractinterpretation. In

Security and Privacy (SP), 2018 IEEE Symposium on , 2018.[22] Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, JonathanUesato, Timothy Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagationfor training veriﬁably robust models. arXiv preprint arXiv:1810.12715 , 2018.[23] Sven Gowal, Krishnamurthy Dj Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin,Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. Scalable veriﬁedtraining for provably robust image classiﬁcation. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 4842–4851, 2019.[24] Sven Gowal, Chongli Qin, Po-Sen Huang, Taylan Cemgil, Krishnamurthy Dvijotham, TimothyMann, and Pushmeet Kohli. Achieving robustness in the wild via adversarial mixing withdisentangled representations. In

Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition , pages 1211–1220, 2020.[25] Christoph Helmberg and Franz Rendl. A spectral bundle method for semideﬁnite programming.

SIAM Journal on Optimization , 10(3):673–696, 2000.[26] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: Anefﬁcient smt solver for verifying deep neural networks. In

International Conference on ComputerAided Veriﬁcation , pages 97–117. Springer, 2017.[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[28] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengioand Yann LeCun, editors, , 2014. URL http://arxiv.org/abs/1312.6114 .[29] Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-basedmodel predictive control for safe exploration. In , pages 6059–6066. IEEE, 2018.[30] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convexouter adversarial polytope. arXiv preprint arXiv:1711.00851 , 2017.[31] Jacek Kuczy´nski and Henryk Wo´zniakowski. Estimating the largest eigenvalue by the powerand lanczos algorithms with a random start.

SIAM journal on matrix analysis and applications ,13(4):1094–1122, 1992.[32] Lindsey Kuper, Guy Katz, Justin Gottschlich, Kyle Julian, Clark Barrett, and Mykel Kochenderfer.Toward scalable veriﬁcation for safety-critical deep networks. arXiv preprint arXiv:1801.05950 ,2018.[33] Guanghui Lan, Zhaosong Lu, and Renato DC Monteiro. Primal-dual ﬁrst-order methods withO p { (cid:15) ) iteration-complexity for cone programming. Mathematical Programming , 126(1):1–29,2011. 1234] Cornelius Lanczos.

An iteration method for the solution of the eigenvalue problem of lineardifferential and integral operators . United States Governm. Press Ofﬁce Los Angeles, CA, 1950.[35] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certiﬁedrobustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471 ,2018.[36] Claude Lemaréchal and François Oustry. Nonsmooth algorithms to solve semideﬁnite programs.In

Advances in linear matrix inequality methods in control , pages 57–77. SIAM, 2000.[37] Changliu Liu, Tomer Arnon, Christopher Lazarus, Clark Barrett, and Mykel J Kochenderfer.Algorithms for verifying deep neural networks. arXiv preprint arXiv:1903.06758 , 2019.[38] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 ,2017.[39] Matthew Mirman, Timon Gehr, and Martin Vechev. Differentiable abstract interpretation forprovably robust neural networks. In

International Conference on Machine Learning , pages3575–3583, 2018.[40] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810 , 2012.[41] Renato DC Monteiro. First-and second-order methods for semideﬁnite programming.

Mathematical Programming , 97(1-2):209–244, 2003.[42] Yurii Nesterov. Smoothing technique and its applications in semideﬁnite optimization.

Mathematical Programming , 110(2):245–259, 2007.[43] Yurii Nesterov.

Lectures on convex optimization , volume 137. Springer, 2018.[44] T. E. Oliphant. Python for scientiﬁc computing.

Computing in Science Engineering , 9(3):10–20,2007.[45] Neal Parikh and Stephen Boyd. Proximal algorithms.

Foundations and Trends in optimization ,1(3):127–239, 2014.[46] Beresford N Parlett.

The symmetric eigenvalue problem , volume 20. siam, 1998.[47] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperativestyle, high-performance deep learning library. In

Advances in Neural Information ProcessingSystems , pages 8024–8035, 2019.[48] Barak A Pearlmutter. Fast exact multiplication by the hessian.

Neural computation , 6(1):147–160, 1994.[49] Chongli Qin, Krishnamurthy (Dj) Dvijotham, Brendan O’Donoghue, Rudy Bunel, RobertStanforth, Sven Gowal, Jonathan Uesato, Grzegorz Swirszcz, and Pushmeet Kohli. Veriﬁcationof non-linear speciﬁcations for neural networks. In

International Conference on LearningRepresentations , 2019. URL https://openreview.net/forum?id=HyeFAsRctQ .[50] Haonan Qiu, Chaowei Xiao, Lei Yang, Xinchen Yan, Honglak Lee, and Bo Li. Semanticadv:Generating adversarial examples via attribute-conditional image editing. arXiv preprintarXiv:1906.07927 , 2019.[51] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adver-sarial examples. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=Bys4ob-Rb .[52] Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang. Semideﬁnite relaxations for certifyingrobustness to adversarial examples. In

Advances in Neural Information Processing Systems ,pages 10877–10887, 2018. 1353] James Renegar. Efﬁcient ﬁrst-order methods for linear programming and semideﬁniteprogramming. arXiv preprint arXiv:1409.5832 , 2014.[54] Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relaxationbarrier to tight robust veriﬁcation of neural networks. arXiv preprint arXiv:1902.08722 , 2019.[55] Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convexrelaxation barrier to tight robustness veriﬁcation of neural networks.

CoRR , abs/1902.08722,2019. URL http://arxiv.org/abs/1902.08722 .[56] Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin Vechev. Fastand effective robustness certiﬁcation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,N. Cesa-Bianchi, and R. Garnett, editors,

Advances in Neural Information Processing Systems 31 ,pages 10802–10813. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8278-fast-and-effective-robustness-certification.pdf .[57] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprintarXiv:1312.6199 , 2013.[58] Tijmen Tieleman and Geoffery Hinton. Rmsprop gradient optimization. , 2014.[59] Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness of neural networks withmixed integer programming. In

International Conference on Learning Representations , 2019.URL https://openreview.net/forum?id=HyGIdiRqtm .[60] Stephen Tu and Jingyan Wang. Practical ﬁrst order methods for large scale semideﬁniteprogramming. Technical report, Technical report, University of California, Berkeley, 2014.[61] Jonathan Uesato, Brendan O’Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarialrisk and the dangers of evaluating against weak attacks. arXiv preprint arXiv:1802.05666 , 2018.[62] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, DavidCournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J.van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J.Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, ˙Ilhan Polat, Yu Feng, Eric W. Moore,Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero,Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt,and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientiﬁc Computing inPython.

Nature Methods , 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.[63] Zaiwen Wen.

First-order methods for semideﬁnite programming . Columbia University, 2009.[64] Zaiwen Wen, Donald Goldfarb, and Wotao Yin. Alternating direction augmented lagrangianmethods for semideﬁnite programming.

Mathematical Programming Computation , 2(3-4):203–230, 2010.[65] Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning,Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certiﬁed robustness for relunetworks. arXiv preprint arXiv:1804.09699 , 2018.[66] Eric Wong and J Zico Kolter. Learning perturbation sets for robust machine learning. arXivpreprint arXiv:2007.08450 , 2020.[67] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarialdefenses. In

Advances in Neural Information Processing Systems , pages 8400–8409, 2018.[68] Alp Yurtsever, Madeleine Udell, Joel A Tropp, and Volkan Cevher. Sketchy decisions: Convexlow-rank matrix optimization with optimal storage. arXiv preprint arXiv:1702.06838 , 2017.[69] Alp Yurtsever, Joel A Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalablesemideﬁnite programming. arXiv preprint arXiv:1912.02949 , 2019.[70] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efﬁcient neuralnetwork robustness certiﬁcation with general activation functions. In

Advances in neuralinformation processing systems , pages 4939–4948, 2018.14

Omitted proofs

A.1 Adversarial robustness as quadratic speciﬁcation

Consider certifying robustness: For input x P R d with true label i , the network does not misclassifyany adversarial example within (cid:96) distance of (cid:15) from x . This property holds if the score of anyincorrect class j is always lower than that of i for all perturbations. Thus φ p x q “ c J x L with c j “ and c i “ ´ . The input constraints are also linear: ´ (cid:15) ď x i ´ x i ď (cid:15) , for i “ , ,...d . A.2 Linear and quadratic constraints for ReLU networksReLU as quadratic constraints:

For the case of ReLU networks, we can do this exactly, as describedin [52]. Consider a single activation x post “ max p x pre , q . This can be equivalently written as x post ě ,x post ě x pre , stating that x post is greater than and x pre . Additionally, the quadratic constraint x post p x post ´ x pre q “ , enforces that x post is atleast one of the two. This can be extended to all unitsin the network allowing us to replace ReLU constraints with quadratic constraints. A.3 Formulation of bound constrained dual problem

Proposition . The optimal value opt of the quadratic veriﬁcation problem (1) is bounded above by opt relax “ : min λ ě ,κ ě c p λ q` J ” κ ´ λ ´ min p diag p κ q´ M p λ qq ı ` loooooooooooooooooooooooooomoooooooooooooooooooooooooon f p λ,κ q , M p λ q “ ˆ g p λ q J g p λ q H p λ q ˙ , (8)and λ ´ min p Z q is the negative portion of the smallest eigen value of Z , i.e. r λ min p Z qs ´ and κ P R ` N . Proof.

We start with the Lagrangian in (4) with rescaled activations such that (cid:96) “ ´ and u “ , where (cid:96) and u are lower and upper bounds on the activations x P R n respectively. This normalization isachieved by using pre-computed bounds via bound propagation, which are used to write the quadraticconstraints, as in [52]. opt ď min λ ľ max ´ ĺ x ĺ ´ c p λ q` g p λ q J x ` x J H p λ q x ¯ . (9)Deﬁne ˜ X “ ˆ x J x xx J . ˙ In terms of the above matrix, the above Lagrangian relaxation (9) can be equivalently written as: opt ď“ : min λ ľ max diag p ˜ X qď c p λ q` x M p λ q , ˜ X y , where (10) M p λ q “ ˆ g p λ q J g p λ q H p λ q ˙ (11)Note that ˜ X is always a PSD matrix with diagonal entries bounded above by . This yields thefollowing relaxation of (9) opt ď opt sdp “ : min λ ľ max diag p X q ĺ ,X ľ c p λ q` x M p λ q ,X y . (12)We introduce a Lagrange multiplier κ P R n ` for the constraint diag p X q ĺ . Since opt sdp isconvex, by strong duality, we have opt sdp “ min λ,κ ľ max X ľ c p λ q` ´ x M p λ q ,X y` κ J ´x diag p κ q ,X y ¯ , (13) “ min λ,κ ľ c p λ q` κ J s.t. diag p κ q´ M p λ q ľ , (14) The factor of is introduced for convenience diag p κ q ´ M p λ q is not PSD, x M p λ q ´ diag p κ q ,X y would be unbounded when maximizing over PSD matrices X and (ii) when diag p κ q´ M p λ q ľ , the maximum value of inner maximization over PSD matrices X is .Projecting onto the PSD constraint diag p κ q´ M p λ q ľ directly is still expensive. Instead, we takethe following approach. For any non-negative p κ,λ q , we generate a feasible p ˆ κ, ˆ λ q as follows. ˆ κ “ ” κ ´ λ ´ min “ diag p κ q´ M p λ q ‰ ı ` , ˆ λ “ λ (15)In other words, λ remains unchanged with ˆ λ “ λ . To obtain ˆ κ , we ﬁrst compute the minimumeigenvalue λ min of the matrix diag p κ q´ M p λ q . If this is positive, p κ,λ q are feasible, and ˆ κ “ κ, ˆ λ “ λ .However, if this is negative, we then add the negative portion of λ ´ min “ r λ min s ´ to the diagonal matrixto make diag p ˆ κ q ľ M p λ q , and subsequently project onto the non-negativity constraint. The subsequentprojection never decreases the value of ˆ κ and hence diag p ˆ κ q´ M p λ q ľ .Plugging ˆ κ, ˆ λ in the objective above, and removing the PSD constraint gives us the following ﬁnalformulation. opt sdp “ opt relax “ : min λ,κ ľ c p λ q` ” κ ´ λ ´ min p diag p κ q´ M p λ qq ı `J . (16)Note that feasible κ,λ remain unchanged and hence the equality. A.4 Relaxation comparison to Raghunathan et al. [52]

Our solver (Algorithm 1) uses the formulation described in (5), replicated above in (16). In this section,we show that the above formulation is equivalent to the SDP formulation in [52] when we use quadraticconstraints to replace the ReLU constraints, as done in [52] and presented above in Appendix A.2. Weshow this by showing equivalence with an intermediate SDP formulation below. From Appendix A.3,the solution to this intermediate fomulation matches that of relaxation we optimize (16). opt sdp “ : min λ ľ max diag p X qď ,X ľ c p λ q` x M p λ q ,X y . (17)To mirror the block structure in M p λ q , we write X ľ as follows. X “ ˆ X X J x X x X xx ˙ ,X xx ľ X X x X J x , (18)where the last condition follows by Schur complements.The objective then takes the form max diag p X xx q ĺ ,X ď g p λ q J X x ` x H p λ q ,X xx y . Note that the feasibleset (over X xx ,X x ) for X “ contains the feasible sets for any smaller X , by the Schur complementcondition above. Since X does not appear in the objective, we can set X “ to obtain the followingequality. opt sdp “ min λ ľ max diag p X q ĺ ,X “ ,X ľ c p λ q` g p λ q J X x ` x H p λ q ,X xx y , (19)where X is the ﬁrst entry, and X x ,X xx are the blocks as described in (18). Prior SDP.

Now we start with the SDP formulation in [52]. Recall that we have a QCQP thatrepresents the original veriﬁcation problem with quadratic constraints on activations. The relaxationin [52] involves intoducing a new matrix variable P as follows. P “ ˆ P r s P r x s P r x s P r xx s . ˙ (20)The quadratic constraints are now written in terms of P where P r x s replaces the linear terms and P r xx s replaces the quadratic terms. Raghunathan et al. [52] optimize this primal SDP formulationto obtain opt prior-sdp . By strong duality, opt prior-sdp matches the optimum of the dual problem obtained16ia the Lagrangian relaxation of the SDP. In terms of the quantities g,H that we deﬁned in this work((3) and (4)), we have opt prior-sdp “ min λ ľ max diag p P q ĺ ,P r s“ ,P ľ L prior-sdp p P,λ q (21) “ min λ ľ max diag p P q ĺ ,P r s“ ,P ľ c p λ q` g p λ q J P r x s` x H p λ q ,P r xx sy . (22)By redeﬁning matrix P as X , from (19) and (21), we have opt sdp “ opt prior-sdp . From (16), we have opt sdp “ opt relax and hence proved that the optimal solution of our formulation matches that of priorwork [52] when using the same quadratic constraints as used in [52]. In other words, our reformulationthat allows for a subgradient based memory efﬁcient solver does not introduce additional loosenessover the original formulation that uses a memory inefﬁcient interior point solver. A.5 Regularization of κ via alternate dual formulation In Section 5.3, we describe that it can be helpful to regularize κ n towards 0. This is motivated bythe following proposition: Proposition . The optimal value opt is upper-bounded by the alternate dual problem opt ď min λ,κ ľ c p λ q` κ loooomoooon ˆ f p λ,κ q s.t. ˆ κ ´ g p λ q J ´ g p λ q ´ H p λ q ˙ ľ (23)Further, for any feasible solution λ,κ for this dual problem, we can obtain a corresponding solutionto opt relax with λ,κ ,κ n “ ,κ “ p κ ; κ n q , such that f p λ,κ q “ ˆ f p λ,κ q . Proof.

We begin with the Lagrangian dual opt ď opt lagAlt “ : min λ ľ max x c p λ q` x J g p λ q` x J H p λ q x. (24)Note that this is exactly the dual from Equation (4), without the bound constraints on x in the innermaximization. In other words, whereas Equation (4) encodes the bound constraints into both theLagrangian and the inner maximization constraints, in Equation 24, the bound constraints are encodedin the Lagrangian only.The inner maximization can be solved in closed form, and is maximized for x “ ´ H p λ q ´ g p λ q , yielding opt lagAlt “ min λ ľ c p λ q´ g p λ q J H p λ q g p λ q . (25)We can then reformulate using Schur complements: opt lagAlt “ min λ ľ ,κ c p λ q` κ s.t. κ ě ´ g p λ q H p λ q ´ g p λ q (26a) “ min λ ľ ,κ c p λ q` κ s.t. ˆ M p λ q ľ where (26b) ˆ M p λ q “ ˆ κ ´ g p λ q J ´ g p λ q ´ H p λ q ˙ . (26c)To see that this provides a corresponding solution to opt relax , we note that when ˆ M ľ , the choice κ “ p κ ; κ n q ,κ n “ makes diag p κ q´ M p λ q “ ˆ M p λ q , and so λ ´ min “ diag p κ q´ M p λ q ‰ “ . Thus, forany solution λ,κ , we have f p λ,κ q “ ˆ f p λ,κ q “ c p λ q` κ . Remark.

Proposition 3 indicates that regularizing κ n towards 0 corresponds to solving the alternatedual formulation opt dualAlt , which does not use bound constraints for the inner maximization. Inthis case, the role of κ n and ˆ κ is slightly different: even in the case when κ n is clamped to , thebound-constrained formulation allows an efﬁcient projection operator, which in turn provide efﬁcientany-time bounds. 17 .6 Informal comparison to standard maximum eigenvalue formulation Our derivation for Proposition 1 is similar to maximum eigenvalue formulations for dual SDPs – ourmain emphasis is that when applied to neural networks, we can use autodiff and implicit matrix-vectorproducts to efﬁciently compute subgradients.We also mention here a minor difference in derivations for convenience of readers. The commonderivation for these maximum eigenvalue formulations starts with an SDP primal under the assumptionthat all feasible solutions for the matrix variable X have ﬁxed trace. This trace assumption plays ananalogous role to our interval constraints in the QCQP (12). These interval constraints also implya trace constraint (since diag p X q ď implies tr p X q ď N ` ), but the interval constraints also allowus to use κ to smooth the optimization. Without κ , any positive eigenvalues of M p λ q cause large spikesin the objective – simplifying the objective f p λ,κ q in (5) reveals the term p N ` q λ ` max p M p λ qq whichgrows linearly with N . As expected, this term also appears in these other formulations [25, 42]. A.7 Proof of Proposition 2

Proposition . For any choice of λ satisfying H p λ q “ , the optimal choice κ opt p λ q is given by κ ˚ “ n ÿ i “ | g p λ q| i ; κ ˚ n “ | g p λ q| where we have divided κ “ r κ ; κ n s into a leading scalar κ and a vector κ n . Proof.

We use the dual expression from Equation (14): opt sdp “ min λ,κ ľ c p λ q` κ J s.t. diag p κ q´ M p λ q ľ . Notice that by splitting κ into its leading component κ (a scalar) and the subvector κ n “ r κ ,...,κ n s (a vector of the same dimension as x ), the constraint between κ,λ evaluates to diag p κ q´ M p λ q “ ˆ κ g p λ q J g p λ q diag p κ n q ˙ ľ Using Schur complements, we can rewrite the PSD constraint as ´ diag p κ q´ M p λ q ¯ ľ ô κ ě ÿ i ě κ ´ i p g p λ qq i Since the objective is monotonically increasing in κ , the optimal choice for κ is the lower boundabove κ ě ř i ě κ ´ i p g p λ qq i . Given this choice, the objective in terms of κ n becomes ÿ i ě κ i ` p g p λ qq i κ i By the AM-GM inequality, the optimal choice for the remaining terms κ n is then κ n “ | g p λ q| . B Experimental details

B.1 Verifying Adversarial Robustness: Training and Hyperparameter DetailsOptimization details.

We perform subgradient descent using the Adam [27] update rule for MLPexperiments, and RMSProp for CNN experiments. We use an initial learning rate of e ´ , whichwe anneal twice by . We use 15K optimization steps for all MLP experiments, 60K for CNNexperiments on M NIST , and 150K on C

IFAR -10. All experiments run on a single P100 GPU.18 daptive learning rates

For MLP experiments, we use an adaptive learning rate for dual variablesassociated with the constraint x i ` d ` x i ` ´ L i p x i q ˘ ĺ , as mentioned in Section 5.3. In earlyexperiments for MLP-Adv, we observed very sharp curvature in the dual objective with respect tothese variables – the gradient has values on the order « e while the solution at convergence hasvalues on the order of « e ´ . Thus, for all MLP experiments, we decrease learning rates associatedwith these variables by a ˆ factor. While SDP-FO produced meaningful bounds even without thisadjustment, we observed that this makes optimization signiﬁcantly more stable for MLP experiments.This adjustment was not necessary for CNN experiments. Training Modes

We conduct experiments on networks trained in three different modes.

Nor indicates the network was trained only on unperturbed examples, with the standard cross-entropy loss.

Adv networks use adversarial training [38].

Mix networks average the adversarial and normal losses,with equal weights on each. We ﬁnd that

Mix training, while providing a signiﬁcant improvementin test-accuracy, renders the model less veriﬁable (across veriﬁcation methods) than training onlywith adversarial examples.The sufﬁx -4 in the network name (e.g. CNN-A-Mix-4) indicates networks trained with the largeperturbation radius (cid:15) train “ . { . We ﬁnd that using larger (cid:15) train implicitly facilitates veriﬁcationat smaller (cid:15) (across veriﬁcation methods), but is accompanied by a signiﬁcant drop in clean accuracy.For all other networks, we choose (cid:15) train to match the evaluation (cid:15) : i.e. generally (cid:15) “ . on M NIST and (cid:15) “ . { on C IFAR -10 (which slightly improves adversarial robustness relative to (cid:15) “ { as reported in [22]). Pre-trained networks

For the networks

MLP-LP , MLP-SDP , MLP-Adv , we use the trainedparameters from [52], and for the networks

MLP-Nor , MLP-Adv-B we use the trained parametersfrom [55].

Model Architectures

Each model architecture is associated with a preﬁx for the network name.Table 2 summarizes the CNN model architectures. The MLP models are taken directly from [52, 55] anduse fully-connected layers with ReLU activations. The number of neurons per layer is as follows:

MLP-Adv

MLP-LP/MLP-SDP

MLP-B/MLP-Nor

Model CNN-A CNN-BArchitecture

CONV 16 4×4+2 CONV 32 5×5+2CONV 32 4×4+1 CONV 128 4×4+2FC 100 FC 250FC 10 FC 10

Table 2: Architecture of CNN models used on MNIST and CIFAR-10. Each layer (except the last fullyconnected layer) is followed by ReLU activations. CONV

T W × H + S corresponds to a convolutionallayer with T ﬁlters of size W × H with stride of S in both dimensions. FC T corresponds to a fullyconnected layer with T output neurons. B.2 Verifying VAEsArchitecture Details

We train a VAE on the MNIST dataset with the architecture detailed in Table 3.

Encoder Decoder

FC 512 FC 1568FC 512 CONV-T 32 3×3+2FC 512 CONV-T 3×3+1FC 16

Table 3: The VAE consists of an encoder and a decoder, and the architecture details for both the encoderand the decoder are provided here. CONV-T

T W × H + S corresponds to a transpose convolutional layerwith T ﬁlters of size W × H with stride of S in both dimensions.19 riginal Perturbed (a) Original and Perturbed Digit ‘9’ Original Perturbed (b) Original and Perturbed Digit ‘0’

Figure 3: Two digits from the MNIST data set, and the corresponding images when perturbed withGaussian noise, whose squared (cid:96) -norm is equal to the threshold ( τ “ . ). τ corresponds tothreshold on the reconstruction error used in equation (7). Optimization details.

We perform subgradient descent using RMSProp with an initial learningrate of e ´ , which we anneal twice by . All experiments run on a single P100 GPU, and eachveriﬁcation instance takes under 7 hours to run. Computing bounds on the reconstruction loss based on interval bound propagation

Intervalbound propagation lets us compute bounds on the activations of the decoder, given bounded l perturbations in the latent space of the VAE. Given a lower bound lb and an upper bound ub on theoutput of the decoder, we can compute an upper bound on the reconstruction error (cid:107) s ´ ˆ s (cid:107) overall valid latent perturbations as (cid:107) max t | ub ´ s | , | s ´ lb | u (cid:107) , where max represents the element-wisemaximum between the two vectors. We visualize images perturbed by noise corresponding to thethreshold τ on the reconstruction error in Section 6.2 in Figure 3. C Additional results

C.1 Detailed comparison to off-the-shelf solverSetup

We isolate the impact of optimization by comparing performance to an off-the-shelf solverwith the same SDP relaxation. For this experiment, we use the MLP-Adv network from [51], selectingquadratic constraints to attain an equivalent relaxation to [51]. We compare across 10 randomexamples, using the target label with the highest loss under a PGD attack, i.e. the target label closestto being misclassiﬁed. For each example, we measure Φ PGD , Φ SDP-IP , and Φ SDP-FO , where Φ and Φ are as deﬁned in Section 6.1. Since the interior point method used by MOSEK can solve SDPsexactly for small-scale problems, this allows analyzing looseness incurred due to the relaxation vs.optimization. In particular, Φ SDP-IP ´ Φ PGD is the relaxation gap, plus any suboptimality for PGD,while Φ SDP-IP ´ Φ SDP-FO is the optimization gap due to inexactness in the SDP-FO dual solution.

Results

We observe that SDP-FO converges to a near-optimal dual solution in all 10 examples. Thisis shown in Figure 4. Quantitatively, the relaxation gap Φ SDP-IP ´ Φ PGD has a mean of . (standarddeviation . ) over the 10 examples, while the optimization gap Φ SDP-IP ´ Φ SDP-FO has a mean of . (standard deviation . ), roughly ˆ smaller. Thus, SDP-FO presents a signiﬁcantly morescalable approach, while sacriﬁcing little in precision for this network. Remark.

While small-scale problems can be solved exactly with second-order interior point methods,these approaches have poor asymptotic scaling factors. In particular, both the SDP primal and dualproblems involve matrix variables with number of elements quadratic in the number of networkactivations N . Solving for the KKT stationarity conditions (e.g. via computing the Choleskydecomposition) then requires memory O p N q . At a high-level, SDP-FO uses a ﬁrst-order methodto save a quadratic factor, and saves another quadratic factor through use of iterative algorithms toavoid materializing the M p λ q matrix. SDP-FO achieves O p N k q memory usage, where k is the numberof Lanczos iterations, and in our experiments, we have found k ! N sufﬁces for Lanczos convergence.20 − φ x (Adversarial lower bound) − − φ x ( V e r i ﬁ e dupp e r b o und ) y = x ( l o w e r b o u n d o n v e r i ﬁ c a t i o n o b j e c t i v e ) Veriﬁed bounds across 10 examples

SDP-FOSDP-IP (a) M

NIST , MLP-Adv

Figure 4:

Comparison to off-the-shelf solver.

For 10 examples on M

NIST , we plot the veriﬁed upperbound on φ x against the adversarial lower bound (using a single target label for each), comparingSDP-FO to the optimal SDP bound found with SDP-IP (using MOSEK). In all cases, the SDP-FObound is very close to the SDP-IP bound, demonstrating that SDP-FO converges to a near-optimal dualsolution. Note that in many cases, the scatter points for SDP-FO and SDP-IP are directly overlappingdue to the small gap. C.2 Investigation of relaxation tightness for MLP-AdvSetup

The discussion above in Appendix C.1 suggests that SDP-FO is a sufﬁciently reliableoptimizer so that the main remaining obstacle to tight veriﬁcation is tight relaxations. In our mainexperiments, we use simple interval arithmetic [39, 23] for bound propagation, to match the relaxationin [51]. However, by using CROWN [70] for bound propagation, we can achieve a tighter relaxation.

Results

Using CROWN bounds in place of interval arithmetic bounds improves the overall veriﬁedaccuracy from . to . . This closes most of the gap to the PGD upper bound of . . Forthis model, while the SDP relaxation still yields meaningful bounds when provided very loose initialbounds, the SDP relaxation still beneﬁts signiﬁcantly from tighter initial bounds. More broadly, thissuggests that SDP-FO provides a reliable optimizer, which combines naturally with development oftighter SDP relaxations. C.3 Verifying Adversarial Robustness: Additional Results

Table 4 provides additional results on verifying adversarial robustness for different perturbation radiiand training modes. Here, we consider perturbations and training-modes not included in Table 1. Weﬁnd that across settings,

SDP-FO outperforms the LP -relaxation. Training

Accuracy Veriﬁed Accuracy

Dataset Epsilon Model Epsilon Nominal PGD SDP-FO (Ours) LP M NIST (cid:15) “ . CNN-A-Adv (cid:15) train “ . IFAR -10 (cid:15) “ CNN-A-Adv (cid:15) train “ . (cid:15) train “ . (cid:15) “ CNN-A-Adv (cid:15) train “ . (cid:15) train “ . Table 4: Comparison of veriﬁed accuracy across various networks and perturbation radii. All SDP-FOnumbers computed on the ﬁrst 100 test set examples, and numbers for LP on the ﬁrst 1000 test setexamples. The perturbations and training-modes considered here differ from those in Table 1. For allnetworks, SDP-FO outperforms the LP -relaxation baseline. C.4 Comparison between Lanczos and exact eigendecomposition

All ﬁnal numbers we report use the minimum eigenvalue from an exact eigendecomposition (we usethe eigh routine available in SciPy [62]). However, the exact decomposition is far too expensive to21se during optimization. On all networks we studied, Lanczos provides a reliable surrogate, whileusing dramatically less computation. For example, for CNN-A-Mix, the average gap between theexact and Lanczos dual bounds – the values of Equation (5) using the true λ min compared to theLanczos approximation of λ min ) – is . with standard deviation . . This gap is small comparedto the overall gap between the veriﬁed upper and adversarial lower bounds, which has mean . withstandard deviation . . We observed similarly reliable Lanczos performance across models, for bothimage classiﬁer and VAE models in Sections 6.1 and 6.2.At the same time, Lanczos is dramatically faster than the exact eigendecomposition: roughly . seconds (using 200 Lanczos iterations) compared to minutes. For the VAE model, this gap iseven larger: roughly . seconds compared to2