[PDF] Monotone operator equilibrium networks

Abstract

Full PDF

MMonotone operator equilibrium networks

Ezra Winston J. Zico Kolter

School of Computer ScienceCarnegie Mellon UniversityPittsburgh, United States { ewinston, zkolter } @cs.cmu.edu Abstract

Implicit-depth models such as Deep Equilibrium Networks have recently beenshown to match or exceed the performance of traditional deep networks whilebeing much more memory efﬁcient. However, these models suffer from unsta-ble convergence to a solution and lack guarantees that a solution exists. On theother hand, Neural ODEs, another class of implicit-depth models, do guaranteeexistence of a unique solution but perform poorly compared with traditional net-works. In this paper, we develop a new class of implicit-depth model based onthe theory of monotone operators, the Monotone Operator Equilibrium Network(MON). We show the close connection between ﬁnding the equilibrium point ofan implicit network and solving a form of monotone operator splitting problem,which admits efﬁcient solvers with guaranteed, stable convergence. We then de-velop a parameterization of the network which ensures that all operators remainmonotone, which guarantees the existence of a unique equilibrium point. Fi-nally, we show how to instantiate several versions of these models, and implementthe resulting iterative solvers, for structured linear operators such as multi-scaleconvolutions. The resulting models vastly outperform the Neural ODE-basedmodels while also being more computationally efﬁcient. Code is available at http://github.com/locuslab/monotone_op_net . Recent work in deep learning has demonstrated the power of implicit-depth networks, models wherefeatures are created not by explicitly iterating some number of nonlinear layers, but by ﬁnding asolution to some implicitly deﬁned equation. Instances of such models include the Neural ODE[8], which computes hidden layers as the solution to a continuous-time dynamical system, and theDeep Equilibrium (DEQ) Model [5], which ﬁnds a ﬁxed point of a nonlinear dynamical systemcorresponding to an effectively inﬁnite-depth weight-tied network. These models, which trace backto some of the original work on recurrent backpropagation [2, 23], have recently regained attentionsince they have been shown to match or even exceed to performance of traditional deep networks indomains such as sequence modeling [5]. At the same time, these models show drastically improvedmemory efﬁciency over traditional networks since backpropagation is typically done analyticallyusing the implicit function theorem, without needing to store the intermediate hidden layers.However, implict-depth models that perform well require extensive tuning in order to achievestable convergence to a solution. Obtaining convergence in DEQs requires careful initialization andregularization, which has proven difﬁcult in practice [21]. Moreover, solutions to these models arenot guaranteed to exist or be unique, making the output of the models potentially ill-deﬁned. WhileNeural ODEs [8] do guarantee existence of a unique solution, training remains unstable since theODE problems can become severely ill-posed [10]. Augmented Neural ODEs [10] improve thestability of Neural ODEs by learning ODEs with simpler ﬂows, but neither model achieves efﬁcient

Preprint. Under review. a r X i v : . [ c s . L G ] J un onvergence nor performs well on standard benchmarks. Crucial questions remain about how modelscan have guaranteed, unique solutions, and what algorithms are most efﬁcient at ﬁnding them.In this paper, we present a new class of implicit-depth equilibrium model, the Monotone OperatorEquilibrium Network (MON), which guarantees stable convergence to a unique ﬁxed point. Themodel is based upon the theory of monotone operators [6, 26], and illustrates a close connectionbetween simple ﬁxed-point iteration in weight-tied networks and the solution to a particular formof monotone operator splitting problem. Using this connection, this paper lays the theoretical andpractical foundations for such networks. We show how to parameterize networks in a mannerthat ensures all operators remain monotone, which establishes the existence and uniqueness of theequilibrium point. We show how to backpropagate through such networks using the implicit functiontheorem; this leads to a corresponding (linear) operator splitting problem for the backward pass,which also is guaranteed to have a unique solution. We then adapt traditional operator splittingmethods, such as forward-backward splitting or Peaceman-Rachford splitting, to naturally derivealgorithms for efﬁciently computing these equilibrium points.Finally, we demonstrate how to practically implement such models and operator splitting methods, inthe cases of typical feedforward, fully convolutional, and multi-scale convolutional networks. Forconvolutional networks, the most efﬁcient ﬁxed-point solution methods require an inversion of theassociated linear operator, and we illustrate how to achieve this using the fast Fourier transform. Theresulting networks show strong performance on several benchmark tasks, vastly improving uponthe accuracy and efﬁciency of Neural ODEs-based models, the other implicit-depth models wheresolutions are guaranteed to exist and be unique.

Implicit models in deep learning

There has been a growing interest in recent years in implicitlayers in deep learning. Instead of specifying the explicit computation to perform, a layer speciﬁessome condition that should hold at the solution to the layer, such as a nonlinear equality, or adifferential equation solution. Using the implicit function theorem allows for backpropagatingthrough the layer solutions analytically , making these layers very memory efﬁcient, as they do notneed to maintain intermediate iterations of the solution procedure. Recent examples include layersthat compute inference in graphical models [15], solve optimization problems [12, 3, 13, 1], executemodel-based control policies [4], solve two-player games [20], solve gradient-based optimization formeta-learning [24], and many others.

Stability of ﬁxed-point models

The issue of model stability has in fact been at the heart of muchwork in ﬁxed-point models. The original work on attractor-style recurrent models, trained viarecurrent backpropagation [2, 23], precisely attempted to ensure that the forward iteration procedurewas stable. And indeed, much of the work in recurrent architectures such as LSTMs has focusedon these issues of stability [14]. Recent work has revisited recurrent backpropagation in a similarmanner to DEQs, with the similar aim of speeding up the computation of ﬁxed points [19]. Andother work has looked at the stability of implicit models [11], with an emphasis on guaranteeingthe existence of ﬁxed points, but focused on alternative stability conditions, and considered onlyrelatively small-scale experiments. Other recent work has looked to use control-theoretic methods toensure the stability of implicit models, [25], though again they consider only small-scale evaluations.

Monotone operators in deep learning

Although most work in the ﬁeld of monotone operatorsis concerned with general convex analysis, the recent work of [9] does also highlight connectionsbetween deep networks and monotone operator problems. Unlike our current work, however, thatwork focused largely on the fact that many common non-linearities can be expressed via proximaloperators, and analyzed traditional networks under the assumptions that certain of the operators weremonotone, but did not address conditions for the networks to be monotone or algorithms for solvingor backpropagating through the networks.

This section lays out our main methodological and theoretical contribution, a class of equilibriumnetworks based upon monotone operators. We begin with some preliminaries, then highlight the We largely use the terms “ﬁxed point” and “equilibrium point” interchangably in this work, using ﬁxedpoint in the context of an iterative procedure, and equilibrium point to refer more broadly to the point itself.

The theory of monotone operators plays a foundational role in convexanalysis and optimization. Monotone operators are a natural generalization of monotone functions,which can be used to assess the convergence properties of many forms of iterative ﬁxed-pointalgorithms. We emphasize that the majority of the work in this paper relies on well-known propertiesabout monotone operators, and we refer to standard references on the topic including [6] and aless formal survey by [26]; we do include a brief recap of the deﬁnitions and results we require inAppendix A. Formally, an operator is a subset of the space A ⊆ R n × R n ; in our setting this willusually correspond to set-valued or single-valued function. Operator splitting approaches refer tomethods for ﬁnding a zero in a sum of operators, i.e., ﬁnd x such that ∈ ( A + B )( x ) . There aremany such methods, but the two we will use mainly in this work are forward-backward splitting(eqn. A9 in the Appendix) and Peaceman-Rachford splitting (eqn. A10). As we will see, bothﬁnding a network equilibrium point and backpropagating through it can be formulated as operatorsplitting problems, and different operator splitting methods will lead to different approaches in theirapplication to our subsequent implicit networks.

Deep equilibrium models

The MON architecture is closely relate to the DEQ model, whichparameterizes a “weight-tied, input-injected” network of the form z i +1 = g ( z i , x ) , where x denotesthe input to the network, injected at each layer; z i denotes the hidden layer at depth i ; and g denotesa nonlinear function which is the same for each layer (hence the network is weight-tied). The keyaspect of the DEQ model is that in this weight-tied setting, instead of forward iteration, we cansimply use any root-ﬁnding approach to ﬁnd an equilibrium point of such an iteration z ∗ = g ( z ∗ , x ) .Assuming the model is stable, this equilibrium point corresponds to an “inﬁnite-depth ﬁxed point” ofthe layer. The MON architecture can be viewed as an instance of a DEQ model, but one that relies onthe theory of monotone operators, and a speciﬁc paramterization of the network weights, to guaranteethe existence of a unique ﬁxed point for the network. Crucially, however, as is the case for DEQs,naive forward iteration of this model is not necessarily stable; we therefore employ operator splittingmethods to develop provably (linearly) convergent methods for ﬁnding such ﬁxed points. As a starting point of our analysis, consider the weight-tied, input-injected network in which x ∈ R d denotes the input, and z k ∈ R n denotes the hidden units at layer k , given by the iteration z k +1 = σ ( W z k + U x + b ) (1)where σ : R → R is a nonlinearity applied elementwise, W ∈ R n × n are the hidden unit weights, U ∈ R n × x are the input-injection weights and b ∈ R n is a bias term. An equilibrium point, or ﬁxedpoint, of this system is some point z (cid:63) which remains constant after an update: z (cid:63) = σ ( W z (cid:63) + U x + b ) . (2)We begin by observing that it is possible to characterize this equilibrium point exactly as the solutionto a certain operator splitting problem, under certain choices of operators and activation σ . This canbe formalized in the following theorem: Theorem 1.

Finding a ﬁxed point of the iteration (1) is equivalent to ﬁnding a zero of the operatorsplitting problem ∈ ( A + B )( z (cid:63) ) with the operators A ( z ) = ( I − W )( z ) − ( U x + b ) , B = ∂f (3) and σ ( · ) = prox f ( · ) for some function f , where prox αf denotes the proximal operator prox αf ( x ) ≡ argmin z (cid:107) x − z (cid:107) + αf ( z ) . (4) This setting can also be seen as corresponding to a recurrent network with identical inputs at each time(indeed, this is the view of so-called attractor networks [23]). However, because in modern usage recurrentnetworks typically refer to sequential models with different inputs at each time, we don’t adopt this terminology. roof. The proof here is immediate: the forward-backward algorithm applied to the above operatorswith α = 1 corresponds exactly to the network’s ﬁxed-point iteration: z k +1 = R B ( z k − αA ( z k ))= prox αf ( z k − α ( I − W ) z k + α ( U x + b ))= prox f ( W z k + U x + b ) . It is also well-established that many common nonlinearities used in deep networks can be representedas proximal operators [7, 9]. For example, the ReLU nonlinearity σ ( x ) = [ x ] + is equal to theproximal operator of the indicator of the positive orthant f ( x ) = I { x ≥ } , and tanh, sigmoid, andsoftplus all have close correspondence with proximal operators of simple expressions [7].In fact, this method establishes that some seemingly unstable iterations can actually still lead to con-vergent algorithms. ReLU activations, for instance, have traditionally been avoided in iterative modelssuch as recurrent networks, due to exploding or vanishing gradient problems and nonsmoothness. Yetthis iteration shows that (with input injection and the above constraint on W ), ReLU operators areperfectly well-suited to these ﬁxed-point iterations. The above connection is straightforward, but also carries interesting implications for deep learning.Speciﬁcally, we can establish the existence and uniqueness of the equilibirum point z (cid:63) via the simplesufﬁcient criterion that I − W is strongly monotone, or in other words I − W (cid:23) mI for some m > (see Appendix A). The constraint is by no means a trivial condition. Although many layersobey this condition under typical initialization schemes, during training it is normal for W to moveoutside this regime. Thus, the ﬁrst step of the MON architecture is to parameterize W in such a waythat it always satisﬁes this strong monotonicity constraint. Proposition 1.

We have I − W (cid:23) mI if and only if there exist A, B ∈ R n × n such that W = (1 − m ) I − A T A + B − B T . (5) Proof.

First assume W is of this form. Then clearly ( I − W ) / I − W ) T / mI + A T A (cid:23) mI. (6)Alternatively, if I − W (cid:23) mI ⇐⇒ (1 − m ) I (cid:23) ( W + W T ) / , then ( W + W T ) / − m ) I − A T A. (7)Thus W = ( W + W T ) / W − W T ) /

2= (1 − m ) I − A T A + B − B T . We therefore propose to simply parameterize W directly in this form, by deﬁning the A and B matrices directly. While this is an overparameterized form for a dense matrix, we could avoid thisissue by, e.g. constraining A to be lower triangular (making it the Cholesky factor of A T A ), and bymaking B strictly upper triangular; in practice, however, simply using general A and B matrices haslittle impact upon the performance of the method. The parameterization does notably raise additionalcomplications when dealing with convolutional layers, but we defer this discussion to Section 4.2. Given the MON formulation, the ﬁrst natural question to ask is: how should we compute theequilibrium point z (cid:63) = σ ( W z (cid:63) + U x + b ) ? Crucially, it can be the case that the simple forwarditeration of the network equation (1) does not converge, i.e., the iteration may be unstable. Fortunately,monotone operator splitting leads to a number of iterative methods for ﬁnding these ﬁxed points,which are guaranteed to converge under proper conditions. For example, the forward-backward For non-symmetric matrices, which of course is typically the case with W , positive deﬁniteness is deﬁnedas the positive deﬁniteness of the symmetric component I − W (cid:23) mI ⇔ I − ( W + W T ) / (cid:23) mI . lgorithm 1 Forward-backward equilibrium solving z := 0 ; err := 1 while err > (cid:15) do z + := (1 − α ) z + α ( W z + Ux + b ) z + := σ ( z + ) err := (cid:107) z + − z (cid:107) (cid:107) z + (cid:107) z := z + return z Algorithm 2

Peaceman-Rachford equilibrium solving z, u := 0 ; err := 1 ; V := ( I + α ( I − W )) − while err > (cid:15) do u / := 2 z − uz / := V ( u / + α ( Ux + b )) u + := 2 z / − u / z + := σ ( u + ) err := (cid:107) z + − z (cid:107) (cid:107) z + (cid:107) z, u := z + , u + return z iteration applied to the monotone operator formulation from Theorem 1 results exactly in a dampedversion of the forward iteration z k +1 = σ ( z k − α (( I − W ) z k − ( U x + b ))) = σ ((1 − α ) z k + α ( W z k + U x + b )) . (8)This iteration is guaranteed to converge linearly to the ﬁxed point z (cid:63) provided that α ≤ m/L ,when the operator I − W is Lipschitz and strongly monotone with parameters L (which is simply theoperator norm (cid:107) I − W (cid:107) ) and m [26].A key advantage of the MON formulation is the ﬂexibility to employ alternative operator splittingmethods that converge much more quickly to the equilibrium. One such example is Peaceman-Rachford splitting which, when applied to the formulation from Theorem 1, takes the form u k +1 / = 2 z k − u k z k +1 / = ( I + α ( I − W )) − ( u k +1 / − α ( U x + b )) u k +1 = 2 z k +1 / − u k +1 / z k +1 = prox αf ( u k +1 ) (9)where we use the explicit form of the resolvents for the two monotone operators of the model.The advantage of Peaceman-Rachford splitting over forward-backward is two-fold: 1) it typicallyconverges in fewer iterations, which is a key bottleneck for many implicit models; and 2) it convergesfor any α > [26], unlike forward-backward splitting which is dependent on the Lipschitz constantof I − W . The disadvantage of Peaceman-Rachford splitting, however, is that it requires an inverseinvolving the weight matrix W . It is not immediately clear how to apply such methods if the W matrixinvolves convolutions or multi-layer models; we discuss these points in Section 4.2. A summary ofthese methods for computation of the forward equilibrium point is given in Algorithms 1 and 2. Finally, a key challenge for any implicit model is to determine how to perform backpropagationthrough the layer. As with most implicit models, a potential beneﬁt of the ﬁxed-point conditions wedescribe is that, by using the implicit function theorem, it is possible to perform backpropagationwithout storing the intermediate iterates of the operator splitting algorithm in memory, and insteadbackpropagating directly through the equilibrium point.To begin, we present a standard approach to differentiating through the ﬁxed point z (cid:63) using theimplicit function theorem. This formulation has some compelling properties for MON, namely thefact that this (sub)gradient will always exist. When training a network via gradient descent, we needto compute the gradients of the loss function ∂(cid:96)∂ ( · ) = ∂(cid:96)∂z (cid:63) ∂z (cid:63) ∂ ( · ) (10)where ( · ) denotes some input to the layer or parameters, i.e. W , x , etc. The challenge here iscomputing (or left-multiplying by) the Jacobian ∂z (cid:63) /∂ ( · ) , since z (cid:63) is not an explicit function ofthe inputs. While it would be possible to simply compute gradients through the “unrolled” updates,e.g. z k +1 = σ ( W z k + U x + b ) for forward iteration, this would require storing each intermediatestate z k , a potentially memory-intensive operation. Instead, the following theorem gives an explicitformula for the necessary (sub)gradients. We state the theorem more directly in terms of the operatorsmentioned Theorem 1; that is, we use prox f ( · ) in place of σ ( · ) .5 heorem 2. For the equilibrium point z (cid:63) = prox f ( W z (cid:63) + U x + b ) , we have ∂(cid:96)∂ ( · ) = ∂(cid:96)∂z (cid:63) ( I − JW ) − J ∂ ( W z (cid:63) + U x + b ) ∂ ( · ) (11) where J = D prox f ( W z (cid:63) + U x + b ) (12) denotes the Clarke generalized Jacobian of the nonlinearity evaluated at the point W z (cid:63) + U x + b .Furthermore, for the case that ( I − W ) (cid:23) mI , this derivative always exists.Proof. Differentiating both sides of the ﬁxed-point equation z (cid:63) = σ ( W z (cid:63) + U x + b ) we have ∂z (cid:63) ∂ ( · ) = ∂ prox f ( W z (cid:63) + U x + b ) ∂ ( · )= J (cid:18) W ∂z (cid:63) ∂ ( · ) + ∂ ( W z (cid:63) + U x + b ) ∂ ( · ) (cid:19) (13)for J deﬁned in (12) (we require the Clarke generalized Jacobian owing to the fact that the nonlinearityneed not be a smooth function). Rearranging we get ( I − JW ) ∂z (cid:63) ∂ ( · ) = J ∂ ( W z (cid:63) + U x + b ) ∂ ( · ) ⇔ ∂z (cid:63) ∂ ( · ) = ( I − JW ) − J ∂ ( W z (cid:63) + U x + b ) ∂ ( · ) . (14)To show that this derivative always exists, we need to show that the I − JW matrix is nonsingular.Owing to the fact that proximal operators are monotone and non-expansive, we have ≤ J ii ≤ .First, letting λ ( · ) denote the set of eigenvalues of a matrix, note that λ ( I − JW ) = λ ( I − J / W J / ) . (15)This follows from the similarity transform λ ( I − JW ) = λ ( J − / ( I − JW ) J / ) for J > andthe case of J ii = 0 follows via the continuity of eigenvalues taking lim J ii → . Now, using the factthat (cid:22) J (cid:22) I , we have Re λ ( I − J / W J / )= Re λ ( I − J + J / ( I − W ) J / ) > (16)since I − W (cid:23) mI and I − J (cid:23) .To apply the theorem in practice to perform reverse-mode differentiation, we need to solve the system ( I − JW ) − T (cid:18) ∂(cid:96)∂z (cid:63) (cid:19) T . (17)The above system is a linear equation and while it is typically computationally infeasible to computethe inverse ( I − JW ) − T exactly, we could compute a solution to ( I − JW ) − T v using, e.g., conjugategradient methods. However, we present an alternative formulation to computing (17) as the solutionto a (linear) monotone operator splitting problem: Theorem 3.

Let z (cid:63) be a solution to the monotone operator splitting problem deﬁned in Theorem 1,and deﬁne J as in (12). Then for v ∈ R n the solution of the equation u (cid:63) = ( I − JW ) − T v (18) is given by u (cid:63) = v + W T ˜ u (cid:63) (19) where ˜ u (cid:63) is a zero of the operator splitting problem ∈ ( ˜ A + ˜ B )( u (cid:63) ) , with operators deﬁned as ˜ A (˜ u ) = ( I − W T )(˜ u ) , ˜ B (˜ u ) = D ˜ u − v (20) where D is a diagonal matrix deﬁned by J = ( I + D ) − (where we allow for the possibility of D ii = ∞ for J ii = 0 ).Proof. We begin with the case where J ii (cid:54) = 0 and thus D ii < ∞ . As above, because proximaloperators are themselves monotone non-expansive operators, we always ≤ J ii ≤ , so that D ii ≥ .6 lgorithm 3 Forward-backward equilibriumbackpropagation u := 0 ; err := 1 ; v := ∂(cid:96)∂z ∗ while err > (cid:15) do u + := (1 − α ) u + αW T uu + i := (cid:40) u + i + αv i α (1+ D ii ) if D ii < ∞ if D ii = ∞ err := (cid:107) u + − u (cid:107) (cid:107) u + (cid:107) u := u + return u Algorithm 4

Peaceman-Rachford equilibrium backpropagation z, u := 0 ; err := 1 ; v := ∂(cid:96)∂z ∗ ; V := ( I + α ( I − W )) − while err > (cid:15) do u / := 2 z − uz / := V T u / u + := 2 z / − u / z + i := (cid:40) u + i + αv i α (1+ D ii ) if D ii < ∞ if D ii = ∞ err := (cid:107) z + − z (cid:107) (cid:107) z + (cid:107) z, u := z + , u + return z Now, ﬁrst assuming that J ii > , and hence D ii < ∞ , we have u = ( I − JW ) − T v ⇔ ( I − W T ( I + D ) − ) u = v ⇔ W − T u − ( I + D ) − u = W − T v ⇔ ( I + D ) W − T u − u = ( I + D ) W − T v ⇔ W − T u − u + DW − T u = ( I + D ) W − T v ⇔ ˜ u − W T ˜ u + D ˜ u = ( I + D ) W − T v (21)where we deﬁne ˜ u = W − T u . To simplify the right hand side of this equation and remove the explicit W − T v terms we note that ( I − JW ) − T = ( I − W T J ) − = I + ( I − W T J ) − W T J. (22)Thus, we can always solve the above equation with the v term of the form W T Jv , giving ( I + D ) W − T W T Jv = ( I + D ) Jv = v. (23)This gives us a (linear) operator splitting problem with the ˜ A and ˜ B operators given in (20).To handle the case where J ii = 0 ⇔ D ii = ∞ , we can simply take the limit D ii → ∞ , and note thatall the operators are well-deﬁned for this case. For instance, the resolvent operator R ˜ B ( u ) = ( I + α ( I + D )) − ( u + αv ) (24)and thus R ˜ B ( u ) ii = u + αv α (1 + D ii ) → (25)as D ii → ∞ .Finally, owing to the fact that I − W T (cid:23) mI and D ii ≥ , the ˜ A and ˜ B operators are stronglymonotone and monotone respectively, we conclude that operator splitting techniques applied to theproblem will be guaranteed to converge.An advantage of this approach when using Peaceman-Rachford splitting is that it allows us to reusea fast method for multiplying by ( I + α ( I − W )) − which is required by Peaceman-Rachfordduring both the forward pass (equilibrium solving) and backward pass (backpropagation) of traininga MON. Algorithms detailing both the Peaceman-Rachford and forward-backward solvers for thebackpropagation problem (20) are given in Algorithms 3 and 4. Although we could solve this operator splitting problem directly, the presence of the W − T v term has twonotable downsides: 1) even if the W matrix itself is nonsingular, it may be arbitrarily close to a singular matrix,thus making direct solutions with this matrix introduce substantial numerical errors; and 2) for operator splittingmethods that do not require an inverse of W (e.g. forward-backward splitting), it would be undesirable to requirean explicit inverse in the backward pass. Example monotone operator networks

With the basic foundations from the previous section, we now highlight several different instantiationsof the MON architecture. In each of these settings, as in Theorem 1, we will formulate the objectiveas one of ﬁnding a solution to the operator splitting problem ∈ ( A + B )( z (cid:63) ) for A ( z ) = ( I − W )( z ) − ( U x + b ) , B = ∂f (26)or equivalently as computing an equilibrium point z (cid:63) = prox f ( W z (cid:63) + U x + b ) .In each of these settings we need to deﬁne what the input and hidden state x and z correspond to,what the W and U operators consist of, and what is the function f which determines the networknonlinearity. Key to the application of monotone operator methods are that 1) we need to constrainthe W matrix such that I − W (cid:23) mI as described in the previous section and 2) we need a methodto compute (or solve) the inverse ( I + α ( I − W )) − , needed e.g. for Peaceman-Rachford; while thiswould not be needed if using only forward-backward splitting, we believe that the full power of themonotone operator view is realized precisely when these more involved methods are possible. The simplest setting, of course, is the case we have largely highlighted above, where x ∈ R d and z ∈ R n are unstructured vectors, and W ∈ R n × n and U ∈ R n × d and b ∈ R n are dense matricesand vectors respectively. As indicated above, we parameterize W directly by A, B ∈ R n × n as in (5).Since the U x term simply acts as a bias in the iteration, there is no constraint on the form of U .We can form an inverse directly by simply forming and inverting the matrix I + α ( I − W ) , which hascost O ( n ) . Note that this inverse needs to be formed only once, and can be reused over all iterationsof the operator splitting method and over an entire batch of examples (but recomputed, of course,when W changes). Any proximal function can be used as the activation: for exmample the ReLU,though as mentioned there are also close approximations to the sigmoid, tanh, and softplus. The real power of MONs comes with the ability to use more structured linear operators such asconvolutions. We let x ∈ R ds be a d -channel input of size s × s and z ∈ R ns be a n -channel hiddenlayer. We also let W ∈ R ns × ns denote the linear form of a 2D convolutional operator and similarlyfor U ∈ R ns × ds . As above, W is parameterized by two additional convolutional operators A, B ofthe same form as W . Note that this implicitly increases the receptive ﬁeld size of W : if A and B are × convolutions, then W = (1 − m ) I − A T A + B − B T will have an effective kernel size of 5. Inversion

The beneﬁt of convolutional operators in this setting is the ability to perform efﬁcientinversion via the fast Fourier transform. Speciﬁcally, in the case that A and B represent circularconvolutions, we can reduce the matrices to block-diagonal form via the discrete Fourier transform(DFT) matrix A = F s D A F ∗ s (27)where F s denotes (a permuted form of) the 2D DFT operator and D A ∈ C ns × ns is a (complex)block diagonal matrix where each block D Aii ∈ C n × n corresponds to the DFT at one particularlocation in the image. In this form, we can efﬁciently multiply by the inverse of the convolutionaloperator, noting that I + α ( I − W )= (1 + αm ) I + αA T A − αB + αB T = F s ((1 + αm ) I + αD ∗ A D A − D B + D ∗ B ) F ∗ s . (28)The inner term here is itself a block diagonal matrix with complex n × n blocks (each block is alsoguaranteed to be invertible by the same logic as for the full matrix). Thus, we can multiply a set ofhidden units z by the inverse of this matrix by simply inverting each n × n block, taking the fastFourier transform (FFT) of z , multiplying each corresponding block of F s z by the correspondinginverse, then taking the inverse FFT. The details are given in Appendix B.8he computational cost of multiplying by this inverse is O ( n s log s + n s ) to compute the FFTof each convolutional ﬁlter and precompute the inverses, and then O ( bns log s + bn s ) to multiplyby the inverses for a set of hidden units with a minibatch of size b . Note that just computing theconvolutions in a normal manner has cost O ( bn s ) , so that these computations are on the sameorder as performing typical forward passes through a network, though empirically 2-3 times slowerowing to the relative complexity of performing the necessary FFTs. Zero padding

One drawback to the above method is that using the FFT in this manner requires thatall convolutions be circular. While empirically there is little drawback to simply replacing traditionalconvolutions with their circular variants, in some cases it may be desirable to avoid this setting, whereinformation about the image may wrap around the borders. If it is desirable to avoid this, we explicitlyremove any circular dependence by zero-padding the hidden units with ( k − / border pixels,where k denotes the receptive ﬁeld size of the convolution. This zero padding can then be enforced bysimply setting all the border entries to zero within the nonlinearity of the network; because setting anelement to zero is equivalent to the proximal operator for the indicator of the zero set, such operationsstill ﬁt within the monotone operator setting. Although a single fully-connected or convolutional operator within a MON can be suitable forsmall-scale problems, in typical deep learning settings it is common to model hidden units atdifferent hierarchical levels. While MONs may seem inherently “single-layer,” we can model thissame hierarchical structure by incorporating structure into the W matrix. For example, assuminga convolutional setting, with input x ∈ R ds as in the previous section, we could partition z into L different hierarchical resolutions and let W have a multi-tier structure, e.g. z =  z ∈ R n s z ∈ R n s ... z L ∈ R n L s L  , W =  W · · · W W · · · W W · · · ... ... ... . . . ... · · · W LL  where z i denotes the hidden units at level i , an s i × s i resolution hidden unit with n i channels,and where W ii denotes an n i channel to n i channel convolution, and W i +1 ,i denotes an n i to n i +1 channel, strided convolution. This structure of W allows for both inter- and intra-tier inﬂuence.One challenge is to ensure that we can represent W with the form (1 − m ) I − A T A + B − B T while still maintaining the above structure, which we achieve by parameterizing each W ij blockappropriately. Another consideration is the inversion of the multi-tier operator, which can be achievedvia the FFT similarly as for single-convolutional W , but with additional complexity arising from thefact that the A i +1 ,i convolutions are strided. These details are described in Appendix C. To test the expressive power and training stability of MONs, we evaluate the MON instantiationsdescribed in Section 4 on several image classiﬁcation benchmarks. We take as a point of comparisonthe Neural ODE (NODE) [8] and Augmented Neural ODE (ANODE) [10] models, the only otherimplicit-depth models which guarantee the existence and uniqueness of a solution.The training process relies upon the operator splitting algorithms derived in Sections 3.4 and 3.5; foreach batch of examples, the forward pass of the network involves ﬁnding the network ﬁxed point(Algorithm 1 or 2), and the backward pass involves backpropagating the loss gradient through theﬁxed point (Algorithm 3 or 4). We analyze the convergence properties of both the forward-backwardand Peaceman-Rachford operator splitting methods, and use the more efﬁcient Peaceman-Rachfordsplitting for our model training. For further training details and model architectures see Appendix D.Experiment code can be found at http://github.com/locuslab/monotone_op_net . Performance on image benchmarks

We train small MONs on CIFAR-10 [17], SVHN [22], andMNIST [18], with a similar number of parameters as the ODE-based models reported in [8] and [10].The results are shown in Table 1. Training curves for MONs, NODE, and ANODE on CIFAR-10are show in Figure 1 and additional training curves are shown in Figure E1. Notably, except for9

IFAR-10Method Model size Acc.

Neural ODE † † Multi-tier 170K 71.8%Single conv lg. ∗ ∗ Multi-tier lg. ∗ ∗ SVHNMethod Model size Acc.

Neural ODE † † Neural ODE †

84K 96.4%Aug. Neural ODE †

84K 98.2%MON (ours)Fully connected 84K 98.2%Single conv 84K

Multi-tier 81K 98.9%Table 1: Test accuracy of MON models comparedto Neural ODEs and Augmented Neural ODEs. † asreported in [10]; *with data augmentation % T e s t acc u r ac y CIFAR-10

Single convMulti-tier NODEANODE

Figure 1: Test accuracy of MONs during training onCIFAR-10, with NODE [8] and ANODE [10] for com-parison. NODE and ANODE curves obtained using codeprovided by [10]. I t e r a ti on s Forward-pass iterations ® = 1 ® tuned Figure 2: Iterations required by Peaceman-Rachfordequilibrium solving over the course of training, for best α and α = 1 . the fully-connected model on MNIST, all MONs signiﬁcantly outperform the ODE-based modelsacross datasets. We highlight the performance of the small single convolution MON on CIFAR-10which outperforms Augmented Neural ODE by 13.5%. Additionally, we train two larger MONs onCIFAR-10 with data augmentation. The strong performance (89.7% test accuracy) of the multi-tiernetwork, in particular, goes a long way towards closing the performance gap with traditional deepnetworks. Efﬁciency of operator splitting methods

We compare the convergence rates of Peaceman-Rachford and forward-backward splitting on a fully trained model, using a large multi-tier MONtrained on CIFAR-10. Figure 3 shows convergence for both methods during the forward pass, fora range of α . As the theory suggests, the convergence rates depend strongly on the choice of α .Forward-backward does not converge for α > . , but convergence speed varies inversely with α for α < . . In contrast, Peaceman-Rachford is guaranteed to converge for any α > but thedependence is non-monotonic. We see that, for the optimal choice of α , Peaceman-Rachford canconverge much more quickly than forward-backward. The convergence rate also depends on theLipschitz parameter L of I − W , which we observe increases during training. Peaceman-Rachfordtherefore requires an increasing number of iterations during both the forward pass (Figure 2) andbackward pass (Figure E2).Finally, we compare the efﬁciency of MON to that of the ODE-based models. We report the timeand number of function evaluations (OED solver steps or operator splitting iterations) required bythe ~170k-parameter models to train on CIFAR-10 for 40 epochs. The MON, neural ODE, andANODE training takes respectively 1.4, 4.4, and 3.3 hours, with an average of 20, 96, and 90 function10

50 100 150 200 250Iterations10 -4 -3 -2 -1 R e l a ti v e e rr o r Forward-backward

Peaceman-Rachford α = 1 α = 1/4 α = 1/8 α = 1/16 α = 1/32 α = 1/64

Figure 3: Convergence of Peaceman-Rachford and forward-backward equilibrium solving, on fully-trainedmodel. evaluations per minibatch. Note however that training the larger 1M-parameter MON on CIFAR-10requires 65 epochs and takes 16 hours. All experiments are run on a single RTX 2080 Ti GPU.

The connection between monotone operator splitting and implicit network equilibria brings a newsuite of tools to the study of implicit-depth networks. The strong performance, efﬁciency, andguaranteed stability of MON indicate that such networks could become practical alternatives todeep networks, while the ﬂexibility of the framework means that performance can likely be furtherimproved by, e.g. imposing additional structure on W or employing other operator splitting methods.At the same time, we see potential for the study of MONs to inform traditional deep learning itself.The guarantees we can derive about what architectures and algorithms work for implicit-depthnetworks may give us insights into what will work for explicit deep networks. References [1] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter. Differentiable convexoptimization layers. In

Neural Information Processing Systems , 2019.[2] L. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorialenvironment. In

Artiﬁcial Neural Networks . 1990.[3] B. Amos and J. Z. Kolter. OptNet: Differentiable optimization as a layer in neural networks. In

International Conference on Machine Learning , 2017.[4] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable mpc for end-to-endplanning and control. In

Advances in Neural Information Processing Systems , pages 8299–8310,2018.[5] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In

Advances in Neural InformationProcessing Systems , pages 688–699, 2019.[6] H. H. Bauschke, P. L. Combettes, et al.

Convex analysis and monotone operator theory inHilbert spaces , volume 408. Springer, 2011.[7] A. Bibi, B. Ghanem, V. Koltun, and R. Ranftl. Deep layers as stochastic solvers. In

InternationalConference on Learning Representations , 2019. URL https://openreview.net/forum?id=ryxxCiRqYX .[8] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differentialequations. In

Advances in neural information processing systems , pages 6571–6583, 2018.[9] P. L. Combettes and J.-C. Pesquet. Deep neural network structures solving variational inequali-ties.

Set-Valued and Variational Analysis , pages 1–28, 2020.1110] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural odes. In

Advances in NeuralInformation Processing Systems , pages 3134–3144, 2019.[11] L. El Ghaoui, F. Gu, B. Travacca, and A. Askari. Implicit deep learning. arXiv preprintarXiv:1908.06315 , 2019.[12] S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo. On differentiatingparameterized argmin and argmax problems with application to bi-level optimization. arXivpreprint arXiv:1607.05447 , 2016.[13] S. Gould, R. Hartley, and D. Campbell. Deep declarative networks: A new hope. arXiv preprintarXiv:1909.04866 , 2019.[14] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[15] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphicalmodels with neural networks for structured representations and fast inference. In

Advances inNeural Information Processing Systems , 2016.[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio andY. LeCun, editors, , 2015. URL http://arxiv.org/abs/1412.6980 .[17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technicalreport, University of Toronto, 2009.[18] Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database.

ATT Labs [Online].Available: http://yann.lecun.com/exdb/mnist , 2, 2010.[19] R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. Yoon, X. Pitkow, R. Urtasun, and R. Zemel. Revivingand improving recurrent back-propagation. In

International Conference on Machine Learning ,pages 3082–3091, 2018.[20] C. K. Ling, F. Fang, and J. Z. Kolter. What game are we playing? End-to-end learning in normaland extensive form games. In

International Joint Conference on AI , 2018.[21] D. Linsley, A. K. Ashok, L. N. Govindarajan, R. Liu, and T. Serre. Stable and expressiverecurrent vision models, 2020.[22] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in nat-ural images with unsupervised feature learning. In

NIPS Workshop on Deep Learning andUnsupervised Feature Learning , 2011.[23] F. J. Pineda. Generalization of back propagation to recurrent and higher order neural networks.In

Advances in Neural Information Processing Systems , 1988.[24] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In

Advances in Neural Information Processing Systems , 2019.[25] M. Revay and I. R. Manchester. Contracting implicit recurrent neural networks: Stable modelswith improved trainability. arXiv preprint arXiv:1912.10402 , 2019.[26] E. K. Ryu and S. Boyd. Primer on monotone operator methods.

Appl. Comput. Math , 15(1):3–43, 2016.[27] L. N. Smith and N. Topin. Super-convergence: Very fast training of neural networks using largelearning rates. In

Artiﬁcial Intelligence and Machine Learning for Multi-Domain OperationsApplications , volume 11006, page 1100612. International Society for Optics and Photonics,2019. 12

Monotone operator theory

We brieﬂy review some of the basic properties of monotone operators that we make use of throughoutthis work. A relation or operator (which in our setting will often roughly correspond to a set-valuedfunction), is a subset of the space A ⊆ R n × R n ; we use the notation A ( x ) = { y | ( x, y ) ∈ A } or simply A ( x ) = y if only a single y is contained in this set. We make use of a few basicoperators and relations: the identity operator I = { ( x, x ) | x ∈ R n } ; the operator sum ( A + B )( x ) = { ( x, y + z ) | ( x, y ) ∈ A, ( x, z ) ∈ B } ; the inverse operator A − ( x, y ) = { ( y, x ) | ( x, y ) ∈ A } ; and thesubdifferential operator ∂f = { ( x, ∂f ( x )) | ∈ dom f } . An operator A has Lipschitz constant L iffor any ( x, u ) , ( y, v ) ∈ A (cid:107) u − v (cid:107) ≤ L (cid:107) x − y (cid:107) . (A1)An operator A is monotone if ( u − v ) T ( x − y ) ≥ , ∀ ( x, u ) , ( y, v ) ∈ A (A2)which for the case of A being a function A : R n → R n is equivalent to the condition ( A ( x ) − A ( y )) T ( x − y ) ≥ , ∀ x, y ∈ dom A. (A3)In the case of scalar-valued functions, this corresponds to our common notion of a monotonic function.The operator A is strongly monotone with parameter m if ( u − v ) T ( x − y ) ≥ m (cid:107) x − y (cid:107) , ∀ ( x, u ) , ( y, v ) ∈ A. (A4)A monotone operator A is maximal monotone if no other monotone operator strictly contains it;formally, most of the convergence properties we use require maximal monotonicity, though we areintentionally informal about this and merely use the established fact that several well-known operatorsare maximal monotone. Speciﬁcally, a linear operator A ( x ) = Gx + h for G ∈ R n × n and h ∈ R n is(maximal) monotone if and only if A + A T (cid:23) and strongly monotone if A + A T (cid:23) mI . Similarly,a subdifferentiable operator ∂f is maximal monotone iff f is a convex closed proper (CCP) function.The resolvent and Cayley operators for an operator A are denoted R A and C A and respectivelydeﬁned as R A = ( I + αA ) − , C A = 2 R A − I (A5)for any α > . The resolvent and Cayley operators are non-expansive (i.e., have Lipschitz constant L ≤ ) for any maximal monotone A , and are contractive (i.e. L < ) for strongly monotone A .We will mainly use two well-known properties of these operators. First, when A ( x ) = Gx + h islinear, then R A ( x ) = ( I + αG ) − ( x − αh ) (A6)and when A = ∂f for some CCP function f , then the resolvent is given by a proximal operator R A ( x ) = prox αf ( x ) ≡ argmin z (cid:107) x − z (cid:107) + αf ( z ) . (A7)Operator splitting approaches refer to methods to ﬁnd a zero in a sum of operators (assumed here tobe maximal monotone), i.e., ﬁnd x such that ∈ ( A + B )( x ) . (A8)There are many such operator splitting methods, which lead to different approaches in their applicationto our subsequent implicit networks, but the two we use mainly in this work are 1) forward-backward splitting, given by the update x k +1 := R B ( x k − αA ( x k )); (A9)and 2) Peaceman-Rachford splitting, which is given by the iteration u k +1 = C A C B ( u k ) , x k = R B ( u k ) . (A10)Both methods will converge linearly to an x that is a zero of the operator sum under certain conditions:a sufﬁcient condition for forward-backward to converge is that A be strongly monotone with parameter m and Lipschitz with constant L and α < m/L ; for Peaceman-Rachford, the method willconverge for any choice of α for strongly monotone A , though the convergence speed will often varysubstantially based upon α . 13 Convolutional MONs

B.1 Inversion via the discrete Fourier transform

First consider the case where W ∈ R s × s is the matrix representation of an unstrided (circular)convolution with a single input channel and single output channel. The convolution operates onvectorized s × s inputs. It is well known that W is diagonalized by the 2D DFT operator F s = F s ⊗ F s where F s is the Fourier basis matrix ( F s ) ij = s ω ( i − j − and ω = exp(2 πι/s ) . We denote ι = √− to avoid confusion with the index i . So F s W F ∗ s = D, (B1)a complex diagonal matrix.Now take the case where W ∈ R ns × ns has n input and output channels. Then ( I n ⊗ F s ) W ( I n ⊗ F ∗ s ) = D =  D D · · · D n D D · · · D n ... ... . . . ... D n D n · · · D nn  (B2)where I n is the n × n identity matrix and each block D ij ∈ C s × s is a complex diagonal matrix.We will denote F s,n = I n ⊗ F s .It is more efﬁcient to consider the permuted form of D ˆ D =  ˆ D · · ·

00 ˆ D · · · ... ... . . . ... · · · ˆ D s  (B3)where each block ˆ D k ∈ C n × n , consists of the k th diagonal elements of all the D ij , that is ˆ D kij =( D ij ) kk . Then inverting or multiplying by ˆ D reduces to inverting or multiplying by the diagonalblocks, which is amenable to accelerated batch-wise computation in the form of an s × n × n tensor.However, the original form (B2) is more convenient mathematically and we use that here.To perform the required inversion of the operator I + α ( I − W ) = (1 + αm ) I + αA T A − αB + αB T (B4)we use the fact that F s,n is unitary and obtain (1 + αm ) I + αA T A − αB + αB T = (1 + αm ) F ∗ s,n F s,n + F ∗ s,n ( αD ∗ A F s,n F ∗ s,n D A − D B + D ∗ B ) F s,n = F ∗ s,n ((1 + αm ) I + αD ∗ A D A − D B + D ∗ B ) F s,n . (B5)The inner term here itself has the blockwise-diagonal form (B2). Thus, we can multiply a set ofhidden units z by the inverse of this matrix by considering the permuted form (B3), inverting eachblock ˆ D i , taking the FFT of z , multiplying each corresponding block of F s,n z by the correspondinginverse, then taking the inverse FFT. C Multi-tier MONs

C.1 Parameterization

Recall the setting of Section 4.3, with z =  z ∈ R n s z ∈ R n s ... z L ∈ R n L s L  , W =  W · · · W W · · · W W · · · ... ... ... . . . ... · · · W LL  . (C1)14o ensure W has the form (1 − m ) I − A T A + B − B T , we restrict both A and B to have the samebidiagonal structure as W . Then the diagonal terms W ii have the form W ii = (1 − m ) I − A Tii A ii − A Ti +1 ,i A i +1 ,i + B ii − B Tii (C2)for i < L and W LL = (1 − m ) I − A TLL A LL + B LL − B TLL . (C3)To compute the off-diagonal terms W i +1 ,i note that restricting W to be bidiagonal makes the off-diagonal terms of B redundant. E.g. since W = 0 , then − A T A − B T = W = 0 ⇒ W = − A T A + B = − A T A . (C4) C.2 Inversion via the discrete Fourier transform

Consider W of the form (C1) with convolutions W ii = (1 − m ) I − A Tii A ii − A Ti +1 ,i A i +1 ,i + B ii − B Tii W LL = (1 − m ) I − A TLL A LL + B LL − B TLL W i +1 ,i = − A Ti +1 ,i +1 A i +1 ,i . (C5)Here the A ii and B ii terms are unstrided convolutions with n i input and n i output channels, whilethe A i,i +1 are strided convolutions with n i input channels and n i +1 output channels.In order to multiply by ( I + α ( I − W )) − , we use back substitution to solve for x in z = ( I + α ( I − W )) x. (C6)Let W (cid:48) = ( I + α ( I − W )) . The back substitution proceeds by tiers, i.e. x = W (cid:48)− z x = W (cid:48)− ( z − W (cid:48) x ) x = W (cid:48)− ( z − W (cid:48) x ) ... (C7)Therefore only the diagonal blocks W (cid:48) ii need be inverted. The inversion of e.g. W (cid:48) = (1 + αm ) I + α ( A T A + A T A + B − B T ) (C8)is complicated by the fact that A is strided, so that it is no longer diagonalized by the DFT. Instead,we perform inversion using the following proposition. Proposition C1.

Let A ∈ R n s × n s be an unstrided circular convolution with n input and n output channels, and B ∈ R n s × n s a strided circular convolution with n input and n outputchannels and stride r where r divides s . Then ( A + B T B ) − = F ∗ s,n ( D − A − D − A D ∗ B ( I n ⊗ K ) D B D − A ) F s,n (C9) where D A = F s,n A F ∗ s,n , D B = F s,n B F ∗ s,n ,K = S T J ( s r I + J T SD B D − A D ∗ B S T J ) − J T S (C10) where J = 1 r ⊗ I s /r is r stacked identity matrices of size ( s /r ) × ( s /r ) and S = ( I r ⊗ S s/r,s ) is a permutation matrix where S a,b ∈ R ab × ab denotes the perfect shufﬂe matrixdeﬁned by subselecting rows of the identity matrix I ab , here given in MATLAB notation: S a,b =  I ab (1 : b : ab, :) I ab (2 : b : ab, :) ... I ab ( b : b : ab, :)  . (C11) Proof.

We will show that A + B T B = F ∗ s,n ( D A + D ∗ B ( I n ⊗ ( 1 s r S T JJ T S )) D B ) F s,n . (C12)15he desired result then follows by applying the Woodbury matrix idenetity.We start by breaking B into an unstrided convolution B (cid:48) which can be diagonalized by the DFT anda matrix U r,s which performs the striding on each channel: B = ( I n ⊗ U r,s ) B (cid:48) = ( I n ⊗ U r,s ) F ∗ s,n D B F s,n (C13)where U r,s ∈ R ( s /r ) × s is deﬁned by subselecting rows of the identity matrix: U r,s =  I s (1 : r : s, :) I s ( rs + 1 : r : ( r + 1) s, :) I s (2 rs + 1 : r : (2 r + 1) s, :) ... I s ( s − sr + 1 : r : s − s ( r − , :)  . (C14)So B T B = F ∗ s,n D ∗ B ( I n ⊗ ( F s U Tr,s U r,s F ∗ s )) D B F s,n . (C15)We want to show that F s U Tr,s U r,s F ∗ s = s r S T JJ T S . Observe that U Tr,s U r,s = ( T r,s ⊗ T r,s ) (C16)where T r,s ∈ R s × s is given by ( T r,s ) ij = (cid:26) if i = j and i (mod r ) = 1 , else. (C17)Then by the properties of Kronecker product F s U Tr,s U r,s F ∗ s = ( F s ⊗ F s )( T r,s ⊗ T r,s )( F ∗ s ⊗ F ∗ s ) = ( F s T r,s F ∗ s ) ⊗ ( F s T r,s F ∗ s ) . (C18)We now show that ( F s T r,s F ∗ s ) = L where L ij = (cid:26) sr if i ≡ j ( mod s/r ) , else. (C19)To do so we use several properties of the roots of unity z k = exp(2 πιk/s ) .1. If a ≡ b ( mod s ) then z a = z b .2. If z is a primitive s th root of unity then z m is a primitive a th root of unity where a = s gcd ( m,s ) .3. The sum of the s th roots of unity (cid:80) s − k =0 z k = 0 if s > .We ﬁrst compute L ij for the case when i ≡ j ( mod s/r ) , or in other words i = j + ksr for someinteger k . We have L ij = 1 s (cid:88) p =1: r : s ω ( i − p − ¯ ω ( p − j − = 1 s (cid:88) p =0: r : s − exp(2 πιp ( i − j ) /s )= 1 s (cid:88) p =0: r : s − exp(2 πιpk/r )= 1 s sr − (cid:88) p =0 exp(2 πιpk )= 1 sr . (C20)16or the case when i (cid:54)≡ j ( mod s/r ) , or in other words i = j + ksr + m for some integers k and m with − sr < m < sr , we have L ij = 1 s (cid:88) p =1: r : s ω ( i − p − ¯ ω ( p − j − = 1 s (cid:88) p =0: r : s − exp(2 πιp ( i − j ) /s )= 1 s sr − (cid:88) p =0 exp(2 πιp ( i − j ) r/s )= 1 s sr − (cid:88) p =0 exp(2 πιpmr/s ) exp(2 πιpk ) . (C21)By property (2), since exp(2 πιr/s ) is a primitive sr th root of unity, then exp(2 πιmr/s ) is a primitive d th root of unity where d = s/r gcd ( m,s/r ) . Since d divides s/r , we can split the sum into several sumsof d th roots of unity using property (1), each of which will sum to zero by property (3). L ij = 1 s sr − (cid:88) p =0 exp(2 πιpmr/s )= 1 s srd − (cid:88) q =0 d − (cid:88) p =0 exp(2 πι ( p + qd ) mr/s )= 1 srd d − (cid:88) p =0 exp(2 πιpmr/s )= 0 (C22)where the second equality follows from property (1) since p = p + qd ( mod d ) and each sum in thethird line is zero by property (3) since exp(2 πιmr/s ) is a primitive d th root of unity.We now have F s U Tr,s U r,s F ∗ s = L ⊗ L and it remains to use properties of Kronecker product to showthat L ⊗ L = s r S T JJ T S . In particular we need associativity and the fact that for A ∈ R n × n and B ∈ R m × m , we have B ⊗ A = S n,m ( A ⊗ B ) S Tn,m (C23)where S n,m is the perfect shufﬂe matrix. Note that L = sr r × r ⊗ I s/r where r × r is the r × r matrix of all ones. Then L ⊗ L = 1 s r (1 r × r ⊗ I s/r ) ⊗ (1 r × r ⊗ I s/r )= 1 s r r × r ⊗ ( I s/r ⊗ (1 r × r ⊗ I s/r ))= 1 s r r × r ⊗ ( S s/r,s ((1 r × r ⊗ I s/r ) ⊗ I s/r ) S Ts/r,s )= 1 s r r × r ⊗ ( S s/r,s (1 r × r ⊗ I s /r ) S Ts/r,s )= 1 s r ( I r ⊗ S s/r,s )(1 r × r ⊗ I s /r )( I r ⊗ S Ts/r,s )= 1 s r SJJ T S T (C24)which completes the proof. 17 IFAR-10

Single conv Multi-tier Single conv lg. Multi-tier lg.Num. channels 81 (16,32,60) 200 (64,128,128)Num. params 172,218 170,194 853,612 1,014,546Epochs 40 40 65 65Initial lr 0.001 0.01 0.001 0.001Lr schedule step decay step decay 1-cycle 1-cycleLr decay steps 25 10 - -Lr decay factor 10 10 - -Max learning rate - - 0.01 0.05Data augmentation - - (cid:88) (cid:88)

SVHN MNIST

Single conv Multi-tier FC Single conv Multi-tierNum. channels 81 (16,32,60) 87* 54 (16, 32, 32)Num. params 172,218 170,194 84,313 84,460 81,394Initial lr 0.001 0.001 0.001 0.001 0.001Epochs 40 40 40 40 40Lr decay steps 25 10 10 10 10Lr decay factor % 10 10 10 10 10

Table D1: Model hyperparameters. *FC is a dense layer with output dimension of 87.

D Experiment details

D.1 Model architecture

Recall that a MON is deﬁned by a choice of linear operators W and U , bias b , and nonlinearity σ ,and that we parameterize W via linear operators A and B . For all experiments we use σ = ReLU. Inthe fully-connected network

A, B and U are dense matrices; in the single-convolution network theyare unstrided convolutions with kernel size 3. The structure of the multi-tier network is as describedin (C1) and (C5); we use three tiers with unstrided convolutions for U and A ii , B ii and stride-2convolutions for the subdiagonal terms A i,i +1 , all with kernels of size 3. The number of channels forsingle and multi-tier convolutional models varies by dataset, as shown in Table D1.For all models, the ﬁxed point z (cid:63) is mapped to logits ˆ y via a dense output layer, and the singleconvolution model ﬁrst applies 4 × ˆ y = W o z (cid:63) + b o or ˆ y = W o AvgPool × ( z (cid:63) ) + b o . D.2 Training details

Because W = (1 − m ) I − A T A + B − B T contains both linear and quadratic terms, we ﬁnd that avariant of weight normalization helps to keep the gradients of the different parameters on the samescale. For example, when W is a dense matrix, we reparameterize A T A as g A T A (cid:107) A (cid:107) and B as h B (cid:107) B (cid:107) ,where g and h are learned scalars. When W consists of a single or multi-tiered convolutions, wereparameterize each convolution kernel analogously.All models are trained by running Peaceman-Rachford with error tolerence (cid:15) = m also affectsconvergence speed since it controls the contraction factor of the relevant operators; consistent withthis, we ﬁnd that Peaceman-Rachford takes longer to converge for smaller m , and use m = 1 for allmodels since model performance is not sensitive to m ∈ [0 . , . We also ﬁnd that the Lipschitzparameter L of I − W increases during training, changing the optimal α value. We therefore tune α ∈ { , / , / , . . . } over the course of training so as to minimize forward-pass iterations.Table D1 gives details of the training hyperparameters used for each model. All models are trainedwith ADAM [16], using batch size of 128. For all but the large CIFAR-10 models, the initial learningrate is chosen from {1e-2, 1e-3} and decayed by a factor of 10 after every 10 or 25 epochs, and thedefault ADAM momentum parameters are used. All training data is normalized to mean µ = 0 ,standard deviation σ = 1 . CIFAR-10 with data augmentation

When training large models on CIFAR-10 we use standarddata augmentation, consisting of zero-padding the 32 ×

32 images to 40 ×

40, then randomly cropping18 rain examples Test examples Image dim. Num. channels

MNIST 60,000 10,000 28 ×

28 1SVHN 73,257 26,032 32 ×

32 3CIFAR-10 50,000 10,000 32 ×

32 3

Table D2: Dataset statistics back to 32x32, and ﬁnally performing random horizontal ﬂips. In order to reduce the number oftraining epochs, we use a single cycle of increasing and decreasing learning rate to achieve super-convergence [27]. The learning rate is increased from 1e-3 to the max learning rate (see Table D1)over 30 epochs, then decreased back to 1e-3 over 30 epochs, then held at 1e-3 for 5 epochs. (Themax learning rate is chosen by training for a single epoch while increasing the learning rate until theloss diverges.) The momentum is also decreased from 0.95 to 0.85 over 30 epochs, then back to 0.95over 30 epochs, then held at 0.95 for 5 epochs. However, we note that the model obtains the sameperformance when trained with constant learning rate of 1e-3 for around 200 epochs.

D.3 Dataset statistics

MNIST [18] consists of black and white examples of handwritten digits 0-9. SVHN [22] consists ofcolor images of digits 0-9 extracted from house numbers captured by Google Stree View. CIFAR-10[17] consists of small images from 10 object classes. Dataset statistics are shown in Table D2.19

Additional results and ﬁgures % T e s t acc u r ac y SVHN

Single convMulti-tier 1 20 40 60Epochs405060708090

CIFAR-10 + data augmentation

Single conv lg.Multi-tier lg.1 10 20 30 40Epochs9596979899 % T e s t acc u r ac y MNIST

Single convMulti-tierFully connected