[PDF] Differentiable Forward and Backward Fixed-Point Iteration Layers

Abstract

Recently, several studies proposed methods to utilize some classes of optimization problems in designing deep neural networks to encode constraints that conventional layers cannot capture. However, these methods are still in their infancy and require special treatments, such as analyzing the KKT condition, for deriving the backpropagation formula. In this paper, we propose a new layer formulation called the fixed-point iteration (FPI) layer that facilitates the use of more complicated operations in deep networks. The backward FPI layer is also proposed for backpropagation, which is motivated by the recurrent back-propagation (RBP) algorithm. But in contrast to RBP, the backward FPI layer yields the gradient by a small network module without an explicit calculation of the Jacobian. In actual applications, both the forward and backward FPI layers can be treated as nodes in the computational graphs. All components in the proposed method are implemented at a high level of abstraction, which allows efficient higher-order differentiations on the nodes. In addition, we present two practical methods of the FPI layer, FPI_NN and FPI_GD, where the update operations of FPI are a small neural network module and a single gradient descent step based on a learnable cost function, respectively. FPI\_NN is intuitive, simple, and fast to train, while FPI_GD can be used for efficient training of energy networks that have been recently studied. While RBP and its related studies have not been applied to practical examples, our experiments show the FPI layer can be successfully applied to real-world problems such as image denoising, optical flow, and multi-label classification.

Full PDF

aa r X i v : . [ c s . L G ] J un Differentiable Forward and BackwardFixed-Point Iteration Layers

Younghan Jeon †∗ Minsik Lee ‡∗ Jin Young Choi † [email protected] [email protected] [email protected] Department of Electrical and Computer Engineering, ASRI, Seoul National University † Division of Electrical Engineering, Hanyang University ‡ Abstract

Recently, several studies proposed methods to utilize some classes of optimiza-tion problems in designing deep neural networks to encode constraints that con-ventional layers cannot capture. However, these methods are still in their infancyand require special treatments, such as analyzing the KKT condition, for derivingthe backpropagation formula. In this paper, we propose a new layer formulationcalled the ﬁxed-point iteration (FPI) layer that facilitates the use of more compli-cated operations in deep networks. The backward FPI layer is also proposed forbackpropagation, which is motivated by the recurrent back-propagation (RBP) al-gorithm. But in contrast to RBP, the backward FPI layer yields the gradient by asmall network module without an explicit calculation of the Jacobian. In actual ap-plications, both the forward and backward FPI layers can be treated as nodes in thecomputational graphs. All components in the proposed method are implementedat a high level of abstraction, which allows efﬁcient higher-order differentiationson the nodes. In addition, we present two practical methods of the FPI layer,FPI_NN and FPI_GD, where the update operations of FPI are a small neural net-work module and a single gradient descent step based on a learnable cost function,respectively. FPI_NN is intuitive, simple, and fast to train, while FPI_GD canbe used for efﬁcient training of energy networks that have been recently studied.While RBP and its related studies have not been applied to practical examples, ourexperiments show the FPI layer can be successfully applied to real-world problemssuch as image denoising, optical ﬂow, and multi-label classiﬁcation.

Recently, several papers proposed to compose a deep neural network with more complicated algo-rithms rather than with simple operations as it had been used. For example, there have been methodsusing certain types of optimization problems in deep networks such as differentiable optimizationlayers [4] and energy function networks [5, 9]. These methods can be used to introduce a prior in adeep network and provide a possibility of bridging the gap between deep learning and some of thetraditional methods. However, they are still premature and require non-trivial efforts to implementin actual applications. Especially, the backpropagation formula has to be derived explicitly for eachdifferent formulation based on some criteria like the KKT conditions, etc. This limits the practical-ity of the approaches since there can be numerous different formulations depending on the actualproblems. ∗ Authors contributed equally.Preprint. Under review. eanwhile, there has been an algorithm called recurrent back-propagation (RBP) proposed byAlmeida [3] and Pineda [21] several decades ago. RBP is a method to train an RNN that con-verges to the steady state. The advantages of RBP are that it can be applied universally to mostoperations that consist of repeated computations and that the whole process can be summarized bya single update equation. Even with its long history, however, RBP and related studies [6, 19] havebeen tested only for verifying the theoretical concept and there has been no example that appliedthese methods to a practical task. Moreover, there have been no studies using RBP in conjunctionwith other neural network components to verify the effect in more complex settings.In this paper, to facilitate the use of more complicated operations in deep networks, we introducea new layer formulation that can be practically implemented and trained based on RBP with someadditional considerations. To this end, we employ the ﬁxed-point iteration (FPI), which is the basisof many numerical algorithms including most gradient-based optimizations, as a layer of a neuralnetwork. In the FPI layer, the layer’s input and its weights are used to deﬁne an update equation, andthe output of the layer is the ﬁxed-point of the update equation. Under mild conditions, the FPI layeris differentiable and the derivative depends only on the ﬁxed point, which is much more efﬁcientthan adding all the individual iterations to the computational graph.We also propose a backpropagation method called backward FPI layer based on RBP [3, 21] tocompute the derivative of the FPI layer efﬁciently. We prove that if the aforementioned conditions forthe FPI layer hold, then the backward FPI layer also converges. In contrast to RBP, the backward FPIlayer yields the gradient by a small network module which allows us to avoid the explicit calculationof the Jacobian. In other words, we do not need a separate derivation for the backpropagationformula and can utilize existing autograd functionalities. Especially, we provide a modularizedimplementation of the partial differentiation operation, which is essential in the backward FPI layerbut is absent in regular autograd libraries, based on an independent computational graph. This makesthe proposed method very simple to apply to various practical applications. Since FPI covers manydifferent types of numerical algorithms as well as optimization problems, there are a lot of potentialapplications for the proposed method. FPI layer is highly modularized so it can be easily usedtogether with other existing layers such as convolution, ReLU, etc., and has a richer representationpower than the feedforward layer with the same number of weights. Contributions of the paper aresummarized as follows. • We propose a method to use an FPI as a layer of neural network. The FPI layer can be uti-lized to incorporate the mechanisms of conventional iterative algorithms, such as numericaloptimization, to deep networks. Unlike other existing layers based on differentiable opti-mization problems, the implementation is much simpler and the backpropagation formulacan be universally derived. • For backpropagation, the backward FPI layer is proposed based on RBP to compute thegradient efﬁciently.Under the mild conditions, we show that both forward and backwardFPI layers are guaranteed to converge. All components are highly modularized and a gen-eral partial differentiation tool is developed so that the FPI layer can be used in variouscircumstances without any modiﬁcation. • Two types of FPI layers (FPI_NN, FPI_GD) are presented. The proposed networks basedon the FPI layers are applied to practical tasks such as image denoising, optical ﬂow, andmulti-label classiﬁcation, which have been largely absent in existing RBP-based studies,and show good performance.The remainder of this paper is organized as follows: We ﬁrst introduce related works in Section2. The proposed FPI layer is explained in Section 3, the experimental results follow in Section 4.Finally, we conclude the paper in Section 5.

Fixed-point iteration:

For a given function g and a sequence of vectors, { x n ∈ R d } , the ﬁxed-pointiteration [10] is deﬁned by the following update equation x n +1 = g ( x n ) , n = 0 , , , · · · , (1) The code will be available upon publication. ˆ x of g , satisfying ˆ x = g (ˆ x ) . The gradient descent method ( x n +1 = x n − γ ∇ f ( x n ) ) is a popular example of ﬁxed-point iteration. Many numerical algorithms are basedon ﬁxed-point iteration, and there are also many examples in machine learning. Here are someimportant concepts about ﬁxed-point iteration. Deﬁnition 1 (Contraction mapping) [16]. On a metric space ( X, d ) , the function f : X → X is acontraction mapping if there is a real number ≤ k < that satisﬁes the following inequality forall x and x in X . d (cid:0) f ( x ) , f ( x ) (cid:1) ≤ k · d ( x , x ) . (2)The smallest k that satisﬁes the above condition is called the Lipschitz constant of f . The distancemetric is deﬁned to be an arbitrary norm k·k in this paper. Based on the above deﬁnition, the Banachﬁxed-point theorem [7] states the following. Theorem 1 (Banach ﬁxed-point theorem). A contraction mapping has exactly one ﬁxed point and itcan be found by starting with any initial point and iterating the update equation until convergence.

Therefore, if g is a contraction mapping, it converges to a unique point ˆ x regardless of the startingpoint x . The above concepts are important in deriving the proposed FPI layer in this paper. Energy function networks:

Scalar-valued networks to estimate the energy (or error) functionshave recently attracted considerable research interests. The energy function network (EFN) has adifferent structure from general feed-forward neural networks, and the concept was ﬁrst proposedin [18]. After training an EFN for a certain task, the answer to a test sample is obtained by ﬁndingthe input of the trained EFN that minimizes the network’s output. The structured prediction energynetwork (SPEN) [9] performs a gradient descent on an energy function network to ﬁnd the solution,and a structured support vector machine [22] loss is applied to the obtained solution. The inputconvex neural networks (ICNNs) [5] are deﬁned in a speciﬁc way so that the networks have convexstructures with respect to (w.r.t.) the inputs, and their learning and inference are performed by theentropy method which is derived based on the KKT optimality conditions. The deep value networks[13] and the IoU-Net [14] directly learn the loss metrics such as the intersection over union (IoU) ofbounding boxes and then perform inference by gradient-based optimization methods.Although the above approaches provide novel ways of utilizing neural networks in optimizationframeworks, they have not been combined with other existing deep network components to verifytheir effects in more complicated problems. Moreover, they are mostly limited to a certain type ofproblems and require complicated learning processes. Our method can be applied to broader situa-tions than EFN approaches, and these approaches can be equivalently implemented by the proposedmethod once the update equation for the optimization problem is derived.

Differentiable optimization layers:

Recently, a few papers using optimization problems as a layerof a deep learning architecture have been proposed. Such a structure can contain a more complicatedbehavior in one layer than the usual layers in neural networks, and can potentially reduce the depthof the network. OptNet [4] presents how to use the quadratic program (QP) as a layer of a neuralnetwork. They also use the KKT conditions to compute the derivative of the solution of QP. Agrawalet al. [1] propose an approach to differentiate disciplined convex programs which is a subclass ofconvex optimization problems. There are a few other researches trying to differentiate optimizationproblems such as submodular models [11], cone program [2], semideﬁnite program [23], and soon. However, most of them have limited applications and users need to adapt their problems to therigid problem settings. On the other hand, our method makes it easy to use a large class of iterativealgorithms as a network layer, which also includes the differentiable optimization problems.

Recurrent back-propagation:

RBP is a method to train a special case of RNN proposed byAlmeida [3] and Pineda [21]. RBP computes the gradient of the steady state for an RNN withconstant memory. Although RBP has great potential, it is rarely used in practical problems of deeplearning. Some artiﬁcial experiments showing its memory efﬁciency were performed, but it wasdifﬁcult to apply in complex and practical tasks. Recently, Liao et al. [19] tried to revive RBP usingthe conjugate gradient method and the Neumann series. However, both the forward and backwardpasses use a ﬁxed number of steps (maximum 100), which might not be sufﬁcient for convergencein practical problems. Also, if the forward pass does not converge, the equilibrium point is meaning-less so it can be unstable to train the network using the unconverged ﬁnal point, which is a problemnot addressed in the paper. Deep equilibrium models (DEQ) [6] tried to ﬁnd the equilibrium pointsof a deep sequence model via an existing root-ﬁnding algorithm. Then, for back-propagation, they3ompute the gradient of the equilibrium point by another root-ﬁnding method. In short, both the for-ward and backward passes are implemented via quasi-Newton methods. DEQ can also be performedwith constant memory, but it can only model the sequential (temporal) data, and the aforementionedconvergence issues still exist.RBP-based methods mainly perform experiments to verify the theoretical concepts and have notbeen well applied to practical examples. Our work incorporates the concept of RBP in the FPI layerto apply complicated iterative operations in deep networks, and presents two types of algorithmsaccordingly. The proposed method is the ﬁrst RBP-based method that shows successful applicationsto practical tasks in machine learning or computer vision, and can be widely used for promotion ofthe RBP-based research in the deep learning ﬁeld.

The ﬁxed-point iteration formula contains a wide variety of forms and can be applied to most iter-ative algorithms. Section 3.1 describes the basic structure and principles of the FPI layer. Section3.2 and 3.3 explains the differentiation of the layer for backpropagation. Section 3.4 describes theconvergence of the FPI layer. Section 3.5 presents two exemplar cases of the FPI layer.

Here we describe the basic operation of the FPI layer. Let g ( x, z ; θ ) be a parametric function where x and z are vectors of real numbers and θ is the parameter. We assume that g is differentiablefor x and also has a Lipschitz constant less than one for x , and the following ﬁxed point iterationconverges to a unique point according to the Banach ﬁxed-point iteration theorem: x n +1 = g ( x n , z ; θ ) , ˆ x = lim n →∞ x n (3)The FPI layer can be deﬁned based on the above relations. The FPI layer F receives an observedsample or output of the previous layer as input z , and yields the ﬁxed point ˆ x of g as the layer’soutput, i.e., ˆ x = g (ˆ x, z ; θ ) = F ( x , z ; θ ) = lim n →∞ g ( n ) ( x , z ; θ ) = ( g ◦ g ◦ · · · ◦ g )( x , z ; θ ) (4)where ◦ indicates the function composition operator. The layer receives the initial point x as well,but its actual value does not matter in the training procedure because g has a unique ﬁxed point.Hence, x can be predetermined to any value such as zero matrix. Accordingly, we will oftenexpress ˆ x as a function of z and θ , i.e., ˆ x ( z ; θ ) . When using an FPI layer, the ﬁrst equation in (3)is repeated until convergence to ﬁnd the output ˆ x . We may use some acceleration techniques suchas the Anderson acceleration [20] for faster convergence. Note that there is no apparent relationbetween the shapes of x n and z . Hence the sizes of the input and output of an FPI layer do not haveto be same. Similar to other network layers, learning of F is performed by updating θ based on backpropagation.For this, the derivatives of the FPI layer has to be calculated. One simple way to compute thegradients is to construct a computational graph for all the iterations up to the ﬁxed point ˆ x . However,this method is not only time consuming but also requires a lot of memory.In this section, we show that the derivative of the entire FPI layer depends only on the ﬁxed point ˆ x . In other words, all the x n before convergence are actually not needed in the computation of thederivatives. Hence, we can only retain the value of ˆ x to perform backpropagation, and consider theentire F as a node in the computational graph.Note that ˆ x = g (ˆ x, z ; θ ) is satisﬁed at the ﬁxed point ˆ x . If we differentiate both sides of the above equation w.r.t. θ , we have ∂ ˆ x∂θ = ∂g∂θ (ˆ x, z ; θ ) + ∂g∂x (ˆ x, z ; θ ) ∂ ˆ x∂θ . (5)Here, z is not differentiated because z and θ are independent. Rearranging the above equation gives ∂ ˆ x∂θ = (cid:18) I − ∂g∂x (ˆ x, z ; θ ) (cid:19) − ∂g∂θ (ˆ x, z ; θ ) , (6)4hich conﬁrms the fact that the derivative of the output of F ( x , z ; θ ) = ˆ x depends only on thevalue of ˆ x . One downside of the above derivation is that it requires the calculation of Jacobians of g , which may need a lot of memory space (e.g., convolutional layers). Moreover, calculating theinverse can also be a burden. In the next section, we will provide an efﬁcient way to resolve theseissues. To train the FPI layer, we need to obtain the gradient w.r.t. its parameter θ . In contrast to RBP [3, 21],we propose a computationally efﬁcient layer, called the backward FPI layer, that yields the gradientwithout explicitly calculating the Jacobian. Here, we assume that an FPI layer is in the middle of thenetwork. If we deﬁne the loss of the entire network as L , then what we need for backpropagation ofthe FPI layer is ∇ θ L (ˆ x ) . According to (6), we have ∇ θ L = (cid:18) ∂ ˆ x∂θ (cid:19) ⊤ ∇ ˆ x L = (cid:18) ∂g∂θ (ˆ x, z ; θ ) (cid:19) ⊤ (cid:18) I − ∂g∂x (ˆ x, z ; θ ) (cid:19) −⊤ ∇ ˆ x L. (7)This section describes how to calculate the above equation efﬁciently. (7) can be divided into twosteps as follows: c = (cid:18) I − ∂g∂x (ˆ x, z ; θ ) (cid:19) −⊤ ∇ ˆ x L, ∇ θ L = (cid:18) ∂g∂θ (ˆ x, z ; θ ) (cid:19) ⊤ c. (8)Rearranging the ﬁrst equation in (8) yields c = (cid:18) ∂g∂x (ˆ x, z ; θ ) (cid:19) ⊤ c + ∇ ˆ x L , which can be expressedas an iteration form, i.e., c n +1 = (cid:18) ∂g∂x (ˆ x, z ; θ ) (cid:19) ⊤ c n + ∇ ˆ x L, (9)which corresponds to RBP. This iteration eliminates the need of the inverse calculation but it stillrequires the calculation of the Jacobian of g w.r.t. ˆ x . Here, we derive a new network layer, i.e. thebackward FPI layer, that yields the gradient without an explicit calculation of the Jacobian. To thisend, we deﬁne a new function h as h ( x, z, c ; θ ) = c ⊤ g ( x, z ; θ ) , then (9) becomes c n +1 = ∂h∂x (ˆ x, z, c n ; θ ) + ∇ ˆ x L. (10)Note that the output of h is scalar. Here, we can consider h as another small network containing onlya single step of g (with an additional inner product). The gradient of h can be computed based onexisting autograd functionalities with some additional considerations. Similarly, the second equationin (8) is expressed using h : ∇ θ L = ∂h∂θ (ˆ x, z, c ; θ ) , (11)where c is the ﬁxed point obtained from the ﬁxed-point iteration in (10). In this way, we can compute ∇ θ L by (11) without any memory-intensive operations and Jacobian calculation. c can be obtainedby initializing c to some arbitrary value and repeating the above update until convergence.Note that this backward FPI layer can be treated as a node in the computational graph, hence thename backward FPI layer . However, care should be taken about the above derivation in that thedifferentiations w.r.t. x and θ are partial differentiations. x and θ might have some dependencywith each other, which can disrupt the partial differentiation process if it is computed based on ausual autograd framework. Let φ ( a, b ) hereafter denotes the gradient operation in the conventionalautograd framework that calculates the derivative of a w.r.t b where a and b are both nodes in acomputational graph. Here, b can also be a set of nodes, in which case the output of φ will also be aset of derivatives.In order to resolve the issue, we implemented a general partial differentiation operator P ( s ; r ) , ∂r ( s ) /∂s where s is a set of nodes, r is a function (a function object, to be precise), and ∂r ( s ) /∂s denotes the set of corresponding partial derivatives: Let I ( t ) denotes an operator that creates a setof leaf nodes by cloning the nodes in the set t and detaching them from the computational graph. P ﬁrst creates an independent computational graph having leaf nodes s ′ = I ( s ) . These leaf nodes5re then passed onto r to yield r ′ = r ( s ′ ) , and now we can differentiate r ′ w.r.t. s ′ using φ ( r ′ , s ′ ) to calculate the partial derivatives, because the nodes in s ′ are independent to each other. Here, theresulting derivatives ∂r ′ /∂s ′ are also attached in the independent graph as the output of φ . The P operator creates another set of leaf nodes I ( ∂r ′ /∂s ′ ) , which is then attached to the original graph(where s resides) as the output of P , i.e., ∂r ( s ) /∂s . In this way, the whole process is completedand the partial differentiation can be performed accurately. If some of the partial derivatives are notneeded in the process, we can simply omit them in the calculation of ∂r ′ /∂s ′ .Note that the above independent graph is preserved for backpropagation. Let H ( v ; u ) be an operatorthat creates a new function object that calculates P i h v i , u i i , where the node v i is an element of theset v and u i is one of the outputs of the function object u . In the backward path of P , the set δ ofgradients passed onto P by backpropagation is used to create a function object η = H ( δ ; ρ ) where ρ ( s ) is a function object that calculates P ( s ; r ) = ∂r ( s ) /∂s . Then, the backpropagated gradientsfor s can be calculated with another P operation, i.e., P ( δ ∪ s ; η ) (here, the derivatives for δ do notneed to be calculated). In practice, the independent graph created in the forward path is reused for ρ in the calculation of η .The backward FPI layer can be highly modularized with the above operators, i.e., P , H , and a plusoperator can construct (10) and (11) entirely, and the iteration of (10) can be implemented withanother forward FPI layer. This allows multiple differentiations of the forward and backward FPIlayers. A picture depicting all the above processes is provided in the supplementary material. Allthe forward and backward layers are implemented at a high level of abstraction, and therefore, it canbe easily applied to practical tasks by changing the structure of g to one that is suitable for each task. The forward path of the FPI layer converges if the bounded Lipschitz assumption holds. For exam-ple, to make a fully connected layer a contraction mapping, simply dividing the weight by a numbergreater than the maximum singular value of the weight matrix will sufﬁce. In practice, we empiri-cally found out that setting the initial values of weights ( θ ) to small values is enough for making g acontraction mapping throughout the training procedure. Convergence of the backward FPI layer.

The backward FPI layer is composed of a linear mappingbased on the Jacobian ∂g/∂x on ˆ x . Convergence of the backward FPI layer can be conﬁrmed by thefollowing proposition. Proposition 1. If g is a contraction mapping, the backward FPI layer (Eq. (9)) converges to aunique point.Proof. For simplicity, we omit z and θ from g . By the deﬁnition of the contraction mapping and theassumption of the arbitrary norm metric, k g ( x ) − g ( x ) kk x − x k ≤ k is satisﬁed for all x and x ( ≤ k < ).For a unit vector v , i.e., k v k = 1 for the aforementioned norm, and a scalar t , let x = x + tv .Then, the above inequality becomes k g ( x + tv ) − g ( x ) kk t k ≤ k . For another vector u with k u k ∗ ≤ where k · k ∗ indicates the dual norm, it satisﬁes u ⊤ (cid:0) g ( x + tv ) − g ( x ) (cid:1) | t | ≤ k g ( x + tv ) − g ( x )) k| t | ≤ k (12)based on the deﬁnition of the dual norm. This indicates that lim t → + u ⊤ (cid:0) g ( x + tv ) − g ( x ) (cid:1) | t | = ∇ v ( u ⊤ g )( x ) ≤ k. (13)According to the chain rule, ∇ ( u ⊤ g ) = (cid:0) u ⊤ J g (cid:1) ⊤ where J g is the Jacobian of g . That gives ∇ v ( u ⊤ g )( x ) = (cid:0) ∇ ( u ⊤ g )( x ) (cid:1) ⊤ · v = u ⊤ J g ( x ) v ≤ k. (14)Let x = ˆ x then u ⊤ J g (ˆ x ) v ≤ k for all u , v that satisfy k u k ∗ = k v k = 1 . Therefore, k J g (ˆ x ) k = sup k v k =1 k J g (ˆ x ) v k = sup k v k =1 , k u k ∗ ≤ u ⊤ J g (ˆ x ) v ≤ k < (15)which indicates that the linear mapping by weight J g (ˆ x ) is a contraction mapping. By the Banachﬁxed-point theorem, the backward FPI layer converges to the unique ﬁxed-point.6 .5 Two representative cases of the FPI layer As mentioned before, FPI can take a wide variety of forms. We present two representative methodsthat are easy to apply to practical problems.

The most intuitive way to use the FPI layer is to set g to an arbitrary neural network module. InFPI_NN, the input variable recursively enters the same network module until convergence. g canbe composed of layers that are widely used in deep networks such as convolution, ReLU, and linearlayers. FPI_NN can perform more complicated behaviors with the same number of parameters thanusing g directly without FPI, as demonstrated in the experiments section. The gradient descent method can be a perfect example for the FPI layer. This can be used for efﬁcientimplementations of the energy function networks like ICNN [5]. Unlike a typical network whichobtains the answer directly as the output of the network (i.e., f ( a ; θ ) is the answer of the query a ),an energy function network retrieves the answer by optimizing an input variable of the network (i.e., argmin x f ( x, a ; θ ) becomes the answer). The easiest way to optimize the network f is gradientdescent ( x n +1 = x n − γ ∇ f ( x n , a ; θ ) ). This is a form of FPI and the ﬁxed point ˆ x is the optimalpoint of f , i.e., ˆ x = argmin x f ( x, a ; θ ) . In case of a single FPI layer network with FPI_GD, ˆ x becomes the ﬁnal output of the network. Accordingly, this output is fed into the ﬁnal loss function L (cid:0) x ∗ , ˆ x ) to train the parameter θ during the training procedure. This behavior conforms to that ofan energy function network. However, unlike the existing methods, the proposed method can betrained easily with the universal backpropagation formula. Therefore, the proposed FPI layer canbe an effective alternative to train energy function networks. An advantage of FPI_GD is that it caneasily satisfy the bounded Lipschitz condition by adjusting the step size γ . Since several studies [3, 21, 19, 6] have already shown that RBP-based algorithms require only aconstant amount of memory, we omit memory-related experiments. Instead, we focus on applyingthe proposed method to practical tasks in this paper. It is worth noting that both the forward andbackward FPI layers were highly modularized, and the exactly same implementations were sharedacross all the experiments without any alteration. The only difference was the choice of g where wecould simply plug in its functional deﬁnition, which shows the efﬁciency of the proposed framework.In the image denoising experiment, we compare the performance of the FPI layer to a non-FPInetwork that has the same structure as g . In the optical ﬂow problem, a relatively very small FPI layeris attached at the end of FlowNet [12] to show its effectiveness. For all experiments, the detailedstructure of g and the hyperparameters for training are described in the supplementary material.Results on the multi-label classiﬁcation problem show that the FPI layer is superior in performancecompared to the existing state-of-the-art algorithms. All training was performed using the Adam[17] optimizer. Here, we compare the image denoising performance for gray images perturbed by Gaussian noisewith variance σ . Image denoising has been traditionally solved with iterative, numerical algo-rithms, hence using an iterative structure like the proposed FPI layer can be an appropriate choicefor the problem. To generate the image samples, we cropped the images in the Flying Chairs dataset[12] and converted it to gray scale (400 images for training and 100 images for testing). We con-structed a single FPI_NN layer network for this experiment. For comparison, we also constructeda feedforward network that has same structure as g . The performance is reported in terms of peaksignal-to-noise ratio (PSNR) in Table 1.Table 1 shows that the single FPI layer network outperforms the feedforward network in all exper-iments. Note that the performance gap between the two algorithms is larger in more noisy circum-stances. Since both the networks are trained to yield the best performance in their given settings,7ethod σ = 15 σ = 20 σ = 25 Feedforward 32.18 30.44 29.09FPI_NN

Table 1: Denoising performance (PSNR, higher is better).

Training Epoch A v e r age EPE

FlownetFlownet+FPI

Figure 1: Average EPE per epoch(lower is better). Method F1 scoreMLP [9] 38.9Feedforward net [5] 39.6SPEN [9] 42.2ICNN [5] 41.5DVN(GT) [8] 42.9DVN(Adversarial)[13]

FPI_GD layer (Ours) 43.2FPI_NN layer (Ours) 43.4Table 2: F score of multi-label text classiﬁ-cation (higher is better).this conﬁrms that a structure with repeated operations can be more suitable for this type of problem.An advantage of the proposed FPI layer here is that there is no explicit calculation of the Jacobian,which can be quite large in this image-based problem, even though there was no specialized compo-nent except the bare deﬁnition of g thanks to the highly modularized nature of the layer. Examplesof image denoising results are shown in the supplementary material. Optical ﬂow is one of the major research areas of computer vision that aims to acquire motionsby matching pixels in two images. We demonstrate the effectiveness of the FPI layer by a simpleexperiment, where the layer is attached at the end of FlowNet [12]. The Flying Chairs dataset[12] was used with the original split which has 22,232 training and 640 test samples. In this case,the FPI layer plays the role of post-processing. We attached a very simple FPI layer consisting ofconv/deconv layers and recorded the average end point error (aEPE) per epoch as shown in Figure1. Although the number of additional parameters is extremely small (less than 0.01%) and thecomputation time is nearly the same with the original FlowNet, it shows noticeable performanceimprovement.

The multi-label text classiﬁcation dataset (Bibtex) was introduced in [15]. The goal of the task isto ﬁnd the correlation between the data and the multi-label features. Both the data and featuresare binary with 1836 indicators and 159 labels, respectively. The numbers of indicators and labelsdiffer for each data, and that of labels is unknown during the evaluation process. We used the sametrain and test split as in [15] and evaluated the F scores. Here, we used two single FPI layernetworks with FPI_GD and FPI_NN. We set g of FPI_NN and f of FPI_GD to similar structureswhich contain one or two fully-connected layers and activation functions. As mentioned, the detailedstructures of the networks are described in supplementary materials. Table 2 shows the F scores.Here, DVN(Adversarial) achieves the best performance, however, it generates adversarial samplesfor data augmentation. Despite their simple structures, our algorithms perform the best among thoseusing only the training data, which conﬁrms the effectiveness of the proposed method. This paper proposed a novel architecture that uses the ﬁxed-point iteration as a layer of the neu-ral network. The backward FPI layer was also proposed to backpropagate the FPI layer efﬁciently.We proved that both the forward and backward FPI layers are guaranteed to converge under mildconditions. All components are highly modularized so that we can efﬁciently apply the FPI layerfor practical tasks by only changing the structure of g . Two representative cases of the FPI layer8FPI_NN and FPI_GD) have been introduced. Experiments have shown that our method has advan-tages for several problems compared to the feedforward network. For some problems like denoising,the iterative structure of the FPI layer can be more helpful, and in some other problems, it can beused to reﬁne the performance of an existing method. Finally, we have also shown in the multi-labelclassiﬁcation example that the FPI layer can achieve the state-of-the-art performance with a verysimple structure. Acknowledgements

This work was supported in part by the Next-Generation ICD Program through NRF funded byMinistry of S&ICT [2017M3C4A7077582], and in part by the National Research Foundation ofKorea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1C1C1012479).

Broader Impact

This work does not present any foreseeable societal consequence for now because it proposes theoret-ical ideas that can be generally applied to various deep learning structures. However, our method canprovide a new direction for various ﬁelds of machine learning or computer vision. Recently, a hugeportion of new techniques is developed based on deep learning in various research ﬁelds. In manycases, this leads to rewriting a large part of the traditional methodologies, because deep networksbear quite different structures from traditional algorithms. During the process, much of the conven-tional wisdom found in existing theories are being re-discovered based on raw datasets. This inducesa high economic cost in developing new technologies. Our method can provide an alternative to thistrend by incorporating the existing mechanisms in many iterative algorithms, which can reduce thedevelopment costs. There already have been many studies that combine deep learning with morecomplicated models such as SMPL, however, one usually has to derive the backpropagation formulaseparately for each method, which introduces considerable difﬁculties and, as a result, developmentcosts. However, our method can be applied universally to many types of iterative algorithms, sothe consilience between various models from different ﬁelds can be stimulated. Another possibleconsequence of the proposed method is that it might also expand the application of deep learning innon-GPU environments. As demonstrated in the experiments, the introduction of an FPI operationin a deep network can achieve similar performance with a much simpler structure, at the expenseof an iterative calculation. This can be helpful for using deep networks in many resource-limitedenvironments and may accelerate the trend of ubiquitous deep learning.

References [1] Agrawal, Akshay, Amos, Brandon, Barratt, Shane, Boyd, Stephen, Diamond, Steven, andKolter, J Zico. Differentiable convex optimization layers. In

Advances in Neural Informa-tion Processing Systems , pp. 9558–9570, 2019.[2] Agrawal, Akshay, Barratt, Shane, Boyd, Stephen, Busseti, Enzo, and Moursi, Walaa M. Dif-ferentiating through a conic program. arXiv preprint arXiv:1904.09043 , 2019.[3] Almeida, Luis B. A learning rule for asynchronous perceptrons with feedback in a combinato-rial environment. In

IEEE International Conference on Neural Networks, 1987 , pp. 609–618,1987.[4] Amos, Brandon and Kolter, J Zico. Optnet: Differentiable optimization as a layer in neuralnetworks. In

Proceedings of the 34th International Conference on Machine Learning-Volume70 , pp. 136–145. JMLR. org, 2017.[5] Amos, Brandon, Xu, Lei, and Kolter, J Zico. Input convex neural networks. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pp. 146–155. JMLR. org,2017.[6] Bai, Shaojie, Kolter, J Zico, and Koltun, Vladlen. Deep equilibrium models. In

Advances inNeural Information Processing Systems , pp. 688–699, 2019.97] Banach, Stefan. Sur les opérations dans les ensembles abstraits et leur application aux équa-tions intégrales.

Fund. math , 3(1):133–181, 1922.[8] Beardsell, Philippe and Hsu, Chih-Chao.

Structured Prediction with Deep Value Networks ,2020. URL https://github.com/philqc/deep-value-networks-pytorch .[9] Belanger, David and McCallum, Andrew. Structured prediction energy networks. In

Interna-tional Conference on Machine Learning , pp. 983–992, 2016.[10] Burden, Richard and Faires, JD.

Numerical analysis . Cengage Learning, 2004.[11] Djolonga, Josip and Krause, Andreas. Differentiable learning of submodular models. In

Ad-vances in Neural Information Processing Systems , pp. 1013–1023, 2017.[12] Fischer, Philipp, Dosovitskiy, Alexey, Ilg, Eddy, Häusser, Philip, Hazırba¸s, Caner, Golkov,Vladimir, Van der Smagt, Patrick, Cremers, Daniel, and Brox, Thomas. Flownet: Learningoptical ﬂow with convolutional networks. arXiv preprint arXiv:1504.06852 , 2015.[13] Gygli, Michael, Norouzi, Mohammad, and Angelova, Anelia. Deep value networks learn toevaluate and iteratively reﬁne structured outputs. In

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pp. 1341–1351. JMLR. org, 2017.[14] Jiang, Borui, Luo, Ruixuan, Mao, Jiayuan, Xiao, Tete, and Jiang, Yuning. Acquisition of local-ization conﬁdence for accurate object detection. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pp. 784–799, 2018.[15] Katakis, Ioannis, Tsoumakas, Grigorios, and Vlahavas, Ioannis. Multilabel text classiﬁcationfor automated tag suggestion. In

Proceedings of the ECML/PKDD , volume 18, pp. 5, 2008.[16] Khamsi, Mohamed A and Kirk, William A.

An introduction to metric spaces and ﬁxed pointtheory , volume 53. John Wiley & Sons, 2011.[17] Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[18] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, and Huang, F. A tutorial on energy-based learning.

Predicting structured data , 1(0), 2006.[19] Liao, Renjie, Xiong, Yuwen, Fetaya, Ethan, Zhang, Lisa, Yoon, KiJung, Pitkow, Xaq, Urta-sun, Raquel, and Zemel, Richard. Reviving and improving recurrent back-propagation. In

International Conference on Machine Learning , pp. 3082–3091, 2018.[20] Peng, Yue, Deng, Bailin, Zhang, Juyong, Geng, Fanyu, Qin, Wenjie, and Liu, Ligang. An-derson acceleration for geometry optimization and physics simulation.

ACM Transactions onGraphics (TOG) , 37(4):1–14, 2018.[21] Pineda, Fernando J. Generalization of back-propagation to recurrent neural networks.

Physicalreview letters , 59(19):2229, 1987.[22] Tsochantaridis, Ioannis, Hofmann, Thomas, Joachims, Thorsten, and Altun, Yasemin. Supportvector machine learning for interdependent and structured output spaces. In

Proceedings of thetwenty-ﬁrst international conference on Machine learning , pp. 104, 2004.[23] Wang, Po-Wei, Donti, Priya L, Wilder, Bryan, and Kolter, Zico. Satnet: Bridging deeplearning and logical reasoning using a differentiable satisﬁability solver. arXiv preprintarXiv:1905.12149arXiv preprintarXiv:1905.12149