Differentiable Forward and Backward Fixed-Point Iteration Layers
aa r X i v : . [ c s . L G ] J un Differentiable Forward and BackwardFixed-Point Iteration Layers
Younghan Jeon †∗ Minsik Lee ‡∗ Jin Young Choi † [email protected] [email protected] [email protected] Department of Electrical and Computer Engineering, ASRI, Seoul National University † Division of Electrical Engineering, Hanyang University ‡ Abstract
Recently, several studies proposed methods to utilize some classes of optimiza-tion problems in designing deep neural networks to encode constraints that con-ventional layers cannot capture. However, these methods are still in their infancyand require special treatments, such as analyzing the KKT condition, for derivingthe backpropagation formula. In this paper, we propose a new layer formulationcalled the fixed-point iteration (FPI) layer that facilitates the use of more compli-cated operations in deep networks. The backward FPI layer is also proposed forbackpropagation, which is motivated by the recurrent back-propagation (RBP) al-gorithm. But in contrast to RBP, the backward FPI layer yields the gradient by asmall network module without an explicit calculation of the Jacobian. In actual ap-plications, both the forward and backward FPI layers can be treated as nodes in thecomputational graphs. All components in the proposed method are implementedat a high level of abstraction, which allows efficient higher-order differentiationson the nodes. In addition, we present two practical methods of the FPI layer,FPI_NN and FPI_GD, where the update operations of FPI are a small neural net-work module and a single gradient descent step based on a learnable cost function,respectively. FPI_NN is intuitive, simple, and fast to train, while FPI_GD canbe used for efficient training of energy networks that have been recently studied.While RBP and its related studies have not been applied to practical examples, ourexperiments show the FPI layer can be successfully applied to real-world problemssuch as image denoising, optical flow, and multi-label classification.
Recently, several papers proposed to compose a deep neural network with more complicated algo-rithms rather than with simple operations as it had been used. For example, there have been methodsusing certain types of optimization problems in deep networks such as differentiable optimizationlayers [4] and energy function networks [5, 9]. These methods can be used to introduce a prior in adeep network and provide a possibility of bridging the gap between deep learning and some of thetraditional methods. However, they are still premature and require non-trivial efforts to implementin actual applications. Especially, the backpropagation formula has to be derived explicitly for eachdifferent formulation based on some criteria like the KKT conditions, etc. This limits the practical-ity of the approaches since there can be numerous different formulations depending on the actualproblems. ∗ Authors contributed equally.Preprint. Under review. eanwhile, there has been an algorithm called recurrent back-propagation (RBP) proposed byAlmeida [3] and Pineda [21] several decades ago. RBP is a method to train an RNN that con-verges to the steady state. The advantages of RBP are that it can be applied universally to mostoperations that consist of repeated computations and that the whole process can be summarized bya single update equation. Even with its long history, however, RBP and related studies [6, 19] havebeen tested only for verifying the theoretical concept and there has been no example that appliedthese methods to a practical task. Moreover, there have been no studies using RBP in conjunctionwith other neural network components to verify the effect in more complex settings.In this paper, to facilitate the use of more complicated operations in deep networks, we introducea new layer formulation that can be practically implemented and trained based on RBP with someadditional considerations. To this end, we employ the fixed-point iteration (FPI), which is the basisof many numerical algorithms including most gradient-based optimizations, as a layer of a neuralnetwork. In the FPI layer, the layer’s input and its weights are used to define an update equation, andthe output of the layer is the fixed-point of the update equation. Under mild conditions, the FPI layeris differentiable and the derivative depends only on the fixed point, which is much more efficientthan adding all the individual iterations to the computational graph.We also propose a backpropagation method called backward FPI layer based on RBP [3, 21] tocompute the derivative of the FPI layer efficiently. We prove that if the aforementioned conditions forthe FPI layer hold, then the backward FPI layer also converges. In contrast to RBP, the backward FPIlayer yields the gradient by a small network module which allows us to avoid the explicit calculationof the Jacobian. In other words, we do not need a separate derivation for the backpropagationformula and can utilize existing autograd functionalities. Especially, we provide a modularizedimplementation of the partial differentiation operation, which is essential in the backward FPI layerbut is absent in regular autograd libraries, based on an independent computational graph. This makesthe proposed method very simple to apply to various practical applications. Since FPI covers manydifferent types of numerical algorithms as well as optimization problems, there are a lot of potentialapplications for the proposed method. FPI layer is highly modularized so it can be easily usedtogether with other existing layers such as convolution, ReLU, etc., and has a richer representationpower than the feedforward layer with the same number of weights. Contributions of the paper aresummarized as follows. • We propose a method to use an FPI as a layer of neural network. The FPI layer can be uti-lized to incorporate the mechanisms of conventional iterative algorithms, such as numericaloptimization, to deep networks. Unlike other existing layers based on differentiable opti-mization problems, the implementation is much simpler and the backpropagation formulacan be universally derived. • For backpropagation, the backward FPI layer is proposed based on RBP to compute thegradient efficiently.Under the mild conditions, we show that both forward and backwardFPI layers are guaranteed to converge. All components are highly modularized and a gen-eral partial differentiation tool is developed so that the FPI layer can be used in variouscircumstances without any modification. • Two types of FPI layers (FPI_NN, FPI_GD) are presented. The proposed networks basedon the FPI layers are applied to practical tasks such as image denoising, optical flow, andmulti-label classification, which have been largely absent in existing RBP-based studies,and show good performance.The remainder of this paper is organized as follows: We first introduce related works in Section2. The proposed FPI layer is explained in Section 3, the experimental results follow in Section 4.Finally, we conclude the paper in Section 5.
Fixed-point iteration:
For a given function g and a sequence of vectors, { x n ∈ R d } , the fixed-pointiteration [10] is defined by the following update equation x n +1 = g ( x n ) , n = 0 , , , · · · , (1) The code will be available upon publication. ˆ x of g , satisfying ˆ x = g (ˆ x ) . The gradient descent method ( x n +1 = x n − γ ∇ f ( x n ) ) is a popular example of fixed-point iteration. Many numerical algorithms are basedon fixed-point iteration, and there are also many examples in machine learning. Here are someimportant concepts about fixed-point iteration. Definition 1 (Contraction mapping) [16]. On a metric space ( X, d ) , the function f : X → X is acontraction mapping if there is a real number ≤ k < that satisfies the following inequality forall x and x in X . d (cid:0) f ( x ) , f ( x ) (cid:1) ≤ k · d ( x , x ) . (2)The smallest k that satisfies the above condition is called the Lipschitz constant of f . The distancemetric is defined to be an arbitrary norm k·k in this paper. Based on the above definition, the Banachfixed-point theorem [7] states the following. Theorem 1 (Banach fixed-point theorem). A contraction mapping has exactly one fixed point and itcan be found by starting with any initial point and iterating the update equation until convergence.
Therefore, if g is a contraction mapping, it converges to a unique point ˆ x regardless of the startingpoint x . The above concepts are important in deriving the proposed FPI layer in this paper. Energy function networks:
Scalar-valued networks to estimate the energy (or error) functionshave recently attracted considerable research interests. The energy function network (EFN) has adifferent structure from general feed-forward neural networks, and the concept was first proposedin [18]. After training an EFN for a certain task, the answer to a test sample is obtained by findingthe input of the trained EFN that minimizes the network’s output. The structured prediction energynetwork (SPEN) [9] performs a gradient descent on an energy function network to find the solution,and a structured support vector machine [22] loss is applied to the obtained solution. The inputconvex neural networks (ICNNs) [5] are defined in a specific way so that the networks have convexstructures with respect to (w.r.t.) the inputs, and their learning and inference are performed by theentropy method which is derived based on the KKT optimality conditions. The deep value networks[13] and the IoU-Net [14] directly learn the loss metrics such as the intersection over union (IoU) ofbounding boxes and then perform inference by gradient-based optimization methods.Although the above approaches provide novel ways of utilizing neural networks in optimizationframeworks, they have not been combined with other existing deep network components to verifytheir effects in more complicated problems. Moreover, they are mostly limited to a certain type ofproblems and require complicated learning processes. Our method can be applied to broader situa-tions than EFN approaches, and these approaches can be equivalently implemented by the proposedmethod once the update equation for the optimization problem is derived.
Differentiable optimization layers:
Recently, a few papers using optimization problems as a layerof a deep learning architecture have been proposed. Such a structure can contain a more complicatedbehavior in one layer than the usual layers in neural networks, and can potentially reduce the depthof the network. OptNet [4] presents how to use the quadratic program (QP) as a layer of a neuralnetwork. They also use the KKT conditions to compute the derivative of the solution of QP. Agrawalet al. [1] propose an approach to differentiate disciplined convex programs which is a subclass ofconvex optimization problems. There are a few other researches trying to differentiate optimizationproblems such as submodular models [11], cone program [2], semidefinite program [23], and soon. However, most of them have limited applications and users need to adapt their problems to therigid problem settings. On the other hand, our method makes it easy to use a large class of iterativealgorithms as a network layer, which also includes the differentiable optimization problems.
Recurrent back-propagation:
RBP is a method to train a special case of RNN proposed byAlmeida [3] and Pineda [21]. RBP computes the gradient of the steady state for an RNN withconstant memory. Although RBP has great potential, it is rarely used in practical problems of deeplearning. Some artificial experiments showing its memory efficiency were performed, but it wasdifficult to apply in complex and practical tasks. Recently, Liao et al. [19] tried to revive RBP usingthe conjugate gradient method and the Neumann series. However, both the forward and backwardpasses use a fixed number of steps (maximum 100), which might not be sufficient for convergencein practical problems. Also, if the forward pass does not converge, the equilibrium point is meaning-less so it can be unstable to train the network using the unconverged final point, which is a problemnot addressed in the paper. Deep equilibrium models (DEQ) [6] tried to find the equilibrium pointsof a deep sequence model via an existing root-finding algorithm. Then, for back-propagation, they3ompute the gradient of the equilibrium point by another root-finding method. In short, both the for-ward and backward passes are implemented via quasi-Newton methods. DEQ can also be performedwith constant memory, but it can only model the sequential (temporal) data, and the aforementionedconvergence issues still exist.RBP-based methods mainly perform experiments to verify the theoretical concepts and have notbeen well applied to practical examples. Our work incorporates the concept of RBP in the FPI layerto apply complicated iterative operations in deep networks, and presents two types of algorithmsaccordingly. The proposed method is the first RBP-based method that shows successful applicationsto practical tasks in machine learning or computer vision, and can be widely used for promotion ofthe RBP-based research in the deep learning field.
The fixed-point iteration formula contains a wide variety of forms and can be applied to most iter-ative algorithms. Section 3.1 describes the basic structure and principles of the FPI layer. Section3.2 and 3.3 explains the differentiation of the layer for backpropagation. Section 3.4 describes theconvergence of the FPI layer. Section 3.5 presents two exemplar cases of the FPI layer.
Here we describe the basic operation of the FPI layer. Let g ( x, z ; θ ) be a parametric function where x and z are vectors of real numbers and θ is the parameter. We assume that g is differentiablefor x and also has a Lipschitz constant less than one for x , and the following fixed point iterationconverges to a unique point according to the Banach fixed-point iteration theorem: x n +1 = g ( x n , z ; θ ) , ˆ x = lim n →∞ x n (3)The FPI layer can be defined based on the above relations. The FPI layer F receives an observedsample or output of the previous layer as input z , and yields the fixed point ˆ x of g as the layer’soutput, i.e., ˆ x = g (ˆ x, z ; θ ) = F ( x , z ; θ ) = lim n →∞ g ( n ) ( x , z ; θ ) = ( g ◦ g ◦ · · · ◦ g )( x , z ; θ ) (4)where ◦ indicates the function composition operator. The layer receives the initial point x as well,but its actual value does not matter in the training procedure because g has a unique fixed point.Hence, x can be predetermined to any value such as zero matrix. Accordingly, we will oftenexpress ˆ x as a function of z and θ , i.e., ˆ x ( z ; θ ) . When using an FPI layer, the first equation in (3)is repeated until convergence to find the output ˆ x . We may use some acceleration techniques suchas the Anderson acceleration [20] for faster convergence. Note that there is no apparent relationbetween the shapes of x n and z . Hence the sizes of the input and output of an FPI layer do not haveto be same. Similar to other network layers, learning of F is performed by updating θ based on backpropagation.For this, the derivatives of the FPI layer has to be calculated. One simple way to compute thegradients is to construct a computational graph for all the iterations up to the fixed point ˆ x . However,this method is not only time consuming but also requires a lot of memory.In this section, we show that the derivative of the entire FPI layer depends only on the fixed point ˆ x . In other words, all the x n before convergence are actually not needed in the computation of thederivatives. Hence, we can only retain the value of ˆ x to perform backpropagation, and consider theentire F as a node in the computational graph.Note that ˆ x = g (ˆ x, z ; θ ) is satisfied at the fixed point ˆ x . If we differentiate both sides of the above equation w.r.t. θ , we have ∂ ˆ x∂θ = ∂g∂θ (ˆ x, z ; θ ) + ∂g∂x (ˆ x, z ; θ ) ∂ ˆ x∂θ . (5)Here, z is not differentiated because z and θ are independent. Rearranging the above equation gives ∂ ˆ x∂θ = (cid:18) I − ∂g∂x (ˆ x, z ; θ ) (cid:19) − ∂g∂θ (ˆ x, z ; θ ) , (6)4hich confirms the fact that the derivative of the output of F ( x , z ; θ ) = ˆ x depends only on thevalue of ˆ x . One downside of the above derivation is that it requires the calculation of Jacobians of g , which may need a lot of memory space (e.g., convolutional layers). Moreover, calculating theinverse can also be a burden. In the next section, we will provide an efficient way to resolve theseissues. To train the FPI layer, we need to obtain the gradient w.r.t. its parameter θ . In contrast to RBP [3, 21],we propose a computationally efficient layer, called the backward FPI layer, that yields the gradientwithout explicitly calculating the Jacobian. Here, we assume that an FPI layer is in the middle of thenetwork. If we define the loss of the entire network as L , then what we need for backpropagation ofthe FPI layer is ∇ θ L (ˆ x ) . According to (6), we have ∇ θ L = (cid:18) ∂ ˆ x∂θ (cid:19) ⊤ ∇ ˆ x L = (cid:18) ∂g∂θ (ˆ x, z ; θ ) (cid:19) ⊤ (cid:18) I − ∂g∂x (ˆ x, z ; θ ) (cid:19) −⊤ ∇ ˆ x L. (7)This section describes how to calculate the above equation efficiently. (7) can be divided into twosteps as follows: c = (cid:18) I − ∂g∂x (ˆ x, z ; θ ) (cid:19) −⊤ ∇ ˆ x L, ∇ θ L = (cid:18) ∂g∂θ (ˆ x, z ; θ ) (cid:19) ⊤ c. (8)Rearranging the first equation in (8) yields c = (cid:18) ∂g∂x (ˆ x, z ; θ ) (cid:19) ⊤ c + ∇ ˆ x L , which can be expressedas an iteration form, i.e., c n +1 = (cid:18) ∂g∂x (ˆ x, z ; θ ) (cid:19) ⊤ c n + ∇ ˆ x L, (9)which corresponds to RBP. This iteration eliminates the need of the inverse calculation but it stillrequires the calculation of the Jacobian of g w.r.t. ˆ x . Here, we derive a new network layer, i.e. thebackward FPI layer, that yields the gradient without an explicit calculation of the Jacobian. To thisend, we define a new function h as h ( x, z, c ; θ ) = c ⊤ g ( x, z ; θ ) , then (9) becomes c n +1 = ∂h∂x (ˆ x, z, c n ; θ ) + ∇ ˆ x L. (10)Note that the output of h is scalar. Here, we can consider h as another small network containing onlya single step of g (with an additional inner product). The gradient of h can be computed based onexisting autograd functionalities with some additional considerations. Similarly, the second equationin (8) is expressed using h : ∇ θ L = ∂h∂θ (ˆ x, z, c ; θ ) , (11)where c is the fixed point obtained from the fixed-point iteration in (10). In this way, we can compute ∇ θ L by (11) without any memory-intensive operations and Jacobian calculation. c can be obtainedby initializing c to some arbitrary value and repeating the above update until convergence.Note that this backward FPI layer can be treated as a node in the computational graph, hence thename backward FPI layer . However, care should be taken about the above derivation in that thedifferentiations w.r.t. x and θ are partial differentiations. x and θ might have some dependencywith each other, which can disrupt the partial differentiation process if it is computed based on ausual autograd framework. Let φ ( a, b ) hereafter denotes the gradient operation in the conventionalautograd framework that calculates the derivative of a w.r.t b where a and b are both nodes in acomputational graph. Here, b can also be a set of nodes, in which case the output of φ will also be aset of derivatives.In order to resolve the issue, we implemented a general partial differentiation operator P ( s ; r ) , ∂r ( s ) /∂s where s is a set of nodes, r is a function (a function object, to be precise), and ∂r ( s ) /∂s denotes the set of corresponding partial derivatives: Let I ( t ) denotes an operator that creates a setof leaf nodes by cloning the nodes in the set t and detaching them from the computational graph. P first creates an independent computational graph having leaf nodes s ′ = I ( s ) . These leaf nodes5re then passed onto r to yield r ′ = r ( s ′ ) , and now we can differentiate r ′ w.r.t. s ′ using φ ( r ′ , s ′ ) to calculate the partial derivatives, because the nodes in s ′ are independent to each other. Here, theresulting derivatives ∂r ′ /∂s ′ are also attached in the independent graph as the output of φ . The P operator creates another set of leaf nodes I ( ∂r ′ /∂s ′ ) , which is then attached to the original graph(where s resides) as the output of P , i.e., ∂r ( s ) /∂s . In this way, the whole process is completedand the partial differentiation can be performed accurately. If some of the partial derivatives are notneeded in the process, we can simply omit them in the calculation of ∂r ′ /∂s ′ .Note that the above independent graph is preserved for backpropagation. Let H ( v ; u ) be an operatorthat creates a new function object that calculates P i h v i , u i i , where the node v i is an element of theset v and u i is one of the outputs of the function object u . In the backward path of P , the set δ ofgradients passed onto P by backpropagation is used to create a function object η = H ( δ ; ρ ) where ρ ( s ) is a function object that calculates P ( s ; r ) = ∂r ( s ) /∂s . Then, the backpropagated gradientsfor s can be calculated with another P operation, i.e., P ( δ ∪ s ; η ) (here, the derivatives for δ do notneed to be calculated). In practice, the independent graph created in the forward path is reused for ρ in the calculation of η .The backward FPI layer can be highly modularized with the above operators, i.e., P , H , and a plusoperator can construct (10) and (11) entirely, and the iteration of (10) can be implemented withanother forward FPI layer. This allows multiple differentiations of the forward and backward FPIlayers. A picture depicting all the above processes is provided in the supplementary material. Allthe forward and backward layers are implemented at a high level of abstraction, and therefore, it canbe easily applied to practical tasks by changing the structure of g to one that is suitable for each task. The forward path of the FPI layer converges if the bounded Lipschitz assumption holds. For exam-ple, to make a fully connected layer a contraction mapping, simply dividing the weight by a numbergreater than the maximum singular value of the weight matrix will suffice. In practice, we empiri-cally found out that setting the initial values of weights ( θ ) to small values is enough for making g acontraction mapping throughout the training procedure. Convergence of the backward FPI layer.
The backward FPI layer is composed of a linear mappingbased on the Jacobian ∂g/∂x on ˆ x . Convergence of the backward FPI layer can be confirmed by thefollowing proposition. Proposition 1. If g is a contraction mapping, the backward FPI layer (Eq. (9)) converges to aunique point.Proof. For simplicity, we omit z and θ from g . By the definition of the contraction mapping and theassumption of the arbitrary norm metric, k g ( x ) − g ( x ) kk x − x k ≤ k is satisfied for all x and x ( ≤ k < ).For a unit vector v , i.e., k v k = 1 for the aforementioned norm, and a scalar t , let x = x + tv .Then, the above inequality becomes k g ( x + tv ) − g ( x ) kk t k ≤ k . For another vector u with k u k ∗ ≤ where k · k ∗ indicates the dual norm, it satisfies u ⊤ (cid:0) g ( x + tv ) − g ( x ) (cid:1) | t | ≤ k g ( x + tv ) − g ( x )) k| t | ≤ k (12)based on the definition of the dual norm. This indicates that lim t → + u ⊤ (cid:0) g ( x + tv ) − g ( x ) (cid:1) | t | = ∇ v ( u ⊤ g )( x ) ≤ k. (13)According to the chain rule, ∇ ( u ⊤ g ) = (cid:0) u ⊤ J g (cid:1) ⊤ where J g is the Jacobian of g . That gives ∇ v ( u ⊤ g )( x ) = (cid:0) ∇ ( u ⊤ g )( x ) (cid:1) ⊤ · v = u ⊤ J g ( x ) v ≤ k. (14)Let x = ˆ x then u ⊤ J g (ˆ x ) v ≤ k for all u , v that satisfy k u k ∗ = k v k = 1 . Therefore, k J g (ˆ x ) k = sup k v k =1 k J g (ˆ x ) v k = sup k v k =1 , k u k ∗ ≤ u ⊤ J g (ˆ x ) v ≤ k < (15)which indicates that the linear mapping by weight J g (ˆ x ) is a contraction mapping. By the Banachfixed-point theorem, the backward FPI layer converges to the unique fixed-point.6 .5 Two representative cases of the FPI layer As mentioned before, FPI can take a wide variety of forms. We present two representative methodsthat are easy to apply to practical problems.
The most intuitive way to use the FPI layer is to set g to an arbitrary neural network module. InFPI_NN, the input variable recursively enters the same network module until convergence. g canbe composed of layers that are widely used in deep networks such as convolution, ReLU, and linearlayers. FPI_NN can perform more complicated behaviors with the same number of parameters thanusing g directly without FPI, as demonstrated in the experiments section. The gradient descent method can be a perfect example for the FPI layer. This can be used for efficientimplementations of the energy function networks like ICNN [5]. Unlike a typical network whichobtains the answer directly as the output of the network (i.e., f ( a ; θ ) is the answer of the query a ),an energy function network retrieves the answer by optimizing an input variable of the network (i.e., argmin x f ( x, a ; θ ) becomes the answer). The easiest way to optimize the network f is gradientdescent ( x n +1 = x n − γ ∇ f ( x n , a ; θ ) ). This is a form of FPI and the fixed point ˆ x is the optimalpoint of f , i.e., ˆ x = argmin x f ( x, a ; θ ) . In case of a single FPI layer network with FPI_GD, ˆ x becomes the final output of the network. Accordingly, this output is fed into the final loss function L (cid:0) x ∗ , ˆ x ) to train the parameter θ during the training procedure. This behavior conforms to that ofan energy function network. However, unlike the existing methods, the proposed method can betrained easily with the universal backpropagation formula. Therefore, the proposed FPI layer canbe an effective alternative to train energy function networks. An advantage of FPI_GD is that it caneasily satisfy the bounded Lipschitz condition by adjusting the step size γ . Since several studies [3, 21, 19, 6] have already shown that RBP-based algorithms require only aconstant amount of memory, we omit memory-related experiments. Instead, we focus on applyingthe proposed method to practical tasks in this paper. It is worth noting that both the forward andbackward FPI layers were highly modularized, and the exactly same implementations were sharedacross all the experiments without any alteration. The only difference was the choice of g where wecould simply plug in its functional definition, which shows the efficiency of the proposed framework.In the image denoising experiment, we compare the performance of the FPI layer to a non-FPInetwork that has the same structure as g . In the optical flow problem, a relatively very small FPI layeris attached at the end of FlowNet [12] to show its effectiveness. For all experiments, the detailedstructure of g and the hyperparameters for training are described in the supplementary material.Results on the multi-label classification problem show that the FPI layer is superior in performancecompared to the existing state-of-the-art algorithms. All training was performed using the Adam[17] optimizer. Here, we compare the image denoising performance for gray images perturbed by Gaussian noisewith variance σ . Image denoising has been traditionally solved with iterative, numerical algo-rithms, hence using an iterative structure like the proposed FPI layer can be an appropriate choicefor the problem. To generate the image samples, we cropped the images in the Flying Chairs dataset[12] and converted it to gray scale (400 images for training and 100 images for testing). We con-structed a single FPI_NN layer network for this experiment. For comparison, we also constructeda feedforward network that has same structure as g . The performance is reported in terms of peaksignal-to-noise ratio (PSNR) in Table 1.Table 1 shows that the single FPI layer network outperforms the feedforward network in all exper-iments. Note that the performance gap between the two algorithms is larger in more noisy circum-stances. Since both the networks are trained to yield the best performance in their given settings,7ethod σ = 15 σ = 20 σ = 25 Feedforward 32.18 30.44 29.09FPI_NN
Table 1: Denoising performance (PSNR, higher is better).
Training Epoch A v e r age EPE
FlownetFlownet+FPI
Figure 1: Average EPE per epoch(lower is better). Method F1 scoreMLP [9] 38.9Feedforward net [5] 39.6SPEN [9] 42.2ICNN [5] 41.5DVN(GT) [8] 42.9DVN(Adversarial)[13]
FPI_GD layer (Ours) 43.2FPI_NN layer (Ours) 43.4Table 2: F score of multi-label text classifi-cation (higher is better).this confirms that a structure with repeated operations can be more suitable for this type of problem.An advantage of the proposed FPI layer here is that there is no explicit calculation of the Jacobian,which can be quite large in this image-based problem, even though there was no specialized compo-nent except the bare definition of g thanks to the highly modularized nature of the layer. Examplesof image denoising results are shown in the supplementary material. Optical flow is one of the major research areas of computer vision that aims to acquire motionsby matching pixels in two images. We demonstrate the effectiveness of the FPI layer by a simpleexperiment, where the layer is attached at the end of FlowNet [12]. The Flying Chairs dataset[12] was used with the original split which has 22,232 training and 640 test samples. In this case,the FPI layer plays the role of post-processing. We attached a very simple FPI layer consisting ofconv/deconv layers and recorded the average end point error (aEPE) per epoch as shown in Figure1. Although the number of additional parameters is extremely small (less than 0.01%) and thecomputation time is nearly the same with the original FlowNet, it shows noticeable performanceimprovement.
The multi-label text classification dataset (Bibtex) was introduced in [15]. The goal of the task isto find the correlation between the data and the multi-label features. Both the data and featuresare binary with 1836 indicators and 159 labels, respectively. The numbers of indicators and labelsdiffer for each data, and that of labels is unknown during the evaluation process. We used the sametrain and test split as in [15] and evaluated the F scores. Here, we used two single FPI layernetworks with FPI_GD and FPI_NN. We set g of FPI_NN and f of FPI_GD to similar structureswhich contain one or two fully-connected layers and activation functions. As mentioned, the detailedstructures of the networks are described in supplementary materials. Table 2 shows the F scores.Here, DVN(Adversarial) achieves the best performance, however, it generates adversarial samplesfor data augmentation. Despite their simple structures, our algorithms perform the best among thoseusing only the training data, which confirms the effectiveness of the proposed method. This paper proposed a novel architecture that uses the fixed-point iteration as a layer of the neu-ral network. The backward FPI layer was also proposed to backpropagate the FPI layer efficiently.We proved that both the forward and backward FPI layers are guaranteed to converge under mildconditions. All components are highly modularized so that we can efficiently apply the FPI layerfor practical tasks by only changing the structure of g . Two representative cases of the FPI layer8FPI_NN and FPI_GD) have been introduced. Experiments have shown that our method has advan-tages for several problems compared to the feedforward network. For some problems like denoising,the iterative structure of the FPI layer can be more helpful, and in some other problems, it can beused to refine the performance of an existing method. Finally, we have also shown in the multi-labelclassification example that the FPI layer can achieve the state-of-the-art performance with a verysimple structure. Acknowledgements
This work was supported in part by the Next-Generation ICD Program through NRF funded byMinistry of S&ICT [2017M3C4A7077582], and in part by the National Research Foundation ofKorea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1C1C1012479).
Broader Impact
This work does not present any foreseeable societal consequence for now because it proposes theoret-ical ideas that can be generally applied to various deep learning structures. However, our method canprovide a new direction for various fields of machine learning or computer vision. Recently, a hugeportion of new techniques is developed based on deep learning in various research fields. In manycases, this leads to rewriting a large part of the traditional methodologies, because deep networksbear quite different structures from traditional algorithms. During the process, much of the conven-tional wisdom found in existing theories are being re-discovered based on raw datasets. This inducesa high economic cost in developing new technologies. Our method can provide an alternative to thistrend by incorporating the existing mechanisms in many iterative algorithms, which can reduce thedevelopment costs. There already have been many studies that combine deep learning with morecomplicated models such as SMPL, however, one usually has to derive the backpropagation formulaseparately for each method, which introduces considerable difficulties and, as a result, developmentcosts. However, our method can be applied universally to many types of iterative algorithms, sothe consilience between various models from different fields can be stimulated. Another possibleconsequence of the proposed method is that it might also expand the application of deep learning innon-GPU environments. As demonstrated in the experiments, the introduction of an FPI operationin a deep network can achieve similar performance with a much simpler structure, at the expenseof an iterative calculation. This can be helpful for using deep networks in many resource-limitedenvironments and may accelerate the trend of ubiquitous deep learning.
References [1] Agrawal, Akshay, Amos, Brandon, Barratt, Shane, Boyd, Stephen, Diamond, Steven, andKolter, J Zico. Differentiable convex optimization layers. In
Advances in Neural Informa-tion Processing Systems , pp. 9558–9570, 2019.[2] Agrawal, Akshay, Barratt, Shane, Boyd, Stephen, Busseti, Enzo, and Moursi, Walaa M. Dif-ferentiating through a conic program. arXiv preprint arXiv:1904.09043 , 2019.[3] Almeida, Luis B. A learning rule for asynchronous perceptrons with feedback in a combinato-rial environment. In
IEEE International Conference on Neural Networks, 1987 , pp. 609–618,1987.[4] Amos, Brandon and Kolter, J Zico. Optnet: Differentiable optimization as a layer in neuralnetworks. In
Proceedings of the 34th International Conference on Machine Learning-Volume70 , pp. 136–145. JMLR. org, 2017.[5] Amos, Brandon, Xu, Lei, and Kolter, J Zico. Input convex neural networks. In
Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pp. 146–155. JMLR. org,2017.[6] Bai, Shaojie, Kolter, J Zico, and Koltun, Vladlen. Deep equilibrium models. In
Advances inNeural Information Processing Systems , pp. 688–699, 2019.97] Banach, Stefan. Sur les opérations dans les ensembles abstraits et leur application aux équa-tions intégrales.
Fund. math , 3(1):133–181, 1922.[8] Beardsell, Philippe and Hsu, Chih-Chao.
Structured Prediction with Deep Value Networks ,2020. URL https://github.com/philqc/deep-value-networks-pytorch .[9] Belanger, David and McCallum, Andrew. Structured prediction energy networks. In
Interna-tional Conference on Machine Learning , pp. 983–992, 2016.[10] Burden, Richard and Faires, JD.
Numerical analysis . Cengage Learning, 2004.[11] Djolonga, Josip and Krause, Andreas. Differentiable learning of submodular models. In
Ad-vances in Neural Information Processing Systems , pp. 1013–1023, 2017.[12] Fischer, Philipp, Dosovitskiy, Alexey, Ilg, Eddy, Häusser, Philip, Hazırba¸s, Caner, Golkov,Vladimir, Van der Smagt, Patrick, Cremers, Daniel, and Brox, Thomas. Flownet: Learningoptical flow with convolutional networks. arXiv preprint arXiv:1504.06852 , 2015.[13] Gygli, Michael, Norouzi, Mohammad, and Angelova, Anelia. Deep value networks learn toevaluate and iteratively refine structured outputs. In
Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pp. 1341–1351. JMLR. org, 2017.[14] Jiang, Borui, Luo, Ruixuan, Mao, Jiayuan, Xiao, Tete, and Jiang, Yuning. Acquisition of local-ization confidence for accurate object detection. In
Proceedings of the European Conferenceon Computer Vision (ECCV) , pp. 784–799, 2018.[15] Katakis, Ioannis, Tsoumakas, Grigorios, and Vlahavas, Ioannis. Multilabel text classificationfor automated tag suggestion. In
Proceedings of the ECML/PKDD , volume 18, pp. 5, 2008.[16] Khamsi, Mohamed A and Kirk, William A.
An introduction to metric spaces and fixed pointtheory , volume 53. John Wiley & Sons, 2011.[17] Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[18] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, and Huang, F. A tutorial on energy-based learning.
Predicting structured data , 1(0), 2006.[19] Liao, Renjie, Xiong, Yuwen, Fetaya, Ethan, Zhang, Lisa, Yoon, KiJung, Pitkow, Xaq, Urta-sun, Raquel, and Zemel, Richard. Reviving and improving recurrent back-propagation. In
International Conference on Machine Learning , pp. 3082–3091, 2018.[20] Peng, Yue, Deng, Bailin, Zhang, Juyong, Geng, Fanyu, Qin, Wenjie, and Liu, Ligang. An-derson acceleration for geometry optimization and physics simulation.
ACM Transactions onGraphics (TOG) , 37(4):1–14, 2018.[21] Pineda, Fernando J. Generalization of back-propagation to recurrent neural networks.
Physicalreview letters , 59(19):2229, 1987.[22] Tsochantaridis, Ioannis, Hofmann, Thomas, Joachims, Thorsten, and Altun, Yasemin. Supportvector machine learning for interdependent and structured output spaces. In
Proceedings of thetwenty-first international conference on Machine learning , pp. 104, 2004.[23] Wang, Po-Wei, Donti, Priya L, Wilder, Bryan, and Kolter, Zico. Satnet: Bridging deeplearning and logical reasoning using a differentiable satisfiability solver. arXiv preprintarXiv:1905.12149arXiv preprintarXiv:1905.12149