aa r X i v : . [ c s . N E ] F e b Min-Max-Plus Neural Networks
Ye Luo ∗ School of InformaticsXiamen UniversityXiamen, Fujian 361005, China [email protected], [email protected]
Shiqing FanSchool of InformaticsXiamen UniversityXiamen, Fujian 361005, China [email protected]
February 15, 2021
Abstract
We present a new model of neural networks called Min-Max-Plus Neural Networks (MMP-NNs) basedon operations in tropical arithmetic. In general, an MMP-NN is composed of three types of alternatelystacked layers, namely linear layers, min-plus layers and max-plus layers. Specifically, the latter twotypes of layers constitute the nonlinear part of the network which is trainable and more sophisticatedcompared to the nonlinear part of conventional neural networks. In addition, we show that with highercapability of nonlinearity expression, MMP-NNs are universal approximators of continuous functions,even when the number of multiplication operations is tremendously reduced (possibly to none in certainextreme cases). Furthermore, we formulate the backpropagation algorithm in the training process ofMMP-NNs and introduce an algorithm of normalization to improve the rate of convergence in training.
Conventional artificial neural networks typically have a fixed nonlinear activation function that applies toall neurons. Introducing a trainable nonlinear part of the network usually further enhances fitting capabilityof the network. For example, the seminal work of He et al. [1] showed for the first time that human-levelperformance on ImageNet Classification (experimentally with an error rate of 5 . . ∗ This work is supported by National Natural Science Foundation of China (Grant No. 61875169). ihich only uses additions and min/max operations. As a result, the number of the multiplication operationscan be tremendously reduced in the computation of such a network. In particular, using certain configu-rations, the number of multiplications can be even reduced to none while the network is still kept to be auniversal approximator.
Our work is closely related to the applications of tropical mathematics to analyze deep neural networks.At present, many studies have analyzed the mechanism of deep neural networks from different perspec-tives, such as the advantages of deep-structured networks over shallow-structured networks [3–5], the impactof activation functions on network expression capabilities [6], explanation of the generalization capabilitiesof networks [7], etc. The methods of analogy to neural networks in order to explain the characteristics ofthem, for example, using tropical geometry to simulate the structure of deep neural network, are also beenproposed.Zhang et al. [8] proposed to use the theory of tropical geometry to analyze deep neural networks inorder to explore the explanation of the characteristics of the neural network structure. In particular, theyestablished a connection between a forward neural network with ReLU activation and tropical geometryfor the first time, proving that this kind of neural networks is equivalent to the family of tropical rationalmappings. They also deduced that a hidden layer of a forward ReLU neural network can be described byzonotopes as the building blocks of a deeper network, and associating the decision boundary of this neuralnetwork with a tropical hypersurface.In order to expand the studies of [6] about the upper bounds on linear regions of layers from ReLUactivations, leaky ReLU activations and maxout activations, [9] present an approach in a tropical perspec-tive which treats neural network layers with piecewise linear activations as tropical polynomials, which arepolynomials over tropical semirings.Calafiore et al. [10] proposed a new type of neural networks called log-sum-exp (LSE) networks whichare universal approximators of convex functions. In addition, they show that difference-LSE networks are asmooth universal approximators of continuous functions over compact convex sets [11]. LSE networks anddifference-LSE networks are also closely related to tropical mathematics via the so-called “dequantization”procedure.There are also studies to use the relevant content of tropical geometry to solve the problem of optimalpiecewise linear regression. Maragos et al. [12] generalized tropical geometrical objects using weighted latticesand provided the optimal solution of max- ⋆ equations (max- ⋆ algebra with an arbitrary binary operation ⋆ that distributes over max) using morphological adjunctions that are projections on weighted lattices. Thenby fitting max- ⋆ tropical curves and surfaces to arbitrary data that constitute polygonal or polyhedral shapeapproximations, the relationship between tropical geometry and optimization of piecewise-linear regressioncan be established. With this theory, they again proposed an approach for multivariate convex regressionby using an approximation model of the maximum of hyperplanes represented as a multivariate max-plustropical polynomial [13], and show that the method has lower complexity than most other methods for fittingpiecewise-linear (PWL) functions.Newer tropical methods are also developed to formally simulate the training of some neural networks.Smyrnis et al. [14] [15] emulate the division of regular polynomials, when applied to those of the max-plussemiring from the aspect of tropical polynomial division. This is done via the approximation of the Newtonpolytope of the dividend polynomial by that of the divisor. The process has been applied to minimize a two-layer fully connected network for a binary classification problem, and then evaluated in various experimentsto prove its ability to approximate the network with the least performance loss. Due to them encoding thetotality of the information contained in the network, the results helped them to demonstrate that the Newtonpolytopes of the tropical polynomials corresponding to the network provide a reliable way to approximateits labels. ii .2.2 Investigations on reducing multiplication operations in neural networks For a deep neural network (DNN), the computation is majorly on multiplications between floating-pointweights and floating-point value activation during forward inference. It is well known that the execution ofmultiplication is typically slower than the execution of addition with higher energy consumption. In recentyears, there are many studies on how to trade multiplication and addition to speed up the computation indeep learning.The seminal work [16] introduced BinaryConnect, a method training a DNN with binary (e.g. -1 or 1)weights during the forward and backward propagations, so that many multiply-accumulate operations canbe replaced by simple accumulations. Moreover, Hubara et al. [17] proposed BNNs, which binarized not onlyweights but also activations in convolutional neural networks at runtime. In order to get greatly improvedperformance, Rastegari et al. [18] introduced XNOR-Networks, in which both the filters and the input toconvolutional layers are binary. XNOR-Nets offer the possibility of running state-of-the-art networks onCPUs in real-time. To further increase the speed of traning binarized networks, [19] propose DoReFa-Netto train convolutional neural networks that have low bitwidth weights and activations using low bitwidthparameter gradients.Considered from another aspect, Chen et al. [20] boldly gave up the convolution operation that involvea lot of matrix multiplication. They propose Adder Networks that maximize the use of addition whileabandoning convolution operations, specifically, by taking the L -norm distance between filters and inputfeature as the output response instead of the results of convolution operations. And the correspondingoptimization method is developed by using regularized full-precision gradients. The experimental resultsshow that AdderNets can well approximate the performance of CNNs with the same architectures, whichmay have some impact on future hardware design. (i) Using tropical arithmetic, we propose a new model of neural networks called Min-Max-Plus NeuralNetworks which have a trainable and more sophisticated nonlinear part compared to conventionalneural networks. We show that a general form of MMP-NNs is composed of three types of layers:linear layers, min-plus layers and max-plus layers, which can be represented by matrices, min-plusmatrices and max-plus matrices respectively. Then we show that the computation of MMP-NNs areessentially a sequence of matrix multiplications and tropical matrix multiplications.(ii) We show that such a model of MMP-NNs is quite general in the sense that conventional maxoutnetworks, ReLU networks, leaky/parametric ReLU networks, and the dequantization of Log-Sum-Exp(LSE) networsk can all be considered as specializations of a special type (called Type I) of MMP-NNs,while on the other hand, MMP-NNs can be quite non-conventional. A special type (called Type II) ofMMP-NNs have only one linear layer at the input end with the remaining layers being the min-pluslayers and max-plus layers which are stacked together alternately. Another special type (Type III) ofMMP-NNs has a structure of Type II networks attached with an additional output linear layer. Witha more sophisticated nonlinear expressor in a Type III network, it is expected to have a more enhancedfitting ability than similarly sized conventional networks.(iii) We show that MMP-NNs of all types are universal approximators. In particular, we show that thespace of functions expressible by the Type II networks with a fixed linear layer can be elegantly charac-terized using tropical convexity. Moreover, the proof that we give to Type II networks being universalapproximators is quite distinct from the regular difference-of-convex-functions approach.(iv) We show that by using Type II networks, since there is only one linear layer, the number of multipli-cations in the computation can be tremendously reduced. This gives an advantage of using Type IInetworks in scenarios where computing resources are limited.(v) We show that MMP-NNs can also be trained using backpropagation as in the conventional feedforwardnetworks. We provide formulas for gradient calculations for the non-conventional min-plus layers andiiiax-plus layers.(vi) We propose a normalization process to adjust the parameters in the min-plus and max-plus layerswhich helps to expedite the rate of convergence in training MMP-NNs. This normalization process is ageneralization of the Legendre-Fenchel transformation widely used in physics and convex optimization. The remaining of the paper is organized as follows: In Section 2, we give a brief overview of the terminologiesin tropical mathematics related to this work; In Section 3, we give a general description of the buildingblocks and architecture of MMP-NNs, and discuss several special types (Type I, II, and III) of MMP-NNs;In Section 4, we prove that all types of MMP-NNs are universal approximators; In Section 5, we describe atraining method of MMP-NNs using backpropagation and introduce the normalization algorithm.
In this section, we provide some preliminaries of tropical mathematics that are related to our work in thispaper. One may refer to [2, 21] for a more comprehensive introduction to tropical mathematics.
Definition 2.1.
Let R min := R S {∞} , R max := R S {−∞} , R := R S {∞ , −∞} .(i) The min-plus algebra ( R min , ⊙ , ⊕ ) and the max-plus algebra ( R max , ⊙ , ⊞ ) are semirings where a ⊙ b := a + b , a ⊕ b := min( a, b ) and a ⊞ b := max( a, b ).(ii) The operations ⊙ , ⊕ and ⊞ are called tropical multiplication , tropical lower addition and tropical upperaddition respectively.(iii) By convention, the tropical semiring is defined as either the min-plus algebra ( R min , ⊙ , ⊕ ) or the max-plus algebra ( R max , ⊙ , ⊕ ).Note that a ⊙ a , b ⊕ ∞ = b and c ⊞ ( −∞ ) = c for all a ∈ R , b ∈ R min and c ∈ R max . This meansthat 0 is the tropical multiplicative identity for both min-plus and max-plus algebras, ∞ is the identity fortropical lower addition and −∞ is the identity of tropical upper addition . Moreover, the tropical division of a, b ∈ R is defined as a ⊘ b := a − b and the tropical inverse of a ∈ R is the negation − a = 0 ⊘ a . Note thatthe min-plus algebra and the max-plus algebra are isomorphic under negation.The min-plus algebra (resp. max-plus algebra) is an idempotent semiring, since a ⊕ a = a (resp. a ⊞ a = a )for all a ∈ R . It can be easily verified that the tropical operations on ( R min , ⊙ , ⊕ ) and ( R max , ⊙ , ⊞ ) satisfythe usual principles of commutativity, associativity and distributivity. Remark . While only one of the min-plus algebra and the max-plus algebra is considered in most otherworks related to the tropical semiring, we need to deal with both in this work. In particular, the followingproperty will be employed: the two addition operations ⊕ and ⊞ also mutually satisfy the distributive law,i.e., a ⊞ ( b ⊕ c ) = ( a ⊞ b ) ⊕ ( a ⊞ c ) and a ⊕ ( b ⊞ c ) = ( a ⊕ b ) ⊞ ( a ⊕ c ) for all a, b, c ∈ R .Tropical operations can be also be applied to functions. For a topological space X , let C ( X ) be the spaceof real-valued continuous functions on X . Correspondingly we can define tropical operations on C ( X ) asfollows:(i) For f, g ∈ C ( X ), the lower tropical addition and upper tropical addition are defined as f ⊕ g := min( f, g )and f ⊞ g := max( f, g ) respectively where the minimum and the maximum are taken in a point-wisemanner. ivii) For f, g ∈ C ( X ) and c ∈ R , the tropical scalar multiplication , tropical multiplication , tropical division and tropical inverse are defined respectively as c ⊙ f := c + f , f ⊗ g := f + g , f ⊘ g := f − g and − f = 0 ⊘ f .(iii) By abuse of notation, we also let ∞ and −∞ denote constant functions on X taking values ∞ and −∞ respectively. Then for f ∈ C ( X ), we have f ⊕ ∞ = f , f ⊞ ( −∞ ) = f , f ⊕ ( −∞ ) = −∞ , f ⊞ ∞ = ∞ , f ⊗ ∞ = ∞ and f ⊗ ( −∞ ) = −∞ .(iv) ( C ( X ) S {∞} , ⊕ , ⊗ ) and ( C ( X ) S {−∞} , ⊞ , ⊗ ) are idempotent semirings with operations satisfyingthe usual principles of commutativity, associativity, distributivity, and the property f ⊞ ( f ⊕ h ) =( f ⊞ g ) ⊕ ( f ⊞ h ) and f ⊕ ( g ⊞ h ) = ( f ⊕ g ) ⊞ ( f ⊕ h ) for all f, g, h ∈ C ( X ) S {∞ , −∞} . Now let us introduce the notion of tropical convexity [22,23] for both min-plus algebra and max-plus algebra.
Definition 2.2.
A subset T of C ( X ) is said to be lower tropically convex (respectively upper tropicallyconvex ) if a ⊙ f ⊕ b ⊙ g (respectively a ⊙ f ⊞ b ⊙ g ) is contained in T for all a, b ∈ R and all f, g ∈ T .Let f , · · · , f n ∈ C ( X ), a , · · · , a n ∈ R min and b , · · · , b n ∈ R max . Then we say a ⊙ f ⊕ · · · ⊕ a n ⊙ f n is a min-plus combination of f , · · · , f n and b ⊙ f ⊞ · · · ⊕ b n ⊙ f n is a max-plus combination of f , · · · , f n .Here we also call both min-plus and max-plus combinations tropical linear combinations . Definition 2.3.
Let V be a subset of C ( X ). The lower tropical convex hull (respectively upper tropicalconvex hull ) of V is the smallest lower (respectively upper) tropically convex subset of C ( X ) containing V .We may characterize the tropical convex hulls as sets of tropical linear combinations as stated in thefollowing proposition. (For more details, see Section 3 of [23]). Proposition 2.4.
For a subset V of C ( X ) , the lower (respectively upper) tropical convex hull of V iscomposed of all min-plus (respectively max-plus) linear combinations of elements in V , i.e., tconv( V ) = { ( a ⊙ f ) ⊕ · · · ⊕ ( a m ⊙ f m ) | m ∈ N , a i ∈ R , f i ∈ V } and tconv( V ) = { ( b ⊙ f ) ⊞ · · · ⊞ ( b m ⊙ f m ) | m ∈ N , b i ∈ R , f i ∈ V } . A polynomial in n variables with coefficients in R has an expression of the form p ( x ) = m P i =1 c i x a i · · · x a in n with x = ( x , · · · , x n ), c i ∈ R and a ij ∈ N . The terms c i x a i · · · x a in n are called monomials . Polynomials andmonomials can be considered as functions on R n . If in addition the coefficients c i and the coordinates x i arerequired to be positive reals while the assumption on the exponents is relaxed such that a ij are allowed tobe any real numbers, then p ( x ) is a called a posynomial . Definition 2.5.
Using tropical operations, the tropical counterparts of monomials, polynomials, and posyn-omials are defined as follows:(i) A tropical monomial in n variables is a function of x = ( x , · · · , x n ) ∈ R n of the form c + h a , x i with c ∈ R and a = ( a , · · · , a n ) ∈ R n . Here h a , x i = a x + · · · + a n x n is the inner product of the vectors a and x .(ii) A min-plus polynomial (respectively a max-plus polynomial ) in n variables is a function on R n of theform p ( x ) = P ⊕ i =1 , ··· ,m c i ⊙ x ⊗ a i · · · x ⊗ a in n = min i =1 , ··· ,m ( c i + h a i , x i ) (respectively p ( x ) = P ⊞ i =1 , ··· ,m c i ⊙ x ⊗ a i · · · x ⊗ a in n = max i =1 , ··· ,m ( c i + h a i , x i )) where c i ∈ R , x = ( x , · · · , x n ) ∈ R n , a i = ( a i , · · · , a in ) ∈ N n and h a i , x i = a i x + · · · + a in x n is the inner product of the vectors a i and x .viii) If instead of requiring a ∈ N n we allow a ∈ R n , then the function min i =1 , ··· ,m ( c i + h a i , x i ) is called a min-plus posynomial and the function max i =1 , ··· ,m ( c i + h a i , x i ) is called a max-plus posynomial . Remark . In the context of this paper, the coefficient vector a = ( a , · · · , , a n ) of a tropical monomial c + h a , x i is not necessarily restricted to be a vector of integers (as in many other works) and our discussionswill be mainly focused on tropical posynomials rather than tropical polynomials. By the above definitions,tropical monomials are simply affine functions, min-plus (respectively max-plus) posynomials are preciselyconvex-upward (respectively convex-downward) piecewise-linear functions. Using tropical operations, we can also define additions and products of tropical matrices. A min-plus matrix is a matrix ( a ij ) ij with entries a ij ∈ R min and a max-plus matrix is a matrix ( a ij ) ij with entries a ij ∈ R max .Then we may denote the space of m × n min-plus matrices by R m × n min and the space of m × n max-plusmatrices by R m × n max . Definition 2.6.
Consider min-plus matrices A = ( a ij ) ∈ R m × n min , B = ( b ij ) ∈ R m × n min and C = ( c ij ) ∈ R n × p min ,and max-plus matrices A ′ = ( a ′ ij ) ∈ R m × n max , B ′ = ( b ′ ij ) ∈ R m × n min and C ′ = ( c ′ ij ) ∈ R n × p min .(i) The min-plus sum S = A ⊕ B is defined to be the m × n min-plus matrix S = ( s ij ) ij with entries s ij = a ij ⊕ b ij = min( a ij , b ij ) for i = 1 , · · · , m and j = 1 , · · · , n .(ii) The max-plus sum S ′ = A ′ ⊞ B ′ is defined to be the m × n max-plus matrix S ′ = ( s ′ ij ) ij with entries s ′ ij = a ′ ij ⊞ b ′ ij = max( a ′ ij , b ′ ij ) for i = 1 , · · · , m and j = 1 , · · · , n .(iii) The min-plus matrix product T = A ⊗ C is defined to be the m × p min-plus matrix T = ( t ij ) ij withentries t ij = a i ⊙ b j ⊕ · · · ⊕ a in ⊙ b nj = min k =1 , ··· ,n ( a ik + b kj ) for i = 1 , · · · , m and j = 1 , · · · , p .(iv) The max-plus matrix product T ′ = A ′ ⊗ C ′ is defined to be the m × p max-plus matrix T ′ = ( t ′ ij ) ij withentries t ′ ij = a ′ i ⊙ b ′ j ⊞ · · · ⊞ a ′ in ⊙ b ′ nj = max k =1 , ··· ,n ( a ′ ik + b ′ kj ) for i = 1 , · · · , m and j = 1 , · · · , p .(v) Let I n = ∞ ∞ · · · ∞∞ ∞ · · · ∞∞ ∞ · · · ∞ ... ... ... . . . ... ∞ ∞ ∞ · · · ∈ R n × n min and I n = −∞ −∞ · · · −∞−∞ −∞ · · · −∞−∞ −∞ · · · −∞ ... ... ... . . . ... −∞ −∞ −∞ · · · ∈ R n × n max .Then I n is called the n × n min-plus identity matrix and I n is called the n × n max-plus identity matrix ,since I n ⊗ A = A ⊗ I n = A for all A ∈ R n × n min and I n ⊗ B = B ⊗ I n = B for all B ∈ R n × n max . Example 2.7.
Let A = −
13 4 , B = (cid:18) (cid:19) . Then it can be easily verified that A ⊗ B = and A ⊗ B =
12 105 310 6 . Remark . Let A ∈ R m × n min , A ′ ∈ R m × n max and B ∈ R n × p . Suppose each row vector in A contains at least oneentry not being ∞ and each row vector in A ′ contains at least one entry not being −∞ . Then A ⊗ B ∈ R m × p and A ′ ⊗ B ∈ R m × p . By letting p = 1, this actually means both A and A ′ can be considered as “tropicallinear” transformations (called min-plus transformation and max-plus transformation respectively) from R n to R m . vi x y y y (a) Linear Layer y y y = w w w w w w · (cid:18) x x (cid:19) x ′ x ′ y ′ y ′ y ′ (b) Min-Plus Layer y ′ y ′ y ′ = w ′ w ′ w ′ w ′ w ′ w ′ ⊗ (cid:18) x ′ x ′ (cid:19) x ′′ x ′′ y ′′ y ′′ y ′′ (c) Max-Plus Layer y ′′ y ′′ y ′′ = w ′′ w ′′ w ′′ w ′′ w ′′ w ′′ ⊗ (cid:18) x ′′ x ′′ (cid:19) Figure 1: (a) A linear layer is representd by a linear transformation. (b) A min-plus layer is represented bya min-plus linear transformation. (c) A max-plus layer is represented by a max-plus linear transformation.
In this section, we give a description of the building blocks and general form of MMP-NNs, and discuss inaddition several special types (Type I, II, and III) of MMP-NNs.
In general, an Min-Max-Plus Neural Network (MMP-NN) is composed of three types of layers: linear layers,min-plus layers and max-plus layers. As shown in Figure 1, a linear layer is represented by a linear transfor-mation, a min-plus layer is represented by a min-plus transformation and a max-plus layer is represented bya max-plus transformation (Remark 3).More precisely, suppose the input of a linear layer, a min-plus player or a max-plus layer is an n -dimensional vector in R n and the output is an m -dimensional vector in R m . Then the correspondinglinear transformation λ : R n → R m , min-plus transformation α : R n → R m and max-plus transforma-tion β : R n → R m can be expressed as ρ ( x ) = L · x , α ( x ) = A ⊗ x and β ( x ) = B ⊗ x respectively where x ∈ R n , L ∈ R m × n , A ∈ R m × n min and B ∈ R m × n max . Since the output should be in R m , we require that each rowvector in A contains at least one entry not being ∞ and each row vector in B contains at least one entrynot being −∞ (Remark 3). Figure 1 shows an example of a linear layer, a min-layer and a max-layer, all for m = 3 and n = 2. Remark . It should be emphasized that even though min-plus transformations and max-plus transformationsare “tropically linear”, in general they are nonlinear in the usual sense, i.e., piecewise linear but convexupward or downward.
Figure 2(a) shows the general form of a feedforward multilayer MMP-NN, which is composed of a sequenceof composite layers, each containing a linear layer, a min-plus layer and a max-plus layer. Consider an MMP-NN with K composite layers and denote the linear transformation, min-plus transformation and max-plustransformation of the k -th layer by λ k , α k and β k respectively.Let d be the number of input nodes and p be the number of output nodes of the MMP-NN. Then theMMP-NN is a function Φ : R d → R p which can be written as a composition of linear, min-plus and max-plusviia) General formInput Linear Min-Plus Max-Plus · · ·
Linear Min-Plus Max-Plus
Output(b) Type IInput
Linear Max-Plus Linear Max-Plus · · ·
Linear Max-Plus
Output(c) Type IIInput
Linear Min-Plus Max-Plus Min-Plus Max-Plus · · ·
Min-Plus Max-Plus
Output(d) Type IIIInput
Linear Min-Plus Max-Plus Min-Plus Max-Plus · · ·
Min-plus Max-Plus Linear
OutputFigure 2: (a) The general form of MMP-NNs; (b) Type I MMP-NNs; (c) Type II MMP-NNs; (d) Type IIIMMP-NNs; viiiransformations Φ = β K ◦ α K ◦ λ K ◦ β K − ◦ α K − ◦ λ K − ◦ · · · ◦ β ◦ α ◦ λ . Let d i , n i and m i be the number of input nodes of the k -th linear layer, the k -th min-plus layer and the k -th max-plus layer respectively. Note that d = d and we let d K +1 = p . Then we have λ k : R d k → R n k , α k : R n k → R m k , β k : R m k → R d k +1 for k = 1 , · · · , K . Let L k ∈ R n k × d k , A k ∈ R m k × n k min and B k ∈ R d k +1 × m k max be the corresponding matrix, min-plus matrix and max-plus matrix for λ k , α k and β k respectively. Then foreach x ∈ R d , we haveΦ( x ) = ( B K ⊗ ( A K ⊗ ( L K · ( B K − ⊗ · · · ⊗ ( L · ( B ⊗ ( A ⊗ ( L · x ) · · · ) . Remark . It should be emphasized that there is no associative law for the above intermediate transforma-tions. For example, suppose A and B both have entries in R . To compute y = B ⊗ ( A ⊗ ( L · x )), onehas to compute step by step as u = L · x , v = A ⊗ u and y = B ⊗ v . If one computes u = L · x and C = B ⊗ A first and then computes C ⊗ u , the result won’t agree with the above step-by-step computationin general.Conventionally, a multilayer feedforward neural network is typically composed of a sequence of affinetransformations and nonlinear activations. The activation function is usually fixed and relatively simple, e.g.ReLU, sigmoid, etc. Even though the nonlinearity added by each neuron of activation is rather small, thetotally nonlinearity of the system can be accumulated after layers of tranformations and activations.For a single composite layer of a general MMP-NN, a linear layer λ k is followed by a min-plus layer α k and a max-plus layer β k , while the composition of α k and β k can be considered as a nonlinear activation.This composition can complexify the system to high nonlinearity in one activation. In addition, we allow totrain the parameters in all three types of matrices (linear, min-plus and max-plus), as will be discussed indetail in Section 5. The architecture of the system can be simplified by dropping out some layers (or simply let the correspondingdropout layer be represented by the identity linear, min-plus or max-plus matrix). Here we show two examplesof such simplifications: Figure 2(b) shows an MMP-NN (called Type I) with the min-plus layer removed foreach composite layer; Figure 2(c) shows an MMP-NN (called Type II) with the linear layer removed for eachcomposite layer except the first one; Figure 2(d) shows an MMP-NN (called Type III) which is a Type IInetwork attached with an additional linear layer to the output end.While the structures of Type II and Type III networks are non-conventional which we will discuss morein Section 4, Type I networks actually have connections to several conventional networks as explained in thefollowing:(i)
Maxout activation.
Type I networks are essentially equivalent to maxout networks.Recall that in a maxout unit [24], for an input x ∈ R d , given W i = ( W ij ) ∈ R n i × d where W ij isthe j -th row of W i and b i = ( b ij ) ∈ R n i for i = 1 , · · · , m , the output is y = ( y i ) ∈ R m where y i = max j =1 , ··· ,m ( W ij x + b ij ). Here W i and b i are learned parameters.Let n = P mi =1 n i . A maxout network can be translated to a Type I network with a linear layerrepresented by matrix L followed by a max-plus layer represented by a max-plus matrix B . Moreprecisely, we have y = B ⊗ ( L · x ) with L = W ... W m ∈ R n × d and B = b · · · b n −∞ −∞ · · · −∞−∞ · · · −∞ b · · · b n −∞ · · · −∞ ... ... ... −∞ −∞ · · · −∞ b m · · · b mn m ∈ R m × n max . ixii) ReLU activation.
For an input x ∈ R d , consider an affine transformation followed by ReLU units.Given W = ( W i ) ∈ R m × d where W i is the i -th row of W and b = ( b i ) ∈ R m , the output is y = ( y i ) ∈ R m which is computed by y i = max( W i · x + b i , L = (cid:18) W (cid:19) ∈ R ( m +1) × d and B = b −∞ · · · −∞ −∞ b · · · −∞ −∞ · · · −∞ b m ∈ R m × ( m +1)max . Then y = B ⊗ ( L · x ).(iii) Leaky/Parametric ReLU activation.
For an input x ∈ R d , consider an affine transformationfollowed by leaky ReLU units. Given W = ( W i ) ∈ R m × d where W i is the i -th row of W and b =( b i ) ∈ R m , let v = ( v i ) ∈ R m with v i = W i · x + b i . Then the output is y = ( y i ) ∈ R m computed by y i = max( v i , λv i ).The above computation can also be translated to a Type I network computation. Let L = (cid:18) WλW (cid:19) ∈ R (2 m ) × d and B = b −∞ · · · −∞ λb −∞ · · · −∞−∞ b · · · −∞ −∞ λb · · · −∞ ... ... . . . ... ... ... . . . ... −∞ · · · −∞ b m −∞ · · · −∞ λb m ∈ R m × m max . Then y = B ⊗ ( L · x ).(iv) Log-Sum-Exp (LSE) network.
The concept of LSE neural networks was introduced by Calafiore etal. [10] recently as a smooth universal approximator of convex functions. In addition, they show thatthe difference-LSE (the difference of the outputs of two LSE networks) is a universal approximator ofcontinuous functions [11].By definition, an
LSE is a function f : R d → R that can be written as f ( x ) = log (cid:0)P ni =1 β i exp( a ( i ) · x ) (cid:1) for some n ∈ N . Here a , x ∈ R d . Given T ∈ R > , by changing the base of the log function and the expfunction concordantly, an LSE T is a function that can be written as f T ( x ) = T log n X i =1 β /Ti exp( a ( i ) · x /T ) ! = T log n X i =1 exp( a ( i ) · ( x /T ) + b i /T ) ! where b i = log β i . As T → + , the family of functions ( f T ) T > converges uniformly to f =max i =1 , ··· ,n (cid:0) a ( i ) · x + b i (cid:1) . The above procedure of deriving f from ( f T ) T > is called Maslov dequanti-zation , which is among the original motivations in the development of tropical mathematics.Let L = a (1) ... a ( n ) ∈ R n × d with a ( i ) being row vectors. Let B = (cid:0) b · · · b n (cid:1) ∈ R × n . Then f ( x ) = B ⊗ ( L · x ). This means that Type I networks here are actually dequantization of LSE networks.
In this section, we show the property of universal approximation of Type I, II and III networks. In particular,we will focus our discussion on Type II networks which is the most non-conventional. Note that a Type IInetwork contains only one linear layer and all multiplication operations of the whole network are performed inxhis layer. The other layers are min-plus layers and max-plus layers where only additions, min operations andmax operations are performed. Actually the real power of Type II networks is that by specific configurations,one can tremendously reduce the the computation of multiplications which are more resource-intensive thanadditions and min/max operations. For example, to approximate a continuous function f : R → R usinga Type I network, we need to generate enough lines with distinct slopes from the linear layer to make arefined approximation of f . However, using a Type II network, it suffices to use much less lines with onlya few slopes (for example, two fixed slopes ± f . As shown in the previous section, Type I networks (Figure 2(b)) can be used to express several conventionalnetworks that have already been proven to be universal approximators:(i)
Maxout networks.
As shown in [24], maxout networks are universal approximators of convex func-tions and by using two maxout networks, the difference of their outputs can be used to approximateany continuous functions.(ii)
ReLU networks.
From a tropical-geometric point of view, [8] shows that the output of a ReLUnetwork is actually a tropical rational function (the difference of two tropical posynomials). As formaxout networks, tropical posynomials are universal approximators of convex functions and tropicalrational functions are universal approximators of continuous functions.(iii)
LSE networks.
As proven in [10] and [11], LSE networks are universal approximators of convexfunctions and difference-LSE networks are universal approximators of continuous functions.It follows that Type I networks are universal approximators. This also means that general MMP-NNs(Figure 2(a)) are universal approximators.
As for the related networks of Type I networks discussed above, a general approach of showing the propertyof universal approximation of continuous functions is to show the property of universal approximation ofconvex functions first and then use the difference of two approximators of convex functions for the universalapproximation of general continuous functions. However, this approach does not apply to Type II networks(Figure 2(c)), since there is no linear layer at the output end which means that we can not derive thedifference of two approximators of convex functions.If a Type II network only contains one single composite layer (made of a linear layer, a min-plus layerand a max-plus layer) , then we also call it a
Linear-Min-Max (LMM) network . In the following, we willanalyze Type II networks in detail and show that:(1) Type II networks are equivalent to Linear-Min-Max (LMM) networks,(2) LMM networks are universal approximators and thus Type II networks are universal approximators, and(3) by specific configurations, the number of multiplications can be tremendously reduced in a Type IInetwork.
Using the theory of tropical convexity introduced in Section 2.2, there is an elegant characterization of thespace of all possible functions that can be expressed by Type II networks with the linear layer fixed.Consider a Typer II network which affords the function Φ : R d → R . Then Φ can be written as acomposition of transformations Φ = β K ◦ α K ◦ · · · ◦ β ◦ α ◦ β ◦ α ◦ λ xihere λ is a linear transformation, α i ’s are min-plus transformations and β i ’s are max-plus transformations.Here β K is a max-plus linear transformation to R (the output node is fixed), while the number K of layersand the number of intermediate nodes are arbitrary. Denote by S λ the space of all functions afforded byType II networks with the linear layer fixed to λ : R d → R n . Denote by S ′ λ the space of all functions affordedby LMM networks (meaning that K = 1) with the linear layer fixed to λ .Now let the coordinates of R d be x , · · · , x d . Here we denote by ˜ x i the coordinate function on R d sendingvectors x = ( x j ) j ∈ R d to the coordinate value of x i for i = 1 , · · · , d . Then λ ( x ) is a vector y = ( f i ( x )) i ∈ R n where each f i : R d → R is a linear function of x . In particular, the function f i is a linear combination of thecoordinate functions ˜ x , · · · , ˜ x d , i.e., f i ( x ) = a i ˜ x + · · · + a id ˜ x d .Let V λ := { f , · · · , f n } ⊆ C ( R d ). By Proposition 2.4, we see that(i) tconv( V λ ) = { ( a ⊙ g ) ⊕ · · · ⊕ ( a m ⊙ g m ) | m ∈ N , a i ∈ R , g i ∈ V λ } (ii) tconv( V λ ) = { ( b ⊙ g ) ⊞ · · · ⊞ ( b m ⊙ g m ) | m ∈ N , b i ∈ R , g i ∈ V λ } (iii) tconv(tconv( V λ )) = { ( a ⊙ g ) ⊕ · · · ⊕ ( a m ⊙ g m ) | m ∈ N , a i ∈ R , g i ∈ tconv( V λ ) } (iv) tconv(tconv( V λ )) = { ( b ⊙ g ) ⊞ · · · ⊞ ( b m ⊙ g m ) | m ∈ N , b i ∈ R , g i ∈ tconv( V λ ) } .By substitution, elements of tconv(tconv( V λ )) can always be written as ( c ⊙ f ⊞ · · · ⊞ c n ⊙ f n ) ⊕ · · · ⊕ ( c m ⊙ f ⊞ · · · ⊞ c mn ⊙ f n ) with c ij ∈ R max , and elements of tconv(tconv( V λ )) can always be written as( d ⊙ f ⊕ · · · ⊕ d n ⊙ f n ) ⊞ · · · ⊞ ( d m ⊙ f ⊕ · · · ⊕ d mn ⊙ f n ) with d ij ∈ R min . Therefore, we may concludethat S ′ λ = tconv(tconv( V λ )).Actually we have tconv(tconv( V λ )) = tconv(tconv( V λ )) (a detailed proof is provided in Section 3 ofour previous work [23] which is essentially based on the mutual distributive law of ⊕ and ⊞ as stated inRemark 1)), and consequently we can write tconv( V λ ) := tconv(tconv( V λ )) = tconv(tconv( V λ )). Therefore,the elements f of tconv( V λ ) can be always be written as ( c ⊙ ˜ f ⊕· · ·⊕ c n ⊙ ˜ f n ) ⊞ · · · ⊞ ( c m ⊙ f ⊕· · ·⊕ c mn ⊙ f n )for some m ∈ N . This means f can be realized by an LMM network Φ = β ◦ α ◦ λ where α ∈ R n × m min is amin-plus layer and β ∈ R m × is a max-plus layer. Consequently, we conclude that S ′ λ = tconv( V λ ).Furthermore, for a deep Type II networkΦ = β K ◦ α K ◦ · · · ◦ β ◦ α ◦ β ◦ α ◦ λ with K ≥
2, we see thatΦ ∈ tconv(tconv( · · · (tconv(tconv( V λ ) · · · ) = tconv( · · · (tconv( V λ ) · · · )where tconv is applied K times. But actually by applying the formula tconv(tconv( V λ )) = tconv(tconv( V λ ))recurrently, it follows that tconv( · · · (tconv( V λ ) · · · ) = tconv( V λ ). Therefore, Φ can always be equally ex-pressed by an LMM network, i.e., Φ = β ◦ α ◦ λ for some m , a min-plus transformation α ∈ R n × m min and amax-plus transformation β ∈ R m × .In sum, we have the following theorem: Theorem 4.1.
Type II networks are equivalent to LMM networks. More specifically, for any fixed lineartransformation λ , we have S λ = S ′ λ = tconv( V λ ) . By Theorem 4.1, we see that to show that Type II networks are universal approximators is equivalent toshow that LMM networks are universal approximators.Recall that for a normed space ( X, k · k ), a function f : X → R is called Lipschitz continuous if thereexists a constant
K > x and x ′ in X, | f ( x ) − f ( x ′ ) | ≤ K k x − x ′ k . Any such K is referredto as a Lipschitz constant of f . We let X ⊆ R d and use the maximum norm k · k ∞ on X for simplicity ofdiscussion. The following theorem actually says any Lipschitz function can be approximated by a sequenceType II networks with a common fixed linear layer. xii heorem 4.2. For any Lipschitz function f on a finite region X ⊆ R d , there exists some linear transfor-mation λ such that a sequence of functions in S λ converging to f uniformly.Proof. Let f : X → R be a Lipschitz continuous function with a Lipschiz constant K for the maximum norm k · k ∞ on R d . This means that | f ( x ) − f ( x ′ ) | ≤ K k x − x ′ k ∞ for all x , x ′ ∈ X . We will use LMM networksto make the approximation in the following.For the linear layer, we use a fixed linear transformation λ : R d → R d where λ ( x ) = ( f i ( x )) i ∈ R d with f i − ( x ) = Kx i and f i ( x ) = − Kx i for x = ( x i ) i ∈ R d . Then after a min-plus linear transformation α : R d → R m , we get α ( λ ( x )) = ( g i ( x )) i ∈ R m where g i ( x ) = a i ⊙ f ( x ) ⊕ · · · ⊕ a i, d ⊙ f d ( x ) =min( f ( x ) + a i , · · · , f d ( x ) + a i, d ) for i = 1 , · · · , m . We note that the graph of g i as a function on R d hasthe shape of a pyramid. Let P i be the tip of the pyramid of g i . By varying value of a ij , the tip P i can havean arbitrary height and the projection of P i to R d can be located anywhere in R d . In particular, we call theprojection of P i the center of g i .Consider two points x and x ′ in R d . Let g and g ′ be two min-plus combinations of f , · · · , f d withcenters at x and x ′ respectively.Let h = max( g, g ′ ). We observe that as long as | g ( x ) − g ′ ( x ′ ) | ≤ K k x − x ′ k ∞ , we have h ( x ) = g ( x ) and h ( x ′ ) = g ′ ( x ′ ). It is easy to verify that g , g ′ and h are Lipschitz continuous also with a Lipschiz constant K .Assigning a grid G on R d with ∆ x = · · · ∆ x n = δ . Since X is a finite region, the intersection of G and X must be a finite set M . Suppose the cardinality of M is m . By adjusting the entries for the matrix( a ij ) ∈ R m × (2 d ) , we can arrange the g i ’s such that the following properties are satisfied:(i) The set of centers of g i ’s is exactly M ; and(ii) for each g i , we have g i ( x i ) = f ( x i ) where x i is the center of g i .Let h = max i =1 , ··· m ( g i ). We claim that for each point x ∈ X , | h ( x ) − f ( x ) | is bounded by 2 Kδ . Actuallylet x ′ be a grid point nearest to x . This means that k x − x ′ k ∞ ≤ δ . Therefore, | h ( x ) − h ( x ′ ) | ≤ Kδ and | f ( x ) − f ( x ′ ) | ≤ Kδ by the Lipschitz continuity of h and f . In addition, since g i ( x i ) = f ( x i ) for all the gridpoints x i and f is Lipschitz continuous, we conclude that h ( x i ) = g i ( x i ) = f ( x i ) for all grid points x i . Then | h ( x ) − f ( x ) | ≤ | h ( x ) − h ( x ′ ) | + | h ( x ′ ) − f ( x ′ ) | + | f ( x ′ ) − f ( x ) | ≤ Kδ + 0 + Kδ = 2 Kδ.
The above argument actually shows that k h − g k ∞ ≤ Kδ . Therefore, we see that h approaches f uniformly as δ approaches 0 which can be achieved by increasing the number of intermediate nodes of theLMM network. Thus we have proved that LMM networks (equivalently Type II networks) with fixed linearlayer are universal approximators of Lipschitz functions.Note that on a compact metric space X , any continuous function is the uniform limit of a sequence ofLipschitz functions [25]. Hence we have the following corollary of the above theorem. Corollary 4.3.
Type II networks (equivalently LMM networks) are universal approximators of continuousfunctions.
In the proof of Theorem 4.2, if the input of the Type II network is d -dimensional, there are only d multiplica-tions in total that need to be computed ( f i ( x ) = − Kx i can be computed from f i − ( x ) = Kx i by negation),all in the linear transformation λ : R d → R d . If we already know that the function to be approximated hasrelatively small variations (in this case we may assume K = 1 is a Lipschitz constant of f ), then we may evensimply fix the linear layer with x i affording f i − ( x ) = x i and f i ( x ) = − x i . In this case, no multiplicationis necessary.Moreover, to form pyramid shaped functions after the min-plus activation, we just need a linear trans-formation R d → R d +1 instead of R d → R d as in in the proof of Theorem 4.2, i.e., only d + 1 hyperplanesneed to be generated from the linear layer instead of 2 d hyperplanes as shown previously (for example, wexiiiay let f i ( x ) = Kx i for i = 1 , · · · , d and f d +1 = − K ( x + · · · + x d )). This number can be further reducedin real applications.On the other hand, as a tradeoff, if more multiplications are introduced in the, then the approximatingfunction generated by the network can be “smoother”.The above discussion of Type II networks as function approximators also implies that to train a Type IInetwork, one may either fix the linear layer with preassigned linear transformations and only train the min-plus and max-plus layers (in this case, each component of the network output is a Lipschitz function whoseLipschitz constants have an upper bound which is determined by the preassigned linear transformation), ortrain the linear layer together with the min-plus and max-plus layers (in this case, each component of thenetwork output can approximate any continuous function by Corollary 4.3). Note that a Type III network is a Type II network attached with an additional linear layer to the output end.Then the fact that Type II networks are equivalent to LMM networks (Theorem 4.1) implies that Type IIInetworks are equivalent to
Linear-Min-Max-Linear (LMML) networks which are LMM networks attachedwith an additional linear layer to the output end.For a set V of vectors, denote the linear span of V by Span( V ) := { P ki =1 c i v i | k ∈ N , c i ∈ R , v i ∈ V } .Let Λ d be the space of all linear transformations on R d . Then the space of all functions expressed by TypeII networks with d input nodes is Expr II d := S λ ∈ Λ d S λ , and the space of all functions expressed by Type IIInetworks with d input nodes is Expr III d := Span (cid:0) Expr II d (cid:1) .It is clear that Type III networks are universal approximators of continuous functions, since Type IInetworks are universal approximators of continuous functions. On the other hand, compared to most con-ventional neural networks with simple fixed nonlinear activation functions, one may consider Type III net-works as conventional networks with the nonlinear activation part being enlarged and trainable. Note thatthe composition of min-plus and max-plus layers afford a powerful nonlinear expression capability, whichby itself afford Type II networks (not containing the output linear layer as Type III networks) capabilityof universal approximation of continuous functions. One may expect that the more sophisticated nonlinearexpressor in a Type III network enhances its overall fitting capability as compared to conventional networksof similar scale. In this section, we first formulate the backpropagation algorithm for training the three types of layers ofMMP-NNs, and then introduce a special technique called “normalization”/“restricted normalization” bywhich the parameters in the nonlinear part of MMP-NNs (entries of the min-plus and max-plus matrices)are properly adjusted which counters the effect of parameter deviation which can commonly happen tothe nonlinear part of MMP-NNs and might seriously affect the convergence of training. In particular, thenormalization technique can be considered as a generalization of is inspired by the the Legendre-Fencheltransformation or convex conjugate which is widely used in physics and convex optimization.
As for the conventional feedforward neural networks, we can use backpropagation to train MMP-NNs.In the process of backpropagation, the calculation methods of parameter gradients of min-plus layers andmax-plus layers is different from that of linear layers. We summarize the gradient calculations as follows.(i) For the linear layers, consider linear functions y i = n X j =1 w ij · x j xivhere n is the dimension of input.From the forward calculation formula of linear layer, the gradient calculation formula of w can bedirectly obtained as follows: ∆ w ij = ∂y i ∂w ij · ∆ y i = x j · ∆ y i where ∆ y i is the loss of the output y i .(ii) As for the min-plus layers, the forward calculation formula can be described as: y ′ i = min (cid:0) w ′ i + x ′ , ..., w ′ ij + x ′ j , ..., w ′ in + x ′ n (cid:1) . And we get the gradient calculation formula of w as the following:∆ w ′ ij = ∂y ′ i ∂w ′ ij · ∆ y ′ i = (cid:26) ∆ y ′ i , (1)0 , (2)(1) when the term of w ′ ij is the smallest term in the forward propagation used to calculate y ′ i .(2) otherwise.(iii) Analogously, the forward calculation formula of the max-plus layers can be described as: y ′′ i = max (cid:0) w ′′ i + x ′ , ..., w ′′ ij + x ′′ j , ..., w ′′ in + x ′′ n (cid:1) . And the gradient calculation formula:∆ w ′′ ij = ∂y ′′ i ∂w ′′ ij · ∆ y ′′ i = (cid:26) ∆ y ′′ i , (3)0 , (4)(3) when the term of w ′′ ij is the biggest term in the forward propagation used to calculate y ′′ i .(4) otherwise. Consider a min-plus transformation α : R n → R m and a max-plus transformation β : R n → R m representedby α ( x ) = A ⊗ x and β ( x ) = B ⊗ x respectively where x ∈ R n , A = ( a ij ) ij ∈ R m × n min and B = ( b ij ) ij ∈ R m × n max .Let f , · · · , f n be real-valued functions on the same domain X . Then for i = 1 , · · · , m , the min-pluscombinations g i = a i ⊙ f ⊕ · · · ⊕ a in ⊙ f n = min ( a i + f , · · · , a in + f n ) and the max-plus combinations h i = b i ⊙ f ⊞ · · · ⊞ b in ⊙ f n = max ( b i + f , · · · , b in + f n ) are also real-valued functions on X (Figure 3).(In this sense, α and β can also be considered respectively as the min-plus and max-plus transformations ofthe vector ( f , · · · , f n ) of functions on X .)Consider, for example, g i = min ( a i + f , · · · , a in + f n ). Note that for a specific x ∈ X , g i ( x ) must takethe same value as a ij + f j ( x ) for at least one of the j ’s in { , · · · , n } . On the other hand, however, it is possiblethat for some f j , inf x ∈ X (( a ij + f j ( x )) − g i ( x )) > a ij + f j isstrictly above the graph of the function g i by a margin. If such a “detachment” of the functions a ij + f j and g i happens, a small adjustment of the coefficient a ij in the training process describe in Subsection 5.1 willbe noneffective which can seriously affects the convergence of training.To avoid such a phenomenon of “parameter deviation” and expedite the convergence of training, weintroduce a normalization process of the parameters as described in below. Algorithm 5.1. (Normalization) (1)
Min-plus normalization xv ... f n g ... g m A ∈ R m × n min f ... f n h ... h m B ∈ R m × n max Figure 3: Functions f , · · · , f n are transformed to g , · · · , g m by the min-plus matrix A ∈ R m × n min and to h , · · · , h m by the max-plus matrix B ∈ R m × n max . Input:
A min-plus matrix A = ( a ij ) ij ∈ R m × n min and functions f , · · · , f n ∈ C ( X ). Computation:
For i = 1 , · · · , m , let g i = a i ⊙ f ⊕· · ·⊕ a in ⊙ f n = min ( a i + f , · · · , a in + f n ) ∈ C ( X ).For each i = 1 , · · · , m and j = 1 , · · · , n , compute ν ( a ij ) = − inf x ∈ X ( f j ( x ) − g i ( x )) = sup x ∈ X ( g i ( x ) − f j ( x )) . (In most of the cases we are interested, infimum and supremum can be replaced by minimum andmaximum respectively.) Output:
A min-plus matrix ν ( A ) := ( ν ( a ij )) ij ∈ R m × n min .(2) Max-plus normalizationInput:
A max-plus matrix B = ( b ij ) ij ∈ R m × n max and functions f , · · · , f n ∈ C ( X ) Computation:
For i = 1 , · · · , m , let h i = b i ⊙ f ⊞ · · · ⊞ b in ⊙ f n = max ( b i + f , · · · , b in + f n ) ∈ C ( X ).For each i = 1 , · · · , m and j = 1 , · · · , n , compute ν ( b ij ) = − sup x ∈ X ( f j ( x ) − h i ( x )) = inf x ∈ X ( h i ( x ) − f j ( x )) . (In most of the cases we are interested, infimum and supremum can be replaced by minimum andmaximum respectively.) Output:
A max-plus matrix ν ( A ) := ( ν ( b ij )) ij ∈ R m × n max . Remark . We call ν ( a ij ) the (min-plus) normalization of a ij , ν ( A ) the (min-plus) normalization of A , ν ( b ij )the (max-plus) normalization of b ij , and ν ( B ) the (max-plus) normalization of B . Proposition 5.2.
Using the notations in Algorithm 5.1, the normalization has the following properties:(a) For each i = 1 , · · · , m and j = 1 , · · · , n , inf x ∈ X (( ν ( a ij ) + f j ( x )) − g i ( x )) = 0 and sup x ∈ X (( ν ( b ij ) + f j ( x )) − h i ( x )) = 0 .(b) ν ( a ij ) ≤ a ij and ν ( b ij ) ≥ b ij .(c) For each i = 1 , · · · , m , g i = ν ( a i ) ⊙ f ⊕ · · · ⊕ ν ( a in ) ⊙ f n = min ( ν ( a i ) + f , · · · , ν ( a in ) + f n ) and h i = ν ( b i ) ⊙ f ⊞ · · · ⊞ ν ( b in ) ⊙ f n = max ( ν ( b i ) + f , · · · , ν ( b in ) + f n ) . Proof.
Here we will only prove the case of min-plus normalization. The case of max-plus normalization canbe proved analogously.For (a), since ν ( a ij ) = sup x ∈ X ( g i ( x ) − f j ( x )), we must have that (1) g i ( x ) ≤ ν ( a ij ) + f j ( x ) for all x ∈ X and (2) for each δ >
0, there exists x ∈ X such that g i ( x ) + δ ≥ ν ( a ij ) + f j ( x ). This means thatinf x ∈ X (( ν ( a ij ) + f j ( x )) − g i ( x )) = 0. xvi a) − − xf ( x ) + 2 = x + 2 f ( x ) + 2 = − x + 2 f ( x ) = x f ( x ) + 1 . . g ( x ) (b) − − xf ( x ) + 2 = x + 2 f ( x ) + 2 = − x + 2 f ( x ) = x f ( x ) + 1 = 1 g ( x )Figure 4: An example of one-dimensional min-plus normalization: (a) A min-plus combination of functions f ( x ) = x , f ( x ) = − x , f ( x ) = x and f ( x ) = 0 as g ( x ) = min( f ( x ) + 2 , f ( x ) + 2 , f ( x ) , f ( x ) + 1 . g ( x ) = min( f ( x ) + 2 , f ( x ) + 2 , f ( x ) , f ( x ) + 1).For (b), note that g i = min ( a i + f , · · · , a in + f n ). Hence g i ≤ a ij + f j which implies inf x ∈ X (( a ij + f j ( x )) − g i ( x )) ≥
0. Therefore, we must have ν ( a ij ) ≤ a ij by (a).For (c), we observe that min ( ν ( a i ) + f , · · · , ν ( a in ) + f n ) ≤ g i = min ( a i + f , · · · , a in + f n ) by (b).On the other hand, for each j , we must have g i ( x ) ≤ ν ( a ij ) + f j ( x ) for all x ∈ X since ν ( a ij ) =sup x ∈ X ( g i ( x ) − f j ( x )). Therefore, we have the identity g i = min ( ν ( a i ) + f , · · · , ν ( a in ) + f n ). Example 5.3.
For X = R , let f ( x ) = x , f ( x ) = − x , f ( x ) = x and f ( x ) = 0 be functions on R .Consider a function g = 2 ⊙ f ⊕ ⊙ f ⊕ f ⊕ . ⊙ f = min( f + 2 , f + 2 , f , f + 1 .
5) which is a min-pluscombination of f , f , f and f . As shown in Figure 4(a), g ( x ) = f ( x ) + 2 for x ≤ − g ( x ) = f ( x )for 1 ≤ x ≤ g ( x ) = f ( x ) + 2 for x ≥
1. However, min(( f + 1 . − g ) = 0 .
5, i.e., the whole graphof f + 1 . g . Running the min-plus normalization algorithm, we derivemax( g − f ) = max( g − f ) = 2, max( g − f ) = 0, and max( g − f ) = 1. Therefore, after normalization, wecan write g = min( f + 2 , f + 2 , f , f + 1) while the graphs of f + 2, f + 2, f and f + 1 are all attachedto the graph of g as shown in Figure 4(b). Remark . We emphasize here that if the functions f , · · · , f n in Algorithm 5.1 are linear functions on aEuclidean space, then the normalization process described in Algorithm 5.1 is exactly the classical Legendretransformation applied to convex functions. On the other hand, there is no restriction of being linear onthe functions f , · · · , f n in our algorithm, while the normaliation process is generally true. In our previouswork [23], we’ve developed a theory of tropical convexity analysis providing a more refined treatment of thisgeneral setting. Remark . Recall that a general MMP-NN of K composite layer with d input nodes and p output nodes ofthe MMP-NN is essentially a function Φ : R d → R p which can be written as Φ = β K ◦ α K ◦ λ K ◦ β K − ◦ α K − ◦ λ K − ◦ · · · ◦ β ◦ α ◦ λ where λ i , α i and β i are linear, min-plus and max-plus transformationsrespectively. In a training process, we may apply min-plus normalizations to the min-plus layers α i andmax-plus normalizations to the max-plus layers β i after steps of parameter tuning to improve and expediteconvergence. In particular, for example, the input of α i is a vector of functions on R d and the output of α i xviis min-plus combinations of the input functions. If we apply min-plus normalization to α i , the parameterswhich are entries to the corresponding min-plus matrix will be adjusted. However, by Proposition 5.2,this parameter adjustment won’t affect the output functions. This means that the normalization processof one layer won’t affect the results of other layers. The advantage of this layer-to-layer independence ofnormalization is that if we need to normalize one specific layer, there is no need to process the precedinglayers first.In Algorithm 5.1, a practical issue of computing ν ( a ij ) and ν ( b ij ) is that we need to find the mini-mum/maximum of some functions on X which is usually a domain in a Euclidean space. The computationof the minimum/maximum of a function can be extremely hard and time-consuming itself unless the inputfunctions are simple functions and the domain X is also simple of low dimension. In a MMP-NN, typi-cally only the input functions of the first min-plus layer α are linear functions enabling us to find rigoroussolutions of the minimum/maximum for layer α in a relatively easy manner. the input functions of themin-plus/max-plus layers following α become more untamed as the layer goes deeper. As a result, a directmin/max computation is a lot harder for these layers.To solve this problem, we propose a modified version of Algorithm 5.1 called restricted normalization(Algorithm 5.4) which only deals with a finite subset D of the domain X . In practice, D can be simplychosen from the training set which tremendously simplifies the min/max computation in the normalizationprocess. Algorithm 5.4. (Restricted Normalization) (1)
Restricted min-plus normalizationInput:
A min-plus matrix A = ( a ij ) ij ∈ R m × n min , functions f , · · · , f n ∈ C ( X ), and a finite subset D ⊆ X . Computation:
For i = 1 , · · · , m , let g i = a i ⊙ f ⊕· · ·⊕ a in ⊙ f n = min ( a i + f , · · · , a in + f n ) ∈ C ( X ).For each i = 1 , · · · , m and j = 1 , · · · , n , compute ν D ( a ij ) = − min x ∈ D ( f j ( x ) − g i ( x )) = max x ∈ D ( g i ( x ) − f j ( x )) . Output:
A min-plus matrix ν D ( A ) := ( ν D ( a ij )) ij ∈ R m × n min .(2) Restricted max-plus normalizationInput:
A max-plus matrix B = ( b ij ) ij ∈ R m × n max and functions f , · · · , f n ∈ C ( X ), and a subset D ⊆ X . Computation:
For i = 1 , · · · , m , let h i = b i ⊙ f ⊞ · · · ⊞ b in ⊙ f n = max ( b i + f , · · · , b in + f n ) ∈ C ( X ).For each i = 1 , · · · , m and j = 1 , · · · , n , compute ν D ( a ij ) = − max x ∈ D ( f j ( x ) − h i ( x )) = min x ∈ D ( h i ( x ) − f j ( x )) . Output:
A max-plus matrix ν D ( A ) := ( ν D ( b ij )) ij ∈ R m × n max . Remark . We call ν D ( a ij ) the (min-plus) normalization of a ij restricted to D , ν D ( A ) the (min-plus) nor-malization of A restricted to D , ν D ( b ij ) the (max-plus) normalization of b ij restricted to D , and ν D ( B ) the (max-plus) normalization of B restricted to D . Proposition 5.5.
Using the notations in Algorithm 5.1 and Algorithm 5.4, for i = 1 , · · · , m , let g ′ i = ν D ( a i ) ⊙ f ⊕ · · · ⊕ ν D ( a in ) ⊙ f n = min ( ν D ( a i ) + f , · · · , ν D ( a in ) + f n ) and h ′ i = ν D ( b i ) ⊙ f ⊞ · · · ⊞ ν D ( b in ) ⊙ f n = max ( ν D ( b i ) + f , · · · , ν D ( b in ) + f n ) . The restricted normalization has the following properties: xviii a) D = {− , } − − xf ( x ) + 2 = x + 2 f ( x ) + 2 = − x + 2 f ( x ) = x f ( x ) + 1 = 1 g ′ ( x ) (b) D = {− . , , . } − − .
19 0 0 . xf ( x ) + 2 = x + 2 f ( x ) + 2 = − x + 1 . f ( x ) = x f ( x ) + 0 .
81 = 0 . g ′′ ( x )Figure 5: Examples of restricted min-plus normalization of the min-plus combination g ( x ) = min( f ( x ) +2 , f ( x ) + 2 , f ( x ) , f ( x ) + 1 .
5) of functions f ( x ) = x , f ( x ) = − x , f ( x ) = x and f ( x ) = 0 (as inExample 5.3 and Figure 4). (a) Normalization restricted to D = {− , } ; (b) Normalization restricted to D = {− . , , . } . (a) For each i = 1 , · · · , m and j = 1 , · · · , n , min x ∈ X (( ν D ( a ij )+ f j ( x )) − g ′ i ( x )) = min x ∈ D (( ν D ( a ij )+ f j ( x )) − g ′ i ( x )) = 0 and max x ∈ X (( ν D ( b ij ) + f j ( x )) − h ′ i ( x )) = max x ∈ D (( ν D ( b ij ) + f j ( x )) − h ′ i ( x )) = 0 .(b) ν D ( a ij ) ≤ ν ( a ij ) ≤ a ij and ν D ( b ij ) ≥ ν D ( b ij ) ≥ b ij .(c) For each i = 1 , · · · , m , we have g ′ i ≤ g i and h ′ i ≥ h i while g ′ i | D = g i | D and h ′ i | D = h i | D .Proof. Here we will only prove the case of restricted min-plus normalization. The case of restricted max-plusnormalization can be proved analogously.Consider all the functions restricted to D . Replacing X by D , we can use the same argument as in theproof of Proposition 5.2 to show that min x ∈ D (( ν D ( a ij ) + f j ( x )) − g ′ i ( x )) = 0, ν D ( a ij ) ≤ a ij and g ′ i | D = g i | D .To show that min x ∈ X (( ν D ( a ij ) + f j ( x )) − g ′ i ( x )) = min x ∈ D (( ν D ( a ij ) + f j ( x )) − g ′ i ( x )), we note that bydefinition of g ′ , we have g ′ i ( x ) ≤ ν D ( a ij ) + f j ( x ) for all j and all x ∈ X , which means that min x ∈ X (( ν D ( a ij ) + f j ( x )) − g ′ i ( x )) ≥
0. On the other hand, the minimum can be achieved at some point x in D. Hence theidentity follows.It remains to show that ν D ( a ij ) ≤ ν ( a ij ) and g ′ i ≤ g i . By definition of ν D ( a ij ) and ν ( a ij ), we have ν D ( a ij ) = max x ∈ D ( g i ( x ) − f j ( x )) ≤ sup x ∈ X ( g i ( x ) − f j ( x )) = ν ( a ij ). Using Proposition 5.2 (c), this alsoimplies that g ′ = min ( ν D ( a i ) + f , · · · , ν D ( a in ) + f n ) ≤ min ( ν ( a i ) + f , · · · , ν ( a in ) + f n ) = g . Example 5.6.
Reconsider the min-plus combination g ( x ) = min( f + 2 , f + 2 , f , f ( x ) + 1 .
5) of functions f ( x ) = x , f ( x ) = − x , f ( x ) = x and f ( x ) = 0 as in Example 5.3. Consider two distinct finite sets D = {− , } and D = {− . , , . } . We will show that the results of the normalization restricted to D and D are distinct.(i) Running the min-plus normalization algorithm restricted to D , we derive max x ∈ D ( g ( x ) − f ( x )) =max x ∈ D ( g ( x ) − f ( x )) = 2, max x ∈ D ( g ( x ) − f ( x )) = 0, and max x ∈ D ( g ( x ) − f ( x )) = 1. Therefore,xixfter this restricted normalization, we get the function g ′ = min( f + 2 , f + 2 , f , f + 1) which isexactly same as the function g (Figure 5(a)).(ii) Running the min-plus normalization algorithm restricted to D , we derive max x ∈ D ( g ( x ) − f ( x )) =2, max x ∈ D ( g ( x ) − f ( x )) = 1 .
44, max x ∈ D ( g ( x ) − f ( x )) = 0, and max x ∈ D ( g ( x ) − f ( x )) = 0 . g ′′ = min( f +2 , f +1 . , f , f +0 . g (Figure 5(b)). However, one can observe that g ′′ ≤ g and restrictedto x ∈ D , we still have g ( x ) = g ′′ ( x ) (see Proposition 5.4(c)). Remark . We emphasize here that by Proposition 5.4, even though the derived function g ′ i (resp. h ′ i ) is ingeneral not identical to the functions g i (resp. h i ), we still have the agreement of g ′ i with g i (resp. h ′ i with h i ) restricted to D as demonstrated in Example 5.6, which also means that the layer-to-layer independenceof normalization stated in Remark 8 is still valid when restricted to D . This is actually satisfactory enoughfor most of the cases we are interested, since typically the set D is chosen from the training set. References [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification. In
Proceedings of the IEEE international conferenceon computer vision , pages 1026–1034, 2015.[2] Diane Maclagan and Bernd Sturmfels.
Introduction to tropical geometry , volume 161. American Math-ematical Soc., 2015.[3] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In
Advances in neuralinformation processing systems , pages 666–674, 2011.[4] Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In
Algorith-mic Learning Theory: 22nd International Conference, ALT 2011, Espoo, Finland, October 5-7, 2011,Proceedings , volume 6925, page 18. Springer, 2011.[5] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In
Conferenceon learning theory , pages 907–940, 2016.[6] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linearregions of deep neural networks. In
Advances in neural information processing systems , pages 2924–2932,2014.[7] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deeplearning requires rethinking generalization. In . OpenReview.net, 2017.[8] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural networks. In
International Conference on Machine Learning , pages 5824–5832, 2018.[9] Vasileios Charisopoulos and Petros Maragos. A tropical approach to neural networks with piecewiselinear activations. arXiv preprint arXiv:1805.08749 , 2018.[10] Giuseppe C Calafiore, Stephane Gaubert, and Corrado Possieri. Log-sum-exp neural networks andposynomial models for convex and log-log-convex data.
IEEE transactions on neural networks andlearning systems , 2019.[11] Giuseppe C Calafiore, St´ephane Gaubert, and Corrado Possieri. A universal approximation resultfor difference of log-sum-exp neural networks.
IEEE Transactions on Neural Networks and LearningSystems , 2020. xx12] Petros Maragos and Emmanouil Theodosis. Tropical geometry and piecewise-linear approximation ofcurves and surfaces on weighted lattices.
ArXiv , abs/1912.03891, 2019.[13] Petros Maragos and Emmanouil Theodosis. Multivariate tropical regression and piecewise-linear sur-face fitting. In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pages 3822–3826. IEEE, 2020.[14] Georgios Smyrnis and Petros Maragos. Tropical polynomial division and neural networks.
ArXiv ,abs/1911.12922, 2019.[15] G. Smyrnis, P. Maragos, and G. Retsinas. Maxpolynomial division with application to neural networksimplification. In
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pages 4192–4196, 2020.[16] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neu-ral networks with binary weights during propagations. In
Advances in neural information processingsystems , pages 3123–3131, 2015.[17] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarizedneural networks. In
Advances in neural information processing systems , pages 4107–4115, 2016.[18] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi-fication using binary convolutional neural networks. In
European conference on computer vision , pages525–542. Springer, 2016.[19] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training lowbitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 ,2016.[20] Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. Addernet:Do we really need multiplications in deep learning? arXiv preprint arXiv:1912.13200 , 2019.[21] Peter Butkoviˇc.
Max-linear systems: theory and algorithms . Springer Science & Business Media, 2010.[22] Mike Develin and Bernd Sturmfels. Tropical convexity.
Doc. Math , 9(1-27):7–8, 2004.[23] Ye Luo. Idempotent analysis, tropical convexity and reduced divisors. arXiv preprint arXiv:1808.01987 ,2018.[24] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxoutnetworks. In
International conference on machine learning , pages 1319–1327. PMLR, 2013.[25] Georganopoulos G. Sur l’approximation des fonctions continues par des fonctions lipschinitziennes.