Reverse Derivative Ascent: A Categorical Approach to Learning Boolean Circuits
DDavid I. Spivak and Jamie Vicary (Eds.):Applied Category Theory 2020 (ACT2020)EPTCS 333, 2021, pp. 247–260, doi:10.4204/EPTCS.333.17 © P.W. Wilson, F. ZanasiThis work is licensed under theCreative Commons Attribution License.
Reverse Derivative Ascent:A Categorical Approach to Learning Boolean Circuits
Paul Wilson
University College LondonUniversity of Southampton [email protected]
Fabio Zanasi
University College London [email protected]
We introduce
Reverse Derivative Ascent : a categorical analogue of gradient based methods for ma-chine learning. Our algorithm is defined at the level of so-called reverse differential categories . It canbe used to learn the parameters of models which are expressed as morphisms of such categories. Ourmotivating example is boolean circuits: we show how our algorithm can be applied to such circuitsby using the theory of reverse differential categories. Note our methodology allows us to learn theparameters of boolean circuits directly , in contrast to existing binarised neural network approaches.Moreover, we demonstrate its empirical value by giving experimental results on benchmark machinelearning datasets.
Computation of the reverse derivative is a critical part of gradient-based machine learning methods (seee.g. [16] for an overview). In essence, the reverse derivative tells us how to update the parameters of amodel, given a prediction error. This update procedure is at the core of many optimisation methods, suchas stochastic gradient descent [16, 2.2] –used for training deep neural networks.Now, the model class of choice is typically neural networks, which can be considered as the smoothmaps R a → R b . A natural question to ask is whether these gradient based methods could be generalisedto other settings. In this paper, we focus on boolean functions - maps Z a → Z b , and their operationalcounterpart, boolean circuits.This setting has real practical value. Larger neural network models typically require expensive andpower-hungry GPGPU hardware to train and run [4] [14]. Performance can be improved via binarisation :the extraction from a trained neural network of a boolean circuit, which provides a better optimisedrepresentation of the same model.The usual pattern in these approaches (see e.g. BinaryConnect [4] and LUTNet [20]) is to performthe training aspect exclusively on the ‘real-valued’ side. However, training schemes for binarised modelssuch as boolean circuits are typically more efficient [9]. It is thus natural to ask: “why not learn theparameters of boolean circuits directly?”Reverse Derivative Ascent, the algorithm that we introduce in this paper, originates from this ques-tion: instead of training a neural network and then extracting a boolean circuit, we begin with a booleancircuit, and learn its parameters directly.In defining and analysing the algorithm, we take a categorical approach. Our methodology relies onthe abstract framework provided by reverse differential categories [3], which axiomatises the concept ofreverse derivative operator. We proceed in three steps:1. We give a syntactic presentation in terms of string diagrams for the reverse differential operatorof polynomials, and a safety condition specifying when we can apply this operator to booleancircuits.48 Reverse Derivative Ascent
2. We define Reverse Derivative Ascent as ‘gradient’-based algorithm working for arbitrary mor-phisms of reverse differential categories.3. We can then apply Reverse Derivative Ascent to boolean circuits, and demonstrate its empiricalvalue by giving experimental results for benchmark datasets.The categorical setting brings two main advantages. First, by exploiting the presentation of booleancircuits as an axiomatic theory of string diagrams [12], we are able to define a suitable reverse derivativeoperator compositionally , by induction on the circuit syntax. Second, because our definition of ReverseDerivative Ascent is phrased at the general level of reverse differential categories, it paves the way forthe application to model classes other than boolean circuits, which we leave for future work.The rest of the paper is structured as follows. We begin with necessary background in Section 2,defining (parametrised) boolean functions and circuits. In Section 3 we define our graphical operator forboolean circuits, give a safety condition for its application, and show that it is consistent with the reversedifferential combinator of polynomials. We include additional material relevant to Section 3 in AppendixA. In Section 4, we describe the Reverse Derivative Ascent algorithm and provide a Haskell library torun our algorithm on boolean circuits . Also, we give empirical evidence that it is able to learn functionsfrom data. We conclude the paper with a discussion of future work in Section 5. We first recall the basics of boolean functions.
Definition 1. A boolean function is a map f : Z a → Z b . We also say that a map g : Z p + a → Z b is a parametrised boolean function with p parameters, a inputs, and b outputs. We denote by BoolFun thesymmetric strict monoidal category whose objects are natural numbers with addition as tensor prod-uct, and whose morphisms are boolean functions. The monoidal product of this category is in fact thecartesian product, making this category cartesian.While parametrised boolean functions are of course exactly the boolean functions, by distinguishingthe p parameters from the a inputs we mean to declare our intent: our goal is to learn a map Z a → Z b which approximates a given dataset of input/output examples of the type ( x , y ) ∈ ( Z a , Z b ) . To do this, wemust choose a particular model : a function f : Z p + a → Z b . We then use our machine learning algorithmto search for a set of parameters θ ∈ Z p such that f ( θ , − ) : Z a → Z b approximates the dataset well. Example 2.
Suppose we wish to learn a boolean function Z → Z with no prior knowledge of thedataset. One choice of model is the function eval : Z + → Z , defined as eval ( θ , x ) (cid:55)→ θ x . That is,the function of 2 parameters which maps the data bit x to θ if x =
0, and θ if x =
1. In this case, ourparameters represent a truth table : i.e., they extensionally specify the Z a → Z b function we are learning. Remark 3.
An interesting property of the eval model is that, because there are a finite number of booleanfunctions Z a → Z b , the entire function space can be represented with 2 a b parameters. Clearly, this makesit suitable for only small a , but [20] demonstrate it can be profitably used as a compositional buildingblock for larger models.In order to apply Reverse Derivative Ascent to boolean functions, we will need to use the reversedifferential operator of a related category, which we now introduce. Note that our implementation represents circuits in terms of the corresponding boolean functions: this is just a presentationchoice, because the category of boolean functions and boolean circuits are isomorphic. We refer to Section 4.2.1 for a morecomprehensive discussion on the implementation. .W. Wilson, F. Zanasi
Definition 4.
Following [3], let
Poly Z be the category with objects the natural numbers, and with mor-phisms p : a → b being b -tuples of polynomials in a variables. That is, p = (cid:104) p ( (cid:126) x ) , p ( (cid:126) x ) , . . . , p b ( (cid:126) x ) (cid:105) with components p i ( (cid:126) x ) ∈ Z [ x , . . . , x a ] , the polynomial ring in a variables over Z . Composition of mor-phisms is the composition of polynomials as in [3], where the composition a p −→ b q −→ c is the polynomialgiven by q ( p ( (cid:126) x ) , p ( (cid:126) x ) , ..., p b ( (cid:126) x )) In order to make the relationship between
BoolFun and
Poly Z clear, we will now recall graphicalpresentations of both. Boolean functions have a well-known graphical representation as boolean cir-cuits . This correspondence can be made formal by establishing an isomorphism between BoolFun anda category
BoolCirc whose morphisms are (open) boolean circuits, see [12, section 4]. Furthermore,the morphisms of
BoolCirc can be pictured as the string diagrams [17] freely generated by a certainsignature and equations. Similarly, morphisms of
Poly Z have a graphical representation as polynomialcircuits , which is obtained by relaxing one of the equations of BoolCirc . As we will exploit such graph-ical representations in our developments, we recall them below.
Definition 5.
We denote with
BoolCirc the symmetric strict monoidal category whose objects are thenatural numbers and whose morphisms are the string diagrams freely generated by generators (1)and, for all morphisms f , equations = = = f = ff f == = == = == = = (2)We call circuits the string diagrams freely obtained by the generators in (1) and quotiented by the lawsof symmetric monoidal categories. When we say boolean circuits, however, we mean morphisms of BoolCirc ; i.e., those circuits quotiented by equations (2).
Definition 6.
We denote by
PolyCirc the symmetric strict monoidal category whose objects are thenatural numbers, and whose morphisms are the string diagrams freely generated by generators (1) andequations (2) minus = . We denote this set of axioms as A , and call the morphisms of thiscategory polynomial circuits .Note the equational theories of both Boolean and polynomial circuits yield that ( , ) and ( , ) form two commutative monoids and ( , ) a commutative comonoid. In fact, the comonoidstructure makes both BoolCirc and
PolyCirc into cartesian categories.For boolean circuits, one may think operationally of as the XOR gate, as the AND gate, andas copy. This intuition is at the basis of the interpretation functor (cid:74) · (cid:75) B : BoolCirc → BoolFun ofboolean circuits as boolean functions. Saying that (1) and (2) present
BoolFun amounts to the followingstatement.50
Reverse Derivative Ascent
Proposition 7 ([12]) . (cid:74) · (cid:75) B : BoolCirc → BoolFun is an isomorphism of symmetric monoidal categories.
Corollary 8. (cid:74) c (cid:75) B = (cid:74) d (cid:75) B if and only if c (2) = d .Similarly for polynomial circuits, we may think of and respectively as the two-variable poly-nomials x + x and x x . Saying that generators (1) and equations A present PolyCirc amounts to thefollowing statement about the interpretation functor (cid:74) · (cid:75) P : Proposition 9. (cid:74) · (cid:75) P : PolyCirc → Poly Z is an isomorphism of symmetric monoidal categories. Corollary 10. (cid:74) f (cid:75) P = (cid:74) g (cid:75) P if and only if f A = g . Proof.
The key idea is that hom-sets
Poly Z ( a , b ) and PolyCirc ( a , b ) have the structure of the free mod-ule over the polynomial ring Z [ x . . . x a ] , and so there exists an isomorphism between them. See Ap-pendix A for the full proof. Remark 11.
We note that the axiom = from [12, Figure 40] is redundant, and can bederived from the others. This is because for all n ∈ N , the hom-sets BoolFun ( n , ) have ring structurewhere 0 x = Example 12.
Continuing our example of the eval : Z + → Z function from Example 2, we showits corresponding boolean circuit below, with its inputs labeled. Note that this circuit can be equallyinterpreted as a morphism of Poly Z , namely as the single 3-variable polynomial (cid:104) θ + ( θ + θ ) x (cid:105) . θ θ x (3) In order to define our machine learning algorithm, we need a notion of reverse derivative for booleancircuits. To this aim, we follow a principled approach by recalling reverse differential categories [3],which axiomatise the notion of a reverse differential combinator for categorical morphisms. More con-cretely, we will translate the reverse derivative combinator of
Poly Z given in [3] to the graphical settingof PolyCirc , and show how we can exploit the syntactic similarity with boolean circuits in order to applyit to morphisms of
BoolCirc . However, we will see that this does not make
BoolCirc a reverse deriva-tive category, and applying the reverse derivative in this way requires a safety condition which we willintroduce.
Definition 13. (from [3]) A reverse differential category is a category which is(i) cartesian(ii) left-additive, meaning that each object a is canonically equipped with a commutative monoid struc-ture (+ a : a × a → a , a : I → a ) .(iii) equipped with a reverse differential combinator which maps morphisms fA B to reverse deriva-tives R [ f ] AB A . obeying the axioms RD.1 through RD.7 of [3, Section 3]. .W. Wilson, F. Zanasi R [ f ] approximately computes the change in input necessary to achieve a given change ofoutput for a function f . For example, suppose we have a parametrised boolean function f : Z p + a → Z b whose predictions we denote ˆ y = f ( θ , x ) . We may have some observed data ( x , y ) ∈ ( Z a , Z b ) that disagreewith our predictions, i.e., where ˆ y (cid:54) = y , and we wish to adjust the parameters of our model θ to bettermatch our observations. The reverse derivative allows us to compute a change in parameters δ θ so that f ( θ + δ θ , x ) is a better prediction than f ( θ , x ) This intuition is exactly the basis for reverse derivativeascent , which we describe in section 4.Our next goal is to show that we can apply a notion of reverse derivative to morphisms of
BoolCirc .We establish some preliminary intuition through an example.
Example 14.
By directly translating the definition of reverse derivative combinator for morphisms of
Poly Z to boolean functions f : Z a → Z b , we obtain the a -tuple of ( a + b ) -variable functions (cid:10) ∑ D [ f ]( x ) · δ y , . . . , ∑ D a [ f ]( x ) · δ y (cid:11) where we use · to denote pointwise multiplication of bitvectors, ∑ x to denote the sum of vector compo-nents, and D i [ f ] is the i th partial derivative of f , as defined in [15]: D i [ f ]( x ) = f ( x ) + f ( x + e i ) , with e i the i th basis vector, whose entries are 0 except for the i th, which is 1.The above indeed allows us to learn the parameters of boolean circuits from data but has one majorflaw: efficiency. Computing it requires i + f , and in models with just a moderate numberof parameters and/or where f is expensive to compute, this quickly becomes intractable. However, thisdefinition is still useful to take the reverse derivative of a ‘black box’ function whose symbolic form isnot known, such as a function from a software library.We will now develop a more efficient approach for boolean functions: the key step is the introductionof (cid:101) R , a syntactic operator defined inductively on circuits. However, we will see that this operation doesnot respect the equations of BoolCirc , and so we introduce a safety condition restricting us to thosecircuits on which (cid:101) R is well defined. Finally, we show that because every circuit has a safe equivalent in BoolCirc , we are able to define an operator for
BoolCirc which coincides with the reverse derivative of
Poly Z . Definition 15.
For each circuit f : a → b , we define the operator (cid:101) R [ f ] : a + b → a inductively on gener-ators (1), composition, and monoidal product. Since each generator represents a specific morphism of Poly Z , we must define (cid:101) R on generators as (cid:55)→ (cid:55)→(cid:55)→(cid:55)→ (cid:55)→(cid:55)→(cid:55)→(cid:55)→ (4) Note that reverse derivatives compute the changes in all inputs, so for a parametrised boolean circuit, this includes bothparameter and data inputs. We note that this definition is essentially the same as the definition of the reverse differential combinator given by [3] forpolynomials over a semiring except for the definition of D i , for which we use the partial derivatives of boolean functions givenby [15]. Indeed, we expose it in our Haskell library as the function
RDA.ReverseDerivative.rdiffB Reverse Derivative Ascent
Following axioms RD.5 and RD.4 of [3, Definition 13], we take (cid:101) R on composition and monoidal productof circuits as follows: ˜ R [ f g ] ˜ R [ g ] ˜ R [ f ] f ˜ R [ f ⊗ g ] ˜ R [ f ] ˜ R [ g ] (cid:55)→(cid:55)→ . (5)Strictly speaking, since PolyCirc is a cartesian category, the definition of (cid:101) R [ f ⊗ g ] is only implied byRD.4, and so we verify that this definition indeed respects the axioms A of PolyCirc . Lemma 16. (cid:101) R is well-defined for circuits modulo A , that is, c A = d implies (cid:101) R [ c ] A = (cid:101) R [ d ] . Proof.
It suffices to check the statement on c and d that are equal modulo a single axiom (cid:104) l , r (cid:105) in A . Thismeans that, modulo the laws of symmetric monoidal categories, c can be factorised as lc L c R and d as rc L c R . Thus ˜ R [ c ] = ˜ R [ c L ; ( id ⊗ l ) ; c R ] and ˜ R [ d ] = ˜ R [ c L ; ( id ⊗ r ) ; c R ] . Unravellingthese circuits according to Definition 15, one may observe that in order to prove that ˜ R [ c ] = ˜ R [ d ] we onlyneed to check that ˜ R [ l ] = ˜ R [ r ] . This can be verified exhaustively for all (cid:104) l , r (cid:105) ∈ A .Consequently, it is clear that this syntactic definition of (cid:101) R is equivalent to the reverse derivative R of PolyCirc and, by isomorphism,
Poly Z . However, this definition is not compatible with boolean circuits:although we have the axiom = , we can derive (cid:101) R [ ] = and (cid:101) R [ − ] = , whichare clearly not equal. Indeed, Lemma 16 highlights that this axiom is the only problematic one.To address this issue, we now introduce a condition called safety, and show that (cid:101) R always respects (2)when applied to safe circuits. In essence, the following series of results give us a recipe to take the reversederivative of a boolean circuit, even though BoolCirc does not form a reverse derivative category.
Safety can be succinctly defined by regarding a boolean circuit combinatorially as a directed graph. Definition 17.
We say a circuit c is safe if, for every generator in c , the two input ports of are notreachable from the same input port of c . Example 18.
The circuit is not safe, whereas (3) is safe. The following circuit, which is equiva-lent in
BoolCirc to (3), is not safe. Indeed, both inputs of the rightmost generator are reachable from θ (and actually also from θ ). θ θ x (6)As witnessed by (3) and (6), it is actually possible to show that Lemma 19.
For each boolean circuit c , there is a safe boolean circuit d such that c (2) = d . This can be made completely formal by interpreting string diagrams as (directed) hypergraphs with boundaries [2, 21],where generators form hyperedges and wires connecting them are the nodes. In this context, reachability between wires (as inDefinition 17 below) can be defined as the existence of a forward path between the corresponding nodes in the hypergraphs. .W. Wilson, F. Zanasi
Proof.
The idea is that one may put circuits in a canonical form, so that then it is straightforward toeliminate all the unsafe paths by iteratively applying = as a rewrite rule. See Appendix A forthe full proof.By virtue of Lemma 19, in order to show that (cid:101) R yields a well-defined operator on the whole of BoolCirc , it suffices to show that it is well-defined on safe circuits. To do so, the following is the keyintermediate lemma.
Lemma 20.
If two circuits c and d are safe and c (2) = d , then c A = d . Proof.
We give an overview of the structure of the argument and refer to Appendix A for the full details.The proof relies on the polynomial interpretation of circuits, (cid:74) · (cid:75) P . Under this interpretation, (cid:74) (cid:75) P = (cid:104) x (cid:105) and (cid:74) (cid:75) P = (cid:104) x (cid:105) . Thus = is not sound under this interpretation, because (cid:104) x (cid:105) (cid:54) = (cid:104) x (cid:105) . Onthe other hand, we know that c A = d if and only if (cid:74) c (cid:75) P = (cid:74) d (cid:75) P by Corollary 10. By definition, this is thesame as saying that c (2) = d if and only if (cid:74) c (cid:75) P = (cid:74) d (cid:75) P modulo the equation x = x .Now, one can prove that, for any safe circuit c , (cid:74) c (cid:75) P does not contain any squared term (Lemma 34).Thus, coming to c and d as in the statement of the lemma, because c and d are such that c (2) = d , it followsthat (cid:74) c (cid:75) P = (cid:74) d (cid:75) P modulo the equation x = x . Because c and d are safe, they do not contain any squaredterms, and thus (cid:74) c (cid:75) P = (cid:74) d (cid:75) P . By completeness of A with respect to (cid:74) · (cid:75) P , we conclude that c A = d .We can now conclude that Proposition 21. (cid:101) R is well-defined on safe circuits modulo (2), that is, for c and d safe, if c (2) = d then (cid:101) R [ c ] (2) = (cid:101) R [ d ] . Proof.
Suppose we have two safe circuits c , d such that c (2) = d . By Lemma 20 we know that c A = d , andtherefore we have (cid:101) R [ c ] A = (cid:101) R [ d ] by Lemma 16. Finally, because A ⊆ (2), we have that (cid:101) R [ c ] (2) = (cid:101) R [ d ] .We can now define an operator R of boolean circuits which computes the reverse derivative. Definition 22.
Let f be a morphism of BoolCirc , i.e. a (2)-equivalence class of boolean circuits. Let c be a safe circuit in such class, which exists by Lemma 19. Define R [ f ] as the (2)-equivalence class of (cid:101) R [ c ] . Since c is safe, we know that R is well defined thanks to Proposition 21.Note that this definition is a minor abuse of notation, because R does not make BoolCirc a reversederivative category. This is because the safety condition is not compositional, and thus cannot satisfyaxiom RD.5. Nevertheless, we are still able to use R to learn the parameters of boolean functions, as wedemonstrate in the following sections. We now introduce our machine learning algorithm, reverse derivative ascent . The definition refers tothe category
BoolCirc , as boolean circuits are our motivating example. However, our formulation makessense in any reverse differential category.We proceed in two parts: the inner ‘step’ of the algorithm, which we call rdaStep , and the outer‘iteration’ of rdaStep , which is rda .54
Reverse Derivative Ascent
Definition 23.
Let f : p + a → b be a boolean circuit in BoolCirc , thus computing a parametrised booleanfunction with p parameters. We define rdaStep f as PAB f R [ f ] APP
Model error Updated parameters (7) rdaStep f represents a single iteration of rda . Its function is to compute a new parameter vector θ (cid:48) for a single labelled dataset example ( x , y ) ∈ ( Z a , Z b ) . We highlight two important parts of rdaStep .First, it computes the model error δ y : = f ( θ , x ) + y : the difference in model prediction to true label.Secondly, it uses R [ f ] to compute a change in parameters δ θ : = R [ f ]( θ , x , δ y ) π such that f ( θ + δ θ , − ) will more closely approximate the example datum.Of course, we would like to update our parameters multiple times: this is the ascent part of reversederivative ascent. In Haskell rda is simply the scanl operation over rdaStep , but we define rda as acircuit to emphasize its generality as a morphism of a reverse derivative category. Definition 24.
Let n ∈ N , and let ( x i , y i ) ∈ ( Z a , Z b ) , denote a sequence of examples with 0 < i ≤ n . rda f is defined as PAB rdaStep f AB PAB rdaStep f PAB rdaStep f ... AB ... ... x y θ θ θ N − θ N x y x N y N θ (8) Remark 25.
In general, there is no need for elements ( x i , y i ) to be in direct correspondence with elementsof the dataset. Commonly, the sequence of examples ‘shown’ to the algorithm will be shuffled withrepetitions [16, 6.1]. We now show empirical results of our method (Table 1), which suggest that our algorithm is genuinelyable to learn useful functions from real-world data. The full source code for running these experimentsis available at http://catgrad.com/p/reverse-derivative-ascent . We begin with a brief dis-cussion of our implementation. Following [3], we write composition left-to-right to mimic diagrammatic order. .W. Wilson, F. Zanasi + → + → + → + → + → The purpose of our implementation is to specify and evaluate circuits as machine learning models. Aboolean circuit a → b is represented by a term of the datatype a :-> b ; more complex circuits arebuilt by composition and tensoring from the primitives of (1). Note that, in our implementation, suchprimitives already come interpreted as the corresponding boolean functions, exploiting the isomorphismbetween BoolCirc and
BoolFun (Proposition 7). This is just a presentation choice, which spares us theneed to define a separate syntax and interpreter. In the future, we plan to enhance the flexibility of ourtool by making these two components distinct.In fact, our datatype :-> is a pair of a circuit and its reverse derivative: constructing a circuit simul-taneously constructs its reverse derivative precisely as in (5). In this way, the reverse derivative is builtup compositionally from smaller parts, and therefore to compute the reverse derivative we need only toextract the second element of the pair, for which we provide the rdiff function.As we have seen in Section 3, composing reverse derivatives in this way is only valid for safe circuits.This prototype version of the code does not implement the procedure described in Appendix A to extracta safe circuit, and so we provide a second method to compute reverse derivatives: the brute-force rdiffB function (as described in Example 14). Note that this method can be applied even to unsafe circuits, but issignificantly less efficient compared to the compositional rdiff as defined in Definition 15. For example,in our experiment code we consider the eval model–an instance of which we show in Example 12. For a -dimensional input, the model has 2 a parameters, and so computing rdiffB eval requires running eval an exponential number of times. By comparison, R [ eval ] (as computed by rdiff eval ) is acircuit whose size is within a constant factor of eval , and whose result needs to be computed just once.To showcase the difference, in the two experiments which follow, we use rdiff for compositionallyfor the Iris model–since it is safe–but use rdiffB for the MNIST model. In the latter case, the numberof parameters is equal to the number of inputs, so this method is not too computationally demanding.We discuss further avenues for improvement to our prototype implementation in Section 5. The Iris dataset [5] is a simple example of a classification problem, and is frequently used for pedagogicalpurposes, e.g. in [6]. It consists of 150 labelled examples of three types of iris flower. Each exampleconsists of four measurements of the flower petal and sepal sizes, so we have the dataset of examples ( ˜ x , ˜ y ) ∈ ( R , { Setosa , Versicolor , Virginica } ) . Interestingly, this pairing is the way in which the reverse derivative construction can be made functorial. See [3, Proposition31] for details. Reverse Derivative Ascent
We run two experiments with this dataset, using our running example of the eval model. We firsttackle the simpler problem of the sub-dataset consisting of the labeled examples for classes Setosa andVersicolor, which we call ‘Iris (2-class)’ in Table 1, and then show results for the full 3-class problem.We also run two variations of each of these two experiments, corresponding to different ways of encodingthe labels: the 3-bit one-hot encoding , and the encoding of labels as binary numbers. In all experiments,we preprocess this data by normalizing and rounding each feature ˜ x i ∈ R into a single bit x i ∈ Z . Forthe n -class problem, this gives us a dataset of examples ( x , y ) ∈ ( Z , Z n ) for the one-hot encoding, and ( x , y ) ∈ ( Z , Z (cid:100) log ( n ) (cid:101) ) for the binary encoding. MNIST [13] is an image classification dataset widely used as a benchmark in machine learning (see e.g.,[4]). It consists of 60000 examples of images of handwritten numeric digits (0 to 9), with each imageconsisting of 28 ×
28 greyscale pixels encoded as bytes. The dataset therefore consists of examples ( ˜ x , ˜ y ) ∈ ( { .. } × , { .. } ) .We do not tackle the full 10-class problem, but leave it for future work. Instead, we restrict ourselvesto the subset of classes { , } . While this means we cannot compare our method to the state of the arton this benchmark, we believe it demonstrates that our method is indeed capable of learning. As in theIris data, we also binarize the pixels of the dataset by normalisation and rounding, to give our ‘binarized’dataset of examples ( x , y ) ∈ ( Z × , Z ) Clearly the dimensionality of this problem is too large to use eval , so we instead use a model pseudoLinear : ( × ) + ( × ) →
1, so named because its structure is loosely inspired by thelinear layers of neural networks. We give only a brief informal description of this model here (fortechnical details, see the experiment code we release with this paper )Essentially, the model learns a ‘feature mask’, which is simply a bitmap image that is pointwisemultiplied with the input. If the resulting bitvector has fewer than 25% as many 1 bits as the mask,the model returns 1. The intuition is that the model should learn the ‘average’ handwritten 0 digit, andcompare it with inputs. In the two-class case this is a fair assumption, since images of 1 and 0 aretypically very different, but it is unlikely to generalise well. From Table 1, we can see that the eval model is able to learn a near-perfect classifier for the 2-classproblem, but fares poorly on the full problem. This is because our preprocessing essentially limits themodel to fixed, axis-aligned decision boundaries. Since the Setosa and Versicolor classes are clearlyseparable when plotted, this works well, but the Versicolor and Virginica classes are not. We also notethat because eval is essentially a lookup table, the label encoding has no effect on model accuracy. Thissuggests that eval may be useful as an ‘output unit’ in larger models.Finally, we note that our MNIST model, while only classifying a subset of the full problem, returnsfairly good results. To make an apples-to-oranges comparison, the approach of [4] gives a similar accu-racy of 99 . one-hot refers to the standard practice of encoding the i th class label of n total as a vector with zero entries except for the i th. See e.g., [1, section 3.3] This is essentially throwing away as much of the information of the dataset as possible: we map each feature to a simple‘high’ or ‘low’ value https://github.com/statusfailed/act-2020-experiments .W. Wilson, F. Zanasi In this paper, we saw how the categorical axiomatisation of reverse derivative can be used to define ageneral ‘gradient’ based algorithm for machine learning. Further, we showed how our algorithm can beused to learn parameters of a novel model class: boolean circuits. However, there are many opportunitiesfor future work, which we broadly classify into two parts.
Empirical Work
The first task is to discover principles for building effective parametrised circuit mod-els. While a number of compositional building blocks for neural network models have been discoveredand studied, the same is not true for parametrised boolean circuits. One exciting challenge is to under-stand whether neural network architectures can be translated to the setting of circuitsFurthermore, although our empirical results show our algorithm is certainly able to learn parametersfrom data, a new machine learning method would typically be expected to show results on the full
MNISTproblem, as well as other image processing benchmarks like CIFAR [11]. Therefore, some empiricalstudy of circuit architectures with respect to these benchmarks will have to be undertaken.As mentioned in Section 4.2.1, another important point is to enhance our implementation. First, weintend to clearly separate between boolean circuits and their interpretations as boolean functions, so thatother semantic interpretations are possible. Second, we plan to implement our procedure to turn a circuitinto its safe equivalent, and study its complexity.
Theoretical Work
One avenue for theoretical work is to demonstrate the use of Reverse DerivativeAscent on categories other than boolean circuits. For example, when interpreted in the category ofnatural numbers and morphisms a → b the smooth maps R a → R b , our method is similar to stochasticgradient descent (SGD) [16], with the following differences. Firstly, computing the model error ((7))means computing the difference between true label and model prediction, but in Z this coincides withaddition because elements are self-inverse. Secondly, SGD has a notion of learning rate : a constantmultiplied by the parameter change which prevents the algorithm ‘overshooting’ the optimal parametervalue. Thirdly, we have no explicit loss function , which is important to discover the conditions underwhich guarantees of convergence exist. By comparison, several different guarantees of convergence areknown for different variants of gradient descent as used in neural networks (see e.g., [16, p. 2].), althoughin some cases tweaks such as slowly decreasing the learning rate are required to make such guarantees,and prevent oscillation around local minima.Another setting of interest is boolean circuits with notions of feedback –something which has alreadyreceived attention in the literature [18] [19]. Characterising these differences between settings may helpto understand gradient methods in a more general light.It will also be important to relate our work to existing category theoretic views of gradient-basedmethods such as [8] [7]. In particular, we believe our method is a special case of [7]. Concretely, we notethat for two parametrised boolean functions f : p + a → b , g : q + b → c , taking the reverse derivativeof their ‘parametrised composition’ ( id × f ) g : q + p + a → c is exactly the composite update-requestmorphism from their formalism. Acknowledgements
We are grateful to the reviewers for their insightful comments and remarks. Wewould also like to thank David Sprunger and Liviu Pirvan for several helpful discussions.58
Reverse Derivative Ascent
References [1] Christopher M. Bishop (2006):
Pattern recognition and machine learning . Information science and statistics,Springer, New York, doi:10.978.038731/0732.[2] Filippo Bonchi, Fabio Gadducci, Aleks Kissinger, Pawel Sobocinski & Fabio Zanasi (2016):
Rewritingmodulo symmetric monoidal structure . Proceedings of the 31st Annual ACM/IEEE Symposium on Logic inComputer Science - LICS ’16 , pp. 710–719, doi:10.1145/2933575.2935316.[3] Robin Cockett, Geoffrey Cruttwell, Jonathan Gallagher, Jean-Simon Pacaud Lemay, Benjamin MacAdam,Gordon Plotkin & Dorette Pronk (2019):
Reverse derivative categories . arXiv:1910.07065 [cs, math] .[4] Matthieu Courbariaux, Yoshua Bengio & Jean-Pierre David: BinaryConnect: Training Deep Neural Net-works with binary weights during propagations . arXiv:1511.00363 [cs] .[5] Dheeru Dua & Casey Graff (2017): UCI Machine Learning Repository .[6] Richard O. Duda, Peter E. Hart & David G. Stork (2000):
Pattern Classification (2nd Edition) . Wiley-Interscience, USA.[7] Brendan Fong, David I. Spivak & R´emy Tuy´eras (2019):
Backprop as Functor: A compositional perspectiveon supervised learning . arXiv:1711.10455 [cs, math] .[8] Bruno Gavranovi´c (2020): Learning Functors using Gradient Descent . Electronic Proceedings in TheoreticalComputer Science
BinarizedNeural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 [cs] . ArXiv: 1602.02830.[10] Nathan Jacobson (2012): Basic Algebra I: Second Edition . Courier Corporation.[11] Alex Krizhevsky (2009):
Learning Multiple Layers of Features from Tiny Images . Master’s thesis, Depart-ment of Computer Science, University of Toronto.[12] Yves Lafont (2003):
Towards an algebraic theory of Boolean circuits . Journal of Pure and Applied Algebra
Gradient-Based Learning Applied toDocument Recognition . In:
Proceedings of the IEEE , pp. 2278–2324, doi:10.1109/5.726791.[14] Rajat Raina, Anand Madhavan & Andrew Y. Ng (2009):
Large-scale deep unsupervised learning usinggraphics processors . In:
Proceedings of the 26th Annual International Conference on Machine Learning -ICML ’09 , ACM Press, Montreal, Quebec, Canada, pp. 1–8, doi:10.1145/1553374.1553486.[15] A. Mart´ın del Rey, G. Rodr´ıguez S´anchez & A. de la Villa Cuenca (2012):
On the boolean partial derivativesand their composition . Applied Mathematics Letters
An overview of gradient descent optimization algorithms . arXiv:1609.04747 [cs] .[17] Peter Selinger (2010): A survey of graphical languages for monoidal categories . arXiv:0908.3347 [math] The differential calculus of causal functions . arXiv:1904.10611 [cs] .[19] David Sprunger & Shin-ya Katsumata (2019): Differentiable Causal Computations via Delayed Trace . In: , IEEE, Vancouver, BC,Canada, pp. 1–12, doi:10.1109/LICS.2019.8785670.[20] Erwei Wang, James J. Davis, Peter Y. K. Cheung & George A. Constantinides (2019):
LUTNet: RethinkingInference in FPGA Soft Logic . IEEE International Symposium on Field-Programmable Custom ComputingMachines , doi:10.1109/FCCM.2019.00014.[21] Fabio Zanasi (2017):
Rewriting in Free Hypergraph Categories . Electronic Proceedings in Theoretical Com-puter Science
Sur le calcul des propositions dans la logique symbolique . .W. Wilson, F. Zanasi A Polynomial Interpretation of Boolean Circuits
In subsection 3.1 we used the interpretation of circuits with axioms A as morphisms of Poly Z . We nowmake this interpretation precise. We will discuss two results for these polynomial circuits: soundnessand completness of their interpretation (cid:74) · (cid:75) P , and a canonical form. A.1 Soundness and Completeness
To show the existence of an isomorphism between
PolyCirc and
Poly Z , we will show that both cate-gories’ hom-sets have the structure of the free module over the polynomial ring. For this, we must recallthe definition of a free module: Definition 26.
Following [10, p. 170], let S be a ring. The free module S b is the cartesian product of b elements of S , i.e. S b = (cid:104) p , p , ..., p b (cid:105) , with addition defined pointwise, (cid:104) p , p , ... p b (cid:105) + (cid:104) q , q , ... q b (cid:105) = (cid:104) p + q , p + q , ... p b + q b (cid:105) a zero element = (cid:104) , , ..., (cid:105) and scalar multiplication s (cid:104) p , p , ..., p b (cid:105) = (cid:104) sp , sp , ... sp b (cid:105) It is clear that the hom-sets of
Poly Z have this structure Proposition 27.
Hom-sets
Poly Z ( a , b ) have the structure of the free module S b with S the polynomialring S = Z [ x , ..., x a ] . Proof.
Immediate from the definition of
Poly Z Furthermore, hom-sets of
PolyCirc also have this structure. This implies the existence of a moduleisomorphism between the hom-sets of
PolyCirc and
Poly Z which is the basis for the functor (cid:74) · (cid:75) P . Webegin, however, with some special case examples. Example 28.
The hom-set
PolyCirc ( , ) has the structure of the ring Z , with every circuit c equal toor . Example 29.
Each hom-set
PolyCirc ( a , ) has the structure of the polynomial ring Z [ x , ..., x a ] , withindeterminates x . . . x a given by the projections π . . . π a Proposition 30.
Hom-sets
PolyCirc ( a , b ) have the structure of the free module Z [ x , . . . , x a ] b . Proof.
For morphisms f , g : a → b , put addition f + g = fga b and multiplication f ∗ g = fga b , with the zero element defined as = ba . one can verify graphically usingequations A that the module axioms hold. If we define the family of b morphisms e i : = a ... i ... b − i ... , < i ≤ b , we can see that it forms a base: each of the generators of Equation 1 can be constructed throughaddition and scalar multiplication of morphisms e i and .We are now ready to give the proof of Proposition 9. Proof of Proposition 9.
By Proposition 27 and Proposition 30, there is a module isomorphism between
Poly Z ( a , b ) and PolyCirc ( a , b ) . Further, because the identity-on-objects functor (cid:74) · (cid:75) P is defined in termsof this bijection, it is a full and faithful functor, and so Poly Z ∼ = PolyCirc . We take scalar multiplication of f : a → b by g : a → f ∗ ( g ∆ ∗ ) , where ∆ ∗ is the unique 1 → b morphismformed by tensor and composition of the diagonal map and identity. Reverse Derivative Ascent
A.2 Canonical Form
We now give a canonical form for morphisms of
PolyCirc . This canonical form essentially isolates alloccurrences of the axiom = , which we can then use to show that all boolean circuits have asafe equivalent in the proof of Lemma 19. Definition 31.
We say a circuit f : a → b is in canonical form if it can be written as (cid:104) C ( p ) , C ( p ) , . . . , C ( p b ) (cid:105) , where (cid:104) p , . . . , p b (cid:105) = (cid:74) f (cid:75) P , and C ( − ) is defined on polynomials as follows.Denote by pow ( n ) the morphism defined inductively as pow ( ) = and pow ( n ) = pow ( n − ) ,and let x ki be an arbitrary indeterminate raised to a power k ∈ N . We define C ( x ki ) : = ... i pow ( k ) ... a − j , and notethat this is consistent with (cid:74) · (cid:75) P in the sense that π i ∗ k . . . ∗ π i A = C ( x ki ) .We now define C ( · ) inductively on polynomials. Either p is a constant, in which case C ( ) = ba and C ( ) = ba , or p = m + p (cid:48) and we have C ( m ) + C ( p (cid:48) ) , where m is a monomial and p (cid:48) a polynomial. A monomial is a product of distinct indeterminates raised to powers x k i i , x k j j , ... , and itscanonical form is therefore C ( x k i i ) ∗ C ( x k j j ) ∗ ... . Remark 32.
Intuitively, the canonical form can be pictured as π i pow ... ... a Example 33.
Continuing with our running example, we note that the eval : 2 + → x + x x + x x (where parameters are x and x ), and so its safe canonicalform can be written as π π π π π . To see it is equivalent to (3), we can apply the counit law = , and then use distributivity.We are now ready to show the proof of Lemma 19. Proof of Lemma 19.
We show that for each circuit c there is a safe circuit d such that c (2) = d . We beginby noting that with equations (2), we have that pow ( k ) = , which can be seen by repeatedly applyingthe = axiom. Using this, we can see that the canonical form of Definition 31 can be rewrittenso that each pow ( k ) morphism becomes the identity. Finally, because each product fga b in thecanonical form is of distinct indeterminates, the rewritten canonical form does not contain any moresquared terms, and so is safe.To conclude, we show how the condition of safety essentially allows only those circuits with inter-pretations as Zhegalkin polynomials [22], as used in the proof of Lemma 20. Lemma 34.
If a circuit f is safe, then the polynomial (cid:74) f (cid:75) P has only exponents in { , } . Proof.