The differential calculus of causal functions
TThe differential calculus of causal functions
David Sprunger
National Institute of Informatics, Tokyo [email protected]
Bart Jacobs
Radboud University, Nijmegen [email protected]
Abstract
Causal functions of sequences occur throughout computer science, from theory to hardware tomachine learning. Mealy machines, synchronous digital circuits, signal flow graphs, and recurrentneural networks all have behaviour that can be described by causal functions. In this work, weexamine a differential calculus of causal functions which includes many of the familiar properties ofstandard multivariable differential calculus. These causal functions operate on infinite sequences,but this work gives a different notion of an infinite-dimensional derivative than either the Fréchet orGateaux derivative used in functional analysis. In addition to showing many standard properties ofdifferentiation, we show causal differentiation obeys a unique recurrence rule. We use this recurrencerule to compute the derivative of a simple recurrent neural network called an Elman network byhand and describe how the computed derivative can be used to train the network.
Mathematics of computing → Differential calculus; Computingmethodologies → Neural networks
Keywords and phrases sequences, causal functions, derivatives, recurrent neural networks, Elmannetworks
Digital Object Identifier
Funding
David Sprunger : This author is supported by ERATO HASUO Metamathematics forSystems Design Project (No. JPMJER1603), JST.
Many computations on infinite data streams operate in a causal manner, meaning their k th output depends only on the first k inputs. Mealy machines, clocked digital circuits,signal flow graphs, recurrent neural networks, and discrete time feedback loops in controltheory are a few examples of systems performing such computations. When designing thesekinds of systems to fit some specification, a common issue is figuring out how adjustingone part of the system will affect the behaviour of the whole. If the system has somereal-valued semantics, as is especially common in machine learning or control theory, thederivative of these semantics with respect to a quantity of interest, say an internal parameter,gives a locally-valid first-order estimate of the system-wide effect of a small change to thatquantity. Unfortunately, since the most natural semantics for infinite data streams is in aninfinite-dimensional vector space, it is not practical to use the resulting infinite-dimensionalderivative.To get around this, one tactic is to replace the infinite system by a finite system obtainedby an approximation or heuristic and take derivatives of the replacement system. This canbe seen, for example, in backpropagation through time [13], which trains a recurrent neuralnetwork by first unrolling the feedback loop the appropriate number of times and thenapplying traditional backpropagation to the unrolled network.This tactic has the advantage that we can take derivatives in a familiar (finite-dimensional)setting, but the disadvantage that it is not clear what properties survive the approximation © David Sprunger and Bart Jacobs;licensed under Creative Commons License CC-BY8th Conference on Very Important Topics.Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:15Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . L O ] A p r process from the unfamiliar (infinite-dimensional) setting. For example, it is not immediatelyclear whether backpropagation through time obeys the usual rules of differential calculus,like a sum or chain rule, nor is this issue confronted in the literature, to the best of ourknowledge. Thus, useful compositional properties of differentiation are ignored in exchangefor a comfortable setting in which to do calculus.In this work, we take advantage of the fact that causal functions between sequencesare already essentially limits of finite-dimensional functions and therefore have derivativeswhich can also be expressed as essentially limits of the derivatives of these finite-dimensionalfunctions. This leads us to the basics of a differential calculus of causal functions. Unlike arbitrary functions between sequences, this limiting process allows us to avoid the use ofnormed vector spaces, and so we believe our notion of derivative is distinct from Fréchetderivatives. Outline.
In section 2, we define causal functions and recall several mechanisms by whichthese functions on infinite data can be defined. In particular, we recall a coalgebraic schemefinding causal functions as the behaviour of Mealy machines (proposition 6), and give a defini-tional scheme in terms of so-called finite approximants (definition 8). In section 3, we definedifferentiability and derivatives of causal functions on real-vector sequences (definition 12)and compute several examples. In section 4, we obtain several rules for our differential causalcalculus analogous to those of multivariable calculus, including a chain rule, parallel rule,sum rule, product rule, reciprocal rule, and quotient rule (propositions 18, 19, 22, 23, 26, and27, respectively). We additionally find a new rule without a traditional analogue we call therecurrence rule (theorem 28). Finally, in section 5, we apply this calculus to find derivativesof a simple kind of recurrent neural network called an Elman network [6] by hand. We alsodemonstrate how to use the derivative of the network with respect to a parameter to guideupdates of that parameter to drive the network towards a desired behaviour. A sequence or stream in a set A is a countably infinite list of values from A , which we alsothink of as a function from the natural numbers ω to A . If σ is a stream in A , we denoteits value at k ∈ ω by σ k . We may also think of a stream as a listing of its image, like σ = ( σ , σ , . . . ). The set of all sequences in A is denoted A ω .Given a ∈ A and σ ∈ A ω , we can form a new sequence by prepending a to σ . Thesequence a : σ is defined by ( a : σ ) = a and ( a : σ ) k +1 = σ k . This operation can be extendedto prepend arbitrary finite-length words w ∈ A ∗ by the obvious recursion. Conversely, we candestruct a given sequence into an element and a second sequence with functions hd : A ω → A and tl : A ω → A ω defined by hd ( σ ) = σ and tl ( σ ) k = σ k +1 . (cid:73) Definition 1 (slicing) . If σ ∈ A ω is a stream and j ≤ k are natural numbers, the slicing σ j : k is the list ( σ j , σ j +1 , . . . , σ k ) ∈ A k − j +1 . (cid:73) Definition 2 (causal function) . A function f : A ω → B ω is causal means σ k = τ k implies f ( σ ) k = f ( τ ) k for all σ, τ ∈ A ω and k ∈ ω . A standard coalgebraic approach to causal functions is to view them as the behaviour ofMealy machines. (cid:73)
Definition 3 (Mealy functor) . Given two sets
A, B , the functor M A,B : Set → Set isdefined by M A,B ( X ) = ( B × X ) A on objects and M A,B ( f ) : φ (id B × f ) ◦ φ on morphisms. . Sprunger and B. Jacobs 23:3 M A,B -coalgebras are Mealy machines with input alphabet A and output alphabet B ,and possibly an infinite state space. The set of causal functions A ω → B ω carries a final M A,B -coalgebra using the following operations, originally observed by Rutten in [10]. (cid:73)
Definition 4.
The
Mealy output of a causal function f : A ω → B ω is the function hd f : A → B defined by ( hd f )( a ) = f ( a : σ ) for any σ ∈ A ω . (cid:73) Definition 5.
Given a ∈ A and a causal function f : A ω → B ω , the Mealy ( a -)derivative of f is the causal function ∂ a f : A ω → B ω defined by ( ∂ a f )( σ ) = tl ( f ( a : σ )) . Note hd ( f ) is well-defined even though σ may be freely chosen due to the causality of f . (cid:73) Proposition 6 (Proposition 2.2, [10]) . The set of causal functions A ω → B ω carries an M A,B -coalgebra via f λa. (( hd f )( a ) , ∂ a f ) , which is a final M A,B -coalgebra.
Hence, a coalgebraic methodology for defining causal functions is to define a Mealymachine and take the image of a particular state in the final coalgebra. By constructingthe Mealy machine cleverly, one can ensure the resulting causal function has some desiredproperties. This is the core idea behind the “syntactic method” using GSOS definitions in[8]. In that work, a Mealy machine of terms is built in such a way that all causal functions( A k ) ω → A ω can be recovered. (cid:73) Example 7.
Suppose ( A, + A , · A , A ) is a vector space over R . This vector space structurecan be extended to A ω componentwise in the obvious way. To illustrate the coalgebraicmethod, we characterise this structure with coalgebraic definitions.To define sequence vector sum coalgebraically, we define a Mealy machine 1 → ( A × A × A with one state, satisfying hd ( s )( a, a ) = a + A a and ∂ ( a,a ) ( s ) = s . Then + A ω : ( A × A ) ω → A ω is defined to be the image of s in the final M A ,A -coalgebra.Note that technically the vector sum in A ω should be a function of type A ω × A ω → A ω ,so we are tacitly using the isomorphism between ( A × A ) ω and A ω × A ω . We will be usingsimilar recastings of sequences in the sequel without bringing up this point again.The zero vector can similarly be defined by a single state Mealy machine 1 → ( A × with input alphabet 1 and output alphabet A , satisfying hd ( s )( ∗ ) = 0 A and ∂ ∗ ( s ) = s . Thezero vector of A ω is the global element picked out by the image of s .Finally, scalar multiplication can be defined with a Mealy machine R → ( A × R ) A withstates r ∈ R , such that hd ( r )( a ) = r · A a and ∂ a r = r . Then r · A ω σ (cid:44) [[ r ]]( σ ), where [[ r ]] isthe image of r in the final M A,A -coalgebra.We immediately begin dropping the subscripts from + A ω and · A ω and when the relevantvector space can be inferred from context. Another approach to causal functions is consider them as a limit of finite approximations,replacing the single function on infinite data with infinitely many functions on finite data.There are (at least) two approaches with this general style, which we briefly describe next. (cid:73)
Definition 8.
Let f : A ω → B ω be a causal function and σ ∈ A ω .The pointwise approximation of f is the sequence of functions U k ( f ) : A k +1 → B definedby U k ( f )( w ) (cid:44) f ( w : σ ) k .The stringwise approximation of f is the sequence of functions T k ( f ) : A k +1 → B k +1 defined by T k ( f )( w ) (cid:44) f ( w : σ ) k . C V I T 2 0 1 9
Again, these are well-defined despite σ being arbitrary due to f ’s causality. We chose theletters U and T deliberately—sometimes the pointwise approximants of a causal functionare called its U nrollings, and the stringwise approximants are called its T runcations.Conversely, given an arbitrary collection of functions u k : A k +1 → B for k ∈ ω , there is aunique causal function whose pointwise approximation is the sequence u k . Thus we have thefollowing bijective correspondence: A ω −→ B ω causal===================== A k +1 −→ B for each k ∈ ω (1)We can nearly do the same for stringwise approximations, but the sequence t k : A k +1 → B k +1 must satisfy t k ( w ) = t k +1 ( wa ) k for all w ∈ A k +1 and a ∈ A .The interchangeability between a causal function and its approximants is a crucial themein this work. Since a function’s pointwise and stringwise approximants are inter-obtainable,we will sometimes refer to a causal function’s “finite approximants” by which we mean eitherfamily of approximants. Finite approximants are a very flexible way of defining causal functions, but causal functionsmay have a more compact representation when they conform to a regular pattern. Recurrenceis one such pattern where a causal function is defined by repeatedly using an ordinary function g : A × B → B and an initial value i ∈ B to obtain rec i ( g ) : A ω → B ω via:[ rec i ( g )( σ )] k = ( g ( σ , i ) if k = 0 g ( σ k , [ rec i ( g )( σ )] k − ) if k > U k ( rec i ( g ))( σ k ) = g ( σ k , g ( σ k − , . . . g ( σ , g ( σ , i )) . . . )). Note these pointwise ap-proximants satisfy the recurrence relation U k ( rec i ( g ))( σ k ) = g ( σ k , U k − ( rec i ( g ))( σ k − )). (cid:73) Example 9.
The unary running product function Q : R ω → R ω can be defined by arecurrence relation: Y ( σ ) = τ ⇔ n τ k +1 = σ k +1 · τ k after τ = σ · g is multiplication of reals and i = 1. In approximant form, [ Q ( σ )] k = Q ki =0 σ i .A special case of recurrent causal functions occurs when there is an h : A → B such that g ( a, b ) = h ( a ) for all ( a, b ) ∈ A × B . In this case, [ rec i ( g )( σ )] k = h ( σ k ) and in particulardoes not depend on the initial value i or any entry σ j for j < k . We denote rec i ( g ) by map ( h ) in this special case since it maps h componentwise across the input sequence. Our goal in this work is to develop a basic differential calculus for causal functions. Thus wewill focus our attention on causal functions between real-vector sequences ( R n ) ω for n ∈ ω ,specializing from causal functions on general sets from the last section. We will draw manyof our illustrating examples for derivatives from Rutten’s stream calculus [9], which describesmany such causal functions between real-number streams. More importantly, [9] establishesmany useful algebraic properties of these functions rigorously via coalgebraic methods. . Sprunger and B. Jacobs 23:5 There are many different approaches one might consider to defining differentiable causalfunctions. One might be to take the original coalgebraic definition and replace the underlyingcategory (
Set ) with a category of finite-dimensional Cartesian spaces and differentiable (orsmooth) maps. Unfortunately, the space of differentiable functions between finite-dimensionalspaces is not finite-dimensional, so the exponential needed to define the M A,B functor in thiscategory does not exist.Another approach is to think of causal functions as functions between infinite dimensionalvector spaces and take standard notions from analysis, like Fréchet derivatives, and applythem in this context. However, norms on sequence spaces usually impose a finiteness conditionlike bounded or square-summable on the domains and ranges of sequence functions. Theserestrictions are compatible with many causal functions like the pointwise sum functionabove, but other causal functions like the running product function become significantly lessinteresting.Our approach to differentiating causal functions is to consider a causal function differen-tiable when all of its finite approximants are differentiable via the correspondence (1). Wewill develop this idea rigorously in section 3.2, but first we need to know a bit about linearcausal functions.
Stated abstractly, the derivative of a function at a point is a linear map which provides anapproximate change in the output of a function given an input representing a small change inthe input to that function [11]. Since linear functions R → R are in bijective correspondencewith their slopes, typically in single-variable calculus the derivative of a function at a pointis instead given as a single real number. In multivariable calculus, derivatives are usuallyrepresented by (Jacobian) matrices since matrices represent linear maps between finitedimensional spaces. Linear functions between infinite dimensional vector spaces do not havea similarly compact, computationally-useful representation, but we can still define derivativesof (causal) functions at points to be linear (causal) maps.We described the natural vector space structure of ( R n ) ω in Example 7. A linear causalfunction is a causal function which is also linear with respect to this vector space structure. (cid:73) Definition 10.
A causal function f : ( R n ) ω → ( R m ) ω is linear when f ( r · σ ) = r · f ( σ ) and f ( σ + τ ) = f ( σ ) + f ( τ ) for all r ∈ R and σ, τ ∈ ( R n ) ω . (cid:73) Lemma 11.
Let f : ( R n ) ω → ( R m ) ω be a causal function. The following are equivalent: f is linear, U k ( f ) : ( R n ) k +1 → R m is linear for all k ∈ ω , and T k ( f ) : ( R n ) k +1 → ( R m ) k +1 is linear for all k ∈ ω . This refines the correspondence (1), allowing us to define a linear causal function bynaming linear finite approximants.Since linear functions between finite dimensional vector spaces can be represented bymatrices, we can think of linear causal functions as limits of the matrices representing itsfinite approximants. This view results in row-finite infinite matrices, such as: A . . .A A . . .A A A . . . ... ... ... . . . C V I T 2 0 1 9 where the A ij are m -row, n -column blocks such that for j > i all entries are 0. These arerelated to the matrices for the approximants of the causal function as follows. The matrix (cid:2) A k A k . . . A kk (cid:3) is the matrix representing U k ( f ). The matrix A . . . A A . . . A k A k A k . . . A kk is the matrix representing T k ( f ). The compat-ibility conditions on the functions T k ( f ) ensure that the matrix for T k ( f ) can be foundin the upper left corner of the matrix for T k +1 ( f ). Note also the upper triangular natureof the matrices for T k ( f ) are a consequence of causality—the first m outputs can dependonly on the first n inputs, so the last entries in the top row must all be 0 and so on.Unlike finite-dimensional matrices, we do not think these infinite matrices are a computa-tionally useful representation, but they are conceptually useful to get an idea of how causallinear functions can be considered the limit of their linear truncations. As we have mentioned, we will use the derivatives of the approximants of a causal function todefine the derivative of the causal function itself. We denote the m -row, n -column Jacobianmatrix of a differentiable function ϕ : R n → R m at x ∈ R n by Jϕ ( x ). Recall this matrix is ∂ϕ ∂x ( x ) ∂ϕ ∂x ( x ) . . . ∂ϕ ∂x n ( x ) ∂ϕ ∂x ( x ) ∂ϕ ∂x ( x ) . . . ∂ϕ ∂x n ( x )... ... . . . ... ∂ϕ m ∂x ( x ) ∂ϕ m ∂x ( x ) . . . ∂ϕ m ∂x n ( x ) where ϕ i : R n → R and ϕ = h ϕ , . . . ϕ m i . We will also be glossing over the distinctionbetween a matrix and the linear function it represents, using Jϕ ( x ) to mean either whenconvenient. (cid:73) Definition 12.
A causal function f : ( R n ) ω → ( R m ) ω is differentiable at σ ∈ ( R n ) ω if all of its finite approximants U k ( f ) : ( R n ) k +1 → R m are differentiable at σ k for all k ∈ ω . If f is differentiable at σ , the derivative of f at σ is the unique linear causal function D ∗ f ( σ ) : ( R n ) ω → ( R m ) ω satisfying U k ( D ∗ f ( σ )) = J ( U k ( f ))( σ k ) . In this definition we are using the correspondence (1), refined in Lemma 11, which allowsus to define a causal (linear) function by specifying its (linear) finite approximants. Wecould equally well have used stringwise approximants in this definition rather than pointwiseapproximants, as the following lemma states. (cid:73)
Lemma 13.
The causal function f is differentiable at σ if and only if each of T k ( f ) are dif-ferentiable at σ k for all k ∈ ω . In this case, D ∗ f ( σ ) satisfies T k ( D ∗ f ( σ )) = J ( T k ( f ))( σ k ) . Though we have mentioned this is not particularly useful computationally, the derivativeof a differentiable function at a point has a representation as a row-finite infinite matrix. (cid:73)
Lemma 14. If f is differentiable at σ , each U k ( f ) : ( R n ) k +1 → R m has an m -row, n ( k +1) -column Jacobian matrix representing its derivative at σ k . Let A ki be m -row, n -column . Sprunger and B. Jacobs 23:7 blocks of this Jacobian, so that J ( U k ( f ))( σ k ) = (cid:2) A k A k . . . A kk (cid:3) The derivative of f at σ is the linear causal function represented by the row-finite infinite matrix D ∗ f ( σ ) = A . . .A A . . .A A A . . . ... ... ... . . . Note that this linear causal function can be evaluated at a sequence ∆ σ ∈ ( R n ) ω bymultiplying the infinite matrix by ∆ σ , considered as an infinite column vector. Next, we use this definition of derivative to find the causal derivatives of some basic functionsfrom Rutten’s stream calculus. (cid:73)
Example 15.
We show the pointwise sum stream function + : ( R ) ω → R ω is itsown derivative at every point ( σ, τ ) ∈ ( R ) ω . Note U k (+)( σ , τ , . . . , σ k , τ k ) = σ k + τ k ,so J ( U k (+))( σ , τ , . . . , σ k , τ k ) = (cid:2) . . . (cid:3) . This is the matrix representation of U k (+) itself, so ( D ∗ +)( σ, τ ) = + or, in other notation, ( D ∗ +)( σ, τ )(∆ σ, ∆ τ ) = ∆ σ + ∆ τ forany σ, τ, ∆ σ, ∆ τ ∈ R ω .This argument can be repeated for all pointwise sum functions + : ( R n × R n ) ω → ( R n ) ω ,replacing the “1” blocks in the Jacobian above with I n .Since the derivative of any constant x : 1 → R n is 0 R n : 1 → R n , the derivative of anyconstant sequence must necessarily be the zero sequence. In stream calculus, there aretwo important constant sequences defined corecursively: [ r ] defined by hd ([ r ])( ∗ ) = r and ∂ ∗ ([ r ]) = [0] for all r ∈ R and X defined by hd ( X )( ∗ ) = 0 and ∂ ∗ ( X ) = [1]. Written out assequences, [ r ] = ( r, , , , . . . ) and X = (0 , , , , . . . ). (cid:73) Example 16. D ∗ [ r ] = D ∗ X = [0].Next, we consider the Cauchy sequence product. Under the correspondence betweensequences σ ∈ R ω and formal power series P σ i x i ∈ R [[ x ]], the Cauchy product is thesequence operation corresponding to the (Cauchy) product of formal power series. Thisoperation is coalgebraically characterized in Rutten [9] as the unique function × : ( R ) ω → R ω satisfying hd ( × )( s , t ) = s · t and ( ∂ ( s ,t ) × )( σ, τ ) = tl ( σ ) × τ + [ s ] × tl ( τ ). For ourpurposes, the explicit definition is more useful: U k ( × )( σ k , τ k ) = P ki =0 σ i · τ k − i . (cid:73) Example 17.
We compute the derivative of the Cauchy product. J ( U k ( × ))( σ , τ , . . . , σ k , τ k ) = (cid:2) τ k σ k τ k − σ k − . . . τ σ (cid:3) Notice that multiplying this matrix by (an initial segment) of a small change sequence(∆ σ , ∆ τ , . . . , ∆ σ k , ∆ τ k ) yields J ( U k ( × ))( σ , τ , . . . , σ k , τ k )(∆ σ , ∆ τ , . . . , ∆ σ k , ∆ τ k ) = k X i =0 ∆ σ i · τ k − i + k X i =0 σ i · ∆ τ k − i Therefore, ( D ∗ × ( σ, τ ))(∆ σ, ∆ τ ) = ∆ σ × τ + σ × ∆ τ . C V I T 2 0 1 9
Another sequence product considered in the stream calculus is the Hadamard product, alsocalled the pointwise product. Defined coalgebraically, the Hadamard product is the uniquebinary operation defined by hd ( (cid:12) )( s , t ) = s · t and ( ∂ ( s ,t ) (cid:12) )( σ, τ ) = tl ( σ ) (cid:12) tl ( τ ). Thishas a similar derivative to the Cauchy product: D ∗ (cid:12) ( σ, τ )(∆ σ, ∆ τ ) = ∆ σ (cid:12) τ + σ (cid:12) ∆ τ .Note that these derivatives make sense without any reference to properties of the sequencesused. We are not aware of a way to realize this derivative as an instance of a notion ofderivative known in analysis. The most obvious notion to try is a Fréchet derivative inducedby a norm on the space of sequences. However, all norms we know on these spaces, including ‘ p -norms and γ -geometric norms k σ k = P σ i · γ i for γ ∈ (0 , Just as it is impractical to compute all derivatives from the definition in undergraduatecalculus, it is also impractical to compute causal derivatives directly from the definition.To ease this burden, one typically proves various “rules” of differentiation which providecompositional recipes for finding derivatives. That is our task in this section.There are at least two good reasons to hope a priori that the standard rules of differ-entiation might hold for causal derivatives. First, causal derivatives were defined to agreewith standard derivatives in their finite approximants. Since these approximant derivativessatisfy these rules, we might hope that they hold over the limiting process. Second, smooth causal functions form a Cartesian differential category, as was shown in [12]. The theoryof Cartesian differential categories includes as axioms or theorems abstract versions of thechain rule, sum rule, etc. However, neither of these reasons are immediately sufficient, so wemust provide independent justification.
We begin by stating some rules familiar from undergraduate calculus. (cid:73)
Proposition 18 (causal chain rule) . Suppose f : ( R n ) ω → ( R m ) ω and g : ( R m ) ω → ( R ‘ ) ω are causal functions. Suppose further f is differentiable at σ ∈ ( R n ) ω and g is differentiableat f ( σ ) . Then h = g ◦ f is differentiable at σ and its derivative is D ∗ g ( f ( σ )) ◦ D ∗ f ( σ ) . Proof.
Let f k = T k ( f ), g k = T k ( g ), and h k = T k ( h ). We know h k = g k ◦ f k . We show thestringwise approximants of D ∗ ( g ◦ f )( σ ) and D ∗ g ( f ( σ )) ◦ D ∗ f ( σ ) match. T k ( D ∗ ( g ◦ f )( σ )) = J ( h k )( σ k ) = J ( g k ◦ f k )( σ k )= J ( g k )( f k ( σ k )) × J ( f k )( σ k ) ( ∗ )= J ( g k )( f ( σ ) k ) × J ( f k )( σ k )= T k ( D ∗ g ( f ( σ ))) ◦ T k ( D ∗ f ( σ )) = T k ( D ∗ g ( f ( σ )) ◦ D ∗ f ( σ ))where the starred line is by the classical chain rule. (cid:74) Since we have already overloaded × for both Cauchy stream product and matrix product,we use k for the parallel composition of functions, where the parallel composition of φ : R n → R m and ψ : R p → R q is φ k ψ : R n + p → R m + q defined by ( φ k ψ )( x, y ) = ( φ ( x ) , ψ ( y )) for x ∈ R p and y ∈ R p . We do not know of a standard name for this rule, but in multivariablecalculus there is a rule J ( φ k ψ )( x, y ) = Jφ ( x ) k Jψ ( y ), which we shall call the parallel rule.There is a similar rule for causal derivatives we describe next. . Sprunger and B. Jacobs 23:9 (cid:73) Proposition 19 (causal parallel rule) . Suppose f : ( R n ) ω → ( R m ) ω and h : ( R p ) ω → ( R q ) ω are causal functions, and that they are differentiable at σ ∈ ( R n ) ω and τ ∈ ( R p ) ω , respectively.Then f k h : ( R n + p ) ω → ( R m + q ) ω is differentiable at ( σ, τ ) ∈ ( R n + p ) ω and its derivative is D ∗ f ( σ ) kD ∗ h ( τ ) . Proof.
The stringwise approximants of D ∗ ( f k h )( σ, τ ) and D ∗ f ( σ ) kD ∗ h ( τ ) match: T k ( D ∗ ( f k h )( σ, τ )) = J ( T k ( f k h ))( σ k , τ k ) = J ( T k ( f ) k T k ( h ))( σ k , τ k )= J ( T k ( f ))( σ k ) k J ( T k ( h ))( τ k ) ( ∗ )= T k ( D ∗ f ( σ )) k T k ( D ∗ h ( τ )) = T k ( D ∗ f ( σ ) kD ∗ h ( τ ))where the starred line is by the classical parallel rule. (cid:74)(cid:73) Proposition 20 (causal linearity) . If f : ( R n ) ω → ( R m ) ω is a linear causal function, it isdifferentiable at every σ ∈ ( R n ) ω and its derivative is D ∗ f ( σ ) = f . These three results are the fundamental properties of causal differentiation we will beusing. Many other standard rules are consequences of these. For example, we can derive asum rule from these properties. (cid:73)
Definition 21.
The sum of two causal maps f, g : ( R n ) ω → ( R m ) ω is defined to be f + g (cid:44) + ◦ ( f k g ) ◦ ∆ ( R n ) ω , where ∆ ( R n ) ω is the sequence duplication map. (cid:73) Proposition 22 (causal sum rule) . If f and g as in Definition 21 are both differentiable at σ , so is their sum and its derivative is D ∗ f ( σ ) + D ∗ g ( σ ) . Proof.
Using the properties above, we find D ∗ ( f + g )( σ ) = D ∗ (+ ◦ ( f k g ) ◦ ∆ ( R n ) ω )( σ ) (sum of maps def’n)= D ∗ (+)(( f k g ◦ ∆ ( R n ) ω )( σ )) ◦ D ∗ ( f k g ◦ ∆ ( R n ) ω )( σ ) (causal chain rule)= + ◦ D ∗ ( f k g ◦ ∆ ( R n ) ω )( σ ) (linearity of +)= + ◦ D ∗ ( f k g )(∆ ( R n ) ω ( σ )) ◦ D ∗ (∆ ( R n ) ω )( σ ) (causal chain rule)= + ◦ D ∗ ( f k g )( σ, σ ) ◦ ∆ ( R n ) ω (def’n & linearity of ∆)= + ◦ ( D ∗ f ( σ ) kD ∗ g ( σ )) ◦ ∆ ( R n ) ω (causal parallel rule)= D ∗ f ( σ ) + D ∗ g ( σ ) (sum of maps def’n)as desired. (cid:74) For functions f, g : R ω → R ω , we can define their Cauchy and Hadamard products f × g and f (cid:12) g with the pattern of Definition 21 and prove two product rules using the derivativesof the binary operations × and (cid:12) we computed earlier. (cid:73) Proposition 23 (causal product rules) . If f, g : R ω → R ω are causal functions differentiableat σ , so are their Cauchy and Hadamard products, and their derivatives are D ∗ ( f × g )( σ )(∆ σ ) = D ∗ f ( σ )(∆ σ ) × g ( σ ) + f ( σ ) × D ∗ g ( σ )(∆ σ ) D ∗ ( f (cid:12) g )( σ )(∆ σ ) = D ∗ f ( σ )(∆ σ ) (cid:12) g ( σ ) + f ( σ ) (cid:12) D ∗ g ( σ )(∆ σ )A typical point of confusion in undergraduate calculus is the role of constants: sometimesthey are treated like elements of the underlying vector space and sometimes like functionswhich always return that vector. In our calculus, a constant can similarly sometimes meana fixed sequence picked out by c : 1 → ( R n ) ω or the composition of this map after adiscarding map ! ( R n ) ω : ( R n ) ω →
1. We have described the derivative of a constant elementin Example 16, now we treat constant maps.
C V I T 2 0 1 9 (cid:73)
Proposition 24 (causal constant rule) . The derivative of ! ( R n ) ω : ( R n ) ω → is ! ( R n ) ω . If c : ( R n ) ω → ( R m ) ω is a constant map, its derivative is the constant map [0]( σ ) ≡ ( R m ) ω . (cid:73) Proposition 25 (causal constant multiple rule) . If c : R ω → R ω is a constant function and f : R ω → R ω is any other causal function differentiable at σ , so is c × f and its derivative is c × D ∗ f ( σ ) . Proof.
Combine the causal product rule and the causal constant rule. (cid:74)
We have seen the standard rules presented in the last section are useful as computationalshortcuts, just as they are in undergraduate calculus. In the causal calculus they turn out tobe perhaps even more crucial, since some differentiable causal functions do not have simpleclosed forms, so trying to find their derivative from the definition is extremely difficult.The stream inverse [9] is the first partial causal function we will consider. This operationis defined on σ ∈ R ω such that σ = 0 with the unbounded-order recurrence relation[ σ − ] k = σ if k = 0 − σ · k − X i =0 (cid:0) σ n − i · [ σ − ] i (cid:1) if k > . Reasoning about this function in terms of its components is extraordinarily difficult sinceeach component is defined in terms of all the preceding components. However, there is auseful fact from Rutten [9] which we can use to find the derivative of this operation at all σ where it is defined: σ × σ − = [1]. (cid:73) Proposition 26 (causal reciprocal rule) . The partial function ( · ) − : R ω → R ω is differen-tiable at all σ ∈ R ω such that σ = 0 , and its derivative is ( D ∗ ( · ) − )( σ )(∆ σ ) = [ − × σ − × σ − × ∆ σ Proof.
Since σ × σ − = [1], their derivatives must also be equal. In particular:[0] = D ∗ [1] = D ∗ ( σ × σ − )(∆ σ ) = σ × ( D ∗ ( · ) − )( σ )(∆ σ ) + ∆ σ × ( σ − )using the causal product rule. Solving this equation for ( D ∗ ( · ) − )( σ )(∆ σ ) yields( D ∗ ( · ) − )( σ )(∆ σ ) = [ − × σ − × σ − × ∆ σ where we are implicitly using many of the identities established in [9]. (cid:74) When adopting the conventions that σ − n (cid:44) σ − ( n − × σ − and σ × τ − (cid:44) στ , this rule looksquite like the usual rule for the derivative of the reciprocal function: ( J ( · ) − )( x )(∆ x ) = − ∆ xx . (cid:73) Proposition 27 (causal quotient rule) . If f, g : R ω → R ω are causal functions differentiableat σ and g ( σ ) = 0 , then fg is also differentiable at σ and its derivative is D ∗ f ( σ )(∆ σ ) × g ( σ ) + [ − × f ( σ ) × D ∗ g ( σ )(∆ σ ) g ( σ ) . . Sprunger and B. Jacobs 23:11 So far, causal differential calculus is rather similar to traditional differential calculus. Thereare two different product rules corresponding to two different products. We were forced touse an implicit differentiation trick to find the derivative of the reciprocal function, but in theend we found a familiar result. However, next we state a rule with no traditional analogue. (cid:73)
Theorem 28 (causal recurrence rule) . Let g : R n × R m → R m be differentiable (everywhere)and i ∈ R m . Then rec i ( g ) : ( R n ) ω → ( R m ) ω is differentiable (everywhere) as a causalfunction and its derivative ∆ τ (cid:44) [ D ∗ rec i ( g )]( σ )(∆ σ ) satisfies the following recurrence: ( τ k +1 = g ( σ k +1 , τ k ) after τ = g ( σ , i )∆ τ k +1 = Jg ( σ k +1 , τ k )(∆ σ k +1 , ∆ τ k ) after ∆ τ = Jg ( σ , i )(∆ σ , R m ) Proof.
We check U k ( D ∗ rec i ( g )( σ ))(∆ σ k ) = ∆ τ k by induction on k . To simplify ournotation, we write u k (cid:44) U k ( rec i ( g )). The base case is easy: U ([ D ∗ rec i ( g )]( σ ))(∆ σ ) = J ( U ( rec i ( g )))( σ )(∆ σ )= J ( λx.g ( x, i ))( σ )(∆ σ ) = Jg ( σ , i )(∆ σ , R m )The induction step uses the fact that u k ( σ k ) = g ( σ k , u k − ( σ k − )). U k ([ D ∗ rec i ( g )]( σ ))(∆ σ k ) = Ju k ( σ k )(∆ σ k )= [ Jg ( σ k , τ k − ) ◦ h Jπ k ( σ k ) , J ( u k − ◦ π k )( σ k ) i ](∆ σ k )= [ Jg ( σ k , τ k − ) ◦ h π k , Ju k − ( σ k − ) ◦ π k i ](∆ σ k )= Jg ( σ k , τ k − )(∆ σ k , Ju k − ( σ k − )(∆ σ k − ))= Jg ( σ k , τ k − )(∆ σ k , ∆ τ k − )where π k is the map discarding the last element of a list. (cid:74) Degenerate recurrences, which do not refer to previous values generated by the recurrence,are a special instance of this rule. (cid:73)
Corollary 29 (causal map rule) . Let h : R n → R m be a differentiable function. Then map ( h ) is differentiable as a causal function, and its derivative is map ( Jh ) . To illustrate the recurrence rule, we revisit the running product function, introduced inExample 9, and compute its derivative. (cid:73)
Example 30.
The unary running product function Q : R ω → R ω was defined to be rec ( g )where g is binary multiplication of reals. In approximant form, U k ( rec ( g ))( σ k ) = Q ki =0 σ i .We compute a recurrence for the derivative of this function using the recurrence rule.Since g is binary multiplication, Jg ( s, t )(∆ s, ∆ t ) = ∆ s · t + s · ∆ t . By the recurrence rule,[ D ∗ rec i ( g )]( σ )(∆ σ ) satisfies the recurrence ( τ k +1 = σ k +1 · τ k after τ = σ ∆ τ k +1 = ∆ σ k +1 · τ k + σ k +1 · ∆ τ k after ∆ τ = ∆ σ Note that a direct computation of the derivative of this function is available since wehave a simple form for its pointwise approximants. Directly from the definition we would get∆ τ k = U k ( D ∗ rec ( g )( σ ))(∆ σ k ) = k X i =0 k Y j =0 ρ ij C V I T 2 0 1 9 where ρ ij is σ j if i = j and ∆ σ j otherwise.Used naively, this formula results in O ( k ) real number multiplications, and requiresaccess to the entire initial segment of σ at all times. In contrast, computing the same quantityusing the recurrence obtained by the recurrence rule requires O ( k ) multiplications and canbe computed on-the-fly, requiring only the availability of the first elements of σ and ∆ σ tomake initial progress and releasing their memory just after use. We next turn toward a potential application domain of our causal differential calculus:machine learning. In particular, we demonstrate that it is possible to use this calculus inthe training of recurrent neural networks (RNNs). RNNs differ from the more commonfeedforward network in that they are designed to process sequences of inputs rather thansingle inputs. This makes them especially useful in analyzing long texts (sequences of words),spoken language (sequences of sounds), and videos (sequences of images). In fact, particularRNN architectures are the core underlying technologies of many speech recognition productstoday, such as Alexa and Siri.In this section, we will be using our causal differential calculus to find the derivativeof a simple kind of recurrent neural network, namely an Elman network [6]. This is aninfluential early example of a network with feedback, though modern feedback networkstypically have more structure. Elman networks can operate on sequences of vectors from R n ,but to keep things slightly simpler we will consider Elman networks operating on sequencesof real numbers only.Let α, β, γ, δ, (cid:15) ∈ R be arbitrary parameters and φ , φ : R → R be arbitrary differentiable“activation” functions. Given an input sequence σ ∈ R ω , the Elman network defined bythese parameters produces the sequence E ( σ ) = τ ∈ R ω satisfying the following recurrence: ( ρ k +1 = φ ( ασ k +1 + βρ k + γ ) after ρ = φ ( ασ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )In our notation, if we define g ( x, y ) (cid:44) φ ( αx + βy + γ ) and g ( x ) (cid:44) φ ( δx + (cid:15) ), then E (cid:44) map ( g ) ◦ rec ( g ). We can therefore find the causal derivative of this Elman networkrelatively easily using the causal chain rule and causal recurrence rule. Indeed, letting D ∗ E ( σ )(∆ σ ) = ∆ τ , these rules tell us ∆ τ satisfies the recurrence: ρ k +1 = φ ( ασ k +1 + βρ k + γ ) after ρ = φ ( ασ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )∆ ρ k +1 = φ ( ασ k +1 + βρ k + γ ) · ( α ∆ σ k +1 + β ∆ ρ k ) after ∆ ρ = φ ( ασ + γ ) · ( α ∆ σ )∆ τ k +1 = φ ( δρ k +1 + (cid:15) ) · ( δ ∆ ρ k +1 ) after ∆ τ = φ ( δρ + (cid:15) ) · ( δ ∆ ρ )This derivative tells us how we would expect the output of the Elman network to changein response to a small change ∆ σ to its input sequence σ . This can be useful information inanalyzing the behavior of the network. However, we can also use causal differentiation topredict how the network’s output would change in response to a small change in one of the parameters , which is a crucial piece of information used when training the network. “Activation” here has no technical meaning, but carries a connotation that the function is likely takenfrom a folklore set of functions including the sigmoid function, hyperbolic tangent, softplus, rectifiedlinear unit, and logistic function. Usually these functions have bounded range, often [0 , . Sprunger and B. Jacobs 23:13 Let us now imagine that we have some data on how this Elman network should behave,in the form of an input/output pair (ˆ σ, ˆ τ ) ∈ R ω × R ω representing ground truth, and wewant to figure out how to adjust one of the parameters, say α , so that our Elman networkbetter reflects this ground truth.We can define a causal function related to the Elman network E , but where we nowconsider α to be a variable and fix σ to be ˆ σ . Denote this function E ˆ σ : R ω → R ω and notethat if τ = E ˆ σ (ˆ α ) for ˆ α ∈ R ω , then τ satisfies the recurrence relation ( ρ k +1 = φ ( α ˆ σ k +1 + βρ k + γ ) after ρ = φ ( α ˆ σ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )We have simplified our expression using the fact that parameters are fixed values thatdo not change in the course of the computation of the output sequence, so ˆ α k = α for all k ∈ ω . Similarly, when we make small change to this parameter, that small change willremain independent of the entry in the sequence, so d ∆ α k = ∆ α for all k .We can compute the derivative of this recurrence relation similarly to above, and find itwill satisfy the following recurrence relation: ρ k +1 = φ ( α ˆ σ k +1 + βρ k + γ ) after ρ = φ ( α ˆ σ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )∆ ρ k +1 = φ ( α ˆ σ k +1 + βρ k + γ ) · (∆ α ˆ σ k +1 + β ∆ ρ k ) after ∆ ρ = φ ( α ˆ σ + γ ) · (∆ α ˆ σ )∆ τ k +1 = φ ( δρ k +1 + (cid:15) ) · ( δ ∆ ρ k +1 ) after ∆ τ = φ ( δρ + (cid:15) ) · ( δ ∆ ρ ) (cid:73) Example 31.
Let us take a very specific example to illustrate this process. We instantiatethe above Elman network with α = β = δ = 1, γ = 0 . (cid:15) = − . φ = φ are both thesigmoid function. We suppose our ground truth data tells us a sequence starting ˆ σ = (1 , , , , . . . ) shouldbe sent to a sequence starting ˆ τ = (0 . , . , . , . , . . . ). In reality, our Elman networkas currently parametrized sends ˆ σ to (0 . , . , . , . , . . . ), when roundedto 5 decimal places. Our task is to decide how to adjust α so that the new network willbetter match our data, in particular reducing every entry by about 0 . E ˆ σ from abovewith our particular choice of parameters. Since we have chosen many coefficients and all theentries of ˆ σ to be 1, there is significant simplification: ρ k +1 = φ ( ρ k + 1 .
1) after ρ = φ (1 . τ k +1 = φ ( ρ k +1 − .
1) after τ = φ ( ρ − . ρ k +1 = φ ( ρ k + 1 . · (∆ α + ∆ ρ k ) after ∆ ρ = φ (1 . · ∆ α ∆ τ k +1 = φ ( ρ k +1 − . · ∆ ρ k +1 after ∆ τ = φ ( ρ − . · ∆ ρ The only free variable in this recurrence is ∆ α . We choose ∆ α = 0 .
1, for reasons to beexplained later. Then we can compute ∆ τ = (0 . , . , . , . , . . . ).What does this tell us? The recurrence is supposed to compute the derivative of E ˆ σ at 1 andapply the resulting linear map to 0.1. Using the interpretation of derivative as approximatechange, this suggests that if we increase our parameter α from its current value of 1 by ∆ α = The sigmoid function φ : R → R is defined by φ ( x ) = e − x . The sigmoid function is traditionallydenoted by σ , but since we have been using σ as a sequence variable we use φ . C V I T 2 0 1 9 .
1, we should expect E ˆ σ (1 .
1) to be about E ˆ σ (1) + (0 . , . , . , . , . . . ).Since our goal is to reduce the output of the network, this adjustment is not a great idea.What are we to do? One option is to pick a new value for ∆ α and recompute theapproximate change, but there is a smarter way. We know that the derivative of E ˆ σ at 1is linear, so if we instead decrease α by 0.1, we would expect E ˆ σ (0 .
9) to be about E ˆ σ (1) − (0 . , . , . , . , . . . ) = (0 . , . , . , . , . . . ). Indeed,after making this adjustment, we find E ˆ σ (0 .
9) = (0 . , . , . , . , . . . ).This adjustment ended up decreasing the result by about 0.00015 more than we predicted,which amounts to approximately a 5% overshot of the original prediction.While it is nice to know our prediction about the change was fairly accurate, subtracting0.1 from α has not achieved our goal: in each component, our Elman network’s outputdecreased by at most 0.005 while we were trying to create a reduction of 0.05. A natural ideahere would be to really exploit the linearity of the derivative and make a bigger adjustmentto α , namely subtracting . . · ∆ α = 10 · ∆ α = 1. Computing E ˆ σ (0), we find it is actually(0 . , . , . , . , . . . ), which is much closer to our goal than E ˆ σ (0 .
9) turnedout to be.This seems like good news, but if we check the accuracy of the prediction our derivativemakes, we would find that the actual reduction from E ˆ σ (1) to E ˆ σ (0) is between 25% and 65%greater than the derivative predicted. Thus, though we were able to make greater progressaligning our network with ground truth, the bigger adjustment came with much greatererror. This is a classic tradeoff in neural network training: the linear approximation providedby the derivative is only valid locally, so taking bigger steps along the gradient comes withpotentially greater rewards in terms of improvements in network performance but also carriesextra risk that greater error could lead the training astray. In this paper, we presented a basic differential calculus for causal functions between sequencesof real-valued vectors. We gave a definition of derivative for causal functions, showed how tocompute derivatives from this definition, established many classical rules from multivariablecalculus including the chain, parallel, sum, product, reciprocal, and quotient rules. Weadditionally showed a rule unique to the causal calculus: the recurrence rule. We then showedhow to use these rules in a practical example, namely the training of an Elman network.
Related work.
We are not aware of other works directly treating differentiation of causalfunctions, though we suspect there may be connections to hard-core analysis literature. Thiswork is obviously inspired in results and structure by standard undergraduate multivariablecalculus, e.g. [11]. We also have a related categorical treatment of differentiation of causalfunctions [12] using the framework of Cartesian differential categories [2]. That is much moreabstract than the present work, but when concretized to the current scenario would onlyapply to smooth causal functions.Though we drew our example differentiable functions almost exclusively from Rutten’sstream calculus [9], we would also like to point out signal flow graphs as another interestingtreatment of causal functions. an interesting graphical representation of causal functions,investigated in e.g. [1, 3, 4, 7]. We expect that interpreting our differential calculus in thissetting could yield a treatment of differentiation in string diagrams.We suspect recurrence rule we obtained, particularly when differentiating Elman networks,may also have connections to the automatic differentiation literature we are not aware of atthis time. In particular, it does rather seem like the recurrence rule augments a recurrence . Sprunger and B. Jacobs 23:15 with dual numbers.
Future directions.
As neural networks become more advanced and practitioners find newand interesting ways of using gradients of these networks, we believe theoreticians have arole to play in systematizing the theory of these new applications of derivatives. We believethat the coalgebra community, as experts with many tools for understanding programsoperating on, infinite data structures, are particularly well-positioned to help develop thesetheories. For example, nearly every rule of causal differentiation we established here relies ona coalgebraically-derived property from Rutten’s stream calculus [9]. We looked at functionson sequences in particular, but we have every reason to believe further results are possiblefor more advanced neural network architectures on more exotic infinite data structures.We are particularly interested in merging our results here with a line of research initiatedin [12] using Cartesian differential categories. We believe this causal calculus could be aninstance of a Cartesian differential restriction category [5], which would drastically improvethe scope of our previous results to cover partial and non-smooth causal functions.
References Henning Basold, Marcello Bonsangue, Helle Hvid Hansen, and Jan Rutten. (Co)AlgebraicCharacterizations of Signal Flow Graphs , pages 124–145. Springer International Publish-ing, Cham, 2014. URL: https://doi.org/10.1007/978-3-319-06880-0_6 , doi:10.1007/978-3-319-06880-0_6 . R F Blute, J R B Cockett, and R A G Seely. Cartesian differential categories.
Theory andApplications of Categories , 22:622–672, 2009. Filippo Bonchi, Paweł Sobociński, and Fabio Zanasi. A categorical semantics of signal flowgraphs. In
CONCUR 2014 - Concurrency Theory - 25th International Conference, CONCUR2014, Rome, Italy, September 2-5, 2014. Proceedings , pages 435–450, 2014. URL: https://doi.org/10.1007/978-3-662-44584-6_30 , doi:10.1007/978-3-662-44584-6\_30 . Filippo Bonchi, Pawel Sobociński, and Fabio Zanasi. Full abstraction for signal flow graphs.In
Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages, POPL 2015, Mumbai, India, January 15-17, 2015 , pages 515–526,2015. URL: https://doi.org/10.1145/2676726.2676993 , doi:10.1145/2676726.2676993 . JRB Cockett, GSH Cruttwell, and JD Gallagher. Differential restriction categories.
Theoryand Applications of Categories , 25(21):537–613, 2011. Jeffrey L. Elman. Finding structure in time.
Cognitive Science , 14(2):179–211, Mar 1990. doi:10.1207/s15516709cog1402_1 . Stefan Milius. A sound and complete calculus for finite stream circuits. In
Proceedings of the25th Annual IEEE Symposium on Logic in Computer Science, LICS 2010, 11-14 July 2010,Edinburgh, United Kingdom , pages 421–430, 2010. URL: https://doi.org/10.1109/LICS.2010.11 , doi:10.1109/LICS.2010.11 . Jan Rutten, Clemens Kupke, and Helle Hvid Hansen. Stream differential equations: Specifica-tion formats and solution methods.
Logical Methods in Computer Science , 13, 2017. J.J.M.M. Rutten. A coinductive calculus of streams.
Mathematical Structures in ComputerScience , 15(1):93–147, Feb 2005. doi:10.1017/S0960129504004517 . J.J.M.M. Rutten. Algebraic specification and coalgebraic synthesis of mealy automata.
Elec-tronic Notes in Theoretical Computer Science , 160:305–319, Aug 2006. doi:10.1016/j.entcs.2006.05.030 . Michael Spivak. Calculus on manifolds. 1965. David Sprunger and Shin-ya Katsumata. Differentiable causal computations via delayed trace.
CoRR , abs/1903.01093, 2019. URL: http://arxiv.org/abs/1903.01093 , arXiv:1903.01093 . P. J. Werbos. Backpropagation through time: what it does and how to do it.
Proceedings ofthe IEEE , 78(10):1550–1560, Oct 1990. doi:10.1109/5.58337 ..