[PDF] The differential calculus of causal functions

Abstract

Causal functions of sequences occur throughout computer science, from theory to hardware to machine learning. Mealy machines, synchronous digital circuits, signal flow graphs, and recurrent neural networks all have behaviour that can be described by causal functions. In this work, we examine a differential calculus of causal functions which includes many of the familiar properties of standard multivariable differential calculus. These causal functions operate on infinite sequences, but this work gives a different notion of an infinite-dimensional derivative than either the Fréchet or Gateaux derivative used in functional analysis. In addition to showing many standard properties of differentiation, we show causal differentiation obeys a unique recurrence rule. We use this recurrence rule to compute the derivative of a simple recurrent neural network called an Elman network by hand and describe how the computed derivative can be used to train the network.

Full PDF

TThe diﬀerential calculus of causal functions

David Sprunger

National Institute of Informatics, Tokyo [email protected]

Bart Jacobs

Radboud University, Nijmegen [email protected]

Abstract

Causal functions of sequences occur throughout computer science, from theory to hardware tomachine learning. Mealy machines, synchronous digital circuits, signal ﬂow graphs, and recurrentneural networks all have behaviour that can be described by causal functions. In this work, weexamine a diﬀerential calculus of causal functions which includes many of the familiar properties ofstandard multivariable diﬀerential calculus. These causal functions operate on inﬁnite sequences,but this work gives a diﬀerent notion of an inﬁnite-dimensional derivative than either the Fréchet orGateaux derivative used in functional analysis. In addition to showing many standard properties ofdiﬀerentiation, we show causal diﬀerentiation obeys a unique recurrence rule. We use this recurrencerule to compute the derivative of a simple recurrent neural network called an Elman network byhand and describe how the computed derivative can be used to train the network.

Mathematics of computing → Diﬀerential calculus; Computingmethodologies → Neural networks

Keywords and phrases sequences, causal functions, derivatives, recurrent neural networks, Elmannetworks

Digital Object Identiﬁer

Funding

David Sprunger : This author is supported by ERATO HASUO Metamathematics forSystems Design Project (No. JPMJER1603), JST.

Many computations on inﬁnite data streams operate in a causal manner, meaning their k th output depends only on the ﬁrst k inputs. Mealy machines, clocked digital circuits,signal ﬂow graphs, recurrent neural networks, and discrete time feedback loops in controltheory are a few examples of systems performing such computations. When designing thesekinds of systems to ﬁt some speciﬁcation, a common issue is ﬁguring out how adjustingone part of the system will aﬀect the behaviour of the whole. If the system has somereal-valued semantics, as is especially common in machine learning or control theory, thederivative of these semantics with respect to a quantity of interest, say an internal parameter,gives a locally-valid ﬁrst-order estimate of the system-wide eﬀect of a small change to thatquantity. Unfortunately, since the most natural semantics for inﬁnite data streams is in aninﬁnite-dimensional vector space, it is not practical to use the resulting inﬁnite-dimensionalderivative.To get around this, one tactic is to replace the inﬁnite system by a ﬁnite system obtainedby an approximation or heuristic and take derivatives of the replacement system. This canbe seen, for example, in backpropagation through time [13], which trains a recurrent neuralnetwork by ﬁrst unrolling the feedback loop the appropriate number of times and thenapplying traditional backpropagation to the unrolled network.This tactic has the advantage that we can take derivatives in a familiar (ﬁnite-dimensional)setting, but the disadvantage that it is not clear what properties survive the approximation © David Sprunger and Bart Jacobs;licensed under Creative Commons License CC-BY8th Conference on Very Important Topics.Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:15Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . L O ] A p r process from the unfamiliar (inﬁnite-dimensional) setting. For example, it is not immediatelyclear whether backpropagation through time obeys the usual rules of diﬀerential calculus,like a sum or chain rule, nor is this issue confronted in the literature, to the best of ourknowledge. Thus, useful compositional properties of diﬀerentiation are ignored in exchangefor a comfortable setting in which to do calculus.In this work, we take advantage of the fact that causal functions between sequencesare already essentially limits of ﬁnite-dimensional functions and therefore have derivativeswhich can also be expressed as essentially limits of the derivatives of these ﬁnite-dimensionalfunctions. This leads us to the basics of a diﬀerential calculus of causal functions. Unlike arbitrary functions between sequences, this limiting process allows us to avoid the use ofnormed vector spaces, and so we believe our notion of derivative is distinct from Fréchetderivatives. Outline.

In section 2, we deﬁne causal functions and recall several mechanisms by whichthese functions on inﬁnite data can be deﬁned. In particular, we recall a coalgebraic schemeﬁnding causal functions as the behaviour of Mealy machines (proposition 6), and give a deﬁni-tional scheme in terms of so-called ﬁnite approximants (deﬁnition 8). In section 3, we deﬁnediﬀerentiability and derivatives of causal functions on real-vector sequences (deﬁnition 12)and compute several examples. In section 4, we obtain several rules for our diﬀerential causalcalculus analogous to those of multivariable calculus, including a chain rule, parallel rule,sum rule, product rule, reciprocal rule, and quotient rule (propositions 18, 19, 22, 23, 26, and27, respectively). We additionally ﬁnd a new rule without a traditional analogue we call therecurrence rule (theorem 28). Finally, in section 5, we apply this calculus to ﬁnd derivativesof a simple kind of recurrent neural network called an Elman network [6] by hand. We alsodemonstrate how to use the derivative of the network with respect to a parameter to guideupdates of that parameter to drive the network towards a desired behaviour. A sequence or stream in a set A is a countably inﬁnite list of values from A , which we alsothink of as a function from the natural numbers ω to A . If σ is a stream in A , we denoteits value at k ∈ ω by σ k . We may also think of a stream as a listing of its image, like σ = ( σ , σ , . . . ). The set of all sequences in A is denoted A ω .Given a ∈ A and σ ∈ A ω , we can form a new sequence by prepending a to σ . Thesequence a : σ is deﬁned by ( a : σ ) = a and ( a : σ ) k +1 = σ k . This operation can be extendedto prepend arbitrary ﬁnite-length words w ∈ A ∗ by the obvious recursion. Conversely, we candestruct a given sequence into an element and a second sequence with functions hd : A ω → A and tl : A ω → A ω deﬁned by hd ( σ ) = σ and tl ( σ ) k = σ k +1 . (cid:73) Deﬁnition 1 (slicing) . If σ ∈ A ω is a stream and j ≤ k are natural numbers, the slicing σ j : k is the list ( σ j , σ j +1 , . . . , σ k ) ∈ A k − j +1 . (cid:73) Deﬁnition 2 (causal function) . A function f : A ω → B ω is causal means σ k = τ k implies f ( σ ) k = f ( τ ) k for all σ, τ ∈ A ω and k ∈ ω . A standard coalgebraic approach to causal functions is to view them as the behaviour ofMealy machines. (cid:73)

Deﬁnition 3 (Mealy functor) . Given two sets

A, B , the functor M A,B : Set → Set isdeﬁned by M A,B ( X ) = ( B × X ) A on objects and M A,B ( f ) : φ (id B × f ) ◦ φ on morphisms. . Sprunger and B. Jacobs 23:3 M A,B -coalgebras are Mealy machines with input alphabet A and output alphabet B ,and possibly an inﬁnite state space. The set of causal functions A ω → B ω carries a ﬁnal M A,B -coalgebra using the following operations, originally observed by Rutten in [10]. (cid:73)

Deﬁnition 4.

The

Mealy output of a causal function f : A ω → B ω is the function hd f : A → B deﬁned by ( hd f )( a ) = f ( a : σ ) for any σ ∈ A ω . (cid:73) Deﬁnition 5.

Given a ∈ A and a causal function f : A ω → B ω , the Mealy ( a -)derivative of f is the causal function ∂ a f : A ω → B ω deﬁned by ( ∂ a f )( σ ) = tl ( f ( a : σ )) . Note hd ( f ) is well-deﬁned even though σ may be freely chosen due to the causality of f . (cid:73) Proposition 6 (Proposition 2.2, [10]) . The set of causal functions A ω → B ω carries an M A,B -coalgebra via f λa. (( hd f )( a ) , ∂ a f ) , which is a ﬁnal M A,B -coalgebra.

Hence, a coalgebraic methodology for deﬁning causal functions is to deﬁne a Mealymachine and take the image of a particular state in the ﬁnal coalgebra. By constructingthe Mealy machine cleverly, one can ensure the resulting causal function has some desiredproperties. This is the core idea behind the “syntactic method” using GSOS deﬁnitions in[8]. In that work, a Mealy machine of terms is built in such a way that all causal functions( A k ) ω → A ω can be recovered. (cid:73) Example 7.

Suppose ( A, + A , · A , A ) is a vector space over R . This vector space structurecan be extended to A ω componentwise in the obvious way. To illustrate the coalgebraicmethod, we characterise this structure with coalgebraic deﬁnitions.To deﬁne sequence vector sum coalgebraically, we deﬁne a Mealy machine 1 → ( A × A × A with one state, satisfying hd ( s )( a, a ) = a + A a and ∂ ( a,a ) ( s ) = s . Then + A ω : ( A × A ) ω → A ω is deﬁned to be the image of s in the ﬁnal M A ,A -coalgebra.Note that technically the vector sum in A ω should be a function of type A ω × A ω → A ω ,so we are tacitly using the isomorphism between ( A × A ) ω and A ω × A ω . We will be usingsimilar recastings of sequences in the sequel without bringing up this point again.The zero vector can similarly be deﬁned by a single state Mealy machine 1 → ( A × with input alphabet 1 and output alphabet A , satisfying hd ( s )( ∗ ) = 0 A and ∂ ∗ ( s ) = s . Thezero vector of A ω is the global element picked out by the image of s .Finally, scalar multiplication can be deﬁned with a Mealy machine R → ( A × R ) A withstates r ∈ R , such that hd ( r )( a ) = r · A a and ∂ a r = r . Then r · A ω σ (cid:44) [[ r ]]( σ ), where [[ r ]] isthe image of r in the ﬁnal M A,A -coalgebra.We immediately begin dropping the subscripts from + A ω and · A ω and when the relevantvector space can be inferred from context. Another approach to causal functions is consider them as a limit of ﬁnite approximations,replacing the single function on inﬁnite data with inﬁnitely many functions on ﬁnite data.There are (at least) two approaches with this general style, which we brieﬂy describe next. (cid:73)

Deﬁnition 8.

Let f : A ω → B ω be a causal function and σ ∈ A ω .The pointwise approximation of f is the sequence of functions U k ( f ) : A k +1 → B deﬁnedby U k ( f )( w ) (cid:44) f ( w : σ ) k .The stringwise approximation of f is the sequence of functions T k ( f ) : A k +1 → B k +1 deﬁned by T k ( f )( w ) (cid:44) f ( w : σ ) k . C V I T 2 0 1 9

Again, these are well-deﬁned despite σ being arbitrary due to f ’s causality. We chose theletters U and T deliberately—sometimes the pointwise approximants of a causal functionare called its U nrollings, and the stringwise approximants are called its T runcations.Conversely, given an arbitrary collection of functions u k : A k +1 → B for k ∈ ω , there is aunique causal function whose pointwise approximation is the sequence u k . Thus we have thefollowing bijective correspondence: A ω −→ B ω causal===================== A k +1 −→ B for each k ∈ ω (1)We can nearly do the same for stringwise approximations, but the sequence t k : A k +1 → B k +1 must satisfy t k ( w ) = t k +1 ( wa ) k for all w ∈ A k +1 and a ∈ A .The interchangeability between a causal function and its approximants is a crucial themein this work. Since a function’s pointwise and stringwise approximants are inter-obtainable,we will sometimes refer to a causal function’s “ﬁnite approximants” by which we mean eitherfamily of approximants. Finite approximants are a very ﬂexible way of deﬁning causal functions, but causal functionsmay have a more compact representation when they conform to a regular pattern. Recurrenceis one such pattern where a causal function is deﬁned by repeatedly using an ordinary function g : A × B → B and an initial value i ∈ B to obtain rec i ( g ) : A ω → B ω via:[ rec i ( g )( σ )] k = ( g ( σ , i ) if k = 0 g ( σ k , [ rec i ( g )( σ )] k − ) if k > U k ( rec i ( g ))( σ k ) = g ( σ k , g ( σ k − , . . . g ( σ , g ( σ , i )) . . . )). Note these pointwise ap-proximants satisfy the recurrence relation U k ( rec i ( g ))( σ k ) = g ( σ k , U k − ( rec i ( g ))( σ k − )). (cid:73) Example 9.

The unary running product function Q : R ω → R ω can be deﬁned by arecurrence relation: Y ( σ ) = τ ⇔ n τ k +1 = σ k +1 · τ k after τ = σ · g is multiplication of reals and i = 1. In approximant form, [ Q ( σ )] k = Q ki =0 σ i .A special case of recurrent causal functions occurs when there is an h : A → B such that g ( a, b ) = h ( a ) for all ( a, b ) ∈ A × B . In this case, [ rec i ( g )( σ )] k = h ( σ k ) and in particulardoes not depend on the initial value i or any entry σ j for j < k . We denote rec i ( g ) by map ( h ) in this special case since it maps h componentwise across the input sequence. Our goal in this work is to develop a basic diﬀerential calculus for causal functions. Thus wewill focus our attention on causal functions between real-vector sequences ( R n ) ω for n ∈ ω ,specializing from causal functions on general sets from the last section. We will draw manyof our illustrating examples for derivatives from Rutten’s stream calculus [9], which describesmany such causal functions between real-number streams. More importantly, [9] establishesmany useful algebraic properties of these functions rigorously via coalgebraic methods. . Sprunger and B. Jacobs 23:5 There are many diﬀerent approaches one might consider to deﬁning diﬀerentiable causalfunctions. One might be to take the original coalgebraic deﬁnition and replace the underlyingcategory (

Set ) with a category of ﬁnite-dimensional Cartesian spaces and diﬀerentiable (orsmooth) maps. Unfortunately, the space of diﬀerentiable functions between ﬁnite-dimensionalspaces is not ﬁnite-dimensional, so the exponential needed to deﬁne the M A,B functor in thiscategory does not exist.Another approach is to think of causal functions as functions between inﬁnite dimensionalvector spaces and take standard notions from analysis, like Fréchet derivatives, and applythem in this context. However, norms on sequence spaces usually impose a ﬁniteness conditionlike bounded or square-summable on the domains and ranges of sequence functions. Theserestrictions are compatible with many causal functions like the pointwise sum functionabove, but other causal functions like the running product function become signiﬁcantly lessinteresting.Our approach to diﬀerentiating causal functions is to consider a causal function diﬀeren-tiable when all of its ﬁnite approximants are diﬀerentiable via the correspondence (1). Wewill develop this idea rigorously in section 3.2, but ﬁrst we need to know a bit about linearcausal functions.

Stated abstractly, the derivative of a function at a point is a linear map which provides anapproximate change in the output of a function given an input representing a small change inthe input to that function [11]. Since linear functions R → R are in bijective correspondencewith their slopes, typically in single-variable calculus the derivative of a function at a pointis instead given as a single real number. In multivariable calculus, derivatives are usuallyrepresented by (Jacobian) matrices since matrices represent linear maps between ﬁnitedimensional spaces. Linear functions between inﬁnite dimensional vector spaces do not havea similarly compact, computationally-useful representation, but we can still deﬁne derivativesof (causal) functions at points to be linear (causal) maps.We described the natural vector space structure of ( R n ) ω in Example 7. A linear causalfunction is a causal function which is also linear with respect to this vector space structure. (cid:73) Deﬁnition 10.

A causal function f : ( R n ) ω → ( R m ) ω is linear when f ( r · σ ) = r · f ( σ ) and f ( σ + τ ) = f ( σ ) + f ( τ ) for all r ∈ R and σ, τ ∈ ( R n ) ω . (cid:73) Lemma 11.

Let f : ( R n ) ω → ( R m ) ω be a causal function. The following are equivalent: f is linear, U k ( f ) : ( R n ) k +1 → R m is linear for all k ∈ ω , and T k ( f ) : ( R n ) k +1 → ( R m ) k +1 is linear for all k ∈ ω . This reﬁnes the correspondence (1), allowing us to deﬁne a linear causal function bynaming linear ﬁnite approximants.Since linear functions between ﬁnite dimensional vector spaces can be represented bymatrices, we can think of linear causal functions as limits of the matrices representing itsﬁnite approximants. This view results in row-ﬁnite inﬁnite matrices, such as:  A . . .A A . . .A A A . . . ... ... ... . . .  C V I T 2 0 1 9 where the A ij are m -row, n -column blocks such that for j > i all entries are 0. These arerelated to the matrices for the approximants of the causal function as follows. The matrix (cid:2) A k A k . . . A kk (cid:3) is the matrix representing U k ( f ). The matrix  A . . . A A . . . A k A k A k . . . A kk  is the matrix representing T k ( f ). The compat-ibility conditions on the functions T k ( f ) ensure that the matrix for T k ( f ) can be foundin the upper left corner of the matrix for T k +1 ( f ). Note also the upper triangular natureof the matrices for T k ( f ) are a consequence of causality—the ﬁrst m outputs can dependonly on the ﬁrst n inputs, so the last entries in the top row must all be 0 and so on.Unlike ﬁnite-dimensional matrices, we do not think these inﬁnite matrices are a computa-tionally useful representation, but they are conceptually useful to get an idea of how causallinear functions can be considered the limit of their linear truncations. As we have mentioned, we will use the derivatives of the approximants of a causal function todeﬁne the derivative of the causal function itself. We denote the m -row, n -column Jacobianmatrix of a diﬀerentiable function ϕ : R n → R m at x ∈ R n by Jϕ ( x ). Recall this matrix is  ∂ϕ ∂x ( x ) ∂ϕ ∂x ( x ) . . . ∂ϕ ∂x n ( x ) ∂ϕ ∂x ( x ) ∂ϕ ∂x ( x ) . . . ∂ϕ ∂x n ( x )... ... . . . ... ∂ϕ m ∂x ( x ) ∂ϕ m ∂x ( x ) . . . ∂ϕ m ∂x n ( x )  where ϕ i : R n → R and ϕ = h ϕ , . . . ϕ m i . We will also be glossing over the distinctionbetween a matrix and the linear function it represents, using Jϕ ( x ) to mean either whenconvenient. (cid:73) Deﬁnition 12.

A causal function f : ( R n ) ω → ( R m ) ω is diﬀerentiable at σ ∈ ( R n ) ω if all of its ﬁnite approximants U k ( f ) : ( R n ) k +1 → R m are diﬀerentiable at σ k for all k ∈ ω . If f is diﬀerentiable at σ , the derivative of f at σ is the unique linear causal function D ∗ f ( σ ) : ( R n ) ω → ( R m ) ω satisfying U k ( D ∗ f ( σ )) = J ( U k ( f ))( σ k ) . In this deﬁnition we are using the correspondence (1), reﬁned in Lemma 11, which allowsus to deﬁne a causal (linear) function by specifying its (linear) ﬁnite approximants. Wecould equally well have used stringwise approximants in this deﬁnition rather than pointwiseapproximants, as the following lemma states. (cid:73)

Lemma 13.

The causal function f is diﬀerentiable at σ if and only if each of T k ( f ) are dif-ferentiable at σ k for all k ∈ ω . In this case, D ∗ f ( σ ) satisﬁes T k ( D ∗ f ( σ )) = J ( T k ( f ))( σ k ) . Though we have mentioned this is not particularly useful computationally, the derivativeof a diﬀerentiable function at a point has a representation as a row-ﬁnite inﬁnite matrix. (cid:73)

Lemma 14. If f is diﬀerentiable at σ , each U k ( f ) : ( R n ) k +1 → R m has an m -row, n ( k +1) -column Jacobian matrix representing its derivative at σ k . Let A ki be m -row, n -column . Sprunger and B. Jacobs 23:7 blocks of this Jacobian, so that J ( U k ( f ))( σ k ) = (cid:2) A k A k . . . A kk (cid:3) The derivative of f at σ is the linear causal function represented by the row-ﬁnite inﬁnite matrix D ∗ f ( σ ) =  A . . .A A . . .A A A . . . ... ... ... . . .  Note that this linear causal function can be evaluated at a sequence ∆ σ ∈ ( R n ) ω bymultiplying the inﬁnite matrix by ∆ σ , considered as an inﬁnite column vector. Next, we use this deﬁnition of derivative to ﬁnd the causal derivatives of some basic functionsfrom Rutten’s stream calculus. (cid:73)

Example 15.

We show the pointwise sum stream function + : ( R ) ω → R ω is itsown derivative at every point ( σ, τ ) ∈ ( R ) ω . Note U k (+)( σ , τ , . . . , σ k , τ k ) = σ k + τ k ,so J ( U k (+))( σ , τ , . . . , σ k , τ k ) = (cid:2) . . . (cid:3) . This is the matrix representation of U k (+) itself, so ( D ∗ +)( σ, τ ) = + or, in other notation, ( D ∗ +)( σ, τ )(∆ σ, ∆ τ ) = ∆ σ + ∆ τ forany σ, τ, ∆ σ, ∆ τ ∈ R ω .This argument can be repeated for all pointwise sum functions + : ( R n × R n ) ω → ( R n ) ω ,replacing the “1” blocks in the Jacobian above with I n .Since the derivative of any constant x : 1 → R n is 0 R n : 1 → R n , the derivative of anyconstant sequence must necessarily be the zero sequence. In stream calculus, there aretwo important constant sequences deﬁned corecursively: [ r ] deﬁned by hd ([ r ])( ∗ ) = r and ∂ ∗ ([ r ]) = [0] for all r ∈ R and X deﬁned by hd ( X )( ∗ ) = 0 and ∂ ∗ ( X ) = [1]. Written out assequences, [ r ] = ( r, , , , . . . ) and X = (0 , , , , . . . ). (cid:73) Example 16. D ∗ [ r ] = D ∗ X = [0].Next, we consider the Cauchy sequence product. Under the correspondence betweensequences σ ∈ R ω and formal power series P σ i x i ∈ R [[ x ]], the Cauchy product is thesequence operation corresponding to the (Cauchy) product of formal power series. Thisoperation is coalgebraically characterized in Rutten [9] as the unique function × : ( R ) ω → R ω satisfying hd ( × )( s , t ) = s · t and ( ∂ ( s ,t ) × )( σ, τ ) = tl ( σ ) × τ + [ s ] × tl ( τ ). For ourpurposes, the explicit deﬁnition is more useful: U k ( × )( σ k , τ k ) = P ki =0 σ i · τ k − i . (cid:73) Example 17.

We compute the derivative of the Cauchy product. J ( U k ( × ))( σ , τ , . . . , σ k , τ k ) = (cid:2) τ k σ k τ k − σ k − . . . τ σ (cid:3) Notice that multiplying this matrix by (an initial segment) of a small change sequence(∆ σ , ∆ τ , . . . , ∆ σ k , ∆ τ k ) yields J ( U k ( × ))( σ , τ , . . . , σ k , τ k )(∆ σ , ∆ τ , . . . , ∆ σ k , ∆ τ k ) = k X i =0 ∆ σ i · τ k − i + k X i =0 σ i · ∆ τ k − i Therefore, ( D ∗ × ( σ, τ ))(∆ σ, ∆ τ ) = ∆ σ × τ + σ × ∆ τ . C V I T 2 0 1 9

Another sequence product considered in the stream calculus is the Hadamard product, alsocalled the pointwise product. Deﬁned coalgebraically, the Hadamard product is the uniquebinary operation deﬁned by hd ( (cid:12) )( s , t ) = s · t and ( ∂ ( s ,t ) (cid:12) )( σ, τ ) = tl ( σ ) (cid:12) tl ( τ ). Thishas a similar derivative to the Cauchy product: D ∗ (cid:12) ( σ, τ )(∆ σ, ∆ τ ) = ∆ σ (cid:12) τ + σ (cid:12) ∆ τ .Note that these derivatives make sense without any reference to properties of the sequencesused. We are not aware of a way to realize this derivative as an instance of a notion ofderivative known in analysis. The most obvious notion to try is a Fréchet derivative inducedby a norm on the space of sequences. However, all norms we know on these spaces, including ‘ p -norms and γ -geometric norms k σ k = P σ i · γ i for γ ∈ (0 , Just as it is impractical to compute all derivatives from the deﬁnition in undergraduatecalculus, it is also impractical to compute causal derivatives directly from the deﬁnition.To ease this burden, one typically proves various “rules” of diﬀerentiation which providecompositional recipes for ﬁnding derivatives. That is our task in this section.There are at least two good reasons to hope a priori that the standard rules of diﬀer-entiation might hold for causal derivatives. First, causal derivatives were deﬁned to agreewith standard derivatives in their ﬁnite approximants. Since these approximant derivativessatisfy these rules, we might hope that they hold over the limiting process. Second, smooth causal functions form a Cartesian diﬀerential category, as was shown in [12]. The theoryof Cartesian diﬀerential categories includes as axioms or theorems abstract versions of thechain rule, sum rule, etc. However, neither of these reasons are immediately suﬃcient, so wemust provide independent justiﬁcation.

We begin by stating some rules familiar from undergraduate calculus. (cid:73)

Proposition 18 (causal chain rule) . Suppose f : ( R n ) ω → ( R m ) ω and g : ( R m ) ω → ( R ‘ ) ω are causal functions. Suppose further f is diﬀerentiable at σ ∈ ( R n ) ω and g is diﬀerentiableat f ( σ ) . Then h = g ◦ f is diﬀerentiable at σ and its derivative is D ∗ g ( f ( σ )) ◦ D ∗ f ( σ ) . Proof.

Let f k = T k ( f ), g k = T k ( g ), and h k = T k ( h ). We know h k = g k ◦ f k . We show thestringwise approximants of D ∗ ( g ◦ f )( σ ) and D ∗ g ( f ( σ )) ◦ D ∗ f ( σ ) match. T k ( D ∗ ( g ◦ f )( σ )) = J ( h k )( σ k ) = J ( g k ◦ f k )( σ k )= J ( g k )( f k ( σ k )) × J ( f k )( σ k ) ( ∗ )= J ( g k )( f ( σ ) k ) × J ( f k )( σ k )= T k ( D ∗ g ( f ( σ ))) ◦ T k ( D ∗ f ( σ )) = T k ( D ∗ g ( f ( σ )) ◦ D ∗ f ( σ ))where the starred line is by the classical chain rule. (cid:74) Since we have already overloaded × for both Cauchy stream product and matrix product,we use k for the parallel composition of functions, where the parallel composition of φ : R n → R m and ψ : R p → R q is φ k ψ : R n + p → R m + q deﬁned by ( φ k ψ )( x, y ) = ( φ ( x ) , ψ ( y )) for x ∈ R p and y ∈ R p . We do not know of a standard name for this rule, but in multivariablecalculus there is a rule J ( φ k ψ )( x, y ) = Jφ ( x ) k Jψ ( y ), which we shall call the parallel rule.There is a similar rule for causal derivatives we describe next. . Sprunger and B. Jacobs 23:9 (cid:73) Proposition 19 (causal parallel rule) . Suppose f : ( R n ) ω → ( R m ) ω and h : ( R p ) ω → ( R q ) ω are causal functions, and that they are diﬀerentiable at σ ∈ ( R n ) ω and τ ∈ ( R p ) ω , respectively.Then f k h : ( R n + p ) ω → ( R m + q ) ω is diﬀerentiable at ( σ, τ ) ∈ ( R n + p ) ω and its derivative is D ∗ f ( σ ) kD ∗ h ( τ ) . Proof.

The stringwise approximants of D ∗ ( f k h )( σ, τ ) and D ∗ f ( σ ) kD ∗ h ( τ ) match: T k ( D ∗ ( f k h )( σ, τ )) = J ( T k ( f k h ))( σ k , τ k ) = J ( T k ( f ) k T k ( h ))( σ k , τ k )= J ( T k ( f ))( σ k ) k J ( T k ( h ))( τ k ) ( ∗ )= T k ( D ∗ f ( σ )) k T k ( D ∗ h ( τ )) = T k ( D ∗ f ( σ ) kD ∗ h ( τ ))where the starred line is by the classical parallel rule. (cid:74)(cid:73) Proposition 20 (causal linearity) . If f : ( R n ) ω → ( R m ) ω is a linear causal function, it isdiﬀerentiable at every σ ∈ ( R n ) ω and its derivative is D ∗ f ( σ ) = f . These three results are the fundamental properties of causal diﬀerentiation we will beusing. Many other standard rules are consequences of these. For example, we can derive asum rule from these properties. (cid:73)

Deﬁnition 21.

The sum of two causal maps f, g : ( R n ) ω → ( R m ) ω is deﬁned to be f + g (cid:44) + ◦ ( f k g ) ◦ ∆ ( R n ) ω , where ∆ ( R n ) ω is the sequence duplication map. (cid:73) Proposition 22 (causal sum rule) . If f and g as in Deﬁnition 21 are both diﬀerentiable at σ , so is their sum and its derivative is D ∗ f ( σ ) + D ∗ g ( σ ) . Proof.

Using the properties above, we ﬁnd D ∗ ( f + g )( σ ) = D ∗ (+ ◦ ( f k g ) ◦ ∆ ( R n ) ω )( σ ) (sum of maps def’n)= D ∗ (+)(( f k g ◦ ∆ ( R n ) ω )( σ )) ◦ D ∗ ( f k g ◦ ∆ ( R n ) ω )( σ ) (causal chain rule)= + ◦ D ∗ ( f k g ◦ ∆ ( R n ) ω )( σ ) (linearity of +)= + ◦ D ∗ ( f k g )(∆ ( R n ) ω ( σ )) ◦ D ∗ (∆ ( R n ) ω )( σ ) (causal chain rule)= + ◦ D ∗ ( f k g )( σ, σ ) ◦ ∆ ( R n ) ω (def’n & linearity of ∆)= + ◦ ( D ∗ f ( σ ) kD ∗ g ( σ )) ◦ ∆ ( R n ) ω (causal parallel rule)= D ∗ f ( σ ) + D ∗ g ( σ ) (sum of maps def’n)as desired. (cid:74) For functions f, g : R ω → R ω , we can deﬁne their Cauchy and Hadamard products f × g and f (cid:12) g with the pattern of Deﬁnition 21 and prove two product rules using the derivativesof the binary operations × and (cid:12) we computed earlier. (cid:73) Proposition 23 (causal product rules) . If f, g : R ω → R ω are causal functions diﬀerentiableat σ , so are their Cauchy and Hadamard products, and their derivatives are D ∗ ( f × g )( σ )(∆ σ ) = D ∗ f ( σ )(∆ σ ) × g ( σ ) + f ( σ ) × D ∗ g ( σ )(∆ σ ) D ∗ ( f (cid:12) g )( σ )(∆ σ ) = D ∗ f ( σ )(∆ σ ) (cid:12) g ( σ ) + f ( σ ) (cid:12) D ∗ g ( σ )(∆ σ )A typical point of confusion in undergraduate calculus is the role of constants: sometimesthey are treated like elements of the underlying vector space and sometimes like functionswhich always return that vector. In our calculus, a constant can similarly sometimes meana ﬁxed sequence picked out by c : 1 → ( R n ) ω or the composition of this map after adiscarding map ! ( R n ) ω : ( R n ) ω →

1. We have described the derivative of a constant elementin Example 16, now we treat constant maps.

C V I T 2 0 1 9 (cid:73)

Proposition 24 (causal constant rule) . The derivative of ! ( R n ) ω : ( R n ) ω → is ! ( R n ) ω . If c : ( R n ) ω → ( R m ) ω is a constant map, its derivative is the constant map [0]( σ ) ≡ ( R m ) ω . (cid:73) Proposition 25 (causal constant multiple rule) . If c : R ω → R ω is a constant function and f : R ω → R ω is any other causal function diﬀerentiable at σ , so is c × f and its derivative is c × D ∗ f ( σ ) . Proof.

Combine the causal product rule and the causal constant rule. (cid:74)

We have seen the standard rules presented in the last section are useful as computationalshortcuts, just as they are in undergraduate calculus. In the causal calculus they turn out tobe perhaps even more crucial, since some diﬀerentiable causal functions do not have simpleclosed forms, so trying to ﬁnd their derivative from the deﬁnition is extremely diﬃcult.The stream inverse [9] is the ﬁrst partial causal function we will consider. This operationis deﬁned on σ ∈ R ω such that σ = 0 with the unbounded-order recurrence relation[ σ − ] k =  σ if k = 0 − σ · k − X i =0 (cid:0) σ n − i · [ σ − ] i (cid:1) if k > . Reasoning about this function in terms of its components is extraordinarily diﬃcult sinceeach component is deﬁned in terms of all the preceding components. However, there is auseful fact from Rutten [9] which we can use to ﬁnd the derivative of this operation at all σ where it is deﬁned: σ × σ − = [1]. (cid:73) Proposition 26 (causal reciprocal rule) . The partial function ( · ) − : R ω → R ω is diﬀeren-tiable at all σ ∈ R ω such that σ = 0 , and its derivative is ( D ∗ ( · ) − )( σ )(∆ σ ) = [ − × σ − × σ − × ∆ σ Proof.

Since σ × σ − = [1], their derivatives must also be equal. In particular:[0] = D ∗ [1] = D ∗ ( σ × σ − )(∆ σ ) = σ × ( D ∗ ( · ) − )( σ )(∆ σ ) + ∆ σ × ( σ − )using the causal product rule. Solving this equation for ( D ∗ ( · ) − )( σ )(∆ σ ) yields( D ∗ ( · ) − )( σ )(∆ σ ) = [ − × σ − × σ − × ∆ σ where we are implicitly using many of the identities established in [9]. (cid:74) When adopting the conventions that σ − n (cid:44) σ − ( n − × σ − and σ × τ − (cid:44) στ , this rule looksquite like the usual rule for the derivative of the reciprocal function: ( J ( · ) − )( x )(∆ x ) = − ∆ xx . (cid:73) Proposition 27 (causal quotient rule) . If f, g : R ω → R ω are causal functions diﬀerentiableat σ and g ( σ ) = 0 , then fg is also diﬀerentiable at σ and its derivative is D ∗ f ( σ )(∆ σ ) × g ( σ ) + [ − × f ( σ ) × D ∗ g ( σ )(∆ σ ) g ( σ ) . . Sprunger and B. Jacobs 23:11 So far, causal diﬀerential calculus is rather similar to traditional diﬀerential calculus. Thereare two diﬀerent product rules corresponding to two diﬀerent products. We were forced touse an implicit diﬀerentiation trick to ﬁnd the derivative of the reciprocal function, but in theend we found a familiar result. However, next we state a rule with no traditional analogue. (cid:73)

Theorem 28 (causal recurrence rule) . Let g : R n × R m → R m be diﬀerentiable (everywhere)and i ∈ R m . Then rec i ( g ) : ( R n ) ω → ( R m ) ω is diﬀerentiable (everywhere) as a causalfunction and its derivative ∆ τ (cid:44) [ D ∗ rec i ( g )]( σ )(∆ σ ) satisﬁes the following recurrence: ( τ k +1 = g ( σ k +1 , τ k ) after τ = g ( σ , i )∆ τ k +1 = Jg ( σ k +1 , τ k )(∆ σ k +1 , ∆ τ k ) after ∆ τ = Jg ( σ , i )(∆ σ , R m ) Proof.

We check U k ( D ∗ rec i ( g )( σ ))(∆ σ k ) = ∆ τ k by induction on k . To simplify ournotation, we write u k (cid:44) U k ( rec i ( g )). The base case is easy: U ([ D ∗ rec i ( g )]( σ ))(∆ σ ) = J ( U ( rec i ( g )))( σ )(∆ σ )= J ( λx.g ( x, i ))( σ )(∆ σ ) = Jg ( σ , i )(∆ σ , R m )The induction step uses the fact that u k ( σ k ) = g ( σ k , u k − ( σ k − )). U k ([ D ∗ rec i ( g )]( σ ))(∆ σ k ) = Ju k ( σ k )(∆ σ k )= [ Jg ( σ k , τ k − ) ◦ h Jπ k ( σ k ) , J ( u k − ◦ π k )( σ k ) i ](∆ σ k )= [ Jg ( σ k , τ k − ) ◦ h π k , Ju k − ( σ k − ) ◦ π k i ](∆ σ k )= Jg ( σ k , τ k − )(∆ σ k , Ju k − ( σ k − )(∆ σ k − ))= Jg ( σ k , τ k − )(∆ σ k , ∆ τ k − )where π k is the map discarding the last element of a list. (cid:74) Degenerate recurrences, which do not refer to previous values generated by the recurrence,are a special instance of this rule. (cid:73)

Corollary 29 (causal map rule) . Let h : R n → R m be a diﬀerentiable function. Then map ( h ) is diﬀerentiable as a causal function, and its derivative is map ( Jh ) . To illustrate the recurrence rule, we revisit the running product function, introduced inExample 9, and compute its derivative. (cid:73)

Example 30.

The unary running product function Q : R ω → R ω was deﬁned to be rec ( g )where g is binary multiplication of reals. In approximant form, U k ( rec ( g ))( σ k ) = Q ki =0 σ i .We compute a recurrence for the derivative of this function using the recurrence rule.Since g is binary multiplication, Jg ( s, t )(∆ s, ∆ t ) = ∆ s · t + s · ∆ t . By the recurrence rule,[ D ∗ rec i ( g )]( σ )(∆ σ ) satisﬁes the recurrence ( τ k +1 = σ k +1 · τ k after τ = σ ∆ τ k +1 = ∆ σ k +1 · τ k + σ k +1 · ∆ τ k after ∆ τ = ∆ σ Note that a direct computation of the derivative of this function is available since wehave a simple form for its pointwise approximants. Directly from the deﬁnition we would get∆ τ k = U k ( D ∗ rec ( g )( σ ))(∆ σ k ) = k X i =0 k Y j =0 ρ ij C V I T 2 0 1 9 where ρ ij is σ j if i = j and ∆ σ j otherwise.Used naively, this formula results in O ( k ) real number multiplications, and requiresaccess to the entire initial segment of σ at all times. In contrast, computing the same quantityusing the recurrence obtained by the recurrence rule requires O ( k ) multiplications and canbe computed on-the-ﬂy, requiring only the availability of the ﬁrst elements of σ and ∆ σ tomake initial progress and releasing their memory just after use. We next turn toward a potential application domain of our causal diﬀerential calculus:machine learning. In particular, we demonstrate that it is possible to use this calculus inthe training of recurrent neural networks (RNNs). RNNs diﬀer from the more commonfeedforward network in that they are designed to process sequences of inputs rather thansingle inputs. This makes them especially useful in analyzing long texts (sequences of words),spoken language (sequences of sounds), and videos (sequences of images). In fact, particularRNN architectures are the core underlying technologies of many speech recognition productstoday, such as Alexa and Siri.In this section, we will be using our causal diﬀerential calculus to ﬁnd the derivativeof a simple kind of recurrent neural network, namely an Elman network [6]. This is aninﬂuential early example of a network with feedback, though modern feedback networkstypically have more structure. Elman networks can operate on sequences of vectors from R n ,but to keep things slightly simpler we will consider Elman networks operating on sequencesof real numbers only.Let α, β, γ, δ, (cid:15) ∈ R be arbitrary parameters and φ , φ : R → R be arbitrary diﬀerentiable“activation” functions. Given an input sequence σ ∈ R ω , the Elman network deﬁned bythese parameters produces the sequence E ( σ ) = τ ∈ R ω satisfying the following recurrence: ( ρ k +1 = φ ( ασ k +1 + βρ k + γ ) after ρ = φ ( ασ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )In our notation, if we deﬁne g ( x, y ) (cid:44) φ ( αx + βy + γ ) and g ( x ) (cid:44) φ ( δx + (cid:15) ), then E (cid:44) map ( g ) ◦ rec ( g ). We can therefore ﬁnd the causal derivative of this Elman networkrelatively easily using the causal chain rule and causal recurrence rule. Indeed, letting D ∗ E ( σ )(∆ σ ) = ∆ τ , these rules tell us ∆ τ satisﬁes the recurrence:  ρ k +1 = φ ( ασ k +1 + βρ k + γ ) after ρ = φ ( ασ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )∆ ρ k +1 = φ ( ασ k +1 + βρ k + γ ) · ( α ∆ σ k +1 + β ∆ ρ k ) after ∆ ρ = φ ( ασ + γ ) · ( α ∆ σ )∆ τ k +1 = φ ( δρ k +1 + (cid:15) ) · ( δ ∆ ρ k +1 ) after ∆ τ = φ ( δρ + (cid:15) ) · ( δ ∆ ρ )This derivative tells us how we would expect the output of the Elman network to changein response to a small change ∆ σ to its input sequence σ . This can be useful information inanalyzing the behavior of the network. However, we can also use causal diﬀerentiation topredict how the network’s output would change in response to a small change in one of the parameters , which is a crucial piece of information used when training the network. “Activation” here has no technical meaning, but carries a connotation that the function is likely takenfrom a folklore set of functions including the sigmoid function, hyperbolic tangent, softplus, rectiﬁedlinear unit, and logistic function. Usually these functions have bounded range, often [0 , . Sprunger and B. Jacobs 23:13 Let us now imagine that we have some data on how this Elman network should behave,in the form of an input/output pair (ˆ σ, ˆ τ ) ∈ R ω × R ω representing ground truth, and wewant to ﬁgure out how to adjust one of the parameters, say α , so that our Elman networkbetter reﬂects this ground truth.We can deﬁne a causal function related to the Elman network E , but where we nowconsider α to be a variable and ﬁx σ to be ˆ σ . Denote this function E ˆ σ : R ω → R ω and notethat if τ = E ˆ σ (ˆ α ) for ˆ α ∈ R ω , then τ satisﬁes the recurrence relation ( ρ k +1 = φ ( α ˆ σ k +1 + βρ k + γ ) after ρ = φ ( α ˆ σ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )We have simpliﬁed our expression using the fact that parameters are ﬁxed values thatdo not change in the course of the computation of the output sequence, so ˆ α k = α for all k ∈ ω . Similarly, when we make small change to this parameter, that small change willremain independent of the entry in the sequence, so d ∆ α k = ∆ α for all k .We can compute the derivative of this recurrence relation similarly to above, and ﬁnd itwill satisfy the following recurrence relation:  ρ k +1 = φ ( α ˆ σ k +1 + βρ k + γ ) after ρ = φ ( α ˆ σ + γ ) τ k +1 = φ ( δρ k +1 + (cid:15) ) after τ = φ ( δρ + (cid:15) )∆ ρ k +1 = φ ( α ˆ σ k +1 + βρ k + γ ) · (∆ α ˆ σ k +1 + β ∆ ρ k ) after ∆ ρ = φ ( α ˆ σ + γ ) · (∆ α ˆ σ )∆ τ k +1 = φ ( δρ k +1 + (cid:15) ) · ( δ ∆ ρ k +1 ) after ∆ τ = φ ( δρ + (cid:15) ) · ( δ ∆ ρ ) (cid:73) Example 31.

Let us take a very speciﬁc example to illustrate this process. We instantiatethe above Elman network with α = β = δ = 1, γ = 0 . (cid:15) = − . φ = φ are both thesigmoid function. We suppose our ground truth data tells us a sequence starting ˆ σ = (1 , , , , . . . ) shouldbe sent to a sequence starting ˆ τ = (0 . , . , . , . , . . . ). In reality, our Elman networkas currently parametrized sends ˆ σ to (0 . , . , . , . , . . . ), when roundedto 5 decimal places. Our task is to decide how to adjust α so that the new network willbetter match our data, in particular reducing every entry by about 0 . E ˆ σ from abovewith our particular choice of parameters. Since we have chosen many coeﬃcients and all theentries of ˆ σ to be 1, there is signiﬁcant simpliﬁcation:  ρ k +1 = φ ( ρ k + 1 .

1) after ρ = φ (1 . τ k +1 = φ ( ρ k +1 − .

1) after τ = φ ( ρ − . ρ k +1 = φ ( ρ k + 1 . · (∆ α + ∆ ρ k ) after ∆ ρ = φ (1 . · ∆ α ∆ τ k +1 = φ ( ρ k +1 − . · ∆ ρ k +1 after ∆ τ = φ ( ρ − . · ∆ ρ The only free variable in this recurrence is ∆ α . We choose ∆ α = 0 .

1, for reasons to beexplained later. Then we can compute ∆ τ = (0 . , . , . , . , . . . ).What does this tell us? The recurrence is supposed to compute the derivative of E ˆ σ at 1 andapply the resulting linear map to 0.1. Using the interpretation of derivative as approximatechange, this suggests that if we increase our parameter α from its current value of 1 by ∆ α = The sigmoid function φ : R → R is deﬁned by φ ( x ) = e − x . The sigmoid function is traditionallydenoted by σ , but since we have been using σ as a sequence variable we use φ . C V I T 2 0 1 9 .

1, we should expect E ˆ σ (1 .

1) to be about E ˆ σ (1) + (0 . , . , . , . , . . . ).Since our goal is to reduce the output of the network, this adjustment is not a great idea.What are we to do? One option is to pick a new value for ∆ α and recompute theapproximate change, but there is a smarter way. We know that the derivative of E ˆ σ at 1is linear, so if we instead decrease α by 0.1, we would expect E ˆ σ (0 .

9) to be about E ˆ σ (1) − (0 . , . , . , . , . . . ) = (0 . , . , . , . , . . . ). Indeed,after making this adjustment, we ﬁnd E ˆ σ (0 .

9) = (0 . , . , . , . , . . . ).This adjustment ended up decreasing the result by about 0.00015 more than we predicted,which amounts to approximately a 5% overshot of the original prediction.While it is nice to know our prediction about the change was fairly accurate, subtracting0.1 from α has not achieved our goal: in each component, our Elman network’s outputdecreased by at most 0.005 while we were trying to create a reduction of 0.05. A natural ideahere would be to really exploit the linearity of the derivative and make a bigger adjustmentto α , namely subtracting . . · ∆ α = 10 · ∆ α = 1. Computing E ˆ σ (0), we ﬁnd it is actually(0 . , . , . , . , . . . ), which is much closer to our goal than E ˆ σ (0 .

9) turnedout to be.This seems like good news, but if we check the accuracy of the prediction our derivativemakes, we would ﬁnd that the actual reduction from E ˆ σ (1) to E ˆ σ (0) is between 25% and 65%greater than the derivative predicted. Thus, though we were able to make greater progressaligning our network with ground truth, the bigger adjustment came with much greatererror. This is a classic tradeoﬀ in neural network training: the linear approximation providedby the derivative is only valid locally, so taking bigger steps along the gradient comes withpotentially greater rewards in terms of improvements in network performance but also carriesextra risk that greater error could lead the training astray. In this paper, we presented a basic diﬀerential calculus for causal functions between sequencesof real-valued vectors. We gave a deﬁnition of derivative for causal functions, showed how tocompute derivatives from this deﬁnition, established many classical rules from multivariablecalculus including the chain, parallel, sum, product, reciprocal, and quotient rules. Weadditionally showed a rule unique to the causal calculus: the recurrence rule. We then showedhow to use these rules in a practical example, namely the training of an Elman network.

Related work.

We are not aware of other works directly treating diﬀerentiation of causalfunctions, though we suspect there may be connections to hard-core analysis literature. Thiswork is obviously inspired in results and structure by standard undergraduate multivariablecalculus, e.g. [11]. We also have a related categorical treatment of diﬀerentiation of causalfunctions [12] using the framework of Cartesian diﬀerential categories [2]. That is much moreabstract than the present work, but when concretized to the current scenario would onlyapply to smooth causal functions.Though we drew our example diﬀerentiable functions almost exclusively from Rutten’sstream calculus [9], we would also like to point out signal ﬂow graphs as another interestingtreatment of causal functions. an interesting graphical representation of causal functions,investigated in e.g. [1, 3, 4, 7]. We expect that interpreting our diﬀerential calculus in thissetting could yield a treatment of diﬀerentiation in string diagrams.We suspect recurrence rule we obtained, particularly when diﬀerentiating Elman networks,may also have connections to the automatic diﬀerentiation literature we are not aware of atthis time. In particular, it does rather seem like the recurrence rule augments a recurrence . Sprunger and B. Jacobs 23:15 with dual numbers.

Future directions.

As neural networks become more advanced and practitioners ﬁnd newand interesting ways of using gradients of these networks, we believe theoreticians have arole to play in systematizing the theory of these new applications of derivatives. We believethat the coalgebra community, as experts with many tools for understanding programsoperating on, inﬁnite data structures, are particularly well-positioned to help develop thesetheories. For example, nearly every rule of causal diﬀerentiation we established here relies ona coalgebraically-derived property from Rutten’s stream calculus [9]. We looked at functionson sequences in particular, but we have every reason to believe further results are possiblefor more advanced neural network architectures on more exotic inﬁnite data structures.We are particularly interested in merging our results here with a line of research initiatedin [12] using Cartesian diﬀerential categories. We believe this causal calculus could be aninstance of a Cartesian diﬀerential restriction category [5], which would drastically improvethe scope of our previous results to cover partial and non-smooth causal functions.

References Henning Basold, Marcello Bonsangue, Helle Hvid Hansen, and Jan Rutten. (Co)AlgebraicCharacterizations of Signal Flow Graphs , pages 124–145. Springer International Publish-ing, Cham, 2014. URL: https://doi.org/10.1007/978-3-319-06880-0_6 , doi:10.1007/978-3-319-06880-0_6 . R F Blute, J R B Cockett, and R A G Seely. Cartesian diﬀerential categories.

Theory andApplications of Categories , 22:622–672, 2009. Filippo Bonchi, Paweł Sobociński, and Fabio Zanasi. A categorical semantics of signal ﬂowgraphs. In

CONCUR 2014 - Concurrency Theory - 25th International Conference, CONCUR2014, Rome, Italy, September 2-5, 2014. Proceedings , pages 435–450, 2014. URL: https://doi.org/10.1007/978-3-662-44584-6_30 , doi:10.1007/978-3-662-44584-6\_30 . Filippo Bonchi, Pawel Sobociński, and Fabio Zanasi. Full abstraction for signal ﬂow graphs.In

Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages, POPL 2015, Mumbai, India, January 15-17, 2015 , pages 515–526,2015. URL: https://doi.org/10.1145/2676726.2676993 , doi:10.1145/2676726.2676993 . JRB Cockett, GSH Cruttwell, and JD Gallagher. Diﬀerential restriction categories.

Theoryand Applications of Categories , 25(21):537–613, 2011. Jeﬀrey L. Elman. Finding structure in time.

Cognitive Science , 14(2):179–211, Mar 1990. doi:10.1207/s15516709cog1402_1 . Stefan Milius. A sound and complete calculus for ﬁnite stream circuits. In

Proceedings of the25th Annual IEEE Symposium on Logic in Computer Science, LICS 2010, 11-14 July 2010,Edinburgh, United Kingdom , pages 421–430, 2010. URL: https://doi.org/10.1109/LICS.2010.11 , doi:10.1109/LICS.2010.11 . Jan Rutten, Clemens Kupke, and Helle Hvid Hansen. Stream diﬀerential equations: Speciﬁca-tion formats and solution methods.

Logical Methods in Computer Science , 13, 2017. J.J.M.M. Rutten. A coinductive calculus of streams.

Mathematical Structures in ComputerScience , 15(1):93–147, Feb 2005. doi:10.1017/S0960129504004517 . J.J.M.M. Rutten. Algebraic speciﬁcation and coalgebraic synthesis of mealy automata.

Elec-tronic Notes in Theoretical Computer Science , 160:305–319, Aug 2006. doi:10.1016/j.entcs.2006.05.030 . Michael Spivak. Calculus on manifolds. 1965. David Sprunger and Shin-ya Katsumata. Diﬀerentiable causal computations via delayed trace.

CoRR , abs/1903.01093, 2019. URL: http://arxiv.org/abs/1903.01093 , arXiv:1903.01093 . P. J. Werbos. Backpropagation through time: what it does and how to do it.

Proceedings ofthe IEEE , 78(10):1550–1560, Oct 1990. doi:10.1109/5.58337 ..