[PDF] Using Echo State Networks to Approximate Value Functions for Control

Abstract

An Echo State Network (ESN) is a type of single-layer recurrent neural network with randomly-chosen internal weights and a trainable output layer. We prove under mild conditions that a sufficiently large Echo State Network can approximate the value function of a broad class of stochastic and deterministic control problems. Such control problems are generally non-Markovian. We describe how the ESN can form the basis for novel and computationally efficient reinforcement learning algorithms in a non-Markovian framework. We demonstrate this theory with two examples. In the first, we use an ESN to solve a deterministic, partially observed, control problem which is a simple game we call `Bee World'. In the second example, we consider a stochastic control problem inspired by a market making problem in mathematical finance. In both cases we can compare the dynamics of the algorithms with analytic solutions to show that even after only a single reinforcement policy iteration the algorithms arrive at a good policy.

Full PDF

aa r X i v : . [ m a t h . D S ] F e b Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00

Echo State Networks for Reinforcement Learning

Allen G. Hart [email protected]

Department of Mathematical SciencesUniversity of BathBath BA2 7AY, UK

Kevin R. Olding [email protected]

Department of Mathematical SciencesUniversity of BathBath BA2 7AY, UK

Alexander M. G. Cox [email protected]

Department of Mathematical SciencesUniversity of BathBath BA2 7AY, UK

Olga Isupova [email protected]

Department of Computer ScienceUniversity of BathBath BA2 7AY, UK

Jonathan H. P. Dawes [email protected]

Department of Mathematical SciencesUniversity of BathBath BA2 7AY, UK

Editor:

Abstract

An Echo State Network (ESN) is a type of single-layer recurrent neural network withrandomly-chosen internal weights and a trainable output layer. We prove under mildconditions that a suﬃciently large Echo State Network can approximate the value functionof a broad class of stochastic and deterministic control problems. Such control problemsare generally non-Markovian.We describe how the ESN can form the basis for novel and computationally eﬃcientreinforcement learning algorithms in a non-Markovian framework. We demonstrate thistheory with two examples. In the ﬁrst, we use an ESN to solve a deterministic, partiallyobserved, control problem which is a simple game we call ‘Bee World’. In the secondexample, we consider a stochastic control problem inspired by a market making problem inmathematical ﬁnance. In both cases we can compare the dynamics of the algorithms withanalytic solutions to show that even after only a single reinforcement policy iteration thealgorithms perform with reasonable skill.

Keywords:

Echo State Networks; Liquid State Machines; Reservoir Computing; Stochas-tic Optimal Control; Mathematical Finance; Reinforcement Learning ©2000 Hart, Olding, Cox, Isupova, and Dawes. art, Olding, Cox, Isupova, and Dawes

1. Introduction

An Echo State Network (ESN) is a special type of single-layer recurrent neural network intro-duced at the turn of the millennium by Jaeger (2001) and Maass et al. (2002) to study timeseries. Training is fast because the training step involves only the selection of weights in theoutput layer rather than updating the internal weights in the recurrent layer. Furthermore,the simple formulation of ESNs renders them amenable to mathematical analysis. Given atime series z k (where k is the discrete time index) of m -dimensional data points, an ESN isset up as follows. We randomly generate a d × d reservoir matrix A , a d × m input matrix C and a d × bias vector ζ . Then we iteratively generate a sequence of d -dimensional reservoirstate vectors x k according to x k +1 = σ ( A x k + C z k + ζ ) where σ ( x ) i = max(0 , x i ) is the rectiﬁed linear unit (ReLU) activation function. Observethat the k th reservoir state x k depends on all past data-points . . . , z k − , z k − and thereforecaptures non-Markovian temporal correlations in the data. If the 2-norm of the reservoirmatrix satisﬁes k A k < then as n tends to inﬁnity, the inﬂuence on the reservoir state x k + n of the data points . . . , z k − , z k − in the distant past becomes arbitrarily small. Thisis called the fading memory property and is closely related to the echo state property (ESP)introduced in the context of ESNs by Jaeger (2001). The ESP is the statement that thesequence of reservoir states ( x k ) k ∈ Z is, for a given input data sequence ( z k ) k ∈ Z , uniquelydetermined.When an ESN has the ESP, it can be applied to a class of supervised learning problemswhere we have a time series of m dimensional data points r k , called targets , that depend onall previous input time series data . . . , z k − , z k − , z k − and we seek to learn the relationshipbetween the sequence of past states and the target for each k . We can train an ESN to solvethis problem by ﬁnding the m × d matrix W that minimises ℓ − X k =0 k W ⊤ x k − r k k + λ k W k , where ℓ is the number of labelled data points, and λ > is the Tikhonov regularisation(a.k.a. ridge regression) parameter. Throughout this paper, k·k denotes the matrix 2-norm,vector 2-norm or modulus, depending on whether the input is a matrix, vector, or scalar,respectively.This minimisation problem can be solved using regularised linear least squares regression,and hence we can both obtain W quickly, and guarantee that W is the global optimum. Thiscompares extremely favourably with training a (deep) neural network with stochastic gra-dient descent and back propagation which takes considerably longer, and may not convergeto the global optimum (Schlegel et al., 2018).Despite the training procedure being entirely linear, ESNs are universal approximators,and can therefore model arbitrarily complex relationships between the sequence of past datapoints and the targets. This is made formal in a recent result by Gonon et al. (2020) thatwe review here and then build on. We emphasise that not only are ESNs theoreticallyvery promising, they have performed remarkably well in practice on problems ranging from cho State Networks for Reinforcement Learning seizure detection, to robot control, handwriting recognition, and ﬁnancial forecasting, whereESNs have won competitions (Lukoševičius and Jaeger, 2009), (Lukoševičius et al., 2012),(Rodan and Tino, 2011), (Triefenbach et al., 2010). Impressively, ESNs outperformed RNNsand LSTMs at a chaotic time series prediction task by a factor of over 2400 (Jaeger and Haas,2004). ESNs have also proved themselves competitive in reinforcement learning (Szita et al.,2006) and control (Peitz and Bieker, 2021).In a sequence of papers, Grigoryeva and Ortega (2018), Grigoryeva and Ortega (2019),and Gonon et al. (2020) recently analysed ESNs in the context of nonlinear ﬁlters and func-tionals. Roughly speaking, a ﬁlter U is a map from a bi-inﬁnite sequence . . . , z − , z − , z , z , z , . . . of real vectors to another bi-inﬁnite sequence of real vectors . . . , x − , x − , x , x , x , . . . , anda functional H maps a bi-inﬁnite sequence . . . , z − , z − , z , z , z , . . . of real vectors to asingle real vector or number. We can view an ESN as a ﬁlter that maps an input sequence . . . , z − , z , z , z , z , . . . to a reservoir sequence . . . , x − , x − , x , x , x , . . . , or a funtionalthat maps . . . , z − , z , z , z , z , . . . to the lone reservoir state x . The theory of ﬁlters andfunctionals is therefore a natural theoretical setting for ESNs. Within this theory, this paperpresents three novel results.Our ﬁrst result assumes that we have a time series of data z k and a set of targets r k thatdepend on all previous data points . . . , z k − , z k − via a functional R which sends inﬁnitesequences of data points to targets. We then have a supervised learning problem of ﬁndingthe relationship between the data and targets. In the special case that z k = r k , this problemis time series forecasting . Our ﬁrst novel result states that if we have suﬃciently many datapoints z k , drawn from a stationary, ergodic, and bounded process Z , which need not beMarkovian, and we obtain W using regularised linear least squares, then a suﬃciently largeESN will approximate, as closely as required, the functional R sending inputs . . . , z k − , z k − to the targets r k .This result has applications in the statistical inference of dynamical systems , which wasrecently reviewed by McGoﬀ et al. (2015). This area of research is especially focused onstatistical inference (i.e learning) of stationary ergodic processes. Furthermore, we can usethis result in the context of reinforcement learning (RL) and optimal control. We envisagean agent operating under a given policy in the parlance of reinforcement learning or controlin the parlance of control theory that generates a sequence of (reward, action, observation)triples z k = ( r k , a k , ω k ) . Then the functional V that maps previous (reward, action, obser-vation) triples . . . , z k − , z k − to rewards z k models the reward functional arbitrarily well.The set up does not assume the RL problem is Markovian, and allows for a continuous statespace.Our second novel result generalises the ﬁrst, and encompasses the case where the func-tional V is the value functional of a stochastic control process, or partially observed MarkovDecision Process (MDP). By training an ESN to approximate the value functional, we estab-lish a stepping stone toward developing an oﬄine reinforcement learning algorithm supportedby an ESN that can solve a large class of control problems. Moreover, since ESNs are re-current, they can be used for non-Markovian problems, where a reinforcement learniningagent must exploit its memory of past observations, actions and rewards. Our third resultis presented in the context of building an online reinforcement algorithm that can, undercertain conditions, determine the optimal value function for a given policy. art, Olding, Cox, Isupova, and Dawes We demonstrate some of these theoretical results numerically on two examples. Theﬁrst is a deterministic game which we call ‘Bee World’. The goal of the game for the beeis to navigate a time varying distribution of nectar in order to maximise the total futurediscounted value of the nectar acquired over all future time. The bee does not have access tothe entire state space, and only observes the nectar it collects at each moment in time. Theproblem is therefore a partially observed Markov Decision Process which requires memoryof the past to solve. We demonstrate how a simple and easily-conﬁgurable reinforcementlearning algorithm supported by an ESN can learn to play Bee World with respectable skill.The second numerical example is inspired by a market making problem in mathematicalﬁnance. The mathematical formulation of this problem boils down to a market maker seekingto control a one dimensional Brownian motion so that it stays near the origin. The cost ofstraying from the origin is quadratic in the distance from the origin, and the cost of applyinga push toward the origin is quadratic in the strength of the push. The market maker musttherefore balance the cost of applying the control against the cost of allowing the motion todrift too far from the origin. We brieﬂy discuss the ﬁnancial motivation for this problem,then solve it analytically in continuous and discrete time. The set up most commonly seenin the literature is continuous time, but only in discrete time is the problem suitable foran ESN. We then compare the optimal discrete time solution to a solution learned by areinforcement learning agent supported by an ESN.The structure of the paper closely follows the summary of results presented above. Insection 2 we set up the mathematical formalism for ESNs that we wish then to approximate.Section 3 introduces our novel theoretical results, while sections 4 and 5 respectively presentapplications to a deterministic, and then a stochastic, optimal control problem. We concludein section 6.

2. Background

In this section, we introduce the theory and notation of nonlinear ﬁlters (in relation toESNs) developed by Grigoryeva and Ortega (2018), Grigoryeva and Ortega (2019), andGonon et al. (2020). First of all, ( R m ) Z is the set of maps with domain Z and codomain R m . We call this the set of bi-inﬁnite R m valued real sequences.Next, a ﬁlter is a map U : ( R m ) Z → ( R d ) Z . A ﬁlter U is called causal if inputs fromthe past and present . . . , z − , z − , z contribute to U ( z ) but states in the future z , z . . . do not. More formally U is casual if ∀ z, y ∈ ( R m ) Z that satisfy z k = y k ∀ k ≤ itfollows U ( z ) = U ( y ) . We can now deﬁne the time shift ﬁlter T : ( R m ) Z → ( R m ) Z by T ( z ) k = T ( z ) k +1 which we interpret as the map that steps forward one unit of time. A ﬁlter U is called time invariant if U commutes with the time shift operator T . If U is causal andtime invariant ﬁlter then we call U a causal time invariant (CTI) ﬁlter.A functional is a map H : ( R m ) Z → R d . Grigoryeva and Ortega (2019) show that thereis a bijection between the space of CTI ﬁlters and the space of functionals. To see this,take a functional H and deﬁne the k th term of the associated ﬁlter U via U ( z ) k = HT k ( z ) .Conversely, given a ﬁlter U , the associated functional H is given by H ( z ) = U ( z ) We can view an ESN as a CTI ﬁlter from the space of input sequences . . . , z − , z , z , . . . to the space of reservoir sequences . . . , x − , x , x , . . . . To make this connection between cho State Networks for Reinforcement Learning ESNs and ﬁlters formal, we will ﬁrst present a generalisation of an Echo State Networkcalled a reservoir system.

Deﬁnition 2.1 (Reservoir system) Let F : R d × R m → R d and h : R d → R s . Then we callthe following system of equations x k +1 = F ( x k , z k ) (1) r k = h ( x k ) a reservoir system. Remark 2.2

We can see that if F ( x, z ) = σ ( A x + C z + ζ ) h ( x ) = W ⊤ x then we retrieve an ESN with d × d reservoir matrix A , d × m input matrix C , bias vector ζ ∈ R d , linear output layer W ∈ R d , and activation function σ = ReLU, deﬁned in theintroduction.

We require that the reservoir system induces a unique ﬁlter from the input sequence tothe reservoir sequence. This property is the Echo State Property that we brieﬂy mentionedin the introduction.

Deﬁnition 2.3 (Echo State Property (Jaeger, 2001)) A reservoir system has the Echo StateProperty (ESP) if for any ( z k ) k ∈ Z ∈ ( R m ) Z there exists a unique ( r k ) k ∈ Z ∈ ( R d ) Z that satisfythe equations of the reservoir system (1) . To any reservoir system with the Echo State property we can associate a unique CTIreservoir ﬁlter U : ( R m ) Z → ( R d ) Z deﬁned by U ( z ) = x . To this reservoir ﬁlter, we mayassign a CTI reservoir functional H : ( R m ) Z → R d deﬁned by H ( z ) = x . In a supervisedlearning context, we have a time series of data points . . . , z − , z − , z and a time series oftargets . . . , r − , r that each depend on all previous data points. The output functional h ◦ H : ( R m ) Z → R is the map we use to approximate the relationship between the data andthe targets, so h ◦ H ( . . . , z − , z − , z , z , z , . . . ) ≈ r k . Note that h ◦ H is causal , so does notpeer into the future and use data z , z , . . . that has not yet occurred. When the reservoirsystem is an ESN, the map h is the linear map W ⊤ obtained by least squares ridge regression,so that W ⊤ H ( . . . , z − , z − , z , z , . . . ) ≈ r k . We assume there exists a true map from thedata to the targets that we label R : R Z → R so that R ( . . . , z − , z − , z , z , . . . ) = r k . Ourgoal is to ﬁnd W such that W ⊤ H ≈ R . Deﬁnition 2.4 (ESN ﬁlter and functional) If an ESN has the ESP then we will write H A , C , ζ to denote the reservoir functional associated to an ESN with parameters A , C and ζ . We will also write H A , C , ζ W to denote the output functional W ⊤ H A , C , ζ (deﬁned by leftmultiplication of H A , C , ζ by the linear readout layer) art, Olding, Cox, Isupova, and Dawes Next, we will present a procedure, introduced by Gonon et al. (2020), for randomlygenerating the ESN’s internal weights A , C and biases ζ , which ensures the ESN has ESPand allows for the universal approximation of target functionals R . The procedure diﬀersfrom the procedure commonly seen in the literature, where A , C , ζ are populated with i.i.dGaussians, or i.i.d uniform deviates, and then A is rescaled so that its 2-norm (or spectralradius) is less than 1. Furthermore, the procedure introduced by Gonon et al. (2020) dependson some details of the input process, which must satisfy mild conditions stated below. Deﬁnition 2.5 (Admissible input process) A ( R m ) Z valued random variable Z is called anadmissible process if for any T ∈ N there exists M T > such that for all k ∈ Z k Z k − T , Z k − T +1 , . . . , Z k k ≤ M T (2) Lebesgue-almost surely.

We will now present a procedure by which the matrices A , C , ζ are randomly generated. Procedure 2.6 (Initialising the random weights of an ESN) Let N ∈ N , R > be the inputparameters for the procedure. Suppose that Z is an admissible input process. Consequently,for any T ∈ N there exists M T such that (for k = 0 in (2)) k Z − T , Z − T +1 , . . . , Z k ≤ M T Lebesgue-almost surely. Then, for a given T , we initialise the ESN reservoir matrix A , inputmatrix C , and biases ζ according to the following procedure.1. Draw N i.i.d. samples A , . . . , A N from the uniform distribution on B R ⊂ R d ( T +1) where B R is the ball of radius R and centre 0, and draw N i.i.d. samples ζ , . . . ζ N from the uniform distribution on [ − max( M T R, , max( M T R, .2. Let S and c be shift matrices deﬁned S = (cid:20) d,dT d,d I dT dT,d (cid:21) c = (cid:20) I d dT,d (cid:21) and set a =  A ⊤ A ⊤ ... A ⊤ N  ¯ A = (cid:20) S d ( T +1) ,N a S N,N (cid:21) ¯ C = (cid:20) c a c (cid:21) ¯ ζ =  d ( T +1) ζ ... ζ N  so that A = (cid:20) ¯ A − ¯ A − ¯ A ¯ A (cid:21) C = (cid:20) ¯ C − ¯ C (cid:21) ζ = (cid:20) ¯ ζ − ¯ ζ (cid:21) . cho State Networks for Reinforcement Learning We are now ready to present the key result by Gonon et al. (2020), (which generalises aresult by Hart et al. (2020a)) and which holds in the following supervised learning context.Given time series data z k (from an admissible process Z ) and a time series of targets r k depending on all previous data . . . , z k − , z k − we wish to approximate the functional thatsends . . . , z k − , z k − to r k . We will denote this functional R . The problem of approximating R given the data and targets is a supervised learning problem. The result can be summarisedas follows. Suppose we have an ESN with weights A , C and biases ζ randomly generatedby procedure 2.6. Then, the ESN admits a linear readout matrix W for which the ESNequipped with the matrix W (denoted H A , C , ζ W ) approximates the relationship R betweendata points . . . , z k − , z k − and targets r k as closely as is required. Theorem 2.7 (Gonon et al. (2020))

Suppose that Z is an admissible input process. Let R : ( D m ) Z → R (where D m is a compact subset of R m ) be CTI and measurable with respectto some measure µ such that E µ [ |R ( Z ) | ] < ∞ .Then for any ǫ > and δ ∈ (0 , there exists N ∈ N , R > such that, with probability (1 − δ ) , the ESN with parameters A , C , ζ generated by the procedure in deﬁnition 2.6 (withinputs N, R ) has the ESP and admits a readout layer W such that (cid:18) E µ (cid:20) (cid:13)(cid:13)(cid:13) H A , C , ζ W ( Z ) − R ( Z ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) (cid:19) / := (cid:18) Z ( R d ) Z (cid:13)(cid:13)(cid:13) H A , C , ζ W ( z ) − R ( z ) (cid:13)(cid:13)(cid:13) dµ ( z ) (cid:19) / < ǫ.

3. Novel Results about ESNs

Theorem 2.7 is an existence result stating that there exists a linear readout layer W yieldingan arbitrarily good approximation. Our ﬁrst novel contribution is to strengthen the resultunder additional assumptions. The new result states that, given a suﬃciently large ESN andsuﬃciently many training data z k drawn from a stationary, ergodic and bounded process Z ,if we train an ESN using regularised least squares then the arbitrarily good readout layer W will be attained (with probability as close to 1 as desired). This result is analogous tothe main result by Hart et al. (2020b) who prove a similar theorem for ESNs trained ondeterministic inputs.Our result holds in the following supervised learning context. Given time series data z k (from an admissible, stationary, ergodic, bounded process Z ) and a time series of targets r k depending on all previous data . . . , z k − , z k − we wish to approximate the mapping from . . . , z k − , z k − to r k . This mapping is denoted R . Our result states that an ESN with weights A , C and biases ζ randomly generated by the procedure in deﬁnition 2.6, which is fed thetraining data z k , and then trained by regularised least squares, will yield a matrix W . ThisESN equipped with the matrix W (denoted H A , C , ζ W ) will approximate the relationship R between data points . . . , z k − , z k − and targets r k as closely as required. Theorem 3.1

Suppose that Z is an admissible input process, that is also stationary andergodic, with invariant measure µ . Let z denote an arbitrary realisation of Z . Let R :( D m ) Z → R (where D m is a compact subset of R m ) be CTI, µ -measurable, and satisfy E µ [ |R ( Z ) | ] < ∞ . art, Olding, Cox, Isupova, and Dawes Then for any ǫ > and δ ∈ (0 , there exist N ∈ N , R > , λ ∗ > and ℓ ∈ N such thatthe ESN with parameters A , C , ζ generated by the procedure in deﬁnition 2.6 (with inputs N, R ), and W ∗ ℓ which minimises the least squares problem ℓ ℓ − X k =0 (cid:13)(cid:13)(cid:13) H A , C , ζ W ∗ ℓ T − k ( z ) − R T − k ( z ) (cid:13)(cid:13)(cid:13) + λ k W ∗ ℓ k , (where λ ∈ (0 , λ ∗ ) ) satisﬁes with probability (1 − δ ) the inequality E µ (cid:20) (cid:13)(cid:13)(cid:13) H A , C , ζ W ∗ ℓ ( Z ) − R ( Z ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) < ǫ. Proof

Later in this paper, we state and prove a result (Theorem 3.2) which admits thispresent result in the special case that γ = 0 .In summary, we have stated that for any ǫ > and δ ∈ (0 , there exists an ESN of size d with output layer W trained by the Tikhonov-regularised least squares procedure against ℓ training points, whose output functional approximates the target arbitrarily closely witharbitrarily high probability. The theorem is (sadly) non constructive in the sense that thenumber of neurons d , number of training points ℓ and regularisation parameter λ ∗ are notcomputed for a given ǫ and δ . We believe, heuristically, that as the number of neurons d required grows with the complexity of the target functional R while the number of trainingpoints ℓ grows with the mixing time of the ergodic process Z .We will now pivot towards our second novel result, which generalises the ﬁrst. Supposethat we have a contraction mapping Φ on the space of functionals, and we seek a W ∗ suchthat the ESN functional H A , C , ζ W ∗ approximates the unique ﬁxed point of Φ . The existence ofthe unique ﬁxed point is guaranteed by Banach’s ﬁxed point theorem. Finding the ﬁxed pointof a contraction mapping has applications in reinforcement learning because the optimalvalue function (and optimal quality function) of a Markov Decision Process (MDP) is aﬁxed point of a Bellman operator . The theory we are presenting here can be viewed as ageneralisation of an MDP because the input processes we are considering may have longtime correlations (violating the Markov property) which can only be recognised by ﬁlterswith suﬃciently long and robust memories; like Echo State Networks.We can observe ﬁrst of all if Φ is the constant map Φ( H ) = R , then Φ is clearly acontraction mapping with ﬁxed point R . In this case, the problem is exactly the same asthat solved by Theorem 3.1. We are especially interested in the case of Φ taking the formof the Bellman Value operator . To make this formal, we will consider a stationary ergodicprocess Z with invariant measure µ . Then we deﬁne the map T Z as a CTI ﬁlter on thebi-inﬁnite sequences ( D m ) Z , which returns the random variable: T Z ( z ) k = ( T Z ( z ) k +1 if k < Z k +1 | Z j = z j ∀ j ≤ if k ≥ . Next, we introduce R : ( D m ) Z → R as the CTI reward functional, giving a reward (orexpectation over a distribution of rewards) to an agent that has observed a given sequence cho State Networks for Reinforcement Learning of (reward, action, observation) triples. We let γ ∈ [0 , denote the discount factor, anddeﬁne the operator Φ( H )( z ) := R ( z ) + γ E µ [ HT Z ( z )] . (3)In this case, Φ is a contraction mapping with Lipschitz constant γ . With this, we willdeﬁne the CTI value functional V : ( D m ) Z → R (with respect to the process Z ) as V ( z ) := E µ (cid:20) ∞ X k =0 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j ≤ (cid:21) . The value functional V takes a sequence of (reward, action, observation) triples and returnsthe expected discounted sum of future rewards. Furthermore, the value function V is theunique ﬁxed point of the Bellman operator Φ . Re-arranging the deﬁnition of V ( z ) above,we have that: V ( z ) = E µ (cid:20) ∞ X k =0 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j ≤ (cid:21) = E µ (cid:20) ∞ X k =1 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j ≤ (cid:21) + R ( z )= γ E µ (cid:20) ∞ X k =0 γ k R T k +1 ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j ≤ (cid:21) + R ( z )= γ E µ (cid:20) ∞ X k =0 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j < (cid:21) + R ( z ) where we have carried out straightforward relabellings of the indexing of terms in the sumby k . Then by the law of total expectation we may write this last expression as V ( z ) = γ E µ (cid:20) E µ (cid:20) ∞ X k =0 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = T Z ( z ) j ∀ j ≤ (cid:21)(cid:21) + R ( z )= γ E µ [ V T Z ( z )] + R ( z ) = Φ( V )( z ) , which shows that V is indeed a ﬁxed point of Φ , and so is the unique such, since Φ is acontraction.Our goal is now to seek a W ∗ such that the ESN functional H A , C , ζ W ∗ closely approximatesthe unique ﬁxed point V of Φ . One approach is to collect a dataset from a single trainingtrajectory, and then perform least squares regression to ﬁnd W ∗ . This is an example of oﬄine learning (in the reinforcement learning parlance) because the training occurs afterthe data has been collected. This is in contrast to online learning where training takes placedynamically as new data becomes available. We will make this oﬄine approach formal inthe following theorem. art, Olding, Cox, Isupova, and Dawes Theorem 3.2

Suppose that Z is an admissible input process, that is also stationary andergodic with invariant measure µ . Let z denote an arbitrary realisation of Z . Let R :( D m ) Z → R be µ -measurable and satisfy E [ |R ( Z ) | ] < ∞ and deﬁne Φ using (3) on the µ -measurable functionals H that satisfy E µ [ | H ( Z ) | ] < ∞ . Let γ ∈ [0 , .Then for any ǫ > , δ ∈ (0 , there exists N ∈ N , R, λ ∗ > and ℓ ∈ N such that the ESNwith parameters A , C , ζ generated by procedure 2.6 (with inputs N, R ), and W ∗ ℓ minimisingthe least squares problem ℓ ℓ − X k =0 (cid:13)(cid:13)(cid:13) W ∗⊤ ℓ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R ( z ) (cid:13)(cid:13)(cid:13) + λ k W ∗ ℓ k where λ ∈ (0 , λ ∗ ) , then with probability (1 − δ ) E µ (cid:20) (cid:13)(cid:13)(cid:13) H A , C , ζ W ∗ ℓ ( Z ) − Φ H A , C , ζ W ∗ ℓ ( Z ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) < ǫ. Proof

First let V be the unique ﬁxed point of the contraction mapping Φ whose existenceand uniqueness is guaranteed by Banach’s ﬁxed point theorem. Denote the Lipschitz con-stant of Φ with the symbol τ . Then we ﬁx ǫ > and δ ∈ (0 , , then by Theorem 2.7 thereexists with probability (1 − δ ) a linear readout W ∈ R d such that E µ (cid:20) (cid:13)(cid:13)(cid:13) H A , C , ζ W ( Z ) − V ( Z ) (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) < ǫ τ ) . (4)Then it follows that E µ [ k H A , C , ζ W − Φ H A , C , ζ W k | A , C , ζ ]= E µ [ k H A , C , ζ W ( Z ) − Φ H A , C , ζ W ( Z ) + V ( Z ) − V ( Z ) k | A , C , ζ ] ≤ E µ [ k H A , C , ζ W ( Z ) − V ( Z ) k | A , C , ζ ] + E µ [ k V ( Z ) − Φ H A , C , ζ W ( Z ) k | A , C , ζ ]= E µ [ k H A , C , ζ W ( Z ) − V ( Z ) k | A , C , ζ ] + E µ [ k Φ V ( Z ) − Φ H A , C , ζ W ( Z ) k | A , C , ζ ] ≤ E µ [ k H A , C , ζ W ( Z ) − V ( Z ) k | A , C , ζ ] + τ E µ [ k V ( Z ) − H A , C , ζ W ( Z ) k | A , C , ζ ]= (1 + τ ) E µ [ k V ( Z ) − H A , C , ζ W ( Z ) k | A , C , ζ ] < (1 + τ ) ǫ τ ) by (4) < ǫ which yields the estimate E µ [ k H A , C , ζ W − Φ H A , C , ζ W k | A , C , ζ ] < ǫ . (5)Now, we can choose λ ∗ such that for any λ ∈ (0 , λ ∗ ) λ k W k < ǫ . (6) cho State Networks for Reinforcement Learning Next we deﬁne a sequence of vectors ( W ∗ j ) j ∈ N by W ∗ j = arg min U ∈ R d (cid:18) j j − X k =0 k H A , C , ζ U T − k ( z ) − γH A , C , ζ U T − k ( z ) − R T − k ( z ) k + λ k U k (cid:19) . We may view arg min as continuous map on the space of strictly convex C functions thatreturns their unique minimiser. The regularised linear least squares problem is a strictlyconvex C problem, so it makes sense to write lim j →∞ W ∗ j = lim j →∞ arg min U ∈ R d (cid:18) j j − X k =0 k H A , C , ζ U T − k ( z ) − γH A , C , ζ U T − k − R T − k ( z ) k + λ k U k (cid:19) = arg min U ∈ R d lim j →∞ (cid:18) j j − X k =0 k H A , C , ζ U T − k ( z ) − γH A , C , ζ U T − k − R T − k ( z ) k + λ k U k (cid:19) = arg min U ∈ R d (cid:18) E [ k H A,C,ζU ( Z ) − γH A , C , ζ U T ( Z ) − R ( Z ) k | A , C , ζ ] + λ k U k (cid:19) where the last equality holds by the Ergodic Theorem. We will denote by W ∗∞ the limit ofminimisers W ∗∞ = lim j →∞ W ∗ j . Now, we may choose ℓ ∈ N suﬃciently large that (cid:12)(cid:12)(cid:12) E µ [ k W ∗⊤ ℓ ( H A , C , ζ ( Z ) − γH A , C , ζ T ( Z )) − R ( Z ) k | A , C , ζ ] − E µ [ k W ∗⊤∞ ( H A , C , ζ ( Z ) − γH A , C , ζ T ( Z )) − R ( Z ) k | A , C , ζ ] (cid:12)(cid:12)(cid:12) < ǫ , (7)and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) lim j →∞ (cid:18) j j − X k =0 k W ∗⊤ j ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k + λ k W ∗ j k (cid:19) − ℓ ℓ − X k =0 k W ∗⊤ ℓ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k + λ k W ∗ ℓ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ǫ , (8)and by the Ergodic Theorem (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ℓ ℓ − X k =0 k W ⊤ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R ( z ) k − lim j →∞ j j − X k =0 k W ⊤ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R ( z ) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ǫ . (9) art, Olding, Cox, Isupova, and Dawes Now the proof proceeds directly E µ [ k H A , C , ζ W ∗ ℓ ( Z ) − Φ H A , C , ζ W ∗ ℓ ( Z ) k | A , C , ζ ]= E µ [ k H A , C , ζ W ∗ ℓ ( Z ) − γH A , C , ζ W ∗ ℓ T ( Z ) − R ( Z ) k | A , C , ζ ]= E µ [ k W ∗⊤ ℓ ( H A , C , ζ ( Z ) − γH A , C , ζ T ( Z )) − R ( Z ) k | A , C , ζ ] . Then we apply (7) which yields E µ [ k H A , C , ζ W ∗ ℓ ( Z ) − Φ H A , C , ζ W ∗ ℓ ( Z ) k | A , C , ζ ] < E µ [ k W ∗⊤∞ ( H A , C , ζ ( Z ) − γH A , C , ζ T ( Z )) − R ( Z ) k | A , C , ζ ] + ǫ . Then we apply the Ergodic Theorem E µ [ k W ∗⊤∞ ( H A , C , ζ ( Z ) − γH A , C , ζ T ( Z )) − R ( Z ) k | A , C , ζ ] + ǫ

5= lim j →∞ (cid:18) j j − X k =0 k W ∗⊤∞ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k (cid:19) + ǫ ≤ lim j →∞ (cid:18) j j − X k =0 k W ∗⊤∞ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k (cid:19) + λ k W ∗∞ k + ǫ

5= lim j →∞ (cid:18) j j − X k =0 k W ∗⊤ j ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k + λ k W ∗ j k (cid:19) + ǫ then apply (8) < ℓ ℓ − X k =0 k W ∗⊤ ℓ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k + λ k W ∗ ℓ k + 2 ǫ ≤ ℓ ℓ − X k =0 k W ⊤ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k + λ k W k + 2 ǫ then apply (9) < lim j →∞ (cid:18) j j − X k =0 k W ⊤ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k (cid:19) + λ k W k + 3 ǫ then apply (6) < lim j →∞ (cid:18) j j − X k =0 k W ⊤ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R T − k ( z ) k (cid:19) + 4 ǫ Then apply the Ergodic Theorem again = E µ [ k W ⊤ ( H A , C , ζ ( Z ) − γH A , C , ζ T ( Z ) − R ( Z )) k | A , C , ζ ] + 4 ǫ E µ [ k H A , C , ζ W − Φ H A , C , ζ W k | A , C , ζ ] + 4 ǫ then apply (5) < ǫ cho State Networks for Reinforcement Learning In some reinforcement learning applications, it is useful - or even essential - for the optimi-sation of W to occur dynamically as new data comes in; such algorithms are called online learning algorithms. In this section, we will present and discuss some preliminary novel re-sults surrounding online learning algorithms that use ESNs. We will ﬁrst introduce a lemma,stating that, under reasonable conditions, the ODE ddt W = − h ( W ) := − E µ (cid:20) H A , C , ζ ( Z ) (cid:0) H A , C , ζ W ( Z ) − Φ H A , C , ζ W ( Z ) (cid:1)(cid:21) (10)converges exponentially quickly to a globally asymptotic ﬁxed point W ∗ , for which theassociated ESN functional H A , C , ζ W ∗ is close to the unique ﬁxed point of Φ . By close wemean that the orthogonal projection of Φ H A , C , ζ W ∗ onto the ﬁnite dimensional vector space offunctionals { H A , C , ζ W | W ∈ R d } is H A , C , ζ W ∗ . Unlike the previous result (Theorem 3.2) we donot need to assume that the contraction mapping satisﬁes Φ( H ) = R + γ E [ HT Z ] . We couldchoose for example Φ( H ) = R + γ sup ϕ E [ HT Z ( ϕ ) ] where Z ( ϕ ) is a process under a control ϕ . The ﬁxed point of this operator is the optimal value function V ∗ . Lemma 3.3

Let Z be an admissible input process. Let A , C , ζ be a d × d , d × m , and d × dimensional reservoir matrix, input matrix and bias vector produced by procedure 2.6. Let H A , C , ζ and H A , C , ζ W denote the associated ESN functionals. Let Φ be a contraction mapping,with Lipschitz constant ≤ τ < , on the space of CTI ﬁlters H : ( D m ) Z → R that are µ -measurable and satisfy E [ H ( Z ) ] < ∞ . Suppose further that ≤ τ < κ − where κ is thecondition number of the autocorrelation matrix Σ = E µ h H A , C , ζ ( Z ) H A , C , ζ ⊤ ( Z ) (cid:12)(cid:12)(cid:12) A , C , ζ i . Then there exists a δ > such that the ODE ddt W = − h ( W ) := − E µ (cid:20) H A , C , ζ ( Z ) (cid:0) H A , C , ζ W ( Z ) − Φ H A , C , ζ W ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) satisﬁes ddt k W − W ∗ k ≤ − δ k W − W ∗ k (11) where W ∗ is a globally asymptotic ﬁxed point. W ∗ enjoys the further property that H A , C , ζ W ∗ = P Φ H A , C , ζ W ∗ where P denotes the L ( µ ) orthogonal projection operator on the µ -measurable ﬁlters H satisfying E [ H ( Z ) ] < ∞ and is deﬁned P H ( z ) := H A , C , ζ ⊤ ( z )Σ − E µ h H A , C , ζ ( Z ) H ( Z ) (cid:12)(cid:12)(cid:12) A , C , ζ i . art, Olding, Cox, Isupova, and Dawes Proof

To show that W ∗ is a globally asymptotic ﬁxed point it suﬃces to show that thereexists a δ > such that ( W − W ∗ ) · ( h ( W ∗ ) − h ( W )) ≤ − δ k ( W − W ∗ ) k as this implies ddt k W − W ∗ k ≤ − δ k W − W ∗ k . To construct this δ , we ﬁrst note that h ( W ) = Σ W − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) so, by a direct computation we have ( W − W ∗ ) · ( h ( W ∗ ) − h ( W ))= ( W − W ∗ ) · (cid:18) E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21)(cid:19) − ( W − W ∗ ) · (cid:18) Σ W − Σ W ∗ (cid:19) = ( W − W ∗ ) · (cid:18) E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21)(cid:19) − ( W − W ∗ ) ⊤ Σ (cid:0) W − W ∗ (cid:1) ≤ ( W − W ∗ ) · (cid:18) E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21)(cid:19) − σ k W − W ∗ k where σ is the smallest eigenvalue of Σ= ( W − W ∗ ) · (cid:18) E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ( Z ) (cid:1) − H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21)(cid:19) − σ k W − W ∗ k ≤ ( W − W ∗ ) · (cid:18) E µ (cid:20) H A , C , ζ ( Z ) H A , C , ζ W ( Z ) (cid:1) − H A , C , ζ ( Z ) H A , C , ζ W ∗ ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21)(cid:19) τ − σ k W − W ∗ k because τ is the Lipschitz constant for Φ= τ ( W − W ∗ ) ⊤ Σ( W − W ∗ ) − σ k W − W ∗ k ≤ τ ρ k W − W ∗ k − σ k W − W ∗ k where ρ is the largest eigenvalue of Σ= − ( σ − τ ρ ) k W − W ∗ k , so we can set δ := σ − τ ρ and notice δ > because ≤ τ < κ − = σ/ρ . Next, to show that H A , C , ζ W ∗ = P Φ H A , C , ζ W ∗ we observe that since W ∗ is an equilibrium point of the ODE ˙ W = − h ( W ) cho State Networks for Reinforcement Learning it follows that h ( W ∗ ) = 0 and therefore E µ (cid:20) H A , C , ζ ( Z ) (cid:0) H A , C , ζ W ∗ ( Z ) − Φ H A , C , ζ W ∗ ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) = ⇒ E µ (cid:20) H A , C , ζ ( Z ) H A , C , ζ ⊤ ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) W ∗ − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) = ⇒ W ∗ − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) so, Σ W ∗ = E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) so, W ∗ = Σ − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) so, H A , C , ζ W ∗ = H A , C , ζ ⊤ Σ − E µ (cid:20) H A , C , ζ ( Z )Φ H A , C , ζ W ∗ ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) = P Φ( H A , C , ζ W ∗ ) . One rather restrictive condition of this lemma is that the Lipschitz constant τ of thecontraction Φ must be less than the reciprocal condition number κ − . Now, we can interpret κ as a measure of how orthonormal the columns of the autocorrelation matrix Σ are. Inparticular, if the columns are indeed orthonormal, then κ = 1 and this condition ceases tobe restrictive at all. If the columns are close to being linearly dependant, then κ is large sothe requirement that κ − is small becomes troublesome. If indeed there is a linear depen-dence, the matrix Σ is not even invertible and the theorem breaks down completely. If weinterpret H A , C , ζ ( Z ) as a vector of features, then κ grows with the correlation between fea-tures. Higher correlation between the features imposes a greater constraint on the Lipschitzconstant τ . If we have 0 inter-feature correlation then κ = 1 and we have no restriction atall on τ .Now, to actually solve ODE (10) we may need to compute h ( W ) := E µ (cid:20) H A , C , ζ ( Z ) (cid:0) H A , C , ζ W ( Z ) − Φ H A , C , ζ W ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) (12)which may, or may not, be practical. For example, if the process Z is ergodic, we canapproximate (12) by taking a suﬃciently long time average of H A , C , ζ T k ( z ) (cid:0) H A , C , ζ W T k ( z ) − Φ H A , C , ζ W T k ( z ) (cid:1) . Alternatively, we may approach the problem of solving (10) by ﬁrst considering the explicitEuler method (with time-steps α k > ) W k +1 = W k − α k h ( W k )= W k − α k E µ (cid:20) H A , C , ζ ( Z ) (cid:0) H A , C , ζ W k ( Z ) − Φ H A , C , ζ W k ( Z ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) A , C , ζ (cid:21) , art, Olding, Cox, Isupova, and Dawes then we might (heuristically) expect the algorithm W k +1 = W k − α k H A , C , ζ T k ( z ) (cid:0) H A , C , ζ W k T k ( z ) − Φ H A , C , ζ W k T k ( z ) (cid:1) (13)to converge to W ∗ , where α k are positive deﬁnite real numbers that satisfy ∞ X k =1 α k = ∞ ∞ X k =1 α k < ∞ . We believe this heuristic could be made rigorous under mild assumptions, because algorithm(13) closely resembles the major algorithm extensively studied in (Benveniste et al., 1990)and (Borkar, 2009) for which similar results hold. Theorems 17 and 2.1.1. appearing in(Benveniste et al., 1990) and (Borkar, 2009) respectively suggest that an algorithm muchlike (13) converges almost surely to W ∗ if its associated ODE (reminiscent of (10)) satisﬁescondition (11), and the input process Z is strongly mixing. The conjecture that algorithm(13) converges to W ∗ is also reminiscent of Theorem 3.1 by Melo and Ribeiro (2007), andrelated results by Chen et al. (2019). These results are closely related to Q-learning andstochastic gradient descent. We note that (sadly) ﬁnding the ﬁxed point of the generalcontraction mapper Φ renders the estimation of W a nonlinear problem.

4. Bee World

To demonstrate the theory presented in section 3, we created a game called

Bee World andshow that a simple reinforcement learning algorithm supported by an ESN can learn toplay Bee World with respectable skill. The game is designed so that the theory presentedpreviously is easy to visualise, rather than because the game is hard to master.Bee World is set on the circle of unit length, which we denote by S , and represent as aninterval with edges identiﬁed. At every point y on the line, there is a non-negative quantityof nectar, which may be enjoyed by the bee without depletion. ‘Without depletion’ meansthat the bee takes a negligible amount of nectar from the point y , so the bee occupyingpoint y does not cause the amount of nectar at y to change. Furthermore, the nectar atevery point y varies with time t according to the prescribed function n ( y, t ) = 1 + cos( ωt ) sin(2 πy ) (14)(which we chose somewhat arbitrarily) that is unknown to the bee. Thus, the amount ofnectar enjoyed by the bee at time t is a value that lies in the interval [0 , , which we willdenote N . Time advances in discrete integer steps t = 0 , , , . . . , and at any time point t abee at point y observes the quantity of nectar r ∈ N at point y and nothing else. Havingmade this observation, the bee may choose to move anywhere in the interval ( y − c, y + c ) for some ﬁxed < c < and arrive at its chosen destination at time t + 1 . The intervalof possible moves ( − c, c ) is called the action space and is denoted A . The goal of the beeis to devise a policy whereby, given all its previous observations, the bee makes a decisionas to where to move next, such that the discounted sum over all future nectar is as greatas possible. The space of all previous (reward, action) pairs ( N × A ) Z − is contained by thespace of bi-inﬁnite sequences ( R ) Z . The agent playing Bee World makes no observations cho State Networks for Reinforcement Learning beyond the rewards (nectar) and actions, but we could easily envision a more general gamewhere the agent makes observations from a set Ω and therefore makes its decisions based ona left sequence of (reward, action, observation) triples.The policy adopted by the bee may be realised as a deterministic policy π : ( N ×A ) Z → A (a CTI functional) for which the bee executes an action a ∈ A determined by the history of(reward, action) pairs. Alternatively, the bee may adopt a stochastic policy, for which everystate history of (reward, action) pairs admits a distribution over actions A from which thebee makes a random choice.Though the evolution of Bee World is Markovian (and deterministic), the bee makes onlya partial observation of the state of Bee World (i.e the amount of nectar the bee observes attime t ) so the bee must take advantage of its memory to reconstruct the true state and ﬁndan optimal policy. This need for memory renders the problem suitable for an ESN, whileruling out the conventional theory of Markov Decision Processes. The problem of playingBee World could be formulated as a partially observed Markov Decision Process. Under a policy π , the nectar-action pairs experienced by the bee yield a realisation of the ( N , A ) Z -valued random variable Z . It therefore makes sense to deﬁne the value functional V : ( N × A ) Z → R associated to Z by V ( z ) = E µ (cid:20) ∞ X k =0 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j ≤ (cid:21) (15)where R : ( N × A ) Z → R is the reward functional deﬁned by R ( z ) = r , where r is thenectar collected at time , T is the shift operator, and γ ∈ [0 , is the discount factorrepresenting the relative importance of near and long term nectar consumption. We can seeafter a simple rearrangement of (15) that V ( z ) = R ( z ) + γ E µ [ V T Z ( z )] so V is the unique ﬁxed point of the contraction mapping Φ deﬁned by Φ( H )( z ) := R ( z ) + γ E µ [ HT Z ( z )] as discussed in Section 3. Thus, by Theorem 3.2, we can approximate the value function V using an ESN trained by regularised least squares as long as the nectar-action pairs z ∈ ( N × A ) Z are drawn from a suitable ergodic process Z . Therefore, we chose an initialpolicy π such that Z is ergodic. In particular, we chose a stochastic policy π ( z ) ∼ U ( − c, c ) for all histories of (reward, action) pairs z ∈ ( N ×A ) Z so that the bee takes a uniform samplefrom the action space A = ( − c, c ) at any point y ∈ S . For the purpose of playing a game,we set c = 0 . and γ = 0 . . We allowed the bee to execute this policy for 2000 time stepsand recorded the observed nectar at every time. The ﬁrst 250 time steps are plotted inFigure 1.Next, we set up an ESN of dimension d = 300 , with reservoir matrix, input matrix,and bias A , C , ζ populated with i.i.d uniform random variables U ( − . , . . A was thenmultiplied by a scaling factor such that the 2-norm of A satisﬁes k A k = 1 . We chose an art, Olding, Cox, Isupova, and Dawes activation function σ ( x ) := max(0 , x ) . We should pause here and note that ESN describedhere diﬀers slightly from the ESN described in deﬁnition 2.6. We decided to set up the ESNin a traditional way, which is empirically observed to be highly successful, as demonstratedin the literature, rather than the more cumbersome way that is suﬃcient to prove results.We then computed a sequence of reservoir states x k ∈ R for the ESN using theiteration x k +1 = σ ( A x k + C z k + ζ ) where x = 0 and each z k ∈ ( N × A ) comprises 2 components: the ﬁrst is the quantity ofnectar observed by the bee at time k , and the second is the action a ∈ ( − c, c ) executed attime k under policy π . Now we return our attention to Theorem 3.2, and see that the W ∗ ℓ minimising ℓ ℓ − X k =0 k W ∗⊤ ℓ ( H A , C , ζ T − k ( z ) − γH A , C , ζ T − k ( z )) − R ( z ) k + λ k W ∗ ℓ k converges to W minimising k W ⊤ ( x k − γx k +1 ) − r k k + λ k W k (16)so we can immediately reformulate (16) as the least squares problem W = (Ξ ⊤ Ξ + λI ) − Ξ ⊤ U where Ξ is the matrix with k th column Ξ k := x k − γx k +1 and U has k th entry r k the k th quantity of nectar, and λ is the regularisation parameterwhich we set to − . We solved this linear system using the SVD. Now V ( z ) ≈ H A , C , ζ W ∗ ℓ ( z ) ≡ ( W ∗ ℓ ) ⊤ H A , C , ζ ( z ) ≡ W ⊤ x where x is the reservoir state associated to the left inﬁnite input sequence z . Furthermore,the map ( W ⊤ · ) therefore approximates the unique ﬁxed point of Φ (by Theorem 3.2) andthis ﬁxed point is exactly the value functional we are looking for. Thus, we can easilycompute the approximate value of an arbitrary reservoir state x under the initial policy π by computing the inner product W ⊤ x . We illustrate this in Figure 1a by plotting, at eachtime k = 1 , . . . , , the value of every observed state to accompany the observed nectar. Having computed an approximate value function under the initial policy π ( z ) ∼ U ( − . , . ,we were faced with the problem of how to improve upon this policy. Exploring eﬃcient andeﬀective algorithms for iteratively improving a policy is a rich area of reinforcement learn-ing research, but outside the scope of this section. Instead, we implemented a simple and cho State Networks for Reinforcement Learning (a) The nectar collected (blue) and the approximate value function under the initial policy π (red)is plotted for the ﬁrst 250 time steps ( x -axis). (b) The nectar function n ( y, t ) (14) at every point represented as a heat map in the ( t, y ) plane, withthe position of the bee at time t under the initial policy indicated by the overlaid white circles. Figure 1: Dynamics of Bee World where the bee executes the initial policy π ( z ) ∼ U ( − . , . for the ﬁrst time steps.greedy approach. For a given reservoir state x we consider 100 actions a , a , . . . a uni-formly sampled over A = ( − . , . , then for each action we consider the nectar-actionpairs z (1) , . . . , z (100) ∈ N × A where the nectar for each pair is the current nectar; and istherefore the same in every pair. Then we compute the next reservoir states for each pair x ( i ) k +1 = σ ( A x k + C z ( i ) k + ζ ) and estimate the value of executing the i th action by computing W ⊤ x ( i ) k +1 . Then we chooseto execute the action a ∗ with the greatest estimated value - which determines our new policy π - which yields a signiﬁcant improvement over the initial policy π , as illustrated in Figure2. Under the initial policy π the bee collected an average of approximately 1.05 nectar perunit time, in comparison to 1.52 nectar under the improved policy π . This is much closerto the optimal value of approximately 1.60, which we obtain in the next section. In this section, we will analyse Bee World so that we can compare the ESN solution to resultsthat we can prove. To make our own lives easier, we consider a smooth version of Bee World,rather than the discrete time version solved by the ESN, so that we can formulate Bee Worldas a control problem that admits a solution via the Euler-Lagrange equation. We have the art, Olding, Cox, Isupova, and Dawes (a) The nectar collected (blue) and the approximate value function (red) is plotted for the ﬁrst 250time steps (y-axis) under the improved policy π . (b) The nectar function n ( y, t ) (14) at every point represented as a heat map in the ( t, y ) plane, withthe position of the bee at time t under the improved policy is indicated by the overlaid white circles. Figure 2: Dynamics of Bee World where the bee executes the improved policy π for theﬁrst time steps.control system ˙ τ = 1˙ y = u ( y, τ ) where u is the controller dependant on y and τ . Then we have a cost function C ( x, τ, u ) = f ( u ) − n ( y, τ ) where f ( x ) is the penalty term for using the control u and n ( y, τ ) is the nectar function. Inthe above formulation of Bee World f ( u ) = ( if − c ≤ u ≤ c ∞ otherwisewhere c = 0 . . Then the objective is to ﬁnd u ∗ = arg min u (cid:26) Z ∞ γ t C ( y, τ, u ) dt (cid:27) . We can see that f is not a well deﬁned function so we will introduce the family of functions f ǫ ( u ) = − ǫ log(cos( πu/ (2 c ))) cho State Networks for Reinforcement Learning where ǫ > , and notice that f ǫ approaches f pointwise as ǫ → . Next, we recall that thestationary points (including the minimum) of the integral functional I [ y ] = Z ∞ F ( t, y, ˙ y ) dt all satisfy the Euler-Lagrange equation ddt ∂ F ∂ ˙ y − ∂ F ∂y = 0 . So, we let F ( t, y, ˙ y ) = γ t C ( t, y, ˙ y )= γ t ( − ǫ log(cos( π ˙ y/ (2 c ))) − cos( ωt ) sin(2 πy ) − then ddt ∂ F ∂ ˙ y − ∂ F ∂y = ddt (cid:18) γ t dd ˙ y ( − ǫ log(cos( π ˙ y/ (2 c )))) (cid:19) + 2 πγ t cos( ωt ) cos(2 πy )= πǫ c ddt (cid:18) γ t tan( π ˙ y/ (2 c )) (cid:19) + 2 πγ t cos( ωt ) cos(2 πy )= πǫ (2 c ) (cid:18) log( γ ) γ t tan( π ˙ y/ (2 c )) + γ t π ¨ y c sec ( π ˙ y/ (2 c )) (cid:19) + 2 πγ t cos( ωt ) cos(2 πy )= πǫ c (cid:18) log( γ ) tan( π ˙ y/ (2 c )) + π ¨ y c sec ( π ˙ y/ (2 c )) (cid:19) + 2 π cos( ωt ) cos(2 πy ) , which we can reformulate as a dynamical system ˙ v = − c cos ( πv/ (2 c )) π (cid:18) c cos( ωτ ) cos(2 πy ) ǫ + log( γ ) tan( πv/ (2 c )) (cid:19) ˙ y = v ˙ τ = 1 (17)whose solutions are stationary points of the integral functional. For small ǫ , we approach theBee World problem. We took ǫ = 10 − , γ = 1 / , initial position y = 0 , and initial velocity v = 0 then simulated a trajectory of the ODE using scipy.integrate.odeint . We plottedthis in Figure 3. The average nectar collected by under this policy was approximately 1.60.

5. Application to Stochastic Control

ESNs have shown remarkable promise in solving problems in mathematical ﬁnance - includ-ing by Lin et al. (2009), Zhang et al. (2013), and Dan et al. (2014) who used an ESN topredict the future values of stock prices. Bozsik and Ilonczai (2012) used an ESN to learnthe solution to a credit rating problem and Maciel et al. (2014) used an ESN to forecastexchange rates, comparing the results to forecasts made with an ARMA model. In thissection we will introduce a stochastic optimal control problem arising in the market makingproblem. We will solve this problem analytically, and compare this to the solution obtainedby a reinforcement learning agent supported by an ESN. art, Olding, Cox, Isupova, and Dawes Figure 3: A numerical solution to the ODE (17) with ǫ = 10 − (white line) superposedon the heat map of the nectar function n ( y, t ) given in (14). Dark colours indicate regionsof low nectar, light regions indicate high values of the nectar function. We observe thatthe solution trajectory spends much more time near local maxima of the nectar functionbut has complicated oscillatory ﬂuctuations during transitions between local maxima. Theoscillations are likely due to approaching a sort of singularity as ǫ → . We consider a stochastic control problem inspired by the motivations of a market makeracting in a general ﬁnancial market. In practice the speciﬁc role of a market maker dependson the particular market, but we consider a market maker who provides liquidity to othermarket participants by quoting prices at which they are willing to sell (ask) and buy (bid)an asset. By setting the ask price higher than the bid price in general they can proﬁt fromthe diﬀerence when they receive both a buy and sell order at these prices. However, themarket maker faces risk, since if they buy a quantity of the asset the market price mightmove against them before they are able to ﬁnd a seller.The market making problem is a complex one, and has been studied extensively since thepublication of the paper by Avellaneda and Stoikov (2008). The paper of Guéant (2017) givesa good overview of much of this work. We consider a stylised version of this problem thatfocuses on inventory management without considering explicit optimal quoting strategies.We consider that a market maker acting relatively passively around the market price inordinary conditions would expect to observe a random demand for buy and sell orders. Ifas a result of random ﬂuctuations they ﬁnd their inventory has drifted away from zero, theywould set prices more competitively on either the ask or bid side to encourage trades tobalance their position. Very broadly the conclusions of work on the market making problemare that there is a price to be paid to exert control over the inventory process and bringinventories closer to zero.Motivated by this insight, we consider the market maker’s inventory to be a stochasticprocess ( Y t ) t ≥ with dynamics d Y t = ϕ t dt + σd W t where ( W t ) t ≥ is a standard Brownian motion. cho State Networks for Reinforcement Learning The parameter σ measures the volatility of the incoming order ﬂow, and ( ϕ t ) t ≥ is thecontrol process by which the market maker adds drift into their order ﬂow by moving theirbid and ask quotes. Naturally, there is a cost involved in applying the control, and a furthercost to holding inventory away from zero. We introduce parameters α and β to quantifythese eﬀects and model the market maker’s proﬁt as a stochastic process solving d Z t = ( r − α ϕ t − β Y t ) dt where r is the rate of proﬁt the market maker would achieve from the bid-ask spread ifthey did not have concerns about the asset price movements. We consider the case wherethe market maker seeks to maximise their long run discounted proﬁt v ( y ) = max ϕ E x h Z ∞ e − δt d Z t i , where E y is the expectation with the process started at Y = y . We note this is an inﬁnitehorizon, Linear-Quadratic regulator (LQR) type problem. We can show that the marketmaker’s value function and optimal control are v ( y ) = − αhy + r − αhσ δ , ϕ ∗ ( y ) = − hy, (18)where h := − αδ + p α δ + 4 β α Further, the inventory process Y t ≥ , when controlled by the optimal control ϕ ∗ ( y ) = − hy is given by the Ornstein-Uhlenbeck process d Y t = − h Y t dt + σd W t whose stationary distribution is a Gaussian N (cid:16) , σ h (cid:17) . To turn this into a problem into one that can be used to train an Echo State Network wereformulate it in discrete time; we consider a process Y , Y , Y , . . . such that Y k +1 − Y k = ǫ ϕ k + σ √ ǫ N k where ( N k ) k ∈ N are a sequence of i.i.d. random variables N k ∼ N (0 , for each k ∈ N ,and ǫ > is the time increment. The control is now a sequence ϕ = ( ϕ k ) k ∈ N . The proﬁtfunction satisﬁes Z = 0 and d Z k := Z k +1 − Z k = ǫ ( r − α ϕ k − β Y k ) . and the market maker seeks to maximise the value function v ( y ) = max ϕ E y h ∞ X k =0 e − δǫk d Z k i , art, Olding, Cox, Isupova, and Dawes over choices of the control ϕ where E y is the expectation with the process started at Y = y .It can be shown that in the limit as ǫ → , the optimal control and value function forthis problem converge precisely to the optimal control and value function in the continuouscase.We state here the results in the case ǫ = 1 , the value we will use for the application ofthe Echo State Network below. Writing γ = e − δ , we ﬁnd in this case that the value functionand optimal control are given by v ( y ) = − αpy + r − γαpσ − γ , ϕ ∗ = − py where p := ( α ( γ −

1) + γβ ) + p ( α ( γ −

1) + γβ ) + 4 αβγ γα . The process Y controlled by ϕ ∗ is Markovian, and has transition operator ( T s )( y ) = Z ∞−∞ P ( Y k +1 = y | Y k = x ) s ( x ) dx = 1 √ πσ Z ∞−∞ e − ( y − (1 − px )22 σ s ( x ) dx. It is straightforward to verify that the Gaussian probability density function s ∗ ( y ) = p p (2 − p ) √ πσ e − y p (2 − p )2 σ , (19)is a ﬁxed point of T and hence that the controlled process has stationary distribution N (cid:16) , σ p (2 − p ) (cid:17) . In this section, we seek to solve the the market making problem with a reinforcement learn-ing algorithm supported by an ESN. In this set up, we assume the market maker has noknowledge of the cost function, and no knowledge of the eﬀect of executing an action. Theagent must execute a variety of actions in a variety of states to learn about the environmentand the eﬀect of it’s actions. Then, the market maker makes reasonable changes to its policyto arrive at a policy that reduces the long term costs of operation. The policy obtained bythe reinforcement learning approach is compared to the optimal policy derived with fullknowledge of the system.

For the purpose of running the simulation, we let the cost of operating the control α = 1 , thecost of straying from the origin β = 1 , the timestep ǫ = 1 , and the volatility parameter σ = 1 .We take the baseline proﬁt parameter r = 0 . The inventory held, and action taken, by themarket maker at time k will be denoted y k and a k respectively. A sequence of (inventory, cho State Networks for Reinforcement Learning action) pairs will be denoted z ∈ ( R ) Z with z k = ( y k , a k ) . The value functional for themarket maker problem is deﬁned V ( z ) = E µ (cid:20) ∞ X k =0 γ k R T k ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) Z j = z j ∀ j ≤ (cid:21) where R : ( R ) Z → R is the reward functional R ( z ) = − ( αa − + βy ) ,T is the shift operator, and γ ∈ [0 , is the discount factor representing the relative impor-tance of near and long term costs. We can see after a simple rearrangement that V ( z ) = R ( z ) + γ E µ [ V T Z ( z )] so V is the unique ﬁxed point of the contraction mapping Φ deﬁned Φ( H )( z ) = R ( z ) + γ E µ [ HT Z ( z )] as discussed in Section 3. Thus, by Theorem 3.2, we can approximate the value function V using an ESN trained by regularised least squares if the (inventory, action) pairs ( y k , a k ) arethe realisation of a stationary ergodic process. Consequently, we sought an initial policy π such that the process Z comprising the inventory-action pairs under policy π is stationaryand ergodic. In particular, we chose π ( y ) ∼ N (0 , σ i ) − ηy (20)with η = 0 . a constant representing the rate of exponential drift toward and σ i = 1 .We ran this policy for 10000 time steps, and recorded the pairs z k along with the rewards r k . Next, we set up an ESN of dimension d = 300 , with reservoir matrix, input matrix,and bias A , C , ζ populated with i.i.d uniform random variables U ( − . , . . A was thenmultiplied by a scaling factor such that the 2-norm of A satisﬁes k A k = 1 . We chose theRELU activation function. We then computed reservoir states x k +1 = σ ( A x k + C z k + ζ ) starting with an initial reservoir state x = 0 . An arbitrary reservoir state x then representsthe left inﬁnite sequence of asset-actions pairs z . We seek an expression for the value of thereservoir state x by solving the least squares problem W = (Ξ ⊤ Ξ + λI ) − Ξ ⊤ U (using the singular value decomposition) where Ξ is the matrix with k th column is Ξ k := x k − γx k +1 and U is the vector of observations where the k th entry is the reward r k , and λ is theregularisation parameter which we set to . We also chose γ = e − . In practice, thediscount factor is usually much larger. With this, we obtain an expression for value of thereservoir state x given by W ⊤ x . The results of this policy are shown in Figures 5 and 4. art, Olding, Cox, Isupova, and Dawes −15 −10 −5 0 5 10 15inventory−250−200−150−100−500 v a l u e Figure 4: Under the initial policy, the value V ( Y ) ( y -axis) learned by the ESN at theinventory Y ( x -axis) at each of the 10000 timesteps is shown. The parabolic shape isconsistent with the analytically derived optimal value function (19) shown in red. We notethat the value function under the initial policy π is not expected to match the value functionunder the optimal policy π ∗ . cho State Networks for Reinforcement Learning (a) (b) Figure 5: Dynamics of the market maker over time executing (a) the initial policy π and(b) the improved policy π . For each plot, the inventory ( y -axis) is shown evolving withtime ( x -axis). We sought to create a new and improved policy based on the observations of under theinitial policy using a naïve approach. At each time step, we consider 100 trial actions a (1) , a (2) , . . . , a (100) drawn from the standard normal distribution N (0 , and compute x ( i ) k +1 = σ ( A x k + C z ( i ) k + ζ ) where z ( i ) k is the (inventory, action) pair ( y k , a ( i ) ) , and a ( i ) is trial action. For each i , wecompute W ⊤ x ( i ) k +1 to obtain the predicted value of executing action a ( i ) . We then choose toexecute the action a ∗ with the greatest predicted value, and update the reservoir state usingthis (inventory, action) pair ( y k , a ∗ ) . This deﬁnes our new policy. We ran this new policyfor 10000 time steps and illustrated the results in Figures 6a, and 6b. The one step reinforcement learning algorithm did not perfectly replicate the analyticallyderived optimal control, but has moved in a promising direction. We can see in Figure6a that the inventory process under the improved policy produces (inventory, action) pairsthat are scattered about the optimal policy. This suggests that the market maker trainedby reinforcement learning is behaving well in some average sense, despite performing manysub-optimal actions. It also appears that the the reinforcement learning algorithm usesthe control more aggressively than is optimal. This sub-optimal control result in greatercosts than the optimal control. In particular the average cost incurred under the improved art, Olding, Cox, Isupova, and Dawes −4 −3 −2 −1 0 1 2 3 4inventory−4−3−2−101234 a c t i o n (a) −4 −3 −2 −1 0 1 2 3 4Inventory0.000.050.100.150.200.250.300.35 P r o b a b ili t y d e n s i t y (b) Figure 6: (a) Illustrates the (inventory, action) pairs ( y k , a k ) under the improved policy π are represented as points on the scatter plot. The inventory is on the x -axis, and action is onthe y -axis. The red line represents the analytically derived optimal control (equation (18)).(b) Illustrates the invariant measure of the inventory process under the improved policy π is approximated with a histogram. The histogram is compared to the analytically derivedinvariant measure of the optimal control process N (0 , . (equation (19)).policy π is 2.65, while the average cost under the the analytically derived optimal policy is σ/ p p (2 − p ) = 1 . .Despite these suboptimal moves, it seems that the inventory process learned by marketmaker has an invariant measure that closely matches the optimal invariant measure. Itis promising to see that an invariant measure exists at all, because the controlled processis assumed to be stationary and ergodic (and therefore admits an invariant measure) inTheorem 3.2.It is also worth noting that the inventory process, controlled either by the ESN or theoptimal control, has support on R , which is not a compact space. Therefore, the conditionsof Theorem 3.2 don’t technically hold. However, the numerical results here suggest thatthe ESN has learned the value functional adequately well, suggesting that Theorem 3.2 mayhold under relaxed conditions.

6. Conclusions and future work

In this paper, we presented three novel results about Echo State Networks trained on datadrawn from a stationary ergodic process. The ﬁrst applies to oﬄine supervised learning.The theorem states that, given a target function, enough training data and a large enoughESN, the least squares training procedure will yield an arbitrarily good approximation tothe target function. The second result applies to an agent performing a stochastic policy π .After the agent has collected suﬃciently many training data, and given a suﬃciently large cho State Networks for Reinforcement Learning ESN, the least squares training procedure will yield an arbitrarily good approximation to thevalue function associated to the policy π . The third result is relevant to online reinforcementlearning. Though the result is quite preliminary, the lemma is introduced with the intentionof developing online algorithms (inspired by Q-learning) to learn the optimal policy fornon-Markovian problems.We demonstrated the second result (which generalises the ﬁrst) on a deterministic con-trol problem (Bee World) and a stochastic control problem (the market making problem).We chose these ‘toy model’ problems to understand the performance of the algorithm com-pletely in cases that are completely solvable analytically, although these optimal solutionsthemselves are not entirely trivial. The reinforcement learning algorithm we use to improvethe policy in both Bee World and the market making problem is extremely simple. It isessentially one iteration of an ǫ -greedy policy (Sutton and Barto, 2015), with ǫ set to 0. De-spite the simplicity of the algorithm, the single iteration considerably improved the policy,resulting in a reasonable approximation to the optimal policy.It therefore seems a natural direction of future work to develop more sophisticated learn-ing algorithms. As this work develops, it will become essential to have a rigorous frame-work describing the relationship between ﬁlters, functionals, random processes and reinforce-ment learning. The theory presented in this paper tentatively connects these objects usingideas from Markov Decision Processes, but the theory is far from complete. A very recentand promising framework for uniting these ideas is QuaSiModO: Quantization-Simulation-Modeling-Optimization (Peitz and Bieker, 2021). The authors analyse the interplay betweenthe following:1. Quantising the action space A

2. Simulating a system under a given control/policy3. Modelling the full system given a partial/full observation of the state space4. Optimising the control/policyIt may also be fruitful to integrate existing algorithms into this ESN learning framework. Inparticular the linear upper conﬁdence bound (linUCB) algorithm (Sutton and Barto, 2015)has a linear structure that ﬁts cleanly into the the linear training framework of the ESN.Furthermore, in cases where practitioners prefer to use other recurrent neural networks(RNNs), like Long Short Term Memory networks (LSTMs), the rigorous theory of ESNsshould prove useful in architecture design. Saxe et al. (2011) have shown that diﬀerent deepneural network architectures can be ranked by randomly initialising the internal weights andtraining only the outer weights by linear regression. Once the best performing architecture(with random internal weights) has been identiﬁed, the authors then train the internalweights of the highest ranking architecture. This is much faster than training the internalweights (a nonlinear problem) for every architecture. The ranking of architectures withrandom internal weights closely approximates the ranking of architectures with optimisedinternal weights. From our point of view, the authors are essentially approximating fullytrained networks with (non-recurrent) ESNs. We conclude that research into ESNs withrandom internal weights reveals something about RNNs with optimised internal weights. art, Olding, Cox, Isupova, and Dawes Acknowledgements

We acknowledge that Allen Hart and Kevin Olding are supported by a scholarship from theEPSRC Centre for Doctoral Training in Statistical Applied Mathematics at Bath (SAMBa),under the project EP/L015684/1.We thank Jeremy Worsfold for insights about reinforcement learning and the linUCBalgorithm, and for refactoring the Bee World code. We also thank Adam White for usefuldiscussions about reinforcement learning settings.

References

Marco Avellaneda and Sasha Stoikov. High-frequency trading in a limit order book.

Quantitative Finance , 8(3):217–224, 2008. doi: 10.1080/14697680701381228. URL https://doi.org/10.1080/14697680701381228 .A. Benveniste, M. Métivier, and P. Prioure.

Adaptive Algorithms and Stochastic Approxi-mations . Springer-Verlag, 1990.Vivek S Borkar.

Stochastic Approximation: A Dynamical Systems Viewpoint . HindustanBook Agency, 2009.J. Bozsik and Z. Ilonczai. Echo state network-based credit rating system. In , pages 185–190, 2012.Zaiwei Chen, Sheng Zhang, Thinh T. Doan, Siva Theja Maguluri, and John-Paul Clarke.Performance of q-learning with linear function approximation: Stability and ﬁnite-timeanalysis. arXiv:1905.11425 , 2019.Jingpei Dan, Wenbo Guo, Weiren Shi, Bin Fang, and Tingping Zhang. Determinis-tic echo state networks based stock price forecasting.

Abstract and Applied Anal-ysis , 2014:137148, Jun 2014. ISSN 1085-3375. doi: 10.1155/2014/137148. URL https://doi.org/10.1155/2014/137148 .Lukas Gonon, Lyudmila Grigoryeva, and Juan-Pablo Ortega. Approximation bounds forrandom neural networks and reservoir systems. arXiv:2002.05933 , 2020.Lyudmila Grigoryeva and Juan-Pablo Ortega. Echo state networks are universal.

NeuralNetworks , 108:495 – 508, 2018. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2018.08.025.Lyudmila Grigoryeva and Juan-Pablo Ortega. Diﬀerentiable reservoir comput-ing.

Journal of Machine Learning Research , 20(179):1–62, 2019. URL http://jmlr.org/papers/v20/19-150.html .Olivier Guéant. Optimal market making.

Applied Mathematical Fi-nance , 24(2):112–154, 2017. doi: 10.1080/1350486X.2017.1342552. URL https://doi.org/10.1080/1350486X.2017.1342552 . cho State Networks for Reinforcement Learning Allen Hart, James Hook, and Jonathan Dawes. Embedding and approxima-tion theorems for echo state networks.

Neural Networks , 128:234 – 247,2020a. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2020.05.013. URL .Allen G. Hart, James L. Hook, and Jonathan H. P. Dawes. Echo state networkstrained by tikhonov least squares are l2 approximators of ergodic dynamical systems. arXiv:2005.069674 , 2020b.Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural net-works.

GMD-Report 148, German National Research Institute for Computer Science , 012001.Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems andsaving energy in wireless communication.

Science , 304(5667):78–80, 2004. ISSN 0036-8075.doi: 10.1126/science.1091277.Xiaowei Lin, Zehong Yang, and Yixu Song. Short-term stock price prediction basedon echo state networks.

Expert Systems with Applications , 36(3, Part 2):7313 –7317, 2009. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2008.09.049. URL .Mantas Lukoševičius, Herbert Jaeger, and Benjamin Schrauwen. Reservoir computing trends.

Künstliche Intelligenz. , 26(4):365–371, 2012. ISSN 0933-1875.Mantas Lukoševičius and Herbert Jaeger. Reservoir computing approaches to recurrentneural network training.

Computer Science Review , 3(3):127 – 149, 2009. ISSN 1574-0137.doi: https://doi.org/10.1016/j.cosrev.2009.03.005.Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing withoutstable states: A new framework for neural computation based on perturbations.

NeuralComputation , 14(11):2531–2560, 2002. doi: 10.1162/089976602760407955.L. Maciel, F. Gomide, D. Santos, and R. Ballini. Exchange rate forecasting using echo statenetworks for trading strategies. In , pages 40–47, 2014.Kevin McGoﬀ, Sayan Mukherjee, and Natesh Pillai. Statistical inference for dynamicalsystems: A review.

Statist. Surv. , 9:209–252, 2015. doi: 10.1214/15-SS111. URL https://doi.org/10.1214/15-SS111 .Francisco S. Melo and M. Isabel Ribeiro. Q-learning with linear function approximation. InNader H. Bshouty and Claudio Gentile, editors,

Learning Theory , pages 308–322, Berlin,Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-72927-3.Sebastian Peitz and Katharina Bieker. On the universal transformation of data-driven mod-els to control systems. arXiv:2102.04722 , 2021.A Rodan and P Tino. Minimum complexity echo state network.

IEEE transactions onneural networks , 22(1):131–144, 2011. ISSN 1045-9227. art, Olding, Cox, Isupova, and Dawes Andrew Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and AndrewNg. On random weights and unsupervised feature learning.

In Proceedings of the 28thInternational Conference on Machine Learning , 2011.Matthew Schlegel, Andrew Jacobsen, Muhammad Zaheer, Andrew Patterson, Adam White,and Martha White. General value function networks. arXiv:1807.06763 , 2018.Richard S Sutton and Andrew G Barto.

Reinforcement Learning: An Introduction . MIT-press, 2015.István Szita, Viktor Gyenes, and András Lőrincz. Reinforcement learning with echo statenetworks. In Stefanos D. Kollias, Andreas Stafylopatis, Włodzisław Duch, and Erkki Oja,editors,

Artiﬁcial Neural Networks – ICANN 2006 , pages 830–839, Berlin, Heidelberg,2006. Springer Berlin Heidelberg. ISBN 978-3-540-38627-8.Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, and Jean-Pierre Martens.Phoneme recognition with large hierarchical reservoirs. In J. D. Laﬀerty, C. K. I. Williams,J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors,

Advances in Neural InformationProcessing Systems 23 , pages 2307–2315. Curran Associates, Inc., 2010.Huaguang Zhang, Jiuzhen Liang, and Zhilei Chai. Stock prediction based on phasespace reconstruction and echo state networks.

Journal of Algorithms & Com-putational Technology , 7(1):87–100, 2013. doi: 10.1260/1748-3018.7.1.87. URL https://doi.org/10.1260/1748-3018.7.1.87 ..