[PDF] Learning to swim in potential flow

Abstract

Fish swim by undulating their bodies. These propulsive motions require coordinated shape changes of a body that interacts with its fluid environment, but the specific shape coordination that leads to robust turning and swimming motions remains unclear. To address the problem of underwater motion planning, we propose a simple model of a three-link fish swimming in a potential flow environment and we use model-free reinforcement learning for shape control. We arrive at optimal shape changes for two swimming tasks: swimming in a desired direction and swimming towards a known target. This fish model belongs to a class of problems in geometric mechanics, known as driftless dynamical systems, which allow us to analyze the swimming behavior in terms of geometric phases over the shape space of the fish. These geometric methods are less intuitive in the presence of drift. Here, we use the shape space analysis as a tool for assessing, visualizing, and interpreting the control policies obtained via reinforcement learning in the absence of drift. We then examine the robustness of these policies to drift-related perturbations. Although the fish has no direct control over the drift itself, it learns to take advantage of the presence of moderate drift to reach its target.

Full PDF

aa r X i v : . [ q - b i o . Q M ] S e p Learning to swim in potential ﬂow

Yusheng Jiao, Feng Ling, Sina Heydari, and Eva Kanso ∗ Department of Aerospace and Mechanical Engineering,University of Southern California, Los Angeles, CA 90089

Nicolas Heess and Josh Merel

DeepMind, London

Fish swim by undulating their bodies. These propulsive motions require coordinated shapechanges of a body that interacts with its ﬂuid environment, but the speciﬁc shape coordinationthat leads to robust turning and swimming motions remains unclear. We propose a simple modelof a three-link ﬁsh swimming in a potential ﬂow environment and we use model-free reinforcementlearning to arrive at optimal shape changes for two swimming tasks: swimming in a desired directionand swimming towards a known target. This ﬁsh model belongs to a class of problems in geometricmechanics, known as driftless dynamical systems, which allow us to analyze the swimming behaviorin terms of geometric phases over the shape space of the ﬁsh. These geometric methods are lessintuitive in the presence of drift . Here, we use the shape space analysis as a tool for assessing,visualizing, and interpreting the control policies obtained via reinforcement learning in the absenceof drift. We then examine the robustness of these policies to drift-related perturbations. Althoughthe ﬁsh has no direct control over the drift itself, it learns to take advantage of the presence ofmoderate drift to reach its target.

Keywords: Fish swimming, Reinforcement learning, Sensorimotor control.

I. INTRODUCTION

Fish swim through interactions of body deformations with the ﬂuid environment. A ﬁsh assimilates sensory infor-mation from its body and the external environment and produces patterns of muscle activation that result in desiredbody deformations; see Fig. 1A. How these sensorimotor decisions are enacted at the physiological level, at the levelof neuronal circuits, remains unclear [1–4]. Animal models, such as the

Danio rerio zebraﬁsh [5–7], as well as roboticand mathematical models [8, 9], provide valuable insights into the sensorimotor control underlying ﬁsh behavior. Suchunderstanding oﬀers enticing paradigms for the design of artiﬁcial soft robotic systems in which the control is embodied in the physics [10, 11].

Embodied systems sense and respond to their environment through a physical body [12, 13];physical interactions with the environment are thus vital for sensing and control. In ﬁsh, the dynamics of the ﬂuidenvironment is essential both at an evolutionary time scale – in shaping body morphologies and sensorimotor modal-ities – and at a behavioral time scale. Fish bodies are tuned to exploit ﬂows [14, 15]. Body designs and undulatorymotions have been examined in computational models of ﬂuid-structure interactions [16–19], including models of bodystiﬀness and neuromechanical control [20, 21]. The ﬁsh’s ability to sense minute water motions is attributed to alateral-line mechanosensory system [22, 23] involved in behaviors ranging from rheotaxis [24–26] to schooling [27]. Re-cent developments prove that machine learning techniques are highly eﬀective in addressing problems of ﬂow sensingand ﬁsh behavior [28–34].A central problem in ﬁsh behavior, which is also relevant for underwater robotic systems, is gait design or motionplanning : what body deformations produce a desired swimming objective? The answer requires an understandingof how the numerous biomechanical degrees of freedom of the ﬁsh body are coordinated to achieve the commonobjective. Mathematically, this problem is often expressed in terms of an optimality principle: ﬁnd control laws thatoptimize a desired objective, such as maximizing swimming speed or minimizing energetic cost [17, 35–37]. But howthese control laws are implemented in the nervous system, and how they are acquired via learning algorithms, aretypically beyond the scope of such methods. As these optimization methods rely on an internal model of the dynamics,diﬀerent computational results have been obtained by varying the speciﬁcation of the physical model of the ﬁsh, theperformance metric, and the imposed control constraints [17, 35–38].Model-free reinforcement learning (RL) of embodied systems oﬀers an alternative framework for gait design thatis mathematically and computationally tractable [40, 41]. In this framework, the ﬁsh world is divided into a bodycontrolled by a learning agent and an environment that encompasses everything outside of what the agent can control. ∗ [email protected]; https://sites.usc.edu/kansolab/ A Flow Sensing VisionOlfaction

BrainSpinal Cord Olf. bulbs B Target ShapeVariablesTargetVariablesLocomotionVariables C RL Environment

Physics

RL Agent

PolicyAction ObservationReward

MusculoskeletalSystem NervousSystemInteractionwithFluid Environment SensoryModalities

Figure 1.

Model-free reinforcement learning and the three-link ﬁsh. A . Illustration of sensorimotor feedback loops inﬁsh. Motor commands generated in the nervous system activate the musculoskeletal system, resulting in deformations of thebody. Body deformations, through interaction with the ﬂuid environment, lead to swimming meanwhile sensory modalitiesprovide sensory feedback to the nervous system. The dashed arrow between musculoskeletal and sensory systems indicatessomatic sensing used to assess whether previous motor commands were successfully executed. Other reﬂexive or preﬂex signalscould also be at play [2, 22, 39]. B . Three-link ﬁsh swimming in quiescent ﬂuid. Locomotion variables ( x, y, β ) are set in alab ﬁxed frame, while the shape variables ( α , α ) and target variables ( ρ, ψ ) are set in body frame symbolizing egocentriccontrol and learning. C . To apply model-free reinforcement learning to our problem, we only need to set the appropriate state,observation, action, and reward variables based on our ﬁsh model. The agent can be viewed as an abstract representation of the parts of the ﬁsh responsible for sensorimotor decisions.In RL, the agent must learn from experiences in a trial-and-error fashion. Speciﬁcally, repeated interactions of thebody with the environment enables the agent to collect sensory observations , control actions , and associated rewards .The goal of the agent is thus to learn to produce behavior that maximizes rewards, and the process is model-freewhen learning does not make use of either a priori or developed knowledge of the physics of the system. The learnedfeedback control law, called a policy , is essentially a mapping from sensory observations to control actions. Thismapping is nonlinear and stochastic, and, by construction, rather than providing a single optimal trajectory, it canbe applied to any initial condition and transferred to conditions other than those seen during training such as whenthe body or ﬂuid environment is perturbed [42].Here, we employ RL to design swimming gaits. We use an idealization in which the ﬁsh is modeled as an articulatedbody consisting of three links, with front and rear links free to rotate relative to the middle link via hinge joints [35,36, 43–46]. In describing the physics of the ﬁsh, we cede the complexity of accounting for the full details of the ﬂuidmedium in favor of considering momentum exchange between the articulated body and the surrounding ﬂuid in thecontext of a potential ﬂow model [35, 36, 44, 45]. This model is a canonical example of a class of under-actuatedcontrol problems whose dynamics can be described over the actuation (shape) space using tools from geometricmechanics [35, 47–49]. Speciﬁcally, the ﬁsh swimming motion can be represented by the sum of a dynamic phase or drift , and a geometric phase over the shape space of all body deformations [36]. From a motion control perspective,this model is advantageous in that it allows for the use of geometric mechanics tools for gait analysis and manualdesign of control policies over the full shape space [46]. These geometric tools also provide an intuitive way for directand interpretable visualizations of the RL-based policies. II. MATHEMATICAL MODEL OF THE THREE-LINK FISH

Consider a three-link ﬁsh as shown in Fig. 1(B). Rotations of the front and rear links relative to the middle link aredenoted by the angles α and α such that ( α , α ) fully describe all possible body deformations. We constrain theswimming motions to a two-dimensional plane, and let ( x, y ) and β denote the net planar displacement and rotation ofthe ﬁsh body, such that ( ˙ x, ˙ y ) and ˙ β represent the linear and rotational velocities of the ﬁsh in inertial frame (the dotnotation ˙() = d() / dt represents diﬀerentiation with respect to time t ). We also introduce the linear velocity ( v x , v y )expressed in a body frame attached to the center of the middle link, v x = ˙ x cos β + ˙ y sin β, v y = − ˙ x sin β + ˙ y cos β. (1)The total linear momentum ( p x , p y ) and total angular momentum π of the body-ﬂuid system are expressed in inertialframe, and their counterparts ( P , P ) and Π are expressed in body-ﬁxed frame, p x = P cos β − P sin β, p y = P sin β + P cos β, π = Π . (2)In potential ﬂow, it is a known result that the total momenta of the body-ﬂuid system can be expressed in terms ofthe body geometry and velocity, via the so-called added mass matrices [35, 36, 50]. Expressions for the added massmatrices of the three-link ﬁsh are derived in detail in Appendix A. The total momenta ( P , P ) and Π are given by  P P Π  = I lock  v x v y ˙ β  + I couple (cid:20) ˙ α ˙ α (cid:21) , (3)where I lock is the locked mass matrix at a given shape of the ﬁsh (see Eqs. A6-A9) and I couple is the mass matrixassociated with shape deformations (see Eq. A10).In the absence of external forces and moments on the ﬁsh-ﬂuid system, the total momenta ( p x , p y ) and π of thebody-ﬂuid system in inertial frame are conserved for all time. Conservation of total momenta yields, upon inverting (2)and substituting into (3),  v x v y ˙ β  = I −  cos β sin β − sin β cos β

00 0 1   p x p y π  − I − I couple (cid:20) ˙ α ˙ α (cid:21) . (4)If we further substitute (1) into (4), we arrive at three coupled ﬁrst-order equations of motion for x , y , β given α and α . The control problem consists of ﬁnding the time evolution of shape changes ( α ( t ) , α ( t )) that achieve a desiredlocomotion task ( x ( t ) , y ( t )) and β ( t ). Speciﬁcally, a swimming gait is deﬁned as a cyclic shape change ( α ( t ) , α ( t )),with period T , that results in a net swimming ( x ( t ) , y ( t )) or turning β ( t ) of the ﬁsh body.This model is a canonical example of a class of under-actuated control problems whose dynamics can be describedover the shape space using tools from geometric mechanics [35, 47–49]. On the right-hand side of (4), the ﬁrst termrepresents a dynamic phase or drift and the second term represents a geometric phase over the ﬁsh shape space( α , α ) [36]. The geometric phase is best described in terms of the local connection matrix A [36, 46], which is afunction only of the shape variables α and α , A =  A x A x A y A y A β A β  := − I − I couple . (5)Each row of A describes a nonlinear vector ﬁeld over the shape space, giving rise to three vector ﬁelds A x ≡ ( A x , A x ), A y ≡ ( A y , A y ), and A β ≡ ( A xβ , A β ) over the ( α , α ) plane as shown in Fig. 2(A).In driftless systems, net locomotion is fully controlled by the ﬁsh shape changes. However, in the presence of drift,shape control is not suﬃcient to determine the ﬁsh motion in the physical space, which is now aﬀected by the driftterm in (4). III. GEOMETRIC PHASES

Geometric phases are deﬁned as the net locomotion ( x, y, β ) that results from prescribed cyclic shape changes in the( α , α ) plane at zero total momentum (no drift). Inverting (1) and substituting (5) into (4) at zero total momentum, curl A y π B − π − π π α α − π π α − π π α A A x A y A β π α curl A x curl A β − − − − y x turning gait0.50 − y swimming gait CD − π − π π − π π − π π Figure 2.

Using connection matrix for simple gait design. A and B . Rows of connection matrix A give us three vectorﬁelds, A x = ∂v x /∂α i , A y = ∂v y /∂α i , and A β = ∂ ˙ β/∂α i , the curl of which yield three corresponding scalar ﬁelds. Magnitudeof the scalar ﬁelds are normalized to be within [ − , ◦ , △ , (cid:3) , are sketched for additional clarity. C and D . Fish that start with body centered at theorigin of the x-y plane and follow the same gait circle can still swim/turn to diﬀerent directions in the physical space dependingon their initial body shapes. we arrive at  cos β sin β − sin β cos β

00 0 1   ˙ x ˙ y ˙ β  =  A x A x A y A y A β A β  (cid:20) ˙ α ˙ α (cid:21) . (6)Motions ( x, y, β ) in the physical space are obtained by integrating (6) with respect to time. Rotations are directlyproportional to the line integral of the vector ﬁeld A β as evident by integrating the last equation in (6), β ( T ) − β (0) = I C d β = I C ( A β d α + A β d α ) . (7)Here, T is the time-period for going around the closed trajectory in the shape space once. Using Green’s Theorem,we get β ( T ) − β (0) = Z Z (cid:18) ∂A β ∂α − ∂A β ∂α (cid:19) d α d α = Z Z curl ( A β )d α d α . (8)The scalar ﬁeld curl ( A β ) provides an intuitive tool for understanding the eﬀect of a cyclic shape deformation onthe net rotation of the ﬁsh: net rotations are proportional to the integral of curl ( A β ) over the area enclosed by aclosed shape trajectory. However, translational motions ( x, y ) are not directly proportional to the area integrals ofcurl ( A x ) and curl ( A y ), but to a combination of all three integrals coupled through the ﬁsh rotational dynamics asevident from (6). Despite this complication, the scalar ﬁelds deﬁned by curl ( A x ), curl ( A y ), and curl ( A β ), shownin Fig. 2(B), are informative of the net translational ( x, y ) and rotational β motions of the ﬁsh. To illustrate the utilityof these curl ﬁelds, we show two examples of cyclic shape changes depicted in black and green lines. A ﬁsh changingits shape following the black line undergoes zero net rotation because the area integral of curl ( A β ) is identicallyzero, but it swims forward in the ( x, y ) plane. The net displacement per period is a conserved quantity, whereas thedirection of motion depends on a combination of the ﬁsh initial shape ( α (0) , α (0)) and initial orientation β (0) asshown in Fig. 2(C). Here, we consider three diﬀerent initial shapes, depicted by the markers ◦ , △ , (cid:3) , all at β (0) = 0.Similarly, shape deformations following the green line lead to net reorientations in the physical space, as shown inFig. 2(D). Evidently, no net motion occurs if the shape trajectory is degenerate – does not enclose an area in theshape space. Further a re-scaling of time does not aﬀect the net motion of the ﬁsh, only the speed at which the ﬁshcompletes these cyclic shape changes.In Fig. 2 and hereafter, the equations of motion are scaled using the total length of the ﬁsh and the total mass inthe head-to-tail direction of the straight ﬁsh as the characteristic length and mass scale. Speciﬁcally, we set the totalmass of the ﬁsh body to be equal to the added mass (actual mass of the ﬁsh is negligible). We leave the time scaleunchanged because the characteristic time depends on the speed of shape changes, which is a control parameter tobe determined by the controller. IV. MOTION CONTROL VIA REINFORCEMENT LEARNING

The scalar ﬁelds curl ( A x ), curl ( A y ), curl ( A β ) over the shape space encode information about the net locomotionof the ﬁsh in a driftless environment, and can be used to design simple swimming and turning gaits as shown in Fig. 2.However, these geometric tools do not allow for a straightforward design of control policies for arbitrary motionplanning [44, 46], and they are even less instructive in the presence of drift. Here we consider an RL driven approachfor motion control.We use RL to train the three-link ﬁsh on two diﬀerent tasks: (i) swim parallel towards a desired direction ina driftless environment; (ii) swim towards a target point located at a distance ρ and angular position ψ from theﬁsh nose, with ρ and ψ expressed in the ﬁsh frame of reference (Fig. 1B). In the ﬁrst task, the desired direction isﬁxed to be parallel to the x -axis without loss of generality. For the second task, we ﬁrst train the ﬁsh in a driftlessenvironment, then introduce drift and train again in the presence of drift. The ﬁrst task allows for direct comparisonof the performance of the trained policy to manually-designed policies in the context of geometric mechanics. Thesecond task allows for evaluation and comparison of the performance of the driftless and drift-aware policies underenvironmental perturbations.Central to any RL implementation are the notions of the state of the system, the observations given to the learningagent, the actions taken by the agent, and the rewards given to the agent in light of its behavior. The state s t of theﬁsh-ﬂuid-target system at a time t is given by the ﬁsh position and orientation in inertial frame ( x, y, β ), its shape( α , α ), and the target position relative to the ﬁsh ( ρ, ψ ); see Fig. 1(B,C). As sensory input, we provide the ﬁsha set of proprioception-based observations of its shape variables α and α , as well as an egocentric observation ofthe task, namely for controlling the direction of swimming, the ﬁsh knows the desired swimming direction relativeto itself ψ = − β , and for swimming towards a target, it knows the angular position ψ of the target point relativeto itself. This yields a set of observations o t = ( α , α , ψ ) t . Additionally, for the training in presence of drift, themagnitude and direction of the drift are also provided as observations. As control action, the ﬁsh has direct control ofits the shape using the rate of shape changes as actions a t = ( ˙ α , ˙ α ) t . With this choice of action, the control can beprojected onto the shape space and directly compared to the geometric mechanics approach. We constrain the valueof the actions to be between − α and α are only allowed to change between − π/ π/ π θ ( a t | o t ) that produces actions a t given observations o t of the state of the ﬁsh-environment. The policy is parameterized by a set of parameters θ tobe optimized. An optimal policy is learned to produce behavior that maximizes rewards. We use a dense shapingreward, that is, the ﬁsh is given a reward at every decision time step. Speciﬁcally, we set the reward to be thedistance the ﬁsh travels in the desired direction towards the target state. For learning to swim parallel to the x -axis, we use the reward r t = x nose , t+1 − x nose , t , which is the change in the ﬁsh nose position along the x -axis. Forlearning to swim towards a target, the reward r t = ρ t − ρ t +1 is based on the change in the relative distance ρ fromthe ﬁsh nose to the target. The return R t = P ∞ t ′ γ t ′ − t r t ′ is deﬁned as the inﬁnite horizon objective based on thesum of discounted future rewards, where γ ∈ [0 ,

1] is known as the discount factor; it determines the preference forimmediate over future rewards. We set γ = 0 .

99 to make the ﬁsh foresighted. The goal is to arrive at an optimal setof parameters θ that maximizes the expected return J ( π θ ) = E π (cid:2) P ∞ t =0 γ t r t (cid:3) for a distribution of initial states. Here,the expectation is taken with respect to the distribution over trajectories π ( τ ) induced jointly by the ﬁsh dynamics,viewed as a partially-observable Markov decision process, and the policy π θ ( a t | o t ). One approach to solving thisoptimization problem is to use a policy gradient method that computes an estimate of the gradient ∇ θ J for learning.Policy gradient methods are widely used to learn complex control tasks and are often regarded as the most eﬀectivereinforcement learning techniques, especially for robotics applications [51–55]. Here, we use a speciﬁc class of policygradient methods, known as actor-critic methods [56, 57] where the ﬁsh learns simultaneously a policy (actor) and avalue function (critic). We implement this method using the clipped advantage Proximal Policy Optimization (PPO)algorithm proposed in [58]. This algorithm ensures fast learning and robust performance by limiting the amount ofchange allowed for the policy within one update. A pseudo-code implementation of the PPO algorithm and additionalimplementation details are provided in Appendix B. V. TRAINING THE FISH TO SWIM

We train the three-link ﬁsh to (i) swim parallel to the x -direction in a driftless environment; (ii) swim towards atarget point in the absence and presence of drift. Note that the ﬁrst task can be viewed as a special case of the secondtask as the target location goes to inﬁnity. We refer to the ﬁrst task as direction-control for short, and the secondtask as naive and drift-aware target-seeking , respectively, based on their awareness of drift.For the purpose of eﬃcient training we impose a ﬁnite time interval, following which the system state is reset to theinitial state for a new round of training. Each round is referred to as an episode . In all training episodes, we initializethe ﬁsh center to be at the origin of the inertial frame, and we initialize the shape angles α , α and body orientation β by sampling from a uniform distribution over all permissible angles to maximize the chances for robust learning. Weﬁx the maximum episode length to 150 time steps, with no early termination allowed. In the target-seeking policies,the target is initially placed at a ﬁxed distance (three-unit length) from the ﬁsh center but at a random orientation.For the drift-aware policy, the drift magnitude is sampled from a uniform distribution between 0 and 0.15, and itsdirection sampled from a uniform distribution between 0 and 2 π . Here, the system is non-dimensional such that oneunit of drift coming from behind will drive a straight ﬁsh to move forward one-unit length per unit time.For the direction-control policy, we perform 24 runs of RL training with 10,000 episodes in each run. The trainingprocess is illustrated in terms of reward evolution in Appendix B, Fig. B.1(A). Here, the reward is calculated by takingthe sum of all rewards in an episode. There is some variability among the seeds, but most trained policies performvery well; only a single policy did not converge by the end of training episodes. Note that ﬂuctuations in reward afterpolicy convergence are partly due to the stochasticity built into the policy itself and partly due to variation in taskdiﬃculty given the random initial conditions: diﬀerent initial conditions require diﬀerent amount of time and eﬀortfor the ﬁsh body to align with the x -axis.It is worth pointing out here the training results of the direction-control policy are aﬀected signiﬁcantly by theepisode length. In order to swim in a desired direction starting from an arbitrary initial orientation, the ﬁsh has toﬁrst turn in that direction, then swim forward. Policies trained with longer episodes perform better in the swimmingportion of the trajectory but fail to make large-angle turns, as training data collected on swimming signiﬁcantlyoutweighs those collected on turning. On the other hand, policies trained with shorter episodes can make turns ofany angle, but are less likely to swim straight after turning. We chose the episode length 150 as a compromise.The evolution of rewards during training of the two target-seeking policies are plotted in Fig. B.1(B), each with20 runs of the RL algorithm and 15,000 episodes in each run. The naive policy converges faster than the drift-awarepolicy, and both policies converge slower than the direction-control policy. These results indicate that the task itself,as well as variations in the environment and number of observations aﬀect the convergence rate, that is, the learningdiﬃculty. Note that the numerical value of the reward is not directly comparable between policies for diﬀerent tasks. VI. BEHAVIOR OF TRAINED FISH

We evaluate the performance of the trained policies by testing them under two type of conditions: conditions similarto those seen during training and perturbed conditions not seen during training.To visualize the RL direction-control policy, we plot in Fig. 3(A) the action vector ﬁelds ( ˙ α , ˙ α ) over the observationsspace ( α , α , β ). These vector ﬁelds depend on the orientation β of the ﬁsh in the physical space such that the controlpolicy ( ˙ α , ˙ α ) forms a foliation over β . Three slices of this foliation are highlighted. The right hand side of Fig. 3(A)provides a closer look at the policy slice at β = 0; the arrows indicate the mean actions advised by the policy fora given set of observations α , α and β = 0. In Fig. 3(B), we show the details of two trajectories in the physicalspace starting from two distinct conﬁgurations. In the ﬁrst test, the ﬁsh starts in the orientation β (0) = 0 from astraight shape ( α (0) , α (0)) = (0 , β (0) = 2 π/ α (0) , α (0)) = ( − π/ , π/ x -axis. In both cases, the ﬁsh is able to turn to the desired direction and swim steadily. In Fig. 3(A),we highlight the corresponding trajectories in the ( α , α , β ) space. As the ﬁsh moves through the physical space, β changes leading the ﬁsh to take actions from other slices of the foliation of action vector ﬁelds. Both trajectories tendto the same periodic swimming cycle around β = 0. A B β = 00 − π/ π/ α α − π/ π/ β − π/ π/ π/ π/ − π/ π/ α α − . . .

40 0 . . y x Figure 3.

Visualizing the direction-control RL policy. A . Given the direction-control task, we visualize the mean RLpolicy actions ( ˙ α , ˙ α ) as vector ﬁelds in the observation space of ( α , α , β ). Two example observation trajectories startingat β (0) = 0 , α (0) = 0 , α (0) = 0 (blue) and β (0) = 2 π/ , α (0) = − π/ , α (0) = π/ β = 0 , π/ , π/

3. The inset shows a ﬂattened view of the slice at β = 0. B . Physical space trajectories of thecorresponding examples shown in A are plotted together with the ﬁsh body conﬁgurations at the starting point and chosenpoints near the ends. π A − π − π π α α − π π α − π π α curl A x curl A β curl A y y x − . RL control B − geometric design Figure 4.

RL provides smooth transition between turning and swimming gaits. A . The shape space trajectory ofthe ﬁsh reorienting itself to swim parallel to the x -axis produced by mean RL policy is superimposed to the scalar curl ﬁeldsfrom Fig. 2(B). Note that this trajectory starts oﬀ-centered and smoothly moves to cycles that are symmetric about the origin. B . The physical space trajectory due to the mean RL policy (red) in comparison to a manually patched turning-to-swimmingtrajectory (green and black) using circular gaits in Fig. 2. Without further ﬁne-tuning on the shape of the gaits, the manuallypatched result shows a more abrupt and unnecessary turn angle. In both simulations, the ﬁsh start in a straight conﬁgurationwith an orientation of β (0) = π/ The shape space analysis introduced in Section III is very useful here to illustrate the shape deformations producedby the RL control policy. In Fig. 4(A), we superimpose the shape changes produced by the RL policy onto the scalarﬁelds curl A x , curl A y and curl A β over the shape space ( α , α ). The corresponding motion in the physical spaceis depicted in red in Fig. 4(B). The shape deformations produced by the RL policy can be interpreted as consistingof two regimes: an initial turning regime followed by a forward swimming regime. The turning regime is indicatedby the initial portion of the shape trajectory enclosing most of the blue area in the curl A β image; the integralin (7) along this portion of the trajectory is negative, leading to a clockwise rotation. The swimming regime isindicated by the periodic shape changes enclosing the rectangular orange portion in the curl A x image. The areaintegral of curl A x along this portion of the trajectory is positive, whereas the corresponding area integrals of curl A y and curl A β are identically zero, leading to net motion in the positive x -direction. These shape deformations andresulting turning and swimming motions can be compared to a manually-designed shape trajectory based on theturning (green) and swimming (black) gaits in Fig. 2(A,B). Speciﬁcally, starting from a straight ﬁsh conﬁguration,we follow the solid portion of the green trajectory (turning) in Fig. 2(A,B), and transition into the black trajectory(swimming) at the second intersection of the green and black shape trajectories. The resulting motion in the physicalspace is superimposed onto Fig. 4(B). In the RL produced motion, the ﬁsh seems to turn more slowly than themanually patched gait in order to produce more forward motions, which makes the transition between turning andswimming seamless. Note that, in the swimming regime, the policy produces cyclic shape deformations that do not ABCD α α α α α x − π − π − π π π − π π − π π − π curl A x Figure 5.

Racing against the RL ﬁsh.

To test the optimality of our direction-control RL policy, we compare forwardswimming performance between the geometrically-designed gaits and RL results. In

A-C we use rectangular gaits with lengthequal to 1.2 (small), 2.4 (medium), and 3.6 (large), respectively, and width all equivalent to shape space trajectory followingthe mean actions from RL policy in D . All shape space trajectories are shown superimposed on top of the curl A x ﬁeld on theright. Physical space trajectories on the left show that the mean RL policy achieves its excellent performance by choosing anoptimal amount of lateral oscillation during forward swimming, while the small and large rectangular gaits move slower dueto either insuﬃcient or overwhelming side-way motion. Note that ﬁsh in A-C are initialized with the same shape but slightlydiﬀerent initial orientations to ensure they all swim in exactly the x -direction. In addition, all ﬁsh utilize the maximum actionsallowed at each time step. maximize the area integral of curl A x (shape trajectory does not enclose the whole orange portion in the curl A x image). We next explore why shape deformations according to the optimal policy do not maximize the area integralof curl A x . The key lies in the fact that the physics of the problem, speciﬁcally, the rotational motion of the ﬁsh,couples displacements in the x - and y - direction.To better explain this aspect of the RL policy, we manually prescribed cyclic shape changes that follow rectangulartrajectories reminiscent of the trajectory generated by the RL policy for forward swimming. The manually-designedshape trajectories all share the same width as the RL policy albeit at diﬀerent lengths, namely, 1.2 (small), 2.4(medium), and 3.6 (large) to enclose increasingly larger regions of the orange portion of the curl A x ; see right columnof Fig. 5. These cyclic deformations result in net displacements in the x -direction with zero-mean excursions in the y -direction. The y -excursions are due to the fact that, even though the area integrals of curl A y and curl A β overthe regions enclosed by these shape trajectories are identically zero, leading to zero net rotation over a full cycle ofshape deformations, the instantaneous rotations β of the ﬁsh body couple displacements in the x - and y -directions, asevident from Eqs. (1) and (4). For the cyclic shape changes in Fig. 5(A), the amplitude of y -excursions is small butso is the net displacement in the x -direction; meanwhile in Fig. 5(C), both the net displacement in the x -directionand the amplitude of the y -excursions are large. The fastest ﬁsh is the one that maximizes forward motion whileminimizing lateral movements, as shown in Fig. 5(B) and recapitulated in the RL result shown in Fig. 5(D). It isworth emphasizing that the RL policy arrives at this optimal solution merely by sampling observations, actions, andrewards, with no prior or developed knowledge of the physics of the problem.Next, we investigate the eﬀect of the initial orientation β (0) ∈ [ − π, π ] on the amount of control eﬀort required toturn and swim parallel to the x -axis. In Fig. 6(A), we show three examples of ﬁsh motion following the trained policy A − π π a c t ua t i on e ff o r t initial orientation B a c t ua t i on e ff o r t target angular position π − π DC y − − x − fish centerfish nose y − . . .

40 0.5 1.5100 . x Figure 6.

Performance of RL policies in a driftless environment. A . Swimming trajectories (center of mass) arrivedat by the mean action of the direction-control policy are shown for ﬁsh starting at orientations β = 0 (blue), π/ π/ B . The actuation eﬀort of reorientation is roughly proportional to the absolute value of the initial ﬁsh orientation dueto the amount of turning maneuvers required. Results shown are based on 25 stochastic policy roll-outs per tested orientationangle. C . Center of mass (blue) and nose (grey) swimming trajectories using the mean action of the naive target-seeking policyare shown with two targets located at an angular position of π/ π/

6. The ﬁsh is considered to have reached the targetwhen its nose is within ǫ = 0 . D . Actuation eﬀort needed to perform shape increasesas the target angular position changes away from 0. Results shown use 25 stochastic policy roll-outs per tested target angle.Note that solid lines and shaded regions of B and D show the median results and variations between 25 to 75 percentile,respectively. corresponding to three distinct initial orientations β (0) = 0 , π/

6, and 5 π/

6. In all examples, the ﬁsh starts from astraight conﬁguration at the origin of the physical space. To measure the actuation eﬀort needed in these motions,we used the integral R τ T shape d t of the actuation energy T shape = ( J + m a )( ˙ α + ˙ α ) , which is the energy impartedto the ﬂuid by the ﬁsh shape changes; see Appendix A. In Fig. 6(B), we show the actuation eﬀort versus the initialorientation of a straight ﬁsh. Here the ﬁsh is instructed by the stochastic policy with the same action noise as in thetraining process. The actuation eﬀort, as well as its variation, is larger for larger turning angles.We examine the behavior and eﬀort of a ﬁsh swimming instructed to reach a known target in an environment withzero drift. Fig. 6(C) shows examples of the ﬁsh swimming motion in the physical space for targets located at ψ = π/ ψ = 5 π/

6. All tests run until the ﬁsh reaches the target or a maximum interval of 1000 time steps is exceeded.We vary the target angular position while maintaining the ﬁsh initial shape and orientation (the ﬁsh always startsstraight in the x -direction), and we calculate the actuation eﬀort as a function of the target orientation; see Fig. 6(D).The actuation eﬀort increases as the target moves from the front to the back of the ﬁsh, because it requires largerturns in order for the ﬁsh to align its heading direction with the direction of the target; this is consistent with ourﬁndings based on the direction-control policy. It is worth emphasizing that the direction-control task is equivalent tothe target-seeking task with the target placed at x = + ∞ .Lastly, we test the behavior and eﬀort of a target-seeking ﬁsh in the presence of non-zero drift. Fig. 7(A) and (C)show a comparison between the naive and drift-aware policies for targets located at ψ = π/ ψ = 5 π/

6, withdrift of magnitude of 0.1 pointing to the x -direction and the − y -direction, respectively. The naive policy (grey lines)is able to reach the target, even though it does not directly observe the drift, albeit following diﬀerent actions andswimming trajectories than the drift-aware policy (orange lines). Speciﬁcally, when following the drift-aware policy,the ﬁsh curls less when the drift is helpful and curls more when the drift is unfavorable. We assess the performanceof the two policies for various drift magnitudes and directions. In Fig. 7(B), we calculate the actuation eﬀort as a0 DCA B drift magnitude . . y − − x − drift drift direction π − π y − − x − drift drift magnitude a c t ua t i on e ff o r t a c t ua t i on e ff o r t drift direction π − π naivedrift-awarenaivedrift-aware ψ = π ψ = 5 π ψ = π ψ = 5 π Figure 7.

Performance of target-seeking policies in the presence of drift.

We compare the swimming trajectoriesand actuation eﬀorts of target-seeking policies for diﬀerent drift magnitudes and direction. A and C . Naive policy (grey) anddrift-aware policy (orange) produce similar average actions in environments with constant (0.1) drift in the positive x -directionand negative y -direction, respectively. Both panels showcase trajectories reaching targets located three unit-lengths away withangular position ψ of π/ π/ B . For a ﬁxed drift direction (positive x -direction), actuation eﬀort as a function of driftmagnitude evolves diﬀerently depending on target angular position ψ . When drift is against the direction of the target (left),both policies fail to reach the target on average for large enough drift (cross marks). No failures are observed when driftfacilitates swimming towards the target (right). In both cases, the drift-aware policy signiﬁcantly outperforms the naive policyat large drift magnitudes by saving actuation eﬀort. Intriguingly, inclusion of extra observations in drift-aware policies seemsto result in slightly suboptimal performance when the drift magnitude is very small. D . With the drift magnitude ﬁxed to0 .

15, the naive policy fails to reach the target if the drift direction is near the opposite end of the target angular position ψ .However, at this drift magnitude, the drift-aware policy can still reach both targets regardless of the drift direction. Note thatsolid lines and shaded regions of B and D show the median results and variations between 25 to 75 percentile, respectively. function of the drift magnitude with a ﬁxed drift direction for two target locations. We ﬁnd that the naive policyoutperforms the drift-aware policy for small drift, but loses or even fails to ﬁnish the task when the drift is large,especially when the drift is in the adverse direction to the target location. This implies that it might be wise to discardsome sensory input (observations) when the perturbation due to drift is weak, especially if these extra observationsact more like a distraction than a clue. But as the perturbation gets stronger, it is necessary to take more observationsinto account. Note that both the naive and drift-aware policies are not able to complete the task in the given amountof time when the drift magnitude is very large and its direction is adversarial to the target location. This is becausethe shape actuation has no direct control over the drift itself. In Fig. 7(D), we ﬁx the magnitude of drift to 0.15and change its direction. Using the actuation eﬀort as our performance metric as before, we ﬁnd that the drift-awarepolicy has better or similar performance under all tested conditions. The naive policy fails when the drift acts inthe adverse direction relative to the target while the drift-aware policy is always able to reach the target before theepisode terminates. VII. CONCLUDING REMARKS

We considered a three-link ﬁsh swimming in potential ﬂow. We reviewed that swimming in potential ﬂow can beexpressed as a combination of a dynamic phase (drift) and geometric phase (driftless) over the shape of ﬁsh bodydeformation [35]. In the driftless case, net locomotion is purely determined by the ﬁsh shape deformations, and shapespace geometric techniques can be used for gait design and motion planning [44, 46], but shape actuation cannot1control the drift itself. Yet, even in the driftless regime, motion planning starting from arbitrary ﬁsh orientation andshape is non-trivial. In this paper, we applied model-free reinforcement learning techniques for the ﬁsh motion control,and we arrived at optimal policies for swimming (i) in a desired direction, and (ii) towards a target in the absenceand presence of drift. The RL based policies produce behavior that is robust to variations in the ﬁsh initial shape andorientation and target location. We used the actuation eﬀort as a measure of the policy performance under variousinitial conditions and in various environments, and we quantiﬁed the robustness of the RL policies to the presenceof drift. We found that although the ﬁsh has no control over the drift itself, the ﬁsh learns to take advantage of thepresence of moderate drift to reach its target. However, large adversarial drift hinders the ﬁsh ability to locate thetarget. The geometric tools provided a useful framework for visualizing and interpreting the RL policies.A few comments on the advantages and limitations of our implementation are in order. Despite algorithmicadvantages, obtaining an RL policy is computationally costly, especially when the environment simulator involveshigh-ﬁdelity ﬂuid-structure interaction models. To circumvent this problem, recent work on training ﬁsh to swimuses a limited set of observations and actions [31]. For example, the zebraﬁsh model of [31] allows only 5 discreteactions, each corresponding to a ﬁxed amplitude of body curvature change. Reduced order ﬂuid models, such as thepotential ﬂow model used here, oﬀer an enticing framework for designing control laws that can later be tested andreﬁned using more realistic ﬂow environments, as done in as done in [19] for a manually-designed swimming gait in[35]. Speciﬁcally, in the simpliﬁed potential ﬂow environment employed here, we are able to train continuous actionpolicies using a rich set of observations, with an eye on probing the performance of these policies in more realisticﬂow environments in future work.

ACKNOWLEDGMENTS

Kanso would like to acknowledge support from the Oﬃce of Naval Research through the grants N00014-17-1-2287and N00014-17-1-2062. The authors are thankful for useful discussions with Dr. Yuval Tassa. [1] Dickinson MH, Lehmann FO, Sane SP. Wing rotation and the aerodynamic basis of insect ﬂight. Science.1999;284(5422):1954–1960.[2] Tytell E, Holmes P, Cohen A. Spikes alone do not behavior make: why neuroscience needs biomechanics. Current opinionin neurobiology. 2011;21(5):816–822.[3] Merel J, Botvinick M, Wayne G. Hierarchical motor control in mammals and machines. Nature Communications.2019;10(1):1–12.[4] Madhav MS, Cowan NJ. The synergy between neuroscience and control theory: the nervous system as inspiration for hardcontrol challenges. Annual Review of Control, Robotics, and Autonomous Systems. 2020;3:243–267.[5] Huang KH, Ahrens MB, Dunn TW, Engert F. Spinal projection neurons control turning behaviors in zebraﬁsh. CurrentBiology. 2013;23(16):1566–1573.[6] Burgess HA, Granato M. Sensorimotor gating in larval zebraﬁsh. Journal of Neuroscience. 2007;27(18):4984–4994.[7] Privat M, Romano SA, Pietri T, Jouary A, Boulanger-Weill J, Elbaz N, et al. Sensorimotor Transformations in theZebraﬁsh Auditory System. Current Biology. 2019;29(23):4010–4023.[8] Marchese AD, Onal CD, Rus D. Autonomous Soft Robotic Fish Capable of Escape Maneuvers Using Fluidic ElastomerActuators. Soft Robotics. 2014;1(1):75–87. PMID: 27625912. Available from: https://doi.org/10.1089/soro.2013.0009 .[9] Lauder GV. Fish Locomotion: Recent Advances and New Directions. Annual Review of Marine Science. 2015;7(1):521–545.PMID: 25251278. Available from: https://doi.org/10.1146/annurev-marine-010814-015614 .[10] Cianchetti M, Follador M, Mazzolai B, Dario P, Laschi C. Design and development of a soft robotic octopus arm exploitingembodied intelligence. In: 2012 IEEE International Conference on Robotics and Automation. IEEE; 2012. p. 5271–5276.[11] Laschi C, Cianchetti M, Mazzolai B, Margheri L, Follador M, Dario P. Soft robot arm inspired by the octopus. AdvancedRobotics. 2012;26(7):709–727.[12] Pfeifer R, Bongard J. How the body shapes the way we think: a new view of intelligence. MIT press; 2006.[13] Pfeifer R, Lungarella M, Iida F. Self-Organization, Embodiment, and Biologically Inspired Robotics. Science.2007;318(5853):1088–1093. Available from: https://science.sciencemag.org/content/318/5853/1088 .[14] Liao JC, Beal DN, Lauder GV, Triantafyllou MS. Fish exploiting vortices decrease muscle activity. Science.2003;302(5650):1566–1569.[15] M¨uller UK. Fish’n ﬂag. Science. 2003;302(5650):1511–1512.[16] Eloy C, Le Gal P, Le Dizes S. Elliptic and triangular instabilities in rotating cylinders. Journal of Fluid Mechanics.2003;476:357–388.[17] Kern S, Koumoutsakos P. Simulations of optimized anguilliform swimming. Journal of Experimental Biology.2006;209(24):4841–4857. [18] Mittal R, Dong H, Bozkurttas M, Loebbecke A, Najjar F. Analysis of ﬂying and swimming in nature using an immersedboundary method. In: 36th AIAA Fluid Dynamics Conference and Exhibit; 2006. p. 2867.[19] Eldredge JD. Numerical simulations of undulatory swimming at moderate Reynolds number. Bioinspiration & biomimetics.2006;1(4):S19.[20] Tytell ED, Hsu CY, Williams TL, Cohen AH, Fauci LJ. Interactions between internal forces, body stiﬀness, and ﬂuidenvironment in a neuromechanical model of lamprey swimming. Proceedings of the National Academy of Sciences.2010;107(46):19832–19837.[21] Hamlet C, Fauci LJ, Tytell ED. The eﬀect of intrinsic muscular nonlinearities on the energetics of locomotion in acomputational model of an anguilliform swimmer. Journal of theoretical biology. 2015;385:119–129.[22] Engelmann J, Hanke W, Mogdans J, Bleckmann H. Hydrodynamic stimuli and the ﬁsh lateral line. Nature.2000;408(6808):51–52. Available from: https://doi.org/10.1038/35040706 .[23] Ristroph L, Liao JC, Zhang J. Lateral Line Layout Correlates with the Diﬀerential Hydro-dynamic Pressure on Swimming Fish. Phys Rev Lett. 2015 Jan;114:018102. Available from: https://link.aps.org/doi/10.1103/PhysRevLett.114.018102 .[24] Colvert B, Kanso E. Fishlike rheotaxis. Journal of Fluid Mechanics. 2016;793:656–666.[25] Colvert B, Liu G, Dong H, Kanso E. How can a source be located by responding to local information in its hydrodynamictrail? In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE; 2017. p. 2756–2761.[26] Montgomery JC, Baker CF, Carton AG. The lateral line can mediate rheotaxis in ﬁsh. Nature. 1997;389(6654):960–963.Available from: https://doi.org/10.1038/40135 .[27] Filella A, Nadal F, Sire C, Kanso E, Eloy C. Model of collective ﬁsh behavior with hydrodynamic interactions. Physicalreview letters. 2018;120(19):198101.[28] Gazzola M, Tchieu AA, Alexeev D, de Brauer A, Koumoutsakos P. Learning to school in the presence of hydrodynamicinteractions. Journal of Fluid Mechanics. 2016;789:726–749.[29] Gustavsson K, Biferale L, Celani A, Colabrese S. Finding eﬃcient swimming strategies in a three-dimensional chaotic ﬂowby reinforcement learning. The European Physical Journal E. 2017;40(12):110.[30] Colabrese S, Gustavsson K, Celani A, Biferale L. Flow navigation by smart microswimmers via reinforcement learning.Physical review letters. 2017;118(15):158004.[31] Verma S, Novati G, Koumoutsakos P. Eﬃcient collective swimming by harnessing vortices through deep reinforcementlearning. Proceedings of the National Academy of Sciences. 2018;115(23):5849–5854.[32] Colvert B, Alsalman M, Kanso E. Classifying vortex wakes using neural networks. Bioinspiration & biomimetics.2018;13(2):025003.[33] Alsalman M, Colvert B, Kanso E. Training bioinspired sensors to classify ﬂows. Bioinspiration & Biomimetics. 2018nov;14(1):016009. Available from: https://doi.org/10.1088%2F1748-3190%2Faaef1d .[34] Weber P, Arampatzis G, Novati G, Verma S, Papadimitriou C, Koumoutsakos P. Optimal Flow Sensing for SchoolingSwimmers. Biomimetics. 2020;5(1):10.[35] Kanso E, Marsden JE. Optimal motion of an articulated body in a perfect ﬂuid. In: Proceedings of the 44th IEEEConference on Decision and Control. IEEE; 2005. p. 2511–2516.[36] Kanso E, Marsden JE, Rowley CW, Melli-Huber JB. Locomotion of articulated bodies in a perfect ﬂuid. Journal ofNonlinear Science. 2005;15(4):255–289.[37] Eloy C. On the best design for undulatory swimming. Journal of Fluid Mechanics. 2013;717:48–89.[38] M¨uller UK, Van Leeuwen JL. Undulatory ﬁsh swimming: from muscles to ﬂow. Fish and Fisheries. 2006;7(2):84–103.[39] Dickinson MH, Farley CT, Full RJ, Koehl M, Kram R, Lehman S. How animals move: an integrative view. science.2000;288(5463):100–106.[40] Degris T, Pilarski PM, Sutton RS. Model-Free reinforcement learning with continuous action in practice. In: 2012 AmericanControl Conference (ACC); 2012. p. 2177–2182.[41] Haith AM, Krakauer JW. Model-Based and Model-Free Mechanisms of Human Motor Learning. In: Richardson MJ, RileyMA, Shockley K, editors. Progress in Motor Control. New York, NY: Springer New York; 2013. p. 1–21.[42] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.[43] Eldredge JD. Numerical simulation of the ﬂuid dynamics of 2D rigid body motion with the vor-tex particle method. Journal of Computational Physics. 2007;221(2):626 – 648. Available from: .[44] Melli JB, Rowley CW, Rufat DS. Motion planning for an articulated body in a perfect planar ﬂuid. SIAM Journal onapplied dynamical systems. 2006;5(4):650–669.[45] Nair S, Kanso E. Hydrodynamically coupled rigid bodies. Journal of Fluid Mechanics. 2007;592:393–411.[46] Hatton RL, Choset H. Connection vector ﬁelds and optimized coordinates for swimming systems at low and high Reynoldsnumbers. In: ASME 2010 Dynamic Systems and Control Conference. American Society of Mechanical Engineers DigitalCollection; 2010. p. 817–824.[47] Morgansen KA, Duidam V, Mason RJ, Burdick JW, Murray RM. Nonlinear control methods for planar carangiformrobot ﬁsh locomotion. In: Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164). vol. 1. IEEE; 2001. p. 427–434.[48] Bloch AM, Krishnaprasad P, Marsden JE, Murray RM. Nonholonomic mechanical systems with symmetry. Archive forRational Mechanics and Analysis. 1996;136(1):21–99.[49] Murray RM, Li Z, Sastry SS, Sastry SS. A mathematical introduction to robotic manipulation. CRC press; 1994. [50] Lamb H. Hydrodynamics. Dover Books on Physics. Dover publications; 1945. Available from: https://books.google.com/books?id=237xDg7T0RkC .[51] Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning.1992;8(3-4):229–256.[52] Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient methods for reinforcement learning with functionapproximation. In: Advances in neural information processing systems; 2000. p. 1057–1063.[53] Kakade SM. A natural policy gradient. In: Advances in neural information processing systems; 2002. p. 1531–1538.[54] Peters J, Schaal S. Natural actor-critic. Neurocomputing. 2008;71(7-9):1180–1190.[55] Peters J, Schaal S. Policy gradient methods for robotics. In: Intelligent Robots and Systems, 2006 IEEE/RSJ InternationalConference on. IEEE; 2006. p. 2219–2225.[56] Konda VR, Tsitsiklis JN. Actor-critic algorithms. In: Advances in neural information processing systems; 2000. p.1008–1014.[57] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Oﬀ-policy maximum entropy deep reinforcement learning witha stochastic actor. arXiv preprint arXiv:180101290. 2018;.[58] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprintarXiv:170706347. 2017;.[59] Leonard NE. Stability of a bottom-heavy underwater vehicle. Automatica. 1997;33(3):331–346.[60] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;. Appendix A: Physics of the ﬁsh model

We review the derivation of the equations of motion governing the swimming of an articulated three-link ﬁsh inpotential ﬂow (Fig. 1).

1. Fish kinematics

Consider planar motions of the three-link ﬁsh. Let x = ( x, y ) denote the position of the center of mass G of themiddle link, and let β denote the orientation of the ﬁsh relative to a ﬁxed inertial frame, here taken to be the anglebetween the x -axis and the major axis of symmetry of the middle link. Let α and α be the rotation angles of thefront link relative to the middle link and the middle link relative to the rear link, that is to say, ( α , α ) represents theshape of the three-link ﬁsh. It is convenient for the following development to introduce a body-ﬁxed frame ( b , b , b ),attached at G and co-rotating with the middle link. This body-ﬁxed frame is related to the inertial frame ( e , e , e )via a rigid-body rotation such that e = cos β b − sin β b , e = sin β b + cos β b , and e = b .The velocity ( ˙ x, ˙ y ) of the center of mass of the middle link, when expressed in the body-ﬁxed frame, is given by v = v x b + v y b = ( ˙ x cos β + ˙ y sin β ) b + ( − ˙ x sin β + ˙ y cos β ) b (A1)Assuming all three links are made of identical ellipsoids of length a , width b , and height c , the velocities of the centersof mass G and G of the front and rear link, expressed in the body-ﬁxed frame of the middle link, are given by( i = 1 ,

2, denote the front and rear link, respectively) v i = ( v x ∓ a ˙ α i sin α i − a ˙ β sin α i ) b + ( v y ± a ˙ β + a ˙ α i cos α i ± a ˙ β cos α i ) b . (A2)The angular velocities of the middle, front, and rear links are given by ˙ β , ˙ β + ˙ α , and ˙ β − ˙ α respectively.

2. Kinetic energy of the articulated body

In the absence of the ﬂuid, the kinetic energy of the articulated three-link body is given by T body = 12 m s v · v + 12 J s ˙ β + 12 X i h m s v i · v i + J s ( ˙ β ± ˙ α i ) i (A3)where m s = abcρ s and J s = ( a + b ) m s are the mass and moment of inertia of each solid link with ρ s the densityof the links.4

3. Kinetic energy of the ﬂuid

The three-link ﬁsh is submerged in an unbounded domain of incompressible and irrotational ﬂuid, such that the ﬂuidvelocity u = ∇ φ can be expressed as the gradient of a potential function φ . It is a standard result in potential ﬂowtheory that the kinetic energy of the ﬂuid can be expressed in terms of the variables of the submerged solid [35, 36, 50].In the case of a single ellipsoid, the kinetic energy of the ﬂuid is given by T ﬂuid = [( m x v x + m y v y ) + J ˙ β ] /

2, where m x , m y and J z are the added mass and added moment of inertia due to the presence of the ﬂuid, expressed in a body-ﬁxed frame that coincides with the major and minor axes of the ellipsoid. These quantities depend on the geometricproperties a, b, c of the submerged ellipsoid [59]. For a non-spherical body, the added masses m x , m y depend on thedirection of motion: the added mass is larger when moving in the direction of the minor axis of symmetry of theellipsoid, that is to say, in the transverse direction, hence m x ≤ m y .In the case of the three-link ﬁsh, the kinetic energy of the ﬂuid is of the form T ﬂuid = 12 m x v x + 12 m y v y + 12 J z ˙ β + 12 X i J z (cid:16) ˙ β ± α i (cid:17) + 12 X i m x (cid:16) v x cos α i ± v y sin α i + a ˙ β sin α i (cid:17) + 12 X i m y (cid:16) ∓ v x sin α i + v y cos α i ± a ˙ β cos α i ± a ˙ β + a ˙ α i (cid:17) . (A4)Here we transform the velocity components of head and tail by α and α , respectively, to match the added masscomponents.

4. Kinetic energy of the body-ﬂuid system

The kinetic energy of the ﬁsh-ﬂuid system is obtained by taking the sum of Eq. A3 and Eq. A4, which can beexpressed in matrix form as follows: T = T body + T ﬂuid = 12  v x v y ˙ β ˙ α ˙ α  T  I lock I couple I T couple I shape   v x v y ˙ β ˙ α ˙ α  , (A5)Here, I lock is a 3 × α and α , I lock = " M HH T J , (A6)where M is a 2 × M = (cid:20) m (cid:0) P i cos α i (cid:1) + m P i sin α i ( m − m )(sin 2 α − sin 2 α ) ( m − m )(sin 2 α − sin 2 α ) m (cid:0) P i cos α i (cid:1) + m P i sin α i (cid:21) , (A7) J is a moment-of-inertia scalar given by J = 3 J + m a X i sin α i + m a X i (1 + cos α i ) , (A8)and H is given by H = (cid:20) ( m − m ) a P i sin 2 α i − m a P i sin α i ( m − m ) a (cos 2 α − cos 2 α ) + m a (cos α − cos α ) (cid:21) . (A9)Here we used m = m s + m x , m = m s + m y , and J = J s + J z . Note that H couples the translational and rotationalmotion of the articulated body. In the case of single ellipsoid, H is identically zero.5 Algorithm B.1

Environment Simulation for time step t = 0 , , ... do if t = 0 or episode terminates then store time step of episode termination, reset state s t ∼ P ( s ) evaluate observation: o t ∼ o ( s t ) end if sample action from policy a t ∼ π θ ( a t | o t ) evolve next state according to ﬁsh physics s t +1 ∼ P ( s t +1 | s t , a t ) evaluate next observation o t +1 ∼ o ( s t +1 ) and reward r t ∼ r ( a t , o t +1 ) if t = 0 or mod( t, N ) = 0 then append current action, observation, reward, and probability of sampling the action to assemble vectors a N × n a , o N × n o , r N × , and π θ old ( a | o ) N × else update agent networks according to Algorithm B.2 end if end for Algorithm B.2

Updating the Agent for update epoch number κ = 0 , , . . . K do compute the truncated return using rewards r N × and assemble into vector R N × estimate inﬁnite-horizon return using R N × and V T = V φ ( o T ) if bootstrapping is desired (see Eq.B2) using o N × n o and value function V φ , evaluate expected returns at each time step and store into V N × compute the advantage A = R N × − V N × and normalize by its mean and variance if desired evaluate the probability of realizing a N × n a based on o N × n o for the policy π θ , and store to π θ ( a | o ) N × compute the action-likelihood ratio: ̺ θ = π θ ( a | o ) N × π θ old ( a | o ) N × compute clipped surrogate loss function: L clip ( θ ) = mean [min [ ̺ θ · A, clip( ̺ θ , − ǫ, ǫ ) · A ]] compute the value-function loss: L value ( φ ) = 0 . · mean (cid:2) ( R N × − V N × ) (cid:3) compute the total loss: L ( θ, φ ) = −L clip ( θ ) + L value ( φ ) − α · entropy [ π θ ] update parameters ( θ, φ ) to minimize the total loss using a gradient based optimizer ( e.g. , Adam [60]) end for Further, I couple is a 3 × I couple =  − m a sin α m a sin α m a cos α m a cos α J + m a (1 + cos α ) − J − m a (1 + cos α )  . (A10)Finally, I shape is a 2 × I shape = (cid:20) J + m a J + m a (cid:21) . (A11) Appendix B: Proximal Policy Optimization (PPO) Algorithms

We implement the clipped advantage Proximal Policy Optimization (PPO) method proposed by [58] for our RLtraining. PPO maximizes a surrogate objective that clips oﬀ unwanted changes when the policy deviates too muchfrom the policy of the previous cycle to ensure faster and more robust convergence. We refer readers to the originalreference cited above as well as the OpenAI’s documentation of the PPO algorithm and their baseline implementationsfor a thorough explanation of the theory and details behind this method.Our implementation can be separated into two parts. The main loop simulates the environment using actionsequence a t generated by the agent, and stores the observed rollouts for future updates; see Algorithm B.1. Note6 A B truncationbootstrapping

Number of episodes R e w a r d naivedrift-aware − Number of episodes R e w a r d Figure B.1.

Evolution of rewards during the training process. A . Total rewards per episode achieved by policies trainedto swim parallel to the x -axis in a driftless environment using bootstrapped (blue) and truncated (black) return estimates. Heresolid lines indicate the median, and the shaded region shows the variation between 25-75 percentile for 24 runs of the learningalgorithm. B . Total rewards obtained by policies trained to swim towards a given target, both of which adopt bootstrappedreturn estimates. Here red indicates naive policies trained in driftless environment, while yellow represents policies trained inthe presence of drift, with drift magnitude and direction supplied as additional observations to the policy. Again, lines andshaded regions indicate median and 25-75 percentile range respectively. that n o and n a are used to indicate the number of observable states and actions. Equations describing the ﬁsh-ﬂuidinteractions were integrated numerically using adaptive time step, explicit RK45 method between each decision stepof 0 . . N time steps for K epochs. Here thevalue of K is chosen to be 80 and the value of N is set to 4050, an integer multiple of the episode length 150.For simplicity, we assume our continuous action variables follows a multivariate normally-distributed policy π θ withmean value represented by a neural network parameterized by θ and constant diagonal covariance matrices, and thecritic / value function V φ ( o t ) is also represented by a neural network with parameters φ . Speciﬁcally, both the meanpolicy and value function are implemented as feed-forward neural network with two hidden layers and tanh activationfunctions. The sizes of the two hidden layers were ﬁxed to 64 and 32, respectively. Finally, using the collectedtrajectories during the previous N time steps, the parameters θ, φ are updated according a total loss function L ( θ, φ )via a back-propagating gradient based optimizer; see Algorithm B.2. Note that since we did not perform systematichyper-parameter tuning, readers might want to explore diﬀerent values for better performance.Another important side-note is that since it is in general impossible to obtain unrealized inﬁnite horizon return R t = Σ ∞ t ′ γ t ′ − t r t ′ , we need to choose an appropriate estimator of this value based on ﬁnite length simulations. We caneither simply truncate rewards after some step k by usingˆ R t (cid:12)(cid:12)(cid:12) truncation = r t + γr t +1 + γ r t +2 + · · · + γ k − r t + k , (B1)or we can use the trained value function (critic) to approximate the residual contribution to the return via k -stepbootstrapping ˆ R t (cid:12)(cid:12)(cid:12) bootstrapping = r t + γr t +1 + γ r t +2 + · · · + γ k V φ ( o t + k +1 ) . (B2)We compared these two approaches for the direction-based task and observed that bootstrapping results in fasterconvergence and higher rewards in general; see Fig. B.1(A). As a result, bootstrapping is used for all tasks depicted inthe main text, where k was determined by the number of available future rewards. Namely, k decreases from 149 to 0as the number of time steps increased from 1 to 150 in each episode. In addition, We show the diﬀerence in trainingrewards and convergence speed between the naive policy and the drift-aware policy in Fig. B.1(B). In general, theinclusion of more observations increased the time to convergence and variance in training rewards.Lastly, we invite interested readers to visit our source repository at https://github.com/mjysh/RL3linkFishhttps://github.com/mjysh/RL3linkFish