Learning to swim in potential flow
Yusheng Jiao, Feng Ling, Sina Heydari, Nicolas Heess, Josh Merel, Eva Kanso
aa r X i v : . [ q - b i o . Q M ] S e p Learning to swim in potential flow
Yusheng Jiao, Feng Ling, Sina Heydari, and Eva Kanso ∗ Department of Aerospace and Mechanical Engineering,University of Southern California, Los Angeles, CA 90089
Nicolas Heess and Josh Merel
DeepMind, London
Fish swim by undulating their bodies. These propulsive motions require coordinated shapechanges of a body that interacts with its fluid environment, but the specific shape coordinationthat leads to robust turning and swimming motions remains unclear. We propose a simple modelof a three-link fish swimming in a potential flow environment and we use model-free reinforcementlearning to arrive at optimal shape changes for two swimming tasks: swimming in a desired directionand swimming towards a known target. This fish model belongs to a class of problems in geometricmechanics, known as driftless dynamical systems, which allow us to analyze the swimming behaviorin terms of geometric phases over the shape space of the fish. These geometric methods are lessintuitive in the presence of drift . Here, we use the shape space analysis as a tool for assessing,visualizing, and interpreting the control policies obtained via reinforcement learning in the absenceof drift. We then examine the robustness of these policies to drift-related perturbations. Althoughthe fish has no direct control over the drift itself, it learns to take advantage of the presence ofmoderate drift to reach its target.
Keywords: Fish swimming, Reinforcement learning, Sensorimotor control.
I. INTRODUCTION
Fish swim through interactions of body deformations with the fluid environment. A fish assimilates sensory infor-mation from its body and the external environment and produces patterns of muscle activation that result in desiredbody deformations; see Fig. 1A. How these sensorimotor decisions are enacted at the physiological level, at the levelof neuronal circuits, remains unclear [1–4]. Animal models, such as the
Danio rerio zebrafish [5–7], as well as roboticand mathematical models [8, 9], provide valuable insights into the sensorimotor control underlying fish behavior. Suchunderstanding offers enticing paradigms for the design of artificial soft robotic systems in which the control is embodied in the physics [10, 11].
Embodied systems sense and respond to their environment through a physical body [12, 13];physical interactions with the environment are thus vital for sensing and control. In fish, the dynamics of the fluidenvironment is essential both at an evolutionary time scale – in shaping body morphologies and sensorimotor modal-ities – and at a behavioral time scale. Fish bodies are tuned to exploit flows [14, 15]. Body designs and undulatorymotions have been examined in computational models of fluid-structure interactions [16–19], including models of bodystiffness and neuromechanical control [20, 21]. The fish’s ability to sense minute water motions is attributed to alateral-line mechanosensory system [22, 23] involved in behaviors ranging from rheotaxis [24–26] to schooling [27]. Re-cent developments prove that machine learning techniques are highly effective in addressing problems of flow sensingand fish behavior [28–34].A central problem in fish behavior, which is also relevant for underwater robotic systems, is gait design or motionplanning : what body deformations produce a desired swimming objective? The answer requires an understandingof how the numerous biomechanical degrees of freedom of the fish body are coordinated to achieve the commonobjective. Mathematically, this problem is often expressed in terms of an optimality principle: find control laws thatoptimize a desired objective, such as maximizing swimming speed or minimizing energetic cost [17, 35–37]. But howthese control laws are implemented in the nervous system, and how they are acquired via learning algorithms, aretypically beyond the scope of such methods. As these optimization methods rely on an internal model of the dynamics,different computational results have been obtained by varying the specification of the physical model of the fish, theperformance metric, and the imposed control constraints [17, 35–38].Model-free reinforcement learning (RL) of embodied systems offers an alternative framework for gait design thatis mathematically and computationally tractable [40, 41]. In this framework, the fish world is divided into a bodycontrolled by a learning agent and an environment that encompasses everything outside of what the agent can control. ∗ [email protected]; https://sites.usc.edu/kansolab/ A Flow Sensing VisionOlfaction
BrainSpinal Cord Olf. bulbs B Target ShapeVariablesTargetVariablesLocomotionVariables C RL Environment
Physics
RL Agent
PolicyAction ObservationReward
MusculoskeletalSystem NervousSystemInteractionwithFluid Environment SensoryModalities
Figure 1.
Model-free reinforcement learning and the three-link fish. A . Illustration of sensorimotor feedback loops infish. Motor commands generated in the nervous system activate the musculoskeletal system, resulting in deformations of thebody. Body deformations, through interaction with the fluid environment, lead to swimming meanwhile sensory modalitiesprovide sensory feedback to the nervous system. The dashed arrow between musculoskeletal and sensory systems indicatessomatic sensing used to assess whether previous motor commands were successfully executed. Other reflexive or preflex signalscould also be at play [2, 22, 39]. B . Three-link fish swimming in quiescent fluid. Locomotion variables ( x, y, β ) are set in alab fixed frame, while the shape variables ( α , α ) and target variables ( ρ, ψ ) are set in body frame symbolizing egocentriccontrol and learning. C . To apply model-free reinforcement learning to our problem, we only need to set the appropriate state,observation, action, and reward variables based on our fish model. The agent can be viewed as an abstract representation of the parts of the fish responsible for sensorimotor decisions.In RL, the agent must learn from experiences in a trial-and-error fashion. Specifically, repeated interactions of thebody with the environment enables the agent to collect sensory observations , control actions , and associated rewards .The goal of the agent is thus to learn to produce behavior that maximizes rewards, and the process is model-freewhen learning does not make use of either a priori or developed knowledge of the physics of the system. The learnedfeedback control law, called a policy , is essentially a mapping from sensory observations to control actions. Thismapping is nonlinear and stochastic, and, by construction, rather than providing a single optimal trajectory, it canbe applied to any initial condition and transferred to conditions other than those seen during training such as whenthe body or fluid environment is perturbed [42].Here, we employ RL to design swimming gaits. We use an idealization in which the fish is modeled as an articulatedbody consisting of three links, with front and rear links free to rotate relative to the middle link via hinge joints [35,36, 43–46]. In describing the physics of the fish, we cede the complexity of accounting for the full details of the fluidmedium in favor of considering momentum exchange between the articulated body and the surrounding fluid in thecontext of a potential flow model [35, 36, 44, 45]. This model is a canonical example of a class of under-actuatedcontrol problems whose dynamics can be described over the actuation (shape) space using tools from geometricmechanics [35, 47–49]. Specifically, the fish swimming motion can be represented by the sum of a dynamic phase or drift , and a geometric phase over the shape space of all body deformations [36]. From a motion control perspective,this model is advantageous in that it allows for the use of geometric mechanics tools for gait analysis and manualdesign of control policies over the full shape space [46]. These geometric tools also provide an intuitive way for directand interpretable visualizations of the RL-based policies. II. MATHEMATICAL MODEL OF THE THREE-LINK FISH
Consider a three-link fish as shown in Fig. 1(B). Rotations of the front and rear links relative to the middle link aredenoted by the angles α and α such that ( α , α ) fully describe all possible body deformations. We constrain theswimming motions to a two-dimensional plane, and let ( x, y ) and β denote the net planar displacement and rotation ofthe fish body, such that ( ˙ x, ˙ y ) and ˙ β represent the linear and rotational velocities of the fish in inertial frame (the dotnotation ˙() = d() / dt represents differentiation with respect to time t ). We also introduce the linear velocity ( v x , v y )expressed in a body frame attached to the center of the middle link, v x = ˙ x cos β + ˙ y sin β, v y = − ˙ x sin β + ˙ y cos β. (1)The total linear momentum ( p x , p y ) and total angular momentum π of the body-fluid system are expressed in inertialframe, and their counterparts ( P , P ) and Π are expressed in body-fixed frame, p x = P cos β − P sin β, p y = P sin β + P cos β, π = Π . (2)In potential flow, it is a known result that the total momenta of the body-fluid system can be expressed in terms ofthe body geometry and velocity, via the so-called added mass matrices [35, 36, 50]. Expressions for the added massmatrices of the three-link fish are derived in detail in Appendix A. The total momenta ( P , P ) and Π are given by P P Π = I lock v x v y ˙ β + I couple (cid:20) ˙ α ˙ α (cid:21) , (3)where I lock is the locked mass matrix at a given shape of the fish (see Eqs. A6-A9) and I couple is the mass matrixassociated with shape deformations (see Eq. A10).In the absence of external forces and moments on the fish-fluid system, the total momenta ( p x , p y ) and π of thebody-fluid system in inertial frame are conserved for all time. Conservation of total momenta yields, upon inverting (2)and substituting into (3), v x v y ˙ β = I − cos β sin β − sin β cos β
00 0 1 p x p y π − I − I couple (cid:20) ˙ α ˙ α (cid:21) . (4)If we further substitute (1) into (4), we arrive at three coupled first-order equations of motion for x , y , β given α and α . The control problem consists of finding the time evolution of shape changes ( α ( t ) , α ( t )) that achieve a desiredlocomotion task ( x ( t ) , y ( t )) and β ( t ). Specifically, a swimming gait is defined as a cyclic shape change ( α ( t ) , α ( t )),with period T , that results in a net swimming ( x ( t ) , y ( t )) or turning β ( t ) of the fish body.This model is a canonical example of a class of under-actuated control problems whose dynamics can be describedover the shape space using tools from geometric mechanics [35, 47–49]. On the right-hand side of (4), the first termrepresents a dynamic phase or drift and the second term represents a geometric phase over the fish shape space( α , α ) [36]. The geometric phase is best described in terms of the local connection matrix A [36, 46], which is afunction only of the shape variables α and α , A = A x A x A y A y A β A β := − I − I couple . (5)Each row of A describes a nonlinear vector field over the shape space, giving rise to three vector fields A x ≡ ( A x , A x ), A y ≡ ( A y , A y ), and A β ≡ ( A xβ , A β ) over the ( α , α ) plane as shown in Fig. 2(A).In driftless systems, net locomotion is fully controlled by the fish shape changes. However, in the presence of drift,shape control is not sufficient to determine the fish motion in the physical space, which is now affected by the driftterm in (4). III. GEOMETRIC PHASES
Geometric phases are defined as the net locomotion ( x, y, β ) that results from prescribed cyclic shape changes in the( α , α ) plane at zero total momentum (no drift). Inverting (1) and substituting (5) into (4) at zero total momentum, curl A y π B − π − π π α α − π π α − π π α A A x A y A β π α curl A x curl A β − − − − y x turning gait0.50 − y swimming gait CD − π − π π − π π − π π Figure 2.
Using connection matrix for simple gait design. A and B . Rows of connection matrix A give us three vectorfields, A x = ∂v x /∂α i , A y = ∂v y /∂α i , and A β = ∂ ˙ β/∂α i , the curl of which yield three corresponding scalar fields. Magnitudeof the scalar fields are normalized to be within [ − , ◦ , △ , (cid:3) , are sketched for additional clarity. C and D . Fish that start with body centered at theorigin of the x-y plane and follow the same gait circle can still swim/turn to different directions in the physical space dependingon their initial body shapes. we arrive at cos β sin β − sin β cos β
00 0 1 ˙ x ˙ y ˙ β = A x A x A y A y A β A β (cid:20) ˙ α ˙ α (cid:21) . (6)Motions ( x, y, β ) in the physical space are obtained by integrating (6) with respect to time. Rotations are directlyproportional to the line integral of the vector field A β as evident by integrating the last equation in (6), β ( T ) − β (0) = I C d β = I C ( A β d α + A β d α ) . (7)Here, T is the time-period for going around the closed trajectory in the shape space once. Using Green’s Theorem,we get β ( T ) − β (0) = Z Z (cid:18) ∂A β ∂α − ∂A β ∂α (cid:19) d α d α = Z Z curl ( A β )d α d α . (8)The scalar field curl ( A β ) provides an intuitive tool for understanding the effect of a cyclic shape deformation onthe net rotation of the fish: net rotations are proportional to the integral of curl ( A β ) over the area enclosed by aclosed shape trajectory. However, translational motions ( x, y ) are not directly proportional to the area integrals ofcurl ( A x ) and curl ( A y ), but to a combination of all three integrals coupled through the fish rotational dynamics asevident from (6). Despite this complication, the scalar fields defined by curl ( A x ), curl ( A y ), and curl ( A β ), shownin Fig. 2(B), are informative of the net translational ( x, y ) and rotational β motions of the fish. To illustrate the utilityof these curl fields, we show two examples of cyclic shape changes depicted in black and green lines. A fish changingits shape following the black line undergoes zero net rotation because the area integral of curl ( A β ) is identicallyzero, but it swims forward in the ( x, y ) plane. The net displacement per period is a conserved quantity, whereas thedirection of motion depends on a combination of the fish initial shape ( α (0) , α (0)) and initial orientation β (0) asshown in Fig. 2(C). Here, we consider three different initial shapes, depicted by the markers ◦ , △ , (cid:3) , all at β (0) = 0.Similarly, shape deformations following the green line lead to net reorientations in the physical space, as shown inFig. 2(D). Evidently, no net motion occurs if the shape trajectory is degenerate – does not enclose an area in theshape space. Further a re-scaling of time does not affect the net motion of the fish, only the speed at which the fishcompletes these cyclic shape changes.In Fig. 2 and hereafter, the equations of motion are scaled using the total length of the fish and the total mass inthe head-to-tail direction of the straight fish as the characteristic length and mass scale. Specifically, we set the totalmass of the fish body to be equal to the added mass (actual mass of the fish is negligible). We leave the time scaleunchanged because the characteristic time depends on the speed of shape changes, which is a control parameter tobe determined by the controller. IV. MOTION CONTROL VIA REINFORCEMENT LEARNING
The scalar fields curl ( A x ), curl ( A y ), curl ( A β ) over the shape space encode information about the net locomotionof the fish in a driftless environment, and can be used to design simple swimming and turning gaits as shown in Fig. 2.However, these geometric tools do not allow for a straightforward design of control policies for arbitrary motionplanning [44, 46], and they are even less instructive in the presence of drift. Here we consider an RL driven approachfor motion control.We use RL to train the three-link fish on two different tasks: (i) swim parallel towards a desired direction ina driftless environment; (ii) swim towards a target point located at a distance ρ and angular position ψ from thefish nose, with ρ and ψ expressed in the fish frame of reference (Fig. 1B). In the first task, the desired direction isfixed to be parallel to the x -axis without loss of generality. For the second task, we first train the fish in a driftlessenvironment, then introduce drift and train again in the presence of drift. The first task allows for direct comparisonof the performance of the trained policy to manually-designed policies in the context of geometric mechanics. Thesecond task allows for evaluation and comparison of the performance of the driftless and drift-aware policies underenvironmental perturbations.Central to any RL implementation are the notions of the state of the system, the observations given to the learningagent, the actions taken by the agent, and the rewards given to the agent in light of its behavior. The state s t of thefish-fluid-target system at a time t is given by the fish position and orientation in inertial frame ( x, y, β ), its shape( α , α ), and the target position relative to the fish ( ρ, ψ ); see Fig. 1(B,C). As sensory input, we provide the fisha set of proprioception-based observations of its shape variables α and α , as well as an egocentric observation ofthe task, namely for controlling the direction of swimming, the fish knows the desired swimming direction relativeto itself ψ = − β , and for swimming towards a target, it knows the angular position ψ of the target point relativeto itself. This yields a set of observations o t = ( α , α , ψ ) t . Additionally, for the training in presence of drift, themagnitude and direction of the drift are also provided as observations. As control action, the fish has direct control ofits the shape using the rate of shape changes as actions a t = ( ˙ α , ˙ α ) t . With this choice of action, the control can beprojected onto the shape space and directly compared to the geometric mechanics approach. We constrain the valueof the actions to be between − α and α are only allowed to change between − π/ π/ π θ ( a t | o t ) that produces actions a t given observations o t of the state of the fish-environment. The policy is parameterized by a set of parameters θ tobe optimized. An optimal policy is learned to produce behavior that maximizes rewards. We use a dense shapingreward, that is, the fish is given a reward at every decision time step. Specifically, we set the reward to be thedistance the fish travels in the desired direction towards the target state. For learning to swim parallel to the x -axis, we use the reward r t = x nose , t+1 − x nose , t , which is the change in the fish nose position along the x -axis. Forlearning to swim towards a target, the reward r t = ρ t − ρ t +1 is based on the change in the relative distance ρ fromthe fish nose to the target. The return R t = P ∞ t ′ γ t ′ − t r t ′ is defined as the infinite horizon objective based on thesum of discounted future rewards, where γ ∈ [0 ,
1] is known as the discount factor; it determines the preference forimmediate over future rewards. We set γ = 0 .
99 to make the fish foresighted. The goal is to arrive at an optimal setof parameters θ that maximizes the expected return J ( π θ ) = E π (cid:2) P ∞ t =0 γ t r t (cid:3) for a distribution of initial states. Here,the expectation is taken with respect to the distribution over trajectories π ( τ ) induced jointly by the fish dynamics,viewed as a partially-observable Markov decision process, and the policy π θ ( a t | o t ). One approach to solving thisoptimization problem is to use a policy gradient method that computes an estimate of the gradient ∇ θ J for learning.Policy gradient methods are widely used to learn complex control tasks and are often regarded as the most effectivereinforcement learning techniques, especially for robotics applications [51–55]. Here, we use a specific class of policygradient methods, known as actor-critic methods [56, 57] where the fish learns simultaneously a policy (actor) and avalue function (critic). We implement this method using the clipped advantage Proximal Policy Optimization (PPO)algorithm proposed in [58]. This algorithm ensures fast learning and robust performance by limiting the amount ofchange allowed for the policy within one update. A pseudo-code implementation of the PPO algorithm and additionalimplementation details are provided in Appendix B. V. TRAINING THE FISH TO SWIM
We train the three-link fish to (i) swim parallel to the x -direction in a driftless environment; (ii) swim towards atarget point in the absence and presence of drift. Note that the first task can be viewed as a special case of the secondtask as the target location goes to infinity. We refer to the first task as direction-control for short, and the secondtask as naive and drift-aware target-seeking , respectively, based on their awareness of drift.For the purpose of efficient training we impose a finite time interval, following which the system state is reset to theinitial state for a new round of training. Each round is referred to as an episode . In all training episodes, we initializethe fish center to be at the origin of the inertial frame, and we initialize the shape angles α , α and body orientation β by sampling from a uniform distribution over all permissible angles to maximize the chances for robust learning. Wefix the maximum episode length to 150 time steps, with no early termination allowed. In the target-seeking policies,the target is initially placed at a fixed distance (three-unit length) from the fish center but at a random orientation.For the drift-aware policy, the drift magnitude is sampled from a uniform distribution between 0 and 0.15, and itsdirection sampled from a uniform distribution between 0 and 2 π . Here, the system is non-dimensional such that oneunit of drift coming from behind will drive a straight fish to move forward one-unit length per unit time.For the direction-control policy, we perform 24 runs of RL training with 10,000 episodes in each run. The trainingprocess is illustrated in terms of reward evolution in Appendix B, Fig. B.1(A). Here, the reward is calculated by takingthe sum of all rewards in an episode. There is some variability among the seeds, but most trained policies performvery well; only a single policy did not converge by the end of training episodes. Note that fluctuations in reward afterpolicy convergence are partly due to the stochasticity built into the policy itself and partly due to variation in taskdifficulty given the random initial conditions: different initial conditions require different amount of time and effortfor the fish body to align with the x -axis.It is worth pointing out here the training results of the direction-control policy are affected significantly by theepisode length. In order to swim in a desired direction starting from an arbitrary initial orientation, the fish has tofirst turn in that direction, then swim forward. Policies trained with longer episodes perform better in the swimmingportion of the trajectory but fail to make large-angle turns, as training data collected on swimming significantlyoutweighs those collected on turning. On the other hand, policies trained with shorter episodes can make turns ofany angle, but are less likely to swim straight after turning. We chose the episode length 150 as a compromise.The evolution of rewards during training of the two target-seeking policies are plotted in Fig. B.1(B), each with20 runs of the RL algorithm and 15,000 episodes in each run. The naive policy converges faster than the drift-awarepolicy, and both policies converge slower than the direction-control policy. These results indicate that the task itself,as well as variations in the environment and number of observations affect the convergence rate, that is, the learningdifficulty. Note that the numerical value of the reward is not directly comparable between policies for different tasks. VI. BEHAVIOR OF TRAINED FISH
We evaluate the performance of the trained policies by testing them under two type of conditions: conditions similarto those seen during training and perturbed conditions not seen during training.To visualize the RL direction-control policy, we plot in Fig. 3(A) the action vector fields ( ˙ α , ˙ α ) over the observationsspace ( α , α , β ). These vector fields depend on the orientation β of the fish in the physical space such that the controlpolicy ( ˙ α , ˙ α ) forms a foliation over β . Three slices of this foliation are highlighted. The right hand side of Fig. 3(A)provides a closer look at the policy slice at β = 0; the arrows indicate the mean actions advised by the policy fora given set of observations α , α and β = 0. In Fig. 3(B), we show the details of two trajectories in the physicalspace starting from two distinct configurations. In the first test, the fish starts in the orientation β (0) = 0 from astraight shape ( α (0) , α (0)) = (0 , β (0) = 2 π/ α (0) , α (0)) = ( − π/ , π/ x -axis. In both cases, the fish is able to turn to the desired direction and swim steadily. In Fig. 3(A),we highlight the corresponding trajectories in the ( α , α , β ) space. As the fish moves through the physical space, β changes leading the fish to take actions from other slices of the foliation of action vector fields. Both trajectories tendto the same periodic swimming cycle around β = 0. A B β = 00 − π/ π/ α α − π/ π/ β − π/ π/ π/ π/ − π/ π/ α α − . . .
40 0 . . y x Figure 3.
Visualizing the direction-control RL policy. A . Given the direction-control task, we visualize the mean RLpolicy actions ( ˙ α , ˙ α ) as vector fields in the observation space of ( α , α , β ). Two example observation trajectories startingat β (0) = 0 , α (0) = 0 , α (0) = 0 (blue) and β (0) = 2 π/ , α (0) = − π/ , α (0) = π/ β = 0 , π/ , π/
3. The inset shows a flattened view of the slice at β = 0. B . Physical space trajectories of thecorresponding examples shown in A are plotted together with the fish body configurations at the starting point and chosenpoints near the ends. π A − π − π π α α − π π α − π π α curl A x curl A β curl A y y x − . RL control B − geometric design Figure 4.
RL provides smooth transition between turning and swimming gaits. A . The shape space trajectory ofthe fish reorienting itself to swim parallel to the x -axis produced by mean RL policy is superimposed to the scalar curl fieldsfrom Fig. 2(B). Note that this trajectory starts off-centered and smoothly moves to cycles that are symmetric about the origin. B . The physical space trajectory due to the mean RL policy (red) in comparison to a manually patched turning-to-swimmingtrajectory (green and black) using circular gaits in Fig. 2. Without further fine-tuning on the shape of the gaits, the manuallypatched result shows a more abrupt and unnecessary turn angle. In both simulations, the fish start in a straight configurationwith an orientation of β (0) = π/ The shape space analysis introduced in Section III is very useful here to illustrate the shape deformations producedby the RL control policy. In Fig. 4(A), we superimpose the shape changes produced by the RL policy onto the scalarfields curl A x , curl A y and curl A β over the shape space ( α , α ). The corresponding motion in the physical spaceis depicted in red in Fig. 4(B). The shape deformations produced by the RL policy can be interpreted as consistingof two regimes: an initial turning regime followed by a forward swimming regime. The turning regime is indicatedby the initial portion of the shape trajectory enclosing most of the blue area in the curl A β image; the integralin (7) along this portion of the trajectory is negative, leading to a clockwise rotation. The swimming regime isindicated by the periodic shape changes enclosing the rectangular orange portion in the curl A x image. The areaintegral of curl A x along this portion of the trajectory is positive, whereas the corresponding area integrals of curl A y and curl A β are identically zero, leading to net motion in the positive x -direction. These shape deformations andresulting turning and swimming motions can be compared to a manually-designed shape trajectory based on theturning (green) and swimming (black) gaits in Fig. 2(A,B). Specifically, starting from a straight fish configuration,we follow the solid portion of the green trajectory (turning) in Fig. 2(A,B), and transition into the black trajectory(swimming) at the second intersection of the green and black shape trajectories. The resulting motion in the physicalspace is superimposed onto Fig. 4(B). In the RL produced motion, the fish seems to turn more slowly than themanually patched gait in order to produce more forward motions, which makes the transition between turning andswimming seamless. Note that, in the swimming regime, the policy produces cyclic shape deformations that do not ABCD α α α α α x − π − π − π π π − π π − π π − π curl A x Figure 5.
Racing against the RL fish.
To test the optimality of our direction-control RL policy, we compare forwardswimming performance between the geometrically-designed gaits and RL results. In
A-C we use rectangular gaits with lengthequal to 1.2 (small), 2.4 (medium), and 3.6 (large), respectively, and width all equivalent to shape space trajectory followingthe mean actions from RL policy in D . All shape space trajectories are shown superimposed on top of the curl A x field on theright. Physical space trajectories on the left show that the mean RL policy achieves its excellent performance by choosing anoptimal amount of lateral oscillation during forward swimming, while the small and large rectangular gaits move slower dueto either insufficient or overwhelming side-way motion. Note that fish in A-C are initialized with the same shape but slightlydifferent initial orientations to ensure they all swim in exactly the x -direction. In addition, all fish utilize the maximum actionsallowed at each time step. maximize the area integral of curl A x (shape trajectory does not enclose the whole orange portion in the curl A x image). We next explore why shape deformations according to the optimal policy do not maximize the area integralof curl A x . The key lies in the fact that the physics of the problem, specifically, the rotational motion of the fish,couples displacements in the x - and y - direction.To better explain this aspect of the RL policy, we manually prescribed cyclic shape changes that follow rectangulartrajectories reminiscent of the trajectory generated by the RL policy for forward swimming. The manually-designedshape trajectories all share the same width as the RL policy albeit at different lengths, namely, 1.2 (small), 2.4(medium), and 3.6 (large) to enclose increasingly larger regions of the orange portion of the curl A x ; see right columnof Fig. 5. These cyclic deformations result in net displacements in the x -direction with zero-mean excursions in the y -direction. The y -excursions are due to the fact that, even though the area integrals of curl A y and curl A β overthe regions enclosed by these shape trajectories are identically zero, leading to zero net rotation over a full cycle ofshape deformations, the instantaneous rotations β of the fish body couple displacements in the x - and y -directions, asevident from Eqs. (1) and (4). For the cyclic shape changes in Fig. 5(A), the amplitude of y -excursions is small butso is the net displacement in the x -direction; meanwhile in Fig. 5(C), both the net displacement in the x -directionand the amplitude of the y -excursions are large. The fastest fish is the one that maximizes forward motion whileminimizing lateral movements, as shown in Fig. 5(B) and recapitulated in the RL result shown in Fig. 5(D). It isworth emphasizing that the RL policy arrives at this optimal solution merely by sampling observations, actions, andrewards, with no prior or developed knowledge of the physics of the problem.Next, we investigate the effect of the initial orientation β (0) ∈ [ − π, π ] on the amount of control effort required toturn and swim parallel to the x -axis. In Fig. 6(A), we show three examples of fish motion following the trained policy A − π π a c t ua t i on e ff o r t initial orientation B a c t ua t i on e ff o r t target angular position π − π DC y − − x − fish centerfish nose y − . . .
40 0.5 1.5100 . x Figure 6.
Performance of RL policies in a driftless environment. A . Swimming trajectories (center of mass) arrivedat by the mean action of the direction-control policy are shown for fish starting at orientations β = 0 (blue), π/ π/ B . The actuation effort of reorientation is roughly proportional to the absolute value of the initial fish orientation dueto the amount of turning maneuvers required. Results shown are based on 25 stochastic policy roll-outs per tested orientationangle. C . Center of mass (blue) and nose (grey) swimming trajectories using the mean action of the naive target-seeking policyare shown with two targets located at an angular position of π/ π/
6. The fish is considered to have reached the targetwhen its nose is within ǫ = 0 . D . Actuation effort needed to perform shape increasesas the target angular position changes away from 0. Results shown use 25 stochastic policy roll-outs per tested target angle.Note that solid lines and shaded regions of B and D show the median results and variations between 25 to 75 percentile,respectively. corresponding to three distinct initial orientations β (0) = 0 , π/
6, and 5 π/
6. In all examples, the fish starts from astraight configuration at the origin of the physical space. To measure the actuation effort needed in these motions,we used the integral R τ T shape d t of the actuation energy T shape = ( J + m a )( ˙ α + ˙ α ) , which is the energy impartedto the fluid by the fish shape changes; see Appendix A. In Fig. 6(B), we show the actuation effort versus the initialorientation of a straight fish. Here the fish is instructed by the stochastic policy with the same action noise as in thetraining process. The actuation effort, as well as its variation, is larger for larger turning angles.We examine the behavior and effort of a fish swimming instructed to reach a known target in an environment withzero drift. Fig. 6(C) shows examples of the fish swimming motion in the physical space for targets located at ψ = π/ ψ = 5 π/
6. All tests run until the fish reaches the target or a maximum interval of 1000 time steps is exceeded.We vary the target angular position while maintaining the fish initial shape and orientation (the fish always startsstraight in the x -direction), and we calculate the actuation effort as a function of the target orientation; see Fig. 6(D).The actuation effort increases as the target moves from the front to the back of the fish, because it requires largerturns in order for the fish to align its heading direction with the direction of the target; this is consistent with ourfindings based on the direction-control policy. It is worth emphasizing that the direction-control task is equivalent tothe target-seeking task with the target placed at x = + ∞ .Lastly, we test the behavior and effort of a target-seeking fish in the presence of non-zero drift. Fig. 7(A) and (C)show a comparison between the naive and drift-aware policies for targets located at ψ = π/ ψ = 5 π/
6, withdrift of magnitude of 0.1 pointing to the x -direction and the − y -direction, respectively. The naive policy (grey lines)is able to reach the target, even though it does not directly observe the drift, albeit following different actions andswimming trajectories than the drift-aware policy (orange lines). Specifically, when following the drift-aware policy,the fish curls less when the drift is helpful and curls more when the drift is unfavorable. We assess the performanceof the two policies for various drift magnitudes and directions. In Fig. 7(B), we calculate the actuation effort as a0 DCA B drift magnitude . . y − − x − drift drift direction π − π y − − x − drift drift magnitude a c t ua t i on e ff o r t a c t ua t i on e ff o r t drift direction π − π naivedrift-awarenaivedrift-aware ψ = π ψ = 5 π ψ = π ψ = 5 π Figure 7.
Performance of target-seeking policies in the presence of drift.
We compare the swimming trajectoriesand actuation efforts of target-seeking policies for different drift magnitudes and direction. A and C . Naive policy (grey) anddrift-aware policy (orange) produce similar average actions in environments with constant (0.1) drift in the positive x -directionand negative y -direction, respectively. Both panels showcase trajectories reaching targets located three unit-lengths away withangular position ψ of π/ π/ B . For a fixed drift direction (positive x -direction), actuation effort as a function of driftmagnitude evolves differently depending on target angular position ψ . When drift is against the direction of the target (left),both policies fail to reach the target on average for large enough drift (cross marks). No failures are observed when driftfacilitates swimming towards the target (right). In both cases, the drift-aware policy significantly outperforms the naive policyat large drift magnitudes by saving actuation effort. Intriguingly, inclusion of extra observations in drift-aware policies seemsto result in slightly suboptimal performance when the drift magnitude is very small. D . With the drift magnitude fixed to0 .
15, the naive policy fails to reach the target if the drift direction is near the opposite end of the target angular position ψ .However, at this drift magnitude, the drift-aware policy can still reach both targets regardless of the drift direction. Note thatsolid lines and shaded regions of B and D show the median results and variations between 25 to 75 percentile, respectively. function of the drift magnitude with a fixed drift direction for two target locations. We find that the naive policyoutperforms the drift-aware policy for small drift, but loses or even fails to finish the task when the drift is large,especially when the drift is in the adverse direction to the target location. This implies that it might be wise to discardsome sensory input (observations) when the perturbation due to drift is weak, especially if these extra observationsact more like a distraction than a clue. But as the perturbation gets stronger, it is necessary to take more observationsinto account. Note that both the naive and drift-aware policies are not able to complete the task in the given amountof time when the drift magnitude is very large and its direction is adversarial to the target location. This is becausethe shape actuation has no direct control over the drift itself. In Fig. 7(D), we fix the magnitude of drift to 0.15and change its direction. Using the actuation effort as our performance metric as before, we find that the drift-awarepolicy has better or similar performance under all tested conditions. The naive policy fails when the drift acts inthe adverse direction relative to the target while the drift-aware policy is always able to reach the target before theepisode terminates. VII. CONCLUDING REMARKS
We considered a three-link fish swimming in potential flow. We reviewed that swimming in potential flow can beexpressed as a combination of a dynamic phase (drift) and geometric phase (driftless) over the shape of fish bodydeformation [35]. In the driftless case, net locomotion is purely determined by the fish shape deformations, and shapespace geometric techniques can be used for gait design and motion planning [44, 46], but shape actuation cannot1control the drift itself. Yet, even in the driftless regime, motion planning starting from arbitrary fish orientation andshape is non-trivial. In this paper, we applied model-free reinforcement learning techniques for the fish motion control,and we arrived at optimal policies for swimming (i) in a desired direction, and (ii) towards a target in the absenceand presence of drift. The RL based policies produce behavior that is robust to variations in the fish initial shape andorientation and target location. We used the actuation effort as a measure of the policy performance under variousinitial conditions and in various environments, and we quantified the robustness of the RL policies to the presenceof drift. We found that although the fish has no control over the drift itself, the fish learns to take advantage of thepresence of moderate drift to reach its target. However, large adversarial drift hinders the fish ability to locate thetarget. The geometric tools provided a useful framework for visualizing and interpreting the RL policies.A few comments on the advantages and limitations of our implementation are in order. Despite algorithmicadvantages, obtaining an RL policy is computationally costly, especially when the environment simulator involveshigh-fidelity fluid-structure interaction models. To circumvent this problem, recent work on training fish to swimuses a limited set of observations and actions [31]. For example, the zebrafish model of [31] allows only 5 discreteactions, each corresponding to a fixed amplitude of body curvature change. Reduced order fluid models, such as thepotential flow model used here, offer an enticing framework for designing control laws that can later be tested andrefined using more realistic flow environments, as done in as done in [19] for a manually-designed swimming gait in[35]. Specifically, in the simplified potential flow environment employed here, we are able to train continuous actionpolicies using a rich set of observations, with an eye on probing the performance of these policies in more realisticflow environments in future work.
ACKNOWLEDGMENTS
Kanso would like to acknowledge support from the Office of Naval Research through the grants N00014-17-1-2287and N00014-17-1-2062. The authors are thankful for useful discussions with Dr. Yuval Tassa. [1] Dickinson MH, Lehmann FO, Sane SP. Wing rotation and the aerodynamic basis of insect flight. Science.1999;284(5422):1954–1960.[2] Tytell E, Holmes P, Cohen A. Spikes alone do not behavior make: why neuroscience needs biomechanics. Current opinionin neurobiology. 2011;21(5):816–822.[3] Merel J, Botvinick M, Wayne G. Hierarchical motor control in mammals and machines. Nature Communications.2019;10(1):1–12.[4] Madhav MS, Cowan NJ. The synergy between neuroscience and control theory: the nervous system as inspiration for hardcontrol challenges. Annual Review of Control, Robotics, and Autonomous Systems. 2020;3:243–267.[5] Huang KH, Ahrens MB, Dunn TW, Engert F. Spinal projection neurons control turning behaviors in zebrafish. CurrentBiology. 2013;23(16):1566–1573.[6] Burgess HA, Granato M. Sensorimotor gating in larval zebrafish. Journal of Neuroscience. 2007;27(18):4984–4994.[7] Privat M, Romano SA, Pietri T, Jouary A, Boulanger-Weill J, Elbaz N, et al. Sensorimotor Transformations in theZebrafish Auditory System. Current Biology. 2019;29(23):4010–4023.[8] Marchese AD, Onal CD, Rus D. Autonomous Soft Robotic Fish Capable of Escape Maneuvers Using Fluidic ElastomerActuators. Soft Robotics. 2014;1(1):75–87. PMID: 27625912. Available from: https://doi.org/10.1089/soro.2013.0009 .[9] Lauder GV. Fish Locomotion: Recent Advances and New Directions. Annual Review of Marine Science. 2015;7(1):521–545.PMID: 25251278. Available from: https://doi.org/10.1146/annurev-marine-010814-015614 .[10] Cianchetti M, Follador M, Mazzolai B, Dario P, Laschi C. Design and development of a soft robotic octopus arm exploitingembodied intelligence. In: 2012 IEEE International Conference on Robotics and Automation. IEEE; 2012. p. 5271–5276.[11] Laschi C, Cianchetti M, Mazzolai B, Margheri L, Follador M, Dario P. Soft robot arm inspired by the octopus. AdvancedRobotics. 2012;26(7):709–727.[12] Pfeifer R, Bongard J. How the body shapes the way we think: a new view of intelligence. MIT press; 2006.[13] Pfeifer R, Lungarella M, Iida F. Self-Organization, Embodiment, and Biologically Inspired Robotics. Science.2007;318(5853):1088–1093. Available from: https://science.sciencemag.org/content/318/5853/1088 .[14] Liao JC, Beal DN, Lauder GV, Triantafyllou MS. Fish exploiting vortices decrease muscle activity. Science.2003;302(5650):1566–1569.[15] M¨uller UK. Fish’n flag. Science. 2003;302(5650):1511–1512.[16] Eloy C, Le Gal P, Le Dizes S. Elliptic and triangular instabilities in rotating cylinders. Journal of Fluid Mechanics.2003;476:357–388.[17] Kern S, Koumoutsakos P. Simulations of optimized anguilliform swimming. Journal of Experimental Biology.2006;209(24):4841–4857. [18] Mittal R, Dong H, Bozkurttas M, Loebbecke A, Najjar F. Analysis of flying and swimming in nature using an immersedboundary method. In: 36th AIAA Fluid Dynamics Conference and Exhibit; 2006. p. 2867.[19] Eldredge JD. Numerical simulations of undulatory swimming at moderate Reynolds number. Bioinspiration & biomimetics.2006;1(4):S19.[20] Tytell ED, Hsu CY, Williams TL, Cohen AH, Fauci LJ. Interactions between internal forces, body stiffness, and fluidenvironment in a neuromechanical model of lamprey swimming. Proceedings of the National Academy of Sciences.2010;107(46):19832–19837.[21] Hamlet C, Fauci LJ, Tytell ED. The effect of intrinsic muscular nonlinearities on the energetics of locomotion in acomputational model of an anguilliform swimmer. Journal of theoretical biology. 2015;385:119–129.[22] Engelmann J, Hanke W, Mogdans J, Bleckmann H. Hydrodynamic stimuli and the fish lateral line. Nature.2000;408(6808):51–52. Available from: https://doi.org/10.1038/35040706 .[23] Ristroph L, Liao JC, Zhang J. Lateral Line Layout Correlates with the Differential Hydro-dynamic Pressure on Swimming Fish. Phys Rev Lett. 2015 Jan;114:018102. Available from: https://link.aps.org/doi/10.1103/PhysRevLett.114.018102 .[24] Colvert B, Kanso E. Fishlike rheotaxis. Journal of Fluid Mechanics. 2016;793:656–666.[25] Colvert B, Liu G, Dong H, Kanso E. How can a source be located by responding to local information in its hydrodynamictrail? In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE; 2017. p. 2756–2761.[26] Montgomery JC, Baker CF, Carton AG. The lateral line can mediate rheotaxis in fish. Nature. 1997;389(6654):960–963.Available from: https://doi.org/10.1038/40135 .[27] Filella A, Nadal F, Sire C, Kanso E, Eloy C. Model of collective fish behavior with hydrodynamic interactions. Physicalreview letters. 2018;120(19):198101.[28] Gazzola M, Tchieu AA, Alexeev D, de Brauer A, Koumoutsakos P. Learning to school in the presence of hydrodynamicinteractions. Journal of Fluid Mechanics. 2016;789:726–749.[29] Gustavsson K, Biferale L, Celani A, Colabrese S. Finding efficient swimming strategies in a three-dimensional chaotic flowby reinforcement learning. The European Physical Journal E. 2017;40(12):110.[30] Colabrese S, Gustavsson K, Celani A, Biferale L. Flow navigation by smart microswimmers via reinforcement learning.Physical review letters. 2017;118(15):158004.[31] Verma S, Novati G, Koumoutsakos P. Efficient collective swimming by harnessing vortices through deep reinforcementlearning. Proceedings of the National Academy of Sciences. 2018;115(23):5849–5854.[32] Colvert B, Alsalman M, Kanso E. Classifying vortex wakes using neural networks. Bioinspiration & biomimetics.2018;13(2):025003.[33] Alsalman M, Colvert B, Kanso E. Training bioinspired sensors to classify flows. Bioinspiration & Biomimetics. 2018nov;14(1):016009. Available from: https://doi.org/10.1088%2F1748-3190%2Faaef1d .[34] Weber P, Arampatzis G, Novati G, Verma S, Papadimitriou C, Koumoutsakos P. Optimal Flow Sensing for SchoolingSwimmers. Biomimetics. 2020;5(1):10.[35] Kanso E, Marsden JE. Optimal motion of an articulated body in a perfect fluid. In: Proceedings of the 44th IEEEConference on Decision and Control. IEEE; 2005. p. 2511–2516.[36] Kanso E, Marsden JE, Rowley CW, Melli-Huber JB. Locomotion of articulated bodies in a perfect fluid. Journal ofNonlinear Science. 2005;15(4):255–289.[37] Eloy C. On the best design for undulatory swimming. Journal of Fluid Mechanics. 2013;717:48–89.[38] M¨uller UK, Van Leeuwen JL. Undulatory fish swimming: from muscles to flow. Fish and Fisheries. 2006;7(2):84–103.[39] Dickinson MH, Farley CT, Full RJ, Koehl M, Kram R, Lehman S. How animals move: an integrative view. science.2000;288(5463):100–106.[40] Degris T, Pilarski PM, Sutton RS. Model-Free reinforcement learning with continuous action in practice. In: 2012 AmericanControl Conference (ACC); 2012. p. 2177–2182.[41] Haith AM, Krakauer JW. Model-Based and Model-Free Mechanisms of Human Motor Learning. In: Richardson MJ, RileyMA, Shockley K, editors. Progress in Motor Control. New York, NY: Springer New York; 2013. p. 1–21.[42] Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.[43] Eldredge JD. Numerical simulation of the fluid dynamics of 2D rigid body motion with the vor-tex particle method. Journal of Computational Physics. 2007;221(2):626 – 648. Available from: .[44] Melli JB, Rowley CW, Rufat DS. Motion planning for an articulated body in a perfect planar fluid. SIAM Journal onapplied dynamical systems. 2006;5(4):650–669.[45] Nair S, Kanso E. Hydrodynamically coupled rigid bodies. Journal of Fluid Mechanics. 2007;592:393–411.[46] Hatton RL, Choset H. Connection vector fields and optimized coordinates for swimming systems at low and high Reynoldsnumbers. In: ASME 2010 Dynamic Systems and Control Conference. American Society of Mechanical Engineers DigitalCollection; 2010. p. 817–824.[47] Morgansen KA, Duidam V, Mason RJ, Burdick JW, Murray RM. Nonlinear control methods for planar carangiformrobot fish locomotion. In: Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164). vol. 1. IEEE; 2001. p. 427–434.[48] Bloch AM, Krishnaprasad P, Marsden JE, Murray RM. Nonholonomic mechanical systems with symmetry. Archive forRational Mechanics and Analysis. 1996;136(1):21–99.[49] Murray RM, Li Z, Sastry SS, Sastry SS. A mathematical introduction to robotic manipulation. CRC press; 1994. [50] Lamb H. Hydrodynamics. Dover Books on Physics. Dover publications; 1945. Available from: https://books.google.com/books?id=237xDg7T0RkC .[51] Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning.1992;8(3-4):229–256.[52] Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient methods for reinforcement learning with functionapproximation. In: Advances in neural information processing systems; 2000. p. 1057–1063.[53] Kakade SM. A natural policy gradient. In: Advances in neural information processing systems; 2002. p. 1531–1538.[54] Peters J, Schaal S. Natural actor-critic. Neurocomputing. 2008;71(7-9):1180–1190.[55] Peters J, Schaal S. Policy gradient methods for robotics. In: Intelligent Robots and Systems, 2006 IEEE/RSJ InternationalConference on. IEEE; 2006. p. 2219–2225.[56] Konda VR, Tsitsiklis JN. Actor-critic algorithms. In: Advances in neural information processing systems; 2000. p.1008–1014.[57] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning witha stochastic actor. arXiv preprint arXiv:180101290. 2018;.[58] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprintarXiv:170706347. 2017;.[59] Leonard NE. Stability of a bottom-heavy underwater vehicle. Automatica. 1997;33(3):331–346.[60] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;. Appendix A: Physics of the fish model
We review the derivation of the equations of motion governing the swimming of an articulated three-link fish inpotential flow (Fig. 1).
1. Fish kinematics
Consider planar motions of the three-link fish. Let x = ( x, y ) denote the position of the center of mass G of themiddle link, and let β denote the orientation of the fish relative to a fixed inertial frame, here taken to be the anglebetween the x -axis and the major axis of symmetry of the middle link. Let α and α be the rotation angles of thefront link relative to the middle link and the middle link relative to the rear link, that is to say, ( α , α ) represents theshape of the three-link fish. It is convenient for the following development to introduce a body-fixed frame ( b , b , b ),attached at G and co-rotating with the middle link. This body-fixed frame is related to the inertial frame ( e , e , e )via a rigid-body rotation such that e = cos β b − sin β b , e = sin β b + cos β b , and e = b .The velocity ( ˙ x, ˙ y ) of the center of mass of the middle link, when expressed in the body-fixed frame, is given by v = v x b + v y b = ( ˙ x cos β + ˙ y sin β ) b + ( − ˙ x sin β + ˙ y cos β ) b (A1)Assuming all three links are made of identical ellipsoids of length a , width b , and height c , the velocities of the centersof mass G and G of the front and rear link, expressed in the body-fixed frame of the middle link, are given by( i = 1 ,
2, denote the front and rear link, respectively) v i = ( v x ∓ a ˙ α i sin α i − a ˙ β sin α i ) b + ( v y ± a ˙ β + a ˙ α i cos α i ± a ˙ β cos α i ) b . (A2)The angular velocities of the middle, front, and rear links are given by ˙ β , ˙ β + ˙ α , and ˙ β − ˙ α respectively.
2. Kinetic energy of the articulated body
In the absence of the fluid, the kinetic energy of the articulated three-link body is given by T body = 12 m s v · v + 12 J s ˙ β + 12 X i h m s v i · v i + J s ( ˙ β ± ˙ α i ) i (A3)where m s = abcρ s and J s = ( a + b ) m s are the mass and moment of inertia of each solid link with ρ s the densityof the links.4
3. Kinetic energy of the fluid
The three-link fish is submerged in an unbounded domain of incompressible and irrotational fluid, such that the fluidvelocity u = ∇ φ can be expressed as the gradient of a potential function φ . It is a standard result in potential flowtheory that the kinetic energy of the fluid can be expressed in terms of the variables of the submerged solid [35, 36, 50].In the case of a single ellipsoid, the kinetic energy of the fluid is given by T fluid = [( m x v x + m y v y ) + J ˙ β ] /
2, where m x , m y and J z are the added mass and added moment of inertia due to the presence of the fluid, expressed in a body-fixed frame that coincides with the major and minor axes of the ellipsoid. These quantities depend on the geometricproperties a, b, c of the submerged ellipsoid [59]. For a non-spherical body, the added masses m x , m y depend on thedirection of motion: the added mass is larger when moving in the direction of the minor axis of symmetry of theellipsoid, that is to say, in the transverse direction, hence m x ≤ m y .In the case of the three-link fish, the kinetic energy of the fluid is of the form T fluid = 12 m x v x + 12 m y v y + 12 J z ˙ β + 12 X i J z (cid:16) ˙ β ± α i (cid:17) + 12 X i m x (cid:16) v x cos α i ± v y sin α i + a ˙ β sin α i (cid:17) + 12 X i m y (cid:16) ∓ v x sin α i + v y cos α i ± a ˙ β cos α i ± a ˙ β + a ˙ α i (cid:17) . (A4)Here we transform the velocity components of head and tail by α and α , respectively, to match the added masscomponents.
4. Kinetic energy of the body-fluid system
The kinetic energy of the fish-fluid system is obtained by taking the sum of Eq. A3 and Eq. A4, which can beexpressed in matrix form as follows: T = T body + T fluid = 12 v x v y ˙ β ˙ α ˙ α T I lock I couple I T couple I shape v x v y ˙ β ˙ α ˙ α , (A5)Here, I lock is a 3 × α and α , I lock = " M HH T J , (A6)where M is a 2 × M = (cid:20) m (cid:0) P i cos α i (cid:1) + m P i sin α i ( m − m )(sin 2 α − sin 2 α ) ( m − m )(sin 2 α − sin 2 α ) m (cid:0) P i cos α i (cid:1) + m P i sin α i (cid:21) , (A7) J is a moment-of-inertia scalar given by J = 3 J + m a X i sin α i + m a X i (1 + cos α i ) , (A8)and H is given by H = (cid:20) ( m − m ) a P i sin 2 α i − m a P i sin α i ( m − m ) a (cos 2 α − cos 2 α ) + m a (cos α − cos α ) (cid:21) . (A9)Here we used m = m s + m x , m = m s + m y , and J = J s + J z . Note that H couples the translational and rotationalmotion of the articulated body. In the case of single ellipsoid, H is identically zero.5 Algorithm B.1
Environment Simulation for time step t = 0 , , ... do if t = 0 or episode terminates then store time step of episode termination, reset state s t ∼ P ( s ) evaluate observation: o t ∼ o ( s t ) end if sample action from policy a t ∼ π θ ( a t | o t ) evolve next state according to fish physics s t +1 ∼ P ( s t +1 | s t , a t ) evaluate next observation o t +1 ∼ o ( s t +1 ) and reward r t ∼ r ( a t , o t +1 ) if t = 0 or mod( t, N ) = 0 then append current action, observation, reward, and probability of sampling the action to assemble vectors a N × n a , o N × n o , r N × , and π θ old ( a | o ) N × else update agent networks according to Algorithm B.2 end if end for Algorithm B.2
Updating the Agent for update epoch number κ = 0 , , . . . K do compute the truncated return using rewards r N × and assemble into vector R N × estimate infinite-horizon return using R N × and V T = V φ ( o T ) if bootstrapping is desired (see Eq.B2) using o N × n o and value function V φ , evaluate expected returns at each time step and store into V N × compute the advantage A = R N × − V N × and normalize by its mean and variance if desired evaluate the probability of realizing a N × n a based on o N × n o for the policy π θ , and store to π θ ( a | o ) N × compute the action-likelihood ratio: ̺ θ = π θ ( a | o ) N × π θ old ( a | o ) N × compute clipped surrogate loss function: L clip ( θ ) = mean [min [ ̺ θ · A, clip( ̺ θ , − ǫ, ǫ ) · A ]] compute the value-function loss: L value ( φ ) = 0 . · mean (cid:2) ( R N × − V N × ) (cid:3) compute the total loss: L ( θ, φ ) = −L clip ( θ ) + L value ( φ ) − α · entropy [ π θ ] update parameters ( θ, φ ) to minimize the total loss using a gradient based optimizer ( e.g. , Adam [60]) end for Further, I couple is a 3 × I couple = − m a sin α m a sin α m a cos α m a cos α J + m a (1 + cos α ) − J − m a (1 + cos α ) . (A10)Finally, I shape is a 2 × I shape = (cid:20) J + m a J + m a (cid:21) . (A11) Appendix B: Proximal Policy Optimization (PPO) Algorithms
We implement the clipped advantage Proximal Policy Optimization (PPO) method proposed by [58] for our RLtraining. PPO maximizes a surrogate objective that clips off unwanted changes when the policy deviates too muchfrom the policy of the previous cycle to ensure faster and more robust convergence. We refer readers to the originalreference cited above as well as the OpenAI’s documentation of the PPO algorithm and their baseline implementationsfor a thorough explanation of the theory and details behind this method.Our implementation can be separated into two parts. The main loop simulates the environment using actionsequence a t generated by the agent, and stores the observed rollouts for future updates; see Algorithm B.1. Note6 A B truncationbootstrapping
Number of episodes R e w a r d naivedrift-aware − Number of episodes R e w a r d Figure B.1.
Evolution of rewards during the training process. A . Total rewards per episode achieved by policies trainedto swim parallel to the x -axis in a driftless environment using bootstrapped (blue) and truncated (black) return estimates. Heresolid lines indicate the median, and the shaded region shows the variation between 25-75 percentile for 24 runs of the learningalgorithm. B . Total rewards obtained by policies trained to swim towards a given target, both of which adopt bootstrappedreturn estimates. Here red indicates naive policies trained in driftless environment, while yellow represents policies trained inthe presence of drift, with drift magnitude and direction supplied as additional observations to the policy. Again, lines andshaded regions indicate median and 25-75 percentile range respectively. that n o and n a are used to indicate the number of observable states and actions. Equations describing the fish-fluidinteractions were integrated numerically using adaptive time step, explicit RK45 method between each decision stepof 0 . . N time steps for K epochs. Here thevalue of K is chosen to be 80 and the value of N is set to 4050, an integer multiple of the episode length 150.For simplicity, we assume our continuous action variables follows a multivariate normally-distributed policy π θ withmean value represented by a neural network parameterized by θ and constant diagonal covariance matrices, and thecritic / value function V φ ( o t ) is also represented by a neural network with parameters φ . Specifically, both the meanpolicy and value function are implemented as feed-forward neural network with two hidden layers and tanh activationfunctions. The sizes of the two hidden layers were fixed to 64 and 32, respectively. Finally, using the collectedtrajectories during the previous N time steps, the parameters θ, φ are updated according a total loss function L ( θ, φ )via a back-propagating gradient based optimizer; see Algorithm B.2. Note that since we did not perform systematichyper-parameter tuning, readers might want to explore different values for better performance.Another important side-note is that since it is in general impossible to obtain unrealized infinite horizon return R t = Σ ∞ t ′ γ t ′ − t r t ′ , we need to choose an appropriate estimator of this value based on finite length simulations. We caneither simply truncate rewards after some step k by usingˆ R t (cid:12)(cid:12)(cid:12) truncation = r t + γr t +1 + γ r t +2 + · · · + γ k − r t + k , (B1)or we can use the trained value function (critic) to approximate the residual contribution to the return via k -stepbootstrapping ˆ R t (cid:12)(cid:12)(cid:12) bootstrapping = r t + γr t +1 + γ r t +2 + · · · + γ k V φ ( o t + k +1 ) . (B2)We compared these two approaches for the direction-based task and observed that bootstrapping results in fasterconvergence and higher rewards in general; see Fig. B.1(A). As a result, bootstrapping is used for all tasks depicted inthe main text, where k was determined by the number of available future rewards. Namely, k decreases from 149 to 0as the number of time steps increased from 1 to 150 in each episode. In addition, We show the difference in trainingrewards and convergence speed between the naive policy and the drift-aware policy in Fig. B.1(B). In general, theinclusion of more observations increased the time to convergence and variance in training rewards.Lastly, we invite interested readers to visit our source repository at https://github.com/mjysh/RL3linkFishhttps://github.com/mjysh/RL3linkFish