[PDF] A reinforcement learning control approach for underwater manipulation under position and torque constraints

Abstract

In marine operations underwater manipulators play a primordial role. However, due to uncertainties in the dynamic model and disturbances caused by the environment, low-level control methods require great capabilities to adapt to change. Furthermore, under position and torque constraints the requirements for the control system are greatly increased. Reinforcement learning is a data driven control technique that can learn complex control policies without the need of a model. The learning capabilities of these type of agents allow for great adaptability to changes in the operative conditions. In this article we present a novel reinforcement learning low-level controller for the position control of an underwater manipulator under torque and position constraints. The reinforcement learning agent is based on an actor-critic architecture using sensor readings as state information. Simulation results using the Reach Alpha 5 underwater manipulator show the advantages of the proposed control strategy.

Full PDF

AA reinforcement learning control approach forunderwater manipulation under position and torqueconstraints

Ignacio Carlucho a , Mariano De Paula b , Corina Barbalata a , Gerardo G. Acosta b [email protected], mariano.depaula@ﬁo.unicen.edu.ar, [email protected], ggacosta@ﬁo.unicen.edu.ar a Department of Mechanical Engineering, Louisiana State University , Baton Rouge, USA b INTELYMEC Group, Centro de Investigaciones en F´ısica e Ingenier´ıa del CentroCIFICEN – UNICEN – CICpBA – CONICET, 7400 Olavarr´ıa, Argentina

Abstract —In marine operations underwater manipulators playa primordial role. However, due to uncertainties in the dy-namic model and disturbances caused by the environment, low-level control methods require great capabilities to adapt tochange. Furthermore, under position and torque constraintsthe requirements for the control system are greatly increased.Reinforcement learning is a data driven control technique thatcan learn complex control policies without the need of a model.The learning capabilities of these type of agents allow for greatadaptability to changes in the operative conditions. In this articlewe present a novel reinforcement learning low-level controller forthe position control of an underwater manipulator under torqueand position constraints. The reinforcement learning agent isbased on an actor-critic architecture using sensor readings asstate information. Simulation results using the Reach Alpha 5underwater manipulator show the advantages of the proposedcontrol strategy.

Index Terms —Underwater manipulation, Reinforcement learn-ing, Neural networks, Intelligent control, Deep DeterministicPolicy Gradient

I. I

NTRODUCTION

In the last decades the importance of underwater manip-ulators in marine operations has grown continuously. Mostrobotic underwater industrial applications are conducted withRemotely Operated Vehicles (ROV) where a human operatoris tasked with the remote operation of the manipulator [1].However, due to the limited number of expert operators andthe high cost of operations, the industry is migrating towardsAutonomous Underwater Vehicles (AUV) [2]. In this typeof scenario, a manipulator, usually electric, is mounted onthe AUV and operates autonomously, however, this requiresa robust and adaptable control system. Furthermore, in au-tonomous missions different types of operational constraintsmay appear, such as speciﬁc joint constraints that must be fol-lowed in order to avoid collisions [3], or decreased joint torquedue to faulty motors. These constraints increase the need fordesigning complex control systems for robust manipulation.One of the most used low-level controllers for manipulatorsis the classical Proportional Integrative Derivative (PID) con-troller [4]. This is due mostly to its simplicity of use and low computational requirements. However, when it is used for con-trolling manipulators arms, it must cope with highly non-linearsystems. This issue is aggravated in underwater environmentswhere unknown disturbances affect the behaviour of the arm.Furthermore, for underwater manipulators, controllers are usedunder the assumption that the arm will move slowly and assuch it is possible to decouple each degree of freedom, some-thing that is not true for every application [5]. Researchershave also turned to non-linear optimal control techniques as aviable option, since they allow to optimize a cost functionunder different metrics. One of these techniques is ModelPredictive Control (MPC) [6], used successfully for controllinga different number of underwater robots [7], [8]. However,one of the drawbacks of this technique is that it requires anaccurate model of the plant in order to work properly, not atrivial matter in underwater robotics [9].Data driven control techniques have appeared as an alter-native for systems with complex or unknown models. Oneof these techniques is Reinforcement Learning (RL) [10]. Inthe RL framework, the robot arm control problem can beformulated as a Markov Decision Process (MDP) [11]. Solvinga RL problem consists in iteratively learning a task frominteractions to achieve a goal. During learning, an artiﬁcialagent (controller) interacts with the target system (arm) bytaking an action (torque command), that makes the robotevolve from its current state x t ∈ X ⊆ R n to x t +1 . Theagent then receives a numerical signal r t , called reward, whichprovides a measure of how good (or bad) the action takenat time t is in terms of the observed state transition. Manyworks have used the RL paradigm for controlling AUVs inunderwater environment [12]. However, this technique has notyet been applied to underwater manipulators.The main contribution of this work is the development of areinforcement learning based control system for the low-levelcontrol of an electric underwater manipulator under positionand torque constraints. Our reinforcement learning formulationis based on the Deep Deterministic Policy Gradient (DDPG)algorithm [13]. The proposed method uses an actor criticstructure, where the actor is a function that maps system1 a r X i v : . [ c s . R O ] N ov tates to actions and the critic is a function that assessactions chosen by the actor. Deep Neural Networks (DNN)are used as function approximators for the actor and critic.Results in simulation show the advantages of our proposalwhen controlling a simulated version of the Reach 5 Alphamanipulator, shown in Fig. 1. The proposed controller iscompared with a MPC, showing that the RL controller is ableto outperform the MPC.The article is structured as follows, Section II presentsan overview of related works followed by Section III thatintroduces the basics of RL control utilized in our formulation.In Section IV the details of our implementations are described,in Section V we present the results obtained with our proposedcontrol scheme, and ﬁnally Section VI presents the overallconclusions of the proposed work.II. R ELATED W ORKS

Designing control systems under constrains considerationsfor robotic manipulators appeared due to the need of robots tointeract with the environment. Some of the most fundamentalapproaches focused on designing motion/interaction controlsystems by using a hybrid control formulation [14], [15].In this approach, the constrains are expressed based on theend-effector’s working space, and are used to decide thetype of control law (either a motion control law or a forceregulator). Nevertheless, constraints are not imposed only bythe interaction with the environment, but are also required forcases when the robotic manipulator has to adjust its workingspace due to obstacles in the environment, or faults in therobotic system. In [16] a passivity-based kinematic controllaw is proposed under joint velocity limits considerations. Themethod proposed can be adapted to different feedback controllaws, but can be applied only for redundant systems. Anadaptive neural-network control for robotic manipulators withparametric uncertainties and motion constraints is proposedin [17]. The simulation and experimental results with a 2degree of freedom (DOF) planar manipulator show the velocityconstraints always being respected, but steady-state errorsare present. Deep neural-network approaches have becomepopular in the past years. An example is given in [18] wherean obstacle avoidance control law is designed for redundantmanipulators. The problem is reformulated as a QuadraticProgramming (QP) problem in the speed level, and a deeprecurrent neural network is designed to solve the QP problemin an online way. Although the simulation results show thatthe robot is capable of avoiding the obstacles while trackingthe predeﬁned trajectories, an experimental evaluation is notpresented.The requirements of adaptability and the difﬁculties withmodeling have lead research towards intelligent control meth-ods, such as reinforcement learning. Many RL methods havebeen previously applied to the control of manipulators [19]. In[20] asynchronous reinforcement learning was used to train aseries of robotic manipulators to solve a door opening task. In[21] a mobile manipulator task is solved by Proximal PolicyOptimization (PPO), a well known RL algorithm [22], in

Fig. 1. Reach 5 Alpha underwater manipulator [26] combination with a Deep Neural Network. Speciﬁcally forthe underwater environments, RL has been used as a controltechniques in several previous works [23]–[25]. However, themajority of these works focus on the of AUVs. Works utilizingRL in underwater manipulators are lacking in the literature.III. R

EINFORCEMENT LEARNING BASED CONTROL

From the point of view of classical and modern control,the design strategy of a control system is based on thecomplete knowledge of the dynamics of the system understudy. Assuming that there is a model with the adequatecapacity to describe the dynamics of the system, generallythrough a set of differential equations, the control problem isreduced to design a controller (or agent) capable of generatingthe adequate control actions for the system to achieve agiven objective, goal, task or speciﬁc desired behavior. In thisway, the performance capabilities of the conventional designedcontrol systems are excessively dependent on the mathematicalmodels used to describe the behavior of the dynamic systemsto be controlled. However, underwater manipulation is acomplex decision making problem in which, the presence ofuncertainty in dynamics is ubiquitous and, consequently, it isof paramount importance designing and using controllers withsuitable adaptation capabilities.Markov decision processes are models for sequential deci-sion making problems when outcomes are uncertain [11]. Inour formulation, we consider a ﬁnite-horizon Markov decisionprocess with a , , ..., T decisions and T − visited stages[27]. That is, the decision (action) at time t is made at thebeginning of stage t which corresponds to the time intervalfrom t to the next t + 1 . So, at any stage, or discrete time, t , the system is at a state x t . In this sense, we have a ﬁniteset X of system states, such that x t ∈ X, ∀ t = 1 , ..., T . Thedecision maker observes state x t ∈ X at stage t and it maychoose an action u t from the set of ﬁnite allowable actions2 generating cost L ( x t , u t ) . Moreover, we let p ( ·| x t , u t ) denote the probability distribution or transition probabilitiesof obtaining states x (cid:48) = x t +1 at stage t + 1 .A deterministic Markovian decision rule at state x t is afunction ψ t : x t → x t which maps the action choice given atstate x t . It is called deterministic because it chooses an actionwith certainty and Markovian (memoryless) since it dependsonly on the current system state. We let D t denote the setof possible deterministic Markovian decision rules at stage t . D t is a subset of more general rules where the action maydepend on the past history of the system and actions may notbe chosen with certainty but rather according to a probabilitydistribution.A policy or strategy speciﬁes the decision rules to be usedat all stages and provides the decision maker with a plan ofwhich action to take given stage and state. That is, a policy π is a sequence of decision rules and we restrict ourselves toranking policies π t belonging to the set D t of deterministicMarkov policies (if randomized policies were included, theset of policies would not be countable). In some problems,the decision maker only focuses on this subset of policies, e.g.because randomized policies are hard to manage in practice orrestrictions in management strategy. Moreover, if the states ata given time step t corresponds to different physical locationsimplementation of policies having a single action at eachlocation may only be acceptable.Under the above summarized framework of Markov deci-sion processes, the reinforcement learning can be formalizedwhere an RL agent located in its environment chooses anaction, from the set of available ones, at every discrete timestep t based on the current state of the system x t . In return,at each time step, the agent receives a reward signal thatquantiﬁes the quality of the action taken in term of the goal ofthe control task. In this paper we only consider one criterionof optimality, namely the expected total cost criterion, so theobjective is to obtain an optimal policy π ∗ that satisﬁes: L ∗ = max L π = max E π { R t | x t = x } (1)where r t is the instantaneous reward obtained at timestep t and R t is the cumulative reward, such that R t = (cid:80) ∞ k =0 γ k r t + k +1 .The basic RL algorithms discussed in the literature havebeen extensively developed to solve reinforcement learningproblems without need of a dynamic model and when thestate spaces and action are ﬁnite, which means that the valuefunctions support a tabular representation [10], [28], [29]. Inorder to ﬁnd an approximate solution to the control problem itis possible to obtain a discretized form of the space of statesand/or actions [30]–[32] and then applying the RL algorithmsthat use discrete spaces. However, as the granularity of therepresentation increases, the computational implementationssuffer the so-called curse of dimensionality, which consists ofan exponential increase in the computational complexity of theproblem due to the increase in the dimension (or number) ofthe state-action pairs to be selected. This makes it impossibleto construct a value function for the problem in question, since the agent has a low probability of ”visiting” the same state, ora state-action pair, more than once depending on whether it isworking with the state value function or value-action function,respectively.In the underwater manipulation problem we have to dealwith dynamic systems where the states and the applied actionsare deﬁned in real domains (continuous spaces) which imposesan important limitation for tabular representation of the valuefunctions. To overcome this drawback, functional approxima-tion techniques have emerged including inductive models toattempt generalizing the value function. Since a few years agopowerful brain inspired deep neural networks [33] have beenintroduced as functions approximations into the RL frameworkgiving rise to deep reinforcement learning methodologies [34].For instance, the Deep Deterministic Policy Gradient (DDPG)algorithm [13] is one of the most spread deep RL algorithmsthat utilizes an actor-critic formulation together with neuralnetwork as function approximators to obtain a deterministicoptimal policy. In the actor-critic formulation, the role of theactor is to select an action based on the policy, such that u = π ( x t ) . The critic on the other hand, gives feedback of howgood or bad the selected action was. In the DDPG algorithmthe state-action value function Q ( x t , u t ) is used as a critic.This function is deﬁned as: Q π ( x t , u t ) = E { R t | x t , u t } = E { ∞ (cid:88) k =0 γ k r k + t +1 | x t , } (2)The update to the state-action value function can then beperformed as: Q w ( x t , u t ) = E { r x t ,u t + γQ w ( x t +1 , u t +1 ) } (3)where Q w is a differentiable parameterized function, so that Q w ≈ Q π .For the actor we consider a function π that parame-terizes states directly into actions with parameters θ , thus π ( x t | θ ) . And we deﬁne a performance objective function L ( π θ ) = E { r γ | µ } and a probability distribution ρ , then theperformance as an expectation can be written as: L ( µ θ ) = (cid:90) ρ µ r ( x t , µ ) dx = E [ r ( x t , µ θ ( x t ))] (4)and by applying the chain rule to the expected return, we canthen write: ∇ θ L = E [ ∇ θ µ θ ( x ) ∇ u Q µ ( x , u )] (5)where Eq. (5) is the deterministic gradient of policies, asdemonstrated in [35]. As indicated previously, both the actorand the critic can be represented by function approximators,where deep neural networks are commonly used since theyallow to work with continuous state spaces. However, thenonlinearities of these networks and the training proceduresused made it difﬁcult for algorithms to converge, however,recent events have repaired these problems. The main causesof the lack of convergence were the correlation of the samples3sed for training and the correlation between the updates ofthe network Q [36].The ﬁrst of these issues was addressed by implementing areplay buffer that stored state transitions, the actions applied,and the rewards earned. The agent is then trained using mini-batches of transitions that are randomly selected from thereplay buffer [37]. The second problem was solved by incorpo-rating target networks, which are a direct copy of the actor’sand critic’s networks called π (cid:48) and Q (cid:48) , with the parameters θ (cid:48) and ω (cid:48) respectively, and which are periodically updatedaccording to the parameter τ , so that θ (cid:48) ← θτ + (1 − τ ) θ (cid:48) and ω (cid:48) ← ωτ + (1 − τ ) ω (cid:48) where τ << .IV. I MPLEMENTATION DETAILS

A simulated version of the Reach Alpha 5 manipulator isused. The Reach Alpha 5 is a 5DOF underwater manipulator,capable of lifting 2kg, and is able to operate in depths up to300m. The manipulator is shown in Fig. 1.For our proposed formulation, the state ( x t ) of the RL agentis determined by the joint position ( q ∈ R n ) in [rads] andjoint velocity ˙q ∈ R n in [rads/s], together with the desiredjoint position ( q req ∈ R n ) in [rad], such that the state isdetermined as: x t = [ q t , ˙q , q req ] , with n being the DOFsof the manipulator. The goal of the agent is to achieve adetermined joint position, where the request comes to the agentby higher a layer in the control hierarchy.A fully connected feed forward network is used for boththe actor and critic with two hidden layers of 400 and 300units each. As activation function Leaky ReLus are used forthe hidden networks, with Tanh used for the output neurons ofthe actor. The learning rate used is . and . for theactor and critic respectively, with Adam used as an optimizer.A decay rate of . is applied to each learning rate after 100thousands training steps.In order to be able to achieve the required position andtorque constraints, the reward function was developed asfollows: If the position is within the allowed bounds the rewardis a Gaussian function that penalizes the agent when the jointposition is not close to the request and gives a positive rewardwhen the position matches or is close to the request. On theother hand, when the agent goes over the allowed bounds itis penalized with a high negative number ( − ). Formally thereward is deﬁned as follows: r t = (cid:40) − e − ( x − x refσ ) , if x min < x t < x max − , otherwise (6)with σ being a parameter that shapes the Gaussian function.For the experiments presented here we utilized σ = 0 . .For the training of the agent, a different random goal wasselected in the allowed work space for each epoch of training.The agent was trained for a total of 2000 epochs, with eachepoch lasting 20 seconds of real time operation. Initiallyrandom noise is added to the actions of the agent to allowfor exploration, such that u t = π ( x t ) + (cid:15)N , with N beingOrnstein–Uhlenbeck noise and (cid:15) linearly decaying over time from to . . The minibatch size used for training is 64, with τ = 0 . and γ = 0 . .V. R ESULTS

For the presented results the agent was trained as previouslystated and the policy is now executed without the additionof exploratory noise, effectively making (cid:15) = 0 . Furthermore,the nonlinear model has been degraded with some parameterschanged randomly to test the adaptability of the agent andrandom noise is introduced to the velocity and position read-ings. While the Reach 5 Alpha has 5 DOF, the last degree offreedom corresponds to the gripper joint, which we are notinterested in controlling, as such all results are shown for theﬁrst 4 DOF.A simulation using the trained RL agent was ranon the Reach Alpha manipulator under normal op-erative conditions with a reference joint position of x ref = (2 . , . , − . , . [rad]. Fig. 2a shows the jointposition while being controlled by the RL agent, Fig. 2b showsthe control actions and Fig. 2c shows the position errors. InFig. 2a it can be seen how the agent reaches the desiredposition in less than two seconds, without any overshoot, evenwhen the requested position requires a long rotation of overtwo radians in Joint 1. The lack of overshoots demostrates howthe agent is capable of behaving without breaking any of theposition constrains imposed during training. The agent utilizesthe maximum torque initially available, as can be seen in Fig.2b, and then utilizes small corrections to keep the joints inposition. Fig. 2c shows how the errors are rapidly reduced,with no steady state error present.Another example shows the behaviour of the armwhen a completely different reference point is selected, x ref = ( − . , . , − . , − . [rad]. Fig. 3a shows theobtained position when using the RL controller. As can be seenthe requested position is reached again in under two seconds,without any overshoot. Again the agent utilizes high levelsof torque initially, with lower levels after the requested jointsposition have been reached, as depicted in 3b. Furthermore,no steady state error is present as ilustrated in 3c.The example presented here aims to test the behavior ofthe agent under torque constraints. In this example, the torqueoutput of Joint 1 is reduced by 75%, with a desired requestedposition of x ref = (2 ., − ., − . , . [rad]. In Fig. 4a theachieved position are shown, where it can be seen that theagent is capable of rapidly reaching the desired positions forJoints 2, 3 and 4, while Joint 1 takes around 5 seconds dueto the new restrictions in torque, but no overshoot is present.This can also be seen in Fig. 4c where the errors are shown,where Joint 1 takes longer to reach the request, however itcan also be seen that no steady state error or overshoot arepresent. Additionally, Fig. 4b shows the torque output, wherethe reduced torque of Joint 1 can be clearly seen. A. Comparative results

In this section we introduce a comparison between theproposed RL agent and a MPC controller. The cost function4 a) Joint Position (b) Torque Output (c) Joint ErrorsFig. 2. Test 1 of the proposed RL algorithm with x ref = (2 . , . , − . , . [rad](a) Joint Position (b) Torque Output (c) Joint ErrorsFig. 3. Test 2 of the proposed RL algorithm with x ref = ( − . , . , − . , − . [rad](a) Joint Position (b) Torque Output (c) Joint ErrorsFig. 4. Test 3 of the proposed RL algorithm under Torque constraints with x ref = (2 ., − ., − . , . [rad] of the MPC controller is J = (cid:80) Nk =0 || x ref − x t + k || Q ( t ) + || ∆ u t ++ k || R ( t ) . The gains Q and R of the cost function wheretuned accordingly. As previously, the Reach 5 Alpha simulatedarm is used where a desired position ( x ref ) should be attained.An experiment is presented when a desired reference po-sition of x ref = (2 . , − . , − . , . [rad] is selected.Fig. 5 shows the obtained joint position when using the pro-posed RL agent, while Fig. 6 shows the results when utilizingthe baseline MPC controller. While the MPC controller is ableto reach the required position for Joint 1 in less than 2.5 seconds (Fig. 6a), it takes the rest of the joints longer. On theother hand, the RL agent is faster and presents no overshoot(Fig. 5a). In addition, the RL agent presents no steady stateerror, as can be seen in Fig. 5c, while the MPC shows someerror is present in the steady state as Fig. 6c shows. While thecontrol actions both utilize high levels of torque initially, theMPC shown in Fig. 6b seems to require less torque once thesteady state is reached as compared to the RL agent, in Fig.5b.A series of experiments were performed in which random5 a) RL: Joint Position (b) RL: Torque Output (c) RL: Joint ErrorsFig. 5. Comparative results using RL for x ref = (2 . , − . , − . , . [rad](a) MPC: Joint Position (b) MPC: Torque Output (c) MPC: Joint ErrorsFig. 6. Comparative results using MPC for x ref = (2 . , − . , − . , . [rad] positions were selected and a number of metrics were obtainedin order to compare the performance of the two algorithms.These metrics include the average energy consumed (E) inJoules, the Root Mean Square Error (RMSE), Mean IntegralError (MIE), the Mean Steady Steate Error (MSSE), theOvershoot (OS) in percentage, and the Settling Time (ST) inseconds. A set of 20 experiments were conducted where theobtained results can be seen in Table I. TABLE IC

OMPARATIVE RL VS MPCAlgorithm E [J] RMSE MIE MSSE OS[ % ] ST [s]RL 19.86 0.23 58.27 0.0033 1.43 6.26MPC 13.8 0.27 97.64 0.033 18.21 7.96 The presented metrics show a much more favorable perfor-mance of the RL agent. Both MIE and MSSE are signiﬁcantlylower, while the RMSE also presents lower values for theRL implementation. The OS is practically non existent in theRL agent as compared to the MPC, while the ST is over asecond less for the RL, making it the faster solution. The onlydisadvantage is seen with regards to the energy consumption(E) which is lower for the MPC controller. However, the RLcontroller is not taking into account the energy consumption as this was not the main focus of the proposal. Overall, thepresented results show the beneﬁts of the RL controller.VI. C

ONCLUSIONS

In this article we presented a novel strategy for the low-level control of an underwater manipulator under positionand torque constraints based on a reinforcement learningformulation. The actions selected by the actor are directly thetorque commands sent to the manipulator, while the state isdetermined by the current position and velocity, together withthe desired joint positions. By including the goal in the statewe are able to generalize over different control requests, afundamental requirement for a control system.The data driven approach provided by RL avoids the useof complex models and is able to adapt to changes in theoperative conditions.For instance, sudden changes that limitthe normal operation of the system, such as obstacles in theworking space, failure of any engine, and others, can causereduced joint movement leading to limited range for jointpositions and/or limited torque range. Such constraints can bedifﬁcult to surpass using classical controllers due to the lack ofaccurate model information and poor tuning of the controller.On the other hand, reinforcement learning controllers, are ableto obtain highly non linear policies that can operate within therequired boundaries and are able to adapt to new requirements.6s future works, the authors suggest the implementationof the algorithm in the Reach 5 Alpha arm as well as inother manipulators to test the adaptability of the proposal. Fur-thermore, investigating the possibility of reducing the higherenergy consumption, when comparing with the MPC, couldbe of interest for autonomous operations.R

EFERENCES[1] Satja Sivˇcev, Joseph Coleman, Edin Omerdi´c, Gerard Dooly, and DanielToal. Underwater manipulators: A review.

Ocean Engineering , 163:431– 450, 2018.[2] A.M. Yazdani, K. Sammut, O. Yakimenko, and A. Lammas. A surveyof underwater docking guidance systems.

Robotics and AutonomousSystems , 124:103382, 2020.[3] Xanthi S. Papageorgiou, Herbert G. Tanner, Savvas G. Loizou, andKostas J. Kyriakopoulos. Switching manipulator control for motionon constrained surfaces.

Journal of Intelligent and Robotic Systems ,62:217–239, 2011.[4] J. G. Ziegler and N. B. Nichols. Optimum Settings for AutomaticControllers.

Journal of Dynamic Systems, Measurement, and Control ,115(2B):220, jun 1993.[5] Corina Barbalata, Matthew W. Dunnigan, and Yvan R. Petillot. Coupledand decoupled force/motion controllers for an underwater vehicle-manipulator system. 2018.[6] Carlos E. Garc´ıa, David M. Prett, and Manfred Morari. Model predictivecontrol: Theory and practice—a survey.

Automatica , 25(3):335 – 348,1989.[7] C. Shen, Y. Shi, and B. Buckham. Trajectory tracking control of anautonomous underwater vehicle using lyapunov-based model predictivecontrol.

IEEE Transactions on Industrial Electronics , 65(7):5796–5805,2018.[8] Guoxing Bai, Yu Meng, Li Liu, Weidong Luo, and Qing Gu. Reviewand comparison of path tracking based on model predictive control.volume 8, page 1077, 2019.[9] Thor I Fossen et al.

Guidance and control of ocean vehicles , volume199. 1994.[10] Richard S. Sutton and Andrew G. Barto.

Reinforcement learning: AnIntroduction . MIT Press, 1998.[11] George E. Monahan. State of the Art—A Survey of Partially ObservableMarkov Decision Processes: Theory, Models, and Algorithms.

Manage-ment Science , 28(1):1–16, January 1982.[12] Ignacio Carlucho, Mariano De Paula, Sen Wang, Yvan Petillot, and Ger-ardo G. Acosta. Adaptive low-level control of autonomous underwatervehicles using deep reinforcement learning.

Robotics and AutonomousSystems , jun 2018.[13] D. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., Wierstra. Continuous control with deep reinforcementlearning.

Foundations and Trends® in Machine Learning , 2(1):1–127,2016.[14] James K Mills and Andrew A Goldenberg. Force and position controlof manipulators during constrained motion tasks.

IEEE Transactions onRobotics and Automation , 5(1):30–46, 1989.[15] Tsuneo Yoshikawa. Dynamic hybrid position/force control of robotmanipulators–description of hand constraints and calculation of jointdriving force.

IEEE Journal on Robotics and Automation , 3(5):386–392, 1987.[16] Yinyan Zhang, Shuai Li, Jianxiao Zou, and Ameer Hamza Khan. Apassivity-based approach for kinematic control of manipulators withconstraints.

IEEE Transactions on Industrial Informatics , 16(5):3029–3038, 2019.[17] Mingming Li, Yanan Li, Shuzhi Sam Ge, and Tong Heng Lee. Adaptivecontrol of robotic manipulators with uniﬁed motion constraints.

IEEETransactions on Systems, Man, and Cybernetics: Systems , 47(1):184–194, 2016.[18] Zhihao Xu, Xuefeng Zhou, and Shuai Li. Deep recurrent neural networksbased obstacle avoidance control for redundant manipulators.

Frontiersin neurorobotics , 13:47, 2019.[19] M. Deisenroth, C. Rasmussen, and D. Fox. Learning to control alow-cost manipulator using data-efﬁcient reinforcement learning. In

Robotics: Science and Systems , 2011. [20] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learningfor robotic manipulation with asynchronous off-policy updates. In ,pages 3389–3396, 2017.[21] Cong Wang, Qifeng Zhang, Qiyan Tian, Shuo Li, Xiaohui Wang, DavidLane, Yvan Petillot, and Sen Wang. Learning mobile manipulationthrough deep reinforcement learning.

Sensors (Basel, Switzerland) ,20(3):939, Feb 2020. 32050678[pmid].[22] John Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.Proximal policy optimization algorithms.

ArXiv , abs/1707.06347, 2017.[23] Ignacio Carlucho, Mariano De Paula, Sen Wang, Yvan Petillot, and Ger-ardo G. Acosta. Adaptive low-level control of autonomous underwatervehicles using deep reinforcement learning.

Robotics and AutonomousSystems , 107:71 – 86, 2018.[24] Andres El-Fakdi and Marc Carreras. Two-step gradient-based reinforce-ment learning for underwater robotics behavior learning.

Robotics andAutonomous Systems , 61(3):271 – 282, 2013.[25] G. Frost, F. Maurelli, and D. M. Lane. Reinforcement learning ina behaviour-based control architecture for marine archaeology. In

OCEANS 2015 - Genova , pages 1–5, 2015.[26] Blueprint Lab. Reach alpha 5 (2020).

URLhttps://blueprintlab.com/products/reach-alpha/ , 2020.[27] K.M. Hee, van and L.B. Hartman.

Application of Markov decisionprocesses to search problems . Research Report. University of Waterloo,Department of Computer Science, 1992.[28] Ignacio Carlucho, Mariano De Paula, Sebastian A. Villar, and Gerardo G.Acosta. Incremental Q-learning strategy for adaptive PID control ofmobile robots.

Expert Systems with Applications , 2017.[29] Ignacio Carlucho, Mariano De Paula, and Gerardo G. Acosta. Doubleq-pid algorithm for mobile robot control.

Expert Systems with Applica-tions , 137:292 – 307, 2019.[30] William S. Lovejoy. Computationally feasible bounds for partiallyobserved markov decision processes.

Operations Research , 39(1):162–175, 1991.[31] L. Crespo and J. Q. Sun. Solution of ﬁxed ﬁnal state optimal controlproblems via simple cell mapping.

Nonlinear Dynamics , 23:391–403,01 2000.[32] LG Crespo and J. Q. Sun. Stochastic optimal control via bellman’sprinciple.

Automatica , 39:2109–2114, 12 2003.[33] Yann LeCun, Yoshua Bengio, Geoffrey Hinton, Lecun Y., Bengio Y.,and Hinton G. Deep learning.

Nature , 521(7553):436–444, 2015.[34] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, JoelVeness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas KFidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,Shane Legg, and Demis Hassabis. Human-level control through deepreinforcement learning.

Nature , 518(7540):529–533, 2015.[35] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra,and Martin Riedmiller. Deterministic Policy Gradient Algorithms.

Proceedings of the 31st International Conference on Machine Learning(ICML-14) , pages 387–395, 2014.[36] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing FunctionApproximation Error in Actor-Critic Methods. In

ICML , volume 80 of

Proceedings of Machine Learning Research , pages 1582–1591. PMLR,2018.[37] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In

Proceed-ings of the 32Nd International Conference on International Conferenceon Machine Learning - Volume 37 , ICML 15, pages 448–456. JMLR.org,2015., ICML 15, pages 448–456. JMLR.org,2015.