[PDF] rl_reach: Reproducible Reinforcement Learning Experiments for Robotic Reaching Tasks

Abstract

Training reinforcement learning agents at solving a given task is highly dependent on identifying optimal sets of hyperparameters and selecting suitable environment input / output configurations. This tedious process could be eased with a straightforward toolbox allowing its user to quickly compare different training parameter sets. We present rl_reach, a self-contained, open-source and easy-to-use software package designed to run reproducible reinforcement learning experiments for customisable robotic reaching tasks. rl_reach packs together training environments, agents, hyperparameter optimisation tools and policy evaluation scripts, allowing its users to quickly investigate and identify optimal training configurations. rl_reach is publicly available at this URL: this https URL

Full PDF

rrl_reach : R

EPRODUCIBLE R EINFORCEMENT L EARNING E XPERIMENTS FOR R OBOTIC R EACHING T ASKS

Pierre Aumjaud ∗ University College DublinDublin, Ireland [email protected]

David McAuliffe

Resero LtdDublin, Ireland [email protected]

Francisco Javier Rodríguez Lera

Universidad de LeónLeón, Spain [email protected]

Philip Cardiff

University College DublinDublin, Ireland [email protected] A BSTRACT

Training reinforcement learning agents at solving a given task is highly dependent on identifyingoptimal sets of hyperparameters and selecting suitable environment input / output conﬁgurations.This tedious process could be eased with a straightforward toolbox allowing its user to quicklycompare different training parameter sets. We present rl_reach , a self-contained, open-sourceand easy-to-use software package designed to run reproducible reinforcement learning experimentsfor customisable robotic reaching tasks. rl_reach packs together training environments, agents,hyperparameter optimisation tools and policy evaluation scripts, allowing its users to quickly in-vestigate and identify optimal training conﬁgurations. rl_reach is publicly available at this URL:https://github.com/PierreExeter/rl_reach.

Keywords

Reinforcement Learning · Robotics · Benchmark · Model-free · Stable BaselinesCurrent code version v1.0Permanent link to code/repository https://github.com/PierreExeter/rl_reach

Permanent link to Reproducible Capsule https://codeocean.com/capsule/4112840/tree/

Legal Code License MIT LicenseCode versioning system used gitSoftware code language used Python 3Compilation requirements & dependencies Docker OR Python 3, Conda, CUDA (optional)Link to developer documentation/manual https://rl-reach.readthedocs.io/en/latest/index.html

Support email for questions [email protected] 1: Code metadata

Industrial processes have seen their productivity and efﬁciency increase considerably in recent decades thanks to theautomation of repetitive tasks, notably with the advances in robotics. This productivity can be further improved byenabling robotic agents to solve tasks independently, without being explicitly programmed by humans.Reinforcement Learning (RL) is a general framework for solving sequential decision-making tasks through self-learningand as such, it has found natural applications in robotics. In RL, an agent interacts with an environment by sending ∗ Corresponding author a r X i v : . [ c s . L G ] M a r l_reach : Reproducible RL Experimentsactions and receiving an observation – describing the current state of the world – and a reward – describing the qualityof the action taken. The agent’s objective is to maximise the expected cumulative return by learning a policy that willselect the appropriate actions in each situation.RL has found many successful applications, however, experiments are notoriously hard to reproduce as the learningprocess is highly dependent on weight initialisation and environment stochasticity [1]. In order to improve reproducibilityand compare RL solutions objectively, various standard toy problems have been implemented [2–7]. A number ofsoftware suites provide training environments for continuous control tasks in robotics, such as dm_control [8, 9],Meta-World [10], SURREAL [11], RLBench [12], D4RL [13], robosuite [14] and robo-gym [15].We introduce rl_reach , a self-contained, open-source and easy-to-use software package for running reproducibleRL experiments applied to robotic reaching tasks. Its objective is to allow researchers to quickly investigate andidentify promising sets of training parameters for a given task. rl_reach is built on top of Stable Baselines 3 [16] – apopular RL framework. The training environments are based on the WidowX MK-II robotic arm and are adapted fromthe Replab project [17], a benchmark platform for running RL robotics experiments. rl_reach encapsulates all thenecessary elements for producing a robust performance benchmark of RL solutions for simple robotics reaching tasks.We aim to promote reproducible experimentation practice in RL research. The rl_reach software has been designed to quickly and reliably run RL experiments and compare the performanceof trained RL agents against algorithms, hyperparameters and training environments. The code metadata are given inTable 1. rl_reach ’s key features are:•

Self-contained : rl_reach packs together a widely-used RL framework – Stable Baselines 3 [16], trainingenvironments, evaluation and hyperparameter tuning scripts (Figure 1). In addition to its ease of usability, onlya few other packages offer such self-contained code.• Free and open-source : The source code is written in Python 3 and published under the permissive MITlicense, with no commercial licensing restrictions. rl_reach only makes use of free and open-source projectssuch as the deep learning library PyTorch [18] or the physics simulator Pybullet [19]. Many RL frameworksrequire a paid MuJoCo license, which can be an obstacle for sharing research results. Code quality andlegibility is guaranteed with standard software development tools, including the Git version control system,Pylint syntax checker, Travis continuous integration service and automated tests.•

Easy-to-use : A simple command-line interface is provided to train agents, evaluate policies, visualise theresults and tune hyperparameters. Documentation is provided to assist end-users with the installation and mainusage of rl_reach . The software and its dependencies can be installed from source with the Github repositoryand Conda environment provided. Portability is maximised across platforms by providing rl_reach as aDocker image, allowing it to run on any operating system that supports Docker. Finally, a reproducible codecapsule is available online on the CodeOcean platform.•

Customisable training environments : rl_reach comes with a number of training environments for solvingthe reaching task with the WidowX robotic arm. These environments are easily customisable to experimentwith different action, observation or reward functions. While many similar software packages exploit toyproblems as benchmark tasks, rl_reach provides its users with a training environment that is closer to anindustrial problem, namely reaching a target position with a robotic arm.• Stable Baselines inheritance : Since rl_reach is built on top of Stable Baselines 3 [16] and its "Zoo",it comes with the same functionalities. In particular, it supports recent model-free RL algorithms such asA2C, DDPG, HER, PPO, SAC and TD3 and automatic hyperparameter tuning with the Optuna optimisationframework [20].•

Reproducible experiments : Each experiment (with a unique identiﬁcation number) consists of a number ofruns with identical training parameters but initialised with different initialisation seeds. The evaluation metricsare averaged across all the seed runs to promote reproducible, reliable and robust experiments.•

Straightforward benchmark : When a trained policy is evaluated, the evaluation metrics, environment’svariables and training hyperparameters are automatically logged in a CSV format. The performance of aselection of experiment runs can be visualised and compared graphically (Figure 2).•

Debugging tools : It is possible to produce a 2D or 3D live plot of the end-effector and goal positions duringan evaluation episode (Figure 3), as well as a number of physical characteristics of the environment such asthe end-effector and the target position, the joint’s angular position, reward, distance, velocity or acceleration2 l_reach : Reproducible RL Experimentsbetween the end-effector and the target (Figure 4). It is also possible to plot the training curves for eachindividual seed run (Figure 5). These plots have proven useful for debugging purposes, especially when testinga new training environment. rl_reach

Pybullet engine

Environment 1Environment 2Environment 10

Stable Baselines 3

RL Agents (PPO, TD3, SAC, etc...)Hyperparametertuning (Optuna)

Experiment folder

Trained policiesOptimal hyperparametersEvaluation metricsTraining curvesLive target visualisationBenchmark plot

Reward, ObservationObservation ActionAction

Figure 1: rl_reach ’s ﬂowchart and components

Reinforcement Learning is a recent and highly active research ﬁeld, with a relatively large number of RL solutionspublished every year. Accurately evaluating and objectively comparing novel and existing RL approaches is crucial toensure continued progress in the ﬁeld. Reproducing RL experimental results is often challenging due to stochasticity inthe training process and training environments [1]. By providing a systematic tool for carrying out reproducible RLexperiments, we hope that rl_reach will promote better experimental practice in the RL research community andimprove reporting and interpretation of results. Since rl_reach ’s interface is straightforward, intuitive and allows for aquick graphical comparison of experiments, it can be used as an educational platform for learning the practicalities ofRL training.Training RL agents is highly dependent on a number of intrinsic (e.g. initialisation seeds, reward functions, actionshape, number of time steps) and extrinsic (algorithm hyperparameters) variables. Identifying the critical parametersthat control a successful training can be a daunting task. Thanks to its easily customisable learning environments andextensive logging of training parameters, rl_reach offers a unique solution to explore the effects of both intrinsic andextrinsic parameters on the training performance.Finally, rl_reach provides learning environments designed to train a robotic manipulator to reach a target position.This task is more industrially-relevant than many of the toy problems considered in other benchmark packages, thusallowing straightforward transfer of RL applications from academic research to industry.A peer-reviewed article [21] has emanated from this software where the performance of robotics RL agents trained toreach target positions is compared. The trained policies were successfully transferred from the simulated to the physicalrobot environment. 3 l_reach : Reproducible RL Experiments obs1 obs2 obs3 obs4 obs5obs1.00.5 M e a n r e t u r n mean_returnobs1 obs2 obs3 obs4 obs5obs278028002820 T r a i n t i m e ( s ) mean_train_time(s)obs1 obs2 obs3 obs4 obs5obs0.00.51.0 M e a n s u cc e ss r a t i o mean_SR_50mean_SR_20mean_SR_10mean_SR_5mean_SR_2mean_SR_1mean_SR_05obs1 obs2 obs3 obs4 obs5obs0.00.51.0 M a x s u cc e ss r a t i o max_SR_50max_SR_20max_SR_10max_SR_5max_SR_2max_SR_1max_SR_05obs2 obs3 obs4 obs5obs204060 R e a c h t i m e mean_RT_50mean_RT_20mean_RT_10mean_RT_5mean_RT_2mean_RT_1mean_RT_05 Figure 2: An example of visualisation plot that compares the performance of different RL experimentsFigure 3: The training environment with live visualisation of the end-effector and target position4 l_reach : Reproducible RL Experiments j o i n t p o s ( r a d ) joint_pos1joint1_minjoint1_max 0.010.000.01 a c t i o n ( r a d ) action_1action_low1action_high1 3210123 j o i n t p o s ( r a d ) joint_pos5joint5_minjoint5_max 0.010.000.01 a c t i o n ( r a d ) action_5action_low5action_high5101 j o i n t p o s ( r a d ) joint_pos2joint2_minjoint2_max 0.0050.0000.005 a c t i o n ( r a d ) action_2action_low2action_high2 0.040.020.000.020.04 j o i n t p o s ( r a d ) joint_pos6joint6_minjoint6_max 0.00010.00000.0001 a c t i o n ( r a d ) action_6action_low6action_high6101 j o i n t p o s ( r a d ) joint_pos3joint3_minjoint3_max 0.0050.0000.005 a c t i o n ( r a d ) action_3action_low3action_high3 0.0250.0200.0150.0100.0050.000 r e w a r d rewardterm1term2 0.000.050.100.15 d i s t a n c e ( m ) distance0 20 40 60 80 100timestep101 j o i n t p o s ( r a d ) joint_pos4joint4_minjoint4_max 0 20 40 60 80 100timestep0.0050.0000.005 a c t i o n ( r a d ) action_4action_low4action_high4 0 20 40 60 80 100timestep2001000100 a cc ( m / s ^ ) est_acc 0 20 40 60 80 100timestep0.00.10.20.3 c oo r d i n a t e s ( m ) goal_xgoal_ygoal_ztip_xtip_ytip_z1.000.750.500.250.000.25 v e l ( m / s ) est_vel Figure 4: An example of metadata plot after the evaluation of a trained policy × − . − . − . − . − . − . A v e r ag e r e t u r n seed 1seed 2seed 3seed 4seed 5seed 6seed 7seed 8seed 9seed 10mean reward Figure 5: An example of training curve plot5 l_reach : Reproducible RL Experiments

We chose to focus on the reaching task as it is one of the simplest tasks to solve with a robotic arm, which allowsusers to run experiments with relatively low computing resources, while still being industrially relevant. Moreover,the reaching task allows the user to shape the reward easily and to implement training environments with both denseand sparse rewards. However, rl_reach would beneﬁt from supporting more complex and diverse manipulation taskssuch as stacking, assembly, pushing or inserting. It also does not include the classic toy problems used traditionally forbenchmarking RL agents. Finally, an implementation of the training environments for the physical WidowX arm wouldhelp validate the performance of policies trained in simulation. rl_reach has been designed as a self-contained tool, packaging both the training environments and the RL frameworkStable Baselines 3 for convenience purposes. However this does not offer the ﬂexibility to experiment with RLalgorithms that are not supported by this framework. A potential future improvement would consist in producinga modular implementation of rl_reach where both the training environments and the RL agents could be easilyinterchangeable.

Acknowledgements

This Career-FIT project has received funding from the European Union’s Horizon 2020 research and innovationprogramme under the Marie Skłodowska-Curie grant agreement No. 713654.

References [1] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters.32nd AAAI Conference on Artiﬁcial Intelligence pp. 3207–3214 (2018)[2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym.arXiv (2016)[3] Bellemare, M.G., Veness, J.: The Arcade Learning Environment : An Evaluation Platform for General Agents.Journal of Artiﬁcial Intelligence Research , 253–279 (2013)[4] Beattie, C., Leibo, J.Z., Teplyashin, D., Ward, T., Wainwright, M., Lefrancq, A., Green, S., Sadik, A., Schrittwieser,J., Anderson, K., York, S., Cant, M., Cain, A., Bolton, A., Gaffney, S., King, H., Hassabis, D., Legg, S., Petersen,S.: DeepMind Lab. arXiv (2016)[5] Nichol, A., Pfau, V., Hesse, C., Klimov, O., Schulman, J.: Gotta Learn Fast: A New Benchmark for Generalizationin RL. arXiv (2018)[6] Cobbe, K., Hesse, C., Hilton, J., Schulman, J.: Leveraging Procedural Generation to Benchmark ReinforcementLearning. In: Proceedings of the 37th International Conference on Machine Learning, pp. 2048 – 2056 (2020)[7] Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepezvari,C., Singh, S., van Roy, B., Sutton, R., Silver, D., van Hasselt, H.: Behaviour Suite for Reinforcement Learning. In:International Conference on Learning Representations (2020)[8] Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.D.L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq,A., Lillicrap, T., Riedmiller, M., Benchmarking, F.: DeepMind Control Suite. arXiv (2018)[9] Tassa, Y., Tunyasuvunakool, S., Muldal, A., Doron, Y., Trochim, P., Liu, S., Bohez, S., Merel, J., Erez, T., Lillicrap,T., Heess, N.: dm_control : Software and Tasks for Continuous Control. Software Impacts (100022) (2020)[10] Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., Levine, S.: Meta-World: A Benchmark andEvaluation for Multi-Task and Meta Reinforcement Learning. In: Conference on Robot Learning (CoRL) (2019)[11] Fan, L., Zhu, Y., Zhu, J., Liu, Z., Zeng, O., Gupta, A., Creus-Costa, J., Savarese, S., Fei-Fei, L.: SURREAL:Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark. In: 2nd Conference onRobot Learning, Proceedings of Machine Learning Research , vol. 87, pp. 767–782 (2018)[12] James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: RLBench: The Robot Learning Benchmark & Learning Environ-ment. IEEE Robotics and Automation Letters (2), 3019–3026 (2020)[13] Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: Datasets for Deep Data-Driven ReinforcementLearning. arXiv (2020)[14] Zhu, Y., Wong, J., Mandlekar, A., Martín-Martín, R.: robosuite: A Modular Simulation Framework and Benchmarkfor Robot Learning. arXiv (2020) 6 l_reach : Reproducible RL Experiments[15] Lucchi, M., Zindler, F., Mühlbacher-Karrer, S., Pichler, H.: robo-gym – An Open Source Toolkit for DistributedDeep Reinforcement Learning on Real and Simulated Robots. In: IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS) (2020)[16] Rafﬁn, A., Hill, A., Ernestus, M., Gleave, A., Kanervisto, A., Dormann, N.: Stable Baselines3 (2019). URL https://github.com/DLR-RM/stable-baselines3 [17] Yang, B., Zhang, J., Pong, V., Levine, S., Jayaraman, D.: REPLAB: A Reproducible Low-Cost Arm BenchmarkPlatform for Robotic Learning. In: International Conference on Robotics and Automation (ICRA) (2019)[18] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,Bai, J., Chintala, S.: PyTorch: An imperative style, high-performance deep learning library. Advances in NeuralInformation Processing Systems , 8026–8037 (2019)[19] Coumans, E., Bai, Y.: PyBullet, a Python Module for Physics Simulation for Games, Robotics and MachineLearning (2019). URL https://pybullet.org/https://pybullet.org/