Using PCA to Efficiently Represent State Spaces
UUsing PCA to Efficiently Represent State Spaces
William Curran
CURRANW @ ONID . OREGONSTATE . EDU
Oregon State UniversityCorvallis, Oregon
Tim Brys
TIMBRYS @ VUB . AC . BE Vrije Universiteit BrusselBrussels, Belgium
Matthew E. Taylor
TAYLORM @ EECS . WSU . EDU
Washington State UniversityPullman, Washington
William D. Smart
BILL . SMART @ OREGONSTATE . EDU
Oregon State UniversityCorvallis, Oregon
Abstract
Reinforcement learning algorithms need to deal with the exponential growth of states and ac-tions when exploring optimal control in high-dimensional spaces. This is known as the curse ofdimensionality. By projecting the agent’s state onto a low-dimensional manifold, we can representthe state space in a smaller and more efficient representation. By using this representation duringlearning, the agent can converge to a good policy much faster. We test this approach in the MarioBenchmarking Domain. When using dimensionality reduction in Mario, learning converges muchfaster to a good policy. But, there is a critical convergence-performance trade-off. By projectingonto a low-dimensional manifold, we are ignoring important data. In this paper, we explore thistrade-off of convergence and performance. We find that learning in as few as 4 dimensions (in-stead of 9), we can improve performance past learning in the full dimensional space at a fasterconvergence rate.
1. Introduction
Learning in high dimensional spaces is necessary and difficult in robotic applications. State andaction spaces in robotics become large, continuous, and scale exponentially with the number ofjoints. This leads to the curse of dimensionality .To address this issue, researchers have developed transfer learning (Pan and Yang, 2010) andlearning from demonstration (Argall et al., 2009) approaches. Transfer learning reduces computa-tional complexity by learning in a simple domain, and transferring that knowledge to a more com-plex domain. In transfer learning there are three key research questions: What to transfer, how totransfer and when to transfer. Each of these questions are difficult to answer and all are completelydomain dependent. When the source domain and the target domain are loosely related, straightforward transfer learning does not work, and can lead to worse performance (Pan and Yang, 2010).1 a r X i v : . [ c s . L G ] J un ransfer learning also requires a computable mapping from the source domain to the target domain,which isn’t always possible.Learning from demonstration (LfD) methods speed up convergence time by bootstrapping learn-ing with demonstrations (Argall et al., 2009). LfD learns a policy using examples or demonstrationsprovided by a human or a robotic teacher. This method extracts state-action pairs from these demon-strations to bootstrap learning. However, these demonstrations must be consistent and accuratelyrepresent solving the task. These methods also solve a specific complex task, rather than solve forgeneral control (Argall et al., 2009).In this work, we focus on the core problem of dealing with large state spaces. We approachthis issue by projecting the full state space onto a low-dimensional manifold. We use PrincipalComponent Analysis to find this transform, and use it during each learning iteration. In this way, weperform learning in only the low-dimensional space. This leads to a critical trade-off. By projectingonto a low-dimensional manifold, we are throwing out low variance, yet potentially important data.However, learning can converge to a good, yet suboptimal, policy much faster. In this paper, weexplore this trade-off of convergence and performance.We organize the rest of this paper as follows. Section 2 describes related work in dimension-ality reduction in machine learning and reinforcement learning. In Section 3 we describe our workwith PCA, dimensionality reduction and reinforcement learning. Section 4 introduces the MarioBenchmarking Domain and our learning approach. Experimental results are in Section 5, followedby conclusion and related work in Sections 6 and 7.
2. Related Work
To motivate our approach, we introduce previous work performed in the field of dimensionalityreduction.
Previous work in dimensionality reduction focuses on reducing the space for classification or func-tion approximation. PCA is effective in many machine learning and data mining applications toextract features from large data sets (Pec; Turk and Pentland, 1991). Rather than using PCA forfeature extraction in large data sets, we use PCA to reduce the dimensionality of the state spaceduring learning.Swinehart and Abbott (2005) used a similar approach for function approximation with neuralnetworks. They find that they can reduce convergence time for reinforcement-based, random walklearning can by reducing the dimension of the parameter space. Liu and Mahadevan (2011) alsouse dimensionality reduction to compute policies in a low dimensional subspace. Liu computes thelow dimensional subspace from a high-dimensional space through random projections. They alsoreduce convergence time in continuous state spaces.
3. PCA+RL in High-Dimensional Spaces to be viable in many scenarios robots need to perform complex manipulation tasks. These complexmanipulations need high degree-of-freedom arms and manipulators. For example, the PR2 robothas two 7 DoF arms. When learning position, velocity and acceleration control, this leads to a21 dimensional state space per arm. Learning in these high-dimensional spaces is computationallyintractable without optimization techniques.To learn in high-dimensional state spaces, our algorithm first computes a transform between thehigh-dimensional state space and a lower-dimensional space. To perform this computation, we needtrajectories across a representative set of the agent’s state space. We can then use any dimensionalityreduction technique to learn the transform. In this work, we use Principal Component Analysis(PCA) (Shlens, 2005).PCA identifies patterns in data and reduces the dimensions of the dataset with minimal loss ofinformation. It does this by computing a transform to convert correlated data to linearly uncorre-lated data. This transformation ensures that the first principal component has the largest possiblevariance. Each additional component has the largest possible variance uncorrelated with all previ-ous components. Essentially, PCA represents as much as the demonstrated state space as possiblein a lower dimension. This transform is given by: T = XW (1)where X is the demonstrated data, W is a p by p matrix whose columns are eigenvectors of X T X and p is the number of principle components (in this case, the number of dimensions).To transform to any arbitrary dimension, k , we can choose k eigenvectors from W with thelargest eigenvalues to form a p by k dimensional matrix W k : T k = XW k (2)We then use reinforcement learning to learn trajectories in the new manifold. All learning isin the lower-dimensional space. For each learning iteration, we project state x down to the lower-dimensional space k : x k = W Tk x (3)We can now compute the action using the chosen learning algorithm, and execute that actionin the simulation. The simulation calculates the new state given the executed action, and we canproject that state down to the same lower-dimensional space. We can then perform a learning update(Figure 1).By learning in a smaller space, reinforcement learning algorithms will converge must faster.However, PCA cannot represent all the variance in all the demonstrations. Therefore, given infinitetime, the converged learning performance will always be worse than learning in the full space. Thisleads to a critical trade-off. By projecting onto a low-dimensional manifold, we are throwing outlow variance, yet possibly important data. Yet, learning can still converge to a good, yet suboptimal,policy much faster.
4. Mario Benchmark Problem
The Mario benchmark problem (Karakovskiy and Togelius, 2012) is based on Infinite Mario Bros,which is a public reimplementation of the original ’s game Super Mario Bros R (cid:13) . In this task,Mario needs to collect as many points as possible, this is done by killing an enemy ( ), devouringa mushroom ( ) or a fireflower ( ), grabbing a coin ( ), finding a hidden block ( ), finishingthe level ( ), getting hurt by a creature ( − ) or dying ( − ). The actions available to Mariocorrespond to the buttons on the NES controller, which are (left, right, no direction), (jump, don’t3igure 1: PCA+RL Flowchartjump), and (run/fire, don’t run/fire). Mario can take one action from each of these groups simul-taneously, resulting in distinct combined or ‘super’ actions. The state space in Mario is quitecomplex, as Mario observes the exact locations of all enemies on the screen and their type. He alsoobserves all information about himself, such as what mode he is in (small, big, fire). Lastly, he hasa gridlike receptive field in which each cell indicates what type of object is in it (such as a brick, acoin, a mushroom, a goomba (enemy), etc.). A screenshot is shown in Figure 2.Our reinforcement learning agent for Mario is inspired by Liao’s and Brys’ previous work (Bryset al., 2014; Liao et al., 2012). We use a Q ( λ ) -learner with a tabular state representation and α =0.01, λ =0.5, γ =0.9 and (cid:15) =0.05. The part of the state space the agent considers consists of these variables: • is Mario able to jump ( − ) • is Mario on the ground ( − ) • is Mario able to shoot fireballs ( − ) • Mario’s current direction, 8 directions and standing still ( − ) • enemies closeby (within one gridcell) in 8 directions ( → − ) • enemies at midrange (within one to three gridcells) in 8 directions ( − ) • whether there is an obstacle in four vertical gridcells in front of Mario ( → − ) • closest enemy position within 21x21 grid surrounding Mario + 1 for absent enemy ( +1 → − )This makes for . × possible states, and × Q -values ( actions in each state). Thesize of the state space is not a problem computationally, because this state-space is sparsely visited,with the majority of the state-visits to a small set of states (Liao et al., 2012).Previous work has investigated using demonstrations in the Mario domain to speed up reinforce-ment learning, albeit in different ways: by shaping the reward using the demonstration (Brys et al.,2015), or learning a reward function using inverse reinforcement learning (Lee et al., 2014).4igure 2: A screenshot of Mario.In the experiments, we run every learning episode on a procedurally generated level based on arandom seed ∈ [0 , ] , with difficulty . Also, we randomly select the mode Mario starts in (small,large, fire) for each episode. Making an agent learn to play Mario this way helps avoid overfittingon a specific level, and makes for a more generally applicable Mario agent. Our results are alwaysaveraged over different trials.
5. Results
As a preliminary analysis, we calculated all the principle components of the Mario domain to seewhich dimensions PCA weighed the highest during learning. The jump , ground and current direc-tion features are heavily represented in the first few principal components. This is intuitive, as thisstate changes frequently throughout a game of Mario. These features are also fundamental skillsrequired to play a game of Mario.PCA also associated features related to enemies within close proximity to Mario entirely inthe last few principal components. These features are only important in very specific scenarioswhere Mario needs to quickly react to many nearby enemies. If only one enemy is nearby, it is alsorepresented in the closest enemy X and closest enemy y features.This analysis legitimizes our approach in the Mario Benchmarking Domain. It demonstratesthat we should initially learn using the fundamental skills required to play Mario. We will showthat we can learn these skills quickly. The skills represented in the higher principal componentsare better for strict optimization in specific scenarios. Analyzing the principal components of thedemonstrations is a sanity check as well as validation of the richness of the demonstration.When using our approach in the Mario domain, results were as expected. By projecting the statedown to a low-dimensional manifold (less than 4), the learning algorithm converged quickly to badpolicies. However, when using a manifold of 4 dimensions or greater, we converged quickly to amuch better policy. These are promising results although these well-performing dimensions maystill converge to a suboptimal policy after more than 5000 episodes.Since learning was poor in the first two manifolds, it shows us that the jump and ground featuresare not informative enough alone to learn an effective Mario policy. Yet, when projecting onto a3 or 4 dimensional manifold, we see large increase in policy performance. In these manifolds, the5igure 3: The emphasis of each feature relative to the principal component.features jump , ground , current direction , shoot , closest enemy Y , and obstacles are all represented. Itis intuitive that these features are important for having basic skill in Mario. The remaining featuresare important, but only for fine tuning policies.
6. Discussion
The strengths of our approach is its simplicity and generality combined with fast convergence. Oncewe sample the domain, PCA can be ran on these samples. We can use the transforms given by PCAin the learning cycle whenever we compute a new state. This is a simple matrix operation andadds no additional computational cost. This preliminary work shows that learning can be efficientlyperformed in these low dimensional spaces at an increased convergence rate.But, there are some fundamental issues. Our algorithm assumes that all states are important. Itis possible for a state to be unnecessary to learning, but have high variance. We also stop learningonce we converge in the lower-dimensional space. This ensures that given infinite time, the higherdimensional space will converge to better performance. We discuss our solution to these problemsin Section 7 by using an iterative learning approach.
7. Future Work
There are many areas of future work this approach can take. The next step we will take would be toperform additional analysis on the current results. Further experiments include varying the amountof training data given to PCA, or to use random demonstration data. We can then see how the qualityand quantity of data affects learning.A major focus of future work is to improve upon the converged performance of the low-dimensional state representations. Work by Grzes and Kudenko has shown that mixed resolutionfunction approximation works well in complex domains. They initially learned with a less expres-sive function approximation to provide early guidance during learning. They then learned using amore expressive function approximation to learn a high quality policy. We can leverage a similar6igure 4: Results with varying manifolds. Lines in bold are experiments that performed better orequal to learning with the full state. Dimensions greater than 5 performed similarly to the full statespace, and were not included for clarity. Error bars are over 100 statistical runs.idea. We propose in future work that we can learn until convergence in a n dimensional space.As shown in this work, the convergence time is low. We then transfer that knowledge to an n + 1 dimensional space. We hypothesize that this learning technique will converge faster than learningentirely in the full dimensional space. References
Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learningfrom demonstration.
Robot. Auton. Syst. , 57(5):469–483, May 2009. ISSN 0921-8890. doi: 10.1016/j.robot.2008.10.024. URL http://dx.doi.org/10.1016/j.robot.2008.10.024 .Tim Brys, Anna Harutyunyan, Peter Vrancx, Matthew E Taylor, Daniel Kudenko, and Ann Now´e.Multi-objectivization of reinforcement learning problems by reward shaping. In
InternationalJoint Conference on Neural Networks (IJCNN) , pages 2315–2322. IEEE, 2014.Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E. Taylor, and AnnNow´e. Reinforcement learning from demonstration through shaping. In
Proceedings of theInternational Joint Conference on Artificial Intelligence (IJCAI) , 2015.M. Grzes and D. Kudenko. Reinforcement learning with reward shaping and mixed resolutionfunction approximation. nternational Journal of Agent Technologies and Systems (IJATS), , 1(2):6–54. 7ergey Karakovskiy and Julian Togelius. The Mario AI benchmark and competitions.
Computa-tional Intelligence and AI in Games, IEEE Transactions on , 4(1):55–67, 2012.Geoffrey Lee, Min Luo, Fabio Zambetta, and Xiaodong Li. Learning a super mario controller fromexamples of human play. In
Evolutionary Computation (CEC), 2014 IEEE Congress on , pages1–8. IEEE, 2014.Yizheng Liao, Kun Yi, and Zhe Yang. Cs229 final report reinforcement learning to play mario.Technical report, Stanford University, USA, 2012.Bo Liu and Sridhar Mahadevan. Compressive reinforcement learning with oblique random projec-tions, 2011.Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.
Knowledge and Data Engineering,IEEE Transactions on , 22(10):1345–1359, Oct 2010. ISSN 1041-4347. doi: 10.1109/TKDE.2009.191.Jonathon Shlens. A tutorial on principal component analysis. In
Systems Neurobiology Laboratory,Salk Institute for Biological Studies , 2005.Christian D. Swinehart and L. F. Abbott. Dimensional reduction for reward-based learning, 2005.M.A. Turk and A.P. Pentland. Face recognition using eigenfaces. In