[PDF] Deep Reinforcement Learning using Capsules in Advanced Game Environments

Abstract

Reinforcement Learning (RL) is a research area that has blossomed tremendously in recent years and has shown remarkable potential for artificial intelligence based opponents in computer games. This success is primarily due to vast capabilities of Convolutional Neural Networks (ConvNet), enabling algorithms to extract useful information from noisy environments. Capsule Network (CapsNet) is a recent introduction to the Deep Learning algorithm group and has only barely begun to be explored. The network is an architecture for image classification, with superior performance for classification of the MNIST dataset. CapsNets have not been explored beyond image classification. This thesis introduces the use of CapsNet for Q-Learning based game algorithms. To successfully apply CapsNet in advanced game play, three main contributions follow. First, the introduction of four new game environments as frameworks for RL research with increasing complexity, namely Flash RL, Deep Line Wars, Deep RTS, and Deep Maze. These environments fill the gap between relatively simple and more complex game environments available for RL research and are in the thesis used to test and explore the CapsNet behavior. Second, the thesis introduces a generative modeling approach to produce artificial training data for use in Deep Learning models including CapsNets. We empirically show that conditional generative modeling can successfully generate game data of sufficient quality to train a Deep Q-Network well. Third, we show that CapsNet is a reliable architecture for Deep Q-Learning based algorithms for game AI. A capsule is a group of neurons that determine the presence of objects in the data and is in the literature shown to increase the robustness of training and predictions while lowering the amount training data needed. It should, therefore, be ideally suited for game plays.

Full PDF

DDeep Reinforcement Learning using Capsulesin Advanced Game Environments

PER-ARNE ANDERSENSUPERVISORS

Morten GoodwinOle-Christoﬀer GranmoMaster’s Thesis

University of Agder, 2018

Faculty of Engineering and ScienceDepartment of ICT a r X i v : . [ c s . A I] J a n iA University of AgderMaster’s thesisFaculty of Engineering and ScienceDepartment of ICTc (cid:13) bstract

Reinforcement Learning (RL) is a research area that has blossomed tremendously in recent years andhas shown remarkable potential for artiﬁcial intelligence based opponents in computer games. Thissuccess is primarily due to vast capabilities of Convolutional Neural Networks (ConvNet), enablingalgorithms to extract useful information from noisy environments. Capsule Network (CapsNet) is arecent introduction to the Deep Learning algorithm group and has only barely begun to be explored.The network is an architecture for image classiﬁcation, with superior performance for classiﬁcationof the MNIST dataset. CapsNets have not been explored beyond image classiﬁcation.This thesis introduces the use of CapsNet for Q-Learning based game algorithms. To successfullyapply CapsNet in advanced game play, three main contributions follow. First, the introduction offour new game environments as frameworks for RL research with increasing complexity, namelyFlash RL, Deep Line Wars, Deep RTS, and Deep Maze. These environments ﬁll the gap betweenrelatively simple and more complex game environments available for RL research and are in thethesis used to test and explore the CapsNet behavior.Second, the thesis introduces a generative modeling approach to produce artiﬁcial training data foruse in Deep Learning models including CapsNets. We empirically show that conditional generativemodeling can successfully generate game data of suﬃcient quality to train a Deep Q-Network well.Third, we show that CapsNet is a reliable architecture for Deep Q-Learning based algorithms forgame AI. A capsule is a group of neurons that determine the presence of objects in the data and is inthe literature shown to increase the robustness of training and predictions while lowering the amounttraining data needed. It should, therefore, be ideally suited for game plays. We conclusively showthat capsules can be applied to Deep Q-Learning, and present experimental results of this methodin the environments introduced. We further show that capsules do not scale as well as convolutions,indicating that CapsNet-based algorithms alone will not be able to play even more advanced gameswithout improved scalability. iii able of Contents

Abstract iiiGlossary xList of Figures xiiList of Tables xiiiList of Publications xv

I Research Overview 1

II Contributions 31

III Experiments and Results 57

References 93Appendices 95

A Hardware Speciﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

IV Publications 97

A Towards a Deep Reinforcement Learning Approach for Tower Line Wars 99B FlashRL: A Reinforcement Learning Platform for Flash Games 115 vii lossary

AGI

Artiﬁcial General Intelligence. 28 AI Artiﬁcial Intelligence. 3, 7, 28, 31, 43

ANN

Artiﬁcial Neural Network. 9–14, 19, 20, 22, 30, 41, 53, 74

CapsNet

Capsule Network. iii, x, xi, 5–7, 27, 35, 49, 53, 55, 75, 77–85, 88–90

CCDN

Conditional Convolution Deconvolution Network. ix–xi, 58–60, 63–65, 68, 70, 72, 74, 75,87, 89–91

CGAN

Conditional Generative Adversarial Network. 25

ConvNet

Convolutional Neural Network. iii, x, xi, 4, 5, 14–16, 22, 25, 27, 53, 66, 75, 77–85, 87,88

DDQN

Double Deep Q-Learning. 23 DL Deep Learning. 9, 42

DNN

Deep Neural Network. 10, 11

DQN

Deep Q-Network. x, xi, 19, 20, 23, 48, 49, 56, 58, 64, 75, 77–85, 89, 90

DRL

Deep Reinforcement Learning. 9, 12, 22, 28, 37, 38, 49, 87

FCN

Fully-Connected Network. 15

GAN

Generative Adversarial Networks. 8, 24–26

MDP

Markov Decision Process. 9, 18

MLP

Multilayer Perceptron. 10

MSE

Mean Squared Error. 12, 13, 58, 64, 68, 70, 74

ReLU

Rectiﬁed Linear Unit. 11 RL Reinforcement Learning. iii, 3–9, 18, 19, 21, 26, 28–31, 35, 37, 38, 41, 42, 44–49, 51–53, 55, 58,72, 74, 75, 87–91 ixlossary Glossary

RTS

Real Time Strategy Games. 3, 5–7, 29, 43, 44, 87–90

SDG

Stochastic Gradient Decent. 12, 13, 58, 65, 75x ist of Figures × Deep Maze ist of Tables ist of Publications

A. Towards a Deep Reinforcement Learning Approach for Tower Line Wars . . . . . . . . . . 99B. FlashRL: A Reinforcement Learning Platform for Flash Games . . . . . . . . . . . . . . . 115xv art I

Research Overview hapter 1 Introduction

Despite many advances in Artiﬁcial Intelligence (AI) for games, no universal Reinforcement Learning(RL) algorithm can be applied to advanced game environments without extensive data manipulationor customization. This includes traditional Real-Time Strategy (RTS) games such as Warcraft III,Starcraft II, and Age of Empires. RL has been applied to simpler games such as the Atari 2600platform but is to the best of our knowledge not successfully applied to more advanced games.Further, existing game environments that target AI research are either overly simplistic such asAtari 2600 or complex such as Starcraft II.RL has in recent years had tremendous progress in learning how to control agents from high-dimensional sensory inputs like images. In simple environments, this has been proven to workwell [36], but are still an issue for advanced environments with large state and action spaces [34]. Inenvironments where the objective is easily observable, there is a short distance between the actionand the reward which fuels the learning [21]. This is because the consequence of any action isquickly observed, and then easily learned. When the objective is complicated, the game objectivesstill need to be mapped to a reward, but it becomes far less trivial [24]. For the Atari 2600 gameMs. Pac-Man this was solved through a hybrid reward architecture that transforms the objectiveto a low-dimensional representation [59]. Similarly, the OpenAI’s bot is able to beat world’s topprofessionals at one versus one in DotA 2. It uses an RL algorithm and trains this with self-playmethods, learning how to predict the opponents next move.Applying RL to advanced environments is challenging because the algorithm must be able to learnfeatures from a high-dimensional input, in order act correctly within the environment [15]. This issolved by doing trial and error to gather knowledge about the mechanics of the environment. Thisprocess is slow and unstable [37]. Tree-Search algorithms have been successfully applied to boardgames such as Tic-Tac-Toe and Chess, but fall short for environments with large state-spaces [8].This is a problem because the grand objective is to use these algorithms in real-world environments,that are often complex by nature. Convolutional Neural Networks (ConvNet) [28] solves complexityproblems but faces several challenges when it comes to interpreting the environment data correctly.3 .1. Motivation Introduction

The primary motivation of this thesis is to create a foundation for RL research in advanced en-vironments, Using generative modeling to train artiﬁcial neural networks, and to use the CapsuleNetwork architecture in RL algorithms. 4 .2. Thesis deﬁnition Introduction

The primary objective of this thesis is to perform

Deep Reinforcement Learning using Cap-sules in Advanced Game Environments . The research is split into six goals following the thesishypotheses.

Goal 1:

Investigate the state-of-the-art research in the ﬁeld of Deep Learning, and learn howCapsule Networks function internally.

Goal 2:

Design and develop game environments that can be used for research into RL agents forthe RTS game genre.

Goal 3:

Research generative modeling and implement an experimental architecture for generatingartiﬁcial training data for games.

Goal 4:

Research the novel CapsNet architecture for MNIST classiﬁcation and combine this withRL problems.

Goal 5:

Combine Deep-Q Learning and CapsNet and perform experiments on environments fromAchievement 2.

Goal 6:

Combine the elements of Goal 3 and Goal 5. The goal is to train an RL agent withartiﬁcial training data successfully.

Hypothesis 1:

Generative modeling using deep learning is capable of generating artiﬁcial trainingdata for games with a suﬃcient quality.

Hypothesis 2:

CapsNet can be used in Deep Q-Learning with comparable performance to ConvNetbased models.

The ﬁrst goal of this thesis is to create a learning platform for RTS game research. Second, touse generative modeling to produce artiﬁcial training data for RL algorithms. The third goal is toapply CapsNets to Deep Reinforcement Learning algorithms. The hypothesis is that its possible toproduce artiﬁcial training data, and that CapsNets can be applied to Deep Q-Learning algorithms.5 .3. Contributions Introduction

This thesis introduces four new game environments,

Flash RL , Deep Line Wars , Deep RTS ,and

Deep Maze . These environments integrates well with OpenAI GYM, creating a novel learningplatform that targets

Deep Reinforcement Learning for Advanced Games .CapsNet is applied to RL algorithms and provides new insight on how CapsNet performs in problemsbeyond object recognition. This thesis presents a novel method that use generative modeling totrain RL agents using artiﬁcial training data.There is to the best of our knowledge no documented research on using CapsNet in RL problems,nor are there environments speciﬁcally targeted RTS AI research. Proceedings of the 30 th Norwegian Informatics Conference, Oslo, Norway 2017 Proceedings of the 37 th SGAI International Conference on Artiﬁcial Intelligence, Cambridge, UK, 2017 .4. Thesis outline Introduction Chapter 2 provides preliminary background research for Artiﬁcial Neural Networks (2.1, 2.2), Gen-erative Models (2.3), Markov Decision Process (2.4), and Reinforcement Learning (2.5).Chapter 3 investigates the current state-of-the-art in Deep Neural Networks (3.1), RL (3.2), GAN(3.3) and Game environments (3.5).Chapter 4 outlines the technical speciﬁcations for the new game environments Flash RL (4.1), DeepLine Wars (4.2), Deep RTS (4.3), and Maze (4.4). In addition, a well established game environment(Section 4.5) is introduced to validate experiments conducted in this thesis.Chapter 5 introduces the proposed solutions for the goals deﬁned in Section 1.2. Section 5.1 outlineshow the environments are presented as a learning platform. Section 5.2 introduces the proposal touse Capsules in RL. Section 5.3 describes the Deep Q-Learning algorithm and the implementationsused for the experiments in this thesis. Finally, the artiﬁcial training data generator is outlined inSection 5.4.Chapter 6 and 7 shows experimental results from the work presented in Chapter 5.Chapter 8 concludes the thesis hypotheses and provides a summary of the work done in this thesis.Section 8.2 outlines the road-map for future research related to the thesis.7 hapter 2

Background

Deep Learning (DL) is a branch of machine learning algorithms that recently became popularizeddue to the exponential growth in available computing power. DL is unique in that it is designed tolearn data representations, as opposed to task-speciﬁc algorithms. Methods from DL are frequentlyused in RL algorithms, creating a new branch called

Deep Reinforcement Learning (DRL). ArtiﬁcialNeural Networks (ANN) are used at its core, utilizing the most novel DL techniques to gain state-of-the-art capabilities.This chapter outlines background theory for topics related to the research performed later in thisthesis. Section 2.1 shows how Artiﬁcial Neural Networks work, moving onto computer vision withConvolutional Neural Networks in Section 2.2. Section 2.4 outlines the theory behind the MarkovDecision Process (MDP) and how it is used in RL.9 .1. Artiﬁcial Neural Networks Background

An Artiﬁcial Neural Network (ANN) is a computing system that is inspired by how the biologicalnervous systems, such as the brain, function [19]. ANNs are composed of an interconnected networkof neurons that pass data to its next layer when stimulated by an activation signal. When a networkconsists of several hidden layers, it is considered a

Deep Neural Network (DNN). Figure 2.1 illustratesa Deep Multi-Layer Perceptron (MLP) with two hidden layers.Figure 2.1: Deep Neural network with two hidden layersFigure 2.2: Single Perceptron f ( x ) = (cid:26) (cid:80) ni =1 ( w i · x i ) + b >

00 otherwise (2.1)MLPs are considered a network because they are composed of many diﬀerent functions. Each ofthese functions is represented as a perceptron . The combination of these functions gives us the abilityto represent complex and high-dimensional functions [19]. Figure 2.2 illustrates a single perceptronfrom an MLP where x , x · · · x n are inputs to the perceptron. Each of these inputs has a weight w , w · · · w n . Input x n and weight w n are multiplied into z n = x n · w n and z = (cid:80) ni =1 ( z n ) + b where b is the bias value and z is the perceptron value. In Figure 2.2, the perceptron has a binary activationfunction (Equation 2.1), the neuron produce the value 1 for all z above 1, and 0 otherwise. Thereare several diﬀerent activation functions that can be used in a perceptron network, see Section 2.1.1.10 .1. Artiﬁcial Neural Networks BackgroundName Equation TanH tanh( z ) = e − z − σ ( z ) j = e zj (cid:80) Kk =1 e zk for j = 1 · · · K Sigmoid f ( z ) = e − z Rectiﬁed Linear Unit (ReLU) f ( z ) = (cid:40) z < z otherwiseLeakyReLU f ( z ) = (cid:40) z if z > . z otherwiseBinary f ( z ) = (cid:40) z <

01 if z ≥ The purpose of an

Activation function is often to introduce non-linearity into the network. It isproven that an DNN using only linear activations are equal to a single-layered network [42]. It istherefore natural to use non-linear activation functions in the hidden layers of an ANN if the goal isto predict non-linear functions. TanH and Rectiﬁed Linear Unit (ReLU) has proven to work well inANNs [22,39,65], but there exist several other alternatives as illustrated in Table 2.1. Researchers donot understand to the full extent why an activation function works better for a particular problemand is why trial and error is used to ﬁnd the best ﬁt [33].

Optimization in ANNs is the process of updating the weights of neurons in a network. In theoptimization process, a loss function is deﬁned. This function calculates the error/cost value ofthe network at the output layer. The error value describes the distance between the ground truthand the predicted value. For the network to improve, this error is backpropagated back through thenetwork until each neuron has an error value that reﬂects its positive or negative contribution to theground truth. Each neuron also calculates the gradient of its weights by multiplying output deltatogether with the input activation value. Weights are updated using stochastic gradient descent (SDG), which is a method of gradually descending the weight loss until reaching the optimal value.11 .1. Artiﬁcial Neural Networks Background

10 5 0 5 10

Prediction L o ss Huber Loss, delta=0.1Huber Loss, delta=1.0Huber Loss, delta=2.0Huber Loss, delta=5.0Huber Loss, delta=10.0Squared Error

Figure 2.3: Loss functions

To measure the inconsistency between the predicted value and the ground truth, a loss function isused in ANNs. The loss function calculates a positive number that is minimized throughout theoptimization of the parameters (Section 2.1.2). A loss function can be any mathematical formula,but there exist several well established functions. The performance varies on the classiﬁcation task. Mean Squared Error (MSE) is a quadratic loss function widely used in linear regression, and arealso used in this thesis. Equation 2.2 is the standard form of MSE, where the goal is to minimizethe residual squares ( y ( i ) − ˆ y ( i ) ). L = 1 n n (cid:88) i =1 ( y ( i ) − ˆ y ( i ) ) (2.2) L δ ( a ) = (cid:26) a for | a | ≤ δ,δ ( | a | − δ ) , otherwise (2.3) Huber Loss is a loss function that is widely used in DRL. It is similar to MSE, but are less sensitiveto data far apart from the ground truth. Equation 2.3 deﬁnes the function where a refers to the Weights and Parameters are used interchangeably throughout the thesis .1. Artiﬁcial Neural Networks Background residuals and δ refers to its sensitivity. Figure 2.3 illustrates the diﬀerence between MSE and HuberLoss using diﬀerent δ conﬁgurations. Hyper-parameters are tunable variables in ANNs. These parameters include learning rate, learningrate decay, loss function, and optimization algorithm like Adam, and SDG.13 .2. Convolutional Neural Networks Background

A Convolutional Neural Network is a novel ANN architecture that primarily reduces the computepower required to learn weights and biases for three-dimensional inputs. ConvNets are split intothree layers:1. Convolution layer2. Activation layer3. Pooling (Optional)A Convolution layer has two primary components, kernel (parameters) and stride . The kernelconsists of a weight matrix that is multiplied by the input values in its receptive ﬁeld . The receptiveﬁeld is the area of the input that the kernel is focused on. The kernel then slides over the inputwith a ﬁxed stride. The stride value determines how fast this sliding happens. With a stride of 1,the receptive ﬁeld move in the direction x + 1, and when at the end of the input x-axis, y + 1.Figure 2.4: Convolutional Neural Network for classiﬁcationConsider a three-dimensional matrix representing an image of size 28 × ×

3. In this example, thegoal is to classify the image to be either a cat or dog. By using hyperparameters kernel = 3 × stride = 1 ×

1, there are 32 shared parameters to be optimized. In contrast, a Fully-Connectednetwork (FCN) with a single neuron layer, would have 2357 parameters to optimize. The reason whyconvolutions work is that it exploits what is called feature locality . ConvNets use ﬁlters that learna speciﬁc feature of the input, for example, horizontal and vertical lines. For every convolutionallayer added to the network, the information becomes more abstract, identifying objects and shapes.Figures 2.4 and 2.5 illustrate how a simple ConvNet is modeled compared to an FCN. The ConvNet14 .2. Convolutional Neural Networks Background use a stride of 1 × × × Pooling is the operation of reducing the data resolution, often subsequent a convolution layer.This is beneﬁcial because it reduces the number of parameters to optimize, hence decreasing thecomputational requirement. Pooling also controls overﬁtting by generalizing features. This makesthe network capable of better handling spatial invariance [48].There are several ways to perform pooling.

Max and

Average pooling are considered the moststable methods in whereas Max pooling is most used in state-of-the-art research [29]. Figure 2.6illustrates the pooling process using Max and Average pooling on a 4 × × X input volume. Thehyperparameters for the pooling operation is kernel = 2 × stride = 2 × × × X output volume. This operation performed independently for X =Depth of the input volume Figure 2.5: Fully-Connected Neural Network for classiﬁcation15 .2. Convolutional Neural Networks Background

Figure 2.6: MAX and AVG Pooling operationeach depth slice of the input volume.

Historically, ConvNets drastically improved the performance of image recognition because it suc-cessfully reduced the number of parameters required, and at the same time preserving importantfeatures in the image. There are however several challenges, most notably that they are not rotationinvariant. ConvNets are much more complicated then covered in this section, but this beyond thescope of this thesis. For an in-depth survey of the ConvNet architecture, refer to Recent Advancesin Convolutional Neural Networks [12]. 16 .3. Generative Models Background

Figure 2.7: Overview: Generative Model

Generative Models are a series of algorithms trying to generate an artiﬁcial output based on someinput, often randomized. Generative Adversarial Networks and Variational Autoencoder is twomethods that have shown excellent results in this task. These methods have primarily been usedin generating realistic images from various datasets like MNIST and CIFAR-10. This section willoutline the theory in understanding the underlying architecture of generative models.The objective of most Generative Models is to generate a distribution of data, that is close tothe ground-truth distribution (the dataset). The Generative Model takes a Gaussian distribution z , as input, and outputs ˆ p ( x ) as illustrated in Figure 2.7. The goal is to ﬁnd parameters θ thatbest matches the ground truth distribution with the generated distribution. Convolutional NeuralNetworks are often used in Generative Modeling, typically for models using noise as input. Themodel has several hidden parameters θ that is tuned via backpropagation methods like stochasticgradient descent. If the model reaches optimal parameters, ˆ p ( x ) = p ( x ) is considered true.17 .4. Markov Decision Process Background MDP is a mathematical method of modeling decision-making within an environment . An environ-ment deﬁnes a real or virtual world, with a set of rules. This thesis focuses on virtual environments,speciﬁcally, games with the corresponding game mechanic limitations. The core problem of MDPsis to ﬁnd an optimal policy function for the decision maker (hereby referred to as an agent). a (cid:124)(cid:123)(cid:122)(cid:125) Action = π ( s ) (cid:124)(cid:123)(cid:122)(cid:125) Policy π for state s (2.4)Equation 2.4 illustrates how a decision/action is made using observed knowledge of the environ-mental state. The goal of the policy function is to ﬁnd the decision that yields the best cumulativereward from the environment. MDP behaves like a Markov chain, hence gaining the Markov Prop-erty . The Markov property describes a system where future states only depend on the presentand not the past. This enables MDP based algorithms to do iterative learning [54]. MDP is thefoundation of how RL algorithms operate to learn the optimal behavior in an environment.18 .5. Reinforcement Learning Background

Reinforcement learning is a process where an agent performs actions in an environment, tryingto maximize some cumulative reward [53] (see Section 2.4). RL diﬀers from supervised learningbecause the ground truth is never presented directly. In RL there are model-free and model-based algorithms. In model-free RL, the algorithm must learn the environmental properties (the model)without guidance. In contrast, model-based RL is deﬁned manually describing the features of anenvironment [10]. For model-free algorithms, the learning only happens in present time and thefuture must be explored before knowledge about the environment can be learned [11, 26].This thesis focuses on

Q-Learning algorithms, a model-free RL technique that may potentially solvediﬃcult game environments. This section investigates the background theory of Q-Learning andextends this method to Deep Q-Learning (DQN), a novel algorithm that combines RL and ANN.

Q-Learning is a model-free algorithm. This means that the MDP stays hidden throughout thelearning process. The objective is to learn the optimal policy by estimating the action-value function Q ∗ ( s, a ), yielding maximum expected reward in state s performing action a in an environment. Theoptimal policy can then be found by π ( s ) = argmax a Q ∗ ( s, a ) (2.5)Equation 2.5 is derived from ﬁnding the optimal utility of a state U ( s ) = max a Q ( s, a ). Since theutility is the maximum value, the argmax of that same value qualiﬁes as the optimal policy. Theupdate rule for Q-Learning is based on value iteration: Q ( s, a ) ← Q ( s, a ) + α (cid:124)(cid:123)(cid:122)(cid:125) LR (cid:32) R ( s ) (cid:124) (cid:123)(cid:122) (cid:125) Reward + γ (cid:124)(cid:123)(cid:122)(cid:125) Discount max a (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) New Q − Q ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) Old Q (cid:33) (2.6)Equation 2.6 shows the iterative process of propagating back the estimated Q-value for each discretetime-step in the environment. α is the learning rate of the algorithm, usually low number between0.001 and 0.00001. The reward function R ( s ) ∈ R , and is often between − < x < γ is the discount factor, discounting the importance of future states. The ”oldQ” is the estimated Q-Value of the starting state while the ”new Q” estimates the future state.Equation 2.6 is guaranteed to converge towards the optimal action-value function, Q i → Q ∗ as i → ∞ [36, 53]. 19 .5. Reinforcement Learning Background At the most basic level, Q-Learning utilizes a table for storing ( s, a, r, s (cid:48) ) pairs. Instead, a non-linearfunction approximation can be used to approximate Q ( s, a ; θ ). This is called Deep-Q Learning . θ describes tunable parameters (weights) for the approximation.ANNs are used as an approximationmethod for retrieving values from the Q-Table but at the cost of stability. Using ANN is muchlike compression found in JPEG images. The compression is lossy , and information is lost atcompression time. This makes DQN unstable, since values may be wrongfully encoded undertraining. In addition to value iteration, a loss function must be deﬁned for the backpropagationprocess of updating the parameters. L ( θ i ) = E (cid:104) ( r + γmax a (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i ) − Q ( s, a ; θ i )) (cid:105) (2.7)Equation 2.7 illustrates the loss function proposed by Minh et al [37]. It uses Bellmans equation tocalculate the loss in gradient descent. To increase training stability, Experience Replay is used. Thisis a memory module that store memories from already explored parts of the state space. Experiencesare often selected at random and then replayed to the neural network as training data. [36].20 hapter 3

State-of-the-art

This thesis focus on topics that are in active research, meaning that the state-of-the-art methodsquickly advances. There have been many achievements in Deep Learning, primarily related toComputer Vision topics. This chapter investigates recent advancements in Deep Learning (3.1),Deep Reinforcement Learning (3.2), Generative Modeling (3.3), Capsule Networks (3.4) and GameLearning Platforms (3.5). In the success of Deep Learning, there have been several breakthroughsin popular game environments. Section 3.6 outlines the state-of-the-art of applying RL algorithmsto game environments. 21 .1. Deep Learning State-of-the-art

Deep Learning has a long history, dating back to late 1980’s. One of the ﬁrst relevant papers onthe area is

Learning representations by backpropagating errors from Rumelhart et al. [44] In thispaper, they illustrated that a deep neural network could be trained using backpropagation. Thedeep architecture proved that a neural network could successfully learn non-linear functions.Yann LeCun started in the early 1990’s research into Convolutional Neural Networks (ConvNet),with handwritten zip code classiﬁcation as the primary goal [27]. He created the famous MNISTdataset, which is still widely used in the literature [28]. After ten years of research, LeCun etal. achieved state-of-the-art results on the MNIST dataset using ConvNets similar to those foundin literature today [28]. But due to scaling issues with Deep ANNs, they were outperformed byclassiﬁers like Support Vector Machines. It was not until 2006 with the paper

A fast learningalgorithm for deep belief nets by Hinton et al. that Deep Learning would appear again [17]. Thispaper showed how ectively train a deep neural network, by training one layer at a time. This wasthe beginning of Deep Neural Networks as they are known today.For this thesis, Computer Vision is the most interesting architecture. There have been many ad-vances in computer vision in the last couple of years. AlexNet [25], VGGNet [40] and ResNet [63]are models achieving state-of-the-art results in the ImageNet competition. These models are com-plex, but does a good job in image recognition. For DRL, there is to best of our knowledge noabstract model, that works for all environments. Therefore the model must be adapted to ﬁt theenvironment at hand best. 22 .2. Deep Reinforcement Learning State-of-the-art

The earliest work found related to Deep Reinforcement Learning is

Reinforcement Learning forRobots Using Neural Networks . This PhD thesis illustrated how an ANN could be used in RL toperform actions in an environment with delayed reward signals successfully. [31]With several breakthroughs in computer vision in early 2010’s, researchers started work on integrat-ing ConvNets into RL algorithms. Q-Learning together with Deep Learning was a game-changingmoment, and has had tremendous success in many single agent environments on the Atari 2600platform. Deep Q-Learning (DQN) as proposed by Mnih et al. used ConvNets to predict the Qfunction. This architecture outperformed human expertise in over half of the games. [36]Hasselt et al. proposed

Double DQN (DDQN), which reduced the overestimation of action valuesin the Deep Q-Network. This led to improvements in some of the games on the Atari platform. [7]Wang et al. then proposed a dueling architecture of DQN which introduced estimation of the valuefunction and advantage function. These two functions were then combined to obtain the Q-Value.Dueling DQN were implemented with the previous work of van Hasselt et al. [43].Harm van Seijen et al. recently published an algorithm called Hybrid Reward Architecture (HRA)which is a divide and conquer method where several agents estimate a reward and a Q-value for eachstate. The algorithm performed above human expertise in Ms. Pac-Man, which is considered one ofthe hardest games in the Atari 2600 collection and is currently state-of-the-art in the reinforcementlearning domain. The drawback of this algorithm is that generalization of Minh et al. approach islost due to a huge number of separate agents that have domain-speciﬁc sensory input. [59]There have been few attempts at using Deep Q-Learning on advanced simulators made explicitlyfor machine-learning. It is probable that this is because there are very few environments createdfor this purpose. 23 .3. Generative Modeling State-of-the-art

Figure 3.1: Illustration of Generative Adversarial Network Model

There are primarily three Generative models that are actively used in recent literature, GAN,Variational Autoencoders and Autoregressive Modeling. GAN show far better results than anyother generative model and is the primary ﬁeld of research for this thesis.GAN show great potential when it comes to generating artiﬁcial images from real samples. Theﬁrst occurrence of GAN was introduced in the paper

Generative Adversarial Networks from Ian J.Goodfellow et al. [23]. This paper proposed a framework using a generator and discriminator neuralnetwork. The general idea of the framework is a two-player game where the generator generatessynthetic images from noise and tries to fool the discriminator by learning to create authenticimages, see Figure 3.1.In future work, it was speciﬁed that the proposed framework could be extended from p ( x ) → p ( x | c ).This was later proposed in the paper Conditional Generative Adversarial Nets (CGAN) by Mirza etal. [35]. GAN is extended to a conditional model by demanding additional information y as inputfor the generator and discriminator. This enabled to condition the generated images on informationlike labels illustrated in Figure 3.2.Radford et al. [33] proposed Deep Convolutional Generative Adversarial Networks (DCGAN) in

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks .This paper improved on using ConvNets in unsupervised settings. Several architectural constraintswere set to make training of DCGAN stable in most scenarios. This paper illustrated many great24 .3. Generative Modeling State-of-the-art

Figure 3.2: Illustration of Conditional Generative Adversarial Network Modelexamples of images generated with DCGAN, for instance, state-of-the-art bedroom images.In summer 2016, Salimans et al. (Goodfellow) presented

Improved Techniques for Training GANs achieving state-of-the-art results in the classiﬁcation of MNIST, CIFAR-10, and SVHN [46]. Thispaper introduced minibatch discrimination, historical averaging, one-sided label smoothing andvirtual batch normalization.There have been many advances in GAN between and after these papers. Throughout the researchprocess of GANs, the most prominent architecture for our problem is Conditional GANs whichenables us to condition the input variable x on variable y . The most recent paper on this topic is Towards Diverse and Natural Image Descriptions via a Conditional GAN from Dai et al. [9]. Thispaper focuses on captioning images using Conditional GANs. It produced captions that were ofsimilar quality to human-made captions. In RL terms it is successfully able to learn a good policyfor the dataset. 25 .4. Capsule Networks State-of-the-art

Capsule Neural Networks (CapsNet) is a novel deep learning architecture that attempts to improvethe performance of image and object recognition. CapsNet is theorized to be far better at detectingrotated objects and requires less training data than traditional ConvNet. Instead of creating deepnetworks like for example ResNet-50, a Capsule layer is created, containing several sub-layers indepth. Each of these capsules has a group of neurons, where the objective is to learn a speciﬁc objector part of an object. When an image is inserted into the Capsule Layer, an iterative process ofidentifying objects begins. The higher dimension layers receive a signal from the lower dimensions.The higher dimension layer then determines which signal is the strongest and a connection is madebetween the winning signal (betting). This method is called dynamic routing . This routing-by-agreement ensures that features are mapped to the output, and preserves all input information atthe same time.Pooling in ConvNet is also a primitive form of routing, but information about the input is lost inthe process. This makes pooling much more vulnerable to attacks compared to dynamic routing.In current state-of-the-art, CapsNet is explained as inverse graphics, where a capsule tries to learnan activity vector describing the probability that an object exists.Capsule Networks are still only in infancy, and there is not well-documented research on this topicyet apart from state-of-the-art paper

Dynamic Routing Between Capsules by Sabour et al. [45].26 .5. Game Learning Platforms State-of-the-art

There exists several exciting game learning platform used to research state-of-the-art AI algorithms.The goal of these platforms is generally to provide the necessary platform for studying

ArtiﬁcialGeneral Intelligence (AGI). AGI is a term used for AI algorithms that can perform well acrossseveral environments without training. DRL is currently the most promising branch of algorithmsto solve AGI.Bellemare et al. provided in 2012 a learning platform

Arcade Learning Environment (ALE) thatenabled scientists to conduct edge research in general deep learning [4]. The package providedhundreds of Atari 2600 environments that in 2013 allowed Minh et al. to do a breakthrough withDeep Q-Learning and A3C. The platform has been a critical component in several advances in RLresearch. [32, 36, 37]The Malmo project is a platform built atop of the popular game

Minecraft . This game is set ina 3D environment where the object is to survive in a world of dangers. The paper

The MalmoPlatform for Artiﬁcial Intelligence Experimentation by Johnson et al. claims that the platform hadall characteristics qualifying it to be a platform for AGI research. [20]ViZDoom is a platform for research in Visual Reinforcement Learning. With the paper

ViZDoom:A Doom-based AI Research Platform for Visual Reinforcement Learning

Kempka et al. illustratedthat an RL agent could successfully learn to play the game

Doom , a ﬁrst-person shooter game, withbehavior similar to humans. [41]With the paper

DeepMind Lab , Beattie et al. released a platform for 3D navigation and puzzlesolving tasks. The primary purpose of Deepmind Lab is to act as a platform for DRL research. [3]In 2016, Brockman et al. from OpenAI released GYM which they referred to as ”a toolkit fordeveloping and comparing reinforcement learning algorithms” . GYM provides various types of envi-ronments from following technologies: Algorithmic tasks, Atari 2600, Board games, Box2d physicsengine, MuJoCo physics engine, and Text-based environments. OpenAI also hosts a website whereresearchers can submit their performance for comparison between algorithms. GYM is open-sourceand encourages researchers to add support for their environments. [5]OpenAI recently released a new learning platform called

ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games .This paper focuses on RTS game research and is the ﬁrst platform oﬃcially targeting these typesof games. [58] 27 .5. Game Learning Platforms State-of-the-artPlatform Diversity AGI Advanced Environment(s)

ALE Yes Yes NoMalmo Platform No No YesViZDoom No Yes YesDeepMind Lab No No YesOpenAI Gym Yes Yes NoOpenAI Universe Yes Yes PartiallyELF No No Yes(GYM-CAIR) Yes Yes YesTable 3.1: Summary of researched platforms

Multiple interesting observations about current state-of-the-art in learning platforms for RL algo-rithms were found during our research. Table 3.1 describes the capabilities of each of the learningplatform in the interest of fulﬁlling the requirements of this thesis. GYM-CAIR is included in thiscomparison and is further described in Chapter 4 and 5.28 .6. Reinforcement Learning in Games State-of-the-art

Reinforcement Learning for games is a well-established ﬁeld of research and is frequently used tomeasure how well an algorithm can perform within an environment. This section presents some ofthe most important achievements in Reinforcement Learning.TD-Gammon is an algorithm capable of reaching an expert level of play in the board game

Backgam-mon [56, 57]. The algorithm was developed by Gerald Tesauro in 1992 at IBM’s Thomas J. WatsonResearch Center. TD-Gammon consists of a three-layer ANN and is trained using an RL techniquecalled

TD-Lambda . TD-Lambda is a temporal diﬀerence learning algorithm invented by Richard S.Sutton [52]. The ANN iterates over all possible moves the player can perform and estimates thereward for that particular move. The action that yields the highest reward is then selected. TD-Gammon is one of the ﬁrst algorithms to utilize self-play methods to improve the ANN parameters.In late 2015,

AlphaGO became the ﬁrst algorithm to win against a human professional Go player.AlphaGO is an RL framework that uses Monte Carlo Tree search and two Deep Neural Networks forvalue and policy estimation [49]. Value refers to the expected future reward from a state assumingthat the agent plays perfectly. The policy network attempts to learn which action is best in anygiven board conﬁguration. The earliest versions of AlphaGO used training data from games playedby human professionals. In the most recent version,

AlphaGO Zero , only self-play is used to trainthe AI [51] In a recent update, AlphaGO was generalized to work for Chess and Shogi (JapaneseChess) only using 24 hours to reach superhuman level of play [50]DOTA 2 is an advanced player versus player game where the player is controlling a hero unit. Thegame objective is to defeat the enemy heroes and destroy their base. In August 2017, OpenAIinvented an RL based AI that defeated professional players in one versus one game. Training wasdone only using self-play, and the algorithm learned how to exploit game mechanics to perform well.DeepStack is an algorithm that can perform an expert level play in Texas Hold’em poker. Thisalgorithm uses tree-search in conjunction with neural networks to perform sensible actions in thegame [38]. DeepStack is a general-purpose algorithm that aims to solve problems with imperfectinformation.There have been several other signiﬁcant achievements in AI, but these are not directly related tothe use of RL algorithms. These include Deep Blue and Watson from IBM. Deep Blue is not AI art II Contributions hapter 4 Environments

Simulated environments are a popular research method to conduct experiments on algorithms incomputer science. These simulated environments are often tailored to the problem, and quicklyproves, or disproves the capability of an algorithm. This chapter proposes four new game environ-ments for deep learning research:

FlashRL , Deep Line Wars , Deep RTS , and

Deep Maze . The game

Flappy Bird is introduced as a validation environment for experiments conducted in Chapter 7.Figure 4.1 illustrates that each of these environments has diﬀerent goals, and the agent placed inthese environments are challenged in several topics, for instance, multitasking, deep and shallowstate interpretation and planning. This chapter creates a foundation for research into CapsNetbased RL-algorithms in advanced game environments.33 nvironments

Figure 4.1: Environment ﬁeld of focus34 .1. FlashRL Environments

Adobe Flash is a multimedia software platform used for the production of applications and ani-mations. The Flash run-time was recently declared deprecated by Adobe, and by 2020, no longersupported. Flash is still frequently used in web applications, and there are countless games createdfor this platform. Several web browsers have removed the support for the Flash runtime, makingit diﬃcult to access the mentioned game environments. Flash games are an excellent resource formachine learning benchmarking, due to size and diversity of its game repository. It is thereforeessential to preserve the Flash run-time as a platform for RL.

Flash Reinforcement Learning (FlashRL) is a novel platform that acts as an input/output interfacebetween Flash games and DRL algorithms. FlashRL enables researchers to interface against almostany Flash-based game environment eﬃciently.Figure 4.2: FlashRL: ArchitectureThe learning platform is developed primarily for Linux based operating systems but is likely torun on Cygwin with few modiﬁcations. There are several key components that FlashRL uses tooperate adequate, see Figure 4.2. FlashRL uses XVFB to create a virtual frame-buﬀer. The frame-buﬀer acts like a regular desktop environment, found in Linux desktop distributions [18]. Insidethe frame-buﬀer, a Flash game chosen by the researcher is executed by a third-party ﬂash player,for example,

Gnash . A VNC server serves the frame-buﬀer and enable FlashRL to access display,mouse and keyboard via the VNC protocol. The VNC Client pyVLC was specially made for thisFlashRL. The code base originates from python-vnc-viewer [55]. The last component of FlashRL isthe Reinforcement Learning API that allows the developer to access the input/output of the pyVLC.This makes it easy to develop sequenced algorithms by using API callbacks or invoke commandsmanually with threading.Figure 4.3 illustrates two methods of accessing the frame-buﬀer from the Flash environment. Both35 .1. FlashRL Environments

Figure 4.3: FlashRL: Frame-buﬀer Access Methodsapproaches are suﬃcient to perform RL, but each has its strengths and weaknesses. Method 1 sendsframes at a ﬁxed rate, for example at 60 frames per second. The second method does not set anyrestrictions of how fast the frame-buﬀer can be captured. This is preferable for developers that donot require images from ﬁxed time-steps because it demands less processing power per frame. Theframework was developed with deep learning in mind and is proven to work well with Keras andTensorﬂow [1]. Figure 4.4: FlashRL: Available environmentsThere are close to a thousand game environments available for the ﬁrst version of FlashRL. Thesegame environments were gathered from diﬀerent sources on the world wide web. FlashRL has arelatively small code-base and to preserve this size, the Flash repository is hosted at a remote site.Because of the large repository, not all games have been tested thoroughly. The game quality maytherefore vary. Figure 4.4 illustrates tested games that yield a great value for DRL research.36 .2. Deep Line Wars Environments

Figure 4.5: Deep Line Wars: Graphical User Interface

The game objective of Deep Line Wars is to invade the opposing player (hereby enemy) withmercenary units until all health points are depleted (see Figure 4.5). For every friendly unit thatenters the red area on the map, the enemy health pool is reduced by one. When a player purchasesa mercenary unit, it spawns at a random location inside the red area of the owners base. Mercenaryunits automatically move towards the enemy base. To protect the base, players can construct towersthat shoot projectiles at the opponents mercenaries. When a mercenary dies, a fair percentage ofits gold value is awarded to the opponent. When a player sends a unit, the income is increased bya percentage of the units gold value. As a part of the income system, players gain gold at ﬁxedintervals. [2]To successfully master game mechanics of Deep Line Wars, the player (agent) must learn • oﬀensive strategies of spawning units, • defending against the opposing player’s invasions, and • maintain a healthy balance between oﬀensive and defensive to maximize incomeThe game is designed so that if the player performs better than the opponent in these mechanics,he is guaranteed to win over the opponent. 37 .2. Deep Line Wars Environments Figure 4.6: Deep Line Wars: Game-state representationFigure 4.7: Deep Line Wars: Game-state representation using heatmaps38 .2. Deep Line Wars Environments

Representation Matrix Size Data SizeImage 800 · · · · · · · · • red pixels as friendly buildings, • green pixels as enemy units, and • teal pixels as the mouse cursor.When using grayscale heatmaps, RGB values are squashed into a one-dimensional matrix withvalues ranging between 0 and 1. Economy drastically increases the complexity of Deep Line Wars,and it is challenging to present only using images correctly. Therefore a secondary data structureis available featuring health, gold, lumber, and income. This data can then be feed into a hybridDL model as an auxiliary input [61]. 39 .3. Deep RTS Environments RTS games are considered to be the most challenging games for AI algorithms to master [60]. Withcolossal state and action-spaces, in a continuous setting, it is nearly impossible to estimate thecomputational complexity of games such as Starcraft II.The game objective of Deep RTS is to build a base consisting of a Town-Hall and then expand thebase to gain the military power to defeat the opponents. Each of the players starts with a worker.Workers can construct buildings and gather resources to gain an economic advantage.Figure 4.8: Deep RTS: Graphical User InterfaceThe game mechanics consist of two main terminologies,

Micro and

Macro management. The playerwith the best ability to manage their resources, military, and defensive is likely to win the game.There is a considerable leap from mastering Deep Line Wars to Deep RTS, much because Deep RTSfeatures more than two players. 40 .3. Deep RTS EnvironmentsPlayer ResourcesProperty: Lumber Gold Oil Food UnitsValue Range: action distribution , player resources , player scoreboard and a live performance graph .The action distribution keeps track of which actions a player has performed in the game session.These statistics are stored to the hard-drive after a game has reached the terminal state. PlayerResources (Table 4.2), are shown at the top bar of the game. Player Scoreboard indicates the overallperformance of each of the players, measured by kills, defensive points, oﬀensive points and resourcecount. Deep RTS features several hotkeys for moderating the game-settings like game-speed andstate representation. The hotkey menu is accessed by pressing the G-hotkey.Deep RTS is an environment developed as an intermediate step between Atari 2600 and the famousgame Starcraft II. It is designed to measure the performance in RL algorithms, while also preservingthe game goal. Deep RTS is developed for high-performance simulation of RTS scenarios. The gameengine is developed in C++ for performance but has an API wrapper for Python, seen in Figure 4.9.It has a ﬂexible conﬁguration to enable diﬀerent AI approaches, for instance online and oﬄine RL.Deep RTS can represent the state as raw game images (C++) and as a matrix, which is compatiblewith both C++ and Python. 41 .4. Deep Maze Environments Figure 4.10: Deep Maze: Graphical User Interface

Deep Maze is a game environment designed to challenge RL agents in the shortest path problem .Deep Maze deﬁnes the problem as follows: • How can the agent optimally navigate through any fully-observable maze?The environment is simple, but becomes drastically more complicated when the objective is to ﬁndthe optimal policy π (cid:63) ( s ) where s = state for all the maze conﬁgurations.There are multiple diﬃculty levels for Deep Maze in two separate modes; deterministic and stochas-tic. In the deterministic mode, the maze structure is never changed from game to game. Stochasticmode randomizes the maze structure for every game played. There are multiple size conﬁgurations,ranging from 7 × ×

55 in width and height, seen in Figure 4.10.Figure 4.11 illustrates how the theoretical maximum state-space set S of Deep Maze increase withmaze size. This is calculated by performing following binomial: S = (cid:0) width × heightplayer + goal (cid:1) = (cid:0) w × h (cid:1) . Thisis however reduced depending on the maze composition, where dense maze structures are generallyless complex to solve theoretically.The simulation is designed for performance so that each discrete time step is calculated with fewestpossible CPU cycles. The simulation is estimated to run at 3 000 000 ticks per second with modernhardware. This allows for fast training of RL algorithms.From an RL point of view, Deep Maze challenges an agent in state-interpretation and navigation,42 .4. Deep Maze Environments Figure 4.11: Deep Maze: State-space complexitywhere the goal is to reach the terminal state in fewest possible time steps. It’s a ﬂexible environmentthat enables research in a single environment setting, as well as multiple scenarios played in sequence.43 .5. Flappy Bird Environments

Figure 4.12: Flappy Bird: Graphical User Interface

Flappy Bird is a popular mobile phone game developed by Dong Nguyen in May 2013. The gameobjective is to control a bird by ”ﬂapping” its wings to pass pipes, see Figure 4.12. The player isawarded one point for each pipe passed.Flappy Bird is widely used in RL research and was ﬁrst introduced in

Deep Reinforcement Learningfor Flappy Bird [6]. This report shows superhuman agent performance in the game using regularDQN methods .OpenAI’s gym platform implements Flappy Bird through PyGame Learning Environment (PLE).It supports both visual and non-visual state representation. The visual representation is an RGBimage while the non-visual information includes vectorized data of the birds position, velocity,upcoming pipe distance, and position.Figure 4.12 illustrates the visual state representation of Flappy Bird. It is represented by an RGBImage with the dimension of 512 × ×

80. Flappy Bird is an excellent environment for RL and providesadequate validation of new game environments introduced in this thesis. Source code: https://github.com/yenchenlin/DeepLearningFlappyBird Available at: https://github.com/ntasfi/PyGame-Learning-Environment hapter 5 Proposed Solutions

Three key solutions are presented in this thesis. First is an architecture that provides a genericcommunication interface between the environments and the DRL agents. Second is to apply CapsuleLayers to DQN, enabling the research into CapsNet based RL algorithms. The third is a noveltechnique for generating artiﬁcial training data for DQN models. These components propose aDRL ecosystem that is suited for research purposes, see Figure 5.1.45 roposed Solutions

Figure 5.1: Proposed Deep Reinforcement Learning ecosystem46 .1. Environments Proposed Solutions

Figure 5.2: Architecture: gym-cair

OpenAI GYM is an open-source learning platform, exposing several game environments to the AIresearch community. There are many existing games available, but these are too simple because theyhave too easy game objectives. A game environment is registered to the GYM platform by deﬁninga scenario . This scenario predeﬁnes the environment settings that determines the complexity.This type of registration is beneﬁcial because it enables to construct multiple scenarios per gameenvironment. An example of this would be the Maze environment, which contains scenarios for deterministic and stochastic gameplay for the diﬀerent maze sizes.Figure 5.2 illustrates how the environment ecosystem is designed using OpenAI GYM. Environmentsare registered to the GYM (1) platform.Deep Line Wars (2) , Deep RTS (3) and Maze (4) are then added to a common repository, called gym-cair (5) . This repository links together all environments, which can be imported via Python (6) . Algorithm 1

Generic GYM RL routine state x = gym.reset terminal = F alse while not terminal do env.render a = env.action space.sample state x +1 , r x +1 , terminal, inf o = env.step ( a ) state x = state x +1 end while The beneﬁt of using GYM is that all environments are constrained to a generic RL interface, seenin Algorithm 1. The environment is initially reset by running gym.reset function (Line 1). It isassumed that the environment does not start in a terminal state (Line 2). While the environmentis not in a terminal state, the agent can perform actions (Line 5 and 6). This procedure is repeateduntil the environment reaches the terminal state.By using this setup, it is far more trivial to perform experiments in the proposed environments. Italso enables better comparison, because GYM ensures that the environment conﬁguration is not47 .1. Environments Proposed Solutions altered while conducting the experiments. 48 .2. Capsule Networks Proposed Solutions

Layer Name Output Params Output ParamsInput 28 × × × × × ×

256 20 992 76 × ×

256 20 992Primary Caps 6 × ×

256 5 308 672 34 × ×

256 5 308 672Capsule Layer 16 ×

16 2 359 296 16 ×

16 75 759 616Output 16 0 16 0

Parameters

Capsule Networks recently illustrated that a shallow ANN could successfully classify the MNISTdataset of digits, with state-of-the-art results, using considerably fewer parameters then in regularConvNets. The idea behind CapsNet is to interpret the input by identifying parts of the whole ,namely the objects of the input. [45] The objects are identiﬁed using Capsules that have the re-sponsibility of ﬁnding speciﬁc objects in the whole. A capsule becomes active when the object itsearches for exist.It becomes signiﬁcantly harder to use CapsNet in RL. The objective of RL is to identify actions thatare sensible to do in any given state. This means that actions become parts , and the whole becomesthe state. Instead of classifying objects, the capsules now estimate a vector of the likelihood thatan action is sensible to do in the current state.Several issues need to be solved for CapsNet to work properly in the environments outlined inChapter 4. The ﬁrst problem is the input size. The MNIST dataset of digits contains images of28 × × × × × × × × × × parameters.Figure 5.3 illustrates how parameters increase exponentially with the input size. In attempts tosolve the scalability issue, several Convolutional Layers can be put in front of the CapsNet. Thisenables the algorithm to extract feature maps from the original input, but it is crucial to not utilizeany form of pooling prior the Capsule Layer. The whole reason to use Capsules is that it solvesseveral problems with invariance that max-pooling possess.Figure 5.4 illustrates how a standard CapsNet is structured, using a single Convolutional Layer.When a neural network is used, a question is deﬁned to instruct the neural network to predict ananswer. For a simple image classiﬁcation task, the question is: what do you see in the image . Theneural network then tries to answer, by using its current knowledge: I see a bird . The answer isthen revealed to the neural network, which allows it to tune its response if it answered incorrectly.The same analogy can be used in an RL problem.The hope is that despite having several scalability issues, it is possible to accurately encode statesinto the correct capsules for each possible action in the environment. There are several methods to49 .2. Capsule Networks Proposed Solutions P a r a m e t e r s Figure 5.3: Capsule Networks: Parameter count for diﬀerent input sizesFigure 5.4: Capsule Networks: Architecture50 .2. Capsule Networks Proposed Solutions improve the training, but for this thesis, only primitive Q-Learning strategies will be used.51 .3. Deep Q-Learning Proposed SolutionsModel Paper Year

Deep Q-Learning Models (It is assumed that all models have a preceding input layer)

Model Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 DQN

ConvRelu ConvRelu ConvRelu DenseRelu OutputLinear2

DRQN

ConvRelu ConvRelu ConvRelu LSTM3

DDQN

ConvRelu ConvRelu ConvRelu DenseReluDenseRelu OutputLinearOutputLinear4

DuDQN

Uses 2x DQN, Gradual updates from Target to Main CDQN

Identical to DDQN but with diﬀerent update strategy DCQN

ConvRelu ConvRelu ConvRelu Capsule OutCaps7

RCQN ConvRelu ConvRelu ConvRelu Capsule OutCapsTable 5.3: Deep Q-Learning architectures There are many diﬀerent Deep Q-Learning algorithms available consisting of diﬀerent hyper-parameters,network depth, experience replay strategies and learning rates. The primary problem of DQN islearning stability, and this is shown with the countless conﬁgurations found in the literature [7, 14,16, 36, 37, 43]. Refer to Section 2.5.2 for how the algorithm performs learning of the Q function.Models 1-4 (Figure 5.2) are the most commonly used DQN architectures found in literature. Model5 shows great potential in continuous environments, comparable to environments from Chapter 4.Models 6 and 7 are two novel approaches using Capsule Layers in conjunction with Convolutionlayers [45, 64]. General knowledge of ANN, DQN, and CapsNet from Chapter 2 is required. Time Distributed / Recurrent .3. Deep Q-Learning Proposed SolutionsDeep Q-Learning HyperparametersParameter Value Range Default Learning Rate 0.0-1.0 1e − → ∞ → ∞ (cid:15) min . → . (cid:15) max . → . and > (cid:15) min (cid:15) start (cid:15) start ∈ { (cid:15) min , (cid:15) max } .4. Artiﬁcial Data Generator Proposed Solutions The Artiﬁcial Data Generator component from Figure 5.1 is an attempt to shorten the explorationphase in RL. By generating artiﬁcial training data, the hope is that DQN models can learn featuresthat were never experienced within the environment. The proposed algorithm could be able topredict these future states, s i +1 given s i conditioned on action a in the generator function s i +1 = G ( s i | a ) [35]. The initial plan was to utilize adversarial generative networks but was not able togenerate conditioned states successfully. Instead, an architecture called Conditional ConvolutionDeconvolution Network was developed that use SDG to update parameters (Section 5.4). Conditional Convolution Deconvolution Network (CCDN) is an architecture that tries topredict the consequence of a condition applied to an image. A state is conditioned on a action topredict future game states.Figure 5.5 illustrates the general idea of the model. The model is designed using two input streams,image and condition steam. The image stream is a series of convolutional layers following a fully-connected layer. The conditional stream contains fully-connected layers where the last layer matchesthe number neurons in the last layer of the image steam. These streams are then multipliedfollowing a fully-connected layer that encodes the conditioned states. The conditioned state isthen reconstructed using deconvolutions. The output layer is the ﬁnal reconstructed image of thepredicted next state given condition.The process of training this model is supervised as it consumes data from the experience replaybuﬀer gathered by RL agents. The model is trained by fetching a memory from the experiencereplay memory ( s i , a, s i +1 ) where s i is the current state, a is the action, and s i +1 is the transition T ( s i , a ). CCDN generates an artiﬁcial state ˆ s i +1 by using the generator model G ( s i | a, θ ). Theparameters θ is optimized using SDG, and the loss is calculated using MSE. M SE = 1 n n (cid:88) i =1 ( S i +1 − ˆ S i +1 ) (5.1)Equation 5.1 (Equation 2.2; the predicted value y is now denoted S ) is simple in that the loss isdecreased when the value of the predicted state ˆ S i +1 gets closer to S i +1 .Table 5.5 illustrates how states are generated using CCDN. It is assumed that it is possible totransition between states given an action, to create training data. When suﬃcient training datais collected, the recorded state data is used to estimate future states. In this example, there is a2 × up, down, left, and right . This yieldsa theoretical maximum state-space of 4 states with 256 possible transitions (4 actions per cell =4 possible state and action combinations). When a portion of the state-space is explored troughrandom-play the CCDN algorithm can train by comparing the predictions against real data. Thegoal is for the model to converge towards learning the transition function of the environment, tocontinue generating future states without any interaction with the environment. It is likely thatthe model are able to converge towards the optimal solution for more than a single time-step in thefuture. 54 .4. Artiﬁcial Data Generator Proposed Solutions Figure 5.5: Architecture: CCDN1 0 0 1 0 0Real States 0 0 T ( s , A right ) 0 0 T ( s , A down ) 0 1 s s s G ( s , A right ) 0 0 G (ˆ s , A down ) 0 1ˆ s ˆ s Table 5.5: Proposed prediction cycle for CCDN55 art III

Experiments and Results hapter 6 Conditional ConvolutionDeconvolution Network

This chapter presents Conditional Convolution Deconvolution Network (CCDN). The purpose of theCCDN algorithm is to generate artiﬁcial training data for Deep Reinforcement Learning algorithms.The data is generated from the game environments introduced in Chapter 4. The goal is to generatehigh-quality training data that can be used to train algorithms without self-play. The algorithm isused on the following game environments: • FlashRL: Multitask (Section 4.1), • Deep Line Wars (Section 4.2), • Deep Maze (Section 4.4), and • Flappy Bird (Section 4.5).Deep RTS (Section 4.3) is excluded from these tests because it does not support image state-representation A dataset consisting of 100 000 unique state transitions is collected for all environments usingrandom-play strategies. A training set is created, consisting of 60 000 transitions (60%), andthe remaining 40% as a test. The training for CCDN took approximately 160 hours per gameenvironment when using hardware listed in Appendix A.

CCDN predicts the future states by conditioning current state on a action. Figure 5.5 illustrates thearchitecture used in these experiments. To calculate the loss, MSE (Equation 5.1) was used during Deep RTS image state-representation is planned for future work .1. Introduction CCDN training. The model is tested using 32, 64, 128, 256, and 512 neurons in the fully-connected layer.Depending on the neuron count, the model has approximately 13 000 000 to 67 000 000 parametersin total.It is beneﬁcial to have a signiﬁcant amount of parameters because it allows the model to encodemore data. The drawback of this is that the model uses more memory, and takes longer to train.The aim is to train the model for 10 000 epoch at a maximum of 168 hours. For this reason,the algorithm used 32 neurons in the hidden layers which gave reasonably good results for someenvironments.Conditioned actions are not shown in the generated images from these experiments. This is becausethe precision is still too coarse, and the generated future states are yet too far from the ground truth.These results are impressive for some environments, and there is a possibility that the generatedsamples can be used in conjunction with real samples to train DQN models.60 .2. Deep Line Wars CCDN Figure 6.1: CCDN: Deep Line Wars: Training Performance

Deep Line Wars show excellent results when generating data using CCDN to generate future statesˆ s = G ( s | a ). Table 6.1 illustrates the transition from real states (Left side) to generated future statesby training CCDN using SDG.CCDN was not able to converge, but it is possible that this is due to our low neuron count of 32in the fully-connected layer. Figure 6.1 shows that the loss inclined gradually while the accuracydeclined. Loss and accuracy do not reﬂect the generated images seen in Table 6.1. The observedtransitions at Day 5-6.5 illustrate realistic transition behavior between states. Observations indicatethat CCDN learns input features like: • Background intensity (Represents health points) • Possible mouse position (White square) • Possible unit positions • Building positionsThe model is still not able to correctly predict the movement of units. This could potentially besolved by stacking several state transitions before predicting future states [6]. This could be doneusing ConvNets or the use of recurrence in neural networks (RNN).61 .2. Deep Line Wars CCDNDeep Line Wars: Conditioned State Transitions

Day 1 Day 4Day 1.5 Day 4.5Day 2 Day 5Day 2.5 Day 5.5Day 3 Day 6Day 3.5 Day 6.5Table 6.1: CCDN: Deep Line Wars62 .3. Deep Maze CCDN

Figure 6.2: CCDN: Deep Maze: Training Performance

Deep Maze should be considered as one of the more straightforward environments to generate high-quality training data because it has the simplest state-space. From Table 6.2 it is clear that CCDNrecognized features like the maze structure early in the training process. Figure 6.2 illustrates thatCCDN converged quickly, having a loss near 0 at the 5th epoch of training. High accuracy wasreported during training when using MSE as the loss function. By inspecting the produced imagesmanually, it was clear that CCDN did not learn how to predict the position of the player inside themaze. Hallways inside the maze did not show any sensible information about the actual location ofthe player. Instead, the maze hallways were generated as random noise. There are however someparts of the maze that presents less noise, indicating that the player did not visit these locations asfrequently. 63 .3. Deep Maze CCDNDeep Maze: Conditioned State Transitions

Day 1 Day 4Day 1.5 Day 4.5Day 2 Day 5Day 2.5 Day 5.5Day 3 Day 6Day 3.5 Day 6.5Table 6.2: CCDN: Deep Maze64 .4. FlashRL: Multitask CCDN

Figure 6.3: CCDN: FlashRL: Training Performance

CCDN produced high-quality state transitions when applying it to Flash RL: Multitask. SinceMultitask is an environment consisting of several diﬀerent scenes (Menu, Stage 1, Stage 2), it wasexpected that it would fail to generate sensible output. Table 6.3 illustrates that CCDN was ableto extract features from all states and map it to correct action. Transitions from Day 2.5 and Day3.5 illustrates a slight change in the paddle tilt and the position of the ball. This shows that thealgorithm can to some extend understand game mechanics. In addition to this, CCDN can drawthe menu including a slight change in the mouse position. The results show that CCDN can learnto extract: • The current scene layout • Primitive physicsFigure 6.3 illustrates that CCDN did not reach more than 5% accuracy at training time even thoughthe loss was close to zero. It is not clear what is causing the training instability because measuringloss of the images manually using MSE gave far better accuracy for most training samples. Theresults indicate that CCDN did indeed learn to extract features from the Multitask environment.65 .4. FlashRL: Multitask CCDNFlash RL: Conditioned State Transitions

Day 1 Day 4Day 1.5 Day 4.5Day 2 Day 5Day 2.5 Day 5.5Day 3 Day 6Day 3.5 Day 6.5Table 6.3: CCDN: FlashRL: Multitask66 .5. Flappy Bird CCDN

Figure 6.4: CCDN: Flappy Bird: Training Performance

Table 6.4 illustrates the generated transitions for the third party game Flappy Bird. Figure 6.4show that CCDN has a gradual decrease in the loss while the accuracy increases to approximately35%. Flappy Bird has the highest accuracy for the tested game environments, but observationsshows that CCDN is only able to generate noise.The reason is that Flappy Bird has a scrolling background, meaning that CCDN must encode alot more data than in the other environments. Because of this, CCDN could not determine how togenerate future state representations for this game.It is expected that this problem could be mitigated by performing data preprocessing. Literatureindicates that RL algorithms often strip away the background to simplify the game-state [13]. Also,it is likely that CCDN could successfully encode Flappy Bird with additional parameters, but thiswould increase the training time to several weeks.67 .5. Flappy Bird CCDNFlappy Bird: Conditioned States

Day 1Day 3Day 4Day 6Table 6.4: CCDN: Flappy Bird68 .6. Summary CCDN

CCDN is a novel algorithm suited for generating artiﬁcial training data for RL algorithms andshows great potential for some environments. The results indicate that CCDN has issues in gameenvironments with a sparse state-space representation. Flappy Bird illustrates the problem wellbecause CCDN generates noise instead of future states for action and state pairs. One method tocombat this problem may be to increase the neuron count for the fully-connected layer in the CCDNmodel.ANN based algorithms frequently suﬀer from training instability. The results show that the CCDNalgorithm was not able to accurately determine the loss using regular MSE. This could potentiallybe the cause of the training instability because the optimizer would not be able to determine howwell it is doing when updating network parameters. It is likely that replacing the MSE loss functioncould improve the generated images drastically.The results presented in this Chapter shows excellent potential in using CCDN for generation ofartiﬁcial training data for game environments. It shows excellent performance in Deep Line Warsand Flash RL: Multitask and could potentially reduce the required amount of exploration in RLalgorithms 69 hapter 7

Deep Q-Learning

This chapter presents experimental results of the research done using Deep Q-Learning with Cap-sNet and ConvNet based models. The goal is to use CapsNet in Deep Q-Learning to solve theenvironments from Chapter 4.RL algorithms are known to be computationally intensive and are thus diﬃcult to train for envi-ronments with large state-spaces [62]. Models are trained using hardware speciﬁed in Appendix A.Chapter 5 proposed 7 DQN architectures that could potentially control an agent well within theenvironments. Model 1 and 6 from Table 5.2 was selected as the primary research area to limit thescope of this thesis . To increase training stability for all environments, hyper-parameters fromTable 5.4 is tuned further per environment. The datasets are populated with 20% artiﬁcial trainingdata, generated from CCDN. Table 7.1 illustrates updated hyper-parameters that performed bestwhen experimenting with CapsNet and ConvNet based models. The DQN models use SDG tooptimize its parameters. Initial training data is sampled using random-play strategies, graduallymoving into exploitation using (cid:15) -greedy.Experiments conducted in this thesis are available at http://github.com/UIA-CAIR . Training time for 7 models in 5 environments: 7 × × Environment α γ (cid:15) -decay Batch Size Dataset-Size

Deep Line Wars 3e-5 0.98 0.005 16 1MDeep Maze 3e-5 0.98 0.005 16 1MFlashRL:Multitask 1e-4 0.98 0.005 16 1MDeep RTS 3e-5 0.98 0.005 16 1MFlappy Bird 2e-4 0.98 0.005 16 1MTable 7.1: DQN: Hyper-parameters71 .1. Experiments Deep Q-Learning Lo ss W i n P e r c en t A c t i on F r equen cy Move Cursor UpMove Cursor DownMove Cursor RightMove Cursor LeftPurchase Unit 0Purchase Unit 1Purchase Unit 2 Purchase Unit 3Purchase Building 0Purchase Building 1Purchase Building 2Purchase Building 3No Action

Figure 7.1: DQN-CapsNet: Deep Line Wars72 .1. Experiments Deep Q-Learning Lo ss W i n P e r c en t A c t i on F r equen cy Move Cursor UpMove Cursor DownMove Cursor RightMove Cursor LeftPurchase Unit 0Purchase Unit 1Purchase Unit 2 Purchase Unit 3Purchase Building 0Purchase Building 1Purchase Building 2Purchase Building 3No Action

Figure 7.2: DQN-ConvNet: Deep Line Wars73 .1. Experiments Deep Q-Learning Lo ss W i n P e r c en t A c t i on F r equen cy Prev UnitNext UnitMove LeftMove RightMove DownMove UpLeftMove UpRightMove DownLeft Move DownRightAttackHarvestBuild 0Build 1Build 2No Action

Figure 7.3: DQN-CapsNet: Deep RTS74 .1. Experiments Deep Q-Learning Lo ss W i n P e r c en t A c t i on F r equen cy Prev UnitNext UnitMove LeftMove RightMove DownMove UpLeftMove UpRightMove DownLeft Move DownRightAttackHarvestBuild 0Build 1Build 2No Action

Figure 7.4: DQN-ConvNet: Deep RTS75 .1. Experiments Deep Q-Learning Lo ss (cid:1) (cid:0) (cid:2) T o t a l R e w a d A c t i on F equen cy DownUp LeftRight

Figure 7.5: DQN-CapsNet: Deep Maze76 .1. Experiments Deep Q-Learning Lo ss (cid:1) (cid:0) (cid:2) (cid:3) (cid:4) T o t a l R e w a r d A c t i on F r equen cy DownUp LeftRight

Figure 7.6: DQN-ConvNet: Deep Maze77 .1. Experiments Deep Q-Learning Lo ss T o t a l R e w a r d A c t i on F r equen cy No Action Flap

Figure 7.7: DQN-CapsNet: Flappy Bird78 .1. Experiments Deep Q-Learning Lo ss T o t a l R e w a r d A c t i on F r equen cy No Action Flap

Figure 7.8: DQN-ConvNet: Flappy Bird79 .2. Deep Line Wars Deep Q-Learning

For Deep Line Wars, both agents illustrated relatively strong capabilities when it comes to exploitinggame mechanics and ﬁnding the opponents weakness. The opponent is a random-play agent, thatbuilds an uneven defense, sending units without any economic considerations. Figure 7.1 andFigure 7.2 show that both agents ﬁnd the opponents weakness to be defense.Figure 7.9: DQN-CapsNet: Agent building defensive due to low health in Deep Line WarsResults shows that the game mechanics are not balanced, making the

Purchase Unit 1 the obviouschoice for oﬀensive actions. This unit is strong enough to survive most defenses and does the mostdamage to the opponents health pool. The ConvNet agent performs better in a period of 100episodes, and both agents can master the random-play opponent.

Deep RTS shows exciting results, where DQN-CapsNet starts at a low loss with a high total reward,slowly diverging in reward and loss. The results show that DQN-CapsNet and DQN-ConvNetperform comparably. It is not clear why DQN-CapsNet diverged, but the high action-space is alikely candidate. It is diﬃcult to see any sense in the determination of action-state mapping, but80 .4. Deep Maze Deep Q-Learning some observations indicated that the agent favor gathering instead of military actions.

The goal of Deep Maze is to ﬁnd the shortest path from start to goal in a 25 ×

25 labyrinth. Figure 7.5and Figure 7.6 shows that DQN-CapsNet had issues with the training stability. The algorithm istested with several diﬀerent hyper-parameter conﬁgurations, but there was no solution to remedythis. DQN-ConvNet did not indicate any issues during the training. Both agents had issues ﬁndingthe shortest path, looking at the total reward, both agents had a negative score. For each movedone after reaching the optimal move count, a negative reward is given the agent. Observationsshow that both agents have similar performance in this experiment.Figure 7.10: DQN-CapsNet: Agent attempting to ﬁnd the shortest path in a 25 × Deep Maze

Figure 7.10 illustrates an in-game image of the 25 ×

25 map used in this experiment. The greensquare is the start area, while the red is the goal. The optimal path for this experiment is a seriesof 21 actions.

For FlashRL: Multitask, the DQN-CapsNet was not able to compete with DQN-ConvNet. It wasnot able to learn how to control the ﬁrst paddle. The results are for this reason not included for81 .6. Flappy Bird Deep Q-Learning this environment. Refer to

Publication B for results using DQN-ConvNet.

Flappy Bird is a diﬃcult environment for an agent to master because the state-space is large due tothe scrolling background. In literature, the training time for this environment is between one andfour days. For this experiment, the agent trained for seven days, in the hope that both agents wouldconverge. Figure 7.7 and Figure 7.8 shows that both agents performed well, where DQN-ConvNetscored 0.4 points higher. For each pipe, the bird passes, 0.1 points are awarded to the total reward.

Looking at the results, it is clear that DQN-CapsNet overestimates actions for almost all environ-ments. Instead of having a sensible distribution of actions, it often chooses to favor a particularmove after a short period of training.Recent state-of-the-art suggests that self-play using dueling methods may increase stability andperformance in the long-term [43], but this was not possible due to GPU memory limitations. It isclear that DQN-CapsNet can work for other tasks then image recognition, but there are still manychallenges to solve before it can perform comparably to DQN-ConvNet. A signiﬁcant issue is thatCapsules do not scale well with several outputs (actions), resulting in a model that quickly becomestoo large for the GPU memory to handle. The upcoming paragraphs summarize the ﬁndings of theexperiments conducted using DQN-CapsNet and DQN-ConvNet.

Training Loss

An interesting observation during the training was that none of the models had a gradual decline inthe loss during training. This may be because the state-space was quite large for all environments inthe test-bed. Some investigation revealed that environments with sparse input had a more signiﬁcantloss increase. By comparing Figure 7.1 and Figure 7.3, it is clear that Deep RTS has far more losscompared to Deep Line Wars when predicting the best Q-Value for an action. Since CapsNetprimarily detects objects, it is likely that the sudden jumps in loss (Figure 7.5) can be explained byseveral capsules changing its prediction vector at the same time. A possible improvement would beto decay the learning rate throughout the training period. It is likely that the training loss issuescan be managed for models with several new hyper-parameter conﬁgurations.

Action Frequency

Results shows that CapsNet tends to overestimate actions drastically in environments with fewactions (Deep Maze and Flappy Bird). It is possible that this is because a Capsule looks for parts inthe whole . Since CapsNet is positional invariant, one explanation may be that the model classiﬁesstates by looking for the existence of an object, instead of the likelihood of the best action. ForFlappy Bird, the model determines that the agent should use

Flap as long as the bird exists inthe input. For environments with large action-spaces, observations show a more consistent actionfrequency. 82 .7. Summary Deep Q-LearningEnvironment Random DQN-CapsNet DQN-ConvNet

Deep Line Wars 50 57 78Deep RTS 1.4 5.0 5.1Deep Maze -600 -275 -225FlashRL: Multitask 14 N/A 300Flappy Bird 1.4 7.9 8.3Table 7.2: Comparison of DQN-CapsNet, DQN-ConvNet, and Random accumulative reward(Higher is better)

Agent Performance

Table 7.2 shows that DQN-CapsNet does indeed perform above random-play agents in selectedenvironments, but falls behind compared to DQN-ConvNet. For all environments, a higher score isbetter. In Deep Line Wars, the reward increases as the agent keep surviving the game or defeat theenemy. The CapsNet agent has approximately 57% win chance while ConvNet wins in 78% of thegames against a random-play agent.In Deep RTS, the accumulated score is measured during the ﬁrst 600 seconds of the game. This istypically resource harvesting, as the agent was never able to create long-term strategies. In earlytraining, CapsNet accumulated far more resources then ConvNet, but it gradually declined whiletraining. This means that the model diverged from the optimal solution. It is likely that this isbecause the model starts to overestimate action Q-values. In comparison, results show that bothmodels perform comparably while performing well beyond the capability of random-play agents.In Deep Maze, none of the agents were able to ﬁnd the optimal path to the goal. Additionalexperiments were conducted and showed better results for smaller mazes (9x9 and 11x11). For 25x25the CapsNet used on average 275 additional actions to reach the goal, while ConvNet performedmarginally better using 225 actions.The CapsNet agent is able to perform well in Flappy Bird. With only 0.4 points less then ConvNet,it is clear that both agents perform at the same level of expertise. It is possible that the CapsNetagent could achieve far better results if a solution is found for the Q-Value overestimation problem.83 hapter 8

Conclusion and Future Work

This thesis conclusively shows that Capsules are viable to use in advanced game environments. Itis further shown that capsules do not scale as well as convolutions, implying that capsule networksalone will not be able to play even more advanced games without improved scalability.

This thesis has focused on

Deep Reinforcement Learning using Capsules in AdvancedGame Environments . This work presents several new game environments that are tailored forresearch into RL algorithms in the RTS genre. This contribution could potentially lead to a ground-breaking performance in advanced game environments that could enable RL agents to perform wellin games like Starcraft II. The combination of Capsule Networks and Deep Q-Learning illustratedcomparable results to regular ConvNets, in regards to stability, on the new learning platform. Asa secondary goal, a generative model was implemented, CCDN, which successfully generates futurestate representations in the majority of the test environments.Since Capsule Networks are a novel research area that is in its early infant stage, more research isrequired to determine its capabilities in RL for advanced game environments. This chapter presentsthe thesis conclusion and future work for the continuation of a PhD thesis in DRL.85 .1. Conclusion Conclusion and Future Work

Hypothesis 1:

Generative modeling using deep learning is capable of generating artiﬁcial trainingdata for games with suﬃcient quality.

Our work shows that it is indeed possible to generate artiﬁcial training data using deeplearning. Our work shows that it is of suﬃcient quality to perform oﬀ-line training of deepneural networks.

Hypothesis 2:

CapsNet can be used in Deep Q-Learning with comparable performance to ConvNetbased models.

The research shows that CapsNet can be directly adapted to work with Deep Q-Learning,but the stability is inferior to regular ConvNet. Some experiments show comparable resultsto ConvNets, but it is not clear how CapsNets do reasoning in an RL environment.

Goal 1:

Investigate the state-of-the-art research in the ﬁeld of Deep Learning, and learn howCapsule Networks function internally.

A thorough survey of the state-of-the-art in deep learning was outlined in Chapter 3. Muchof the performed work was inspired by previous research, which enabled several exciting dis-coveries in RL. Results show that it is possible to combine CapsNet with other algorithms.

Goal 2:

Design and develop game environments that can be used for research into RL agents forthe RTS game genre.

The thesis outline four new game environments that target research into RL agents for RTSgames.Deep RTS is a Warcraft II clone that is suited for an agent of high-quality play. It requires theagent to do actions in a high-dimensional environment that is continuously moving. Since theDeep RTS state is of such high-dimension, it is still not feasible to master this environment.Deep Line Wars was created to enable research on a simpler scale, this enabled research intosome of the RTS aspects, found in Deep RTS.To simplify it even further, Deep Maze was created to only account for trivial state interpre-tations. Flash RL was created as a side project, enabling research into a vast library of Flashgames.Together, these game environments create a platform that allows for in-depth research intoRL problems in the RTS game genre.

Goal 3:

Research generative modeling and implement an experimental architecture for generatingartiﬁcial training data for games. .1. Conclusion Conclusion and Future Work CCDN is introduced as a novel architecture for generating artiﬁcial future states from a game,using present state and action to learn the transition function of an environment. Earlyexperimental results are presented in this work, showing that it has potential to successfullytrain a neural network based model.

Goal 4:

Research the novel CapsNet architecture for MNIST classiﬁcation and apply this to RLproblems.

Section 5.2 outlines the research into CapsNet in scenarios that are diﬀerent from the MNISTexperiments conducted by Sabour et al [45]. The objective of Capsules is redeﬁned so that itcould work for RL related problems.

Goal 5:

Combine Deep-Q Learning and CapsNet and perform experiments on environments fromAchievement 2.

In Chapter 7, DQN and CapsNet were successfully combined and illustrated that it has thepotential to perform well in several advanced game environments. Although these results onlyshow minor agent intelligence, it is an excellent beginning for further research into this typeof deep models.

Goal 6:

Combine the elements of Goal 3 and Goal 5. The goal is to train an RL agent withartiﬁcial training data successfully.

Results shows that training data generated with CCDN can be used in conjunction with realdata to train an DQN algorithm successfully.All of the goals deﬁned in the scope of this thesis were accomplished. Although the results arenot astounding for all goals, it enables further research into several new deep learning ﬁelds. Thework presented in this thesis enables further research into CapsNet based RL in advanced gameenvironments. Because of the new learning platform, researchers can better perform research intoRTS games. It is possible that the work from this thesis could be the foundation for novel RLalgorithms in the future. 87 .2. Future Work Conclusion and Future Work

Environments

1. Continue work on Flash RL, enabling it to replace OpenAI Universe Flash.2. Propose partnership with ELF and implement Deep RTS and Deep Line Wars into ELF.3. Develop a full-ﬂedged platform that expands beyond gym-cair.4. Implement Image state-representation for Deep RTS. Generative Modeling

1. Additional experiments with hyper-parameters with the existing models.2. Attempt to stabilize training.3. Investigate if it is possible to use adversarial methods to train the generative model.4. Identify and solve the issue with the loss function in CCDN.

Deep Capsule Q-Learning

1. Improve stability of current architecture, enabling less data. preprocessing for the algorithmto function.2. Improve the scalability of Capsules for large action spaces.3. Do additional experiments with multiple conﬁgurations to ﬁnd the cause of the training in-stability.4. More research into combining Capsules with RL algorithms.

Planned Publications

1. Deep RTS: A Real-time Strategy game for Reinforcement Learning.2. CCDN: Towards inﬁnite training data using generative models.3. DCQN: Using Capsules in Deep Q-Learning. ELF Source-code: https://github.com/facebookresearch/ELF Proposed Publication titles may change in ﬁnal versions eferences [1] Per-Arne Andersen, Morten Goodwin, and Ole-Christoﬀer Granmo. FlashRL: A ReinforcementLearning Platform for Flash Games. Norsk Informatikkonferanse , 2017.[2] Per Arne Andersen, Morten Goodwin, and Ole Christoﬀer Granmo. Towards a deep reinforce-ment learning approach for tower line wars. In

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) , volume10630 LNAI, pages 101–114, 2017.[3] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, HeinrichK¨uttler, Andrew Lefrancq, Simon Green, V´ıctor Vald´es, Amir Sadik, Julian Schrittwieser,Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaﬀney, HelenKing, Demis Hassabis, Shane Legg, and Stig Petersen. DeepMind Lab. dec 2016.[4] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learningenvironment: An evaluation platform for general agents.

IJCAI International Joint Conferenceon Artiﬁcial Intelligence , 2015-Janua:4148–4152, 2015.[5] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,and Wojciech Zaremba. OpenAI Gym. jun 2016.[6] Kevin Chen. Deep Reinforcement Learning for Flappy Bird. page 6, 2015.[7] Wenliang Chen, Min Zhang, Yue Zhang, and Xiangyu Duan. Exploiting meta features fordependency parsing and part-of-speech tagging.

Artiﬁcial Intelligence , 230:173–191, sep 2016.[8] Gianfranco Ciardo and Andrew S. Miner. Storage alternatives for large structured state spaces.In

Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelli-gence and Lecture Notes in Bioinformatics) , volume 1245, pages 44–57, 1997.[9] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards Diverse and Natural ImageDescriptions via a Conditional GAN. mar 2017.[10] Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple model-basedreinforcement learning.

Neural computation , 14(6):1347–1369, 2002.[11] Eyal Even-dar, Shie Mannor, and Yishay Mansour. Action Elimination and Stopping Condi-tions for Reinforcement Learning.

Icml , 7:1079–1105, 2003.89eferences References[12] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, TingLiu, Xingxing Wang, Gang Wang, Jianfei Cai, and Tsuhan Chen. Recent advances in convo-lutional neural networks.

Pattern Recognition , dec 2017.[13] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learningfor robotic manipulation with asynchronous oﬀ-policy updates, 2017.[14] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous Deep Q-Learning with Model-based Acceleration. mar 2016.[15] Abhishek Gupta, Clemens Eppner, Sergey Levine, and Pieter Abbeel. Learning dexterousmanipulation for a soft robotic hand from human demonstrations. In

IEEE InternationalConference on Intelligent Robots and Systems , volume 2016-Novem, pages 3786–3793, 2016.[16] Matthew Hausknecht and Peter Stone. Deep Recurrent Q-Learning for Partially ObservableMDPs. jul 2015.[17] Geoﬀrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A Fast Learning Algorithm for DeepBelief Nets.

Neural Computation , 18(7):1527–1554, 2006.[18] Harold L Hunt and I I Jon Turney. Cygwin/X Contributor’s Guide. 2004.[19] Aaron Courville Ian Goodfellow, Yoshua Bengio.

Deep Learning , volume 521. MIT Press, 2017.[20] Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platformfor artiﬁcial intelligence experimentation.

IJCAI International Joint Conference on ArtiﬁcialIntelligence , 2016-Janua:4246–4247, 2016.[21] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning:A survey.

Journal of Artiﬁcial Intelligence Research , 4:237–285, 1996.[22] I Kanter, Y LeCun, and S Solla. Second-order properties of error surfaces: learning time andgeneralization.

Advances in Neural Information Processing Systems (NIPS 1990) , 3:918–924,1991.[23] Shohei Kinoshita, Takahiro Ogawa, and Miki Haseyama. LDA-based music recommendationwith CF-based similar user selection. , pages 215–216, jun 2016.[24] George Konidaris and Andrew G. Barto. Autonomous shaping.

International Conference onMachine Learning , pages 489–496, 2006.[25] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. ImageNet Classiﬁcation with DeepConvolutional Neural Networks, 2012.[26] Ben J.A. Kr¨ose.

Learning from delayed rewards . PhD thesis, King’s College, Cambridge, UK,1995.[27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.Jackel. Backpropagation Applied to Handwritten Zip Code Recognition.

Neural Computation ,1(4):541–551, dec 1989. 90eferences References[28] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learningapplied to document recognition.

Proceedings of the IEEE , 86(11):2278–2323, 1998.[29] Chen-Yu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing Pooling Functions inConvolutional Neural Networks: Mixed, Gated, and Tree. sep 2015.[30] Yuxi Li. Deep Reinforcement Learning: An Overview. arXiv , pages 1–30, 2017.[31] L J Lin. Reinforcement Learning for Robots Using Neural Networks.

Report, CMU , pages1–155, 1993.[32] Bj¨orn Lindstr¨om, Ida Selbing, Tanaz Molapour, and Andreas Olsson. Racial Bias Shapes SocialReinforcement Learning.

Psychological Science , 25(3):711–719, feb 2014.[33] Chunhui Liu, Aayush Bansal, Victor Fragoso, and Deva Ramanan. Do Convolutional NeuralNetworks act as Compositional Nearest Neighbors? arXiv , pages 1–15, 2017.[34] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Ban-ino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, andRaia Hadsell. Learning to Navigate in Complex Environments. nov 2016.[35] Mehdi Mirza and Simon Osindero. Conditional Generative Adversarial Nets. nov 2014.[36] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning. dec 2013.[37] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforce-ment learning.

Nature , 518(7540):529–533, feb 2015.[38] Matej Moravˇc´ık, Martin Schmid, Neil Burch, Viliam Lis´y, Dustin Morrill, Nolan Bard, TrevorDavis, Kevin Waugh, Michael Johanson, and Michael Bowling. DeepStack: Expert-Level Ar-tiﬁcial Intelligence in No-Limit Poker. jan 2017.[39] Fernando Naclerio, Marco Seijo-Bujia, Eneko Larumbe-Zabala, and Conrad P. Earnest. Car-bohydrates alone or mixing with beef or whey protein promote similar training outcomes inresistance training males: A double-blind, randomized controlled clinical trial.

InternationalJournal of Sport Nutrition and Exercise Metabolism , 27(5):408–420, 2017.[40] Bruno A. Olshausen and David J. Field. Sparse coding of sensory inputs.

Current Opinion inNeurobiology , 14(4):481–487, sep 2004.[41] Etienne Perot, Maximilian Jaritz, Marin Toromanoﬀ, and Raoul De Charette. End-to-EndDriving in a Realistic Racing Game with Deep Reinforcement Learning.

IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition Workshops , 2017-July:474–475, may 2017.[42] Laurent Praly and Yuan Wang. Stabilization in spite of matched unmodeled dynamics and anequivalent deﬁnition of input-to-state stability.

Mathematics of Control, Signals, and Systems ,9(1):1–33, 1996. 91eferences References[43] Carlos Ramirez-Perez and Victor Ramos. SDN meets SDR in self-organizing networks: Fittingthe pieces of network management.

IEEE Communications Magazine , 54(1):48–57, nov 2016.[44] David E. Rumelhart, Geoﬀrey E. Hinton, and Ronald J. Williams. Learning representationsby back-propagating errors.

Nature , 323(6088):533–536, oct 1986.[45] Sara Sabour, Nicholas Frosst, and Geoﬀrey E Hinton. Dynamic Routing Between Capsules.

Nips , oct 2017.[46] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved Techniques for Training GANs. jun 2016.[47] Wojciech Samek, Alexander Binder, Gregoire Montavon, Sebastian Lapuschkin, andKlaus Robert Muller. Evaluating the Visualization of What a Deep Neural Network HasLearned.

IEEE Transactions on Neural Networks and Learning Systems , sep 2016.[48] Dominik Scherer, Andreas M¨uller, and Sven Behnke. Evaluation of pooling operations in convo-lutional architectures for object recognition. In

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) , volume6354 LNCS, pages 92–101, 2010.[49] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van DenDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, TimothyLillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Master-ing the game of Go with deep neural networks and tree search.

Nature , 529(7587):484–489,2016.[50] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, ArthurGuez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap,Karen Simonyan, and Demis Hassabis. Mastering Chess and Shogi by Self-Play with a GeneralReinforcement Learning Algorithm. dec 2017.[51] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lill-icrap, Fan Hui, Laurent Sifre, George Van Den Driessche, Thore Graepel, and Demis Hassabis.Mastering the game of Go without human knowledge.

Nature , 550(7676):354–359, 2017.[52] Richard S Sutton and Andrew G Barto. Time-Derivative Models of Pavlovian Reinforce-ment.

Learning and Computational Neuroscience: Foundations of Adaptive Networks , (Mowrer1960):497–537, 1990.[53] R.S. Sutton and A.G. Barto.

Reinforcement Learning: An Introduction , volume 9. MIT Press,1998.[54] W.W. Swart, C.E. Gearing, and T. Var.

A dynamic programming—integer programming algo-rithm for allocating touristic investments , volume 27. 1972.[55] Techtonik. python-vnc-viewer. https://github.com/techtonik/python-vnc-viewer, 2015.92eferences References[56] Gerald Tesauro. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-LevelPlay.

Neural Computation , 6(2):215–219, 1994.[57] Gerald Tesauro. Temporal diﬀerence learning and TD-Gammon.

Communications of the ACM ,38(3):58–68, 1995.[58] Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C. Lawrence Zitnick. ELF:An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games. jul2017.[59] Harm van Seijen, Mehdi Fatemi, Joshua Romoﬀ, Romain Laroche, Tavian Barnes, and JeﬀreyTsang. Hybrid Reward Architecture for Reinforcement Learning. jun 2017.[60] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets,Michelle Yeo, Alireza Makhzani, Heinrich K¨uttler, John Agapiou, Julian Schrittwieser, JohnQuan, Stephen Gaﬀney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, DavidSilver, Timothy Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence,Anders Ekermo, Jacob Repp, and Rodney Tsing. StarCraft II: A New Challenge for Reinforce-ment Learning. aug 2017.[61] Fang Wan and Chaoyang Song. Logical Learning Through a Hybrid Neural Network withAuxiliary Inputs. may 2017.[62] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin,Bo Chen, and Ying Wu. Learning ﬁne-grained image similarity with deep ranking.

Proceedingsof the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , pages1386–1393, apr 2014.[63] Songtao Wu, Shenghua Zhong, and Yan Liu. Deep residual learning for image steganalysis.

Multimedia Tools and Applications , pages 1–17, dec 2017.[64] Edgar Xi, Selina Bing, and Yang Jin. Capsule Network Performance on Complex Data. dec2017.[65] Lecun Yann. Eﬃcient backprop.

Neural networks: tricks of the trade , 53(9):1689–1699, 1998.93 ppendices

A Hardware Speciﬁcation

Operating System Ubuntu 17.10Processor Intel i7-7700KMemory 64GB DDR4Graphics 1x NVIDIA GeForce 1080TI95 art IV

Publications ppendix A Towards a Deep ReinforcementLearning Approach for Tower LineWars owards a Deep Reinforcement LearningApproach for Tower Line Wars Per-Arne Andersen ( B ) , Morten Goodwin, and Ole-Christoﬀer Granmo University of Agder, Grimstad, Norway [email protected]

Abstract.

There have been numerous breakthroughs with reinforce-ment learning in the recent years, perhaps most notably on Deep Rein-forcement Learning successfully playing and winning relatively advancedcomputer games. There is undoubtedly an anticipation that Deep Rein-forcement Learning will play a major role when the ﬁrst AI mastersthe complicated game plays needed to beat a professional Real-TimeStrategy game player. For this to be possible, there needs to be a gameenvironment that targets and fosters AI research, and speciﬁcally DeepReinforcement Learning. Some game environments already exist, how-ever, these are either overly simplistic such as Atari 2600 or complexsuch as Starcraft II from Blizzard Entertainment.We propose a game environment in between Atari 2600 and Star-craft II, particularly targeting Deep Reinforcement Learning algorithmresearch. The environment is a variant of Tower Line Wars from War-craft III, Blizzard Entertainment. Further, as a proof of concept thatthe environment can harbor Deep Reinforcement algorithms, we pro-pose and apply a Deep Q-Reinforcement architecture. The architecturesimpliﬁes the state space so that it is applicable to Q-learning, and inturn improves performance compared to current state-of-the-art meth-ods. Our experiments show that the proposed architecture can learn toplay the environment well, and score 33% better than standard Deep Q-learning—which in turn proves the usefulness of the game environment.

Keywords:

Reinforcement Learning · Q-Learning · Deep Learning · Game environment

Despite many advances in AI for games, no universal reinforcement learningalgorithm can be applied to Real-Time Strategy Games (RTS) without datamanipulation or customization. This includes traditional games such as WarcraftIII, Starcraft II, and Tower Line Wars. Reinforcement Learning (RL) has beenapplied to simpler games such as games for the Atari 2600 platform but has tothe best of our knowledge not successfully been applied to RTS games. Further,existing game environments that target AI research are either overly simplisticsuch as Atari 2600 or complex such as Starcraft II. c (cid:2) Springer International Publishing AG 2017M. Bramer and M. Petridis (Eds.): SGAI-AI 2017, LNAI 10630, pp. 101–114, 2017.https://doi.org/10.1007/978-3-319-71078-5 _

02 P.-A. Andersen et al.

Reinforcement Learning has had tremendous progress in recent years in learn-ing to control agents from high-dimensional sensory inputs like vision. In simpleenvironments, this has been proven to work well [1], but are still an issue forcomplex environments with large state and action spaces [2]. In games wherethe objective is easily observable, there is a short distance between action andreward which fuels the learning. This is because the consequence of any actionis quickly observed, and then easily learned. When the objective is more com-plicated the game objectives still need to be mapped to the reward function,but it becomes far less trivial. For the Atari 2600 game Ms. Pac-Man this wassolved through a hybrid reward architecture that transforms the objective to alow-dimensional representation [3]. Similarly, the OpenAI’s bot is able to beatworld’s top professionals at 1v1 in DotA 2. It uses reinforcement learning whileit plays against itself, learning to predict the opponent moves.Real-Time Strategy Games, including Warcraft III, is a genre of games muchmore comparable to the complexity of real-world environments. It has a sparsestate space with many diﬀerent sensory inputs that any game playing algorithmmust be able to master in order to perform well within the environment. Dueto the complexity and because many action sequences are required to constitutea reward, standard reinforcement learning techniques including Q-learning arenot able to master the games successfully.This paper introduces a two-player version of the popular Tower Line Warsmodiﬁcation from the game Warcraft III. We refer to this variant as Deep LineWars. Note that Tower Line Wars is not an RTS game, but has many simi-lar elements such as time-delayed objectives, resource management, oﬀensive,and defensive strategy planning. To prove that the environment is working we,inspired by recent advances from van Seijen et al. [3], apply a method of sep-arating the abstract reward function of the environment into smaller rewards.This approach uses a Deep Q-Network using a Convolutional Neural Networkto map actions to states and can play the game successfully and perform betterthan standard Deep Q-learning by 33%.Rest of the paper is organized as follows: We ﬁrst investigate recent discov-eries in Deep RL in Sect. 2. We then brieﬂy outline how Q-Learning works andhow we interpret Bellman’s equation for utilizing Neural Networks as a functionapproximator in Sect. 3. We present our contribution in Sect. 4 and present acomparison of other game environments that are widely used in reinforcementlearning. We introduce a variant of Deep Q-Learning in Sect. 5 and present acomparison to other RL models used in state-of-the-art research. Finally weshow results in Sect. 6, deﬁne a roadmap of future work in Sect. 7 and concludeour work in Sect. 8.

There have been several breakthroughs related to reinforcement learning per-formance in recent years [4]. Q-Learning together with Deep Learning was agame-changing moment, and has had tremendous success in many single agent owards a Deep Reinforcement Learning Approach for Tower Line Wars 103 environments on the Atari 2600 platform [1]. Deep Q-Learning as proposed byMnih et al. [1] as shown in Fig. 1 used a neural network as a function approxi-mator and outperformed human expertise in over half of the games [1].

Fig. 1.

Deep Q-Learning architecture

Hasselt et al. proposed Double DQN, which reduced the overestimation ofaction values in the Deep Q-Network [5]. This led to improvements in some ofthe games on the Atari platform.Wang et al. then proposed a dueling architecture of DQN which introducedestimation of the value function and advantage function [6]. These two functionswere then combined to obtain the Q-Value. Dueling DQN were implementedwith the previous work of van Hasselt et al. [6].Harm van Seijen et al. recently published an algorithm called Hybrid RewardArchitecture (HRA) which is a divide and conquer method where several agentsestimate a reward and a Q-value for each state [3]. The algorithm performedabove human expertise in Ms. Pac-Man, which is considered one of the hard-est games in the Atari 2600 collection and is currently state-of-the-art in thereinforcement learning domain [3]. The drawback of this algorithm is that gen-eralization of Minh et al. approach is lost due to a huge number of separateagents that have domain-speciﬁc sensory input.There have been few attempts at using Deep Q-Learning on advanced simu-lators speciﬁcally made for machine-learning. It is probable that this is becausethere are very few environments created for this purpose.

Reinforcement learning can be considered hybrid between supervised and unsu-pervised learning. We implement what we call an agent that acts in our envi-ronment. This agent is placed in the unknown environment where it tries tomaximize the environmental reward [7].Markov Decision Process (MDP) is a mathematical method of modelingdecision-making within an environment. We often use this method when uti-lizing model-based RL algorithms. In Q-Learning, we do not try to model the

04 P.-A. Andersen et al.

MDP. Instead, we try to learn the optimal policy by estimating the action-valuefunction Q ∗ ( s, a ), yielding maximum expected reward in state s executing actiona. The optimal policy can then be found by π ( s ) = argmax a Q ∗ ( s, a ) (1)This is derived from Bellman’s Equation , because we can consider U ( s ) = max a Q ( s, a ), the Utility function to be true. This gives us the ability to derivefollowing update-rule equation from Bellman’s work: Q ( s, a ) ← Q ( s, a )+ α (cid:2)(cid:3)(cid:4)(cid:5) Learning Rate (cid:6) R ( s ) (cid:2) (cid:3)(cid:4) (cid:5) Reward + γ (cid:2)(cid:3)(cid:4)(cid:5) Discount max a (cid:2) Q ( s (cid:2) , a (cid:2) ) (cid:2) (cid:3)(cid:4) (cid:5) New Estimate − Q ( s, a ) (cid:2) (cid:3)(cid:4) (cid:5) Old Estimate (cid:7) (2)This is an iterative process of propagating back the estimated Q-value foreach discrete time-step in the environment. It is guaranteed to converge towardsthe optimal action-value function, Q i → Q ∗ as i → ∞ [1, 7]. At the most basiclevel, Q-Learning utilize a table for storing ( s, a, r, s (cid:2) ) pairs. But we can insteaduse a non-linear function approximation in order to approximate Q ( s, a ; θ ). θ describes tunable parameters for approximator. Artiﬁcial Neural Networks(ANN) are a popular function approximator, but training using ANN is rela-tively unstable. We deﬁne the loss function as following. L ( θ i ) = E (cid:8) ( r + γmax a (cid:2) Q ( s (cid:2) , a (cid:2) ; θ i ) − Q ( s, a ; θ i )) (cid:9) (3)As we can see, this equation uses Bellman equation to calculate the loss forthe gradient descent. To combat training instability, we use Experience Replay .This is a memory module which stores memories from experienced states anddraws a uniform distribution of experiences to train the network [1]. This is whatwe call a

Deep Q-Network and are as described in its most primitive form. Seerelated work for recent advancements in DQN.

For a player to play RTS games well, he typically needs to master high diﬃcultystrategies. Most RTS strategies incorporate– Build strategies,– Economy management,– Defense evaluation, and– Oﬀense evaluation.These objectives are easy to master when separated but become hard to perfectwhen together. Starcraft II is one of the most popular RTS games, but due toits complexity, it is not expected that an AI-based system can beat this gameanytime soon. At the very least, state-of-the-art Deep Q-Learning is not directlyapplicable. Blizzard entertainment and Google DeepMind has collaborated on owards a Deep Reinforcement Learning Approach for Tower Line Wars 105 an interface to the Starcraft II game [8, 9]. Starcraft II is for many researchersconsidered the next big goal in AI research. Warcraft III is relatable to StarcraftII as they are the same genre and have near identical game mechanics.Current state-of-the-art algorithms struggle to learn objectives in the state-space because the action-space is too abstract [10]. State and action spacesdeﬁne the range of possible conﬁgurations a game board can have. Existing DQNmodels use pixel data as input and objectively maps state to action [1]. Thisworks when the game objective is closely linked to an action, such as controllinga paddle in Breakout, where the correct action is quickly rewarded, and a wrongaction quickly punished. This is not possible in RTS games. If the objective isto win the game, an action will only be rewarded or punished after minutes oreven hours of gameplay. Furthermore, gameplay would consist of thousands ofactions and only combined will they result in a reward or punishment.

Fig. 2.

Properties of selected game environments

Collected data in Fig. 2 argues that games that have been solved by currentstate-of-the-art is usually non-stochastic and is fully observable. Also, current AIprefers environments which are not simultaneous, meaning they can be pausedbetween each state transition. This makes sense because hardware still limitsadvances in AI.

06 P.-A. Andersen et al.

By doing rough estimations of the state-space in-game environments fromFig. 2, it is clear that state-of-the-art has done a big leap in recent years. Withthe most recent contribution being Ms. Pac-Man [3]. However, by computingthe state-space of a regular Starcraft II map only taking unit compositions intoaccount, the state space can be calculated to be (128 x = 16384 =10 [11]. Fig. 3.

State-space complexity of selected game environments

The predicament is that the diﬀerence in complexity between Ms. Pac-Manand Starcraft II is tremendous. Figure 3 illustrates a relative and subjective com-parison between state-complexity in relevant game environments. State-spacecomplexity describes approximately how many diﬀerent game conﬁgurations agame can have. It is based on map size, unit position, and unit actions. The com-parison is a bit arbitrary because the games are complex in diﬀerent manners.However, there is no doubt that the distance between Ms. Pac-Man, perhaps themost advanced computer game mastered so far, and Starcraft II is colossal. Toadvance AI solutions towards Starcraft II, we argue that there is a need for sev-eral new game environments that exceed the complexity of existing games andchallenge researches on multi-agent issues closely related to Starcraft II [12]. We,therefore, introduce Deep Line Wars as a two-player variant of Tower Line Wars.Deep Line Wars is a game simulator aimed at ﬁlling the gap between Atari 2600and Starcraft II. It features the most important aspects of an RTS game.The objective of this game is as seen in Fig. 4 to invade the opposing playerwith units until all health is consumed. The opposing player’s health is reducedfor each friendly unit that enters the red area of the map. A unit spawns at arandom location on the red line of the controlling player’s side and automatically owards a Deep Reinforcement Learning Approach for Tower Line Wars 107

Fig. 4.

Graphical interface of Deep Line Wars walks towards the enemy base. To protect your base against units, the player canbuild towers which shoot projectiles at enemy units. When an enemy unit dies,a fair percentage of the unit value is given to the player. When a player sends aunit, the income variable is increased by a deﬁned percentage of the unit value.Players gold are increased at regular intervals determined in the conﬁgurationﬁles. To master Deep Line Wars, the player must learn following skill-set:– oﬀensive strategies of spawning units,– defending against the opposing player’s invasions, and– maintain a healthy balance between oﬀensive and defensive in order to max-imize incomeand is guaranteed a victory if mastered better than the opposing player.Because the game is speciﬁcally targeted towards machine learning, the game-state is deﬁned as a multi-dimensional matrix. Figure 5 represents a 5 × ×

08 P.-A. Andersen et al.

Fig. 5.

Game-state representation

Fig. 6.

State abstraction using gray-scale heat-maps – red pixels as friendly buildings,– green pixels as enemy units, and– teal pixels as the mouse cursor.We also included an option to reduce the state-space to a one-dimensional matrixusing gray-scale imaging. Each of the above features is then represented by avalue between 0 and 1. We do this because Convolutional Neural Networks arecomputational demanding, and by reducing input dimensionality, we can speedup training. [1] We do not down-scale images because the environment is only30 ×

11 pixels large. The state cannot be described fully by these heat-maps asthere are economics, health, and income that must be interpreted separately.This is solved by having a 1-dimensional vectorized representation of the data,that can be fed into the model. owards a Deep Reinforcement Learning Approach for Tower Line Wars 109

The main contribution in this paper is the game environment presented in Sect. 4.A key element is to show that the game environment is working properly and we,therefore, introduce a learning algorithm trying to play the game. This is in noway meant as a perfect solver for Deep Line Wars, but rather as a proof of conceptthat learning algorithms can be applied in the Deep Line Wars environment. Inour solution we consider the environment as a MDP having state set S, action setA, and a reward function set R. Each of the weighted reward functions derivesfrom a speciﬁc agent within the MDP and deﬁnes the absolute reward of theenvironment R env with following equation: R env ( s, a ) = n (cid:10) i =1 w i R i ( s, a ) (4)where R env ( s, a ) is the weighted sum w i of reward function(s) R i ( s, a ). Theproposed algorithm model is a method of dividing the ultimate problem intoseparate smaller problems which can be trivialized with certain kinds of genericalgorithms. Fig. 7.

Separation of the reward function

When reward for the observed state is calculated, we calculate the Q-valueof Q ( s, a ) utilizing R env by using a variant of DQN. We conducted experiments with several deep learning algorithms in order tobenchmark current state-of-the-art put up against a multi-agent, multi-sensoryenvironment. The experiments were conducted in Deep Line Wars, a multi-agent, multi-sensory environment. All algorithms were benchmarked with iden-tical game parameters.We tested

DeepQNetwork , a state-of-the-art DQN from Mnih et al. [1],

Deep-QRewardNetwork , rule-based, and random behaviour. Each of the algorithms wastested with several conﬁgurations, seen in Fig. 8. We did not expect any of these

10 P.-A. Andersen et al.

Fig. 8.

Property matrix of tested algorithms algorithms to beat the rule-based challenge due to the diﬃculty of the AI. Theextended execution graph algorithm (see Sect. 7) was not part of the test bedbecause it was not able to compete with any of the simpler DQN algorithmswithout guided mouse management.Tests were done using Intel I7-4770k, 64 GB RAM and NVIDIA Geforce GTX1080TI. Each of the algorithms was trained/executed for 1500 episodes. Eachepisode is considered to be a game that either of the players wins, or the 600 stime limit is reached. DQN had a discount-factor of 0.99, learning rate of 0.001and batch-size of 32.Throughout the learning process, we can see that DeepQNetwork and Deep-QRewardNetwork learn to perform resource management correctly. Figure 9illustrates income throughout learning from 1500 episodes. The random player ispresented as an aggregated average of 1500 games, but the remaining algorithmsare only single instances. It is not practical to perform more than a single runof the Deep Learning algorithms because it takes several minutes per episode toﬁnish which sums up to a huge learning time.Figure 9 shows that the proposed algorithms outperform random behaviorafter relatively few episodes. DeepQRewardNetwork performs approximately33% better than DeepQNetwork. We believe that this is because the rewardfunction R ( s, a ) is better deﬁned and therefore easier to learn the optimal pol-icy in a shorter period of time. These results show that DeepQRewardNetworkconverges towards the optimal policy better, but as seen in Fig. 9 diverges afterapproximately 1300 games. The reason for the divergence is that experiencereplay does not correctly batch important memories to the model. This causes owards a Deep Reinforcement Learning Approach for Tower Line Wars 111 Fig. 9.

Income after each episode the model to train on unimportant memories and diverges the model. This isconsidered a part of future work and is addressed more thoroughly in Sect. 7. Therule-based algorithm can be regarded as an average player and can be comparedto human level in this game environment.

Fig. 10.

Victory distribution of tested algorithms

Figure 10 shows that DeepQNetwork and DeepQRewardNetwork have about63–67% win ratio throughout the learning process. Compared to the rule-basedAI it does not qualify to be near mastering the game, but we can see that itoutperforms random behavior in the game environment.

12 P.-A. Andersen et al.

This paper introduced a new learning environment for reinforcement learningand applied state-of-the-art Deep-Q Learning to the problem. Some initial resultsshowed progress towards an AI that could beat a rule-based AI. There are stillseveral challenges that must be addressed for an unsupervised AI to learn com-plex environments like Line Tower Wars. Mouse input based games are diﬃcultto map to an abstract state representation, because there are a huge numberof sequenced mouse clicks that are required, to correctly act in the game. DQNcannot at current state handle long sequences of actions and must be guided in-order to succeed. Finding a solution to this problem without guiding is thoughtto be the biggest blocker for these types of environments, and will be the focusfor future work.DeepQNetwork and DeepQRewardNetwork had issues with divergence afterapproximately 1300 episodes. This is because our experience replay algorithmdid not take into account that the majority of experiences are bad. It could notsuccessfully prioritize the important memories. As future work, we propose toinstead use prioritized experience replay from Schaul et al. [13].

Fig. 11.

Divide and conquer execution graph

Figure 7 show that diﬀerent sensors separate the reward from the environ-ment to obtain a more precise reward bound to an action. In our research, wedeveloped an algorithm that utilizes diﬀerent models based on which state theplayer has. Figure 11 show the general idea, where the state is categorized intothree diﬀerent types

Oﬀensive , Defensive , and

No Action . This state is eval-uated by a Convolutional Neural Network and outputs a one-hot vector thatsignal which state the player is currently in. Each of the blocks in Fig. 11 thenrepresents a form of state-modeling that is determined by the programmer. Ourinitial tests did not yield any promising results, but according to the Bellmanequations, it is a qualiﬁed way of evaluating the state and successfully performlearning, on an iterative basis. owards a Deep Reinforcement Learning Approach for Tower Line Wars 113

Deep Line Wars is a simple but yet advanced Real-Time (strategy) game simu-lator, which attempts to ﬁll the gap between Atari 2600 and Starcraft II. DQNshows promising initial results but is far from perfect in current state-of-the-art.An attempt in making abstractions in the reward signal yielded some improvedperformance, but at the cost of a more generalized solution. Because of the enor-mous state-space, DQN cannot compete with simple rule-based algorithms. Webelieve that this is caused by speciﬁcally the mouse input which requires someunderstanding of the state to perform well. This also causes the algorithm tooverestimate some actions, speciﬁcally the oﬀensive actions, because the algo-rithm is not able to correctly build defensive without getting negative rewards.It is imperative that a solution of the mouse input actions are found before DQNcan perform better. A potential approach could be using the StarCraft II APIto get additional training data, including mouse sequences [14].

References

1. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,Riedmiller, M.: Playing ATARI with deep reinforcement learning. In: NIPS DeepLearning Workshop (2013)2. Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil,M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., Hadsell, R.: Learningto navigate in complex environments. CoRR abs/1611.03673 (2016)3. van Seijen, H., Fatemi, M., Romoﬀ, J., Laroche, R., Barnes, T., Tsang, J.: Hybridreward architecture for reinforcement learning. abs/1706.04208 (2017)4. Gosavi, A.: Reinforcement learning: a tutorial survey and recent advances.INFORMS J. Comput. (2), 178–192 (2009)5. van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with doubleq-learning. CoRR abs/1509.06461 (2015)6. Wang, Z., de Freitas, N., Lanctot, M.: Dueling network architectures for deepreinforcement learning. CoRR abs/1511.06581 (2015)7. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press(1998)8. Traysent: Starcraft ii api - technical design, November. https://us.battle.net/forums/en/sc2/topic/207511149219. Vinyals, O.: Deepmind and blizzard to release starcraft ii as an airesearch environment, November 2016. https://deepmind.com/blog/deepmind-and-blizzard-release-starcraft-ii-ai-research-environment/10. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver,D., Wierstra, D.: Continuous control with deep reinforcement learning. CoRRabs/1509.02971 (2015)11. Uriarte, A., Onta˜n´on, S.: Game-tree search over high-level game states in RTSgames, October 201412. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning envi-ronment: an evaluation platform for general agents. CoRR abs/1207.4708 (2012)13. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. CoRRabs/1511.05952 (2015)14 P.-A. Andersen et al.14. Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Sasha Vezhnevets, A., Yeo,M., Makhzani, A., K¨uttler, H., Agapiou, J., Schrittwieser, J., Quan, J., Gaﬀney,S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap,T., Calderone, K., Keet, P., Brunasso, A., Lawrence, D., Ekermo, A., Repp, J.,Tsing, R.: StarCraft II: a new challenge for reinforcement learning. ArXiv e-prints,August 2017 ppendix B FlashRL: A Reinforcement LearningPlatform for Flash Games lashRL: A Reinforcement Learning Platformfor Flash Games

Per-Arne Andersen

Morten GoodwinOle-Christoﬀer GranmoUniversity of Agder, Faculty of Engineering and ScienceServiceboks 509, NO-4898 Grimstad, Norway

Abstract

Reinforcement Learning (RL) is a research area that has blossomedtremendously in recent years and has shown remarkable potential inamong others successfully playing computer games. However, there onlyexists a few game platforms that provide diversity in tasks and state-space needed to advance RL algorithms. The existing platforms oﬀerRL access to Atari- and a few web-based games, but no platform fullyexpose access to Flash games. This is unfortunate because applying RLto Flash games have potential to push the research of RL algorithms.This paper introduces the Flash Reinforcement Learning platform(FlashRL) which attempts to ﬁll this gap by providing an environmentfor thousands of Flash games on a novel platform for Flash automation.It opens up easy experimentation with RL algorithms for Flash games,which has previously been challenging. The platform shows excellentperformance with as little as 5% CPU utilization on consumer hardware.It shows promising results for novel reinforcement learning algorithms.

This paper was presented at the NIK-2017 conference; see . Introduction

There are several challenges related to developing algorithms that can interactwith human-level performance in real-world environments, such as computer games.Researchers often use toy experiments when working with

Reinforcement Learning (RL), because it is easier, cheaper and consumes less time to orchestrate. Withseveral applications for RL in daily life, it has become an essential ﬁeld of research[13, 4]. However, existing learning platforms for games have major limitations suchas few game environments and little environment control.

OpenAI is a non-proﬁt company that is currently one of the leading researchersof RL. OpenAI

Universe is a software platform that has several game environmentsaimed at artiﬁcial research. The problem with this software is that individualdevelopers are not directly permitted to supplement new environments to therepository, and there is little documentation on how to contribute to newenvironments.

FlashRL changes this with our proposed architecture as the controlis given back to each researcher.

Adobe Flash is a multimedia software platform used for the production ofapplications and animation. The Flash run-time was recently declared deprecatedby Adobe, and by 2020, no longer supported. Flash is still frequently used inweb applications, and there are several thousand games created for this platform.Several browsers have removed support for Flash, making it impossible to accessthe mentioned game environments. Games have proven to be an excellent area ofmachine learning benchmarking, due to size and diversity of its state-space. It istherefore essential to preserve Flash as an environment for reinforcement learning.Automating Flash applications is a relatively untouched area. The technologyhas been succeeded by several better options for web development, for example,HTML5. This makes it hard for algorithms to control Flash environmentsprogrammatically. There are already reinforcement learning platforms that supportFlash games as part of their game library, but these use browsers to execute theFlash run-time.Figure 1: Interacting with Flash through browser automatingFigure 1 illustrates how interaction with the Flash environment would typicallybe carried out through browser automation software such as

Selenium . Selenium canautomate most modern browsers. It does not directly support Flash automation,but can easily be used for this purpose with minimal customisation [3]. With theloss of browser support, the diﬃculty of controlling Flash applications increases,and there is a signiﬁcant risk that excellent game environments for reinforcementlearning are lost.FlashRL is unique for reinforcement learning as it allows researchers to use anydesired Flash environment. It gives full control of the game environment and is notbased on running Flash applications in the browser.lashRL is targeted research in reinforcement learning, but can also be used inother machine learning algorithms. It supports all kinds of Flash applications butis primarily used for agent-based gameplay. Several thousand game environmentsare included in the ﬁrst release of the software . Multitask 2 is a Flash game that isexcellent for reinforcement learning as it requires the agent to perform several taskssimultaneously. We show in this paper that our learning platform can be used totrain novel reinforcement algorithms without any customisation.In Section 2, we discuss related work for existing learning platforms in machinelearning. We also argue why web browsers are no longer viable as Flash run-time. Section 3 brieﬂy outline what reinforcement learning is and explains how Q-Learning works. Section 4 outlines the proposed platform and thoroughly describeits underlying architecture. In Section 5 we show initial results of utilizing theproposed learning platform for reinforcement learning. At Section 6 summarisesthe work and argue why the proposed learning platform is used for reinforcementlearning research. Section 7 outlines a road-map for further development of theplatform. With the increasing popularity in RL, there is a need for ﬂexible learning platforms.Several learning platforms exist that can run a limited number of games, but noplatform that features an open-source interface with possibility to run any

Flashgame.Bellemare et al. provided in 2012 a learning platform

Arcade LearningEnvironment (ALE) that enabled scientists to conduct edge research in generaldeep learning [1]. The package provided hundreds of Atari 2600 environments thatin 2013 allowed Minh et al. to do a breakthrough with Deep Q-Learning and A3C.The platform has been a key component in several breakthroughs in RL research.[11, 9, 8]In 2016, Brockman et al. from OpenAI released GYM which they referredto as "a toolkit for developing and comparing reinforcement learning algorithms" [2]. GYM provides various types of environments from following technologies[2]: Algorithmic tasks, Atari 2600, Board games, Box2d physics engine, MuJoCophysics engine, and Text-based environments. OpenAI also hosts a website whereresearchers can submit their performance for comparison between algorithms. GYMis open-source and encourages researchers to add support for their environments.OpenAI recently released a new learning platform called

Universe . Thisenvironment further adds support for environments running inside VNC. It alsosupports running Flash games and browser applications. However, despite OpenAI’sopen-source policy, they do not allow researchers to add new environments to therepository. This limits the possibilities of running any environment. Universe is,however, a signiﬁcant learning platform as it also has support for desktop gameslike Grand Theft Auto IV, that allow for research in autonomous driving [7].Selenium is a software for automating web browsers and is used primarily for unit-testing of web content. There were some eﬀorts to create a version that allowed tointeract with Flash content, but it was quickly abandoned. There is limited supportfor interacting with Flash, by selecting the DOM-Element in HTML and sending Author of this paper takes no credit for any game environments ey-presses via Javascript. Several learning platforms utilize this method, but dueto the deprecation of Flash in browsers, it is no longer a viable option.

Reinforcement learning can be considered hybrid between supervised and unsuper-vised learning. We implement what we call an agent that acts in our environment.This agent is placed in the unknown environment where it tries to maximize theenvironmental reward [14].Markov Decision Process (MDP) is a mathematical method of modeling decision-making within an environment. We often use this technique when utilizing model-based RL algorithms. In

Q-Learning , we do not try to model the MDP. Instead,we try to learn the optimal policy by estimating the action-value function Q ∗ ( s, a ) ,yielding maximum expected reward in state s executing action a. The optimal policycan then be found by π ( s ) = argmax a Q ∗ ( s, a ) (1)This is derived from Bellman’s Equation , because we can consider U ( s ) = max a Q ( s, a ) , the utility function to be true. This gives us the ability to derivefollowing update-rule equation from Bellman’s work: Q ( s, a ) ← Q ( s, a ) + α |{z} LearningRate (cid:18) R ( s ) |{z} Reward + γ |{z} Discountfactor max a Q ( s , a ) | {z } NewEstimate − Q ( s, a ) | {z } OldEstimate (cid:19) (2)This is an iterative process of propagating back the estimated Q-value for eachdiscrete time-step in the environment. It is guaranteed to converge towards theoptimal action-value function, Q i → Q ∗ as i → ∞ [14, 10]. At the most basic level,Q-Learning utilize a table for storing ( s, a, r, s ) pairs. But we can instead use anon-linear function approximation in order to approximate Q ( s, a ; θ ) . θ describestunable parameters for approximator. Artiﬁcial Neural Networks (ANN) are apopular function approximator, but training using ANN is relatively unstable. The proposed platform is an interface that acts as a bridge between the

Gnash Flashplayer and the reinforcement learning algorithms.

Flash Reinforcement Learning (FlashRL) is a new platform that allows researchers to run algorithms on any Flash-based game eﬃciently.The learning platform is developed primarily for the operating system Linux butis likely to run on Cygwin with few modiﬁcations. There are several key componentsthat FlashRL uses to operate adequate, see Figure 2. It uses a Linux library calledXVFB to create a virtual frame-buﬀer that is used for graphics rendering [6]. Insidethis frame-buﬀer, a Flash game chosen by the researcher is executed by a third partyﬂash player, for example,

Gnash . A VNC server serves the XVFB frame-buﬀer andallows FlashRL to access it by utilizing a VNC Client. The VNC Client can thenissue commands like keyboard presses and mouse movements. The VNC Client pyVLC was specially made for this learning platform. The code base originatesfrom python-vnc-viewer [15]. The last component of FlashRL is the Reinforcementigure 2: FlashRL Architecture OverviewLearning API that allows the developer to access the input/output of the VNC client.This makes it easy to develop sequenced algorithms by using the API callbacks ormanually by threading.Figure 3: Frame-buﬀer Access MethodsFigure 3 illustrates two methods of accessing the frame-buﬀer from the FlashGame. Both approaches are suﬃcient to perform reinforcement learning, but eachhas its strength and weaknesses. Method 1, seen in Figure 3 allows the developer toget frames served at a ﬁxed rate, for example, 60 frames per second. Method 2 doesnot restrict the frequency of how fast the frame-buﬀer is captured. This is preferablefor developers that do not require images from ﬁxed time-steps as it requires lessprocessing power per frame. The framework was developed with deep learning inmind and is proven to work with Keras and Tensorﬂow.Several thousand game environments are shipped with the initial version ofFlashRL. These game environments were gathered from diﬀerent sources on theweb. FlashRL has a relatively small code-base and to preserve this size, all of theFlash games are hosted remotely. The quality varies, and some of the games are nottested or labeled. Most games are however tested and can be played without issues,igure 4: Selected environments from the FlashRL game repositorysee Figure 4.

This section presents experiments of reinforcement learning algorithms applied inFlashRL. We use the game Multitask 2 to test the learning platform. Multitask 2was chosen because it challenges the algorithm to master four diﬀerent mini-gamessimultaneously.The experiments are grouped in two. The ﬁrst experiment determines thehardware requirements of the platform and benchmarks the speed of criticaloperations. The second experiment is an implementation of standard Deep Q-Learning trained on raw state images from Multitask 2 to perform game actions.The latter is meant as a proof of concept that RL algorithms can be applied inFlashRL.All experiments were conducted on Ubuntu Linux 17.04 x64 running Python3.5.3. The machine has 64GB memory, Nvidia GeForce 1080TI, and Intel I7-7770kas hardware. Multitask 2

Figure 5 illustrates the game-play of Multitask 2. The game is split into four-gamephases. The ﬁrst phase (lower right corner in Figure 5) is a single paddle that theplayer must balance a ball on. In state two (lower left corner in Figure 5) , theplayer must control the second paddle to avoid arrows traveling towards it. Thethird phase (upper right corner in Figure 5) consist of an arrow with mechanicsrelatable to the game Flappy Bird [12]. In the ﬁnal phase (upper left corner inFigure 5), the player must additionally jump over holes on the ground. For theplayer to succeed the game, he must control eight actions simultaneously. The scoreis calculated by adding a single point for each second survived in the game.

Experiment 1: Hardware Requirements

Recall from section 4 that there are two methods of accessing the frame-buﬀer.The ﬁrst method (Method 1) is based on retrieving the frame-buﬀer at ﬁxed time Multitask 2 - http://multitaskgames.com/multitask-2.html igure 5: In-game footage of the game Multitaskintervals. The second method (Method 2) does not have any interval restriction.This makes Method 2 faster because it does not require sleep between frames. Thiscauses the framework to consume all available CPU, which is not always preferable.We can see from Figure 6 that using Method 1 with the interval set to 30 fpsuses approximately 5% of the CPU. Increasing the interval to 300 increases it to13%. We gradually increased the interval until the CPU ran at maximum. A singleI7-7700k can compute approximately 6300 fps images from the frame-buﬀer beforestruggling to keep up.The GPU Did not recognize any load during these test because the Flashenvironment is software rendered. Memory consumed were between 200MB and500MB depending on the speed. We believe that the reason for memory increase isthat Python does not garbage collect old frame-buﬀer snapshots between iterations,and therefore gets an increased memory load.

Experiment 2: Reinforcement Learning

Deep Q-Network (DQN) is a novel algorithm architecture developed by Minh et al.at Google DeepMind. It combines Q-Learning estimating Q-Values from a neuralnetwork. [11]In our tests we used Double Q-Learning from Hasselt et al. [5]. We alsoused Dueling from Wang et al. that increases the learning precision by usingtwo estimators: state-value and action-advantage function [16]. We used adiscount factor of 0.99, learning rate of 0.001 and mini-batch of 16. We usedexploration/exploitation strategy with (cid:15) -greedy where it started at 0.9 and ﬁnishedat 0.1. The (cid:15) annealing was set to 10 000 steps. This is a relatively low epsilonphase. But it seemed to work well in this environment.Figure 7 illustrates the training of DQN, where the x-axis represents episodesigure 6: Hardware benchmarkFigure 7: Deep Q-Learning Trainingf the game and y-axis score before reaching the terminal state. The agent hadtroubles adapting to the third phase (see Section 5). Phase 3 is relatively hard tomaster because it requires the user balance the arrow in the air. At around 230episodes we saw a drop in score. This is because the network seems to prioritize theﬁrst phase of the game. It reached the second phase a few times but was not ableto successfully control the paddle for longer periods of time. This is why it stalesat approximately 400 episodes. We believe that the network could have performedbetter with additional training time. It trained for a total of two days. Hopefully,it will be easier to train the network when FlashRL can speed-forward games, seesection 7. The results are overall acceptable as we can see that FlashRL deliverquality states that a reinforcement learning agent can learn from.

FlashRL oﬀers an easy-to-use architecture for performing RL in Flash-based games.It is demonstrated to work well for Multitask 2, one of the environments included.FlashRL ﬁlls the gap that emerged with the deprecation of Flash, Its main focus isRL, but can also be used for other machine learning genres. This paper shows thatFlashRL can be used to train RL algorithms, in particular, Multitask 2. The workshows promising results and continuing to expand the game repository may providenew insights about RL in the future.FlashRL will be kept alive as long as ﬂash environments are an asset to themachine learning community. It is available to the public at https://github.com/UIA-CAIR/FlashRL , and can easily be adapted to every research requirement.

Several improvements are planned for FlashRL. This paper outlined features of theinitial version of the FlashRL, and it is by far suﬃcient for simple reinforcementlearning research. As seen in section 5, a Deep Q-Learning based agent cansuccessfully learn from the environment

Multitask and gradually perform better.

Speed-forward Option

Learning algorithms often require several thousand episodes to gain expertknowledge of the environment. FlashRL is currently limited to the speed of whichthe game loop is executed (usually 30 fps in real-time). An important improvementwould be to lift this restriction and allow algorithms to train at an accelerated rate.This would certainly improve training duration of feedback based algorithms.

Game Repository Analysis

The game repository features many unlabeled, unrated and untested games. Somegames are potentially useless in a machine learning setting and require a review.The review phase is time-consuming, and authors of this paper did not have enoughtime to analyze each of the environments manually. The goal is to add labels andcategorize all games in the repository gradually. ebsite

A future goal is to allow execution of algorithms from a web interface and toadd gamiﬁcation aspects to the library. This would potentially create competitionbetween researchers much like Kaggle and OpenAI Universe.

Cross-Platform Support

FlashRL is in the initial version, only supported in Python 3 on the Linux platform.The goal is to extend it so that it also can run without modiﬁcations on MicrosoftWindows operating systems. eferences [1] Marc G. Bellemare et al. “The Arcade Learning Environment: An EvaluationPlatform for General Agents”. In:

CoRR abs/1207.4708 (2012). url : http://arxiv.org/abs/1207.4708 .[2] Greg Brockman et al. OpenAI Gym . 2016. eprint: arXiv:1606.01540 .[3]

Flash Testing with Selenium . Aug. 2017. url : .[4] Michael E. Grost et al. “Applications of Artiﬁcial Intelligence”. In: CAD/CAMRobotics and Factories of the Future: Volume II: Automation of Design,Analysis and Manufacturing . Ed. by Birendra Prasad, S. N. Dwivedi, andK. B. Irani. Berlin, Heidelberg: Springer Berlin Heidelberg, 1989, pp. 165–229. isbn : 978-3-642-52323-6. doi : . url : https://doi.org/10.1007/978-3-642-52323-6_3 .[5] Hado van Hasselt, Arthur Guez, and David Silver. “Deep ReinforcementLearning with Double Q-learning”. In: CoRR abs/1509.06461 (2015). url : http://arxiv.org/abs/1509.06461 .[6] Harold L Hunt and II Jon Turney. “Cygwin/X Contributor’s Guide”. In:(2004).[7] Yuxi Li. “Deep Reinforcement Learning: An Overview”. In: CoRR abs/1701.07274 (2017). url : http://arxiv.org/abs/1701.07274 .[8] Volodymyr Mnih et al. “Asynchronous Methods for Deep ReinforcementLearning”. In: CoRR abs/1602.01783 (2016). url : http://arxiv.org/abs/1602.01783 .[9] Volodymyr Mnih et al. “Human-level control through deep reinforcementlearning”. In: Nature issn : 00280836. url : http://dx.doi.org/10.1038/nature14236 .[10] Volodymyr Mnih et al. “Playing Atari With Deep Reinforcement Learning”.In: NIPS Deep Learning Workshop . 2013.[11] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”. In:

CoRR abs/1312.5602 (2013). url : http://arxiv.org/abs/1312.5602 .[12] Matthew Piper. “How to Beat Flappy Bird: A Mixed-Integer Model PredictiveControl Approach”. PhD thesis. The University of Texas at San Antonio, 2017.[13] Stuart Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Approach .3rd. Upper Saddle River, NJ, USA: Prentice Hall Press, 2009. isbn :0136042597, 9780136042594.[14] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning : AnIntroduction . MIT Press, 1998.[15] Techtonik. python-vnc-viewer . https : / / github . com / techtonik / python -vnc-viewer . 2015.[16] Ziyu Wang, Nando de Freitas, and Marc Lanctot. “Dueling NetworkArchitectures for Deep Reinforcement Learning”. In: CoRR abs/1511.06581(2015). url : http://arxiv.org/abs/1511.06581 . lashRL: A Reinforcement Learning Platform for Flash Games UiA