[PDF] MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-based System for Financial Portfolio Management

Abstract

Financial portfolio management is one of the most applicable problems in reinforcement learning (RL) owing to its sequential decision-making nature. Existing RL-based approaches, while inspiring, often lack scalability, reusability, or profundity of intake information to accommodate the ever-changing capital markets. In this paper, we propose MSPM, a modularized and scalable, multi-agent RL-based system for financial portfolio management. MSPM involves two asynchronously updated units: an Evolving Agent Module (EAM) and Strategic Agent Module (SAM). A self-sustained EAM produces signal-comprised information for a specific asset using heterogeneous data inputs, and each EAM employs its reusability to have connections to multiple SAMs. An SAM is responsible for asset reallocation in a portfolio using profound information from the connected EAMs. With the elaborate architecture and the multi-step condensation of volatile market information, MSPM aims to provide a customizable, stable, and dedicated solution to portfolio management, unlike existing approaches. We also tackle the data-shortage issue of newly-listed stocks by transfer learning, and validate the indispensability of EAM with four different portfolios. Experiments on 8-year U.S. stock market data prove the effectiveness of MSPM in profit accumulation, by its outperformance over existing benchmarks.

Full PDF

AA Modularized and Scalable Multi-Agent Reinforcement Learning-based Systemfor Financial Portfolio Management

Zhenhan Huang , Fumihide Tanaka ∗ ,University of Tsukuba, Tsukuba, [email protected], [email protected] Abstract

Financial Portfolio Management is one of themost applicable problems in Reinforcement Learn-ing (RL) by its sequential decision-making na-ture. Existing RL-based approaches, while inspir-ing, often lack scalability, reusability, or profundityof intake information to accommodate the ever-changing capital markets. In this paper, we de-sign and develop MSPM, a novel Multi-agent Re-inforcement learning-based system with a modu-larized and scalable architecture for portfolio man-agement. MSPM involves two asynchronously up-dated units: Evolving Agent Module (EAM) andStrategic Agent Module (SAM). A self-sustainedEAM produces signal-comprised information for aspeciﬁc asset using heterogeneous data inputs, andeach EAM possesses its reusability to have connec-tions to multiple SAMs. A SAM is responsible forthe assets reallocation of a portfolio using profoundinformation from the EAMs connected. With theelaborate architecture and the multi-step conden-sation of the volatile market information, MSPMaims to provide a customizable, stable, and ded-icated solution to portfolio management that ex-isting approaches do not. We also tackle data-shortage issue of newly-listed stocks by transferlearning, and validate the necessity of EAM. Ex-periments on 8-year U.S. stock markets data provethe effectiveness of MSPM in proﬁts accumulationby its outperformance over existing benchmarks.

Portfolio management is a continuous process of reallocatingcapital into multiple assets [Markowitz, 1952], and it aimsto maximize accumulated proﬁts with an option to minimizethe overall risks of the portfolio. To perform such a practice,portfolio managers who focus on stock markets convention-ally read ﬁnancial statements and balance sheets, pay atten-tion to news from media and announcements from ﬁnancialinstitutions like central banks, and analyze stock price trends. ∗ Corresponding Author

By the resemblance of nature of the problem, researchers un-surprisingly want to adapt the Reinforcement Learning (RL)framework to portfolio management, and with the incorpo-ration of Deep Learning (DL), they have built robust andvalid Deep RL-based methods [Jiang et al. , 2017] [Ye et al. ,2020] [Liang et al. , 2018]. Akin to receiving informationfrom various sources as portfolio managers will do, existingapproaches incorporate heterogeneous data [Ye et al. , 2020].Researchers also propose Multi-agent Reinforcement Learn-ing (MARL) approaches [L´opez et al. , 2010] [Sycara et al. ,1995] [Lee et al. , 2020]. In [Lee et al. , 2020], the authorsdesign a system including a group of Deep Q-Network [Mnih et al. , 2013] based agents as the resemblance of independentinvestors to make investment decisions, and end up with adiversiﬁed portfolio. However, while inspiring, these cutting-edge approaches lack scalability, reusability, or profundity ofintake information to accommodate the ever-changing mar-kets. For example, SARL [Ye et al. , 2020] as a generalframework, the encoder’s intake is either ﬁnancial news datafor embedding or stock prices for trading signals generation,and the lack of scalability prevents the encoder from efﬁ-ciently producing holistic information and eventually limitthe RL-based agents to learning. Additionally, agents in ex-isting multi-agent based systems are mostly ad-hoc and rarelyreusable. As one of the ﬁrst tries, MAPS [Lee et al. , 2020] isa reinforcement-learning implementation of Ensemble Learn-ing [Rokach, 2010] by its very nature, an implicit solutionwith stacking multiple Deep Q-Network agents, which can beas reliable as explicit solutions implementing policy gradientmethods [Jiang et al. , 2017].In this paper, we design and develop MSPM, a novel Multi-agent Reinforcement learning-based system with a modu-larized and scalable architecture for portfolio management.In MSPM, assets are vital and organic building blocks.This vitalness is reﬂected in that each asset has its exclu-sive module which takes heterogeneous data and utilizesa Deep Q-Network (DQN) based agent to produce signal-comprised information. This module, as a dedicated system,is called Evolving Agent Module (EAM). Once we have setup and trained the EAMs corresponding to the the assets ina portfolio, we connect them to a decisive system: Strate-gic Agent Module (SAM), a Proximal Policy Optimization(PPO) [Schulman et al. , 2017] based agent. A SAM rep-resents a portfolio and uses the profound information from a r X i v : . [ q -f i n . P M ] F e b he EAMs connected for assets reallocation. EAM and SAMare asynchronously updated, and EAM possesses its reusabil-ity which mean an EAM can connect to multiple SAMs andthese connections can be arbitrary. With the power of parallelcomputing, we are able to perform the capital reallocation forvarious portfolios at scale simultaneously. By experiment andcomparison, we conﬁrm the necessity of the Evolving AgentModule (EAM) and validate MSPM’s outperformance overthe classical trading strategies in terms of both accumulatedportfolio value and Sortino Ratio. Our contribution can belisted as follows:• To our knowledge, MSPM is the ﬁrst approach to for-malize a modularized and scalable multi-agent rein-forcement learning system using signal-comprised infor-mation for ﬁnancial portfolio management.• We tackle data-shortage issue of newly-listed stocks bytransfer learning for RL-based systems, and revise EIIE-style neural network architecture to better accommodateActor-Critic methods.• We validate MSPM’s outperformance by at least 100%and 343% higher than the benchmarks in terms of accu-mulated rate of return, under extreme market conditionsof U.S. stock markets from March to December 2020during the global pandemic. To address portfolio management problem, we ﬁrst assumea composition of the minimal-viable market information: 1.historical prices from U.S. stock markets and 2. asset-relatednews from media, as the environment with which agents inEvolving Agent Modules (EAMs) interact. Then, the com-bination of the signal-comprised information which EAMsgenerates become the state that decisive agents in SAMs ob-serve. SAMs consequently reallocate assets in the portfolios.Each EAM is reusable, which means we may connect it toany portfolios/SAMs once set up and trained, and a SAM isconnected with at least one EAM. The relationship betweenEAMs and SAMs is illustrated in Figure 1. EAM is retrainedperiodically using the latest information from the market, me-dia, ﬁnancial institutions, etc, but in this paper we will imple-ment only the former two sources of data, and we leave restsources of input as an open question for study in the future.For next sections, we will introduce details about EAM andSAM.

In the following sections, we introduce the details aboutEAM, and Figure 2 provides an overview of the EAM’s ar-chitecture.

An Evolving Agent Module (EAM) has two data sources: 1.asset’s historical prices, and 2. asset-related ﬁnancial news.The historical prices used for training EAMs are fetched us-ing Quandl’s End-Of-Day-Data through Quandl’s API. Theﬁnancial news data of each asset are from news websites (e.g.ﬁnance.yahoo.com) and social media (e.g. twitter.com) on adaily basis.

Figure 1: This ﬁgure illustrates the surjection relationship betweenEAMs and SAMs better. Each Evolving Agent Module (EAM)is responsible for a single asset, and involves a DQN agent andFinBERT model with utilization of heterogeneous data to producesignal-comprised information. Each Strategic Agent Module (SAM)is a module for a portfolio, involves a PPO agent, and reallocates theassets with stacked signal-comprised 3-D tensor

Profound State V + from the EAMs connected. Moreover, trained EAMs are reusablefor different portfolios and combined at will. By parallel comput-ing, capital reallocation may be performed for various portfolios atscale simultaneously. The state, which agent in EAM observes at any given peri-odic (daily) time-step t, is a 2-D tensor v t stacked by the re-cent n-day historical prices s t and the 1-D tensor of sentimentpolarities of designated asset ρ t . Speciﬁcally, v t = ( s t , ρ t ) (1)where sentiment polarities ρ , which range continuously from-1.0 to 1.0 to indicate bearishness (-1.0) or bullishness (1.0),are classiﬁed using pre-trained FinBERT classiﬁer [Araci,2019] [Devlin et al. , 2019] with asset-related ﬁnancial news,and then averaged. Furthermore, we not only feed agent thenews sentiments but also comprise news volumes which isthe number of news items used for generating sentiment po-larities of each asset. This method is trying to alleviate theunbalanced-news issue the existing research has [Ye et al. ,2020]. We train a Deep Q-Network (DQN) agent [Mnih et al. , 2013]for the sequential decision-making in Evolving Agent Mod-ule. In Deep Q-Learning, the agent acts based on the state-value function Q θ ( s t , a t ) = E π θ [ (cid:80) ∞ k =0 γ k r t + k +1 | s t = s, a t = a ] which we use a Residual Network with 1-D Convolution to represent.

DQN Extensions

We also implement three extensions of the original DQN [La-pan, 2018], which are dueling architecture [Wang et al. ,2016], Double DQN [Hasselt et al. , 2016] and two-step Bell-man unrolling. ransfer Learning

Instead of training every EAM from scratch, we initially traina

Foundational EAM , using historical prices of AAPL (Ap-ple Inc.) and then train all other EAMs based on this pre-trained EAM. By doing so, the EAMs would have a prior pat-tern recognition of stock trends, and this approach also tack-les the data-shortage issue of newly-listed stocks.

The DQN agent in EAM acts to trade the designated assetwith an action of either buying, selling, or skipping at everytime step t. The choice of an action, a t = { buying, closing,or skipping } , is called asset trading signal . As indicated inthe actions, there is no short (selling) position, but only long(buying) position, and a new position can be opened only afteran existing position has been closed. The reward r t ( s t , ι t ) agent receives at each time-step t is: r t ( s t , ι t ) =  (cid:80) ti = t ι v ( close ) t v ( close ) t − − − β ) , if ι t , if not ι t (2)where v ( close ) t is the close price of the given asset at timestep t, t l is the time step when a long position was openedand commissions were deducted, β stands for the commissionof 0.002 and ι t is the indicator of opening position (i.e., anposition is still open). As illustrated in Figure 2, after the EAMs are trained, wefeed new historical prices s (cid:48) t and ﬁnancial news of the desig-nated assets to generate predictive trading signals a ∗ t . Thenwe stack the same new historical prices to a ∗ t to formalizea 2-D signal-comprised tensor s ∗ t as the data source to trainSAM. We introduce the detailed facts about SAM in the next sec-tions. Figure 3 provides an overview of the EAM’s architec-ture.

SAM is dependent on EAMs it is connected with as its datasource is a combination of the signal-comprised informationfrom these EAMs.

The 2-D signal-comprised tensors from the EAMs connectedwill be stacked and transformed into a 3-D signal-comprisedtensor called

Profound State v + t which is the state SAM ob-serves at each time-step t . Figure 4 shows the abstract of theprofound state v + t . Figure 2: An EAM is a module for a designated asset. Each Evolv-ing Agent Module (EAM) takes two types of heterogeneous data:1. designated asset’s historical prices, and 2. asset-related ﬁnancialnews. At the center of EAM is an extended DQN agent using a 1-DConvolution ResNet for the sequential decision-making. Instead oftraining every EAM from scratch, we train EAMs by transfer learn-ing using Foundational EAM. At every time step t , the DQN agentin EAM observes state v t of historical prices s t and news sentiments ρ t of designated asset, and acts to trade with an action a ∗ t of eitherbuying, selling, or skipping, and eventually generates a 2-D signal-comprised tensor s ∗ t using new prices s (cid:48) t and signals a ∗ t . It is a Proximal Policy Optimization (PPO) [Schulman et al. ,2017] agent at the center of SAM to perform reallocation ofassets . PPO is a actor-critic style policy gradient method, andhas been widely used on continuous action space problemsbecause of its balance between good performance and easeof implementation. A policy π θ is a parametrized mapping: S × A → [0 , from state space to action space. Amongthe different objective functions of PPO, we implement theClipped Surrogate Objective [Schulman et al. , 2017]: L ( θ ) = ˆ E π θ (cid:48) [ min ( r t ( θ ) A θ (cid:48) t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) A θ (cid:48) t )] (3)where r t ( θ ) = π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) and A θ (cid:48) t is the advantage function is expressed as: A θ (cid:48) t = Q θ (cid:48) ( s t , a t ) − V θ (cid:48) ( s t ) where the state-action value function Q θ (cid:48) ( s t , a t ) is: Q θ (cid:48) ( s t , a t ) = E π θ (cid:48) [ ∞ (cid:88) k =0 γ k r t + k +1 | s t = s, a t = a ] igure 3: A SAM is a module for a portfolio. Each SAM takes thesignal-comprised 3-D tensor which is stacked and transformed from2-D tensors from EAMs connected with, and then SAM generatesthe reallocation weights for the assets in the portfolio.Figure 4: The input of Strategic Agent Module, Profound State V + t ∈ R f × m ∗ × n , is a 3-D tensor, where f is the number of fea-ture, m ∗ = m + 1 is the number of assets m in the portfolio pluscash and n is the ﬁxed rolling-window length. and the value function V θ (cid:48) ( s t ) is: V θ (cid:48) ( s t ) = E π θ (cid:48) [ ∞ (cid:88) k =0 γ k r t + k +1 | s t = s ] For the PPO agent, we design a policy network architecturetargeting the uniqueness of continuous action space in ﬁnan-cial portfolio management problems inspired by follow theEnsemble of Identical Independent Evaluators (EIIE) topol-ogy [Jiang et al. , 2017]. Since assets’ reallocation weights a t at time-step t is strictly required to be summed to 1.0, we setfrom m ∗ normal distributions N ( µ t , σ ) , ..., N m ( µ m ∗ t , σ ) ,where m ∗ = m + 1 and µ t ∈ R × m ∗ × is the linear out-put of the last layer the neural network and with standarddeviation σ = 0 , and we sample from these distributionsfor x t ∈ R m ∗ × . We eventually can obtain the reallocationweights a t = Sof tmax ( x t ) and the log probability of x t forthe PPO agent to learn.Figure 5 shows details of the Policy Network (Actor),namely θ (cid:48) , of Strategic Agent Module. Due to the resem-blance and equivalence, architectures of the Value Network(Critic) and Target Policy Network, namely θ , will not be il-lustrated. Figure 5: The policy network ( θ (cid:48) ) of Strategic Agent Module to ac-commodate PPO algorithm. After x t ∈ R m ∗ × are sampled fromthe normal distributions N ( µ t , σ ) , we calculate log probability of x t and the get reallocation weights a t = Softmax ( x t ) . ReLu acti-vation function is set after every convolutional layer, except the lastone. f is the number of features, m ∗ is the number of assets in theportfolio, and n = 50 is the ﬁxed rolling-window length.Figure 6: Allocation weights transform due to the assets’ prices ﬂuc-tuation. The action the PPO agent takes at each time step t is a t = ( a ,t , a ,t , ..., a n,t ) T (4)which is the vector of reallocating weights at each time-step t, and (cid:80) ni =0 a i,t = 1 . Figure 6 shows the details ofprices of ﬂuctuation.After the assets are reallocated by a t , the allocationweights of the portfolio eventually becomes w t = y t (cid:12) a t y t · a t (5)at the end of time step t due to the price ﬂuctuation duringthe time step period, where y t = v +( close ) t v +( close ) t − = (1 , v +( close )1 ,t v +( close )1 ,t − , ..., v +( close ) m,t v +( close ) m,t − ) T (6)is the relative price vector, namely the asset price changesover time, involving prices of assets and cash. v +( close ) i,t denotes the closing price of the i -th asset at time t , where i = { , ..., m } and m is the number of assets in the portfolio. Risk-adjusted Periodic Rate of Return

The

Risk-adjusted

Periodic (daily) Rate of Return is: r ∗ t ( s t , a t ) = ln ( a t · y t − β n (cid:88) i =0 | a i,t − w i,t | − φσ t ) (7)ate Range ∼ ∼ ∼ ∼ Table 1: EAM-Training dataset involves the historical prices ( s t )and news sentiments ( ρ t ) of all 5 assets in both portfolios (a) and(b), and is used to training the AAPL-based Foundational EAM andthen transfer learning for other 4 assets. EAM-Predicting dataset in-volves new historical prices ( s (cid:48) t ) and news sentiments ( ρ ∗ t ) for EAMsgenerate signal-comprised tensors ( s ∗ t = ( s (cid:48) t , a ∗ t ) ) to formalize theSAM-Traning ( v + ) datasets. SAM-Testing dataset has the same str-cuture with the SAM-Training but is solely for back-testing purpose. where w t represents the allocation weights of the assets atthe end of time step t [Jiang et al. , 2017] [Liang et al. , 2018],and β n (cid:88) i =0 | a i,t − w i,t | (8)is the transaction cost, where β = 0 . is the commis-sion rate, with σ t measuring the volatility of assets’ pricesﬂuctuation during the last 50 days and φ is the risk discountwhich can be ﬁne-tuned as a hyperparameter. We propose two portfolios in this paper: Portfolio(a) whichinvolves three stocks [AAPL, AMD, GOOGL], and Portfo-lio(b) which involves another three stocks [GOOGL, NVDA,TSLA]. To build the Portfolio(a) and Portfolio(b), we traintwo SAMs: SAM(a) and SAM(b). It is also worth mention-ing that two SAMs will share the same EAM for the stock incommon (GOOGL).

Datasets

Stock historical prices are Quandl’s End-Of-Day-Data from2013-01-01 to 2020-12-31 fetched using Quandl’s API. Wealso implement web news sentiment data (FinSentS) fromQuandl provided by InfoTrie, and on stock AAPL, FinSentSand news sentiments generated by FinBERT are close. Dueto the restriction of APIs and limitation of web scraping, thenews sentiments data utilized for the EAMs in the followingexperiments are majorly relied on FinSentS.

Except for daily rate of return, two performance metrics arealso compared:

1. Daily Rate of Return (DRR) r T = 1 T T (cid:88) t =1 exp( r t ) , (9)where T is the terminal time step, and r t is the risk-unadjusted peridoic (daily) rate of return obtained every timestep.

2. Accumulated Rate of Return (ARR) [Ormos and Urb´an, 2013] p T /p , where p T stands for theportfolio value at the terminal time step, p is the portfoliovalue at the initial time step, and p T = p exp ( T (cid:88) t =1 r t ) = p b (cid:89) t =1 β t a t · y t (10)

3. Sortino Ratio (SR)

Sortino Ratio [Sortino and Price, 1994] is often referred toas a risk-adjusted return, which measures the portfolio per-formance compared to a risk-free return after adjusted by theportfolio’s downside risk. In our case, Sortino Ratio is calcu-lated: SR = T (cid:80) Tt =1 exp( r t ) − r f σ downsidet (11)where r t is the risk-unadjusted peridoic (daily) rate of re-turn, σ downsidet = (cid:113) V ar t ( r lt − r f ) . r f is the risk-free returnand conventionally equals zero, r lt is the return in r t less than0, and t = T is the terminal time step. We compare the performance of our MSPM system to thebenchmarks which are classical portfolio investment strate-gies. [Li and Hoi, 2014] [MLFinLab, 2021]•

CRP stands for (Uniform) Constant Rebalanced Port-folio, and it is about investing an equal proportion ofcapital to each asset, namely /N , which sounds quitesimple but, in fact, challenging to beat [DeMiguel et al. ,2007].• Buy and Hold (BAH) strategy is to invest without rebal-ance. Once the capital is invested, no further allocationwill be made.•

Exponential Gradient Portfolio (EG) strategy is to in-vest capital into the latest stock with the best perfor-mance and use a regularization term to keep the portfolioinformation.•

Follow the Regularized Leader (FTRL) strategytracks the Best Constant Rebalanced Portfolio until theprevious period with an additional regularization term.Follow the Regularized Leader strategy reweights basedon the entire history of the data with an expectation tohave had the maximum returns.

We set initial portfolio value as p = 10 , . The testingdata of past 50-day Profound State v + from 2020-03-03 to2020-12-31. As shown in the Figure 7 and Figure 8 EAM-enabled SAMs by at least 100% and 343% higher than thebenchmarks in terms of ARR for both portfolios. igure 7: SAM(a) outperforms all benchmarks on Portfolio(a) inBack-testing in terms of accumulated portfolio value.Figure 8: SAM(b) outperforms all benchmarks on Portfolio(b) inBack-testing in terms of accumulated portfolio value. Validation of EAM

Since EAMs provide the trading signal-comprised informa-tion to SAMs, we want to verify EAM’s indispensabilityby comparing the system performance with and without theEAMs. It is clearly shown in both Figure 7 and Figure 8 thatthe EAM-disabled SAMs, which have hyperparameters withEAM-enabled SAMs, perform poorer. Moreover, we traina set of SAMs for 500 episodes with learning rates of 1e-3 and 1e-4, risk discount of 0.05, and 0.01, and gamma of0.99, for portfolio (a). As the learning curves shown in theFigure 9, EAM-disabled SAMs never reach as good resultsas EAM-enabled SAMs in terms of periodical rate of return.The results validate that only with the trading signals fromthe EAMs, the SAMs can have an ideal performance.

We propose MSPM, a modularized multi-agentreinforcement-learning based system, aiming to bringscalability, reusability and profundity of intake information

Figure 9: According to the learning curves, EAM-enabled SAMsalways learn better than EAM-disabled SAMs, which validatesEAM’s indispensability in our MSPM system.

Portfolios (a) & (b) DRR% ARR% SR1. CRP 0.29 75.90 4.410.49 156.81

2. Buy and Hold 0.31 80.18 4.290.63 230.57 5.753. FTRL 0.30 77.95 4.140.55 184.47 5.654. EG 0.31 79.47 4.190.61 215.17 5.695. MSPM-SAM

Table 2: Back-testing Results to ﬁnancial portfolio management. We prove that MSPM isqualiﬁed as a stepping stone to inspire more creative systemdesigns of ﬁnancial portfolio management by its novelty andoutperformance over the existing benchmarks. Moreover, weconﬁrm the necessity of Evolving Agent Module, a essentialcomponent of the system.

References [Araci, 2019] Dogu Araci. Finbert: Financial sentimentanalysis with pre-trained language models, 2019.[DeMiguel et al. , 2007] Victor DeMiguel, Lorenzo Gar-lappi, and Raman Uppal. Optimal Versus Naive Diversiﬁ-cation: How Inefﬁcient is the 1/N Portfolio Strategy?

TheReview of Financial Studies , 22(5):1915–1953, 12 2007.[Devlin et al. , 2019] Jacob Devlin, Ming-Wei Chang, Ken-ton Lee, and Kristina Toutanova. BERT: Pre-training ofdeep bidirectional transformers for language understand-ing. In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computationalinguistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 4171–4186, Minneapo-lis, Minnesota, June 2019. Association for ComputationalLinguistics.[Hasselt et al. , 2016] Hado van Hasselt, Arthur Guez, andDavid Silver. Deep reinforcement learning with double q-learning. In

Proceedings of the Thirtieth AAAI Conferenceon Artiﬁcial Intelligence , AAAI’16, page 2094–2100.AAAI Press, 2016.[Jiang et al. , 2017] Zhengyao Jiang, Dixing Xu, and JinjunLiang. A deep reinforcement learning framework for theﬁnancial portfolio management problem, 2017.[Lapan, 2018] Maxim Lapan.

Deep Reinforcement Learn-ing Hands-On: Apply Modern RL Methods, with DeepQ-Networks, Value Iteration, Policy Gradients, TRPO, Al-phaGo Zero and More . Packt Publishing, 2018.[Lee et al. , 2020] Jinho Lee, Raehyun Kim, Seok-Won Yi,and Jaewoo Kang. Maps: Multi-agent reinforcementlearning-based portfolio management system.

Proceed-ings of the Twenty-Ninth International Joint Conferenceon Artiﬁcial Intelligence , Jul 2020.[Li and Hoi, 2014] Bin Li and Steven C. H. Hoi. Online port-folio selection: A survey.

ACM Comput. Surv. , 46(3), Jan-uary 2014.[Liang et al. , 2018] Zhipeng Liang, Hao Chen, Junhao Zhu,Kangkang Jiang, and Yanran Li. Adversarial deep rein-forcement learning in portfolio management, 2018.[L´opez et al. , 2010] Vivian F. L´opez, Noel Alonso, LuisAlonso, and Mar´ıa N. Moreno. A multiagent sys-tem for efﬁcient portfolio management. In Yves De-mazeau, Frank Dignum, Juan M. Corchado, JavierBajo, Rafael Corchuelo, Emilio Corchado, FlorentinoFern´andez-Riverola, Vicente J. Juli´an, Pawel Pawlewski,and Andrew Campbell, editors,

Trends in Practical Appli-cations of Agents and Multiagent Systems , pages 53–60,Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.[Markowitz, 1952] Harry Markowitz. Portfolio selection.

The Journal of Finance , 7(1):77–91, 1952.[MLFinLab, 2021] MLFinLab. Machine learning ﬁnanciallaboratory, 2021.[Mnih et al. , 2013] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atariwith deep reinforcement learning. 2013. citearxiv:1312.5602Comment: NIPS Deep Learning Work-shop 2013.[Ormos and Urb´an, 2013] Mih´aly Ormos and Andr´as Urb´an.Performance analysis of log-optimal portfolio strate-gies with transaction costs.

Quantitative Finance ,13(10):1587–1597, 2013.[Rokach, 2010] Lior Rokach. Ensemble-based classiﬁers.

Artif. Intell. Rev. , 33:1–39, 02 2010.[Schulman et al. , 2017] John Schulman, Filip Wolski, Pra-fulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms, 2017. [Sortino and Price, 1994] Frank A. Sortino and Lee N. Price.Performance measurement in a downside risk framework.

The Journal of Investing , 3(3):59–64, 1994.[Sycara et al. , 1995] Katia Sycara, K. Decker, and DajunZeng. Designing a multi-agent portfolio management sys-tem. In

Proceedings of the AAAI Workshop on InternetInformation Systems , October 1995.[Wang et al. , 2016] Ziyu Wang, Tom Schaul, Matteo Hes-sel, Hado van Hasselt, Marc Lanctot, and Nando de Fre-itas. Dueling network architectures for deep reinforcementlearning, 2016.[Ye et al. , 2020] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Ju Xiao, and Bo Li. Reinforcement-learning based portfolio management with augmented as-set movement prediction states.