MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-based System for Financial Portfolio Management
AA Modularized and Scalable Multi-Agent Reinforcement Learning-based Systemfor Financial Portfolio Management
Zhenhan Huang , Fumihide Tanaka ∗ ,University of Tsukuba, Tsukuba, [email protected], [email protected] Abstract
Financial Portfolio Management is one of themost applicable problems in Reinforcement Learn-ing (RL) by its sequential decision-making na-ture. Existing RL-based approaches, while inspir-ing, often lack scalability, reusability, or profundityof intake information to accommodate the ever-changing capital markets. In this paper, we de-sign and develop MSPM, a novel Multi-agent Re-inforcement learning-based system with a modu-larized and scalable architecture for portfolio man-agement. MSPM involves two asynchronously up-dated units: Evolving Agent Module (EAM) andStrategic Agent Module (SAM). A self-sustainedEAM produces signal-comprised information for aspecific asset using heterogeneous data inputs, andeach EAM possesses its reusability to have connec-tions to multiple SAMs. A SAM is responsible forthe assets reallocation of a portfolio using profoundinformation from the EAMs connected. With theelaborate architecture and the multi-step conden-sation of the volatile market information, MSPMaims to provide a customizable, stable, and ded-icated solution to portfolio management that ex-isting approaches do not. We also tackle data-shortage issue of newly-listed stocks by transferlearning, and validate the necessity of EAM. Ex-periments on 8-year U.S. stock markets data provethe effectiveness of MSPM in profits accumulationby its outperformance over existing benchmarks.
Portfolio management is a continuous process of reallocatingcapital into multiple assets [Markowitz, 1952], and it aimsto maximize accumulated profits with an option to minimizethe overall risks of the portfolio. To perform such a practice,portfolio managers who focus on stock markets convention-ally read financial statements and balance sheets, pay atten-tion to news from media and announcements from financialinstitutions like central banks, and analyze stock price trends. ∗ Corresponding Author
By the resemblance of nature of the problem, researchers un-surprisingly want to adapt the Reinforcement Learning (RL)framework to portfolio management, and with the incorpo-ration of Deep Learning (DL), they have built robust andvalid Deep RL-based methods [Jiang et al. , 2017] [Ye et al. ,2020] [Liang et al. , 2018]. Akin to receiving informationfrom various sources as portfolio managers will do, existingapproaches incorporate heterogeneous data [Ye et al. , 2020].Researchers also propose Multi-agent Reinforcement Learn-ing (MARL) approaches [L´opez et al. , 2010] [Sycara et al. ,1995] [Lee et al. , 2020]. In [Lee et al. , 2020], the authorsdesign a system including a group of Deep Q-Network [Mnih et al. , 2013] based agents as the resemblance of independentinvestors to make investment decisions, and end up with adiversified portfolio. However, while inspiring, these cutting-edge approaches lack scalability, reusability, or profundity ofintake information to accommodate the ever-changing mar-kets. For example, SARL [Ye et al. , 2020] as a generalframework, the encoder’s intake is either financial news datafor embedding or stock prices for trading signals generation,and the lack of scalability prevents the encoder from effi-ciently producing holistic information and eventually limitthe RL-based agents to learning. Additionally, agents in ex-isting multi-agent based systems are mostly ad-hoc and rarelyreusable. As one of the first tries, MAPS [Lee et al. , 2020] isa reinforcement-learning implementation of Ensemble Learn-ing [Rokach, 2010] by its very nature, an implicit solutionwith stacking multiple Deep Q-Network agents, which can beas reliable as explicit solutions implementing policy gradientmethods [Jiang et al. , 2017].In this paper, we design and develop MSPM, a novel Multi-agent Reinforcement learning-based system with a modu-larized and scalable architecture for portfolio management.In MSPM, assets are vital and organic building blocks.This vitalness is reflected in that each asset has its exclu-sive module which takes heterogeneous data and utilizesa Deep Q-Network (DQN) based agent to produce signal-comprised information. This module, as a dedicated system,is called Evolving Agent Module (EAM). Once we have setup and trained the EAMs corresponding to the the assets ina portfolio, we connect them to a decisive system: Strate-gic Agent Module (SAM), a Proximal Policy Optimization(PPO) [Schulman et al. , 2017] based agent. A SAM rep-resents a portfolio and uses the profound information from a r X i v : . [ q -f i n . P M ] F e b he EAMs connected for assets reallocation. EAM and SAMare asynchronously updated, and EAM possesses its reusabil-ity which mean an EAM can connect to multiple SAMs andthese connections can be arbitrary. With the power of parallelcomputing, we are able to perform the capital reallocation forvarious portfolios at scale simultaneously. By experiment andcomparison, we confirm the necessity of the Evolving AgentModule (EAM) and validate MSPM’s outperformance overthe classical trading strategies in terms of both accumulatedportfolio value and Sortino Ratio. Our contribution can belisted as follows:• To our knowledge, MSPM is the first approach to for-malize a modularized and scalable multi-agent rein-forcement learning system using signal-comprised infor-mation for financial portfolio management.• We tackle data-shortage issue of newly-listed stocks bytransfer learning for RL-based systems, and revise EIIE-style neural network architecture to better accommodateActor-Critic methods.• We validate MSPM’s outperformance by at least 100%and 343% higher than the benchmarks in terms of accu-mulated rate of return, under extreme market conditionsof U.S. stock markets from March to December 2020during the global pandemic. To address portfolio management problem, we first assumea composition of the minimal-viable market information: 1.historical prices from U.S. stock markets and 2. asset-relatednews from media, as the environment with which agents inEvolving Agent Modules (EAMs) interact. Then, the com-bination of the signal-comprised information which EAMsgenerates become the state that decisive agents in SAMs ob-serve. SAMs consequently reallocate assets in the portfolios.Each EAM is reusable, which means we may connect it toany portfolios/SAMs once set up and trained, and a SAM isconnected with at least one EAM. The relationship betweenEAMs and SAMs is illustrated in Figure 1. EAM is retrainedperiodically using the latest information from the market, me-dia, financial institutions, etc, but in this paper we will imple-ment only the former two sources of data, and we leave restsources of input as an open question for study in the future.For next sections, we will introduce details about EAM andSAM.
In the following sections, we introduce the details aboutEAM, and Figure 2 provides an overview of the EAM’s ar-chitecture.
An Evolving Agent Module (EAM) has two data sources: 1.asset’s historical prices, and 2. asset-related financial news.The historical prices used for training EAMs are fetched us-ing Quandl’s End-Of-Day-Data through Quandl’s API. Thefinancial news data of each asset are from news websites (e.g.finance.yahoo.com) and social media (e.g. twitter.com) on adaily basis.
Figure 1: This figure illustrates the surjection relationship betweenEAMs and SAMs better. Each Evolving Agent Module (EAM)is responsible for a single asset, and involves a DQN agent andFinBERT model with utilization of heterogeneous data to producesignal-comprised information. Each Strategic Agent Module (SAM)is a module for a portfolio, involves a PPO agent, and reallocates theassets with stacked signal-comprised 3-D tensor
Profound State V + from the EAMs connected. Moreover, trained EAMs are reusablefor different portfolios and combined at will. By parallel comput-ing, capital reallocation may be performed for various portfolios atscale simultaneously. The state, which agent in EAM observes at any given peri-odic (daily) time-step t, is a 2-D tensor v t stacked by the re-cent n-day historical prices s t and the 1-D tensor of sentimentpolarities of designated asset ρ t . Specifically, v t = ( s t , ρ t ) (1)where sentiment polarities ρ , which range continuously from-1.0 to 1.0 to indicate bearishness (-1.0) or bullishness (1.0),are classified using pre-trained FinBERT classifier [Araci,2019] [Devlin et al. , 2019] with asset-related financial news,and then averaged. Furthermore, we not only feed agent thenews sentiments but also comprise news volumes which isthe number of news items used for generating sentiment po-larities of each asset. This method is trying to alleviate theunbalanced-news issue the existing research has [Ye et al. ,2020]. We train a Deep Q-Network (DQN) agent [Mnih et al. , 2013]for the sequential decision-making in Evolving Agent Mod-ule. In Deep Q-Learning, the agent acts based on the state-value function Q θ ( s t , a t ) = E π θ [ (cid:80) ∞ k =0 γ k r t + k +1 | s t = s, a t = a ] which we use a Residual Network with 1-D Convolution to represent.
DQN Extensions
We also implement three extensions of the original DQN [La-pan, 2018], which are dueling architecture [Wang et al. ,2016], Double DQN [Hasselt et al. , 2016] and two-step Bell-man unrolling. ransfer Learning
Instead of training every EAM from scratch, we initially traina
Foundational EAM , using historical prices of AAPL (Ap-ple Inc.) and then train all other EAMs based on this pre-trained EAM. By doing so, the EAMs would have a prior pat-tern recognition of stock trends, and this approach also tack-les the data-shortage issue of newly-listed stocks.
The DQN agent in EAM acts to trade the designated assetwith an action of either buying, selling, or skipping at everytime step t. The choice of an action, a t = { buying, closing,or skipping } , is called asset trading signal . As indicated inthe actions, there is no short (selling) position, but only long(buying) position, and a new position can be opened only afteran existing position has been closed. The reward r t ( s t , ι t ) agent receives at each time-step t is: r t ( s t , ι t ) = (cid:80) ti = t ι v ( close ) t v ( close ) t − − − β ) , if ι t , if not ι t (2)where v ( close ) t is the close price of the given asset at timestep t, t l is the time step when a long position was openedand commissions were deducted, β stands for the commissionof 0.002 and ι t is the indicator of opening position (i.e., anposition is still open). As illustrated in Figure 2, after the EAMs are trained, wefeed new historical prices s (cid:48) t and financial news of the desig-nated assets to generate predictive trading signals a ∗ t . Thenwe stack the same new historical prices to a ∗ t to formalizea 2-D signal-comprised tensor s ∗ t as the data source to trainSAM. We introduce the detailed facts about SAM in the next sec-tions. Figure 3 provides an overview of the EAM’s architec-ture.
SAM is dependent on EAMs it is connected with as its datasource is a combination of the signal-comprised informationfrom these EAMs.
The 2-D signal-comprised tensors from the EAMs connectedwill be stacked and transformed into a 3-D signal-comprisedtensor called
Profound State v + t which is the state SAM ob-serves at each time-step t . Figure 4 shows the abstract of theprofound state v + t . Figure 2: An EAM is a module for a designated asset. Each Evolv-ing Agent Module (EAM) takes two types of heterogeneous data:1. designated asset’s historical prices, and 2. asset-related financialnews. At the center of EAM is an extended DQN agent using a 1-DConvolution ResNet for the sequential decision-making. Instead oftraining every EAM from scratch, we train EAMs by transfer learn-ing using Foundational EAM. At every time step t , the DQN agentin EAM observes state v t of historical prices s t and news sentiments ρ t of designated asset, and acts to trade with an action a ∗ t of eitherbuying, selling, or skipping, and eventually generates a 2-D signal-comprised tensor s ∗ t using new prices s (cid:48) t and signals a ∗ t . It is a Proximal Policy Optimization (PPO) [Schulman et al. ,2017] agent at the center of SAM to perform reallocation ofassets . PPO is a actor-critic style policy gradient method, andhas been widely used on continuous action space problemsbecause of its balance between good performance and easeof implementation. A policy π θ is a parametrized mapping: S × A → [0 , from state space to action space. Amongthe different objective functions of PPO, we implement theClipped Surrogate Objective [Schulman et al. , 2017]: L ( θ ) = ˆ E π θ (cid:48) [ min ( r t ( θ ) A θ (cid:48) t , clip ( r t ( θ ) , − (cid:15), (cid:15) ) A θ (cid:48) t )] (3)where r t ( θ ) = π θ ( a t | s t ) π θ (cid:48) ( a t | s t ) and A θ (cid:48) t is the advantage function is expressed as: A θ (cid:48) t = Q θ (cid:48) ( s t , a t ) − V θ (cid:48) ( s t ) where the state-action value function Q θ (cid:48) ( s t , a t ) is: Q θ (cid:48) ( s t , a t ) = E π θ (cid:48) [ ∞ (cid:88) k =0 γ k r t + k +1 | s t = s, a t = a ] igure 3: A SAM is a module for a portfolio. Each SAM takes thesignal-comprised 3-D tensor which is stacked and transformed from2-D tensors from EAMs connected with, and then SAM generatesthe reallocation weights for the assets in the portfolio.Figure 4: The input of Strategic Agent Module, Profound State V + t ∈ R f × m ∗ × n , is a 3-D tensor, where f is the number of fea-ture, m ∗ = m + 1 is the number of assets m in the portfolio pluscash and n is the fixed rolling-window length. and the value function V θ (cid:48) ( s t ) is: V θ (cid:48) ( s t ) = E π θ (cid:48) [ ∞ (cid:88) k =0 γ k r t + k +1 | s t = s ] For the PPO agent, we design a policy network architecturetargeting the uniqueness of continuous action space in finan-cial portfolio management problems inspired by follow theEnsemble of Identical Independent Evaluators (EIIE) topol-ogy [Jiang et al. , 2017]. Since assets’ reallocation weights a t at time-step t is strictly required to be summed to 1.0, we setfrom m ∗ normal distributions N ( µ t , σ ) , ..., N m ( µ m ∗ t , σ ) ,where m ∗ = m + 1 and µ t ∈ R × m ∗ × is the linear out-put of the last layer the neural network and with standarddeviation σ = 0 , and we sample from these distributionsfor x t ∈ R m ∗ × . We eventually can obtain the reallocationweights a t = Sof tmax ( x t ) and the log probability of x t forthe PPO agent to learn.Figure 5 shows details of the Policy Network (Actor),namely θ (cid:48) , of Strategic Agent Module. Due to the resem-blance and equivalence, architectures of the Value Network(Critic) and Target Policy Network, namely θ , will not be il-lustrated. Figure 5: The policy network ( θ (cid:48) ) of Strategic Agent Module to ac-commodate PPO algorithm. After x t ∈ R m ∗ × are sampled fromthe normal distributions N ( µ t , σ ) , we calculate log probability of x t and the get reallocation weights a t = Softmax ( x t ) . ReLu acti-vation function is set after every convolutional layer, except the lastone. f is the number of features, m ∗ is the number of assets in theportfolio, and n = 50 is the fixed rolling-window length.Figure 6: Allocation weights transform due to the assets’ prices fluc-tuation. The action the PPO agent takes at each time step t is a t = ( a ,t , a ,t , ..., a n,t ) T (4)which is the vector of reallocating weights at each time-step t, and (cid:80) ni =0 a i,t = 1 . Figure 6 shows the details ofprices of fluctuation.After the assets are reallocated by a t , the allocationweights of the portfolio eventually becomes w t = y t (cid:12) a t y t · a t (5)at the end of time step t due to the price fluctuation duringthe time step period, where y t = v +( close ) t v +( close ) t − = (1 , v +( close )1 ,t v +( close )1 ,t − , ..., v +( close ) m,t v +( close ) m,t − ) T (6)is the relative price vector, namely the asset price changesover time, involving prices of assets and cash. v +( close ) i,t denotes the closing price of the i -th asset at time t , where i = { , ..., m } and m is the number of assets in the portfolio. Risk-adjusted Periodic Rate of Return
The
Risk-adjusted
Periodic (daily) Rate of Return is: r ∗ t ( s t , a t ) = ln ( a t · y t − β n (cid:88) i =0 | a i,t − w i,t | − φσ t ) (7)ate Range ∼ ∼ ∼ ∼ Table 1: EAM-Training dataset involves the historical prices ( s t )and news sentiments ( ρ t ) of all 5 assets in both portfolios (a) and(b), and is used to training the AAPL-based Foundational EAM andthen transfer learning for other 4 assets. EAM-Predicting dataset in-volves new historical prices ( s (cid:48) t ) and news sentiments ( ρ ∗ t ) for EAMsgenerate signal-comprised tensors ( s ∗ t = ( s (cid:48) t , a ∗ t ) ) to formalize theSAM-Traning ( v + ) datasets. SAM-Testing dataset has the same str-cuture with the SAM-Training but is solely for back-testing purpose. where w t represents the allocation weights of the assets atthe end of time step t [Jiang et al. , 2017] [Liang et al. , 2018],and β n (cid:88) i =0 | a i,t − w i,t | (8)is the transaction cost, where β = 0 . is the commis-sion rate, with σ t measuring the volatility of assets’ pricesfluctuation during the last 50 days and φ is the risk discountwhich can be fine-tuned as a hyperparameter. We propose two portfolios in this paper: Portfolio(a) whichinvolves three stocks [AAPL, AMD, GOOGL], and Portfo-lio(b) which involves another three stocks [GOOGL, NVDA,TSLA]. To build the Portfolio(a) and Portfolio(b), we traintwo SAMs: SAM(a) and SAM(b). It is also worth mention-ing that two SAMs will share the same EAM for the stock incommon (GOOGL).
Datasets
Stock historical prices are Quandl’s End-Of-Day-Data from2013-01-01 to 2020-12-31 fetched using Quandl’s API. Wealso implement web news sentiment data (FinSentS) fromQuandl provided by InfoTrie, and on stock AAPL, FinSentSand news sentiments generated by FinBERT are close. Dueto the restriction of APIs and limitation of web scraping, thenews sentiments data utilized for the EAMs in the followingexperiments are majorly relied on FinSentS.
Except for daily rate of return, two performance metrics arealso compared:
1. Daily Rate of Return (DRR) r T = 1 T T (cid:88) t =1 exp( r t ) , (9)where T is the terminal time step, and r t is the risk-unadjusted peridoic (daily) rate of return obtained every timestep.
2. Accumulated Rate of Return (ARR) [Ormos and Urb´an, 2013] p T /p , where p T stands for theportfolio value at the terminal time step, p is the portfoliovalue at the initial time step, and p T = p exp ( T (cid:88) t =1 r t ) = p b (cid:89) t =1 β t a t · y t (10)
3. Sortino Ratio (SR)
Sortino Ratio [Sortino and Price, 1994] is often referred toas a risk-adjusted return, which measures the portfolio per-formance compared to a risk-free return after adjusted by theportfolio’s downside risk. In our case, Sortino Ratio is calcu-lated: SR = T (cid:80) Tt =1 exp( r t ) − r f σ downsidet (11)where r t is the risk-unadjusted peridoic (daily) rate of re-turn, σ downsidet = (cid:113) V ar t ( r lt − r f ) . r f is the risk-free returnand conventionally equals zero, r lt is the return in r t less than0, and t = T is the terminal time step. We compare the performance of our MSPM system to thebenchmarks which are classical portfolio investment strate-gies. [Li and Hoi, 2014] [MLFinLab, 2021]•
CRP stands for (Uniform) Constant Rebalanced Port-folio, and it is about investing an equal proportion ofcapital to each asset, namely /N , which sounds quitesimple but, in fact, challenging to beat [DeMiguel et al. ,2007].• Buy and Hold (BAH) strategy is to invest without rebal-ance. Once the capital is invested, no further allocationwill be made.•
Exponential Gradient Portfolio (EG) strategy is to in-vest capital into the latest stock with the best perfor-mance and use a regularization term to keep the portfolioinformation.•
Follow the Regularized Leader (FTRL) strategytracks the Best Constant Rebalanced Portfolio until theprevious period with an additional regularization term.Follow the Regularized Leader strategy reweights basedon the entire history of the data with an expectation tohave had the maximum returns.
We set initial portfolio value as p = 10 , . The testingdata of past 50-day Profound State v + from 2020-03-03 to2020-12-31. As shown in the Figure 7 and Figure 8 EAM-enabled SAMs by at least 100% and 343% higher than thebenchmarks in terms of ARR for both portfolios. igure 7: SAM(a) outperforms all benchmarks on Portfolio(a) inBack-testing in terms of accumulated portfolio value.Figure 8: SAM(b) outperforms all benchmarks on Portfolio(b) inBack-testing in terms of accumulated portfolio value. Validation of EAM
Since EAMs provide the trading signal-comprised informa-tion to SAMs, we want to verify EAM’s indispensabilityby comparing the system performance with and without theEAMs. It is clearly shown in both Figure 7 and Figure 8 thatthe EAM-disabled SAMs, which have hyperparameters withEAM-enabled SAMs, perform poorer. Moreover, we traina set of SAMs for 500 episodes with learning rates of 1e-3 and 1e-4, risk discount of 0.05, and 0.01, and gamma of0.99, for portfolio (a). As the learning curves shown in theFigure 9, EAM-disabled SAMs never reach as good resultsas EAM-enabled SAMs in terms of periodical rate of return.The results validate that only with the trading signals fromthe EAMs, the SAMs can have an ideal performance.
We propose MSPM, a modularized multi-agentreinforcement-learning based system, aiming to bringscalability, reusability and profundity of intake information
Figure 9: According to the learning curves, EAM-enabled SAMsalways learn better than EAM-disabled SAMs, which validatesEAM’s indispensability in our MSPM system.
Portfolios (a) & (b) DRR% ARR% SR1. CRP 0.29 75.90 4.410.49 156.81
2. Buy and Hold 0.31 80.18 4.290.63 230.57 5.753. FTRL 0.30 77.95 4.140.55 184.47 5.654. EG 0.31 79.47 4.190.61 215.17 5.695. MSPM-SAM
Table 2: Back-testing Results to financial portfolio management. We prove that MSPM isqualified as a stepping stone to inspire more creative systemdesigns of financial portfolio management by its novelty andoutperformance over the existing benchmarks. Moreover, weconfirm the necessity of Evolving Agent Module, a essentialcomponent of the system.
References [Araci, 2019] Dogu Araci. Finbert: Financial sentimentanalysis with pre-trained language models, 2019.[DeMiguel et al. , 2007] Victor DeMiguel, Lorenzo Gar-lappi, and Raman Uppal. Optimal Versus Naive Diversifi-cation: How Inefficient is the 1/N Portfolio Strategy?
TheReview of Financial Studies , 22(5):1915–1953, 12 2007.[Devlin et al. , 2019] Jacob Devlin, Ming-Wei Chang, Ken-ton Lee, and Kristina Toutanova. BERT: Pre-training ofdeep bidirectional transformers for language understand-ing. In
Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computationalinguistics: Human Language Technologies, Volume 1(Long and Short Papers) , pages 4171–4186, Minneapo-lis, Minnesota, June 2019. Association for ComputationalLinguistics.[Hasselt et al. , 2016] Hado van Hasselt, Arthur Guez, andDavid Silver. Deep reinforcement learning with double q-learning. In
Proceedings of the Thirtieth AAAI Conferenceon Artificial Intelligence , AAAI’16, page 2094–2100.AAAI Press, 2016.[Jiang et al. , 2017] Zhengyao Jiang, Dixing Xu, and JinjunLiang. A deep reinforcement learning framework for thefinancial portfolio management problem, 2017.[Lapan, 2018] Maxim Lapan.
Deep Reinforcement Learn-ing Hands-On: Apply Modern RL Methods, with DeepQ-Networks, Value Iteration, Policy Gradients, TRPO, Al-phaGo Zero and More . Packt Publishing, 2018.[Lee et al. , 2020] Jinho Lee, Raehyun Kim, Seok-Won Yi,and Jaewoo Kang. Maps: Multi-agent reinforcementlearning-based portfolio management system.
Proceed-ings of the Twenty-Ninth International Joint Conferenceon Artificial Intelligence , Jul 2020.[Li and Hoi, 2014] Bin Li and Steven C. H. Hoi. Online port-folio selection: A survey.
ACM Comput. Surv. , 46(3), Jan-uary 2014.[Liang et al. , 2018] Zhipeng Liang, Hao Chen, Junhao Zhu,Kangkang Jiang, and Yanran Li. Adversarial deep rein-forcement learning in portfolio management, 2018.[L´opez et al. , 2010] Vivian F. L´opez, Noel Alonso, LuisAlonso, and Mar´ıa N. Moreno. A multiagent sys-tem for efficient portfolio management. In Yves De-mazeau, Frank Dignum, Juan M. Corchado, JavierBajo, Rafael Corchuelo, Emilio Corchado, FlorentinoFern´andez-Riverola, Vicente J. Juli´an, Pawel Pawlewski,and Andrew Campbell, editors,
Trends in Practical Appli-cations of Agents and Multiagent Systems , pages 53–60,Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.[Markowitz, 1952] Harry Markowitz. Portfolio selection.
The Journal of Finance , 7(1):77–91, 1952.[MLFinLab, 2021] MLFinLab. Machine learning financiallaboratory, 2021.[Mnih et al. , 2013] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atariwith deep reinforcement learning. 2013. citearxiv:1312.5602Comment: NIPS Deep Learning Work-shop 2013.[Ormos and Urb´an, 2013] Mih´aly Ormos and Andr´as Urb´an.Performance analysis of log-optimal portfolio strate-gies with transaction costs.
Quantitative Finance ,13(10):1587–1597, 2013.[Rokach, 2010] Lior Rokach. Ensemble-based classifiers.
Artif. Intell. Rev. , 33:1–39, 02 2010.[Schulman et al. , 2017] John Schulman, Filip Wolski, Pra-fulla Dhariwal, Alec Radford, and Oleg Klimov. Proximalpolicy optimization algorithms, 2017. [Sortino and Price, 1994] Frank A. Sortino and Lee N. Price.Performance measurement in a downside risk framework.
The Journal of Investing , 3(3):59–64, 1994.[Sycara et al. , 1995] Katia Sycara, K. Decker, and DajunZeng. Designing a multi-agent portfolio management sys-tem. In
Proceedings of the AAAI Workshop on InternetInformation Systems , October 1995.[Wang et al. , 2016] Ziyu Wang, Tom Schaul, Matteo Hes-sel, Hado van Hasselt, Marc Lanctot, and Nando de Fre-itas. Dueling network architectures for deep reinforcementlearning, 2016.[Ye et al. , 2020] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Ju Xiao, and Bo Li. Reinforcement-learning based portfolio management with augmented as-set movement prediction states.