[PDF] Understanding algorithmic collusion with experience replay

Abstract

In an infinitely repeated pricing game, pricing algorithms based on artificial intelligence (Q-learning) may consistently learn to charge supra-competitive prices even without communication. Although concerns on algorithmic collusion have arisen, little is known on underlying factors. In this work, we experimentally analyze the dynamics of algorithms with three variants of experience replay. Algorithmic collusion still has roots in human preferences. Randomizing experience yields prices close to the static Bertrand equilibrium and higher prices are easily restored by favoring the latest experience. Moreover, relative performance concerns also stabilize the collusion. Finally, we investigate the scenarios with heterogeneous agents and test robustness on various factors.

Full PDF

aa r X i v : . [ ec on . GN ] F e b Understanding algorithmic collusion with experiencereplay

Bingyan Han ∗ February 18, 2021

Abstract

In an inﬁnitely repeated pricing game, pricing algorithms based on artiﬁcialintelligence (Q-learning) may consistently learn to charge supra-competitive priceseven without communication. Although concerns on algorithmic collusion havearisen, little is known on underlying factors. In this work, we experimentally ana-lyze the dynamics of algorithms with three variants of experience replay. Algorith-mic collusion still has roots in human preferences. Randomizing experience yieldsprices close to the static Bertrand equilibrium and higher prices are easily restoredby favoring the latest experience. Moreover, relative performance concerns also sta-bilize the collusion. Finally, we investigate the scenarios with heterogeneous agentsand test robustness on various factors.

Keywords:

Bertrand oligopoly, algorithmic collusion, experience replay, reinforce-ment learning, deep Q-learning, relative performance.

With the digitalization of the economy and the advances in data analytics, ﬁrms areincreasingly handing key manual decisions such as product pricing over to computers(Fisher et al., 2018; Mikl´os-Thal and Tucker, 2019; Hansen et al., 2020). However, thesophistication and powerfulness of algorithms have also led to another prominent concernon the possibility of collusion. Pricing algorithms may be too advanced to learn that itis optimal to collude (Ezrachi and Stucke, 2016). Although many are skeptical that au-tonomous collusion is only science ﬁction, recent experimental research (Waltman andKaymak, 2008; Klein, 2019; Calvano et al., 2020; Hansen et al., 2020) suggests that dy-namic pricing algorithms can learn collusive strategies from scratch, even without humanguidance or communication with each other. On the empirical side, there are a few, albeit ∗ Division of Science and Technology, BNU-HKBU United International College, Zhuhai, China,[email protected] rank experience replay . Remarkably,it recovers supra-competitive prices in much shorter runs, while the reward-punishmentscheme is not realized under the deep Q-learning setting.With an illustration of the destabilization and stabilization, it is not surprising that al-gorithmic collusion still has roots in our human preferences, such as up-to-date experienceand RPE. In particular, by breaking the local trends in learning data serially, we eliminatethe collusive strategies. This conclusion is supported by the comparison between onlineand random experience replay, for both classic and deep Q-learning. Second, relative3erformance concerns facilitate the coordination between competitors. More data do notautomatically mean better outcomes. By ﬁltering the information acquired, agents easilydetect supra-competitive prices are more proﬁtable. Rank experience replay signiﬁcantlyaccelerates the speed of learning, compared with Calvano et al. (2020). Nevertheless, thedownside of rank experience replay is it distracts the algorithms from monopoly prices ifthe economic environment is asymmetric, since it also aims to achieve the closet rewardsfor each ﬁrm. See Section 6.1 for details. Overall, we may believe that ﬁrms are morelikely to adopt online and rank experience replay with more realistic economic motiva-tion. Random experience replay is a bit unnatural and contrived, motived mainly by apure algorithmic concern to obtain uncorrelated data. Finally, from another perspective,algorithmic collusion is also vulnerable. By simply modifying a few lines of the code, theoutcomes are dramatically diﬀerent. Therefore, when necessary, authorities should auditand test ﬁrms’ pricing algorithms to identify the tacit collusion. Unlike human collusionwhich is diﬃcult to probe since people may lie about their consideration, algorithms canbe tested and reviewed thoroughly and openly.Since the collusion depends on the coordination, we have also considered player het-erogeneity in Section 5. Achieving supra-competitive prices is harder and unstable in thecase considered in Section 5.1. But in general, it is still possible to realize anti-competitiveprices even with heterogeneous players. Moreover, we ﬁnd rank experience replay alsostabilizes the training progression and ensures the presence of higher prices. Importantly,we discover that an agent with deep Q-learning and random experience replay, who learnscompetitive prices when facing a homogeneous agent, turns out to charge higher priceswhen facing a classic Q-learning agent. It raises a tricky question on the responsibilityof the deep agent on supra-competitive prices. See Section 5.1 for elaboration.The rest of the paper is organized as follows. Section 2 presents the Bertrand oligopolyeconomic setting together with a concrete review on classic Q-learning and deep Q-learning. Section 3 destabilizes the algorithmic collusion with random experience replay.Section 4 recovers the supra-competitive prices with online and rank experience replay.Section 5 discusses several scenarios with heterogeneous players. Concerns on robustnessare addressed from several aspects in Section 6. Section 7 concludes with questions for thefuture research. The code is publicly available at the GitHub repository for replicationof these ﬁndings. To provide a fair comparison and validate the destabilization/stabilization results, weadopt the same economic environment in Calvano et al. (2020). Consider an inﬁnitelyrepeated pricing game with n diﬀerentiated products and an outside good. In a Bertrandoligopoly setting, ﬁrms compete with each other by controlling prices for their products. https://github.com/hanbingyan/collusion t , the demand q i,t for the product i follows a logit model given by q i,t = e bi − pi,tµ P nj =1 e bj − pj,tµ + e b µ , (2.1)where p i,t is the price and b i represents the product quality index. Constant µ measureshorizontal diﬀerentiation between products. Product 0 is the outside good. We referinterested readers to Calvano et al. (2020, Section II.A.) for further motivation on the logitdemand model (2.1). Consequently, the reward for ﬁrm i at period t is r i,t = ( p i,t − c i ) q i,t ,where c i is the constant marginal cost. For simplicity, suppose all ﬁrms stay active duringthe whole repeated pricing game.Consider the game for one period ﬁrst. If each ﬁrm only maximizes its proﬁt sepa-rately, the derived price in equilibrium, denoted as a vector p N , is called the Bertrand-Nash equilibrium. On the contrary, if all ﬁrms unite and maximize the aggregate proﬁts,they obtain a monopoly price vector p M and achieve higher rewards. Theoretically,ﬁrms can set prices with continuous values. However, algorithms such as Q-learningtypically require a ﬁnite action space. We consider the same feasible action spacein Calvano et al. (2020), denoted as A , with m equally spaced points from interval[ p N − ξ ( p M − p N ) , p M + ξ ( p M − p N )] controlled by a parameter ξ >

0. Suppose all ﬁrmsuse the same action space A .For an inﬁnitely repeated game, each ﬁrm i faces a problem to maximize its owndiscounted return E h ∞ X t =0 γ t r i,t i , (2.2)where 0 < γ < s t ∈ S in every period t = 0 , , , ... , containing thecurrent information of the environment. For simplicity, we assume all agents observethe common state, and partial information is not considered. Similar to Calvano et al.(2020), the state space S is a set of all past prices in the last k periods and thereforeﬁnite with |S| = m nk . The rationality of this speciﬁcation is explained in Calvano et al.(2020, Section II.C). Firms then choose actions (i.e., prices) according to their policies π i ,conditioning on the observed state. The policy is simply a mapping from the state space S to the action space A . Prices are publicly observable by the ﬁrms, while the pricingpolicy π i is undisclosed. After prices are selected simultaneously, ﬁrms obtain rewardsindividually and the environment moves on to the next state s t +1 .We deﬁne the optimal action-value function Q ∗ ,i ( s, a i ) for ﬁrm i as the maximumexpected payoﬀ achievable by following any policy π i , after observing state s and thentaking some action a i ∈ A . Q ∗ ,i ( s, a i ) = max π i E h ∞ X t =0 γ t r i,t (cid:12)(cid:12)(cid:12) s, a i , π i i . (2.3)5enote any greedy policy that achieves the maximum in (2.3) as π ∗ ,i . From now on, fornotation simplicity, we omit i if the quantities apply for any ﬁrm i and whenever it isclear. We highlight that any ﬁrm can observe competitors’ actions and rewards, but notthe greedy policies represented by matrices or functions.Crucially, Q ∗ ( s, a ) satisﬁes the Bellman equation Q ∗ ( s, a ) = E h r + γ max a ′ ∈A Q ∗ ( s ′ , a ′ ) (cid:12)(cid:12)(cid:12) s, a i , (2.4)where r is the reward for one period and s ′ is the next state observed after taking action a under state s . Algorithms distinguish with each other on how to utilize Bellman equation(2.4) to learn the optimal action-value function Q ∗ ( s, a ) and implied greedy policies π .In this paper, we mainly focus on two algorithms: classic Q-learning algorithm (Watkins,1989) adopted in Calvano et al. (2020); deep Q-learning algorithm (Mnih et al., 2015).These algorithms are originally proposed for the single-agent setting. We ﬁrst extend tothe multi-agent setting in the simplest way and then discuss main ideas to destabilize orstabilize the collusion. In this paper, we use ﬁrms/players/agents interchangeably. Classic Q-learning algorithms use consecutive samples for learning the optimal action-value function Q ∗ ( s, a ), which is indeed an |S| × |A| matrix since the action space and thestate space are ﬁnite. Starting from a given initial matrix Q , during the period t , eachagent selects an action a t under the current state s t , and observes the reward r t and thenext state s t +1 . The following equation iteratively updates the corresponding cell s = s t , a = a t , for matrix Q t ( s, a ), by setting Q t +1 ( s t , a t ) = (1 − α ) Q t ( s t , a t ) + α (cid:2) r t + γ max a ′ ∈A Q t ( s t +1 , a ′ ) (cid:3) . (2.5)Parameter 0 ≤ α ≤ α isadopted, since large values tend to dismiss the information acquired too rapidly. Othercells with s = s t or a = a t remain the same.An arbitrary initial Q may not contain much useful information about the true Q ∗ .Then the algorithm faces a trade-oﬀ between experimenting with actions that are cur-rently suboptimal (exploration) and continuing to learn the information already obtained(exploitation). Therefore, ε -greedy policy is introduced to follow the current greedy pol-icy with probability 1 − ε t and a purely random action with probability ε t . We considera time-declining exploration rate, exogenously set as ε t = e − βt , (2.6)with parameter β > s t , a t , r t , s t +1 ) is used to update the Q-matrix in an online fashion. Atthe current period t , previous actions, rewards, and states before period t are discarded6nd no longer used for the learning process. Therefore, each observation has a directimpact on the Q-matrix only once. However, in practice, ﬁrms may store experience for arelatively long time. Second, only one state-action cell ( s, a ) = ( s t , a t ) is updated duringa single learning step. The learning speed might be slow. Some structural informationabout the action-value functions could have been integrated into the modeling. Third,it is well-known that trajectories of Markov decision processes have strong temporal cor-relations (Mnih et al., 2015; Zhang and Sutton, 2017; Fan et al., 2020), as illustrated inFigure 1. Under a baseline setting introduced later, we calculate the correlations betweenconsecutive values of one agent’s actions, in contrast to the randomly sampled previousactions. To be more precise, at the period t , the correlation between ( a t − L , a t − L +1 , ..., a t − )and ( a t − L − , a t − L , ..., a t − ) is referred to as the online correlation in Figure 1. The randomone is calculated with uniformly sampled sequences instead.Clearly, consecutive samples exhibit local dependency, which is expected since thenext action relies on the current state and trained action-value functions. In contrast,correlations between random samples have no obvious patterns. Mnih et al. (2015) exploitthis observation to break temporal correlations and obtain uncorrelated samples. C o rr e l a t i o n RandomOnline

Figure 1: Correlations in time series of actions from classic Q-learning. For online corre-lations, the last 129 actions before each period are collected and the correlation betweenthe ﬁrst 128 and the last 128 actions is calculated (i.e., with time lag 1). For randomcorrelations, two sequences with 128 actions each are randomly sampled from the recent2000 actions before each period. We only report 2000 periods for one player during thelearning process and data for another player share the same characteristics.

Deep Q-learning can resolve the issues in classic Q-learning. First, deep Q-learningapproximates Q ∗ ( s, a ) with neural networks to improve the learning speed. Speciﬁcally, aQ-network Q ( s, a ; θ ) with weights θ is a neural network function approximator of Q ∗ ( s, a ).To train Q-networks, Mnih et al. (2015) consider a technique called experience replay(Lin, 1992), inspired by neuroscientiﬁc discoveries on brains. Brieﬂy speaking, at eachperiod t , the agent’s experience tuple ( s t , a t , r t , s t +1 ) is stored in a replay memory buﬀerwith ﬁxed length. When performing updates, Mnih et al. (2015) ﬁrst sample tuples7niformly at random from the replay memory. We refer to this sample selection method as random experience replay . As shown in Figure 1, randomization can generate uncorrelatedsamples and break the strong temporal correlations. Moreover, tuples are likely to besampled for several times to improve the data eﬃciency. Next, the sampled experiencetuples are used as a mini-batch in the minimization of certain loss functions (mean-square errors, Huber losses) on the diﬀerences in the Bellman equation (2.4). To be moreprecise, the optimal yet unknown values r + γ max a ′ ∈A Q ∗ ( s ′ , a ′ ) on the right hand sideof (2.4) are replaced by the approximate but known values r + γ max a ′ ∈A Q ( s ′ , a ′ ; θ − ),where parameters θ − are from some previous periods. The diﬀerence δ between two sidesin (2.4) is deﬁned as δ = Q ( s, a ; θ ) − (cid:2) r + γ max a ′ ∈A Q ( s ′ , a ′ ; θ − ) (cid:3) . (2.7)Let L ( δ ; θ ) be the loss function for a set of δ , calculated from the mini-batch via randomexperience replay. Stochastic gradient methods are utilized to update θ . After severalepisodes, θ − is updated to the latest θ . Therefore, unlike supervised learning, targets indeep Q-learning are not ﬁxed and should be updated periodically. Algorithm 1 presentsthe details for deep Q-learning under a multi-agent setting. The Q-networks are trainedwith episodes and each episode contains certain periods or iterations. Algorithm 1:

Multi-agent deep Q-learning with random experience replayFor each agent i = 1 , ..., n :Initialize a replay memory D i with capacity N ;Initialize Q-network Q i with Calvano et al. (2020, Equation (8));Initialize target Q-network ˆ Q i with Calvano et al. (2020, Equation (8)); for episode = 1 , ..., M do Initialize a random state s = (( a , ..., a n ) , ..., ( a − k +1 , ..., a n − k +1 )) for period t = 1 , ..., T do For each agent i , with probability 1 − ε t select a greedy action a it = arg max a Q i ( s t , a ; θ i ); otherwise select a random action a it Execute actions ( a t , ..., a nt ) together and each agent observes all rewards r i,t Set state s t +1 = (( a t , ..., a nt ) , ..., ( a t − k +1 , ..., a nt − k +1 ))Store transition ( s t , ( a t , ..., a nt ) , s t +1 , ( r ,t , ..., r n,t )) in every D i For each agent i , sample random mini-batch of transitions independentlyfrom D i For each agent i , perform a gradient descent step on individual lossfunction with respect to weights θ i end Every C episodes, reset ˆ Q i = Q i end .4 Discussions In the single-agent setting, there are theoretical guarantees on the convergence for thetwo algorithms. See Watkins and Dayan (1992) for classic Q-learning and Fan et al.(2020) for deep Q-learning. However, in multi-agent Q-learning, when agents’ actions areregarded as part of state variables, the environment becomes non-stationary in the eyesof each agent. An agent’s policy depends on the states and therefore his rivals’ policies,which are also changing over periods by learning or experimenting under ε -greedy policies.Therefore, multi-agent Q-learning currently lacks general convergence results, due to thetechnical diﬃculties from non-stationarity. In practice, convergence is veriﬁed only ex-post. But luckily, the algorithms under the multi-agent setting investigated in this paperconverge practically, possibly thanks to the relatively simple economic environment withlow-dimensional state and action spaces.Why do we select deep Q-learning instead of other variants of Q-learning? First, todestabilize algorithmic collusion, we must highlight that the key ingredient is randomexperience replay instead of deep neural networks. In our experiments, we adopt a neuralnetwork with only one hidden layer. The Q-network is by no means deep. It only speedsup the cell updates by learning Q ∗ as a function. Later on, we will show that the results arerobust to neural network designs in various aspects. However, random experience replayplays an essential role in breaking temporal correlations and improving data eﬃciencyby reusing experience tuples more times. Second, since algorithmic collusion depends onthe methodology that ﬁrms adopted, it is reasonable to consider some popular methodsand assume the ﬁrms are using or exploring them.In the following two sections, to facilitate the comparison with Calvano et al. (2020),we consider the same baseline model setting in Calvano et al. (2020) unless otherwisespeciﬁed. Let number of agents n = 2, marginal costs c i = 1, b i − c i = 1, a = 0, µ = 0 .

25, discount factor γ = 0 .

95, number of feasible prices m = 15, ξ = 0 .

1, pastprices length in state k = 1. Let exploration rate β = 1 × − , the middle value inCalvano et al. (2020). Under this speciﬁcation, the Bertrand equilibrium price p N isapproximately 1.473 and the monopoly price p M is close to 1.925. The action space offeasible prices, is equally spaced and given as { . , . , . , ..., . , . } with a stepsize 0 .

04. We encode the prices as action 0 for price 1.43, action 1 for price 1.47, etc.There are discretization errors and p N , p M are not exactly included in the action space.Besides, the state space has 225 elements.Since certain features like ﬂuctuating between two consecutive prices may be maskedafter averaging across diﬀerent replications, we usually plot graphs with one particularsession, unless otherwise speciﬁed, and describe common characteristics found among allsessions. We ﬁrst destabilize the collusion and then recover the supra-competitive prices.9 Destabilizing collusion

Temporal correlations have been illustrated in Figure 1. As indicated by the blue dots,trajectories of one player’s actions usually maintain local trends. In contrast, the ran-domly sampled actions have no obvious patterns and are close to uncorrelated data. Thisdiﬀerence turns out to be crucial in Mnih et al. (2015) for single-agent deep Q-learning.Originally, Mnih et al. (2015) conjectured that strong temporal correlations might lead topoor local minima or even divergence for the single-agent case. Notably, for multi-agentreinforcement learning, strong temporal correlations have certain beneﬁts in maintain-ing stationarity for the environment. Rivals’ policies may not vary too rapidly. Theagent can trust on recently observed experience tuples that are not obsolete. In contrast,multi-agent deep Q-learning with random experience replay is highly non-stationary byFoerster et al. (2016); Leibo et al. (2017); Foerster et al. (2017). Therefore, random ex-perience replay is totally disabled or carefully modiﬁed. Overall, these ﬁndings inspireus to destabilize algorithmic collusion by breaking temporal correlations with randomexperience replay. Since the economic environment is relatively simple compared withcomplicated computer gaming scenarios, the algorithms may still be viable to obtainconvergence results.Before moving into the experience replay setting, we further motivate the discussionsby modifying the classic Q-learning algorithm. Instead of updating the cell correspondingto the current observed state, we randomly sample a previous experience tuple contain-ing the action, state, and reward from a ﬁnite memory buﬀer and update that cell. Thisidea resembles the random experience replay technique but with a mini-batch size ofone. Hereafter, we refer to this modiﬁcation as the C-Random algorithm and the classicQ-learning in Section 2.2 as the C-Online algorithm. Although it may be economicallyunrealistic to force the agents to update an outdated cell instead of the current observa-tion, we believe this modiﬁcation provides a benchmark and explores the eﬀect of randomexperience replay while keeping other factors unchanged. One can expect a cell is vis-ited more frequently since randomization helps the C-Random algorithm escape from thelocal region in the current trajectory. However, a ﬁnite replay memory buﬀer preventsextensive random visits and guarantee convergence.In Figure 2, we observe that the C-Random algorithm shifts the long-run prices tothe lower side. It indicates that breaking the temporal correlations can destabilize thecollusion. However, supra-competitive prices still constitute a considerable proportion.One possible reason is the C-Random algorithm still updates one cell per period. Anagent’s policy is not altered enough and can generate paths similar to the C-Online al-gorithm. Besides, we discover that the high prices ( ≥ .

91) also appear more frequentlyafter the convergence. Randomization may also enable the C-Random algorithm to ex-plore these prices more extensively compared with the C-Online algorithm. Nevertheless,we highlight that the proportions of the lower prices increase more signiﬁcantly.10 .

43 1 .

47 1 .

51 1 .

55 1 .

59 1 .

63 1 .

67 1 .

71 1 .

75 1 .

79 1 .

83 1 .

87 1 .

91 1 .

95 1 . Price0.000.020.040.060.080.100.120.140.160.180.20 P e r c e n t C-OnlineC-RandomC-Rank

Figure 2: Distributions of long-run prices after convergence. The blue dots replicate theoutcomes in Calvano et al. (2020) with the classic Q-learning algorithm. The C-Randomalgorithm employs a replay memory with size 2250. The C-Rank algorithm is based onrelative performance concerns and is detailed later in Section 4.2. We run each algorithmfor 500 replications and collect the last 20 observations for each replication. The ﬁgureonly plots data for one player since the other player has similar results.

Motivated by the ﬁndings in Section 3.1, we explore the deep Q-learning framework withrandom experience replay (Mnih et al., 2015). For brevity, we refer to this renowned deepQ-learning Algorithm 1 as the D-Random algorithm. Consider the memory buﬀer sizeas 2000 and the mini-batch size as 128. Compared with computer science literature thatregularly sets a buﬀer size to 10 (Mnih et al., 2015; Zhang and Sutton, 2017), ours ismuch smaller, since action space and state space are signiﬁcantly smaller. We assume twoD-Random players use the same conﬁguration for Q-networks, a fully-connected neuralnetwork with one hidden layer and hidden size h = 512. To be more precise, the Q-network is a neural network f : R nk → R m , from the state space to the Q-values for m actions, given as f ( s ) = W σ ( W s + v ) + v . (3.1) W ∈ R h × nk , v ∈ R h , W ∈ R m × h , and v ∈ R m are network weights. σ ( u ) = max { u, } isthe rectiﬁed linear unit (ReLU) activation function, applied element-wisely. Neural net-works are commonly over-parameterized since the number of parameters greatly exceedsthe input dimension nk . Initially, we set the action-value function implied by f as givenin Calvano et al. (2020, Equation 8), which can be achieved by setting W , W , v = 0 and v as Calvano et al. (2020, Equation 8). This is the main motive that we do not ﬁx v aszero. Robustness of our design on neural networks is checked in Section 6.2. This designdoes not have any fancy tricks and is somehow commonly used in deep learning literature.The Q-network is capable of generating greedy policies with ﬂexible structures. However,our trained Q-networks consistently yield constant greedy policy matrices.Following Mnih et al. (2015), we train the Q-networks for a given length of episodesand do not specify convergence criteria like Calvano et al. (2020). Greedy policy matricesunder deep Q-learning usually ﬂuctuate between several consecutive prices after conver-11ence, which is illustrated later in Figure 3b. We ﬁx the number of episodes to 2000.Each episode contains 500 periods. Table 1 reports descriptive statistics of the long-runprices during the last episode for two players, labeled as Player 0 and Player 1. Besides,each episode yields 500 greedy policy matrices for one player. However, each matrix turnsout to be constant. Thus, Table 1 also presents statistics for these matrices. Both playersselect prices 1.47 and 1.51 more frequently. The highest is 1.63 with a relatively lowfrequency. Two players do not charge the same all the time, but their choices are closeto each other. These characteristics reported in Table 1 are pervasive in all replicationswe have run.Price 1.43 1.47 1.51 1.55 1.59 1.63Percent (Player 0) 17.4% 19.8% 38.0% 17.4% 5.8% 1.6%Percent (Player 1) 16.8% 25.0% 33.0% 19.4% 5.0% 0.8%Relative price Player 0 > Player 1 Player 0 = Player 1 Player 0 < Player 1Percent 40.0% 24.0% 36.0%Table 1: Statistics of long-run prices for two D-Random players during the last episode.Figure 3a further illustrates the evolution of selected prices during the entire trainingwith several quantiles. In the beginning, the proportion of prices not exceed 1.51 is only20%. Since there are 15 feasible prices and three of them satisfy this threshold, thestarting percentage is not far from the equally assigned probabilities. The proportionrises steadily and contributes to 80% eventually. Sharp ﬂuctuations during the learningprocess may be due to random exploration. Remarkably, D-Random algorithms seldomselect extremely high prices for the whole training period. The initial greedy policies selectprice 1.59 under the initialization given in Calvano et al. (2020, Equation 8). Importantly,we ﬁnd the results are robust to the methods of initialization, see Section 6.4. By settingthe initial greedy choice to the highest feasible price, the outcomes remain close to theBertrand equilibrium. Figure 3b shows that the highest selected prices for each episodedecrease slightly. Moreover, the frequencies for these highest prices are typically low.Table 1 provides detailed numbers for the last episode. Proportions of high prices vanishquickly under the D-Random algorithms. Finally, the most frequent prices ﬂuctuatebetween 1.47 and 1.51, close to the one-shot Bertrand equilibrium.Figure 4 demonstrates the frequencies of prices during the start and the end of thelearning process. Figure 4 agrees with Figure 3a on the trend of prices within severalthresholds. Figure 4b also shows the detailed percentage for the highest and the mostfrequent prices during the last several episodes, consistent with Figure 3b. From Table1, Figure 3 and 4, we can conclude that the D-Random algorithm obtains prices close toBertrand equilibrium and destabilizes the algorithmic collusion discovered under the C-Online method. Moreover, we also adopt the same idea in Calvano et al. (2020): Step inafter convergence and exogenously force one player to defect. Check for any punishmentsfrom the rivals. We observe that both players immediately return to the long-run prices12

250 500 750 1000 1250 1500 1750 2000Episodes0.00.20.40.60.81.0 P e r c e n t Price ≤ 1.71Price ≤ 1.51 (a) Percents of prices not exceed 1.51 and 1.71 P r i c e HighestMost frequent (b) The highest and the most frequent prices

Figure 3: Evolution of selected prices in each episode. Only data for one player are shown.Another player exhibits the similar pattern.

Episodes0.00.20.40.60.81.0 P e r c e n t (a) The ﬁrst 30 episodes Episodes0.00.20.40.60.81.0 P e r c e n t (b) The last 30 episodes Figure 4: Distributions for selected prices in each episode. For simplicity, only the ﬁrstand the last 30 episodes are reported.at the next period after the deviation. The forced deviation has an almost zero impacton the outcomes. It is reasonable since the deviation lasts only for one period and theagents sample their experiences randomly from replay memories. The weightings on thedeviation are signiﬁcantly low. Moreover, the constant greedy policies also imply thatthere is no reward-punishment scheme learned by the D-Random algorithm.Compared with the C-Random algorithm in Section 3.1, the D-Random algorithmfurther breaks the temporal correlations. Fully-connected neural networks (3.1) do notlearn the order of samples from mini-batches. A moderate batch size also ensures thesampled experience tuples are close to being uncorrelated (Fan et al., 2020). Moreover,Q-networks update greedy policies for all cells as a whole. This characteristic has ad-vantages and also disadvantages, compared with C-Online and C-Random algorithms.One can expect Q-networks can learn more eﬃciently. However, imposing structures onQ-networks may implicitly restrict the domains of Q-values. Although neural networksare universal function approximators (Cybenko, 1989), the practical implementation maynot satisfy all theoretical requirements. On the other hand, we ﬁnd no convergence issues13n employing two-agent D-Random algorithms, in contrast to Foerster et al. (2016, 2017);Leibo et al. (2017). One possible explanation is, non-stationarity is not strong enoughunder a relatively simpliﬁed economic environment.

A natural question is, with deep Q-learning, is it still possible to recover algorithmiccollusion or at least signiﬁcantly higher prices than Bertrand equilibrium? Random ex-perience replay assigns equal weights on all experience tuples in the memory buﬀer.However, agents may have certain preferences and do not always pick up data randomly.The ﬁrst idea is online experience replay. It resembles the C-Online algorithm and usesthe most recent data in the memory buﬀer. For the fairness of comparison, we still ﬁx thesame mini-batch size as 128. The only diﬀerence is the mini-batch contains exactly thelatest 128 experience tuples. We refer to deep Q-learning with online experience replayas the D-Online algorithm. One can expect that temporal correlations can be restoredmoderately under this setting. Experiments suggest that online experience replay raises,albeit slightly, the price levels to be above 1.51. Compared with Figure 3b, only thesampling method is modiﬁed. We can attribute this trend towards supra-competitiveprices to the temporal correlations within the recent samples. Another distinction is eachD-Online player only selects two, or even one, price(s) after convergence. Compared withsix possible long-run prices in Table 1, online experience replay greatly stabilizes thetraining progression.However, the D-Online algorithm does not obtain prices close to 1.78 as in Calvanoet al. (2020). Clearly, Q-networks do not detect the order of samples in the mini-batch. Anatural extension in the future is to adopt other architectures such as LSTM or RNN. Onecan also conjecture that reducing the mini-batch sizes and chunking the data to feed theQ-networks sequentially will strengthen the local dependencies. However, a small mini-batch size usually makes the learning process unstable, since not enough information ispassed to the network during a single period. Indeed, with a small mini-batch size like8, algorithms select much more prices, compared with merely two choices for each playerbefore.

To incorporate relative performance concerns, we assume each ﬁrm only memorizes theperiods when its proﬁt does not exceed that of the rivals. Periods when it overtakesthe competitors are dismissed. This assumption may sound extreme, but it ampliﬁes theeﬀects of relative performance. We consider the simplest case with two ﬁrms . The replay When there are more ﬁrms, one can easily extend our idea by assigning sampling priority to tuplesaccording to the reward rankings. The lower the ranking of a ﬁrm’s proﬁt, the higher the priority for s t , a t , r t , s t +1 ). In our symmetric economic environment,rank experience replay makes Player 0 merely focus on the tuples with reward r ,t ≤ r ,t .Under these scenarios, the selected price a t ≥ a t . However, there is no relative constrainton the current state s t . Thus, rank experience replay is still viable to visit arbitrarystates.Suppose we still randomly sample from rank replay memory buﬀers. We call deepQ-learning with rank experience replay the D-Rank algorithm. Remarkably, the D-Rankalgorithm quickly converges to supra-competitive prices, see Figure 5. Both players selectthe same long-run price and therefore are ranked the same. It is the equilibrium whenﬁrms are also concerned with rankings. Nevertheless, only high prices exist. No reward-punishment mechanism is detected as in Calvano et al. (2020). The greedy policy matricesare still constant for all states. If one ﬁrm defects for one period, both immediately restorethe pre-deviation price at the following period. The defection incurs no punishment fromthe rival at all. P e r c e n t Price ≤ 1.75Price = 1.79

Figure 5: Rank experience replay. Fractions with prices below 1.79 drop quickly. Thestrategies ﬁx at 1.79 after relatively short numbers of episodes. Both players select 1.79.Graphs for another player omitted.Why D-Rank algorithms yield supra-competitive prices? Note that under the baselineeconomic environment, a lower price generates a higher proﬁt than the rivals. Neverthe-less, if both ﬁrms choose lower prices, they will earn lower proﬁts. There could have beenprice wars. Impressively, D-Rank algorithms avoid this Prisoner’s dilemma and maximizethe aggregate proﬁts even without communications. The absence of strong temporal cor-relations does not dominate the advantages of exploring ranking information. We explainthe dynamics of D-Rank algorithms with the trade-oﬀ between exploitation and explo-ration. On the one hand, the D-Rank method improves the exploitation by ﬁltering the the tuple.

Episodes0.00.20.40.60.81.0 P e r c e n t (a) The D-Rank player Episodes0.00.20.40.60.81.0 P e r c e n t (b) The D-Random player Figure 6: Distributions of long-run prices when deep players have heterogeneous concernson relative performance. In the last episode, the D-Rank player selects price 1.55 mostfrequently, while the D-Random rival selects 1.51 more commonly. During 65.6% of thelast episode, the D-Rank player selects a higher price than his competitor.

For most scenarios in the previous sections, we have assumed two homogeneous players.Since collusion is a matter of coordination, a natural concern is two players adopting dif-ferent learning algorithms and/or diﬀerent economic considerations. There are numerouscombinations. To ease the exposition and highlight the roles of random and rank experi-ence replay, we only consider three particular cases: a C-Online player and a D-Randomplayer; a C-Online player and a D-Rank player; a C-Rank player and a D-Rank player.

Suppose one player uses the C-Online algorithm and another player adopts the D-Randomalgorithm instead. Overall, convergence is more diﬃcult under this situation. We haveobserved sessions with prices ﬂuctuating over a wide range, with no clear sign of con-vergence. When converged, the long-run prices for the D-Random player are still morevolatile, as plotted in Figure 7b for the entire training process of a particular session.However, a comparison between Figure 7b and Figure 3b do indicate the possibility ofhigh prices under heterogeneous players. During the last episode, the D-Random playerconsistently chooses price 1.59, while the greedy policy matrix for the C-Online playerselects prices in cycle 1 . → . → . → . → . → · · · , detailed in Figure 7a.The upper half of the matrix generally has lower prices (deeper colors), while the lowerhalf has higher prices (lighter colors). When we exogenously force the D-Random playerto defect, the greedy policies for both players are unchanged. Figure 7a indicates thatif the D-Random player cuts his price, then the C-Online player will also cut his price,which is the punishment of deviations. It agrees with the reward-punishment scheme inCalvano et al. (2020). Besides, the deep player still uses a constant greedy policy matrix,17ven when the rival is more ﬂexible.The high prices raise a tricky question for regulations on algorithmic collusion. Appar-ently, ﬁrms cannot escape the responsibility of collusion by attributing it completely tothe algorithms. Suppose we agree that both the two ﬁrms with C-Online algorithms areliable for algorithmic collusion. In contrast, if they employ D-Random algorithms, thenthe price is competitive and close to Bertrand equilibrium. However, when a C-Onlineplayer appears, the prices are raised signiﬁcantly again. We may agree that the C-Onlineplayer is liable for the collusive strategies. But, should the D-Random player under theheterogeneous setting also take some responsibilities? If we think the D-Random playeris liable, in the same sense as the C-Online player, then the D-Random player may havethe following complaint. If the C-Online competitor also uses his D-Random algorithm,then there is no collusion at all. If we think the D-Random player is innocent, he deﬁ-nitely beneﬁts from the raised prices and makes more proﬁts. If we think the D-Randomplayer is partially liable, to what extent should we punish him? How can we judge theD-Random agent, a fallen angel, is acting voluntarily or involuntarily? These questionsare left open for further discussions. D ee p p l a y e r (a) Greedy policy for the C-Online palyer P r i c e HighestMost frequent (b) The D-Random player

Figure 7: The left subgraph shows the greedy policy matrix of the C-Online player duringthe last episode. Recall the prices are encoded consecutively as follows: 0 for price 1.43,1 for price 1.47, etc. The horizontal axis stands for the previous choice of the C-Onlineplayer. The vertical axis is for the D-Random player. The right subgraph plots thehighest and the most frequent prices chosen by the D-Random player in the same way asin Figure 3b.

Notably, rank experience replay also stabilizes the collusion under heterogeneous players.First, consider a C-Online player with a D-Rank player. Fluctuations in the D-Rankplayer’s actions are reduced. But in general, we cannot form a solid conclusion thatprices of D-Rank players are higher or lower than those of D-Random players. The main18eason is high variances present when a D-Random player meets a C-Online player. Itweakens the power of the comparison.

Furthermore, we assume both players are concerned about rankings. The algorithmsconverge at a relatively short time scale, similar to Figure 5, with supra-competitiveprices. More importantly, they inhibit the punishment of defection, as indicated by acomparison between Figure 7a and Figure 8a. Imagine the D-Rank player reduces theprice to be below 1.47 (action 1), the C-Rank player will still keep his price at a highlevel. In contrast, Figure 7a will tell the C-Online player to cut the price. D ee p p l a y e r (a) The C-Rank player P e r c e n t Price ≤ 1.79Price = 1.83 (b) The D-Rank player

Figure 8: C-Rank player and D-Rank player. The left subgraph illustrates the greedypolicy of the C-Rank player. Both players select price 1.83, labeled as action 10, afterconvergence. The right subgraph indicates a rapid convergence for the D-Rank player,similar to Figure 5.To identify the factors contributing to this elimination of punishment, we have alsochecked the greedy policies for another two cases: two C-Rank players or a C-Onlineplayer with a D-Rank player. However, no similar eﬀect is detected. We speculate thereasons as follows. Deep Q-learning tends to select a constant greedy policy, under manyscenarios. But the D-Random player in Figure 7b is volatile and the classic player cannotdetect it. When both players consider rankings, the algorithms are the most stable.The classic player has successfully anticipated this pattern and believes the deep playerwill immediately return to the pre-deviation price at the next period. To validate thisconjecture, we consider two classic players. Force one player to select a ﬁxed price, forexample, 1.83, with a time-increasing probability of 1 − ε t . The ε -greedy policy is adoptedwith a probability of ε t . This player tries to mimic the behaviors of a D-Rank player. Hiscompetitor is a typical C-Rank player. We manage to approximately replicate featuresin the greedy policy shown in Figure 8a. As depicted in Figure 9, the experiment alsoeliminates the punishment from the C-Rank player. In this design, the classic player must19dopt rank experience replay, while it does not matter whether the player mimicking aconstant policy adopts or not. C o n s t a n t p l a y e r Figure 9: Greedy policy for the C-Rank player, facing another classic player using aconstant policy gradually. We set the constant policy as price 1.83 (action 10). Thisﬁgure resembles Figure 8a in the sense that it eliminates the punishment for the defectionof the constant player.

There are numerous hyper-parameters and combinations. To ease the exposition, we onlyconsider one factor each time and ﬁx most others as given under the baseline setting.

Asymmetry makes collusion more diﬃcult (Calvano et al., 2020). Moreover, anotherconcern is that the interaction between asymmetry and rank experience replay. Wefocus on the cost asymmetry, reported in Table 2. D-Random algorithms still perfectlyinhibit the algorithmic collusion and obtain the closest price pair to Bertrand equilibrium.Similar to Calvano et al. (2020), the collusion under C-Online algorithms is reduced onlyby a limited extent. Remarkably, Table 2 discloses the mechanism and a drawback ofrank experience replay. For both C-Rank and D-Rank algorithms, the more eﬃcient ﬁrm,Player 1 with a lower marginal cost, always charges a higher price than the less eﬃcientrival. This pattern is distinct from all other algorithms and theoretical prices. In a two-agent setting, rank experience replay makes competitors concentrate on the borderlinewhere rewards are the closest to each other, where they could achieve similar rankings, ifpossible. Price pairs (1 . , .

60) and (1 . , .

75) lie exactly at the border. The C-Rankalgorithm traps at a lower price pair, compared with the D-Rank case. Trajectories ofD-Rank algorithms show that the prices are not moving monotonically. They raise anddrop for several turns, while the lower bound is increasing and ﬁxes at the long-run pricesafter convergence. It reﬂects the tight games between two players with relative proﬁts in20ind. Besides, since theoretical prices in Table 2 enlarge the action space, it also veriﬁesthe robustness of action spaces.Strategy Bertrand Monopoly C-Online C-Rank D-Random D-RankPlayer 0 1.372 2.198 1.95 ⇄ .

5. Action spaceis enlarged as { . , . , ..., . , . } for both players. Other parameters are the sameas in the baseline setting. The ﬁrst two columns are theoretical prices. Designing neural networks usually need expertise and careful tuning. There are manychoices that might look arbitrary, including depth (the number of hidden layers), width(the number of neurons in each layer), activation functions, and other techniques embrac-ing dropout, layer normalization, etc. It is impossible to test all the designs. Althoughthe universal approximation result (Cybenko, 1989) guarantees suﬃciency even with asingle hidden layer, one conventional wisdom, yet debatable, is that a deeper network isbetter than a wider one (Eldan and Shamir, 2016). Since our baseline model only hasone hidden layer, a natural concern is whether more hidden layers could discover thecollusive strategies. We have tested a deeper network with 10 hidden layers and a smallerwidth ( h = 16). The long-run prices do not show any signiﬁcant diﬀerence with Bertrandequilibrium achieved. Moreover, the greedy policies remain constant for both players.Another concern is the bias term v in the last layer. Some works may not use theﬁnal bias term (Fan et al., 2020). If it is activated, the initialization of weights should bedone carefully (Karpathy, 2019). As explained previously, if the last bias term is adopted,then it is easy to follow the initialization in Calvano et al. (2020). Otherwise, the inverseproblem of ﬁnding the exact network weights for given greedy policies is nontrivial. Thelast bias term uniformly shifts the Q-value functions regardless of the states. If thevector v is larger than the ﬁrst term in (3.1) by several orders of magnitude, then thegreedy policy is more likely to be constant. However, even when we ﬁx v = 0, thenetwork still produces constant greedy policies. Another important observation is, aftermany iterations, the algorithm usually learns the term v only, while ﬁxing other weightsunchanged. It stabilizes the training progression, while it may also restrict the structuresof Q-value functions. Nevertheless, it is chosen by the algorithm automatically. Experience replay introduces a new hyper-parameter, memory buﬀer size. Zhang andSutton (2017) provides an empirical study on the importance of buﬀer size and ﬁnds itis task-dependent. Both a small and a large buﬀer size can hurt the training process.21e have also considered a large one as 2 × . Deep Q-learning with random or rankexperience replay tends to be robust. Algorithms with a large or a moderate size do notexhibit signiﬁcant diﬀerences in the results.On the other hand, the eﬀect of mini-batch sizes has been mentioned concretely inonline experience replay. Similarly, for random experience replay, a smaller mini-batchsize like eight also makes the training progression volatile. However, the long-run pricesstill ﬂuctuate near Bertrand equilibrium. In the baseline setting, the Q-matrix is initialized under the assumption that one agent’sopponents initially select any actions uniformly at random (Calvano et al., 2020), regard-less of the state. The greedy policy under this speciﬁcation is constant. It consistentlycharges a price of 1.59 (action 4) for any state. However, it is debatable on which initial-ization is optimal or truly non-informative. Therefore, we conduct a robustness check onseveral initialization methods.Initialization Baseline Q = 0 Random Q = 19 TopmostD-Random Bertrand Bertrand Bertrand Bertrand BertrandD-Rank High High High High HighC-Rank High Upper bound Upper bound High VolatileTable 3: Convergence results under diﬀerent initializations. The action space is enlargedto { . , . , ..., . , . } with 20 actions, equally spaced with step size 0.04. Theupper bound is signiﬁcantly higher than the monopoly price.Table 3 considers ﬁve methods. The ﬁrst method, baseline, is from Calvano et al.(2020, Equation 8). The second sets the Q-matrix as zero. Sutton (1996); Zhang andSutton (2017) document that zero initial Q-values encourage exploration. The thirdrandom method is implemented diﬀerently for the classic and deep setting. Classic algo-rithms uniformly sample Q-matrix entries from [0 , This paper utilizes an experimental approach to understand algorithmic collusion. Acrucial open question is theoretical guarantees on the convergence to collusive or compet-itive strategies and the convergence rates under a multi-agent setting. Another problemleft for future research is that fully-connected neural networks consistently learn constantgreedy policies. However, the rationality is currently unclear. More eﬀorts are neededin exploring other architectures of deep networks. Finally, the experiment assumes arelatively simpliﬁed economic environment. One direction is to consider more realisticsettings or create innovative approaches to study the topic with empirical data.

References

Assad, S., Clark, R., Ershov, D., and Xu, L. (2020). Algorithmic pricing and compe-tition: Empirical evidence from the German retail gasoline market.

Working Paper . https://ssrn.com/abstract=3682021 .Bizjak, J., Kalpathy, S. L., Li, Z. F., and Young, B. (2016). The role ofpeer ﬁrm selection in explicit relative performance awards. Working Paper . https://ssrn.com/abstract=2833309 .Calvano, E., Calzolari, G., Denicol`o, V., Harrington, J. E., and Pastorello, S. (2020).Protecting consumers from collusive prices due to AI. Science , (6520), 1040–1042.Calvano, E., Calzolari, G., Denicol`o, V., and Pastorello, S. (2020). Artiﬁcial intelligence,algorithmic pricing, and collusion. American Economic Review , (10), 3267–97.Casas-Arce, P. and Martinez-Jerez, F. A. (2009). Relative performance compensation,contests, and dynamic incentives. Management Science , (8), 1306–1320.Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-matics of Control, Signals and Systems , (4), 303–314.Eldan, R. and Shamir, O. (2016). The power of depth for feedforward neural networks. InFeldman, V., Rakhlin, A., and Shamir, O. (Eds.), Proceedings of the 29th Conferenceon Learning Theory , volume 49, (pp. 907–940).Ezrachi, A. and Stucke, M. E. (2016).

Virtual competition . Oxford University Press.23an, J., Wang, Z., Xie, Y., and Yang, Z. (2020). A theoretical analysis of deep q-learning.In Bayen, A. M., Jadbabaie, A., Pappas, G. J., Parrilo, P. A., Recht, B., Tomlin, C. J.,and Zeilinger, M. N. (Eds.),

Proceedings of the 2nd Annual Conference on Learning forDynamics and Control , volume 120, (pp. 486–489).Fisher, M., Gallino, S., and Li, J. (2018). Competition-based dynamic pricing in on-line retailing: A methodology validated with ﬁeld experiments.

Management Science , (6), 2496–2514.Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P. H. S., Kohli, P., andWhiteson, S. (2017). Stabilising experience replay for deep multi-agent reinforcementlearning. In Precup, D. and Teh, Y. W. (Eds.), Proceedings of the 34th InternationalConference on Machine Learning , volume 70, (pp. 1146–1155).Foerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. (2016). Learning to com-municate with deep multi-agent reinforcement learning. In Lee, D. D., Sugiyama, M.,von Luxburg, U., Guyon, I., and Garnett, R. (Eds.),

Advances in Neural InformationProcessing Systems 29 , (pp. 2137–2145).Hansen, K., Misra, K., and Pai, M. (2020). Algorithmic collusion: Supra-competitiveprices via independent algorithms.

Marketing Science .Hao, S., Jin, Q., and Zhang, G. (2011). Relative ﬁrm proﬁtability and stock returnsensitivity to industry-level news.

The Accounting Review , (4), 1321–1347.Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning:Data mining, inference, and prediction . Springer.Karpathy, A. (2019). A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe/ .Klein, T. (2019). Autonomous algorithmic collusion: Q-learning under sequential pricing.

Working Paper . https://ssrn.com/abstract=3195812 .Kokkoris, I. (2020). A few reﬂections on the recent caselaw on algorithmic collusion. Working Paper . https://ssrn.com/abstract=3665966 .Lazear, E. P. and Rosen, S. (1981). Rank-order tournaments as optimum labor contracts. Journal of Political Economy , (5), 841–864.Leibo, J. Z., Zambaldi, V. F., Lanctot, M., Marecki, J., and Graepel, T. (2017). Multi-agent reinforcement learning in sequential social dilemmas. In Larson, K., Winikoﬀ, M.,Das, S., and Durfee, E. H. (Eds.), Proceedings of the 16th Conference on AutonomousAgents and MultiAgent Systems , (pp. 464–473).Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, plan-ning and teaching.

Machine Learning , (3-4), 293–321.24ikl´os-Thal, J. and Tucker, C. (2019). Collusion by algorithm: Does better demandprediction facilitate coordination between sellers? Management Science , (4), 1552–1561.Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level controlthrough deep reinforcement learning. Nature , (7540), 529–533.Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience re-play. In Bengio, Y. and LeCun, Y. (Eds.), .Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examplesusing sparse coarse coding. In Touretzky, D., Mozer, M. C., and Hasselmo, M. (Eds.), Advances in Neural Information Processing Systems , volume 8. MIT Press.Sutton, R. S. and Barto, A. G. (2018).

Reinforcement learning: An introduction . MITpress.Waltman, L. and Kaymak, U. (2008). Q-learning agents in a Cournot oligopoly model.

Journal of Economic Dynamics and Control , (10), 3275–3293.Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning , (3-4), 279–292.Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College,University of Cambridge .Zhang, S. and Sutton, R. S. (2017). A deeper look at experience replay. arXiv preprintarXiv:1712.01275arXiv preprintarXiv:1712.01275